python - Deep Q learning-test issue navigation

Question

Welcome To Ask or Share your Answers For Others

python - Deep Q learning-test issue navigation

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Deep Q learning-test issue navigation

I'm trying to use the deep q network to solve an optimization problem, where my states (i.e., 21 inputs) are correlated with actions (i.e., 20 outputs). My problem doesn't have a terminal, i.e., the agent moves in real-time without any boundaries to choose the optimal location (It's a navigation problem).

After training the deep Q network, my network only chooses one output for different states. Can anyone help me with this problem? I check the Q-values for training, and all values are changing similarly together.

Also, I have another doubt. When I see the reward:

it seems that the reward is converging, but the Q-value has a sharp peak at the initial episode:

I don't know why it happens?

My test code is as follows:

def test(env, agent, test_runtime, ref_idx):
    saving_optdata = [0 for _ in range(test_runtime)]

    print("
---- TEST ----
")
    energy = 0
    for t in range(start_time, start_time + test_runtime, 1):
        env.reset(ref_idx=ref_idx)  # reset the environment
        time_window = env.time_window(1 + 1, t)
        state = hstack((ref_idx, time_window[55:60, 0]))
        action, q_values = agent.act(state, 0, False)
        opt_idx = len(env.arr_depth) - action_size + action

        next_idx, done = env.next_timestep(action, action_size)  # send action to environment
        next_state = hstack((next_idx, time_window[55:60, 1]))
        reward = net_power(ref_idx, next_idx, time_window[next_idx, 1], k1, 1)  # - 
        agent.step(action, reward, next_state, done, False)

        opt_vel = env.arr_velocity[opt_idx, t + 1]
        opt_p = net_power(ref_idx, opt_idx, opt_vel, k1, 1)
        energy += opt_p
        saving_optdata[t - start_time] = [ref_idx, opt_idx, env.arr_depth[opt_idx], opt_vel, opt_p, energy, q_values]
        print("time = {:3d}, Index= {:2d}, Power = {:9.3f},   Velocity = {},".format(t - start_time, opt_idx, opt_p,
                                                                                    opt_vel))
        ref_idx = opt_idx

question from:https://stackoverflow.com/questions/65641153/deep-q-learning-test-issue-navigation

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

python - Deep Q learning-test issue navigation

python - Deep Q learning-test issue navigation

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags