Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
86 views
in Technique[技术] by (71.8m points)

python - Deep Q learning-test issue navigation

I'm trying to use the deep q network to solve an optimization problem, where my states (i.e., 21 inputs) are correlated with actions (i.e., 20 outputs). My problem doesn't have a terminal, i.e., the agent moves in real-time without any boundaries to choose the optimal location (It's a navigation problem).

After training the deep Q network, my network only chooses one output for different states. Can anyone help me with this problem? I check the Q-values for training, and all values are changing similarly together.

Q-values in training

Also, I have another doubt. When I see the reward:

Reward

it seems that the reward is converging, but the Q-value has a sharp peak at the initial episode:

Q values with peak

I don't know why it happens?

My test code is as follows:

def test(env, agent, test_runtime, ref_idx):
    saving_optdata = [0 for _ in range(test_runtime)]

    print("
---- TEST ----
")
    energy = 0
    for t in range(start_time, start_time + test_runtime, 1):
        env.reset(ref_idx=ref_idx)  # reset the environment
        time_window = env.time_window(1 + 1, t)
        state = hstack((ref_idx, time_window[55:60, 0]))
        action, q_values = agent.act(state, 0, False)
        opt_idx = len(env.arr_depth) - action_size + action

        next_idx, done = env.next_timestep(action, action_size)  # send action to environment
        next_state = hstack((next_idx, time_window[55:60, 1]))
        reward = net_power(ref_idx, next_idx, time_window[next_idx, 1], k1, 1)  # - 
        agent.step(action, reward, next_state, done, False)

        opt_vel = env.arr_velocity[opt_idx, t + 1]
        opt_p = net_power(ref_idx, opt_idx, opt_vel, k1, 1)
        energy += opt_p
        saving_optdata[t - start_time] = [ref_idx, opt_idx, env.arr_depth[opt_idx], opt_vel, opt_p, energy, q_values]
        print("time = {:3d}, Index= {:2d}, Power = {:9.3f},   Velocity = {},".format(t - start_time, opt_idx, opt_p,
                                                                                    opt_vel))
        ref_idx = opt_idx
question from:https://stackoverflow.com/questions/65641153/deep-q-learning-test-issue-navigation

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...