I'm trying to use the deep q network to solve an optimization problem, where my states (i.e., 21 inputs) are correlated with actions (i.e., 20 outputs). My problem doesn't have a terminal, i.e., the agent moves in real-time without any boundaries to choose the optimal location (It's a navigation problem).
After training the deep Q network, my network only chooses one output for different states. Can anyone help me with this problem? I check the Q-values for training, and all values are changing similarly together.
Also, I have another doubt. When I see the reward:
it seems that the reward is converging, but the Q-value has a sharp peak at the initial episode:
I don't know why it happens?
My test code is as follows:
def test(env, agent, test_runtime, ref_idx):
saving_optdata = [0 for _ in range(test_runtime)]
print("
---- TEST ----
")
energy = 0
for t in range(start_time, start_time + test_runtime, 1):
env.reset(ref_idx=ref_idx) # reset the environment
time_window = env.time_window(1 + 1, t)
state = hstack((ref_idx, time_window[55:60, 0]))
action, q_values = agent.act(state, 0, False)
opt_idx = len(env.arr_depth) - action_size + action
next_idx, done = env.next_timestep(action, action_size) # send action to environment
next_state = hstack((next_idx, time_window[55:60, 1]))
reward = net_power(ref_idx, next_idx, time_window[next_idx, 1], k1, 1) # -
agent.step(action, reward, next_state, done, False)
opt_vel = env.arr_velocity[opt_idx, t + 1]
opt_p = net_power(ref_idx, opt_idx, opt_vel, k1, 1)
energy += opt_p
saving_optdata[t - start_time] = [ref_idx, opt_idx, env.arr_depth[opt_idx], opt_vel, opt_p, energy, q_values]
print("time = {:3d}, Index= {:2d}, Power = {:9.3f}, Velocity = {},".format(t - start_time, opt_idx, opt_p,
opt_vel))
ref_idx = opt_idx
question from:
https://stackoverflow.com/questions/65641153/deep-q-learning-test-issue-navigation