I want to implement the following algorithm, taken from this book, section 13.6:
I don't understand how to implement the update rule in pytorch (the rule for w is quite similar to that of theta).
As far as I know, torch requires a loss for loss.backwward()
.
This form does not seem to apply for the quoted algorithm.
I'm still certain there is a correct way of implementing such update rules in pytorch.
Would greatly appreciate a code snippet of how the w weights should be updated, given that V(s,w) is the output of the neural net, parameterized by w.
EDIT: Chris Holland suggested a way to implement, and I implemented it. It does not converge on Cartpole, and I wonder if I did something wrong.
The critic does converge on the solution to the function gamma*f(n)=f(n)-1
which happens to be the sum of the series gamma+gamma^2+...+gamma^inf
meaning, gamma=1 diverges. gamma=0.99 converges on 100, gamma=0.5 converges on 2 and so on. Regardless of the actor or policy.
The code:
def _update_grads_with_eligibility(self, is_critic, delta, discount, ep_t):
gamma = self.args.gamma
if is_critic:
params = list(self.critic_nn.parameters())
lamb = self.critic_lambda
eligibilities = self.critic_eligibilities
else:
params = list(self.actor_nn.parameters())
lamb = self.actor_lambda
eligibilities = self.actor_eligibilities
is_episode_just_started = (ep_t == 0)
if is_episode_just_started:
eligibilities.clear()
for i, p in enumerate(params):
if not p.requires_grad:
continue
eligibilities.append(torch.zeros_like(p.grad, requires_grad=False))
# eligibility traces
for i, p in enumerate(params):
if not p.requires_grad:
continue
eligibilities[i][:] = (gamma * lamb * eligibilities[i]) + (discount * p.grad)
p.grad[:] = delta.squeeze() * eligibilities[i]
and
expected_reward_from_t = self.critic_nn(s_t)
probs_t = self.actor_nn(s_t)
expected_reward_from_t1 = torch.tensor([[0]], dtype=torch.float)
if s_t1 is not None: # s_t is not a terminal state, s_t1 exists.
expected_reward_from_t1 = self.critic_nn(s_t1)
delta = r_t + gamma * expected_reward_from_t1.data - expected_reward_from_t.data
negative_expected_reward_from_t = -expected_reward_from_t
self.critic_optimizer.zero_grad()
negative_expected_reward_from_t.backward()
self._update_grads_with_eligibility(is_critic=True,
delta=delta,
discount=discount,
ep_t=ep_t)
self.critic_optimizer.step()
EDIT 2:
Chris Holland's solution works. The problem originated from a bug in my code that caused the line
if s_t1 is not None:
expected_reward_from_t1 = self.critic_nn(s_t1)
to always get called, thus expected_reward_from_t1
was never zero, and thus no stopping condition was specified for the bellman equation recursion.
With no reward engineering, gamma=1
, lambda=0.6
, and a single hidden layer of size 128 for both actor and critic, this converged on a rather stable optimal policy within 500 episodes.
Even faster with gamma=0.99
, as the graph shows (best discounted episode reward is about 86.6).
BIG thank you to @Chris Holland, who "gave this a try"
See Question&Answers more detail:
os