Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
277 views
in Technique[技术] by (71.8m points)

python - Pytorch: How to create an update rule that doesn't come from derivatives?

I want to implement the following algorithm, taken from this book, section 13.6:

enter image description here

I don't understand how to implement the update rule in pytorch (the rule for w is quite similar to that of theta).

As far as I know, torch requires a loss for loss.backwward().

This form does not seem to apply for the quoted algorithm.

I'm still certain there is a correct way of implementing such update rules in pytorch.

Would greatly appreciate a code snippet of how the w weights should be updated, given that V(s,w) is the output of the neural net, parameterized by w.


EDIT: Chris Holland suggested a way to implement, and I implemented it. It does not converge on Cartpole, and I wonder if I did something wrong.

The critic does converge on the solution to the function gamma*f(n)=f(n)-1 which happens to be the sum of the series gamma+gamma^2+...+gamma^inf meaning, gamma=1 diverges. gamma=0.99 converges on 100, gamma=0.5 converges on 2 and so on. Regardless of the actor or policy.

The code:

def _update_grads_with_eligibility(self, is_critic, delta, discount, ep_t):
    gamma = self.args.gamma
    if is_critic:
        params = list(self.critic_nn.parameters())
        lamb = self.critic_lambda
        eligibilities = self.critic_eligibilities
    else:
        params = list(self.actor_nn.parameters())
        lamb = self.actor_lambda
        eligibilities = self.actor_eligibilities

    is_episode_just_started = (ep_t == 0)
    if is_episode_just_started:
        eligibilities.clear()
        for i, p in enumerate(params):
            if not p.requires_grad:
                continue
            eligibilities.append(torch.zeros_like(p.grad, requires_grad=False))

    # eligibility traces
    for i, p in enumerate(params):

        if not p.requires_grad:
            continue
        eligibilities[i][:] = (gamma * lamb * eligibilities[i]) + (discount * p.grad)
        p.grad[:] = delta.squeeze() * eligibilities[i]

and

expected_reward_from_t = self.critic_nn(s_t)
probs_t = self.actor_nn(s_t)
expected_reward_from_t1 = torch.tensor([[0]], dtype=torch.float)
if s_t1 is not None:  # s_t is not a terminal state, s_t1 exists.
    expected_reward_from_t1 = self.critic_nn(s_t1)

delta = r_t + gamma * expected_reward_from_t1.data - expected_reward_from_t.data

negative_expected_reward_from_t = -expected_reward_from_t
self.critic_optimizer.zero_grad()
negative_expected_reward_from_t.backward()
self._update_grads_with_eligibility(is_critic=True,
                                    delta=delta,
                                    discount=discount,
                                    ep_t=ep_t)
self.critic_optimizer.step()

EDIT 2: Chris Holland's solution works. The problem originated from a bug in my code that caused the line

if s_t1 is not None:
    expected_reward_from_t1 = self.critic_nn(s_t1)

to always get called, thus expected_reward_from_t1 was never zero, and thus no stopping condition was specified for the bellman equation recursion.

With no reward engineering, gamma=1, lambda=0.6, and a single hidden layer of size 128 for both actor and critic, this converged on a rather stable optimal policy within 500 episodes.

Even faster with gamma=0.99, as the graph shows (best discounted episode reward is about 86.6).

thanks

BIG thank you to @Chris Holland, who "gave this a try"

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I am gonna give this a try.

.backward() does not need a loss function, it just needs a differentiable scalar output. It approximates a gradient with respect to the model parameters. Let's just look at the first case the update for the value function.

We have one gradient appearing for v, we can approximate this gradient by

v = model(s)
v.backward()

This gives us a gradient of v which has the dimension of your model parameters. Assuming we already calculated the other parameter updates, we can calculate the actual optimizer update:

for i, p in enumerate(model.parameters()):
    z_theta[i][:] = gamma * lamda * z_theta[i] + l * p.grad
    p.grad[:] = alpha * delta * z_theta[i]

We can then use opt.step() to update the model parameters with the adjusted gradient.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...