Abstract
In section 5.4 of their book on reinforcement learning Sutton and Barto show that the policy improvement theorem applies to soft policies, that is, when making a soft policy greedier (but still soft) with respect to its Q-function we obtain an improved policy.
I found this material difficult to follow and wrote this short document to elaborate on their proof. Familiarity with the material up until that section is assumed.
Translated title of the contribution | Elaboration on the policy improvement theorem for soft policies in reinforcement learning |
---|---|
Original language | English |
Publisher | Department of Computer Science, University of Bristol |
Publication status | Published - 2010 |
Bibliographical note
Other page information: -Other identifier: 2001266