In section 5.4 of their book on reinforcement learning Sutton and Barto show that the policy improvement theorem applies to soft policies, that is, when making a soft policy greedier (but still soft) with respect to its Q-function we obtain an improved policy. I found this material difficult to follow and wrote this short document to elaborate on their proof. Familiarity with the material up until that section is assumed.
|Translated title of the contribution||Elaboration on the policy improvement theorem for soft policies in reinforcement learning|
|Publisher||Department of Computer Science, University of Bristol|
|Publication status||Published - 2010|
Bibliographical noteOther page information: -
Other identifier: 2001266