TY - UNPB
T1 - Reward Learning with Trees
T2 - Methods and Evaluation
AU - Bewley, Tom
AU - Lawry, Jonathan
AU - Richards, Arthur G
AU - Craddock, Rachel
AU - Henderson, Ian
N1 - 22 pages (9 main body). Preprint, under review
PY - 2022/10/3
Y1 - 2022/10/3
N2 - Recent efforts to learn reward functions from human feedback have tended to use deep neural networks, whose lack of transparency hampers our ability to explain agent behaviour or verify alignment. We explore the merits of learning intrinsically interpretable tree models instead. We develop a recently proposed method for learning reward trees from preference labels, and show it to be broadly competitive with neural networks on challenging high-dimensional tasks, with good robustness to limited or corrupted data. Having found that reward tree learning can be done effectively in complex settings, we then consider why it should be used, demonstrating that the interpretable reward structure gives significant scope for traceability, verification and explanation.
AB - Recent efforts to learn reward functions from human feedback have tended to use deep neural networks, whose lack of transparency hampers our ability to explain agent behaviour or verify alignment. We explore the merits of learning intrinsically interpretable tree models instead. We develop a recently proposed method for learning reward trees from preference labels, and show it to be broadly competitive with neural networks on challenging high-dimensional tasks, with good robustness to limited or corrupted data. Having found that reward tree learning can be done effectively in complex settings, we then consider why it should be used, demonstrating that the interpretable reward structure gives significant scope for traceability, verification and explanation.
KW - cs.LG
KW - cs.AI
U2 - 10.48550/arXiv.2210.01007 Focus to learn more
DO - 10.48550/arXiv.2210.01007 Focus to learn more
M3 - Preprint
BT - Reward Learning with Trees
ER -