Reinforcement Learning for Control of Inherently Unstable Robots

Student thesis: Doctoral ThesisDoctor of Philosophy (PhD)


Deep reinforcement learning has reignited interest into Artificial General Intelligence (AGI). Current methods often aim to replicate the high level cognition displayed by the neocortex as its size and density is what separates humans from other species. But this organ is only the icing on evolution’s neurological cake and relies on older regions to convert its abstract plans into complex movements. Despite this, current testing benchmarks are often reductionistic simulations and toy-problems which ignore the temporal abstraction, brutality and complexity of the only environment which has resulted in true intelligence. To this end, the thesis explores reinforcement learning not with pseudo problems nor discrete environments, but instead for real world agile locomotion.
Part I (Chapter 3) explores how model-free methods can be adapted to facilitate a two-wheeled robot to learn balance and position control directly in the field. Implementations and benchmarks are introduced for embedded high frequency reinforcement learning. A soft deck approach to termination criteria is presented and shown to eliminate the requirement for a human when learning agility from scratch outdoors.
Part II (Chapter 4) shows how hierarchical reinforcement learning can rival traditional controllers whilst eliminating catastrophic forgetting by freezing the inner policy. This meant the inverted pendulum learnt velocity control to a precision of 10−6m/s and allowed for variation in the update rates of different sensors. Since high level human commands are required for many agile systems it was also shown how a person can manipulate the learnt policy to drive the robot around a hazardous obstacle course.
Part III (Chapters 5-6) presents biologically inspired methods to address jerk and symmetry when learning quadrupedal locomotion. It is shown how joint-lock from constrained action spaces which result in idiosyncratic gaits can be eliminated with closed-loop mechanisms and frequency based action spaces. Moving average filters were combined with action sampling to provide an alternative way to smooth jerki- ness during training. It is shown how this can be extended to define sampling pairs which encourages a desired gait coordination to emerge without any need for reward shaping or adding constraints to the optimisation function. This is compared against an alternative method for encouraging gait symmetry which grows the action space throughout an agent’s lifetime.
Date of Award19 Mar 2024
Original languageEnglish
Awarding Institution
  • The University of Bristol
SupervisorMark Hansen (Supervisor) & Wenhao Zhang (Supervisor)

Cite this