Humanoid Imitation Learning from Diverse Sources

Architecture diagram of our GAIL imitation learning system. The system accepts input from three different types of sources, listed in increasing order of complexity: a reinforcement-learned expert, motion capture data, and real-life video. $\pi = \pi_g$.

This post describes our experience implementing a system which learns locomotion skills for humanoid skeletons from imitation, and all of the supporting infrastructure and data processing necessary to do so. Our system employs a generative adversarial imitation learning (GAIL) architecture, which is a type of generative adversarial network. We successfully trained our GAIL to control a custom-designed humanoid skeleton, using expert demonstrations from a reinforcment-learned (RL) policy using that skeleton. We also explored several methods for deriving real human motion demonstrations from video and developed a preprocessing pipeline for motion capture data. Our system is work-in-progress, making it the foundation for several possible future research projects.

Introduction to GAIL

Generative adversarial imitation learning (GAIL) is a deep neural network architecture for imitation learning, wherein an agent ("learner") learns a skill by observing the behavior of another agent ("expert"). It is based on the popular generative adversarial network (GAN) architecture, in which the network is divided into two principal functional blocks engaged in an adversarial game: the "discriminator" (or "critic") learns to distinguish real training examples from examples created by the network, while the "generator" (or "actor") module learns to produce convincing conterfeit examples meant to fool the discriminator. These counterfeit examples are the useful output of a GAN.

A key property for applying GANs to imitation learning is that the generator is never exposed to real world training examples, only the discriminator. This allows GAIL to side-step the problem of translating expert demonstrations into the target agent's domain. In GAIL, the discriminator learns to distinguish generated performances from expert demonstrations, while the generator attempts to mimic the expert convincingly-enough to fool the discriminator into thinking its performance was an expert demonstration. In our setting of GAIL, the expert demonstrations are time-series humanoid locomotion trajectories from a variety of sources (motion capture, video, artificially-trained experts via RL), and the output of the generator is a policy $\pi_g$ (a function mapping from states $s$ to actions $a$) for moving a humanoid model in simulation to mimic those demonstrated motions.


Our GAIL Implementation

The original GAIL formulation introduced by Ho et al. proposes a method where expert demonstrations containing both states $s$ and corresponding actions $a$ are presented to the discriminator. The discriminator network consists of two hidden layers of 100 neurons with tangential activation functions. In a real-life imitation learning problem, such as humanoid motion, the actions (e.g. joint torques) are difficult to obtain compared to states (e.g. joint positions) as it would require solving complex inverse kinematics problems. Our work attempts to address this and other practical modifications necessary for using GAIL for imitation learning outside simulated environments.

Overall Algorithm

Algorithm 1 Input: set of demonstrations $\{z^d_t\}_{t=1\dots T^d}$ Initialize $\pi_g$ and $D_\phi$ randomly for $i$ in $1 \dots N$ do Sample rollouts $\{z^g_t\}_{t=1\dots T^g}$ from $\pi_g$ in the environment $E$ Calculate rewards $\{r_t = - \log(1 - D_\phi(z^g_t))\}_{t=1\dots T^g}$ Update $\theta$ with TRPO for $j$ in $1\dots M$ do $\mathscr{L}(\phi) = \Sigma_{t=1\dots T^g} \log(1 - D_\phi(z^g_t)) - \Sigma_{t=1\dots T^d} \log(D_\phi(z^d_t))$ Update $\phi$ with gradient descent Return: $\pi_g$

Loss Functions

In the canonical GAN formulation there are two loss functions that can be both optimized by regular gradient descent algorithms, separately: One loss function for the discriminator and one for the generator. In GAIL, however, the generator network $G_\theta$ encodes a policy $\pi_g$ and cannot be trained with simple gradient descent. We use a policy gradient method to train the generator, though there are also non-gradient methods explored in the literature. Based on the rollouts $\{z^g_t\}_{t=1\dots T^g}$ of $\pi_g$, and the rewards $\{r_t\}_{t=1\dots T^g}$ associated with each $z^g_t$ that are calculated by the discriminator, the policy gradient method updates $\theta$, the parameters representing the policy function. Also contrary to typical GAN practice, the reward input to policy gradient, as calculated by the discriminator, is found in the literature to yield better results when presented as $r_t = -\log(1 - D_\phi(\cdot))$ as opposed to $\log(D_\phi(\cdot))$ . The loss function to be optimized for the discriminator $D_\phi$ is much like in a regular GAN: $$\mathscr{L}(\phi) = \Sigma_{t=1\dots T^g} \log(1 - D_\phi(z^g_t)) - \Sigma_{t=1\dots T^d} \log(D_\phi(z^d_t))$$ After the generator's parameters $\theta$ are updated, the discriminator's parameters $\phi$ are updated $M$ times. Algorithm 1 gives pseudocode of the GAIL policy training.

Simulation Environment $E$ and Target Skeleton

We developed support for both MuJoCo and OpenAI Roboschool as physics engines for simulation environments:

Our original ambition was to use Roboschool exclusively for our target humanoid environment. However, we found that our implementation of TRPO, based on Open AI Baselines, is highly tuned for MuJoCo and failed to train successful locomotion policies in Roboschool environments. All experiments detailed herein used MuJoCo as the simulation environment.

We engineered a custom MuJoCo humanoid skeleton in an attempt to approximate a real human body in terms of relative length and weight of the joints. The final skeleton is shown below.

Note that this skeleton is not the same as the real skeletons in the motion capture and real-life video data. Therefore, our implementation performs imitation as well as retargeting. The imitation learning evaluation and rollout is based solely on the following features $z_t$, based on the states $\{s_t\}_{t=1\dots T}$:

The generated policy $\pi_g$ takes actions on all 21 joints.

Generator Policy Network $\pi_g$

Our final policy network consists of 3 hidden layers, each of 150 neurons, all of them with tangential activation functions. We also experimented with dual-layer networks of hidden shapes (32, 32) and (64, 64), but found the larger network necessary for reliable performance. We found that the size of the network necessary for adequate performance scales proportionally with the degrees of freedom present in the target skeleton. Unfortunately, larger networks also slow training significantly.

Expert Policy $\pi_d$

The expert skill can be any movement of the human body, such as walking or running. Such a skill can be represented as a rollout (series of states $\{s_1, s_2, \ldots,s_n\}$ at discrete timesteps $\{t_1, t_2, \ldots, t_n\}$) generated from the implicit expert policy $\pi_d$, which is a function mapping the state of the expert $s$ to its next action from that state $a$. We attempted to train the system from three sources of expert demonstrations, presented in order of increasing complexity.

  1. Reinforcement Learned Policy
  2. Motion Capture Data
  3. Video of Skill

These three different demonstration sources give rise to important complications, and training from each has their own challenges, which we discuss in detail below.


Sources of Expert Demonstrations

Learning from Artificial Experts (RL)

The simplest way of testing GAIL is to imitate a policy obtained through direct reinforcement learning, in which an agent interacts with the environment, receives rewards or penalties for those interactions, and learns or updates a policy based on those interactions and rewards. Ho et al. use this method to obtain expert policies . Note that an RL policy provides both states and actions, but actions are difficult to observe in the real world. The results of our Trust Region Policy Optimization (TRPO) RL policy and its learned GAIL policy are shown in the video below.

Video: Side-by-side comparison of learned humanoid walking policies from the RL and GAIL algorithms. We trained the RL policy (left) using TRPO, then used it to provide expert demonstrations for training the GAIL network (right).

Learning from Motion Capture Data

A setting much closer to reality is to imitate motion capture data. Here, the demonstrator provides neither actions nor rewards. Motion capture data is obtained by attaching trackable markers to human actors, who perform recorded skill demonstrations in front of a special tracking system. The CMU Graphics Lab's Motion Capture Database (MOCAP) provides several skills, such as walking and jumping . As this data only provides states, we adapted GAIL to function without expert actions available, as in Merel et al. , also referred to as S-GAIL .

Motion Capture Extraction and Resampling

The CMU motion capture dataset provides AMC files which contain the time-series joint angles for various motions (e.g. walking, dancing, etc.) of the full humanoid skeleton at a frequency of 120Hz. We created an automated tool to interpret the AMC files, down-sample data to the same frame rate as our simulator, and generate the rollouts from motion capture data. For this project we down-sampled the animation to a frequency of approximately 66.67Hz using cubic spline interpolation.

Learning from Video

Learning from motion capture data requires putting markers on actors who demonstrate various skills, and asking them for perform in an expensive tracking system. Instead we would instead like to learn skills directly from video, which is much cheaper to obtain. The first step to achieve this, within our existing GAIL framework, is to obtain three dimensional human poses from videos of a human performing a skill. Most approaches of obtaining a three-dimensional pose from raw images belong to one of two categories:

  1. A pipeline approach, where first a 2D pose is estimated from an image, followed by the estimation of the 3D pose.
  2. Learning 3D poses directly from images in an end-to-end fashion.

We first tried the pipeline approach using a model trained for 2D pose estimation , and fed the output of to another module which estimates the 3D pose from the 2D pose . We were able to achieve good results for 2D pose estimation, but 3D pose estimation suffered from restricted movement for skills like walking. The brittleness of connecting two imperfect models taught us why end-to-end approaches are increasingly popular for such reconstruction problems.

Video: 2D pose results for Approach 1

Next, we experimented with methods for directly learning 3D poses from images. This overcomes the limitation of the pipeline approach by backpropagating 3D information about the skeletal structure to the 2D convolutional layers. In this way, the prediction of 2D pose benefits from the 3D information encoded in the final prediction . This approach works in two stages. First, belief maps obtained from the previous stage are used to predict an updated set of belief maps for the 2D human joint positions. Second, the output of the CNN-based belief maps is taken as input to a new layer that uses a pretrained probabilistic 3D human pose model to lift the proposed 2D poses into 3D. This approach gave us 3D pose estimations of significantly higher quality.

We then extended this approach from images to videos by processing each frame independently. We present the results of this approach below.

Video: Pose estimation example for walking.


Achievements

We implemented a deep RL development environment and many tools for doing deep RL research. We trained expert RL policies, and trained GAIL policies from those experts. We also implemented two approaches for human pose reconstruction from video, using three pretrained networks from the literature. We plan on continuing this research in the future.

Lessons Learned

Conclusion and future work

Our goal in this project was to familiarize ourselves with state-of-the-art deep RL software and research techniques, while reimplementing previous work and laying a foundation for future research. While we did not exactly replicate the results from , we certainly achieved our two other goals. We now have a deep RL training environment and workflow, as well as several future research directions we hope to pursue and publish in peer-reviewed forums.

Possible future research directions include:


Acknowledgements

We would like to express our appreciation to Andrew Liao for providing the TensorFlow-based GAIL implementation used in this research. We would also like to thank our TAs Artem Molchanov and Shao-Hua Sun, and Prof. Joseph Lim for their guidance and support. The motion capture data used in this research was obtained from mocap.cs.cmu.edu. The database was created with funding from NSF EIA-0196217.