Imitation Learning

Learning dexterous manipulation

Imitation learning is fascinating to me because it promises to learn essentially any task, given a dataset of sufficient demonstration. For beating Moravec’s paradox, imitation learning is a great tool - it allows us to teach policies in an easy, supervised manner on real-world data while avoiding some of the issues of RL such as reward engineering and sim2real transfer.

To this end, I have worked on an pipeline for dexterous hands at the Soft Robotics Lab at ETH Zurich, implementing a teleoperation pipeline, a data recording system and adapting different imitation learning frameworks (ACT, Octo, Diffusion Policy). In addition, I am currently working on something exciting involving cross-embodiment learning!

A few snippets of teleoperation videos at 1x speed:

We can use the collected data to train policies such as ACT, Octo and Diffusion Policy. Here is a video showing a rollout of Diffusion Policy:

For future work, I am interested in cross-embodiment learning and learning, learning from general embodied data, particularly human data. I believe this is inherently a question of finding the right representations for both observations and actions. There are a lot of factors to consider: we have data with action labels, but also plenty of data without action labels. How can we bridge learning from both?

So how do we find the right representations? Either, we rely on end-to-end learning, or we employ explicit representation learning techniques. For now, I believe explicit representation learning is the way to go. Robot learning is dominated by many small datasets with largely different observation and action spaces. Relying on end-to-end learning for lots of small datasets might not give us generalizable representations: our models might just learn to fit each dataset individually, which likely limits skill transfer across datasets. In contrast, when using explicit representation learning, the learned representations for each dataset are well-aligned by design: we could use contrastive learning or other self-supervised multi-modal learning techniques to ensure this.

Stay tuned for more updates on my work on this!