No More Blind Spots: Learning Vision-Based Omnidirectional Bipedal Locomotion for Challenging Terrain

Mohitvishnu S. Gadde, Pranay Dugar, Ashish Malik, Alan Fern 2025 IEEE Humanoids

[ Paper ]

Abstract

Effective bipedal locomotion in dynamic environments, such as cluttered indoor spaces or uneven terrain, requires agile and adaptive movement in all directions. This necessitates omnidirectional terrain sensing and a controller capable of processing such input. We present a learning framework for vision-based omnidirectional bipedal locomotion, enabling seamless movement using depth images. A key challenge is the high computational cost of rendering omnidirectional depth images in simulation, making traditional sim-to-real reinforcement learning (RL) impractical. Our method combines a robust blind controller with a teacher policy that supervises a vision-based student policy, trained on noise-augmented terrain data to avoid rendering costs during RL and ensure robustness. We also introduce a data augmentation technique for supervised student training, accelerating training by up to 10 times compared to conventional methods. Our framework is validated through simulation and real-world tests, demonstrating effective omnidirectional locomotion with minimal reliance on expensive rendering. This is, to the best of our knowledge, the first demonstration of vision-based omnidirectional bipedal locomotion, showcasing its adaptability to diverse terrains.

Our fully learned controller integrates vision and locomotion for reactive and agile gaits over terrains. The proposed approach enables bipedal robot Cassie traversing over challenging terrains, including random high blocks, stairs, 0.5m step up (∼60% leg length), with speed up to 1m/s.

System Architecture

Overview of the locomotion policy with vision module.

The above image illustrates the overall system, which has two main components: (1) a locomotion policy, which outputs PD setpoints for the robot actuators based on proprioception, a local terrain heightmap, and user commands, and (2) a heightmap predictor, which outputs a predicted heightmap based on proprioceptive information and images from a depth camera. These components are learned in simulation and then transferred to the real robot (Cassie).

Terrain Aware Policy and Heightmap Predictor

Policy Architecture
Heightmap Predictor

The policy here is represented by a neural network that maps observation sequences to actions. It consists of two key components: a pretrained blind policy and a vision-based modulator. The blind policy provides a baseline locomotion control signal suitable for moderate terrains. For more complex terrain, the vision-based modulator refines the baseline control by incorporating local terrain details, enabling more adaptive and robust locomotion.

Heightmap Predictor is a neural network that estimates the terrain heightmap ahead of the robot using depth images and robot states. It consists of two stages: Stage 1 uses an LSTM to reconstruct missing terrain details by leveraging temporal information, trained with mean-squared error loss. Stage 2 refines the heightmap using a U-Net, enhancing edge sharpness and surface flatness with L1 loss, resulting in more accurate and stable predictions.

Experiments

Various terrains used to train the terrain aware locmotion policy

Results

[A] Ablation study on policy with simulation heightmap. [B] Ablation study on policy with different heightmap predictor architectures. Each ablation study uses data collected from a range of terrains defined in Table I. Success rate indicates the robot does not fall down for 10 seconds of rollouts. Episodes with foot collision indicates the number of episodes that have one or more foot collision events occurred during rollouts, and such random collision events are unfavorable towards hardware deployment. Termination due to foot collision shows the percentage of foot collision events that lead to failures. All plots are evaluated with a confidence interval of 95%.