Learning Dynamic Bipedal Walking Across Stepping Stones

Abstract

In this work, we propose a learning approach for 3D dynamic bipedal walking when footsteps are constrained to stepping stones. While recent work has shown progress on this problem, real-world demonstrations have been limited to relatively simple open-loop, perception-free scenarios. Our main contribution is a more advanced learning approach that enables real-world demonstrations, using the Cassie robot, of closedloop dynamic walking over moderately difficult stepping-stone patterns. Our approach first uses reinforcement learning (RL) in simulation to train a controller that maps footstep commands onto joint actions without any reference motion information. We then learn a model of that controller’s capabilities, which enables prediction of feasible footsteps given the robot’s current dynamic state. The resulting controller and model are then integrated with a real-time overhead camera system for detecting stepping stone locations. For evaluation, we develop a benchmark set of stepping stone patterns, which are used to test performance in both simulation and the real world. Overall, we demonstrate that sim-to-real learning is extremely promising for enabling dynamic locomotion over stepping stones. We also identify challenges remaining that motivate important future research directions.

Here, Cassie navigates towards the next footstep target using real-time position estimation from an overhead camera. The camera tracks Cassie's position relative to the target, marked by green dots in the simulation. The same stepping stone pattern is used in both hardware and simulation.

System Architecture

The policy network combines proprioceptive states processed by an LSTM dynamics module with footstep commands and a periodic clock signal. A feed-forward (FF) layer integrates these inputs to produce PD targets for motor control. This design allows task-specific inputs, such as commands, to adapt without altering the pretrained LSTM module, enabling flexible and reusable policy learning.

Reachability Prediction Model

The Reachability Prediction Model in the paper is a learned model designed to predict reachable footstep locations for the robot based on its current dynamic state. This model encapsulates the robot’s dynamics in a latent state and predicts two key outcomes: the step error for the current footstep target and the latent robot state after the next touchdown. By leveraging these predictions, the model identifies regions where the robot can reliably step within a specified error threshold. Additionally, the model supports multi-step lookahead by recursively predicting future latent states, which is especially useful for navigating highly constrained terrains. It is trained using supervised learning on data that captures relationships between the robot’s state, footstep targets, and the resulting errors, enabling robust performance across diverse scenarios.

Experiments

Simulation test of the utility of prediction model. Left: Stepping stone locations and initial robot position are randomized for each trial. Cassie is asked to go across the gap by stepping onto one of the stones. Right: “Random” refers to random selection among the choices. “Closest” and “Closest on touchdown (TD) side” refer to selecting the closest (in euclidean sense) stone or only selecting on stones same as the touchdown side. The success rate is measured from 1000 independent simulation trials. Having a reachability model allows Cassie have the highest success rate of successfully crossing the gap.

Results

(a) The training curve shows the benefit of using a pretrained dynamics layer, which results in higher reward and faster convergence. (b) The performance of the policy is evaluated in during training (∼50millions) shows that the pretrained method performs better than trained from scratch on the set of patterns. (c) The policy can only see the immediate next footstep command. As part of the emergent behaviors, the learned step frequency allows the robot to successfully achieve the next target step by taking a longer or shorter swing duration. The policy also learns to elevate the body height in order to enable longer steps.