Generative Video for Humanoid Control

Matei Gardea, Koushil Sreenath

University of California, Berkeley

PDF Code coming soon

Reference video

Scene + reference motion

Abstract

A central challenge in robot learning is converting high-level intent into physically executable behavior. Generative video models have started to show signs of understanding basic physical behavior in visual prediction tasks, suggesting that videos synthesized from language and images may encode useful cues about how objects, agents, and environments evolve under plausible real-world dynamics. In this work, we study a simple route for grounding synthesized videos of a robot acting in a static scene into explicit geometric and kinematic structures for control. The input is a natural-language instruction and a single exocentric RGB image of a Unitree G1 in its starting configuration. The output is a metric scene representation and a parameterized robot reference trajectory, which together define the simulation scene and target behavior for a tracking policy. The current prototype implements the scene-reconstruction side of this pipeline, while robot pose estimation, policy training, and physical deployment are in progress.

Method

The pipeline starts from one exocentric image and a language instruction. A video model generates a short robot video. The moving robot is segmented out so the static scene can be reconstructed from the remaining pixels. Dense monocular geometry provides per-frame point maps, and static points are registered across the generated sequence. The reconstruction is then aligned to a gravity frame and converted into a mesh. In parallel, the generated robot motion is intended to define a reference trajectory for a tracking policy.

The current prototype implements the scene side of the pipeline: video generation, robot segmentation, monocular geometry prediction, static registration, gravity alignment, and mesh reconstruction. Robot pose estimation, policy training, and physical deployment remain in progress.

Demo

Initial image used as the input for the generated task video — 00 Input image

01 Generated video

02 Dynamic mask

03 Depth estimate

04 Static registration

05 Gravity alignment

06 Scene mesh

07 Scene + reference motion