Generative Video for Humanoid Control

Matei Gardea, Koushil Sreenath

University of California, Berkeley

Reference video
Scene + reference motion

Abstract

A central challenge in robot learning is converting high-level intent into physically executable behavior. Generative video models have started to show signs of understanding basic physical behavior in visual prediction tasks, suggesting that videos synthesized from language and images may encode useful cues about how objects, agents, and environments evolve under plausible real-world dynamics. In this work, we study a simple route for grounding synthesized videos of a robot acting in a static scene into explicit geometric and kinematic structures for control. The input is a natural-language instruction and a single exocentric RGB image of a Unitree G1 in its starting configuration. The output is a metric scene representation and a parameterized robot reference trajectory, which together define the simulation scene and target behavior for a tracking policy. The current prototype implements the scene-reconstruction side of this pipeline, while robot pose estimation, policy training, and physical deployment are in progress.

Method

The pipeline starts from one exocentric image and a language instruction. A video model generates a short robot video. The moving robot is segmented out so the static scene can be reconstructed from the remaining pixels. Dense monocular geometry provides per-frame point maps, and static points are registered across the generated sequence. The reconstruction is then aligned to a gravity frame and converted into a mesh. In parallel, the generated robot motion is intended to define a reference trajectory for a tracking policy.

The current prototype implements the scene side of the pipeline: video generation, robot segmentation, monocular geometry prediction, static registration, gravity alignment, and mesh reconstruction. Robot pose estimation, policy training, and physical deployment remain in progress.

Demo

Initial image used as the input for the generated task video
00 Input image
01 Generated video
02 Dynamic mask
03 Depth estimate
04 Static registration
05 Gravity alignment
06 Scene mesh
07 Scene + reference motion