Training Data for HumanPlus
A comprehensive breakdown of Stanford's HumanPlus system -- its shadowing transformer trained on 40 hours of AMASS motion capture, the imitation transformer for autonomous skill learning, and how Claru provides the human motion and teleoperation data HumanPlus requires.
Input/Output Specification
Shadowing: single RGB camera + 3D pose estimation; Autonomous: two head-mounted egocentric RGB cameras (480x640)
33-DoF humanoid joint-position targets (19 body + 12 hand + 2 wrist DoFs)
Not language-conditioned; task is defined by the demonstration dataset used for each skill
30 Hz (both shadowing and autonomous execution)
How Claru Data Integrates with HumanPlus
Claru provides data for all three stages of the HumanPlus pipeline. For HST training (Stage 1), we deliver human motion capture data in SMPL/SMPL-H format from optical motion capture systems (Vicon, OptiTrack) at 120 Hz, covering household manipulation, industrial tasks, and daily activities that extend AMASS's coverage. For HIT data collection (Stage 2), we provide the teleoperation infrastructure and trained operators to collect shadowing demonstrations on Unitree H1 and other humanoid platforms, delivering 20-50 high-quality demonstrations per task with synchronized egocentric RGB and 33-DoF joint trajectories at 30 Hz. For visual backbone pretraining extensions, our catalog of 3M+ egocentric human activity videos provides rich visual data from head-mounted cameras. All data includes task success labels, camera calibration, and format compatibility verification with HumanPlus's open-source codebase on GitHub.
What Is HumanPlus?
HumanPlus is a full-stack system for humanoid robots to learn motion and autonomous skills from human data, developed at Stanford by Zipeng Fu, Qingqing Zhao, Qi Wu, Gordon Wetzstein, and Chelsea Finn. Published at CoRL 2024 (arXiv 2406.10454), HumanPlus demonstrates that a humanoid robot can learn to shadow human motion in real time using only a single RGB camera, then leverage that shadowing capability to collect task-specific demonstration data and learn autonomous skills via behavior cloning.
The system runs on a custom 33-DoF humanoid built on the Unitree H1 platform, featuring two 6-DoF dexterous hands (Inspire Hands), two 1-DoF wrists, and a 19-DoF body (two 4-DoF arms, two 5-DoF legs, and a 1-DoF waist). Two egocentric RGB cameras are mounted on the head for visual perception. Using up to 40 teleoperated demonstrations per task, the humanoid autonomously completes tasks including wearing a shoe and standing up, unloading objects from warehouse racks, folding a sweatshirt, rearranging objects, typing on a keyboard, and greeting another robot -- with success rates ranging from 60% to 100%.
The fundamental insight of HumanPlus is that the human body is the best teleoperation interface for a humanoid robot. Rather than designing complex teleoperation rigs with haptic gloves or VR controllers, HumanPlus uses off-the-shelf pose estimation to track a human operator's body motion from a single RGB camera, retargets that motion to the humanoid's joint space in real time, and uses the resulting data for imitation learning. This makes data collection as simple as having a person perform the task while the robot mirrors their movements.
HumanPlus at a Glance
Input / Output Specification
| Parameter | Specification |
|---|---|
| Shadowing Input | Single RGB camera stream processed by off-the-shelf pose estimation (e.g., WHAM) to extract 3D human body and hand joint positions in real time |
| Shadowing Output | 33-DoF humanoid joint-position targets at 30 Hz via the Humanoid Shadowing Transformer (HST) |
| Autonomous Perception | Two head-mounted egocentric RGB cameras providing stereo visual input for the imitation policy |
| Autonomous Action | 33-DoF joint-position targets at 30 Hz via the Human Imitation Transformer (HIT), trained on shadowing-collected demonstrations |
| Language Conditioning | Not language-conditioned; task selection is implicit in the demonstration dataset used for training each skill |
| Control Frequency | 30 Hz for both shadowing and autonomous execution |
Architecture and Key Innovations
HumanPlus relies on two decoder-only transformer models that operate at different stages of the learning pipeline. The Humanoid Shadowing Transformer (HST) is the low-level controller that converts retargeted human joint trajectories into stable humanoid motion. It is trained in simulation via reinforcement learning (PPO) on the AMASS dataset, which contains 40 hours and over 11,000 unique sequences of human motion capture data. The HST takes as input the current humanoid joint state and the retargeted target pose from the human operator, and outputs joint-position commands that achieve the target pose while maintaining balance and physical feasibility. A key property of the HST is zero-shot sim-to-real transfer -- the policy trained in simulation deploys directly to the physical humanoid without any real-world fine-tuning.
The retargeting pipeline bridges the kinematic gap between human and humanoid bodies. Human 3D pose estimation (using models like WHAM) extracts body and hand joint positions from a single RGB camera at 30 Hz. These human joint positions are then mapped to the humanoid's joint space through kinematic retargeting that accounts for differences in limb lengths, joint limits, and morphology. The retargeted targets are physically achievable poses for the humanoid, not raw human joint angles.
The Human Imitation Transformer (HIT) is the high-level autonomous policy that learns task-specific skills from demonstration data collected via shadowing. During data collection, a human operator performs the task while the robot shadows their motion in real time, producing pairs of (egocentric observation, robot action) data. The HIT takes egocentric RGB images from the two head-mounted cameras as input and predicts joint-position action chunks using a Transformer architecture similar to ACT (Action Chunking with Transformers). It uses a CVAE (Conditional Variational Autoencoder) formulation to handle multi-modal action distributions.
The two-stage design (HST for motion transfer + HIT for task learning) creates a virtuous data collection loop. Because the HST enables real-time shadowing, data collection requires no specialized hardware beyond a camera pointed at the human operator. This makes it feasible to collect 10-40 demonstrations per task in minutes rather than hours, which is sufficient for the HIT to learn robust autonomous policies. The simplicity of the data collection pipeline is arguably HumanPlus's most important practical innovation.
Comparison with Related Models
How HumanPlus compares to alternative humanoid learning approaches.
| Dimension | HumanPlus | GR00T N1 | H2O (1X Technologies) | Mobile ALOHA |
|---|---|---|---|---|
| Teleoperation method | RGB camera + pose estimation (no hardware needed) | Standard teleoperation devices | Exoskeleton / VR controllers | Direct kinesthetic teaching |
| Embodiment | 33-DoF humanoid (Unitree H1 + Inspire Hands) | Cross-embodiment (humanoids, arms) | Humanoid (1X Neo) | Bimanual mobile manipulator |
| Demos per task | 10-40 | 500-2,000 | Not publicly specified | 50 |
| Locomotion + manipulation | Yes (whole-body) | Yes (whole-body) | Yes (whole-body) | Mobile base only (no walking) |
| Language conditioned | No | Yes | No | No |
Training Data Requirements
HumanPlus has two distinct data requirements corresponding to its two transformer models. The Humanoid Shadowing Transformer (HST) is trained on human motion capture data in simulation. The published system uses the AMASS dataset, which aggregates motion capture data from multiple sources (CMU MoCap, Human3.6M, ACCAD, and others) into a unified SMPL format, totaling 40 hours and over 11,000 unique motion sequences. AMASS covers diverse activities including walking, running, dancing, sports, object manipulation, and daily activities. The HST is trained via PPO reinforcement learning in the MuJoCo physics simulator, where the humanoid attempts to track retargeted AMASS motions while maintaining balance. No real-world data is needed for this stage.
The Human Imitation Transformer (HIT) requires task-specific demonstration data collected on the physical humanoid via shadowing. For each target skill, a human operator performs the task while the robot mirrors their motion in real time. The resulting data consists of synchronized egocentric RGB frames (from the two head-mounted cameras, typically at 480x640 resolution) and 33-DoF joint-position recordings at 30 Hz. The published tasks required between 10 and 40 demonstrations each -- for example, 20 demonstrations for folding a sweatshirt, 40 for wearing a shoe and standing up, and 10 for greeting another robot.
For the HST, the critical data property is motion diversity. The 40 hours in AMASS cover a wide range of human activities, and this breadth is what enables the HST to track arbitrary human motion during deployment. If the target application involves motions significantly outside AMASS's coverage (e.g., highly specialized industrial movements or sports-specific techniques), supplementing with additional motion capture data improves shadowing fidelity. Each motion sequence should be in SMPL or SMPL-H format with 3D joint positions at 30+ Hz.
For the HIT, data quality is more important than quantity. Because only 10-40 demonstrations are used per task, each demonstration must be a clean, successful execution of the target skill. The egocentric camera viewpoint should be consistent across demonstrations, the objects should be in approximately similar starting configurations, and the human operator should execute the task smoothly without hesitation or errors. Failed or partial demonstrations degrade HIT performance significantly at these small sample sizes.
How Claru Data Integrates with HumanPlus
Claru provides data for both stages of the HumanPlus pipeline. For the HST training stage, we offer extensive human motion capture datasets in SMPL/SMPL-H format that can supplement or replace AMASS for broader motion coverage. Our motion data spans diverse activities including household manipulation, industrial assembly, warehouse operations, personal care, and fitness -- captured via high-fidelity optical motion capture systems (Vicon, OptiTrack) at 120 Hz and downsampled to the target training rate. This expanded motion vocabulary enables the HST to track a wider range of human motions during shadowing, particularly for manipulation-heavy tasks that are underrepresented in standard mocap datasets.
For the HIT training stage, Claru provides task-specific demonstration datasets collected on humanoid platforms via shadowing or direct teleoperation. Our collection pipeline captures synchronized egocentric RGB frames from head-mounted cameras at 30 Hz, full 33-DoF joint-position trajectories, and task success labels. We enforce strict data quality standards matching HumanPlus's requirements: each demonstration is a verified successful execution, camera viewpoints are consistent, and object configurations are controlled. For a typical HumanPlus skill, we deliver 20-50 high-quality demonstrations per task.
Beyond direct data provision, Claru's egocentric video catalog (3M+ clips of human daily activities captured from head-mounted cameras) provides a rich source of task-relevant visual data for pretraining or augmenting HIT's vision backbone. While HumanPlus does not currently use video pretraining, the architecture is compatible with frozen pretrained vision encoders, and teams exploring this extension can leverage our egocentric video for backbone pretraining. All data includes full provenance documentation, sensor calibration, and compatibility verification with HumanPlus's open-source codebase.
Key References
- [1]Fu, Zhao, Wu, Wetzstein, & Finn. “HumanPlus: Humanoid Shadowing and Imitation from Humans.” CoRL 2024, 2024. Link
- [2]Mahmood et al.. “AMASS: Archive of Motion Capture as Surface Shapes.” ICCV 2019, 2019. Link
- [3]Zhao et al.. “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.” RSS 2023, 2023. Link
- [4]Fu et al.. “Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation.” arXiv 2401.02117, 2024. Link
- [5]Shin et al.. “WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion.” CVPR 2024, 2024. Link
Frequently Asked Questions
HumanPlus uses off-the-shelf 3D human pose estimation (such as WHAM) running on a single RGB camera pointed at the human operator. The estimated 3D body and hand joint positions are retargeted to the humanoid's 33-DoF joint space in real time at 30 Hz. The Humanoid Shadowing Transformer (HST) then converts these retargeted targets into physically stable joint-position commands. The operator simply performs the task naturally while being filmed -- no gloves, exoskeletons, or VR equipment needed.
AMASS (Archive of Motion Capture as Surface Shapes) aggregates human motion capture data from 15+ sources into a unified SMPL body model format. It contains 40 hours and over 11,000 unique motion sequences spanning walking, running, dancing, sports, and daily activities. HumanPlus trains its low-level Humanoid Shadowing Transformer on AMASS via reinforcement learning in simulation, teaching the humanoid to track diverse human motions while maintaining balance. The diversity of AMASS is critical -- it ensures the HST can handle arbitrary real-time human motion during deployment.
HumanPlus achieves autonomous task completion with remarkably few demonstrations: 10-40 per task in the published experiments. For example, folding a sweatshirt used 20 demonstrations, wearing a shoe and standing up used 40, and greeting another robot used just 10. The low data requirement is enabled by the egocentric viewpoint consistency (both human and robot see similar views during shadowing) and the Human Imitation Transformer's action chunking architecture.
The HumanPlus framework is platform-agnostic in principle, though the published system was built on the Unitree H1 with Inspire Hands (33 DoF total). Adapting to a different humanoid requires: (1) adjusting the kinematic retargeting to account for different limb proportions and joint limits, (2) retraining the HST in simulation with the new robot's URDF model and AMASS data, and (3) collecting task demonstrations via shadowing on the new platform. The HST retraining is the most computationally intensive step but remains feasible on a single GPU cluster.
For HST training, Claru delivers human motion capture data in SMPL/SMPL-H format at 30+ Hz with full 3D joint positions, compatible with HumanPlus's simulation training pipeline. For HIT training, we deliver task-specific demonstrations with synchronized egocentric RGB frames (480x640 at 30 Hz) and 33-DoF joint-position trajectories from the target humanoid platform. Each demonstration is verified for task success and motion quality. Data is packaged in HDF5 format compatible with HumanPlus's open-source codebase.
Get HumanPlus-Ready Training Data
Tell us about your HumanPlus project -- target humanoid platform, task skills, and motion vocabulary -- and we will deliver human motion capture data for HST training and task-specific demonstrations for HIT learning.