Humanoid Robot Training Data: Whole-Body Demonstrations at Scale

Humanoid robots require training data that captures the full complexity of human movement: coordinated bimanual manipulation, dynamic locomotion, and seamless transitions between walking, reaching, and grasping. Public datasets overwhelmingly feature single-arm tabletop manipulation, leaving a critical gap for labs building general-purpose humanoid policies.

Why Is Training Data the Limiting Factor for Humanoid Robots?

Humanoid robots represent the most demanding form factor for learned policies. Unlike fixed-base manipulators, humanoids must simultaneously control locomotion, balance, and upper-body manipulation. GR00T N1, NVIDIA's open foundation model for humanoid robots, uses a dual-system architecture combining a vision-language model with a diffusion transformer to generate whole-body actions. The authors demonstrated that training on a heterogeneous data pyramid — mixing real-robot trajectories, human demonstration videos, and synthetic data — was essential for generalization across manipulation tasks. However, even with this architecture, the model's performance was bounded by the diversity and quality of whole-body demonstration data available. Figure AI's Helix model similarly combines a VLM backbone with a latent action diffusion policy, trained on thousands of hours of teleoperated demonstrations to achieve full-body humanoid control including walking, picking, and placing in real-world environments.

[1][2]

What Makes Humanoid Training Data Different from Manipulation Data?

Standard robot manipulation datasets capture single-arm or dual-arm tabletop tasks from fixed-base platforms. Humanoid policies need fundamentally different data: whole-body trajectories that include base locomotion, torso orientation, and coordinated arm movements. The HumanPlus project demonstrated that shadowing human demonstrations with a full-size humanoid enabled autonomous skill learning for tasks like donning a shoe and folding a shirt, but the approach required precise retargeting of human motion to the robot's kinematic structure. RoboCasa showed that even in simulated environments, generating realistic household task data requires modeling the full kinematic chain of a mobile manipulator including navigation, reaching, and bimanual coordination across diverse kitchen and living room layouts. The recurring challenge is that the action space for humanoids is 30-50+ degrees of freedom, versus 6-7 for a typical arm, making data collection exponentially more expensive and annotation more complex.

[3][4]

How Do Current Datasets Fail Humanoid Generalization?

Open X-Embodiment aggregated over 1 million trajectories from 22 robot platforms, but the vast majority come from single-arm manipulators in lab environments. DROID provides 76,000 manipulation trajectories across 564 scenes, yet all data comes from fixed-base Franka Emika robots. Neither dataset contains locomotion, balance recovery, or loco-manipulation sequences. The 1X World Model Challenge released a dataset of real-world humanoid video data, but without paired action labels the data supports world model pre-training rather than direct policy learning. For labs building humanoid foundation models, this means public data can serve as a visual pre-training source but cannot provide the whole-body action supervision that policies require for deployment.

[5][6]

How Do Open Humanoid Datasets Compare to Custom Collection?

The table below compares existing open datasets relevant to humanoid robot training against Claru custom collection. Key gaps include whole-body action labels, diverse real-world environments, and loco-manipulation task coverage.

Open X-Embodiment

Scale1M+ trajectories, 22 robot platforms
TasksSingle-arm manipulation; pick-and-place, pushing, stacking
EnvironmentsResearch labs; standardized tabletop setups
LimitationsNo humanoid data; no locomotion; fixed-base robots only; limited to short-horizon manipulation

DROID

Scale76K trajectories, 564 scenes
TasksTable-top manipulation with Franka robots
Environments13 institutions; predominantly lab environments
LimitationsSingle robot morphology (Franka); no whole-body or locomotion data; lab-centric

AgiBot World

Scale1M+ trajectories, 100+ scenes
TasksMobile manipulation, navigation, household tasks
EnvironmentsIndoor real-world: kitchens, living rooms, offices
LimitationsSingle robot platform; fixed action representation; geographically constrained to specific regions

AMASS (Human Motion)

Scale40+ hours, 11,000+ motions
TasksHuman motion capture; locomotion, gestures, interactions
EnvironmentsMotion capture studios; controlled lighting
LimitationsNo manipulation actions; no object interaction labels; studio environments only; requires motion retargeting to robot morphologies

Claru Custom

Scale386K+ video clips, ~500 contributors, 10K+ hours game data
TasksConfigurable: whole-body demonstrations, loco-manipulation, bimanual tasks, kitchen activities, workplace operations
EnvironmentsGlobal real-world coverage; homes, workplaces, outdoor; 10+ workplace categories across multiple countries
LimitationsRequires engagement lead time (days to launch, 1-2 week calibration); not a public benchmark
0+

Annotators

0

Countries

0M+

Annotations Delivered

Same-day

QA Turnaround

Frequently Asked Questions

Claru provides three categories of data relevant to humanoid policies: (1) egocentric human demonstration videos capturing whole-body activities like cooking, cleaning, and workplace tasks from 500+ global contributors; (2) synchronized observation-action pairs with sub-16ms temporal alignment for direct policy training; and (3) structured activity annotations covering task phases, object interactions, and body part engagement. Data can be configured for specific humanoid morphologies and action space representations.

Research from HumanPlus and GR00T N1 demonstrates that human demonstration video serves as both a pre-training signal and a direct supervision source for humanoid policies. Human videos capture the whole-body coordination patterns — reaching while walking, bimanual manipulation, balance adjustments — that humanoid robots must reproduce. Vision-language-action models can extract task structure and motion patterns from human demonstrations, then fine-tune on robot-specific action data for deployment.

Yes. Claru configures collection pipelines to match specific robot morphologies and deployment environments. The structured activity taxonomy is co-developed with each research team through iterative revision cycles. Output formats including action space representations, frame rates, and annotation schemas are configured to match your model's training pipeline requirements.

Claru collects across real-world environments that lab datasets underrepresent. The workplace program covers 10 categories including barista stations, carpentry workshops, tailoring studios, and phone repair shops across multiple countries. The egocentric pipeline spans homes, kitchens, outdoor spaces, and commercial environments. This diversity is critical for humanoid robots that must generalize across deployment settings rather than memorize a single lab layout.

// INITIATE

Your next hire isn't a vendor. It's a data team.

Tell us what you're training. We'll scope the dataset.

claru@contact ~ READY
CONNECTED
> Initialize consultation request...

Or email us directly at [email protected]

</>

References

  1. [1]NVIDIA et al.. GR00T N1: An Open Foundation Model for Generalist Humanoid Robots.” arXiv 2025, 2025. Open VLA foundation model for humanoid robots trained on a heterogeneous data pyramid of real-robot trajectories, human videos, and synthetic data; dual-system architecture achieves superior manipulation results. Link
  2. [2]Figure AI et al.. Helix: A Vision-Language-Action Model for Generalist Humanoid Control.” arXiv 2025, 2025. Full-body humanoid VLA combining a VLM backbone with latent action diffusion, trained on thousands of hours of teleoperated demonstrations for walking, picking, and placing in real-world environments. Link
  3. [3]Fu et al.. HumanPlus: Humanoid Shadowing and Imitation from Humans.” arXiv 2024, 2024. Demonstrated that autonomous humanoid skills including wearing a shoe and folding a shirt can be learned from real-time human shadowing demonstrations with motion retargeting. Link
  4. [4]Nasiriany et al.. RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots.” arXiv 2024, 2024. Large-scale simulation benchmark with 150+ kitchen layouts and 2,500+ 3D objects demonstrating that environment diversity dramatically improves policy generalization for household manipulation. Link
  5. [5]O'Brien et al.. Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” arXiv 2024, 2024. Aggregated 1M+ trajectories from 22 robot platforms but predominantly features single-arm manipulation without locomotion or whole-body data. Link
  6. [6]Khazatsky et al.. DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset.” arXiv 2024, 2024. 76,000 robot manipulation trajectories across 564 scenes and 13 institutions; valuable for manipulation but limited to fixed-base Franka robots. Link