Training Data for Figure AI
Figure AI is building general-purpose humanoid robots backed by $2.6 billion in funding and an OpenAI partnership. Here is how real-world data accelerates their path from prototype to production.
About Figure AI
Figure AI is building general-purpose humanoid robots for commercial deployment. Founded in 2022 by Brett Adcock (serial entrepreneur, previously Vettery and Archer Aviation), Figure has raised over $2.6 billion at a $39 billion valuation as of its Series B in February 2025 — the largest funding round in robotics history. Investors include Microsoft, NVIDIA, OpenAI, Jeff Bezos, Intel, and Samsung. Figure 02, the company's second-generation humanoid, features 41 degrees of freedom, dexterous hands with 16 degrees of freedom per hand, and onboard AI powered by a custom vision-language-action model developed in collaboration with OpenAI.
Figure AI at a Glance
Core Data Requirements
Dexterous Manipulation
Two-handed object manipulation with force feedback, contact-rich interactions, and tool use — captured via teleoperation or motion capture across diverse objects and environments.
Egocentric Activity Video
First-person video of humans performing warehouse, factory, and household tasks — the visual pretraining data that grounds Helix's language understanding in physical action.
Bipedal Locomotion
Real-world walking and balancing data across varied terrain, stairs, ramps, and cluttered floors with full body pose tracking and surface characterization.
Language-Action Pairs
Natural language task descriptions paired with corresponding manipulation and navigation demonstrations for training the Helix VLA model.
Known Data Requirements
Figure AI's push toward general-purpose humanoid intelligence demands massive quantities of real-world manipulation data, egocentric video of human activities, and diverse locomotion trajectories across unstructured environments. Their custom VLA model Helix — developed with OpenAI — requires grounded multimodal training data that pairs natural language instructions with physical actions in real environments. The 41-DOF action space of Figure 02 demands orders of magnitude more demonstration data than fixed-base manipulators, and the dexterous 16-DOF hands require fine-grained contact data that no simulation can faithfully reproduce.
Dexterous bimanual manipulation demonstrations
Source: Figure AI job postings for Manipulation Research Scientist and Teleoperation Engineer, 2024
High-quality teleoperated demonstrations of two-handed object manipulation tasks including pick-and-place, tool use, assembly sequences, and contact-rich interactions with full joint-state recordings from Figure 02's 16-DOF dexterous hands. Requires force and tactile feedback data to train policies that apply appropriate grip force across object types.
Egocentric human activity video for visual pretraining
Source: Figure 02 product announcement and Helix VLA architecture description
First-person video of humans performing warehouse, logistics, manufacturing, and household tasks — used to pretrain visual representations and world models that ground Helix's language understanding in physical activity. The VLA architecture requires visual features that encode manipulation-relevant semantics: object affordances, spatial relationships, hand-object contact patterns, and task progression.
Whole-body locomotion trajectories across terrain
Source: Published research on sim-to-real transfer for humanoid walking (Radosavovic et al., 2024)
Motion capture and IMU data of humans walking on uneven surfaces, climbing stairs, navigating cluttered environments, and transitioning between walking and manipulation. Figure 02's whole-body coordination requires data that captures the coupling between locomotion and manipulation — how body posture adjusts when reaching, how gait changes when carrying objects.
Language-paired task demonstrations for Helix VLA
Source: Figure-OpenAI partnership announcement (2024) and Helix model description
Manipulation and locomotion demonstrations paired with natural language task descriptions for training the Helix vision-language-action model. Instructions range from simple ('pick up the box') to compositional ('move the red package from shelf B to the conveyor belt'). Requires diverse instruction phrasings for each task to improve language grounding robustness.
Manufacturing and warehouse environment recordings
Source: BMW Spartanburg deployment partnership (2024)
Visual and spatial recordings of real manufacturing and warehouse environments — production lines, parts bins, conveyor systems, storage racks — to pretrain Figure 02's perception system on the visual distributions of actual deployment environments. Current data is limited to Figure's headquarters and BMW's single pilot facility.
How Claru Data Addresses These Needs
| Lab Need | Claru Offering | Rationale |
|---|---|---|
| Dexterous bimanual manipulation demonstrations | Manipulation Trajectory Dataset + Custom Dexterous Collection | Claru's manipulation trajectories capture multi-camera, multi-modal recordings of dexterous object interactions with precise temporal annotations. Custom collection campaigns using teleoperation rigs can produce the bimanual demonstrations Figure needs — two-handed coordination, tool use, and contact-rich assembly — across diverse real-world environments rather than a single lab. |
| Egocentric human activity video for visual pretraining | Egocentric Activity Dataset (~386K clips) | Purpose-collected first-person video of daily activities across 100+ real-world locations, annotated with activity labels, object bounding boxes, and temporal segments. This provides the visual pretraining corpus that Helix needs to ground language in physical activity — dramatically higher signal per frame than uncurated internet video. |
| Whole-body locomotion trajectories across terrain | Custom Locomotion Data Collection with Body-Worn Sensors | Claru can deploy collectors with body-worn IMU suites and egocentric cameras to capture locomotion data in target environments — warehouses, factories, outdoor terrain, stairways — across 100+ cities. Each location contributes unique surface characteristics and terrain variation. |
| Language-paired task demonstrations for Helix VLA | Custom Language-Paired Data Collection Campaigns | Claru's annotation pipeline pairs demonstrations with diverse natural language instructions written by human annotators who watch each full episode. Multiple phrasings per task provide the instruction diversity that VLA models need for robust language grounding — not the templated descriptions that scripted collection produces. |
| Manufacturing and warehouse environment recordings | Custom Environmental Recording Campaigns | Claru can coordinate multi-camera visual recordings across partner manufacturing and warehouse facilities, capturing the environmental distributions (lighting, layout, equipment, visual clutter) that Figure 02 will encounter in production deployments beyond BMW Spartanburg. |
Technical Data Analysis
Figure AI's approach to general-purpose humanoid robotics represents the highest-resource bet in the industry. With $2.6 billion in funding and partnerships with OpenAI, Microsoft, and NVIDIA, Figure has the capital to pursue the data-intensive path to humanoid intelligence. Their custom Helix VLA model — built in collaboration with OpenAI and leveraging OpenAI's language model expertise — follows the paradigm established by RT-2: co-train a vision-language model backbone on both web-scale data and robot demonstrations to produce a model that reasons about novel situations and translates language instructions into physical actions.
The 41-degree-of-freedom action space of Figure 02 creates a fundamental data scaling challenge. A standard 7-DOF robotic arm needs demonstrations covering a 7-dimensional action space. Figure 02's full body — 41 joints including two 16-DOF dexterous hands — requires coverage of a 41-dimensional space. The curse of dimensionality means that achieving equivalent coverage requires exponentially more demonstrations. This is not a theoretical concern — it is the primary bottleneck Figure faces in scaling from impressive laboratory demonstrations to reliable commercial deployment.
The dexterous hands are particularly data-hungry. Each hand has 16 degrees of freedom allowing individuated finger control, precision pinch grasps, power grasps, and tool manipulation. The contact dynamics between robot fingers and real-world objects (rigid, deformable, fragile, slippery) are nearly impossible to simulate faithfully — friction, compliance, and texture vary across millions of object types in ways that no physics engine captures. Real-world manipulation data with force and tactile feedback from diverse objects is irreplaceable for training dexterous hand policies.
Figure's BMW deployment at the Spartanburg, South Carolina manufacturing plant provides initial commercial validation but also exposes the generalization challenge. A policy trained at Spartanburg with specific tooling, part geometries, and environmental conditions must transfer to other manufacturing facilities with different configurations. Each new deployment site requires either massive retraining or initial training data diverse enough to cover the variation across sites. Claru's ability to collect manipulation data across dozens of industrial environments directly addresses this generalization requirement.
The sim-to-real transfer problem is particularly acute for humanoids. Simulated locomotion policies trained in Isaac Gym or MuJoCo consistently fail on real hardware due to ground contact dynamics, actuator backlash, and compliant surface interactions. Real-world walking data collected in actual warehouse and factory environments — with realistic floor textures, obstacles, and human co-workers — provides the distributional coverage that simulation alone cannot deliver.
Key Research & References
- [1]Brohan et al.. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” CoRL 2023, 2023. Link
- [2]Cheng et al.. “Expressive Whole-Body Control for Humanoid Robots.” RSS 2024, 2024. Link
- [3]Radosavovic et al.. “Real-World Humanoid Locomotion with Reinforcement Learning.” Science Robotics, Vol 9, 2024. Link
- [4]Kim et al.. “OpenVLA: An Open-Source Vision-Language-Action Model.” arXiv 2406.09246, 2024. Link
- [5]Black et al.. “pi-zero: A Vision-Language-Action Flow Model for General Robot Control.” arXiv 2410.24164, 2024. Link
- [6]Open X-Embodiment Collaboration. “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” ICRA 2024, 2024. Link
Frequently Asked Questions
Figure AI needs dexterous manipulation demonstrations from its 16-DOF hands, egocentric human activity video for visual pretraining, whole-body locomotion trajectories, and language-action paired data for training the Helix VLA model. The 41-DOF action space of Figure 02 requires significantly more demonstration data than traditional robotic arms — the curse of dimensionality means coverage requirements scale super-linearly with the number of controlled joints.
Helix is Figure AI's custom vision-language-action model developed in collaboration with OpenAI. It follows the VLA paradigm: a vision-language model backbone processes camera images and natural language instructions, then generates robot actions as output tokens. OpenAI contributes language model expertise while Figure provides the robotics hardware and demonstration data pipeline. Training Helix requires robot demonstrations paired with diverse natural language instructions — the data type that determines the model's ability to follow novel commands.
Simulated humanoid policies consistently fail on real hardware due to unmodeled ground contact dynamics, actuator backlash, compliant surface interactions, and the complexity of fingertip-to-object contact physics. Figure 02's dexterous 16-DOF hands are especially sensitive — the friction and compliance of real objects vary in ways that physics engines cannot faithfully model. Real-world data collected in actual deployment environments provides the distributional coverage that fills these simulation gaps.
Figure AI has raised over $2.6 billion at a $39 billion valuation as of February 2025, making it the most heavily capitalized humanoid robotics company in history. For comparison, 1X Technologies has raised $125 million, Agility Robotics $179 million, and Skild AI approximately $1.4 billion. Figure's investor roster — Microsoft, NVIDIA, OpenAI, Jeff Bezos, Intel, Samsung — reflects the conviction that general-purpose humanoids represent one of the largest technology opportunities of the decade.
Yes. Claru operates a global network of 10,000+ data collectors across 100+ cities who can capture teleoperated manipulation demonstrations, egocentric video, and motion capture data in target environments — warehouses, factories, commercial spaces — using standardized recording protocols. This distributed collection provides the environmental diversity and scale that single-lab operations cannot achieve.
Accelerate Figure AI's Data Pipeline
Talk to our team about purpose-built manipulation, egocentric, and locomotion datasets for humanoid robot training.