Training Data for Stanford SAIL

Stanford SAIL made robot learning accessible with ALOHA and Octo. Here is how diverse real-world data expands what open-source robot policies can do.

About Stanford AI Lab (SAIL)

Stanford's AI Lab houses multiple robotics groups including Chelsea Finn's IRIS Lab (robot learning), Dorsa Sadigh's ILIAD Lab (human-robot interaction), and Jeannette Bohg's Interactive Perception Lab. SAIL researchers created ALOHA (low-cost bimanual teleop, $20K), Mobile ALOHA ($32K whole-body system), Octo (open-source robot policy), and Bridge V2 (shared manipulation dataset with 60K+ demonstrations).

Low-cost robot learning hardwareOpen-source robot policies and datasetsBimanual and whole-body manipulationHuman-robot interaction and shared autonomyLarge-scale manipulation dataset curation

Stanford SAIL at a Glance

$20K

ALOHA Hardware Cost

$32K

Mobile ALOHA System

Octo

Open-Source Policy

60K+

Bridge V2 Demonstrations

DeepMind

ALOHA Unleashed Partner

RLDS

Data Format Standard

Known Data Requirements

Stanford SAIL's open-source approach to robot learning — exemplified by ALOHA hardware, Octo model, and Bridge dataset — creates demand for large, diverse manipulation datasets that the community can build upon. Mobile ALOHA needs household task data from diverse homes. Bridge V2 needs contributions from geographically distributed environments. The ALOHA Unleashed collaboration with DeepMind pushes dexterity requirements further.

Diverse bimanual manipulation data for ALOHA

Source: ALOHA paper (Zhao et al., RSS 2023) and Mobile ALOHA (CoRL 2024)

Bimanual teleoperated demonstrations across diverse tasks and environments to train policies on ALOHA-class low-cost hardware platforms. ALOHA Unleashed extends these requirements to complex dexterous tasks like repairing other robots.

Mobile manipulation in home environments

Source: Mobile ALOHA demonstrations of cooking, cleaning, and household tasks

Mobile manipulation recordings in real home environments — cooking, cleaning, organizing, cabinet operation — with whole-body (base + arms) coordinated demonstrations captured in diverse residential settings.

Multi-site data for Bridge dataset expansion

Source: Bridge V2 dataset (Walke et al., CoRL 2023) and cross-institution data sharing model

Manipulation data collected across many institutions and environments using Bridge-compatible formats to expand the dataset ecosystem beyond current US-based university contributors.

Language-paired manipulation demonstrations

Source: Octo's language-conditioned control and RT-2 integration research

Robot demonstrations paired with natural language task descriptions for training Octo and similar language-conditioned policies that follow verbal instructions.

Household object interaction at scale

Source: Mobile ALOHA target of autonomous household assistance

Object interaction data spanning kitchen utensils, cleaning tools, food items, containers, and furniture — collected in real homes rather than laboratory mockups — with the variety of objects that actual households contain.

How Claru Data Addresses These Needs

Lab Need	Claru Offering	Rationale
Diverse bimanual manipulation data for ALOHA	Manipulation Trajectory Dataset + Custom Bimanual Collection	Claru's manipulation data includes bimanual recordings. Custom collection campaigns with ALOHA-compatible recording formats can directly expand available training data for the ALOHA research community beyond Bay Area laboratory environments.
Mobile manipulation in home environments	Egocentric Activity Dataset + Custom Home Collection	Claru's egocentric video captures human household activities from first-person perspective in real homes across 100+ cities. Targeted mobile manipulation collection in diverse residential environments provides the visual variety Mobile ALOHA needs to generalize beyond Stanford kitchens.
Multi-site data for Bridge dataset expansion	Distributed Collection in Bridge-Compatible Formats	Claru can coordinate collection across multiple locations using Bridge V2-compatible RLDS recording formats, enabling seamless integration with the existing dataset ecosystem and dramatically expanding geographic coverage.
Language-paired manipulation demonstrations	Custom Language-Paired Data Collection	Claru can coordinate collection campaigns where diverse tasks are performed with concurrent natural language narration, producing the language-action pairs that Octo and similar models need for instruction following.

Technical Data Analysis

Stanford SAIL's robotics groups have collectively shaped the modern landscape of accessible robot learning. The ALOHA system — a $20K bimanual teleoperation rig built by Tony Zhao and Zipeng Fu under Chelsea Finn's supervision — demonstrated that graduate students can collect high-quality manipulation data without million-dollar hardware. Mobile ALOHA, costing $32K including onboard power and compute, extended this to whole-body household tasks: the robot learned to saut shrimp, clean wine spills, push in chairs, and call elevators. ALOHA Unleashed, a collaboration with Google DeepMind, pushed dexterity further by training robots to repair other robots and perform tasks requiring fine motor control.

Octo created an open-source generalist policy that anyone can fine-tune. Bridge V2 established a cross-institution data sharing standard with 60,000+ demonstrations formatted in the RLDS (Reinforcement Learning Datasets) format. Together, these projects form an ecosystem where hardware is cheap, software is open, and data is shared — democratizing robot learning in ways that previously required corporate-scale resources.

This accessibility-focused research philosophy creates a specific data bottleneck: the quality and diversity of policies trained on ALOHA is directly limited by the diversity of environments where ALOHA data has been collected. Most ALOHA data comes from a handful of university labs in the San Francisco Bay Area — the same kitchens, the same tables, the same lighting. A policy trained to fold towels in a Stanford lab kitchen overfits to those specific visual features and fails in a kitchen with different countertops, cabinet styles, or lighting conditions.

The Bridge dataset faces the same challenge at larger scale. While Bridge V2 aggregated data from multiple institutions, the contributing labs are primarily in the western United States and have similar laboratory setups. Expanding Bridge to include data from diverse global environments — different architectural styles, object categories, cultural objects, and regional home designs — would significantly improve the generalization of models like Octo that are trained on it.

Claru's distributed collection network addresses this gap directly. By coordinating ALOHA-compatible and Bridge-compatible data collection across 100+ cities with diverse home environments, Claru can provide the environmental variety that the Stanford ecosystem needs to train policies that work outside the Bay Area. Each location contributes unique visual characteristics — different countertop materials, cabinet styles, lighting conditions, household objects — that make trained policies more robust. The data can be formatted in RLDS for direct compatibility with Bridge V2 and Octo training pipelines.

Key Research & References

[1]Zhao, T.Z. et al.. “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.” RSS 2023, 2023. Link
[2]Fu, Z., Zhao, T.Z., Finn, C.. “Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation.” CoRL 2024, 2024. Link
[3]Team Octo. “Octo: An Open-Source Generalist Robot Policy.” RSS 2024, 2024. Link
[4]Walke et al.. “BridgeData V2: A Dataset for Robot Learning at Scale.” CoRL 2023, 2023. Link
[5]Zhao, T.Z. et al. (with Google DeepMind). “ALOHA Unleashed: A Simple Recipe for Robot Dexterity.” arXiv 2024, 2024. Link
[6]Chi et al.. “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.” RSS 2023, 2023. Link

Frequently Asked Questions

Most ALOHA data comes from university labs in the San Francisco Bay Area. Policies trained on this data overfit to Bay Area aesthetics — specific countertop materials, kitchen layouts, and lighting conditions — and fail in environments with different visual characteristics. Data from diverse homes and kitchens across different regions produces policies that generalize broadly.

Bridge V2 is a standardized manipulation dataset aggregated from multiple research institutions using the RLDS format, containing 60,000+ demonstrations. It established protocols for cross-institution data sharing. Expansion requires data collection from new, geographically diverse environments using Bridge-compatible recording formats — exactly what distributed collection enables.

Octo is an open-source generalist robot policy designed to be fine-tuned for specific tasks. The quality of Octo's pretrained representations depends on training data diversity. More environments, tasks, and object categories in pretraining produce representations that require less fine-tuning data to adapt to new deployments and unseen conditions.

ALOHA Unleashed is a collaboration between Stanford (Tony Zhao, Chelsea Finn) and Google DeepMind that pushes ALOHA's dexterity further. The project demonstrated robots performing highly dexterous tasks including repairing other robots, showing that scaling data collection and training compute can unlock manipulation capabilities previously considered beyond imitation learning.

The original ALOHA is a stationary bimanual system ($20K) for tabletop manipulation. Mobile ALOHA ($32K) adds a mobile base and whole-body teleoperation, enabling the robot to navigate through environments while performing bimanual tasks — cooking, cleaning, calling elevators. This whole-body capability requires training data from full-room or full-home environments rather than just tabletops.