TOTO Benchmark Alternative: Production Data for Real-World Robot Deployment

TOTO (Test on Trajectory Optimization) provides a benchmark for evaluating robot learning methods on out-of-distribution generalization across manipulation tasks. But a benchmark for measuring generalization is not a data source for achieving it. Compare TOTO with Claru's production-grade data collection.

TOTO Profile

Institution

Carnegie Mellon University / Meta AI

Year

2023

Scale

~1,000 demonstrations per task across 6+ manipulation tasks, with pre-computed representations from 5+ visual encoders

License

MIT License

Modalities
RGB images (2 camera viewpoints)Proprioception (joint positions, velocities, end-effector pose)Action labelsPre-computed visual embeddings (R3M, MVP, CLIP, ResNet, MAE)

How Claru Helps Teams Beyond TOTO

TOTO addresses a critical question in robot learning: which visual representations generalize best under distribution shift? Its controlled evaluation framework, with pre-computed embeddings from R3M, MVP, CLIP, and other backbones, provides a rigorous method for answering this question. But the question itself is only the first step. Once you know which representation to use, you need the data that turns that representation into a deployed policy that handles real-world variability. This is where Claru comes in. TOTO measures OOD robustness on controlled, one-dimensional shifts in a single lab. Real deployment demands robustness across simultaneous, uncontrolled, multi-dimensional variation in environments that look nothing like a CMU research lab. Claru provides that robustness by collecting demonstrations across the full range of conditions your robot will encounter: different times of day, different environmental configurations, different object instances, different operator interactions. Our data transforms out-of-distribution conditions into in-distribution training data, which is fundamentally more effective than hoping a representation will generalize to conditions it has never seen. We deliver in standard formats with the multi-modal sensor coverage that production manipulation requires, giving you the data foundation to move from benchmark evaluation to real-world deployment.

What Is TOTO?

TOTO (Test on Trajectory Optimization) is a benchmark dataset for evaluating robot learning algorithms, developed by researchers at Carnegie Mellon University (CMU) and Meta AI. Published in 2023, TOTO focuses specifically on measuring how well pretrained representations enable out-of-distribution generalization -- testing whether a model trained on one set of conditions can perform under different conditions (new object positions, new objects, new lighting, etc.).

The benchmark consists of demonstrations collected on a Franka Emika Panda arm across multiple manipulation tasks including pushing, picking, stacking, and placing objects. What distinguishes TOTO from generic manipulation datasets is its systematic variation of conditions between training and test splits. For each task, the training set uses one distribution of object positions, lighting conditions, and object instances, while the evaluation set deliberately shifts these conditions. This structured train/test split enables rigorous measurement of generalization capability.

TOTO provides approximately 1,000 demonstrations per task, collected via teleoperation. Each demonstration includes multi-view RGB images from two cameras, proprioceptive state (joint positions, velocities, end-effector pose), and action labels. The benchmark includes pre-computed visual representations from multiple pretrained encoders (R3M, MVP, CLIP, ResNet, MAE) so researchers can directly compare which visual backbone produces the best downstream policy performance under distribution shift.

The dataset is released under the MIT License and was designed as a standardized evaluation protocol rather than a large-scale pretraining resource. TOTO's primary contribution is its experimental framework for comparing representation learning methods, not its raw data volume. It has been cited in research on visual representation learning for robotics and out-of-distribution robustness.

TOTO at a Glance

~1,000
Demos per Task
6+
Manipulation Tasks
2
Camera Views
5+
Pre-Computed Representations
MIT
License
1
Robot (Franka Panda)

TOTO vs. Claru: Side-by-Side Comparison

A comparison for teams that need data addressing the out-of-distribution challenges TOTO measures.

DimensionTOTOClaru
Primary PurposeBenchmark for evaluating OOD generalizationTraining data for production deployment
Scale~1,000 demos per task (a few thousand total)1K to 1M+ demos, scoped to your needs
Robot PlatformFranka Panda in CMU labAny physical robot you deploy
Environmental VariationControlled train/test splits with systematic OOD shiftsOrganic variation from real deployment environments
Task Coverage6+ basic manipulation tasks (push, pick, stack, place)Custom tasks matching your production requirements
Sensor ModalitiesRGB (2 views) + proprioceptionRGB + depth + force/torque + proprioception + tactile
Pre-Computed FeaturesR3M, MVP, CLIP, ResNet, MAE embeddings includedRaw data -- use your own feature extraction pipeline
OOD CoverageStructured: position, lighting, object instance shiftsComprehensive: natural environmental variability across all dimensions
LicenseMIT LicenseCommercial license with IP assignment
Ongoing CollectionStatic benchmark releaseContinuous collection and expansion

Key Limitations of TOTO for Production Use

TOTO is a benchmark, not a training dataset. Its approximately 1,000 demonstrations per task are designed to evaluate whether a representation enables generalization, not to provide sufficient data for training a production-capable policy. The data volume is intentionally kept small so that the contribution of the representation (not the data scale) can be isolated. Production policies typically require 5-50x more demonstrations per task to achieve deployment-level reliability.

The out-of-distribution shifts in TOTO are controlled and limited. TOTO varies object positions, lighting, and object instances between train and test splits, but real-world deployment involves many more dimensions of variation simultaneously: different times of day, different seasons, different users, different background clutter, sensor degradation over time, and environmental changes that accumulate continuously. TOTO's structured OOD shifts test one dimension at a time; real deployment tests all dimensions simultaneously.

TOTO uses a single Franka Panda arm in a single laboratory at CMU. The camera positions, lighting rig, table surface, and workspace geometry are specific to that installation. Teams deploying different robots in different environments cannot directly benefit from TOTO's demonstrations -- the data is too specific to its collection environment to serve as general-purpose training data.

The benchmark lacks depth, force/torque, and tactile sensor data. TOTO provides RGB images and proprioception, which is sufficient for the representation evaluation it was designed for, but insufficient for training policies that must handle contact-rich manipulation. Many production tasks (insertion, packing, assembly, tool use) require haptic feedback that TOTO does not capture.

TOTO's task set is limited to basic manipulation primitives (pushing, picking, stacking, placing). Production deployments require complex, multi-step tasks with sequencing, conditional logic, and error recovery that these basic benchmarks do not evaluate.

When to Use TOTO vs. Commercial Data

TOTO is the right tool for a specific research question: which visual representation enables the best out-of-distribution generalization for robot manipulation? If you are developing or comparing visual encoders, self-supervised learning methods, or representation learning approaches for robotics, TOTO provides the controlled experimental framework to measure OOD performance rigorously. Its pre-computed embeddings from multiple encoders make this comparison especially efficient.

TOTO is also useful as a sanity check during development. Before investing in large-scale data collection, you can use TOTO to verify that your visual backbone generalizes across at least basic distribution shifts. If your method fails on TOTO's controlled variations, it will certainly fail on the uncontrolled variations of real deployment.

Move to Claru when your goal shifts from measuring generalization to achieving it. Real-world robustness comes not from a clever representation but from data that covers the variability your policy will encounter in production. Claru collects demonstrations across the natural variations in your deployment environment -- different times of day, different object configurations, different operators, different environmental states -- providing the data diversity that teaches policies genuine real-world robustness.

How Claru Complements TOTO

TOTO identifies which representations enable generalization; Claru provides the data that exercises those representations in production conditions. After using TOTO to select your visual backbone, Claru supplies the large-scale, diverse, real-world demonstrations needed to train a robust policy on top of that backbone.

Where TOTO creates OOD conditions through controlled lab-setting variations, Claru captures organic OOD conditions by collecting demonstrations across the natural variability of your deployment environment. Different lighting throughout the day, different object placements by different workers, different background states -- these real-world variations are more complex and comprehensive than any controlled benchmark can simulate.

Claru also extends beyond TOTO's sensor coverage. We collect synchronized RGB-D, force/torque, proprioception, and optional tactile data, enabling policies that combine the visual representations TOTO evaluates with the multi-modal inputs that production manipulation requires.

Data is delivered in RLDS, HDF5, zarr, or LeRobot format with standardized schemas. For teams that have used TOTO to validate their representation learning approach, Claru provides the production data layer that turns a research finding into a deployed capability.

References

  1. [1]Zhou et al.. TOTO: A Benchmark for Evaluating Representations for Robot Manipulation.” arXiv 2023, 2023. Link
  2. [2]Nair et al.. R3M: A Universal Visual Representation for Robot Manipulation.” CoRL 2022, 2022. Link
  3. [3]Radosavovic et al.. Real-World Robot Learning with Masked Visual Pre-training.” CoRL 2023, 2023. Link
  4. [4]Chi et al.. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.” RSS 2023, 2023. Link

Frequently Asked Questions

TOTO is primarily a benchmark -- an evaluation framework for comparing how well different visual representations enable out-of-distribution generalization in robot manipulation. While it includes demonstration data, the scale (~1,000 demos per task) is designed for evaluation, not for training production policies. Large-scale training data requires a service like Claru.

Yes, TOTO is released under the MIT License. However, its small scale and single-lab data make it impractical as a training resource for production systems. It serves best as a research evaluation tool.

TOTO includes pre-computed embeddings from R3M, MVP, CLIP, ResNet, and MAE. This allows researchers to directly compare how each visual backbone performs under distribution shift without re-running feature extraction. The results help guide representation selection for downstream policy training.

TOTO creates controlled out-of-distribution conditions by systematically varying one factor (object position, lighting, or object instance) between train and test splits. Real deployment involves simultaneous, continuous variation across many factors. TOTO measures whether a representation can handle isolated shifts; real robustness requires handling all shifts at once, which demands diverse training data from actual deployment conditions.

Yes, but through a different mechanism than TOTO evaluates. Claru achieves robustness by collecting demonstrations across the natural variability of your deployment environment -- different lighting conditions, object configurations, and environmental states -- so that these variations become part of your training distribution rather than out-of-distribution conditions.

Build Real Robustness, Not Just Benchmark Scores

Get diverse real-world demonstrations that cover the variability your robot will face in production. Turn TOTO-validated representations into deployed policies.