Real-World Data for Colosseum
Colosseum reveals that VLA models fail under systematic visual perturbations. Diverse real-world data builds the per-axis robustness these models lack.
Colosseum at a Glance
Perturbation Axes
Colosseum's 12 perturbation axes each isolate one environmental change, revealing which factors cause the largest performance drops for VLA models.
| Perturbation Axis | What Changes | Typical VLA Drop | Data Implication |
|---|---|---|---|
| Lighting Color | White light replaced with colored (red, blue, green) | 30-50% | Train under varied lighting temperatures and hues |
| Lighting Intensity | Brightness increased or decreased significantly | 15-25% | Include dim and bright environments in collection |
| Table Texture | Workspace surface material changed | 10-30% | Collect on wood, metal, fabric, and patterned surfaces |
| Distractor Objects | Irrelevant objects placed on the workspace | 20-40% | Include cluttered environments with non-task objects |
| Camera Viewpoint | Camera position shifted 5-10 degrees | 15-30% | Collect from multiple camera angles per task |
| Object Color | Target object color changed | 10-25% | Use diverse object instances, not single exemplars |
VLA Model Robustness Comparison
How leading VLA architectures perform on Colosseum's perturbation axes, based on published results.
| Model | Nominal Success | Lighting Robustness | Distractor Robustness | Overall Degradation |
|---|---|---|---|---|
| RT-2 | High | Low (30-50% drop) | Medium (20-30% drop) | Significant |
| Diffusion Policy | Medium-High | Medium (20-30% drop) | Low (30-40% drop) | Significant |
| OpenVLA | Medium | Medium (15-30% drop) | Medium (20-35% drop) | Moderate-Significant |
| Octo | Medium | Medium (20-35% drop) | Medium (15-30% drop) | Moderate |
Benchmark Profile
Colosseum is a benchmark designed to evaluate the robustness of vision-language-action (VLA) models under systematic environmental perturbations. Created by Pumacay et al. and presented at RSS 2024, it tests how robot manipulation policies degrade when environmental conditions change across 12 independent perturbation axes — lighting, textures, table colors, distractor objects, and camera viewpoints — using a real WidowX 250 robot arm.
The Sim-to-Real Gap
Colosseum is unusual among benchmarks because it runs on real hardware, not simulation. However, its controlled perturbations still underrepresent real-world variability. A real deployment faces simultaneous changes in lighting, backgrounds, distractor objects, and camera drift — not one perturbation at a time. Additionally, Colosseum's perturbations are applied in a controlled lab setting and cannot capture outdoor lighting variation, surface contamination, or the visual complexity of unstructured real environments.
Real-World Data Needed
Training data collected under extreme visual diversity — many different lighting conditions, backgrounds, table surfaces, distractor objects, and camera angles — to build visual representations robust to the perturbations Colosseum measures. Uncontrolled real-world environments naturally provide this diversity at a scale that controlled perturbation experiments cannot match.
Complementary Claru Datasets
Egocentric Activity Dataset
Collected across 100+ real-world locations with naturally varying lighting, backgrounds, and visual conditions — exactly the visual diversity that trains robust policies against Colosseum's perturbation axes.
Manipulation Trajectory Dataset
Real-world manipulation recordings in diverse, uncontrolled environments provide training data with authentic visual variation rather than synthetic or lab-controlled perturbations.
Custom Visual Diversity Collection
Purpose-collected manipulation data explicitly varying lighting, surface material, background, and distractor density to address each of Colosseum's 12 perturbation axes with real-world instances.
Bridging the Gap: Technical Analysis
Colosseum fills a critical evaluation gap by measuring visual robustness on real hardware. Most benchmarks test task success under nominal conditions only. Colosseum tests whether performance holds when one environmental factor changes at a time, and the results expose a fundamental weakness in current VLA architectures.
The benchmark's findings are sobering. RT-2-based models lose 30-50% success rate when lighting color changes from white to colored. Diffusion Policy variants drop 20-40% with distractor objects on the table. Even small camera viewpoint shifts of 5-10 degrees degrade performance by 15-30% for models that appeared robust under nominal evaluation. OpenVLA and Octo show similar sensitivity, with no current architecture demonstrating consistent robustness across all perturbation axes.
What makes these results actionable is the per-axis decomposition. Colosseum reveals that lighting robustness and distractor robustness are largely independent failure modes — a policy can be robust to lighting changes while being fragile to distractors. This implies that training data must cover each axis of variation independently, not just increase overall diversity.
The benchmark also reveals that data quantity alone does not solve robustness. RT-2 was trained on massive internet-scale data but still shows significant drops under Colosseum's perturbations. The issue is the distribution of that data — web images do not proportionally represent the specific perturbation axes that matter for manipulation. Purpose-collected real-world manipulation data under controlled diversity can be more efficient than uncurated web-scale data for building robust policies.
Claru's egocentric activity dataset, collected across 100+ cities in naturally varying conditions, provides precisely this kind of structured visual diversity. Each collection location has different lighting, surfaces, backgrounds, and clutter levels. Training on this diversity produces the per-axis visual robustness that Colosseum measures.
Key Papers
- [1]Pumacay et al.. “Colosseum: A Benchmark for Evaluating Generalization for Robotic Manipulation.” RSS 2024, 2024. Link
- [2]Brohan et al.. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” CoRL 2023, 2023. Link
- [3]Chi et al.. “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.” RSS 2023, 2023. Link
- [4]Kim et al.. “OpenVLA: An Open-Source Vision-Language-Action Model.” CoRL 2024, 2024. Link
- [5]Octo Model Team. “Octo: An Open-Source Generalist Robot Policy.” RSS 2024, 2024. Link
Frequently Asked Questions
Colosseum systematically tests visual robustness on real hardware — how much performance degrades when lighting, textures, backgrounds, distractors, or camera angles change. Other benchmarks test task success under nominal conditions only, giving an incomplete picture of deployment readiness. Colosseum's per-axis decomposition reveals exactly which environmental changes cause the largest performance drops for a given VLA model.
Most training data comes from narrow visual distributions — either controlled lab settings or internet images that do not proportionally represent the perturbation axes relevant to manipulation. Models learn visual shortcuts specific to their training distribution. When perturbations break these correlations (e.g., an object's color changes, making color-based identification fail), the policy collapses because its visual features are not truly invariant to task-irrelevant changes.
Partially, but not completely. Color jittering and random crops help with some axes but fail to capture the correlated structure of real visual changes — how shadows shift with lighting, how reflections change with surface materials, how clutter affects occlusion patterns. Real-world data under authentic visual variation captures these correlations, producing more robust representations than synthetic augmentation alone.
Domain randomization applies random visual perturbations during simulation training to build robustness. Colosseum is an evaluation benchmark, not a training method — it measures whether a policy is robust after training, regardless of how it was trained. Colosseum's results show that even domain-randomized policies still degrade under real-world perturbations, suggesting that the randomization distribution does not fully cover real visual variation.
Lighting color changes consistently cause the largest drops (30-50% for most models), followed by distractor objects (20-40%) and camera viewpoint shifts (15-30%). However, the ranking varies by architecture — some models handle lighting well but fail on distractors. This model-specific sensitivity is what makes Colosseum's per-axis decomposition valuable for targeted data collection.
Build Visually Robust Robot Policies
Discuss visually diverse manipulation data that addresses the robustness gaps Colosseum reveals across all 12 perturbation axes.