Real-World Data for Colosseum

Colosseum reveals that VLA models fail under systematic visual perturbations. Diverse real-world data builds the per-axis robustness these models lack.

Colosseum at a Glance

Manipulation Tasks

Perturbation Axes

30-50%

Typical Success Drop

WidowX 250

Robot Platform

Real HW

Evaluation Setting

2024

Released

Perturbation Axes

Colosseum's 12 perturbation axes each isolate one environmental change, revealing which factors cause the largest performance drops for VLA models.

Perturbation Axis	What Changes	Typical VLA Drop	Data Implication
Lighting Color	White light replaced with colored (red, blue, green)	30-50%	Train under varied lighting temperatures and hues
Lighting Intensity	Brightness increased or decreased significantly	15-25%	Include dim and bright environments in collection
Table Texture	Workspace surface material changed	10-30%	Collect on wood, metal, fabric, and patterned surfaces
Distractor Objects	Irrelevant objects placed on the workspace	20-40%	Include cluttered environments with non-task objects
Camera Viewpoint	Camera position shifted 5-10 degrees	15-30%	Collect from multiple camera angles per task
Object Color	Target object color changed	10-25%	Use diverse object instances, not single exemplars

VLA Model Robustness Comparison

How leading VLA architectures perform on Colosseum's perturbation axes, based on published results.

Model	Nominal Success	Lighting Robustness	Distractor Robustness	Overall Degradation
RT-2	High	Low (30-50% drop)	Medium (20-30% drop)	Significant
Diffusion Policy	Medium-High	Medium (20-30% drop)	Low (30-40% drop)	Significant
OpenVLA	Medium	Medium (15-30% drop)	Medium (20-35% drop)	Moderate-Significant
Octo	Medium	Medium (20-35% drop)	Medium (15-30% drop)	Moderate

Benchmark Profile

Colosseum is a benchmark designed to evaluate the robustness of vision-language-action (VLA) models under systematic environmental perturbations. Created by Pumacay et al. and presented at RSS 2024, it tests how robot manipulation policies degrade when environmental conditions change across 12 independent perturbation axes — lighting, textures, table colors, distractor objects, and camera viewpoints — using a real WidowX 250 robot arm.

Task Set

14 manipulation tasks evaluated under 12 systematic perturbation axes. Tasks include pick-and-place, stacking, drawer operations, and object rearrangement. Each perturbation axis (lighting color, lighting intensity, table texture, table color, distractor objects, background changes, camera pose shifts, object color, object size, object texture, tabletop clutter, and combined perturbations) is applied independently to isolate its effect on policy performance.

Observation Space

RGB images from wrist-mounted and third-person cameras at 224x224 resolution, proprioceptive state including joint positions and gripper aperture, and natural language task descriptions specifying the manipulation goal.

Action Space

6-DOF end-effector delta poses (3D position + 3D orientation) with binary gripper control, executed on a WidowX 250 6-DOF robot arm at 5 Hz control frequency.

Evaluation Protocol

Success rate measured per task under each perturbation axis independently and in combination. Each task-perturbation pair runs 10 evaluation trials with deterministic perturbation settings. The benchmark reports both nominal success rate (no perturbations) and per-axis degradation, enabling researchers to identify exactly which environmental changes cause the largest performance drops for a given VLA architecture.

The Sim-to-Real Gap

Colosseum is unusual among benchmarks because it runs on real hardware, not simulation. However, its controlled perturbations still underrepresent real-world variability. A real deployment faces simultaneous changes in lighting, backgrounds, distractor objects, and camera drift — not one perturbation at a time. Additionally, Colosseum's perturbations are applied in a controlled lab setting and cannot capture outdoor lighting variation, surface contamination, or the visual complexity of unstructured real environments.

Real-World Data Needed

Training data collected under extreme visual diversity — many different lighting conditions, backgrounds, table surfaces, distractor objects, and camera angles — to build visual representations robust to the perturbations Colosseum measures. Uncontrolled real-world environments naturally provide this diversity at a scale that controlled perturbation experiments cannot match.

Complementary Claru Datasets

Egocentric Activity Dataset

Collected across 100+ real-world locations with naturally varying lighting, backgrounds, and visual conditions — exactly the visual diversity that trains robust policies against Colosseum's perturbation axes.

Manipulation Trajectory Dataset

Real-world manipulation recordings in diverse, uncontrolled environments provide training data with authentic visual variation rather than synthetic or lab-controlled perturbations.

Custom Visual Diversity Collection

Purpose-collected manipulation data explicitly varying lighting, surface material, background, and distractor density to address each of Colosseum's 12 perturbation axes with real-world instances.

Bridging the Gap: Technical Analysis

Colosseum fills a critical evaluation gap by measuring visual robustness on real hardware. Most benchmarks test task success under nominal conditions only. Colosseum tests whether performance holds when one environmental factor changes at a time, and the results expose a fundamental weakness in current VLA architectures.

The benchmark's findings are sobering. RT-2-based models lose 30-50% success rate when lighting color changes from white to colored. Diffusion Policy variants drop 20-40% with distractor objects on the table. Even small camera viewpoint shifts of 5-10 degrees degrade performance by 15-30% for models that appeared robust under nominal evaluation. OpenVLA and Octo show similar sensitivity, with no current architecture demonstrating consistent robustness across all perturbation axes.

What makes these results actionable is the per-axis decomposition. Colosseum reveals that lighting robustness and distractor robustness are largely independent failure modes — a policy can be robust to lighting changes while being fragile to distractors. This implies that training data must cover each axis of variation independently, not just increase overall diversity.

The benchmark also reveals that data quantity alone does not solve robustness. RT-2 was trained on massive internet-scale data but still shows significant drops under Colosseum's perturbations. The issue is the distribution of that data — web images do not proportionally represent the specific perturbation axes that matter for manipulation. Purpose-collected real-world manipulation data under controlled diversity can be more efficient than uncurated web-scale data for building robust policies.

Claru's egocentric activity dataset, collected across 100+ cities in naturally varying conditions, provides precisely this kind of structured visual diversity. Each collection location has different lighting, surfaces, backgrounds, and clutter levels. Training on this diversity produces the per-axis visual robustness that Colosseum measures.

Key Papers

[1]Pumacay et al.. “Colosseum: A Benchmark for Evaluating Generalization for Robotic Manipulation.” RSS 2024, 2024. Link
[2]Brohan et al.. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” CoRL 2023, 2023. Link
[3]Chi et al.. “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.” RSS 2023, 2023. Link
[4]Kim et al.. “OpenVLA: An Open-Source Vision-Language-Action Model.” CoRL 2024, 2024. Link
[5]Octo Model Team. “Octo: An Open-Source Generalist Robot Policy.” RSS 2024, 2024. Link

Frequently Asked Questions

Colosseum systematically tests visual robustness on real hardware — how much performance degrades when lighting, textures, backgrounds, distractors, or camera angles change. Other benchmarks test task success under nominal conditions only, giving an incomplete picture of deployment readiness. Colosseum's per-axis decomposition reveals exactly which environmental changes cause the largest performance drops for a given VLA model.

Most training data comes from narrow visual distributions — either controlled lab settings or internet images that do not proportionally represent the perturbation axes relevant to manipulation. Models learn visual shortcuts specific to their training distribution. When perturbations break these correlations (e.g., an object's color changes, making color-based identification fail), the policy collapses because its visual features are not truly invariant to task-irrelevant changes.

Partially, but not completely. Color jittering and random crops help with some axes but fail to capture the correlated structure of real visual changes — how shadows shift with lighting, how reflections change with surface materials, how clutter affects occlusion patterns. Real-world data under authentic visual variation captures these correlations, producing more robust representations than synthetic augmentation alone.

Domain randomization applies random visual perturbations during simulation training to build robustness. Colosseum is an evaluation benchmark, not a training method — it measures whether a policy is robust after training, regardless of how it was trained. Colosseum's results show that even domain-randomized policies still degrade under real-world perturbations, suggesting that the randomization distribution does not fully cover real visual variation.

Lighting color changes consistently cause the largest drops (30-50% for most models), followed by distractor objects (20-40%) and camera viewpoint shifts (15-30%). However, the ranking varies by architecture — some models handle lighting well but fail on distractors. This model-specific sensitivity is what makes Colosseum's per-axis decomposition valuable for targeted data collection.