Real-World Data for ARNOLD

ARNOLD provides standardized evaluation for robot learning. Real-world data validates whether simulation performance transfers to physical hardware.

ARNOLD at a Glance

Sim

Environment

Multi

Tasks

Standard

Evaluation

Active

Community

Benchmark Profile

ARNOLD (A Benchmark for Language-Grounded Task Learning with Continuous States in Realistic 3D Scenes) evaluates an agent's ability to follow natural language instructions to manipulate objects in photorealistic 3D environments. Built on NVIDIA AI2-THOR, it bridges language understanding and physical manipulation.

Task Set

8 language-grounded manipulation tasks: pick up, reposition, open/close, toggle, pour, heat/cool. Each task is conditioned on natural language instructions with diverse phrasings.

Observation Space

Photorealistic RGB images from agent's egocentric camera, depth maps, agent's current pose, object states.

Action Space

Discrete navigation actions plus continuous manipulation parameters (grasp pose, force, motion trajectory).

Evaluation Protocol

Success rate for achieving the target state described in natural language. Evaluated across diverse instruction phrasings and object arrangements.

The Sim-to-Real Gap

AI2-THOR photorealistic rendering reduces visual gap but physical interactions (pouring, heating) use simplified state transitions rather than continuous physics. Language diversity in evaluation is limited to template-based variations.

Real-World Data Needed

Real-world language-grounded manipulation demonstrations. Diverse natural language instructions paired with manipulation actions. Real physical interactions (actual pouring, opening, toggling) rather than state transitions.

Complementary Claru Datasets

Custom Language-Paired Collection

Real demonstrations paired with diverse natural language instructions provide the grounding data that templates cannot match.

Egocentric Kitchen Video Dataset

Kitchen manipulation video provides visual pretraining for the household environments ARNOLD evaluates.

Manipulation Trajectory Dataset

Real manipulation recordings provide authentic physics for the contact-rich interactions ARNOLD simulates.

Bridging the Gap: Technical Analysis

ARNOLD represents the convergence of natural language processing and robotic manipulation benchmarking. The benchmark tests whether models can understand language instructions, ground them in visual observations, and execute the corresponding physical manipulation — the core capability needed for language-conditioned robots.

The language grounding challenge in ARNOLD extends beyond simple object references. Instructions like 'pour the coffee into the blue mug on the counter' require resolving spatial references, understanding object attributes, and decomposing the instruction into a sequence of physical actions. Current VLA models achieve moderate success on template-based instructions but struggle with the compositional complexity of natural language.

AI2-THOR's photorealistic rendering narrows the visual domain gap compared to MuJoCo or PyBullet-based benchmarks, but the physical interaction gap remains wide. Pouring in AI2-THOR is a state transition (liquid teleports from container A to container B), not a physical simulation of fluid dynamics. Similarly, opening doors and drawers uses simplified hinge models. Real-world data where these physical interactions actually occur provides the ground truth for evaluating whether models understand physical causality or just memorize simulation-specific shortcuts.

The instruction diversity gap is particularly important. Template-based evaluation ('pick up the [object] from the [location]') tests a narrow slice of language understanding. Real humans give instructions in diverse ways — elliptical, indirect, contextual. Training and evaluation on real language instructions, paired with actual demonstrations, tests genuine language grounding rather than template matching.

Key Papers

[1]Gong et al.. “ARNOLD: A Benchmark for Language-Grounded Task Learning with Continuous States in Realistic 3D Scenes.” ICCV 2023, 2023. Link
[2]Shridhar et al.. “CLIPort: What and Where Pathways for Robotic Manipulation.” CoRL 2022, 2022. Link

Frequently Asked Questions

ARNOLD (A Benchmark for Language-Grounded Task Learning with Continuous States in Realistic 3D Scenes) evaluates an agent's ability to follow natural language instructions to manipulate objects in photorealistic 3D environments. Built on NVIDIA AI2-THOR, it bridges language understanding and physical manipulation.

Real-world language-grounded manipulation demonstrations. Diverse natural language instructions paired with manipulation actions. Real physical interactions (actual pouring, opening, toggling) rather than state transitions.

AI2-THOR photorealistic rendering reduces visual gap but physical interactions (pouring, heating) use simplified state transitions rather than continuous physics. Language diversity in evaluation is limited to template-based variations.

Yes. Claru coordinates data collection on specific robot platforms and in specific environments to enable direct comparison between simulated and real performance for ARNOLD tasks.

Related Resources

Glossary

Sim To Real →

Glossary

Manipulation Trajectory →

Glossary

Behavioral Cloning →

Get Real-World Data for ARNOLD

Discuss purpose-collected data to validate and improve your ARNOLD-trained policies on physical hardware.

Get in Touch Browse the Data Catalog