Real-World Data for RLBench

RLBench is the standard benchmark for multi-task manipulation with 100 tasks. Real-world data reveals whether those simulation scores transfer to physical robots.

RLBench at a Glance

100

Tasks

7-DOF

Action Space

Camera Views

Eval Episodes/Task

CoppeliaSim

Physics Engine

2020

Released

RLBench Task Categories

RLBench's 100 tasks span manipulation primitives, tool use, articulated objects, and multi-step sequences. Each category presents different sim-to-real transfer challenges.

Category	Example Tasks	Task Count	Primary Transfer Gap
Pick & Place	Pick up cup, place wine at rack	~25	Grasp stability, object weight, friction
Stacking & Assembly	Stack blocks, stack cups, put ring on peg	~15	Contact-rich insertion, alignment tolerance
Articulated Objects	Open drawer, open door, turn tap	~15	Mechanism friction, hinge dynamics
Tool Use	Screw nail, sweep to dustpan	~10	Tool-object contact, force application
Reaching & Pressing	Reach target, press button, push button	~10	Minimal (coarse positioning transfers well)
Multi-Step Sequences	Set the table, put groceries in cupboard	~10	Error compounding, state estimation
Precision Manipulation	Close jar, insert peg, place cups	~15	Tight tolerances, compliance control

Sim vs. Real: Key Gaps in RLBench

Dimension	RLBench (Simulation)	Real World
Contact Physics	Spring-damper contacts, uniform Coulomb friction	Complex friction cones, material deformation, surface contamination
Visual Rendering	Flat lighting, simple textures, no reflections	Complex shadows, specular highlights, clutter, varying lighting
Actuator Model	Ideal joint control, no backlash or friction	Joint friction, backlash, torque limits, impedance control dynamics
Object Diversity	Parameterized variations (color, size, position)	Infinite geometry, material, and weight variety
Motion Planning	Always succeeds with full environment knowledge	Uncertain geometry, obstacles, joint limit collisions
Sensor Model	Perfect RGB-D, no noise, no occlusion artifacts	Sensor noise, depth holes, motion blur, auto-exposure

RLBench vs. Related Benchmarks

How RLBench compares to other widely used manipulation benchmarks.

Feature	RLBench	ManiSkill 3	LIBERO	CALVIN
Task count	100	20+	130	34
Physics engine	CoppeliaSim	SAPIEN	MuJoCo	PyBullet
GPU parallel	No	Yes (4K+ envs)	No	No
Multi-step eval	Some tasks	Some tasks	10-step suites	5-step chains
Language conditioning	Task name only	Task name	Templated	Free-form natural language
Rendering quality	Basic rasterized	Ray-traced	Basic rasterized	Basic rasterized

Benchmark Profile

RLBench is a large-scale benchmark and learning environment built on CoppeliaSim (V-REP) and PyRep. Created by Stephen James et al. at Imperial College London in 2020, it provides 100 carefully designed manipulation tasks with scripted demonstrations, supporting both reinforcement learning and imitation learning research. Each task includes multiple variations in object position, color, size, and count, making it the de facto standard for evaluating multi-task manipulation policies.

Task Set

100 manipulation tasks spanning reach target, pick and place, stack blocks, open drawer, slide block, press button, put items in drawer, close jar, screw nail, place wine at rack, and complex multi-step sequences like set the table and put groceries in cupboard. Each task has 10-60 variations that change object color, position, and quantity. Tasks range from simple single-step reaching to complex multi-step sequences requiring 6+ coordinated actions.

Observation Space

RGB images from up to 4 cameras (front, left shoulder, right shoulder, wrist) at 128x128 resolution, aligned depth maps, joint positions (7 joints), joint velocities, gripper open/close state, and task-specific low-dimensional state observations. Demonstrations include full 6-DOF end-effector waypoint trajectories.

Action Space

7-DOF joint velocities or 6-DOF end-effector delta poses (3D position + quaternion orientation) with binary gripper open/close. Most recent methods use keyframe-based action representations, predicting next-best-pose waypoints rather than continuous joint commands.

Evaluation Protocol

Success rate on held-out task variations over 25 evaluation episodes per task. Multi-task evaluation measures average success rate across all 100 tasks or a standard 18-task subset. Single-task evaluation uses 100 episodes per task with randomized initial conditions. Methods are compared on the number of demonstrations used (1, 5, 10, 20, 100 demos per task).

The Sim-to-Real Gap

RLBench's CoppeliaSim physics diverges from real-world contact dynamics — objects slide unrealistically on surfaces, grasps succeed or fail discretely rather than exhibiting partial slip, and contacts are modeled as spring-damper systems with simplified friction. Camera rendering lacks photorealistic lighting, textures, and optical effects present in real sensor data. The simulated Franka Panda ignores real joint friction, backlash, torque limits, and the nonlinear dynamics of the real robot's impedance controller.

Real-World Data Needed

Real-world manipulation recordings on the same task categories as RLBench — pick-and-place, stacking, drawer operations, button pressing, jar manipulation, and multi-step sequences — collected with real robots or human demonstrations. Critical needs include authentic contact dynamics with diverse objects, photorealistic visual data from real environments, demonstrations on physical hardware with real actuator limitations, and multi-camera recordings that match RLBench's 4-camera observation setup.

Complementary Claru Datasets

Egocentric Activity Dataset

Human demonstrations of manipulation tasks parallel to RLBench categories provide visual pretraining data that bridges the non-photorealistic simulation rendering gap across 100+ real-world environments.

Manipulation Trajectory Dataset

Real-world manipulation recordings with multi-camera views and temporal annotations provide authentic contact dynamics for tasks similar to RLBench's 100-task suite, including pick-and-place, drawer operations, and assembly.

Custom Task-Matched Collection

Purpose-collected real-world demonstrations of specific RLBench tasks enable direct sim-to-real comparison, simulation parameter calibration, and policy fine-tuning on physical hardware.

Bridging the Gap: Technical Analysis

RLBench has become the de facto standard for evaluating multi-task manipulation policies. PerAct, RVT, RVT-2, Act3D, and GNFactor all benchmark against RLBench's task suite, creating a well-established leaderboard that drives architectural innovation. However, high RLBench scores do not reliably predict real-world performance, and this gap is well-documented.

The visual sim-to-real gap is particularly pronounced. CoppeliaSim's rendering engine produces clean, uniform lighting with simple flat-colored textures — nothing like the complex visual environment a real robot encounters. Models that learn to exploit RLBench's visual shortcuts (e.g., object color as the sole distinguishing feature between blocks) fail when confronted with photorealistic visual complexity where objects have similar colors, specular highlights, and partial occlusion.

The contact dynamics gap is equally critical. CoppeliaSim models contacts as spring-damper systems with simplified Coulomb friction. Real-world grasps involve complex friction cones, material deformation, surface contamination, and the compliance of real gripper pads. A policy that achieves 95% grasp success in RLBench may drop to 60% on real hardware because its learned grasping strategy relies on simulation-specific contact behavior that does not exist physically.

The keyframe action representation used by modern RLBench methods (PerAct, RVT) introduces an additional transfer challenge. These methods predict discrete next-best-pose waypoints, and a motion planner connects the waypoints. In simulation, motion planning always succeeds because the environment is fully known. On real hardware, motion planning must handle uncertainty, obstacles not in the model, and joint limits that the simulation's idealized robot does not have.

Bridging this gap requires real-world data collected on the same task categories. Claru can coordinate collection of manipulation demonstrations that directly parallel RLBench tasks — pick-and-place with real objects, drawer operations in real furniture, stacking with physical blocks — providing the authentic data needed to validate and calibrate simulation-trained policies before deployment.

Key Papers

[1]James et al.. “RLBench: The Robot Learning Benchmark & Learning Environment.” RA-L 2020, 2020. Link
[2]Shridhar et al.. “Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation.” CoRL 2023, 2023. Link
[3]Goyal et al.. “RVT: Robotic View Transformer for 3D Object Manipulation.” CoRL 2023, 2023. Link
[4]Gervet et al.. “Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation.” CoRL 2023, 2023. Link
[5]Goyal et al.. “RVT-2: Learning Precise Manipulation from Few Demonstrations.” RSS 2024, 2024. Link

Frequently Asked Questions

RLBench uses simplified physics (spring-damper contacts), idealized actuators (no backlash or friction), and non-photorealistic rendering (flat lighting, simple textures). Policies learn to exploit simulation-specific shortcuts — relying on uniform object colors for identification, assuming perfect friction for grasping, depending on exact motion planning — that do not exist on real hardware. The visual gap, contact dynamics gap, and actuator model gap each independently contribute to performance drops during transfer.

Contact-rich tasks like stacking, insertion, and jar manipulation are hardest to transfer because they depend on friction and contact dynamics that CoppeliaSim models poorly. Multi-step tasks like set the table are also challenging due to compounding errors across steps. Tasks requiring only coarse positioning (reach target, push button) transfer most easily because they tolerate larger execution errors.

Three primary approaches: (1) Fine-tuning simulation-trained policies with a small number of real demonstrations to adapt contact strategies and visual features. (2) Calibrating simulation parameters using real-world force measurements to improve physics fidelity before training. (3) Training domain adaptation models on paired simulation and real visual data to translate between observation domains. The most effective approach combines all three.

The standard multi-task evaluation trains a single policy on 18 representative tasks (or all 100) using 1, 5, 10, 20, or 100 demonstrations per task, then evaluates success rate over 25 episodes per task with randomized initial conditions. The few-shot protocol (especially 5 and 10 demonstrations) is most commonly reported because it reflects the practical constraint of limited real-world data availability.

Keyframe methods like PerAct and RVT predict 6-DOF waypoints rather than continuous joint commands, reducing the policy to a series of perception-to-pose predictions. This simplifies the learning problem but introduces dependency on a motion planner to connect waypoints. In simulation, motion planning always succeeds; on real hardware, planning must handle uncertainty, collision avoidance with imprecise geometry, and the gap between planned and executed trajectories requires compliant control.