How to Set Up a Teleoperation Rig for Data Collection
A practitioner's guide to building a teleoperation rig for robot data collection — choosing between leader-follower, VR, and SpaceMouse interfaces, selecting and calibrating cameras, configuring the recording pipeline, and optimizing for operator throughput and data quality.
Prerequisites
- Robot arm(s) with position control API
- Workspace with stable mounting for robot and cameras
- Linux workstation with real-time kernel (recommended)
- Budget for input devices and cameras
Choose the Teleoperation Interface
Select the input device based on your task complexity, budget, and target data quality. Here is a decision framework:
Leader-follower arms (e.g., ALOHA, GELLO): Best for fine-grained manipulation tasks requiring dexterous control — the operator directly feels and mirrors the manipulation motion. The ALOHA system uses ViperX 300 6-DoF arms at approximately $4,000 each (leader + follower = $8,000 per side). GELLO uses 3D-printed leader arms with joint-position mirroring to commodity robots like the Franka Emika. Leader-follower produces the highest-quality demonstrations because the kinematic mapping is intuitive and direct. Setup complexity: medium (mechanical alignment, joint calibration, latency tuning). Best for: grasping, insertion, bimanual tasks, food preparation, cloth folding.
VR controller interface (e.g., Meta Quest 3, HTC Vive): The operator sees the robot workspace through cameras (displayed in the headset) and controls the end-effector position using hand controllers. Libraries like dex-retargeting and AnyTeleop map VR controller poses to robot end-effector commands. Cost: $400-800 for the headset. The advantage is that operators can work remotely (over a network link) and the setup is portable. The disadvantage is that depth perception through cameras is limited, causing more collisions and missed grasps. Best for: tabletop pick-and-place, remote data collection, tasks where the operator does not need force feedback.
SpaceMouse / gamepad: The 3Dconnexion SpaceMouse ($200-400) provides a 6-DoF input that maps to incremental end-effector velocity. Low cost, minimal setup, but the mapping from hand motion to robot motion is unintuitive — operators need 2-4 hours of training before producing usable demonstrations. Demonstration quality is lower (more jerky, slower execution). Best for: low-volume collection, quick prototyping, tasks with slow and deliberate motions.
Tip: If budget allows, always prefer leader-follower over SpaceMouse — the quality difference in demonstrations directly translates to 20-40% fewer total demonstrations needed for equivalent policy performance
Set Up Cameras and Calibrate the Sensor Rig
Camera placement and calibration directly determine what the trained policy can see. For most manipulation tasks, the minimum camera setup is: (1) one overhead camera pointing down at the workspace (captures the full scene layout for spatial reasoning), (2) one wrist-mounted camera on each robot arm (captures close-up gripper-object interactions), and (3) optionally one or two side cameras at 30-45 degrees (resolve occlusions from the overhead view).
For camera selection, Intel RealSense D405 (short-range, high-resolution depth) is ideal for wrist cameras. Intel RealSense D435i (medium-range, wider FOV) works well for overhead and side cameras. If depth is not needed (many policies use RGB only), Logitech C920 webcams at $50 each provide excellent 1080p video. All cameras should capture at 30 Hz minimum — some policies (ACT, Diffusion Policy) benefit from higher rates up to 60 Hz.
Calibrate each camera's intrinsics using a ChArUco board (OpenCV's cv2.aruco module): capture 20+ board images at varying angles and distances, targeting reprojection error below 0.3 pixels. Calibrate extrinsics (camera-to-robot-base transform) using hand-eye calibration: move the robot end-effector to 15-20 known poses while capturing the corresponding camera images of a calibration board attached to the end-effector. Use cv2.calibrateHandEye() with the TSAI method. Verify extrinsics by projecting known 3D points into the camera image — misalignment should be under 3 mm at typical working distances.
Mount all cameras on rigid structures (not flexible gooseneck mounts that drift over time). Label each camera with its ID and position. Test the full camera rig by recording a 10-minute demonstration and verifying that all camera feeds are synchronized, in focus, well-exposed, and covering the full manipulation workspace.
Tip: The wrist camera is the single most important camera for policy learning — ensure it has an unobstructed view of the gripper and the immediate contact zone. If you can only afford one camera beyond the wrist camera, choose overhead over side views
Build the Recording Pipeline
The recording pipeline captures, synchronizes, timestamps, and stores all data streams during each demonstration episode. A production-grade pipeline must handle: multi-camera video at 30+ Hz, robot joint states at 50-100 Hz, end-effector poses at 50 Hz, gripper commands, and optionally force/torque data at 100+ Hz.
In ROS2, use rosbag2 recording with a composite message type that bundles all streams. Create a custom launch file that starts all camera drivers, robot state publishers, and the recording node simultaneously. Use hardware timestamps (not software wall-clock time) for all sensor data — software timestamps have 5-20 ms jitter that causes synchronization errors. On Intel RealSense cameras, enable hardware timestamping with rs2_option.RS2_OPTION_GLOBAL_TIME_ENABLED.
For non-ROS setups, build a Python recording loop that polls each sensor at its native rate and writes to an HDF5 file with separate datasets per stream. Use a shared high-resolution timer (time.perf_counter_ns()) for consistent timestamps. Implement a ring buffer for each camera stream so that frame drops do not block the recording thread.
Each episode should be stored as a separate file with a structured naming convention: {task_name}_{episode_id}_{timestamp}_{operator_id}.hdf5 (or .bag). Include metadata at the start of each file: task name, environment description, object list, operator ID, camera calibration parameters, and robot URDF version. This metadata is essential for downstream processing and dataset management.
Test the pipeline by recording 50 episodes and verifying: (1) all camera feeds have the expected number of frames (30 frames/sec * episode_duration_sec, within 1%), (2) robot state timestamps align with camera timestamps within 5 ms, (3) no files are corrupted or truncated, and (4) the pipeline can sustain the target recording rate without dropping frames.
Tip: Always include a 'dry run' mode that records and discards data for 30 seconds to warm up sensor drivers and verify all streams are active before the operator begins the actual demonstration
Configure the Workspace and Task Setup
The physical workspace layout affects both operator ergonomics and data diversity. Design the workspace for consistent, repeatable demonstrations while allowing systematic variation.
Table height should match the robot's optimal working range — for a ViperX 300 mounted on a table, the manipulation surface should be 5-15 cm below the robot base to maximize the reachable workspace. For a Franka Emika, the optimal table height is 70-75 cm. Mark the workspace boundaries with tape so operators know the reachable area and avoid commanding the robot to its joint limits (which causes jerky motion and potential damage).
Create a scene reset protocol for each task. After every demonstration episode, the workspace must be returned to a starting configuration. Define this explicitly: 'Place the red mug at position X, the blue bowl at position Y, and the sponge at position Z, all within the marked circles.' Provide visual guides (laminated photographs of the reset configuration) at the workstation. Fast, consistent resets maximize throughput — a 30-second reset enables 40 episodes/hour, while a 2-minute reset cuts throughput to 15 episodes/hour.
For diversity, define a randomization protocol: rotate objects between 3-5 starting positions across episodes, vary object orientation (upright, on side, rotated 90 degrees), and cycle through 5-10 different objects in the same category. Track diversity coverage through a checklist that the operator marks after each episode. At minimum, each (object, position, orientation) combination should appear in at least 5 episodes before adding new combinations.
Tip: Place a clock or timer visible to the operator showing elapsed time per episode — this subtle cue encourages efficient demonstrations without explicit speed pressure, reducing average episode duration by 15-20%
Train Operators and Validate Data Quality
Operator training is the most overlooked step in teleoperation rig setup. An untrained operator using a perfect rig produces worse data than a trained operator using a basic rig. Design a structured training program:
Phase 1 — Interface familiarization (1-2 hours): The operator practices moving the robot through free space without objects, learning the kinematic mapping and identifying comfortable operating postures. For leader-follower rigs, this means learning to suppress arm tremor and execute smooth motions. For VR interfaces, this means calibrating to the depth perception limitations.
Phase 2 — Task practice (2-4 hours): The operator executes the target task 50-100 times without recording, focusing on efficiency and consistency. A supervisor provides feedback on common mistakes: excessive pauses (which teach the policy to hesitate), unnecessary motions (which add noise to the trajectory), and inconsistent grasp strategies (which create multimodal action distributions that confuse behavioral cloning).
Phase 3 — Recorded pilot (50 episodes): Record the first 50 episodes and review them for quality. Compute average episode duration, success rate, and path efficiency (ratio of end-effector path length to minimum possible path length). Operators should achieve: success rate > 85%, path efficiency > 0.6, and episode duration within 2x the expert baseline. Operators who do not meet these thresholds receive additional coaching before production collection begins.
During production collection, implement automated quality gates: flag episodes where the robot hits joint limits (jerky motion), where the episode duration exceeds 3x the median (operator struggled), or where the gripper opens/closes more than 3x the expected number (re-grasping indicates poor execution). Flagged episodes are reviewed and either accepted with quality annotations or discarded and re-collected.
Tip: Rotate operators every 2-3 hours to prevent fatigue-induced quality degradation — track quality metrics per operator per session to identify when fatigue sets in for each individual
Validate End-to-End and Optimize Throughput
Before beginning production data collection, run a full end-to-end validation of the rig: record 100 episodes across all target tasks, convert them to the target training format (RLDS, HDF5), load them into a training pipeline, and train a basic policy. This smoke test verifies that the entire chain — from teleoperation input through recording, post-processing, and model training — works correctly. The trained policy does not need to perform well (100 episodes is far too few), but training should complete without errors, and the policy should show at least directional learning (moving toward objects rather than away).
Optimize throughput by identifying and eliminating bottlenecks: (1) Scene reset time — pre-stage objects in bins near the workspace so the operator can grab them quickly. For high-throughput collection, assign a second person as a 'scene setter' who resets the workspace while the operator reviews the last episode. (2) Recording overhead — the recording pipeline should start and stop with a single button press or foot pedal, not a keyboard command that requires the operator to look away from the workspace. (3) Episode metadata — pre-populate episode metadata (task name, object list, environment) and auto-increment the episode counter. The operator should only need to confirm success/failure at the end of each episode.
With optimized throughput, a single-operator single-arm rig produces 200-300 episodes per day for simple tasks and 80-150 episodes per day for complex tasks. A two-operator bimanual rig (one pilot, one scene setter) produces 150-250 episodes per day. These rates are the basis for budgeting collection campaigns: a 5,000-episode dataset at 200 episodes/day requires 25 operator-days.
Tip: Invest in a foot pedal ($15-30) for start/stop recording control — it keeps the operator's hands on the teleoperation interface throughout the session, eliminating the 3-5 second overhead per episode of reaching for a keyboard
Tools & Technologies
References
- [1]Zhao et al.. “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.” RSS 2023, 2023. Link
- [2]Wu et al.. “GELLO: A General, Low-Cost, and Intuitive Teleoperation Framework for Robot Manipulators.” arXiv 2309.13037, 2024. Link
- [3]Chi et al.. “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.” RSS 2023, 2023. Link
How Claru Can Help
Claru operates turnkey teleoperation rigs across multiple collection sites, using calibrated leader-follower systems (ALOHA-style ViperX 300 pairs), multi-camera arrays with hardware timestamping, and trained operators who undergo our structured 3-phase training program. Our rigs include automated quality gating, diversity coverage tracking, and optimized scene reset protocols that achieve 200-300 episodes per day for standard manipulation tasks. We handle the full pipeline from rig setup through operator training, production collection, and data delivery in RLDS, HDF5, or custom formats.
Why Teleoperation Rig Design Determines Dataset Quality
The teleoperation interface is the bottleneck between human manipulation expertise and robot training data. A well-designed rig allows operators to produce fluid, efficient demonstrations at 20-40 episodes per hour. A poorly designed rig produces jerky, slow demonstrations with frequent pauses and re-grasps, inflating the dataset size needed for equivalent policy performance. The ALOHA system (Zhao et al., RSS 2023) demonstrated that a low-cost leader-follower rig costing under $20,000 could produce demonstrations of sufficient quality to train fine-grained bimanual policies — proving that expensive haptic interfaces are not required, but thoughtful design is essential.
The three dominant teleoperation paradigms for data collection are: (1) leader-follower arms, where a matched pair of robot arms are mechanically or electrically coupled so the follower mirrors the leader's motion, (2) VR controller interfaces, where the operator wears a headset and uses hand controllers to command the robot end-effector in Cartesian space, and (3) SpaceMouse/gamepad interfaces, where the operator uses a 6-DoF input device to command incremental end-effector motion. Each paradigm has distinct tradeoffs in data quality, operator fatigue, setup cost, and applicable task complexity.
Teleoperation Interface Comparison: Cost, Quality, and Throughput
Leader-follower arms (ALOHA, GELLO) produce the highest-quality demonstrations with the most intuitive kinematic mapping. The operator moves the leader arm and the follower mirrors it, preserving the natural feel of manipulation. ALOHA uses matched ViperX 300 6-DoF arms (approximately $8,000 per leader-follower pair) and achieves 20-40 episodes per hour for tabletop tasks. GELLO takes a different approach: 3D-printed leader arms with joint encoders that map to any commodity robot (Franka Emika, UR5e, xArm) via joint-position control, costing approximately $300 per leader arm. Leader-follower systems excel at contact-rich tasks (insertion, tool use, bimanual coordination) where force feedback through the mechanical coupling helps operators sense and respond to contact events.
VR controllers (Meta Quest 3, HTC Vive) decouple the operator from the robot workspace, enabling remote data collection over a network link. The operator sees the robot's workspace through cameras displayed in the headset and controls end-effector position using hand controllers. Libraries like dex-retargeting and AnyTeleop map controller 6-DoF poses to robot commands. Cost is $400-800 for the headset. The main limitation is depth perception — stereo camera feeds displayed in a headset provide weaker depth cues than direct viewing, causing 10-20% more collisions and missed grasps compared to leader-follower systems. VR is best for tasks that do not require precise force control. SpaceMouse ($200-400) is the budget option: a 6-DoF input device that maps to incremental end-effector velocities. It requires 2-4 hours of operator training and produces lower-quality demonstrations, but the zero-hardware-modification setup makes it ideal for quick prototyping.
Recording Pipeline Architecture for Teleoperation Data
The recording pipeline must capture, synchronize, timestamp, and store all data streams during each demonstration episode. For ROS2-based systems, use rosbag2 recording in mcap format (more efficient than sqlite3 for large files) with hardware timestamps enabled on all sensors. Create a custom launch file that starts all camera drivers, the robot state publisher, the teleoperation node, and the recording node simultaneously. Enable hardware timestamping on Intel RealSense cameras with rs2_option.RS2_OPTION_GLOBAL_TIME_ENABLED to avoid the 5-20ms jitter of software timestamps.
For non-ROS setups, build a Python recording loop that polls each sensor at its native rate and writes to HDF5 using h5py with separate datasets per stream. Use time.perf_counter_ns() as a shared high-resolution timer for consistent timestamps. Structure each episode as a separate file named {task}_{episode_id}_{timestamp}_{operator_id}.hdf5. Include metadata at the start of each file: task name, environment description, object list, operator ID, camera calibration parameters, and robot URDF version. Test the pipeline by recording 50 episodes and verifying: all camera feeds have the expected frame count (within 1%), robot state timestamps align with camera timestamps within 5ms, no files are corrupted, and the pipeline sustains the target recording rate without dropping frames.
Frequently Asked Questions
The ALOHA-style leader-follower setup using two pairs of ViperX 300 6-DoF arms (one leader, one follower per side) costs approximately $16,000-20,000 including grippers and mounting hardware. For single-arm tasks, a single leader-follower pair is $8,000-10,000. The SpaceMouse + existing robot arm approach costs only $200-400 for the input device but produces lower-quality demonstrations due to the unintuitive 6-DoF mapping. For the best cost-quality tradeoff, the leader-follower approach is recommended — the hardware cost is amortized over thousands of demonstrations.
Throughput depends on the teleoperation interface and task complexity. Leader-follower rigs: 20-40 episodes/hour for simple tasks (pick-and-place), 10-20/hour for complex tasks (assembly, tool use). VR teleoperation: 15-30 episodes/hour for simple tasks, 8-15/hour for complex tasks. SpaceMouse: 10-20 episodes/hour for simple tasks, 5-10/hour for complex tasks. These rates include scene resets between episodes. Operator fatigue limits effective session length to 2-3 hours with 15-minute breaks every 45 minutes. Budget for 4-6 productive hours per day per operator.
Only if your target tasks require bimanual coordination — folding clothes, opening containers while reaching inside, using tools with both hands, or stabilizing objects with one hand while manipulating with the other. Single-arm rigs are simpler, cheaper, and produce higher-quality demonstrations because the operator focuses on one arm. If your task can be decomposed into sequential single-arm steps (pick with one arm, then place with the other), a single-arm rig with arm switching may suffice. For tasks requiring simultaneous bimanual contact (holding a bowl while stirring), a bimanual rig is essential.
Need a Turnkey Teleoperation Setup?
Claru provides fully configured teleoperation rigs and trained operators for data collection campaigns. Tell us your robot platform and task requirements.