How to Collect Warehouse Robot Data for Training

A practitioner's guide to building warehouse-scale robot training datasets — from sensor rig design and task taxonomy through fleet-level telemetry pipelines and final formatting for policy learning.

Difficultyintermediate

Time3-6 weeks

Prerequisites

Access to warehouse environment or staging area
Robot platform with teleoperation capability
ROS2 Humble or later installed
Sensor rig (RGB-D cameras, optional LiDAR)
Python 3.10+ with NumPy, h5py, and tensorflow-datasets

1

Define Your Task Taxonomy and Data Specification

Warehouse environments contain dozens of distinct manipulation and navigation tasks. Start by building a task taxonomy document that enumerates every task variant you intend to support: single-item pick-and-place from shelf to tote, multi-item bin picking, mixed-case palletizing, depalletizing, conveyor singulation, package scanning and sorting, and AMR navigation with dynamic obstacle avoidance. For each task, specify the observation space (which cameras, what resolution, whether depth is required), the action space (Cartesian end-effector deltas vs. joint position targets, gripper width or binary open/close), the control frequency (typically 10-20 Hz for manipulation, 5-10 Hz for navigation), and the expected episode length.

Create a formal data spec document using a standardized template that covers: target model architecture and its exact input tensor shapes, required coordinate frames and transforms, object categories with unique IDs for each SKU, environment configuration parameters (shelf heights, aisle widths, conveyor speeds), and minimum diversity requirements — for example, 50 unique SKU models, 5 lighting conditions, 3 shelf configurations. Share this document with all stakeholders (ML team, robotics engineers, collection operators) and sign off before spending money on hardware or scheduling collection sessions. Changes to the data spec mid-collection are the single most expensive mistake teams make.

Google Docs or Notion for spec documentsURDF/SDF for robot model reference

Tip: Map your task taxonomy to Open X-Embodiment task categories if you plan to leverage cross-embodiment pretraining later

2

Design and Calibrate Your Sensor Rig

Warehouse data collection demands a sensor rig that captures both the robot's workspace in high fidelity and the broader environment context. A proven configuration for mobile manipulators uses three camera streams: a wrist-mounted Intel RealSense D455 (640x480 at 30 fps, depth + RGB) pointed at the gripper workspace, a head or shoulder-mounted D435i (640x480 at 30 fps) providing the agent's egocentric view, and an overhead or third-person ZED 2i (1280x720 at 15 fps) for supervision and debugging. For proprioception, record the full joint state vector (positions, velocities, efforts) from the robot's /joint_states topic at 100 Hz, plus end-effector pose from the forward kinematics at the same rate.

Calibrate each camera's intrinsics using a ChArUco board with OpenCV's cv2.aruco.calibrateCameraCharuco — aim for reprojection error below 0.3 pixels. Compute extrinsics (camera-to-base transforms) using hand-eye calibration: the eye-in-hand method for the wrist camera (cv2.calibrateHandEye with the Tsai-Lenz solver), and PnP-based registration for fixed cameras using known AprilTag positions in the workspace. Store all calibration data as YAML files alongside the dataset. For depth quality in warehouse environments, note that RealSense infrared stereo struggles with reflective shrink-wrap and dark conveyor belts — enable the D455's built-in IMU and use the high-accuracy preset. Test depth quality on your actual SKU inventory before committing to a collection schedule.

Intel RealSense SDK 2.0OpenCV 4.xAprilTag libraryros2_realsense

Tip: Run a 30-minute burn-in test to check for thermal drift in depth readings

Tip: Label each camera stream with a unique camera_id in metadata for multi-view reconstruction later

3

Build the Recording Pipeline with Synchronization

Sensor synchronization is non-negotiable for policy learning — a 50 ms misalignment between an RGB frame and the corresponding joint state renders the observation-action pair nearly useless. Implement hardware synchronization by connecting all RealSense cameras to a shared hardware trigger line using the D455's GPIO sync port. For joint state recording, use a real-time ROS2 node running in a dedicated executor thread with SCHED_FIFO priority. Timestamp all sensor streams using a single NTP-synchronized clock source; avoid mixing ROS time with camera-internal timestamps.

Build a recording manager node in Python that subscribes to all sensor topics via message_filters.ApproximateTimeSynchronizer with a slop tolerance of 10 ms. On each synchronized callback, pack the observation into a dictionary: {"rgb_wrist": (H,W,3) uint8, "depth_wrist": (H,W) float32 in meters, "rgb_head": (H,W,3) uint8, "joint_positions": (N,) float64, "ee_pose": (7,) float64 as [x,y,z,qx,qy,qz,qw], "gripper_width": float64}. Write episodes to HDF5 using h5py with gzip compression (level 4 is a good speed/size tradeoff). Each episode is one HDF5 file containing datasets for obs/{modality}, action, and metadata. Implement a post-episode validation hook that checks: no NaN values in joint states, no all-zero depth frames, frame count matches expected duration, and file size is within expected range. Reject and re-record any episode that fails validation before the operator moves to the next trial.

ROS2 Humbleh5pymessage_filtersPython multiprocessing

Tip: Write raw sensor data first, then run a batch post-processing step for image resizing and normalization — this avoids blocking the recording loop

4

Design Collection Protocols for Each Task Family

Write a Standard Operating Procedure (SOP) document for each task in your taxonomy. The SOP should be detailed enough that a new operator can follow it with zero verbal instruction. For a palletizing task, the SOP specifies: (1) Initial state setup — place N boxes of specified dimensions on the source conveyor or staging area in a randomized arrangement, position the pallet at the target location, ensure the robot is at its home configuration. (2) Task execution — the teleoperator picks each box in a specified order (or randomized order, depending on the protocol variant) and places it on the pallet following a specified packing pattern. (3) Episode termination criteria — all boxes placed successfully (success) or timeout after 180 seconds (failure). (4) State randomization between episodes — shuffle box positions, vary the number of boxes (3-8 per episode), rotate pallet orientation by 0 or 90 degrees.

Critically, define the teleoperator interface. For warehouse manipulation, a SpaceMouse (3Dconnexion SpaceMouse Pro) controlling end-effector velocity in Cartesian space is the most common choice — it offers intuitive 6-DoF control and operators reach proficiency after 2-3 hours of practice. Configure deadzone thresholds (typically 0.05 for translation, 0.02 for rotation) to filter hand tremor. Set velocity scaling to 0.15 m/s max translation and 0.8 rad/s max rotation for safe operation around shelving. Run a pilot session of 30 episodes with 2-3 operators to measure: average episode duration, success rate, and inter-operator consistency (compute DTW distance between trajectories for the same task). Use pilot results to refine the SOP, adjust speed limits, and estimate total collection time.

3Dconnexion SpaceMouseteleop_twist_joy or custom teleop nodeDTW (fastdtw Python package)

Tip: Rotate operators every 90 minutes to prevent fatigue-induced quality degradation

Tip: Record operator ID in episode metadata for later analysis of operator-specific biases

5

Execute Collection with Real-Time Quality Monitoring

Run data collection in structured shifts with dedicated roles: one teleoperator, one environment resetter (who randomizes objects between episodes), and one quality monitor watching a live dashboard. Build the dashboard using a simple Streamlit app that displays: episodes completed vs. target (progress bar), rolling success rate over the last 50 episodes, distribution of episode durations (histogram), depth frame dropout rate, and joint state recording gaps. Flag any episode where depth dropout exceeds 2% of frames or where joint state gaps exceed 50 ms for immediate re-recording.

For warehouse-scale collection targeting 5,000+ episodes, organize work into collection campaigns of 500 episodes each. At the end of each campaign, run an automated quality audit: compute action velocity statistics and flag outliers (episodes where max joint velocity exceeds safety limits suggest teleoperation glitches), verify observation space coverage (e.g., at least 80% of target SKUs appear in the campaign), and check for recording artifacts (duplicate timestamps, zero-length episodes, corrupted HDF5 files). Maintain a collection log spreadsheet tracking: campaign ID, date, operator, robot serial number, environment configuration, episodes recorded, episodes passing QA, and any hardware issues. This log becomes essential for debugging dataset problems months later when training reveals unexpected failures.

Streamlitpandasmatplotlibcustom QA scripts

Tip: Keep a physical logbook at each collection station for operators to note hardware issues, unusual events, or protocol deviations that automated monitoring might miss

6

Post-Process, Validate, and Format for Training

After collection completes, run a comprehensive post-processing pipeline. First, apply consistent image preprocessing: resize all camera streams to the target model's input resolution (typically 256x256 or 224x224), convert depth from raw millimeters to normalized float32 meters, and apply the camera intrinsics to correct any lens distortion. Second, compute derived features if your model requires them: end-effector velocity by finite differencing the pose trajectory with a Savitzky-Golay filter (window=11, order=3 from scipy.signal), wrist force/torque if available, and binary contact labels from force thresholds.

For deduplication, compute a trajectory embedding for each episode by encoding the full joint state sequence through a lightweight 1D-CNN or by computing DTW distance between all episode pairs. Flag episodes with DTW distance below a threshold (calibrate on your pilot data — typically episodes from the same operator performing the same task variant will cluster). Remove exact duplicates and near-duplicates, keeping the highest-quality instance (measured by smoothness of the action trajectory and depth frame completeness). Generate stratified train/validation/test splits (80/10/10) ensuring that splits are stratified by task variant, operator ID, and environment configuration — never leak episodes from the same session across splits. Finally, convert to your target format. For RLDS, implement a custom TFDS DatasetBuilder: define the features_dict matching your observation and action spec, implement _split_generators to map your train/val/test HDF5 files, and run tfds build --data_dir=/path/to/output. Validate the RLDS output by loading 100 random episodes through the standard tfds.load pipeline and visually inspecting observation-action alignment in a Jupyter notebook.

scipy.signaltensorflow-datasetsh5pyZarrfastdtwJupyter

Tip: Always keep a copy of the raw unprocessed data — post-processing bugs are easier to fix when you can re-run from raw

Tip: Publish a dataset card following the Google Model Cards template documenting collection conditions, known biases, and intended use

Tools & Technologies

ROS2Intel RealSense SDKPythonNumPyHDF5RLDSTensorFlow DatasetsOpen3D

References

[1]Brohan et al.. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” CoRL 2023, 2023. Link
[2]Open X-Embodiment Collaboration. “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” ICRA 2024, 2024. Link
[3]Chi et al.. “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.” RSS 2023, 2023. Link
[4]Octo Model Team. “Octo: An Open-Source Generalist Robot Policy.” RSS 2024, 2024. Link

How Claru Can Help

Claru operates data collection infrastructure purpose-built for warehouse and logistics environments. Our teams have deployed sensor rigs in fulfillment centers, distribution hubs, and manufacturing floors across 100+ cities, collecting teleoperation demonstrations for pick-and-place, palletizing, bin picking, and AMR navigation tasks. We handle the full pipeline — sensor calibration, operator training, real-time QA, post-processing, and delivery in RLDS, HDF5, or Zarr format — so your ML team can focus on model development instead of data engineering.

Why Warehouse Environments Demand Purpose-Built Training Data

Warehouse and logistics operations are the largest commercial market for robot learning today. Companies deploying pick-and-place, palletizing, and autonomous navigation robots need training data that reflects the specific challenges of warehouse environments: varying SKU inventories, reflective shrink-wrapped packaging, dynamic obstacles (forklifts, human workers), and industrial lighting conditions that differ dramatically from laboratory settings.

Academic datasets collected in university labs suffer from severe domain shift when applied to warehouse tasks. Lab kitchens have consistent lighting and a few dozen objects; warehouses have fluorescent/LED flicker, thousands of SKUs, and working conditions that change with every shift. The result: policies trained on lab data achieve 95% success in the lab and 40% success on the warehouse floor.

This guide covers the specialized requirements of warehouse data collection — from ruggedized sensor rigs and safety protocols through fleet-level telemetry pipelines and formatting for policy training at scale.

Warehouse Data Collection Benchmarks

50-200

Demos for single-SKU pick-and-place

1K-5K

Demos for mixed palletizing

10K+

Demos for multi-task foundation model fine-tuning

$3K-8K

Sensor rig cost per collection station

10-20 Hz

Control frequency for manipulation tasks

500+ eps

Episodes per quality audit campaign

Warehouse Task Families and Data Requirements

Task Family	Action Space	Control Frequency	Key Sensors	Typical Scale
Single-item pick-and-place	7-DoF EE delta + gripper	10-20 Hz	Wrist RGB-D + head RGB-D	50-200 demos
Multi-SKU bin picking	7-DoF EE delta + gripper	10-20 Hz	Overhead RGB-D + wrist RGB-D	500-2,000 demos
Mixed-case palletizing	7-DoF EE delta + gripper width	10 Hz	Overhead + side RGB-D + F/T	1,000-5,000 demos
AMR navigation	2D velocity (v, omega)	5-10 Hz	LiDAR + stereo + odometry	100+ hours of driving

Warehouse-Specific Data Challenges

Reflective Packaging

Shrink-wrapped pallets and metallic packaging create depth sensor failures. Use D455 high-accuracy mode and validate depth completion rate on actual SKUs before collection.

Safety Compliance (ISO 15066)

Data collection alongside human workers requires force-limited mode, geofenced safety zones, and a dedicated safety observer. Budget 2-4 weeks for insurance and facility agreements.

SKU Diversity

Warehouses handle thousands of SKUs with varying sizes, weights, and surface properties. Ensure your dataset covers at least 50 representative SKUs across the size/weight distribution.

Lighting Variability

Industrial fluorescent and LED lighting creates flicker artifacts at certain camera exposure settings. Lock exposure to avoid auto-exposure fluctuations and test for banding.

Fleet-Level Data Aggregation for Multi-Robot Deployments

Warehouse environments increasingly deploy multiple robots simultaneously, creating an opportunity to aggregate training data across the fleet. Each robot collecting demonstrations during its normal operation generates a continuous stream of new training data — but fleet data has unique quality and consistency challenges that single-robot collection does not.

The primary challenge is hardware variation across the fleet. Even identical robot models develop individual characteristics over time: gripper friction changes with wear, camera intrinsics drift with temperature cycling, and joint encoders accumulate calibration offsets. Include per-robot calibration metadata with every episode so that downstream users can account for these differences during training. Run a monthly fleet calibration protocol: each robot executes a standardized motion sequence, and the recorded trajectories are compared to a reference to detect and correct drift.

Fleet data deduplication is more important than single-robot deduplication because multiple robots performing the same task in the same environment naturally produce near-identical episodes. Use the trajectory embedding + LSH pipeline described in the deduplication guide, but add robot_id as a stratification dimension — near-duplicates from different robots may actually be valuable (they capture robot-specific dynamics), while near-duplicates from the same robot are redundant.

For fleet data pipelines, implement a centralized data lake that receives episodes from all robots via a nightly sync or real-time streaming. Each robot writes to local storage during operation and uploads completed episodes during idle periods. The central pipeline runs validation, deduplication, enrichment, and format conversion as a batch job. Maintain separate per-robot quality dashboards to detect individual robots whose data quality has degraded.

Frequently Asked Questions

The optimal rig depends on the task family. For mobile manipulation (pick-and-place, palletizing), mount a wrist-mounted Intel RealSense D455 for close-range depth, a head-mounted RealSense D435i for workspace context, and an overhead ZED 2i for third-person supervision. Record proprioception at 100 Hz minimum via the robot's joint state publisher. For autonomous mobile robots (AMRs) doing navigation, a Velodyne VLP-16 or Ouster OS1-64 LiDAR combined with a global-shutter stereo pair and wheel odometry is standard. Synchronize all sensors through hardware trigger lines or PTP (Precision Time Protocol) — software timestamps from ROS message headers introduce 5-15 ms jitter that degrades policy training. Budget roughly $3,000-8,000 per data collection station depending on LiDAR inclusion.

Modern architectures vary widely in data efficiency. For single-SKU pick-and-place with Diffusion Policy or ACT, 50-200 high-quality teleoperation demonstrations can yield >85% success rates. Multi-SKU bin picking across 20-50 object categories typically needs 500-2,000 demonstrations to generalize reliably. Palletizing with mixed box sizes requires 1,000-5,000 demonstrations because the action space includes both grasp pose and placement planning. If training a foundation model like Octo or OpenVLA for multi-task warehouse operation, plan for 10,000+ demonstrations spanning your full task vocabulary. Start with a 200-episode pilot for your highest-priority task, validate that the policy trains successfully, then scale collection based on generalization gaps observed during evaluation.

Collect data during off-peak shifts or in a dedicated staging area that mirrors real warehouse conditions — same shelving systems, conveyor belts, lighting, and floor surfaces. If collecting alongside human workers, implement geofenced safety zones using the robot's built-in safety controller with force-limited mode enabled (ISO 15066 compliant power and speed limiting). All teleoperation sessions should have a dedicated safety observer with an e-stop in hand. For AMR navigation data, run at reduced speed (0.5 m/s max) with expanded sensor-based stopping zones. Record near-miss events and safety stops as metadata — these become valuable negative examples for training collision avoidance. Insurance and facility agreements should be finalized before any collection begins, as this often takes 2-4 weeks.

Include both, but tag them explicitly. A dataset of only successes creates policies that have no model of what failure looks like and cannot recover gracefully. Collect intentional failure demonstrations at roughly a 10-15% ratio: dropped objects, misaligned grasps, path planning failures, and object collisions. Label each episode with a success boolean and a failure taxonomy code (e.g., grasp_slip, collision_static, collision_dynamic, placement_miss, timeout). For reinforcement learning and reward model training, paired success/failure examples from the same initial state are especially valuable. In practice, natural failures during teleoperation collection provide 5-10% of episodes anyway — supplement with deliberate failure recording sessions to reach your target ratio and ensure failure mode coverage.

For maximum compatibility with current research infrastructure, store episodes in RLDS (Reinforcement Learning Datasets) format backed by TFRecord files. RLDS is the native format for Open X-Embodiment, Octo, and RT-2, and Google's tensorflow_datasets library provides efficient sharding, shuffling, and streaming. If your primary training framework is PyTorch-based (Diffusion Policy, ACT), use HDF5 via the robomimic convention: one HDF5 file per episode with groups for obs, action, and metadata. For very large fleet datasets (100K+ episodes), consider Zarr with DirectoryStore for S3-compatible cloud storage and parallel reads. Regardless of format, include per-episode metadata as JSON sidecar files: robot model, gripper type, task ID, environment configuration hash, operator ID, timestamp, and success label. This metadata enables stratified sampling during training.

Related Resources

Glossary

Embodied Ai→

Glossary

Manipulation Trajectory→

Glossary

Scene Understanding→

Guide

How To Collect Teleoperation Data→

Guide

How To Collect Multimodal Robot Data→

Solution

Teleoperation Data→

Need Warehouse Robot Data?

Claru operates data collection infrastructure in warehouse environments across 100+ cities. Talk to a specialist about your palletizing, pick-and-place, or AMR navigation data needs.

Get in Touch Browse the Data Catalog