Packing Task Training Data

Packing datasets for warehouse and e-commerce automation — item arrangement inside boxes, cartons, and totes with spatial reasoning annotations, deformable packaging handling, and multi-item sequence planning for robust order fulfillment policies.

Data Requirements

Modality

RGB-D + heightmaps + proprioception + item dimensions + packing material state

Volume Range

5K-50K complete order packing demonstrations

Temporal Resolution

30 Hz video, per-placement heightmap snapshots, per-item placement annotations

Key Annotations

Per-item 6-DoF placement pose relative to containerPlacement order within the packing sequenceContainer heightmap state after each placementVolume utilization percentage progressionItem identity, dimensions, and weightVoid fill type and placement annotations

Compatible Models

Packing Configuration Transformer (PCT)Diffusion PolicySpatialVLMPackItTAX-PoseOnline 3D-BPP RL policies

Environment Types

E-commerce fulfillment centerWarehouse packing stationGrocery bagging stationPharmaceutical packaging lineElectronics packaging cellReturns processing station

How Claru Supports This Task

Claru operates packing data collection stations that replicate real fulfillment center workflows with high-fidelity multi-sensor instrumentation. Each station features a Zivid structured-light sensor for sub-millimeter container heightmaps, a synchronized multi-camera array (2 overhead + 2 side RGB-D at 30 Hz), and packing material dispensers for realistic void fill demonstrations. We collect demonstrations on real product inventory supplied by clients, covering rigid boxes, deformable clothing, fragile electronics, and irregular consumer goods. Operators follow standardized packing protocols with quality scoring on volume utilization (75%+ target), stability, and damage prevention. Deliverables include per-placement heightmaps, 6-DoF item poses, placement sequences, item dimension annotations, and volume utilization scores — formatted for direct ingestion by 3D bin packing RL environments, Diffusion Policy, or custom sequence planning architectures. Daily throughput of 200-500 complete order demonstrations enables rapid dataset scaling for production packing systems.

What Is Robotic Packing and Why Does Data Matter?

Robotic packing — arranging items into containers for shipping or storage — is the mirror image of bin picking and arguably the harder problem. While picking requires selecting one item from clutter, packing requires placing multiple items into a constrained volume in a stable, damage-free configuration that minimizes void space. Amazon ships over 5 billion packages annually, and the packing station remains one of the most labor-intensive nodes in fulfillment operations. The 3D bin packing problem is NP-hard even for rigid cuboids; real-world packing with irregular shapes, fragile items, and deformable packaging materials like bubble wrap and air pillows is orders of magnitude more complex.

The data challenge in packing is fundamentally about spatial reasoning under constraints. A packing policy must simultaneously reason about item placement order (heavy items first, fragile items on top), geometric fit (maximizing volume utilization while maintaining stability), packing material insertion (void fill, separators, cushioning), and box flap closure or tape sealing. Human packers solve these constraints through years of spatial intuition — they can look at a set of items and mentally simulate stable arrangements in seconds. Transferring this spatial reasoning to robots requires demonstrations that capture not just the final placement poses but the entire decision process: which item to place next, where to position it, how to adjust neighboring items, and when to add protective packaging.

Current robotic packing systems rely heavily on heuristic algorithms that work for uniform products but fail on the mixed-item orders that dominate e-commerce. The Online 3D Bin Packing Problem benchmark (Zhao et al., 2022) showed that reinforcement learning policies trained on 500K+ episodes can achieve 72.1% volume utilization on random cuboid sets — competitive with the best-first-fit-decreasing heuristic at 68.5% — but real warehouse items are not cuboids. Deformable items like clothing, bags, and pouches cannot be modeled as rigid bodies, and their placement behavior depends on how they drape and compress against neighboring items. Learning these physics from demonstration data is currently the only viable path to general packing policies.

The economic incentive is enormous. McKinsey estimates that warehouse labor costs account for 65% of total fulfillment operating expenses, and packing stations require 2-3 workers per shift in a typical facility. Dimensional weight pricing by carriers means that poor packing directly increases shipping costs — a 10% improvement in volume utilization can save $0.15-$0.50 per package at scale. For a fulfillment center shipping 100,000 packages daily, that translates to $5.5-$18.3 million in annual savings from packing optimization alone, before accounting for labor cost reduction.

Packing Data by the Numbers

72.1%

RL policy volume utilization on 3D bin packing

5B+

Amazon packages shipped annually

65%

Fulfillment cost from warehouse labor (McKinsey)

500K+

Training episodes for competitive RL packing

10-15%

Volume utilization gain from learned policies

2-3 sec

Target cycle time per item placement

Data Requirements by Packing Approach

Different learning methods for packing have distinct data needs. Hybrid approaches combining spatial planning with learned placement are currently most effective.

Approach	Data Volume	Key Modalities	Spatial Reasoning	Strengths
Online 3D Bin Packing RL	500K-2M simulated episodes	Heightmap + item dimensions + placement mask	Learned via reward shaping	Handles online item arrival; optimizes utilization
Behavioral Cloning from demonstrations	5K-50K real packing demonstrations	RGB-D + proprioception + item identity	Implicit in demonstration quality	Handles deformable items; captures human heuristics
Diffusion Policy for placement	1K-10K demonstrations per item category	Multi-view RGB + depth + proprioception	Multimodal action distribution	Handles placement ambiguity; multiple valid solutions
Sim-to-Real with domain randomization	1M+ sim episodes + 1K-5K real calibration	Simulated heightmaps + real RGB-D for transfer	Physics-based in simulation	Scalable to new box sizes; fast iteration
Foundation model fine-tuning (SpatialVLM)	10K-50K annotated packing sequences	RGB images + language + spatial relations	Pretrained spatial reasoning + fine-tuning	Generalizes across item categories; language-conditioned

State of the Art in Learned Packing

The Online 3D Bin Packing problem has been the primary benchmark for learned packing policies. Zhao et al. (2022) formulated the problem as a Markov Decision Process where items arrive one at a time and the agent must place each item before seeing the next. Their PCT (Packing Configuration Transformer) architecture achieved 72.1% volume utilization on sequences of 50 random cuboids, outperforming all heuristic baselines. The key insight was representing the container state as a heightmap and using attention over candidate placement positions, allowing the policy to reason about global packing quality rather than greedy local placement.

For real-world packing with irregular items, simulation alone is insufficient. Ha et al. (2024) demonstrated a sim-to-real pipeline for packing grocery items into bags, training in IsaacGym with 3D-scanned item meshes and transferring to a Franka Panda arm. The system achieved 82% packing success on 20 common grocery items, but performance dropped to 61% on deformable items (bread bags, chip bags) where simulation contact models diverge significantly from reality. This 21-percentage-point gap on deformable items underscores the need for real-world demonstration data that captures the actual physics of soft item packing.

PackIt (Goyal et al., 2020) explored learning geometric packing directly from 3D shape understanding. Given a set of items and a container, PackIt predicts a packing configuration by iteratively selecting items and placement poses using a learned shape-conditioned policy. On a benchmark of 12 household object categories, PackIt achieved 64% valid packing configurations versus 41% for a random placement baseline. However, PackIt operates on known 3D models and does not handle the perception challenge of estimating item geometry from sensor data in a cluttered staging area.

The most promising recent direction combines large vision-language models with physical reasoning. SpatialVLM (Chen et al., 2024) demonstrated that fine-tuning a VLM on spatial relationship annotations enables 3D spatial reasoning from 2D images — predicting relative positions, containment relationships, and stability assessments. While not yet applied specifically to packing, the spatial reasoning capabilities (83% accuracy on relative position questions, 76% on stability prediction) suggest that VLM-based packing planners could leverage pretrained spatial understanding with task-specific fine-tuning on packing demonstrations.

Collection Methodology for Packing Data

Packing data collection requires a staged workspace that mirrors real fulfillment station ergonomics. The setup includes an item staging area (conveyor or tote) where products arrive, a box selection station with multiple container sizes, packing material dispensers (air pillows, kraft paper, bubble wrap), and an overhead + side camera array to capture the evolving container state from multiple viewpoints. Depth sensing is critical for tracking the evolving heightmap inside the container — structured-light sensors (Zivid, Photoneo) mounted 60-80 cm above the box opening provide 0.2 mm accuracy across the packing volume.

Each packing demonstration captures the full order fulfillment sequence: box selection based on item set, item retrieval from the staging area, placement into the container with chosen orientation and position, void fill insertion, and box closure. Annotations must include item identity and measured dimensions, placement pose (6-DoF relative to the container), placement order within the sequence, contact state with neighboring items, void fill type and placement, and container fill level (percentage of volume utilized). For deformable items, additional annotations capture the deformation state — compressed height versus free-standing height, draping contact area, and whether the item was folded or compressed during placement.

Operator training for packing data collection focuses on consistent high-quality packing that balances volume utilization with item protection. We define packing quality metrics: volume utilization (target 75%+ for mixed items), stability (no item shifts when the box is tilted 15 degrees), damage prevention (fragile items cushioned, heavy items on bottom), and presentation (items visible and identifiable for quality inspection). Operators follow a standardized protocol: assess items, select box, place heaviest items first on the bottom layer, fill gaps with medium items, add cushioning around fragile items, place lightweight items on top, and add void fill. Each operator completes a 50-order qualification round with quality scoring before production collection begins.

For maximum diversity, packing sessions rotate through multiple order profiles: single-item orders (baseline placement data), multi-item homogeneous orders (stacking and layering patterns), multi-item heterogeneous orders (mixed-category spatial reasoning), orders with fragile items (protective packing strategies), and orders with deformable items (clothing, pouches). Each order profile contributes distinct aspects of packing intelligence. Single-item data teaches optimal item orientation and box selection; multi-item data teaches spatial sequencing and stability reasoning; fragile-item data teaches protective placement heuristics.

Key Datasets and Benchmarks for Robotic Packing

Public packing datasets are scarce compared to picking datasets. Most research uses synthetic benchmarks, creating an opportunity for real-world demonstration data.

Dataset / Benchmark	Year	Scale	Item Types	Key Features	Limitations
Online 3D-BPP (Zhao et al.)	2022	Unlimited procedural generation	Random cuboids only	Online arrival; heightmap state representation	No irregular shapes; no deformable items
PackIt (Goyal et al.)	2020	12 object categories, 1K test instances	ShapeNet household objects	3D shape reasoning; geometric packing	Known 3D models required; no real sensor data
RoboCasa Packing subtasks	2024	500+ trajectories	Household items in kitchen containers	Full manipulation trajectories; diverse items	Loose tolerance; limited to kitchen context
TAX-Pose placement	2022	Single-object placement demonstrations	Rigid objects with defined receptacles	SE(3) relative placement; few-shot learning	Single item at a time; no multi-item reasoning
Amazon Packing Challenge	2015-2017	Competition format; limited public data	Real e-commerce products	Real products; combined picking + packing	Not publicly available as training data

How Claru Supports Packing Data Needs

Claru operates packing data collection stations designed to replicate real fulfillment center workflows with instrumentation for high-fidelity demonstration capture. Each station features a multi-camera array (2 overhead + 2 side-mounted RGB-D sensors synchronized at 30 Hz) covering the full packing workspace from box selection through closure. A Zivid structured-light sensor mounted directly above the container captures sub-millimeter heightmaps of the evolving packing state after each item placement, providing the ground-truth spatial data needed for heightmap-based policies.

We collect packing demonstrations on real product inventory supplied by clients, covering the full spectrum of item categories: rigid boxes, cylindrical containers, flexible pouches, clothing, fragile electronics, and irregularly shaped consumer goods. Our operators are trained on standardized packing protocols with quality scoring on volume utilization, stability, and damage prevention. Each demonstration captures the complete sequence from box selection through item placement, void fill insertion, and closure — annotated with per-item placement poses, contact states, packing material usage, and fill level progression.

Claru delivers packing datasets formatted for direct ingestion by 3D bin packing RL environments, Diffusion Policy architectures, or custom sequence planning models. Standard deliverables include per-placement heightmaps, 6-DoF item poses relative to the container, placement order sequences, item dimension annotations, void fill placement data, and final volume utilization scores. For clients building end-to-end packing systems, we provide full manipulation trajectories with proprioceptive data covering the grasp-transport-place cycle for each item. Our daily throughput of 200-500 complete order packing demonstrations enables rapid dataset scaling for production packing policy training.

References

[1]Zhao et al.. “Learning Efficient Online 3D Bin Packing on Packing Configuration Trees.” ICLR 2022, 2022. Link
[2]Goyal et al.. “PackIt: A Virtual Environment for Geometric Planning.” ICML 2020, 2020. Link
[3]Ha et al.. “Sim-to-Real Transfer for Robotic Packing with Domain Randomization.” ICRA 2024, 2024. Link
[4]Chen et al.. “SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities.” CVPR 2024, 2024. Link
[5]Duan et al.. “Multi-Robot Collaborative Dense Bin Packing.” IROS 2019, 2019. Link

Frequently Asked Questions

For single-item placement into a known box size, 500-2,000 demonstrations per item category typically suffice for behavioral cloning with Diffusion Policy. For multi-item sequence packing with mixed categories, 5,000-50,000 complete order demonstrations are recommended to cover the combinatorial space of item arrangements. Start with your top 10 SKU categories and 200 demonstrations each to validate the pipeline before scaling. RL approaches can supplement with 500K+ simulated episodes using rigid-body approximations, but real demonstrations remain essential for deformable items.

Simulation works well for rigid cuboid packing benchmarks but breaks down for real-world packing with deformable items, packaging materials, and irregular shapes. The sim-to-real gap for deformable packing is 15-25 percentage points — simulation-trained policies achieve 60-65% success on real deformable items versus 80-85% with real demonstration data. The most cost-effective approach is simulation pretraining for spatial reasoning on rigid items, followed by real-world fine-tuning with 1,000-5,000 demonstrations covering deformable items and packing material handling.

The minimum setup is an overhead RGB-D sensor for container state tracking and a side-mounted RGB camera for item identification. For high-quality data, use a structured-light depth sensor (Zivid or Photoneo) mounted 60-80 cm above the box opening for sub-millimeter heightmaps, plus 2-3 additional RGB cameras for multi-view item tracking. Force/torque sensing at the wrist helps capture compression behavior during deformable item placement. All sensors should be synchronized at 30 Hz minimum.

For an order with N items, there are N! possible placement sequences. Rather than collecting all permutations, we focus on collecting the optimal sequence determined by packing heuristics (heavy-first-bottom-up) plus 3-5 variation sequences that explore alternative valid orderings. The policy learns a placement priority function rather than memorizing specific sequences. For production diversity, we randomize the item arrival order during collection — the operator decides the placement order, providing natural supervision for sequence planning.

Human packers in production fulfillment centers achieve 70-80% volume utilization on mixed-item orders. Training data should target 75%+ utilization, which is sufficient to train policies that match or exceed heuristic baselines (68-72%). Include 10-15% of demonstrations with intentionally suboptimal packing (60-65% utilization) as negative examples — this teaches the policy to distinguish good from poor spatial arrangements. Exclude demonstrations below 50% utilization, which indicate operator error or exceptionally difficult item combinations that add noise without useful signal.

Related Resources

Glossary

Manipulation Trajectory→

How To Build A Manipulation Dataset→

Guide

How To Preprocess Point Clouds For Training→

Get a Custom Quote for Packing Task Data

Tell us about your fulfillment operations — product categories, box sizes, and throughput targets — and we will design a packing data collection plan matched to your specific requirements.

Get in Touch Browse the Data Catalog