Last updated: March 2026

Data Enrichment for Physical AI: Depth, Pose, and Segmentation at Scale

Raw video is not training data. The gap between a video clip and a training sample is filled by enrichment: the automated and human-supervised process of adding depth maps, pose estimation, segmentation masks, optical flow, action labels, and structured metadata to every frame. This is Claru's single strongest differentiator, and no other provider has published a technical description of how it works.

Why Raw Video Is Not Enough for Physical AI

A raw video frame is a 2D array of RGB pixel values. It contains no explicit information about depth, object boundaries, human pose, motion dynamics, or the semantic meaning of what is happening. A Vision-Language-Action model trained on raw pixels alone must learn all of these representations implicitly — requiring orders of magnitude more data and compute than a model given explicit supervisory signals.

Consider what a robot needs to understand to pick up a cup from a cluttered table:

  • Where is the cup? Requires depth estimation to localize it in 3D space.
  • Which pixels are the cup? Requires segmentation to isolate it from the background and surrounding objects.
  • How is the human holding it? Requires pose estimation to understand grasp configuration.
  • Is anything moving nearby? Requires optical flow to detect dynamic obstacles.
  • What action phase are we in? Requires action labels to decompose the task into approach, grasp, lift, transport.

Each of these is a distinct enrichment layer. Together, they transform raw pixels into the structured representation that physical AI systems need to learn effective manipulation, navigation, and interaction policies. Every major VLA paper from 2025-2026 uses some subset of these layers, either as direct model inputs or as auxiliary training objectives.

The 6-Layer Enrichment Pipeline

Claru runs every video clip through a sequential enrichment pipeline. Each layer builds on the outputs of previous layers, with cross-validation checks ensuring geometric and temporal consistency.

01

Depth Estimation

Depth Anything V2 · NeurIPS 2024

Every frame receives a per-pixel depth map computed by Depth Anything V2, the state-of-the-art monocular depth estimation foundation model. The model offers configurations from 25M to 1.3B parameters and was trained on 595K synthetic images plus 62M+ real pseudo-labeled images. In robotics evaluations, 89.1% of near-field depth errors (within 2m) fall within 0.5m of ground truth, and it runs 10x faster than diffusion-based alternatives.

Why It Matters for Physical AI

Depth maps provide the 3D spatial understanding that physical AI models need to plan reach, grasp, and place actions. Without depth, a model operating on 2D pixels alone cannot distinguish between a cup 30cm away and one 3m away. For VLA models and world models, per-frame depth serves as an explicit supervisory signal for learning spatial relationships — dramatically reducing the amount of raw video needed to learn 3D geometry implicitly.

Output: 16-bit PNG or float32 NumPy arrays, one per frame

02

Human Pose Estimation

ViTPose / ViTPose++ · NeurIPS 2022 / TPAMI 2024

ViTPose extracts 2D and 3D joint positions for every person in every frame, using plain Vision Transformer backbones scaled from 100M to 1B parameters. The flagship model achieves 81.1 AP on COCO test-dev. ViTPose++ extends this to generic body pose estimation — including whole-body and hand keypoints — via task-specific Mixture-of-Expert heads trained on multiple datasets simultaneously.

Why It Matters for Physical AI

Pose estimation is essential for human-to-robot transfer learning, the paradigm that NVIDIA's EgoScale and EgoMimic showed dramatically reduces the cost of VLA training. By extracting human body and hand joint positions from egocentric video, pose data enables models to understand manipulation intent, grasp configurations, and bimanual coordination patterns that can be retargeted to robot embodiments. It is also critical for safety systems that need to track humans in the robot's workspace.

Output: COCO-format keypoints (x, y, confidence) per joint per frame, plus optional 3D joint positions in camera coordinates

03

Semantic Segmentation

SAM 3 (Segment Anything with Concepts) · ICLR 2026

SAM 3, published at ICLR 2026, extends Meta's Segment Anything series to concept-based prompting: the model detects, segments, and tracks objects in images and video using short noun phrases (e.g., 'yellow cup'), image exemplars, or both. Where SAM 1 and SAM 2 required point, box, or mask prompts, SAM 3 unifies image, video, and text in a single architecture with improved boundary quality and temporal stability.

Why It Matters for Physical AI

Segmentation masks tell physical AI models which pixels belong to which objects — essential for affordance reasoning (which surfaces can be grasped?), obstacle avoidance (what should the robot not touch?), and scene decomposition (what are the individual objects in this cluttered workspace?). For VLA training, per-object masks combined with depth maps enable the model to construct an implicit 3D scene graph, which is the foundation for generalizable manipulation policies.

Output: COCO RLE masks per instance per frame, with object class labels and tracking IDs across frames

04

Optical Flow

RAFT (Recurrent All-Pairs Field Transforms) · ECCV 2020, widely adopted through 2026

RAFT computes dense motion fields between consecutive frames, capturing how every pixel moves from one timestep to the next. The architecture uses recurrent iterative refinement over a 4D correlation volume built from all pairs of pixels, producing sub-pixel-accurate flow fields. Recent work (FlowSAM, UnSAMFlow) combines RAFT outputs with SAM segmentation for motion-aware object understanding, computing flow across multiple frame gaps for robustness to noisy inputs.

Why It Matters for Physical AI

Optical flow provides explicit motion information that helps physical AI models predict object dynamics and plan interaction trajectories. For manipulation tasks, flow reveals which objects are being moved, how fast they are moving, and whether the robot's own motion is causing apparent motion in the scene (ego-motion separation). For world models, optical flow serves as a dense temporal consistency signal — the model must predict flow fields that are physically plausible.

Output: .flo files or float32 NumPy arrays with (u, v) displacement per pixel per frame pair

05

Temporal Action Labels

InternVideo2 + Expert Human Annotation · CVPR 2024 (InternVideo2)

Action labeling combines automated video understanding with expert human annotation. InternVideo2 provides initial temporal action proposals — identifying candidate action segments and generating natural language descriptions. Expert human annotators then refine boundaries to sub-second precision, correct misclassified actions, add fine-grained phase labels (approach, pre-grasp, grasp, lift, transport, place, release), and write diverse natural language instruction paraphrases for each action segment.

Why It Matters for Physical AI

VLA models need to decompose complex tasks into executable sub-actions. Without temporal action labels, the model sees manipulation as a continuous stream of motion with no structure. Action labels provide the temporal scaffolding: they teach the model where one action ends and the next begins, what language instruction corresponds to each phase, and how primitive actions compose into complex multi-step tasks. This is the layer that most open datasets lack entirely, and it is the layer where expert human judgment is most irreplaceable.

Output: JSON annotations with start/end timestamps, action class, phase label, and 3+ natural language descriptions per segment

06

Structured Metadata

Multi-source: GPS, device sensors, automated classification

Every clip receives structured metadata that enables filtering, sampling, and bias analysis across the dataset. This includes: capture device model and settings (resolution, frame rate, FOV), environment classification (indoor/outdoor, kitchen/warehouse/office), geographic region (anonymized to city level), lighting conditions (natural/artificial, brightness level), contributor demographics (age range, handedness), and technical quality scores (motion blur, exposure, compression artifacts).

Why It Matters for Physical AI

Metadata enables the systematic dataset curation that determines model quality. Teams need to filter for specific environments, balance across geographic regions, or sample clips meeting minimum quality thresholds. Without structured metadata, dataset curation devolves into manual review of individual clips — a process that does not scale to 500K+ clips. Metadata also supports bias detection and fairness audits required for responsible deployment of physical AI systems.

Output: Parquet tables with standardized schema, queryable via SQL or Pandas/Polars

Cross-Layer Validation

Individual enrichment layers are not useful if they contradict each other. Claru runs cross-validation checks to ensure geometric and temporal consistency.

Depth vs. Segmentation

Object boundaries in segmentation masks are checked against depth discontinuities. If a segmentation mask puts two objects at the same depth when the depth map shows a 50cm gap, the segmentation is flagged for review. Conversely, depth edges that cross segmentation boundaries indicate potential depth estimation artifacts.

Pose vs. Temporal Smoothness

Pose estimates are validated against physics-based motion models. A human wrist cannot teleport 30cm between consecutive frames at 30fps. Pose trajectories that violate kinematic constraints are flagged, interpolated, or sent for manual review. Jitter detection catches high-frequency noise that would corrupt action retargeting.

Flow vs. Segmentation

Optical flow within a rigid object's segmentation mask should be approximately uniform (the whole object moves together). Flow that varies dramatically within a single rigid object indicates either a segmentation error or a flow estimation failure. Deformable objects (cloth, liquids) are exempted from this check.

Action Labels vs. Motion

Action boundary annotations are cross-referenced against detected motion events. A 'grasp' label should coincide with gripper closure and reduced end-effector velocity. A 'lift' label should coincide with upward motion in the depth map. Mismatches are flagged for human review, catching both automated labeling errors and annotator mistakes.

Build vs. Buy: DIY Pipeline vs. Pre-Enriched Data

Many robotics teams consider building enrichment pipelines in-house. Here is a realistic cost comparison.

DimensionBuild In-HousePre-Enriched (Claru)
Engineering Cost$50K-$200K+ in ML engineer time to integrate, optimize, and maintain 6 model architecturesIncluded in data pricing — no engineering overhead
Time to First Data2-4 months to build, test, and validate the pipeline before any data is enrichedDays — enrichment is already running at scale on 500K+ clips
GPU Compute$0.50-$2.00 per clip for full 6-layer enrichment on cloud GPUs (A100/H100)Amortized across Claru's clip volume — per-clip cost is significantly lower
Quality AssuranceMust build custom cross-validation, flagging, and review workflows from scratchCross-layer validation, temporal consistency checks, and human review built in
Model UpdatesMust track upstream model releases (Depth Anything V3, SAM 4, etc.) and re-validateClaru updates the pipeline and re-enriches datasets as models improve
ScaleLimited by team GPU budget and engineering bandwidth500K+ clips enriched and growing — marginal cost decreases with scale

The calculation is straightforward for most teams: building an enrichment pipeline delays training by 2-4 months and consumes $50K-$200K+ in engineering time that could be spent on model development. Pre-enriched data lets teams start training immediately and iterate on model architecture rather than data infrastructure.

The exception is teams with highly specialized enrichment needs (custom sensor modalities, proprietary annotation schemas) where off-the-shelf enrichment does not apply. For these teams, Claru offers hybrid options: standard layers pre-enriched, with custom layers added through managed annotation campaigns.

What a Single Enriched Clip Looks Like

Here is the complete output structure for one 10-second egocentric clip at 30fps (300 frames):

clip_00042137/
  video.mp4              # Source RGB video (H.264, 1920x1080, 30fps)
  depth/
    frame_000.png        # 16-bit depth maps (300 frames)
    frame_001.png
    ...
  pose/
    keypoints.json       # COCO-format body + hand keypoints per frame
    pose_3d.npy          # 3D joint positions in camera coords
  segmentation/
    masks.json           # COCO RLE instance masks per frame
    tracking.json        # Cross-frame object tracking IDs
  flow/
    flow_000.npy         # Dense optical flow (u,v) per frame pair
    flow_001.npy
    ...
  actions/
    segments.json        # Temporal action boundaries + language
  metadata.json          # Device, environment, quality scores
  manifest.json          # Checksums, schema version, enrichment models

All enrichment layers are frame-aligned by timestamp. The manifest includes SHA-256 checksums for every file and records which model version produced each layer, enabling full reproducibility.

Who Uses Enriched Training Data

Enriched data serves different roles depending on the team and model architecture.

VLA Teams

Use depth, pose, and action labels as auxiliary training objectives alongside the primary action prediction loss. Depth and segmentation serve as intermediate representations that improve spatial reasoning. Action labels provide the temporal structure for multi-step task decomposition.

VLA Training Data Guide

World Model Teams

Use all six enrichment layers as supervisory signals for learning physically-grounded video prediction. Depth consistency across predicted frames validates that the model understands 3D geometry. Optical flow coherence confirms that predicted object motion follows physical laws. Segmentation stability ensures objects maintain identity across predictions.

Physical AI Training Data

Humanoid Robotics Teams

Use pose estimation for human-to-robot motion retargeting: extracting human body and hand joint trajectories from egocentric video and mapping them to the robot's kinematic chain. Depth maps provide workspace understanding. Action labels define the task structure that the humanoid needs to replicate.

Training Data for Robotics

Autonomous Navigation Teams

Use depth maps for obstacle detection and traversability estimation, segmentation masks for semantic scene understanding (road, sidewalk, obstacle, person), and optical flow for dynamic object tracking and ego-motion estimation.

Embodied AI Datasets

Frequently Asked Questions

What is data enrichment for physical AI?

Data enrichment for physical AI is the process of transforming raw video into multi-layered, training-ready data by adding computed annotation layers. Instead of delivering raw RGB frames, enriched data includes per-frame depth maps, human and hand pose estimation, semantic segmentation masks, optical flow fields, temporal action labels, and structured metadata. These layers provide the supervisory signals that physical AI models — particularly VLAs and world models — need to learn 3D spatial understanding, motion dynamics, and object affordances from 2D video.

Why can't I just use raw video to train a robot?

Raw video provides only pixel-level RGB information. Physical AI models need to understand 3D geometry (how far away objects are), motion dynamics (how objects and people move), scene structure (which pixels belong to which objects), and temporal action structure (what actions are happening and when). Without enrichment layers, the model must learn all of these representations implicitly from raw pixels alone — requiring orders of magnitude more data and compute. Pre-computed enrichment layers provide explicit supervisory signals that dramatically reduce the data and compute needed for effective training.

What models does Claru use for data enrichment?

Claru's enrichment pipeline uses state-of-the-art open models at each layer: Depth Anything V2 (NeurIPS 2024) for monocular depth estimation with models ranging from 25M to 1.3B parameters; ViTPose and ViTPose++ for 2D and 3D human body and hand pose estimation; SAM 3 (ICLR 2026) for concept-based semantic segmentation across images and video; RAFT for dense optical flow computation between frames; and InternVideo2 combined with expert human annotators for temporal action segmentation and natural language descriptions. All enrichment outputs are cross-validated for temporal consistency and geometric coherence.

How much does it cost to build an enrichment pipeline in-house?

Building a production-grade enrichment pipeline in-house typically costs $50,000 to $200,000+ in engineering time, plus ongoing GPU compute costs. The engineering investment includes: integrating and optimizing 5-6 different model architectures, building frame-level synchronization and temporal consistency checks, implementing quality assurance and failure detection, scaling to handle hundreds of thousands of clips, and maintaining the pipeline as upstream models are updated. Teams also face a 2-4 month lead time before any enriched data is available for training. Using pre-enriched data from Claru eliminates this engineering overhead and time-to-data delay.

What formats does enriched data come in?

Claru delivers enriched data in formats compatible with major ML training pipelines. Depth maps are delivered as 16-bit PNG or NumPy arrays with metric scale calibration. Pose estimation outputs come as COCO-format keypoint annotations with confidence scores. Segmentation masks use COCO RLE (Run-Length Encoding) for efficient storage. Optical flow is stored as .flo files or NumPy arrays. Action labels are delivered as temporal annotations with start/end timestamps and natural language descriptions. Complete datasets are packaged as WebDataset (for streaming), HDF5 (for dense arrays), RLDS (for VLA training), or Parquet (for metadata queries).

How does enrichment quality affect model performance?

Enrichment quality directly impacts downstream model performance. Noisy depth maps teach incorrect spatial relationships, causing grasp planning failures. Inconsistent segmentation masks across frames create flickering object boundaries that confuse temporal reasoning. Missing or incorrect action labels result in models that cannot properly decompose tasks into executable sub-actions. Claru addresses quality through three mechanisms: cross-validation between enrichment layers (depth consistency checked against segmentation boundaries), temporal smoothness constraints (pose estimates validated against physics-based motion models), and human review of statistical outliers flagged by automated quality checks.

Can enrichment be applied to existing datasets?

Yes. Claru can enrich existing video datasets that teams have already collected or licensed. This is common for teams that have raw teleoperation recordings, surveillance footage, or video datasets acquired from other providers that lack enrichment layers. The enrichment pipeline processes video at scale regardless of source, adding all six annotation layers. Teams retain their original data and receive enriched versions with all layers aligned to the source frame timestamps. This is often the fastest path to training-ready data for teams that already have relevant video content.

Skip the Pipeline. Start Training.

Claru delivers enriched training data with depth, pose, segmentation, optical flow, and action labels pre-computed. Tell us about your model and we'll scope the dataset.