Last updated: April 2026

The Physical AI Stack: From Raw Sensor Data to Robot Action (2026)

A robot that picks up a cup and places it on a shelf is executing four distinct computational layers simultaneously. Each layer has its own architecture, its own training data requirements, and its own failure modes. This post maps the full stack.

TL;DR

The physical AI stack has four layers — perception, world model, policy network, and action execution — each requiring fundamentally different training data.
Perception models (Depth Anything V2, SAM 2, ViTPose) need labeled sensory data; they can be pre-trained on large public datasets and fine-tuned with smaller domain-specific sets.
Policy networks (OpenVLA, Octo, GR00T N1) need observation-action-instruction triplets collected through robot teleoperation — this is the hardest and most expensive data to collect in the stack.
World models need large-scale diverse video to learn physics priors; egocentric video across varied environments provides the distributional breadth internet video lacks for manipulation scenarios.

In This Post

Stack Overview: The Four Layers
Layer 1: Perception
Layer 2: World Model
Layer 3: Policy Network
Layer 4: Action Execution
Stack Summary Table
Training Data by Layer: Where the Gaps Are
Key Takeaways
FAQ

Stack Overview: The Four Layers

When a robot arm reaches for a coffee cup, it is executing a pipeline that spans from raw photons hitting a camera sensor to motor current flowing through servo motors — in under 100 milliseconds. That pipeline involves four distinct computational layers, each with its own training paradigm and data requirements.


  SENSORS                 PHYSICAL AI STACK              HARDWARE
  ─────────               ──────────────────             ─────────

  RGB camera   ──────┐
  Depth sensor ──────┤   ┌─────────────────────┐
  LiDAR        ──────┼──▶│  L1: PERCEPTION      │  objects, 3D structure,
  IMU          ──────┘   │  (Depth Anything V2,  │  semantic labels,
                         │   SAM 2, ViTPose,     │  pose estimates
                         │   DINOv2)             │
                         └──────────┬────────────┘
                                    │
                         ┌──────────▼────────────┐
                         │  L2: WORLD MODEL       │  predicted next state,
                         │  (GR00T N1 think sys,  │  object dynamics,
                         │   UniSim, GROOT)       │  physics priors
                         └──────────┬────────────┘
                                    │
   LANGUAGE    ──────────┐          │
   INSTRUCTION            ├─────────▼────────────┐
                          │  L3: POLICY NETWORK   │  joint angles,
                          │  (OpenVLA, Octo,      │  end-effector pose,
                          │   pi-zero, GR00T N1)  │  gripper state
                          └──────────┬────────────┘
                                     │
                          ┌──────────▼────────────┐
                          │  L4: ACTION EXECUTION  │  torque, velocity,
                          │  (PD controllers,      │  position commands
                          │   impedance control,   │  at 100–1000 Hz
                          │   learned residuals)   │
                          └──────────┬────────────┘
                                     │
                                     ▼
                              ROBOT HARDWARE
                         (motors, joints, grippers)

This layered structure is not unique to any single robot architecture — it describes how virtually every physical AI system from autonomous vehicles to humanoid robots organizes its computation. The layers can be trained separately and composed, or trained jointly end-to-end. Most production systems in 2026 use a hybrid: perception is pre-trained on large public datasets, world models are trained on large video corpora, and policies are fine-tuned on task-specific robot trajectory data.

Layer 1: Perception

The perception layer takes raw sensor inputs — RGB images, depth frames, LiDAR point clouds, IMU readings — and transforms them into structured representations the rest of the stack can reason over: object locations, 3D scene geometry, semantic labels, and agent pose.

Perception is the most tractable layer for pre-training on large public datasets because the training signal (ground-truth labels) can be generated with automated tools applied to existing imagery.

Depth Anything V2 — Monocular Depth Estimation

Depth Anything V2 (University of Hong Kong, 2024) produces per-pixel depth estimates from a single RGB image. It was trained on a mixture of high-quality labeled depth data and 62M unlabeled images pseudo-labeled by a teacher model. The result is a depth estimator that generalizes across environments without requiring depth sensor hardware, making it valuable for robots that use RGB-only cameras. For training data: the base model needs per-pixel depth annotations (from LiDAR, structured light, or stereo cameras); the pseudo-labeling pipeline extends coverage to any RGB imagery.

SAM 2 — Segmentation and Tracking

SAM 2 (Meta AI, 2024) segments and tracks objects across video frames with a single prompt (click, bounding box, or mask). For robotics, this enables zero-shot object tracking across manipulation sequences without object-specific training. SAM 2 was trained on SA-1B (1B+ masks) and SA-V (50K+ videos with spatio-temporal masks). The training data requirement is large-scale mask annotation — not action labels — making it substantially cheaper to collect than policy training data.

ViTPose — Human and Hand Pose Estimation

ViTPose (Microsoft Research, 2022) is a Vision Transformer-based pose estimation model that achieves strong performance on both human body and hand keypoint detection. For robotics, hand pose estimation is particularly important for understanding manipulation from egocentric video — it enables retargeting of human hand trajectories to robot gripper poses. ViTPose requires datasets with 2D and 3D joint annotation: COCO-WholeBody, MPII, Human3.6M, and InterHand2.6M are the primary training sources.

Training data requirements at this layer: Labeled sensory data with ground-truth structural annotations. The key challenge is visual diversity — perception models trained on lab images transfer poorly to novel deployment environments with different lighting and object types. This is where Claru's 500K+ egocentric clips, enriched with Depth Anything V2 depth maps, ViTPose keypoints, and SAM segmentation masks, provide training-ready perception data across 100+ cities and diverse real-world environments.

Layer 2: World Model

A world model learns an internal representation of environment dynamics: given the current state and a proposed action, what will happen next? For robots, this means understanding how objects behave under manipulation — how a cup slides across a table, how cloth deforms when grasped, how a stack of objects responds to contact.

World models enable planning: instead of reacting to each observation, a robot can mentally simulate candidate action sequences and select the one predicted to succeed. This is the computational difference between a reactive policy and a deliberative agent.

The dominant approach to training world models for physical AI is video prediction: train on large video corpora to predict future frames given past frames and (optionally) action inputs. The model must learn physics-like dynamics to make accurate predictions, developing implicit representations of object permanence, gravity, contact, and rigidity.

GR00T N1's Dual-System Architecture

GR00T N1 (NVIDIA, 2025) uses a two-system architecture: a "thinking" system based on the Eagle2 VLM for high-level scene understanding and task planning (functioning as a world model for task-level reasoning), and an "acting" system based on a diffusion transformer for generating motor trajectories. The thinking system was pretrained on NVIDIA's video corpus — including the EgoScale egocentric video initiative — giving it broad priors about how the physical world behaves.

Training data requirements at this layer: Diverse video at scale, with a preference for content showing physical interactions: objects being picked up, placed, pushed, dropped, and manipulated. Internet video is a reasonable starting point but has important gaps — it is dominated by filmed content rather than first-person manipulation, and it underrepresents the close-range, hand-object interactions that matter most for robotics. Egocentric video from humans performing everyday manipulation tasks fills this gap. Claru's 500K+ clips across kitchens, workshops, warehouses, and outdoor environments are specifically structured for this use case.

Layer 3: Policy Network

The policy network is the most data-hungry layer in the physical AI stack and the hardest data to collect. It takes the structured state representation from the perception layer, the predictions from the world model, and the natural language instruction, and produces the action sequence the robot should execute.

Policy training data must be in the form of observation-action-instruction triplets: at each timestep, what the robot observed, what instruction it was following, and what action it took. This requires physically collecting demonstrations — either through robot teleoperation or human demonstration retargeted to the robot's action space.

The standard public training corpus for policy networks is the Open X-Embodiment dataset: 1M+ trajectories across 22 robot embodiments from 21 research institutions. OpenVLA was pre-trained on a 970K-trajectory subset. Octo used the same data. For task-specific fine-tuning, most teams collect additional demonstrations using their specific robot hardware in their deployment environment.

The key models at this layer in 2026:

OpenVLA (7B params): Open-source VLA pretrained on 970K Open X-Embodiment trajectories. Best open option for fine-tuning experiments.
Octo (93M params): Smaller, faster VLA from Berkeley. Runs at 20+ Hz; good baseline for real-time control experiments.
pi-zero (Physical Intelligence): PaliGemma backbone + flow matching action expert for dexterous bimanual tasks. Proprietary weights.
GR00T N1 (NVIDIA): Foundation model for humanoid robots with an Eagle2 VLM thinking system and diffusion transformer action expert.

Layer 4: Action Execution

The action execution layer translates the high-level action targets from the policy (end-effector pose targets, desired joint configurations, gripper commands) into low-level motor commands that run on hardware at 100–1000 Hz — far faster than the policy network runs.

This layer operates at the interface between the learned policy and physical reality. Even a perfectly specified end-effector pose target from the policy must be converted to joint torques through inverse kinematics, then tracked by servo controllers that account for motor dynamics, gear ratios, and joint limits.

Most current physical AI systems use classical control at this layer — PD controllers, impedance controllers, or model predictive controllers tuned to the specific hardware. The role of learning at this layer is smaller than at L1–L3: it is primarily used for residual learning (correcting systematic errors in classical controllers) and for hardware-specific calibration.

Training data requirements at this layer: Hardware-specific calibration data: motor characterization, joint encoder calibration, friction identification. This data does not transfer across robot platforms and must be collected fresh for each deployment. The volume is small (hundreds to thousands of calibration runs) but the specificity requirement is absolute.

Stack Summary Table

Layer	Output	Training Data	Key Models
L1Perception	Structured scene representation (object maps, 3D structure, semantic labels)	Labeled sensory data: images + depth/segmentation/pose annotations 10K–1M+ labeled frames	Depth Anything V2, SAM 2, ViTPose, DINOv2, R3M
L2World Model	Predicted next state, object dynamics, physics priors	Large video corpora with optional action labels; diverse multi-environment video Millions of video clips	GR00T N1 (world model component), UniSim, GROOT
L3Policy Network	Action sequence (joint angles, end-effector poses, gripper states)	Robot trajectory data: observation-action-instruction triplets 50K–1M+ trajectories	OpenVLA, pi-zero, Octo, GR00T N1 (policy component)
L4Action Execution	Low-level motor commands (torque, velocity, position) at hardware frequency	Hardware-specific calibration data, sensor characterization Varies by hardware (hundreds to thousands of calibration runs)	PDcontrollers, impedance controllers, learned residual dynamics

Training Data by Layer: Where the Gaps Are

The data supply across the four stack layers is not uniform. Some layers have abundant public training data; others face genuine scarcity. Understanding where the gaps are helps teams prioritize collection efforts.

L1: Perception data

Relatively abundant

Large public datasets exist: NYU-Depth V2, SUN RGB-D, COCO-Panoptic, KITTI, and others. The gap is visual diversity for deployment environments — production robots encounter lighting, objects, and clutter that academic datasets don't cover. Domain-specific enriched video (like Claru's 500K+ clips with pre-computed depth, segmentation, and pose) closes this gap without requiring teams to collect and annotate from scratch.

L2: World model data

Moderate gap

Internet video exists at scale, but underrepresents first-person manipulation: the close-range, hand-object interactions that matter for learning manipulation physics. Ego4D (3,670 hours) and EPIC-Kitchens partially address this, but coverage of diverse environments, industrial tasks, and outdoor manipulation is thin.

L3: Policy data

Significant gap

This is the critical bottleneck. Open X-Embodiment provides 1M+ trajectories but covers limited task types, limited environments, and limited robot embodiments. DROID adds 76K trajectories across 564 environments for better generalization, but the total available policy training data is small compared to what production deployment requires. Every team building a production physical AI system must collect additional demonstrations for their specific tasks.

L4: Action execution data

Hardware-specific, not shared

Calibration data does not transfer across platforms. Each team collects this for their own hardware. Volume is small; the gap is not quantity but systematic effort — many teams skip rigorous hardware calibration and encounter downstream policy failures that are actually L4 problems misattributed to L3.

Key Takeaways

The physical AI stack has four layers (perception, world model, policy network, action execution), each with distinct architecture, training data requirements, and failure modes.
Perception data (labeled sensory inputs) is relatively abundant through public datasets; the gap is visual diversity for deployment environments, which enriched real-world video fills.
World model data needs diverse first-person manipulation video — internet video underrepresents close-range hand-object interactions; egocentric video at scale fills this gap.
Policy data (observation-action-instruction triplets) is the critical bottleneck: Open X-Embodiment provides 1M+ trajectories as a foundation, but task-specific and environment-specific demonstrations must still be collected for production deployment.
Action execution operates on hardware calibration data that is platform-specific and non-transferable — systematic calibration is often neglected and is a common source of unexplained deployment failures.
The practical implication: teams building physical AI systems need to map their data collection strategy to specific stack layers, not treat 'robot training data' as a single undifferentiated category.

Frequently Asked Questions

What is the physical AI stack?

The physical AI stack is the layered pipeline that takes raw sensor inputs (RGB images, depth maps, proprioception, IMU readings) and produces physical robot actions (joint angles, end-effector poses, gripper states). It consists of four functional layers: (1) Perception — extracting structured representations from raw sensor data (object locations, 3D structure, semantic labels); (2) World Model — building an internal model of environment state and predicting future states given actions; (3) Policy Network — mapping perceived state and language instructions to action sequences; (4) Action Execution — translating policy outputs to low-level motor commands on hardware. Each layer requires different types of training data.

How does perception training data differ from policy training data?

Perception training data consists of sensory inputs paired with ground-truth labels about the environment: RGB images with depth maps, segmentation masks, object bounding boxes, or pose estimates. The model learns to infer structure from the raw signal. Policy training data consists of state-action trajectories: observations from the robot's sensors at each timestep paired with the actions taken and the outcomes. The model learns to map state representations to actions. Perception data can be collected at low cost using automated annotation tools (Depth Anything V2 for depth, SAM 2 for segmentation); policy data requires human operators teleoperating robots or performing demonstrations with motion capture, making it far more expensive to collect.

What is a world model in robotics?

A world model in robotics is an internal representation that captures the current state of the environment and can predict how that state will change given the robot's actions. Unlike a pure reactive policy (which maps current state to action without modeling the future), a world model enables planning: the robot can mentally simulate sequences of actions and their consequences before committing to any of them. World models trained on large video datasets learn physics-like priors — how objects fall, roll, deform, and respond to contact — that help robots reason about novel situations. GR00T N1 uses a world model component trained on NVIDIA's video corpus to provide this prior for its humanoid robot policy.

What training data does the perception layer need?

Perception layer training data consists of raw sensory inputs (RGB images, depth frames, LiDAR point clouds) paired with ground-truth structural labels: per-pixel depth values, semantic segmentation masks, object instance masks, 3D bounding boxes, or human pose keypoints. For robotics specifically, the perception data should match the visual distribution the robot will encounter in deployment: similar lighting, similar object types, similar backgrounds. Depth Anything V2 was trained on a mixture of labeled data (NYU-Depth V2, SUN RGB-D, KITTI, and others) and unlabeled images pseudo-labeled by a teacher model — this approach achieves strong generalization without requiring manually labeled depth for every new environment.

Can the physical AI stack layers be trained independently?

In principle yes, and in practice that is how most teams approach it. Perception models (depth estimation, segmentation, pose estimation) are typically pre-trained on large datasets and used as frozen feature extractors or fine-tuned with small domain-specific datasets. World models are trained on video prediction tasks using large video corpora. Policy networks are trained on robot trajectory data, either end-to-end (taking raw pixels as input) or on top of frozen or fine-tuned perception representations. End-to-end training of all layers simultaneously is theoretically possible but requires enormous amounts of robot interaction data and is computationally expensive. Joint training of perception and policy has shown improved performance in some recent work (e.g., R3M from Meta AI, which co-trains a visual representation and a robot policy).

Related Resources

Physical AI Training Data

How Claru collects, enriches, and delivers training data for physical AI at each stack layer.

Embodied AI Datasets

Overview of major embodied AI datasets and their coverage of the physical AI stack.

Glossary: World Model

Definition of world models in robotics with references to GR00T N1 and UniSim architectures.