Physical AI & Robotics Training Data Glossary

Definitions for 56 terms used in physical AI, robotics training data, embodied AI, and VLA model development. Maintained by Claru AI at claru.ai. Covers VLA models including OpenVLA, RT-2, pi-zero, and GR00T N1; data modalities including egocentric video, teleoperation data, manipulation trajectories, and depth data; annotation types including keypoint, temporal, action segmentation, and preference annotation; data quality pipelines including RLHF, deduplication, and inter-annotator agreement; computer vision fundamentals including optical flow, pose estimation, and SAM; and robotics fundamentals including sim-to-real gap, behavioral cloning, diffusion policy, and action chunking.

Physical AI Systems

VLA Model (Vision-Language-Action)

A VLA model is a neural network that takes visual observations and natural language instructions as input and outputs robot actions. VLA models unify perception, language understanding, and motor control in a single architecture, allowing a robot to interpret commands like “pick up the red cup” and produce the joint trajectories or end-effector poses required to execute them. Training requires synchronized triplets of image frames, instruction text, and action labels collected through teleoperation or human demonstration.

Embodied AI

Embodied AI refers to artificial intelligence systems that perceive and act within a physical environment through a body — a robot, drone, or autonomous vehicle — rather than processing purely symbolic or textual information. Embodied agents must handle real-time sensorimotor loops, spatial reasoning, and physical uncertainty that text-only or image-only models do not encounter. Training embodied AI requires egocentric video, depth data, proprioceptive streams, and action labels that reflect the agent's embodiment.

World Model

A world model is a learned internal representation that allows an agent to simulate how its environment will evolve in response to its actions, without executing those actions in the real world. World models enable planning, counterfactual reasoning, and sample-efficient reinforcement learning by letting an agent imagine trajectories before acting. For physical AI, training world models requires diverse real-world video that captures the causal structure of physical interactions — how objects move, deform, and respond to contact forces.

Humanoid Robot

A humanoid robot is a robotic system with a body morphology that approximates the human form, typically including two legs, two arms, a torso, and a head. This morphology allows humanoids to operate in environments designed for humans — stairs, doorways, workbenches — without infrastructure modifications. Training humanoid policies requires full-body motion data, bimanual manipulation demonstrations, whole-body pose annotations, and egocentric video from head-mounted cameras at approximately the eye height of a standing human.

Visuomotor Policy

A visuomotor policy is a learned mapping from visual observations — camera images or video frames — directly to motor commands or control actions. Rather than first building an explicit scene representation and then planning, visuomotor policies compute actions end-to-end from pixels. This approach is data-intensive: the policy must generalize across lighting, viewpoint, and object variation. Effective visuomotor policies are trained on large corpora of egocentric demonstration video paired with synchronized action labels.

Foundation Model for Robotics

A foundation model for robotics is a large pre-trained model that serves as a general-purpose base for many downstream robot learning tasks, analogous to how GPT-4 serves as a base for language applications. These models are pre-trained on broad, cross-embodiment datasets and then fine-tuned for specific robots and tasks. Examples include OpenVLA, Octo, and GR00T N1. Training requires large, diverse datasets spanning many robot types, environments, and task categories.

Cross-Embodiment Data

Cross-embodiment data is training data collected from multiple robot morphologies — different arm designs, grippers, camera configurations, and kinematic chains — assembled into a single dataset to train policies that generalize across robot types. The Open X-Embodiment dataset is the canonical example, combining trajectories from 22 different robots. Cross-embodiment training reduces the need to collect large demonstrations for every new robot platform, enabling faster deployment of pre-trained foundation models on novel hardware.

Physical AI

Physical AI refers to AI systems that perceive, reason about, and act within the physical world, as opposed to systems that operate purely in digital or linguistic domains. Physical AI encompasses robots, embodied agents, autonomous vehicles, and world models — any system where the AI must bridge the gap between perception and physical action. The defining data requirement of physical AI is grounded, multi-modal training data: video paired with depth, force, pose, and action information that reflects how the real physical world behaves.

Data Modalities

Egocentric Video

Egocentric video is first-person video captured from a camera mounted on or near a person's or robot's head, recording the world from the perspective of the agent performing a task. This viewpoint directly mirrors what a robot's on-board camera would see during operation, making egocentric video the most natural training signal for visuomotor policies and embodied AI. Key characteristics include frequent hand-object interactions, dynamic viewpoint changes, and the full visual context of task execution including workspace layout.

Teleoperation Data

Teleoperation data consists of paired observation-action recordings captured while a human operator remotely controls a physical robot to complete tasks. The human drives the robot through VR controllers, exoskeletons, or leader-follower setups, and the system records both what the robot's cameras see and the exact joint positions, end-effector poses, and gripper states the human commands. This creates ground-truth action labels at the deployment embodiment, making teleoperation data particularly valuable for behavior cloning and VLA fine-tuning.

Manipulation Trajectory

A manipulation trajectory is a time-series record of a robot arm executing a task, capturing the sequence of end-effector positions, orientations, gripper states, and joint angles over the duration of a manipulation action such as grasping, lifting, inserting, or assembling. Trajectories are the primary training signal for imitation learning and behavior cloning in manipulation robotics. High-quality trajectories require sub-16ms temporal alignment between visual observations and action states, and include metadata about object identity, task phase, and success or failure outcome.

Depth Data

Depth data encodes the distance from a camera to every point in a scene as a per-pixel value, producing a depth map that complements a standard RGB image with 3D spatial information. For robot learning, depth data enables grasp planning, obstacle avoidance, and 3D scene understanding that pure RGB images cannot support. Depth can be measured directly using LiDAR or structured light sensors, or estimated from monocular RGB video using models like Depth Anything V2. Claru enriches raw video with per-frame depth maps as a standard annotation layer.

RGB-D Data

RGB-D data pairs standard color (red-green-blue) video frames with aligned depth frames captured at the same moment, providing both appearance and geometry for every scene. The depth channel (D) gives each pixel a distance value in addition to its color, enabling direct 3D reconstruction and precise grasp pose estimation. RGB-D cameras such as Intel RealSense and Microsoft Azure Kinect are common in robotics research. RGB-D data is particularly valuable for manipulation tasks where the 3D geometry of objects directly determines grasp feasibility.

Point Cloud

A point cloud is a set of data points in three-dimensional space, where each point has (x, y, z) coordinates and often additional attributes such as color or surface normals, representing the geometry of a scanned object or environment. Point clouds are generated by LiDAR sensors, structured light depth cameras, or through reconstruction from RGB-D sequences. In robotics, point clouds feed into grasp planning algorithms, 3D object detection, and occupancy mapping. Standard formats include PLY and PCD files compatible with the Open3D and PCL libraries.

Proprioceptive Data

Proprioceptive data captures a robot's internal state — joint angles, joint velocities, end-effector position and orientation, gripper force, and torque readings — without relying on external sensors like cameras. Proprioception provides the robot's sense of its own body position in space, analogous to human kinesthetic awareness. In manipulation and locomotion policies, proprioceptive data is typically concatenated with visual observations as part of the observation vector, giving the policy information about its current configuration and the forces being applied at each joint.

Synthetic Data (for Robotics)

Synthetic data for robotics is training data generated in simulation environments such as IsaacSim, MuJoCo, Genesis, or Habitat, rather than collected from physical robot systems or human demonstrations. Synthetic data offers unlimited scale and perfect ground-truth labels — exact object poses, contact forces, and joint states — at a fraction of the cost of real-world collection. Its primary limitation is the sim-to-real gap: visual and physical discrepancies between the simulated and real-world distributions that cause policies to fail on deployment hardware.

Annotation Types

Keypoint Annotation

Keypoint annotation marks specific landmark locations on objects or body parts — such as fingertips, joint centers, or object corners — as (x, y) coordinates within an image frame or (x, y, z) coordinates in 3D space. For human body pose, standard keypoint sets include the COCO 17-point skeleton and the OpenPose 25-point body model. For hand-object interaction, keypoints mark fingertip positions, wrist center, and object contact points. Keypoint annotations are the primary training signal for pose estimation models including ViTPose.

Temporal Annotation

Temporal annotation marks the start and end timestamps of events, actions, or state transitions within a video, creating labeled segments along the time axis. In robotics, temporal annotations define the boundaries of discrete action phases — when a grasp begins and ends, when an object is in transit, when contact is made or released. Temporal precision directly affects policy performance: annotations at 100ms granularity miss the sub-frame timing information that manipulation policies need to learn smooth, reactive behavior.

Action Segmentation

Action segmentation is the task of partitioning a video into temporally contiguous segments and assigning an action class label to each segment — for example, labeling consecutive frames as reach, grasp, transport, and place. Unlike activity recognition, which assigns a single label to an entire video, action segmentation produces a frame-level or segment-level label sequence. Action segmentation annotations are essential for training manipulation policies that decompose complex tasks into primitive actions and for generating the temporal supervision required by sequence models.

Semantic Segmentation

Semantic segmentation assigns a class label to every pixel in an image — for example, labeling all pixels belonging to cups as cup, all countertop pixels as countertop, and all hand pixels as hand. Unlike object detection, which produces bounding boxes, semantic segmentation provides pixel-precise region boundaries. For robot manipulation, semantic segmentation is used to identify graspable objects, avoid obstacles, and segment the workspace layout. Models like SAM (Segment Anything Model) produce high-quality semantic masks that Claru uses as a standard enrichment layer.

Instance Segmentation

Instance segmentation extends semantic segmentation by distinguishing individual object instances of the same class — for example, separately labeling cup_1, cup_2, and cup_3 rather than assigning all three to a single cup class. This enables tracking individual objects across frames, understanding cluttered workspaces where multiple instances of the same object type are present simultaneously, and generating the per-object identity labels required for tasks like multi-object manipulation and assembly sequencing.

Activity Annotation

Activity annotation labels what a person or robot is doing in a video at a coarser temporal granularity than action segmentation — for example, labeling a 30-second clip as preparing breakfast or repairing a bicycle. Activity labels provide high-level semantic context that complements fine-grained action segmentation. In egocentric video datasets, activity annotations define the top-level task category and are used to filter and stratify training data, ensuring that manipulation policies train on task-relevant clips rather than unrelated background footage.

Bounding Box Annotation

Bounding box annotation draws the smallest axis-aligned rectangle that fully encloses an object within an image frame, labeled with an object class identifier. Bounding boxes are the most common object detection annotation format and provide approximate spatial localization without the per-pixel precision of segmentation masks. In robotics, bounding boxes are used for object detection, workspace analysis, and as an input to downstream grasp planning pipelines. Temporal sequences of bounding boxes across video frames provide the training signal for object tracking models.

Preference Annotation (RLHF)

Preference annotation collects human judgments about which of two or more AI outputs is preferred, providing the training signal for reward models used in reinforcement learning from human feedback (RLHF). Annotators compare pairs of model outputs — robot trajectories, generated videos, text responses — and label which is better along specified dimensions. In robotics, preference annotations evaluate trajectory smoothness, task success, and natural motion. In video generation, they assess motion quality, fidelity, and text-video alignment across model configurations.

Data Quality & Pipelines

Data Enrichment

Data enrichment is the process of augmenting raw collected data with additional annotation layers — depth maps, segmentation masks, pose estimates, optical flow, captions, and action labels — that downstream models need but that are not present in the original capture. Rather than delivering raw video, enrichment pipelines run automated models (Depth Anything V2, ViTPose, SAM, RAFT) and human annotation passes to produce a multi-layer dataset ready for direct use in training. Enrichment is distinct from annotation: it adds derived signals, not just labels assigned by humans.

Benchmark Curation

Benchmark curation is the construction of a held-out evaluation dataset used to measure model performance on a specific task or capability. A well-curated benchmark is representative of the deployment distribution, covers edge cases and failure modes, and has high-quality ground-truth labels. In physical AI, benchmarks are used to compare robot policies across manipulation difficulty, environment diversity, and task generalization. Benchmark curation involves selecting diverse samples, verifying annotation quality, and preventing contamination between training and evaluation splits.

Data Deduplication

Data deduplication identifies and removes near-duplicate samples from a training dataset that would cause the model to overfit to repeated examples rather than learning the true underlying distribution. In video datasets, deduplication operates at the frame level (perceptual hashing), clip level (embedding similarity), or trajectory level (action sequence similarity). Effective deduplication improves training efficiency and generalization: models trained on deduplicated datasets often achieve better downstream performance with less compute than those trained on raw, redundant corpora.

Inter-Annotator Agreement (IAA)

Inter-annotator agreement (IAA) is a metric that quantifies the degree to which independent annotators assign the same labels to the same data samples, measuring the reliability and consistency of an annotation process. High IAA indicates that the task is well-defined and the guidelines are clear. Common IAA metrics include Cohen's kappa for pairwise agreement, Fleiss' kappa for multiple annotators, and Krippendorff's alpha for ordinal or continuous scales. Claru monitors Krippendorff's alpha as a primary quality signal, with a target threshold of 0.85 or above for preference annotation tasks.

RLHF (Reinforcement Learning from Human Feedback)

RLHF is a training paradigm in which a reward model is trained on human preference annotations — judgments about which AI outputs are better — and then used to fine-tune a base model through reinforcement learning to produce outputs humans prefer. RLHF was central to the training of InstructGPT, ChatGPT, and Claude. In robotics, RLHF is applied to train reward models that evaluate trajectory quality, enabling policies to improve from human evaluations of robot behavior rather than requiring explicit reward engineering.

Data Quality Scoring

Data quality scoring assigns a numerical quality score to each sample in a dataset based on criteria such as annotation correctness, visual clarity, task relevance, and diversity contribution. Quality scores are used to filter out low-quality samples before training, weight samples during training, or prioritize which samples to re-annotate. In video datasets, quality scoring evaluates factors including motion blur, occlusion severity, camera calibration drift, and action completeness. Automated quality scoring reduces the labeling load by focusing human review on borderline samples rather than clearly acceptable or rejected ones.

Dataset Diversity

Dataset diversity measures the range of variation in a training corpus across dimensions that matter for model generalization — scene appearance, lighting, object category, geographic location, task type, and operator behavior. A diverse dataset reduces overfitting to specific environments and improves zero-shot performance in novel settings. In robotics, diversity is measured along axes including environment category (kitchen, warehouse, outdoor), object type (rigid, deformable, transparent), and viewpoint (wrist camera, head camera, external camera). Claru tracks diversity coverage explicitly across all collection campaigns.

Active Learning

Active learning is a data collection and annotation strategy in which a model identifies which unlabeled samples it is most uncertain about, and those samples are prioritized for human annotation. This concentrates labeling effort on the examples that will most improve model performance, reducing the total annotation volume required to reach a target accuracy. In robotics, active learning selects demonstration scenarios that expose gaps in the current policy — edge cases, failure modes, or underrepresented environments — for targeted data collection rather than random sampling.

Computer Vision Fundamentals

Optical Flow

Optical flow is a dense motion field that describes, for every pixel in an image, the apparent velocity of that pixel between consecutive frames — encoded as a 2D vector indicating direction and magnitude of motion. Optical flow captures how objects and surfaces move through the scene and is used in robotics to detect moving obstacles, estimate camera ego-motion, and segment foreground objects from background. RAFT (Recurrent All-Pairs Field Transforms) is the standard model for computing optical flow in physical AI data enrichment pipelines.

Monocular Depth Estimation

Monocular depth estimation predicts a per-pixel depth map from a single RGB image, inferring scene geometry without requiring a stereo camera pair or active depth sensor. This is achieved by neural networks trained on large corpora of RGB-depth pairs, learning visual cues such as perspective foreshortening, object size, and texture gradient that correlate with distance. Depth Anything V2 is the current standard for monocular depth in physical AI pipelines. Monocular depth enables depth enrichment at scale for video datasets collected with standard single-lens cameras.

Pose Estimation

Pose estimation predicts the positions of anatomical landmarks — body joints, hand keypoints, or object corners — in 2D image coordinates or 3D space. Human body pose estimation produces skeleton representations used to understand how humans perform tasks, providing the reference demonstrations that robot learning systems imitate. Hand pose estimation localizes finger joints to capture dexterous manipulation in egocentric video. ViTPose is the standard vision transformer model for human pose estimation in physical AI data pipelines, trained on COCO Keypoints and MPII.

Hand-Object Interaction (HOI)

Hand-object interaction (HOI) refers to the detection, segmentation, and analysis of the contact relationship between human hands and objects in video — identifying which hand is touching which object, the contact region, grip type, and the resulting object state change. HOI annotations are critical for robotics training data because manipulation tasks are fundamentally about how hands (and by extension, robot end-effectors) interact with objects. HOI detection in egocentric video provides the ground-truth skill demonstrations that robot manipulation policies learn from.

Object Tracking

Object tracking maintains the identity of one or more objects across consecutive video frames, assigning consistent identifiers as objects move, become occluded, or change appearance. Tracking converts per-frame detections into temporally coherent object trajectories, which are essential for understanding how objects are manipulated over time. In physical AI training data, tracking links object instances across frames to enable identity-consistent annotations, trajectory prediction training, and the temporal association needed for action segmentation and reward learning.

Video Prediction

Video prediction is the task of generating plausible future video frames given a sequence of past frames, requiring a model to understand scene dynamics, object physics, and the temporal evolution of appearance. Video prediction models are a form of learned world model: they internalize how objects move, deform, and interact under physical constraints. Training video prediction models requires large corpora of real-world video with diverse motion patterns — not just static scene images — making egocentric and robotics video particularly valuable for this task.

SAM (Segment Anything Model)

SAM (Segment Anything Model) is a promptable image segmentation model developed by Meta AI that generates high-quality object masks from point, box, or text prompts, without task-specific training. SAM can segment any object in an image — known or unknown — making it a general-purpose tool for annotation automation. SAM3 (the video-capable version) tracks and segments objects across video frames. Claru uses SAM3 as a standard layer in its enrichment pipeline to produce segmentation masks for every object in egocentric video collections.

Panoptic Segmentation

Panoptic segmentation combines semantic segmentation and instance segmentation into a unified output where every pixel is assigned both a class label and an instance identifier. Countable objects (things) such as cups, hands, and tools receive unique instance IDs, while background regions (stuff) such as floor, table, and wall receive class labels only. Panoptic segmentation provides the most complete pixel-level scene understanding, enabling robot systems to simultaneously know what type every surface is and which individual object is which without running two separate pipelines.

Robotics Fundamentals

Imitation Learning

Imitation learning is a class of robot learning methods in which a policy is trained to replicate the behavior of an expert demonstrator, learning from observations of how a human or expert robot performs a task rather than from trial-and-error exploration. The simplest form of imitation learning is behavioral cloning, which treats demonstration data as a supervised learning problem. More advanced approaches like DAgger and inverse reinforcement learning address the distributional shift problem that arises when the policy encounters states outside the demonstration distribution.

Behavioral Cloning (BC)

Behavioral cloning (BC) is the simplest form of imitation learning, treating demonstration data as a supervised learning problem: given an observation, predict the action the expert demonstrator took. A policy is trained by minimizing the difference between predicted and demonstrated actions across a dataset of (observation, action) pairs. BC is data-efficient and straightforward to implement but suffers from compounding errors when the policy encounters states slightly outside the demonstration distribution, since small mistakes at each step can compound into large deviations over a long trajectory.

Sim-to-Real Gap

The sim-to-real gap refers to the performance degradation that occurs when a robot policy trained in simulation is deployed on physical hardware, caused by discrepancies between simulated and real-world visual appearance, physics, sensor noise, and actuator dynamics. Even photorealistic simulators produce textures, lighting, contact physics, and deformable object behavior that differ measurably from the real world. Bridging the sim-to-real gap requires either domain randomization during simulation training, real-world fine-tuning data, or both in combination.

Domain Randomization

Domain randomization is a simulation training technique that trains a policy across a wide range of randomized visual and physical simulation parameters — object textures, lighting colors, camera positions, friction coefficients, and object masses — so that the real world appears as just another variation in the training distribution. By training on many randomized environments, the policy learns representations that are robust to the specific parameter values, making it more likely to transfer to the real-world domain. Domain randomization reduces the sim-to-real gap without requiring large amounts of real-world data.

Action Chunking

Action chunking is a technique in robot learning where the policy predicts a short sequence of future actions (a chunk) rather than a single action at each timestep, then executes that chunk before predicting the next one. Chunking reduces the effective frequency of policy inference, lowering latency demands on the policy network, and enables the policy to plan ahead within the chunk horizon. The Action Chunking with Transformers (ACT) method popularized this approach, demonstrating that chunks of 10-100 actions significantly improve performance on dexterous manipulation tasks compared to single-step action prediction.

Diffusion Policy

Diffusion Policy is a robot learning method that frames action prediction as a conditional denoising diffusion process: the policy generates action sequences by iteratively removing noise from a random sample, conditioned on the current visual observation. Diffusion models naturally represent multi-modal action distributions — situations where multiple different actions are all correct responses to the same observation — which standard regression-based behavioral cloning cannot capture. Diffusion Policy achieves state-of-the-art performance on dexterous manipulation benchmarks and underlies the action heads in several commercial humanoid platforms.

Reward Model

A reward model is a neural network trained to predict a scalar quality score for a given AI output — a robot trajectory, a text response, or a video — based on human preference annotations. The reward model encodes human judgment as a differentiable function, allowing reinforcement learning algorithms to optimize a policy toward outputs that humans prefer. Reward models trained on low-quality or inconsistent preference annotations produce reward hacking: policies that score highly on the reward model while producing outputs humans actually dislike. High inter-annotator agreement is essential for reliable reward model training.

6-DOF Grasp Planning

6-DOF grasp planning determines the full six-degrees-of-freedom pose — three translational (x, y, z) and three rotational (roll, pitch, yaw) — at which a robot end-effector should approach and grasp an object. Unlike top-down planar grasping, 6-DOF planning considers arbitrary object geometries and orientations, enabling grasps from the side, below, or at any angle. Training 6-DOF grasp networks requires point cloud or RGB-D data paired with labels specifying grasp quality scores or binary success labels for sampled grasp poses.

Models & Architectures

RAFT (Optical Flow)

RAFT (Recurrent All-Pairs Field Transforms) is a deep learning architecture for optical flow estimation that builds a 4D correlation volume for all pairs of pixels between two frames and iteratively updates a flow estimate using recurrent refinement steps. RAFT achieves state-of-the-art accuracy on optical flow benchmarks (Sintel, KITTI) and runs efficiently enough for large-scale video processing pipelines. Claru uses RAFT as the standard optical flow model in its enrichment pipeline, computing dense motion fields for every consecutive frame pair in egocentric video collections.

Depth Anything V2

Depth Anything V2 is a monocular depth estimation model developed by researchers at the University of Hong Kong and TikTok, trained on a combination of labeled real-world data and large-scale synthetic data to produce high-quality relative and metric depth maps from single RGB images. The V2 release improved fine-grained detail accuracy on transparent, reflective, and occluded surfaces compared to V1, making it more reliable for real-world robotics enrichment. Depth Anything V2 is the standard depth estimation model in Claru's video enrichment pipeline.

ViTPose

ViTPose is a human pose estimation model that uses a plain Vision Transformer (ViT) backbone, demonstrating that the transformer architecture that dominates NLP and image classification also achieves state-of-the-art performance on keypoint detection tasks without task-specific architectural modifications. ViTPose is trained on COCO Keypoints and MPII Human Pose datasets and supports whole-body pose including body, hand, face, and foot keypoints. Claru uses ViTPose to extract 2D and 3D joint positions from egocentric video as a standard enrichment layer for robotics training data.

Open X-Embodiment (OXE)

Open X-Embodiment (OXE) is a large-scale robot learning dataset released by Google DeepMind and collaborators in 2023, aggregating over 1 million robot trajectories from 22 different robot embodiments across 21 research institutions. OXE provides the broadest available collection of real-robot manipulation demonstrations and was used to train the RT-X family of models, demonstrating that cross-embodiment pre-training improves policy performance on new robots. OXE is publicly available but covers a limited set of robot platforms, environments, and task categories compared to what production robotics teams require.

Diffusion Transformer (DiT)

A Diffusion Transformer (DiT) is a neural network architecture that applies the transformer architecture — with self-attention and feed-forward layers arranged in a sequence — as the backbone of a diffusion model, replacing the U-Net architecture that dominated earlier diffusion model designs. DiT models scale more predictably with model size and training data than U-Net-based diffusion models and have become the architecture of choice for video generation and world model training. Sora, Stable Video Diffusion, and several robotics world models use DiT-based architectures.

Vision Transformer (ViT)

A Vision Transformer (ViT) is an image recognition architecture that applies the transformer architecture directly to images by splitting an image into fixed-size patches, linearly embedding each patch, and processing the sequence of patch embeddings with standard transformer self-attention layers. ViT models, introduced by Dosovitskiy et al. in 2021, achieve state-of-the-art performance on image classification when trained on large enough datasets and have become the standard backbone for a wide range of vision models including object detection, segmentation, pose estimation, and VLA models.

GR00T N1 (NVIDIA)

GR00T N1 is a general-purpose humanoid robot foundation model developed by NVIDIA, announced in 2025, designed to serve as a pre-trained base that robotics teams can fine-tune for specific humanoid platforms and tasks. GR00T N1 processes multimodal inputs including video, text, and proprioceptive state, and outputs motor actions. The model was trained on a combination of real robot demonstrations from the Open X-Embodiment dataset, synthetic simulation data from NVIDIA Isaac, and synthetic video generated from physical simulations, representing a hybrid data strategy for humanoid generalization.

pi-zero (Physical Intelligence)

pi-zero is a general-purpose robot foundation model developed by Physical Intelligence (pi), released in late 2024. pi-zero uses a flow-matching action head built on top of a pre-trained vision-language model backbone, enabling zero-shot and few-shot generalization to new tasks and robot embodiments. Physical Intelligence trained pi-zero on a large proprietary dataset of robot demonstrations spanning multiple robot platforms and diverse manipulation tasks, including dexterous tasks like laundry folding, table bussing, and grocery bagging that previous general-purpose models struggled to perform reliably.

Physical AI & Robotics Training Data Glossary

Definitions for ML engineers building robots, embodied agents, and world models.

56 terms — last updated April 2026

Physical AI Systems

VLA Model (Vision-Language-Action)

A VLA model is a neural network that takes visual observations and natural language instructions as input and outputs robot actions. VLA models unify perception, language understanding, and motor control in a single architecture, allowing a robot to interpret commands like “pick up the red cup” and produce the joint trajectories or end-effector poses required to execute them. Training requires synchronized triplets of image frames, instruction text, and action labels collected through teleoperation or human demonstration.

Embodied AI

Embodied AI refers to artificial intelligence systems that perceive and act within a physical environment through a body — a robot, drone, or autonomous vehicle — rather than processing purely symbolic or textual information. Embodied agents must handle real-time sensorimotor loops, spatial reasoning, and physical uncertainty that text-only or image-only models do not encounter. Training embodied AI requires egocentric video, depth data, proprioceptive streams, and action labels that reflect the agent's embodiment.

World Model

A world model is a learned internal representation that allows an agent to simulate how its environment will evolve in response to its actions, without executing those actions in the real world. World models enable planning, counterfactual reasoning, and sample-efficient reinforcement learning by letting an agent imagine trajectories before acting. For physical AI, training world models requires diverse real-world video that captures the causal structure of physical interactions — how objects move, deform, and respond to contact forces.

Humanoid Robot

A humanoid robot is a robotic system with a body morphology that approximates the human form, typically including two legs, two arms, a torso, and a head. This morphology allows humanoids to operate in environments designed for humans — stairs, doorways, workbenches — without infrastructure modifications. Training humanoid policies requires full-body motion data, bimanual manipulation demonstrations, whole-body pose annotations, and egocentric video from head-mounted cameras at approximately the eye height of a standing human.

Visuomotor Policy

A visuomotor policy is a learned mapping from visual observations — camera images or video frames — directly to motor commands or control actions. Rather than first building an explicit scene representation and then planning, visuomotor policies compute actions end-to-end from pixels. This approach is data-intensive: the policy must generalize across lighting, viewpoint, and object variation. Effective visuomotor policies are trained on large corpora of egocentric demonstration video paired with synchronized action labels.

Foundation Model for Robotics

A foundation model for robotics is a large pre-trained model that serves as a general-purpose base for many downstream robot learning tasks, analogous to how GPT-4 serves as a base for language applications. These models are pre-trained on broad, cross-embodiment datasets and then fine-tuned for specific robots and tasks. Examples include OpenVLA, Octo, and GR00T N1. Training requires large, diverse datasets spanning many robot types, environments, and task categories.

Cross-Embodiment Data

Cross-embodiment data is training data collected from multiple robot morphologies — different arm designs, grippers, camera configurations, and kinematic chains — assembled into a single dataset to train policies that generalize across robot types. The Open X-Embodiment dataset is the canonical example, combining trajectories from 22 different robots. Cross-embodiment training reduces the need to collect large demonstrations for every new robot platform, enabling faster deployment of pre-trained foundation models on novel hardware.

Physical AI

Physical AI refers to AI systems that perceive, reason about, and act within the physical world, as opposed to systems that operate purely in digital or linguistic domains. Physical AI encompasses robots, embodied agents, autonomous vehicles, and world models — any system where the AI must bridge the gap between perception and physical action. The defining data requirement of physical AI is grounded, multi-modal training data: video paired with depth, force, pose, and action information that reflects how the real physical world behaves.

Data Modalities

Egocentric Video

Egocentric video is first-person video captured from a camera mounted on or near a person's or robot's head, recording the world from the perspective of the agent performing a task. This viewpoint directly mirrors what a robot's on-board camera would see during operation, making egocentric video the most natural training signal for visuomotor policies and embodied AI. Key characteristics include frequent hand-object interactions, dynamic viewpoint changes, and the full visual context of task execution including workspace layout.

Teleoperation Data

Teleoperation data consists of paired observation-action recordings captured while a human operator remotely controls a physical robot to complete tasks. The human drives the robot through VR controllers, exoskeletons, or leader-follower setups, and the system records both what the robot's cameras see and the exact joint positions, end-effector poses, and gripper states the human commands. This creates ground-truth action labels at the deployment embodiment, making teleoperation data particularly valuable for behavior cloning and VLA fine-tuning.

Manipulation Trajectory

A manipulation trajectory is a time-series record of a robot arm executing a task, capturing the sequence of end-effector positions, orientations, gripper states, and joint angles over the duration of a manipulation action such as grasping, lifting, inserting, or assembling. Trajectories are the primary training signal for imitation learning and behavior cloning in manipulation robotics. High-quality trajectories require sub-16ms temporal alignment between visual observations and action states, and include metadata about object identity, task phase, and success or failure outcome.

Depth Data

Depth data encodes the distance from a camera to every point in a scene as a per-pixel value, producing a depth map that complements a standard RGB image with 3D spatial information. For robot learning, depth data enables grasp planning, obstacle avoidance, and 3D scene understanding that pure RGB images cannot support. Depth can be measured directly using LiDAR or structured light sensors, or estimated from monocular RGB video using models like Depth Anything V2. Claru enriches raw video with per-frame depth maps as a standard annotation layer.

RGB-D Data

RGB-D data pairs standard color (red-green-blue) video frames with aligned depth frames captured at the same moment, providing both appearance and geometry for every scene. The depth channel (D) gives each pixel a distance value in addition to its color, enabling direct 3D reconstruction and precise grasp pose estimation. RGB-D cameras such as Intel RealSense and Microsoft Azure Kinect are common in robotics research. RGB-D data is particularly valuable for manipulation tasks where the 3D geometry of objects directly determines grasp feasibility.

Point Cloud

A point cloud is a set of data points in three-dimensional space, where each point has (x, y, z) coordinates and often additional attributes such as color or surface normals, representing the geometry of a scanned object or environment. Point clouds are generated by LiDAR sensors, structured light depth cameras, or through reconstruction from RGB-D sequences. In robotics, point clouds feed into grasp planning algorithms, 3D object detection, and occupancy mapping. Standard formats include PLY and PCD files compatible with the Open3D and PCL libraries.

Proprioceptive Data

Proprioceptive data captures a robot's internal state — joint angles, joint velocities, end-effector position and orientation, gripper force, and torque readings — without relying on external sensors like cameras. Proprioception provides the robot's sense of its own body position in space, analogous to human kinesthetic awareness. In manipulation and locomotion policies, proprioceptive data is typically concatenated with visual observations as part of the observation vector, giving the policy information about its current configuration and the forces being applied at each joint.

Synthetic Data (for Robotics)

Synthetic data for robotics is training data generated in simulation environments such as IsaacSim, MuJoCo, Genesis, or Habitat, rather than collected from physical robot systems or human demonstrations. Synthetic data offers unlimited scale and perfect ground-truth labels — exact object poses, contact forces, and joint states — at a fraction of the cost of real-world collection. Its primary limitation is the sim-to-real gap: visual and physical discrepancies between the simulated and real-world distributions that cause policies to fail on deployment hardware.

Annotation Types

Keypoint Annotation

Keypoint annotation marks specific landmark locations on objects or body parts — such as fingertips, joint centers, or object corners — as (x, y) coordinates within an image frame or (x, y, z) coordinates in 3D space. For human body pose, standard keypoint sets include the COCO 17-point skeleton and the OpenPose 25-point body model. For hand-object interaction, keypoints mark fingertip positions, wrist center, and object contact points. Keypoint annotations are the primary training signal for pose estimation models including ViTPose.

Temporal Annotation

Temporal annotation marks the start and end timestamps of events, actions, or state transitions within a video, creating labeled segments along the time axis. In robotics, temporal annotations define the boundaries of discrete action phases — when a grasp begins and ends, when an object is in transit, when contact is made or released. Temporal precision directly affects policy performance: annotations at 100ms granularity miss the sub-frame timing information that manipulation policies need to learn smooth, reactive behavior.

Action Segmentation

Action segmentation is the task of partitioning a video into temporally contiguous segments and assigning an action class label to each segment — for example, labeling consecutive frames as reach, grasp, transport, and place. Unlike activity recognition, which assigns a single label to an entire video, action segmentation produces a frame-level or segment-level label sequence. Action segmentation annotations are essential for training manipulation policies that decompose complex tasks into primitive actions and for generating the temporal supervision required by sequence models.

Semantic Segmentation

Semantic segmentation assigns a class label to every pixel in an image — for example, labeling all pixels belonging to cups as cup, all countertop pixels as countertop, and all hand pixels as hand. Unlike object detection, which produces bounding boxes, semantic segmentation provides pixel-precise region boundaries. For robot manipulation, semantic segmentation is used to identify graspable objects, avoid obstacles, and segment the workspace layout. Models like SAM (Segment Anything Model) produce high-quality semantic masks that Claru uses as a standard enrichment layer.

Instance Segmentation

Instance segmentation extends semantic segmentation by distinguishing individual object instances of the same class — for example, separately labeling cup_1, cup_2, and cup_3 rather than assigning all three to a single cup class. This enables tracking individual objects across frames, understanding cluttered workspaces where multiple instances of the same object type are present simultaneously, and generating the per-object identity labels required for tasks like multi-object manipulation and assembly sequencing.

Activity Annotation

Activity annotation labels what a person or robot is doing in a video at a coarser temporal granularity than action segmentation — for example, labeling a 30-second clip as preparing breakfast or repairing a bicycle. Activity labels provide high-level semantic context that complements fine-grained action segmentation. In egocentric video datasets, activity annotations define the top-level task category and are used to filter and stratify training data, ensuring that manipulation policies train on task-relevant clips rather than unrelated background footage.

Bounding Box Annotation

Bounding box annotation draws the smallest axis-aligned rectangle that fully encloses an object within an image frame, labeled with an object class identifier. Bounding boxes are the most common object detection annotation format and provide approximate spatial localization without the per-pixel precision of segmentation masks. In robotics, bounding boxes are used for object detection, workspace analysis, and as an input to downstream grasp planning pipelines. Temporal sequences of bounding boxes across video frames provide the training signal for object tracking models.

Preference Annotation (RLHF)

Preference annotation collects human judgments about which of two or more AI outputs is preferred, providing the training signal for reward models used in reinforcement learning from human feedback (RLHF). Annotators compare pairs of model outputs — robot trajectories, generated videos, text responses — and label which is better along specified dimensions. In robotics, preference annotations evaluate trajectory smoothness, task success, and natural motion. In video generation, they assess motion quality, fidelity, and text-video alignment across model configurations.

Data Quality & Pipelines

Data Enrichment

Data enrichment is the process of augmenting raw collected data with additional annotation layers — depth maps, segmentation masks, pose estimates, optical flow, captions, and action labels — that downstream models need but that are not present in the original capture. Rather than delivering raw video, enrichment pipelines run automated models (Depth Anything V2, ViTPose, SAM, RAFT) and human annotation passes to produce a multi-layer dataset ready for direct use in training. Enrichment is distinct from annotation: it adds derived signals, not just labels assigned by humans.

Benchmark Curation

Benchmark curation is the construction of a held-out evaluation dataset used to measure model performance on a specific task or capability. A well-curated benchmark is representative of the deployment distribution, covers edge cases and failure modes, and has high-quality ground-truth labels. In physical AI, benchmarks are used to compare robot policies across manipulation difficulty, environment diversity, and task generalization. Benchmark curation involves selecting diverse samples, verifying annotation quality, and preventing contamination between training and evaluation splits.

Data Deduplication

Data deduplication identifies and removes near-duplicate samples from a training dataset that would cause the model to overfit to repeated examples rather than learning the true underlying distribution. In video datasets, deduplication operates at the frame level (perceptual hashing), clip level (embedding similarity), or trajectory level (action sequence similarity). Effective deduplication improves training efficiency and generalization: models trained on deduplicated datasets often achieve better downstream performance with less compute than those trained on raw, redundant corpora.

Inter-Annotator Agreement (IAA)

Inter-annotator agreement (IAA) is a metric that quantifies the degree to which independent annotators assign the same labels to the same data samples, measuring the reliability and consistency of an annotation process. High IAA indicates that the task is well-defined and the guidelines are clear. Common IAA metrics include Cohen's kappa for pairwise agreement, Fleiss' kappa for multiple annotators, and Krippendorff's alpha for ordinal or continuous scales. Claru monitors Krippendorff's alpha as a primary quality signal, with a target threshold of 0.85 or above for preference annotation tasks.

RLHF (Reinforcement Learning from Human Feedback)

RLHF is a training paradigm in which a reward model is trained on human preference annotations — judgments about which AI outputs are better — and then used to fine-tune a base model through reinforcement learning to produce outputs humans prefer. RLHF was central to the training of InstructGPT, ChatGPT, and Claude. In robotics, RLHF is applied to train reward models that evaluate trajectory quality, enabling policies to improve from human evaluations of robot behavior rather than requiring explicit reward engineering.

Data Quality Scoring

Data quality scoring assigns a numerical quality score to each sample in a dataset based on criteria such as annotation correctness, visual clarity, task relevance, and diversity contribution. Quality scores are used to filter out low-quality samples before training, weight samples during training, or prioritize which samples to re-annotate. In video datasets, quality scoring evaluates factors including motion blur, occlusion severity, camera calibration drift, and action completeness. Automated quality scoring reduces the labeling load by focusing human review on borderline samples rather than clearly acceptable or rejected ones.

Dataset Diversity

Dataset diversity measures the range of variation in a training corpus across dimensions that matter for model generalization — scene appearance, lighting, object category, geographic location, task type, and operator behavior. A diverse dataset reduces overfitting to specific environments and improves zero-shot performance in novel settings. In robotics, diversity is measured along axes including environment category (kitchen, warehouse, outdoor), object type (rigid, deformable, transparent), and viewpoint (wrist camera, head camera, external camera). Claru tracks diversity coverage explicitly across all collection campaigns.

Active Learning

Active learning is a data collection and annotation strategy in which a model identifies which unlabeled samples it is most uncertain about, and those samples are prioritized for human annotation. This concentrates labeling effort on the examples that will most improve model performance, reducing the total annotation volume required to reach a target accuracy. In robotics, active learning selects demonstration scenarios that expose gaps in the current policy — edge cases, failure modes, or underrepresented environments — for targeted data collection rather than random sampling.

Computer Vision Fundamentals

Optical Flow

Optical flow is a dense motion field that describes, for every pixel in an image, the apparent velocity of that pixel between consecutive frames — encoded as a 2D vector indicating direction and magnitude of motion. Optical flow captures how objects and surfaces move through the scene and is used in robotics to detect moving obstacles, estimate camera ego-motion, and segment foreground objects from background. RAFT (Recurrent All-Pairs Field Transforms) is the standard model for computing optical flow in physical AI data enrichment pipelines.

Monocular Depth Estimation

Monocular depth estimation predicts a per-pixel depth map from a single RGB image, inferring scene geometry without requiring a stereo camera pair or active depth sensor. This is achieved by neural networks trained on large corpora of RGB-depth pairs, learning visual cues such as perspective foreshortening, object size, and texture gradient that correlate with distance. Depth Anything V2 is the current standard for monocular depth in physical AI pipelines. Monocular depth enables depth enrichment at scale for video datasets collected with standard single-lens cameras.

Pose Estimation

Pose estimation predicts the positions of anatomical landmarks — body joints, hand keypoints, or object corners — in 2D image coordinates or 3D space. Human body pose estimation produces skeleton representations used to understand how humans perform tasks, providing the reference demonstrations that robot learning systems imitate. Hand pose estimation localizes finger joints to capture dexterous manipulation in egocentric video. ViTPose is the standard vision transformer model for human pose estimation in physical AI data pipelines, trained on COCO Keypoints and MPII.

Hand-Object Interaction (HOI)

Hand-object interaction (HOI) refers to the detection, segmentation, and analysis of the contact relationship between human hands and objects in video — identifying which hand is touching which object, the contact region, grip type, and the resulting object state change. HOI annotations are critical for robotics training data because manipulation tasks are fundamentally about how hands (and by extension, robot end-effectors) interact with objects. HOI detection in egocentric video provides the ground-truth skill demonstrations that robot manipulation policies learn from.

Object Tracking

Object tracking maintains the identity of one or more objects across consecutive video frames, assigning consistent identifiers as objects move, become occluded, or change appearance. Tracking converts per-frame detections into temporally coherent object trajectories, which are essential for understanding how objects are manipulated over time. In physical AI training data, tracking links object instances across frames to enable identity-consistent annotations, trajectory prediction training, and the temporal association needed for action segmentation and reward learning.

Video Prediction

Video prediction is the task of generating plausible future video frames given a sequence of past frames, requiring a model to understand scene dynamics, object physics, and the temporal evolution of appearance. Video prediction models are a form of learned world model: they internalize how objects move, deform, and interact under physical constraints. Training video prediction models requires large corpora of real-world video with diverse motion patterns — not just static scene images — making egocentric and robotics video particularly valuable for this task.

SAM (Segment Anything Model)

SAM (Segment Anything Model) is a promptable image segmentation model developed by Meta AI that generates high-quality object masks from point, box, or text prompts, without task-specific training. SAM can segment any object in an image — known or unknown — making it a general-purpose tool for annotation automation. SAM3 (the video-capable version) tracks and segments objects across video frames. Claru uses SAM3 as a standard layer in its enrichment pipeline to produce segmentation masks for every object in egocentric video collections.

Panoptic Segmentation

Panoptic segmentation combines semantic segmentation and instance segmentation into a unified output where every pixel is assigned both a class label and an instance identifier. Countable objects (things) such as cups, hands, and tools receive unique instance IDs, while background regions (stuff) such as floor, table, and wall receive class labels only. Panoptic segmentation provides the most complete pixel-level scene understanding, enabling robot systems to simultaneously know what type every surface is and which individual object is which without running two separate pipelines.

Robotics Fundamentals

Imitation Learning

Imitation learning is a class of robot learning methods in which a policy is trained to replicate the behavior of an expert demonstrator, learning from observations of how a human or expert robot performs a task rather than from trial-and-error exploration. The simplest form of imitation learning is behavioral cloning, which treats demonstration data as a supervised learning problem. More advanced approaches like DAgger and inverse reinforcement learning address the distributional shift problem that arises when the policy encounters states outside the demonstration distribution.

Behavioral Cloning (BC)

Behavioral cloning (BC) is the simplest form of imitation learning, treating demonstration data as a supervised learning problem: given an observation, predict the action the expert demonstrator took. A policy is trained by minimizing the difference between predicted and demonstrated actions across a dataset of (observation, action) pairs. BC is data-efficient and straightforward to implement but suffers from compounding errors when the policy encounters states slightly outside the demonstration distribution, since small mistakes at each step can compound into large deviations over a long trajectory.

Sim-to-Real Gap

The sim-to-real gap refers to the performance degradation that occurs when a robot policy trained in simulation is deployed on physical hardware, caused by discrepancies between simulated and real-world visual appearance, physics, sensor noise, and actuator dynamics. Even photorealistic simulators produce textures, lighting, contact physics, and deformable object behavior that differ measurably from the real world. Bridging the sim-to-real gap requires either domain randomization during simulation training, real-world fine-tuning data, or both in combination.

Domain Randomization

Domain randomization is a simulation training technique that trains a policy across a wide range of randomized visual and physical simulation parameters — object textures, lighting colors, camera positions, friction coefficients, and object masses — so that the real world appears as just another variation in the training distribution. By training on many randomized environments, the policy learns representations that are robust to the specific parameter values, making it more likely to transfer to the real-world domain. Domain randomization reduces the sim-to-real gap without requiring large amounts of real-world data.

Action Chunking

Action chunking is a technique in robot learning where the policy predicts a short sequence of future actions (a chunk) rather than a single action at each timestep, then executes that chunk before predicting the next one. Chunking reduces the effective frequency of policy inference, lowering latency demands on the policy network, and enables the policy to plan ahead within the chunk horizon. The Action Chunking with Transformers (ACT) method popularized this approach, demonstrating that chunks of 10-100 actions significantly improve performance on dexterous manipulation tasks compared to single-step action prediction.

Diffusion Policy

Diffusion Policy is a robot learning method that frames action prediction as a conditional denoising diffusion process: the policy generates action sequences by iteratively removing noise from a random sample, conditioned on the current visual observation. Diffusion models naturally represent multi-modal action distributions — situations where multiple different actions are all correct responses to the same observation — which standard regression-based behavioral cloning cannot capture. Diffusion Policy achieves state-of-the-art performance on dexterous manipulation benchmarks and underlies the action heads in several commercial humanoid platforms.

Reward Model

A reward model is a neural network trained to predict a scalar quality score for a given AI output — a robot trajectory, a text response, or a video — based on human preference annotations. The reward model encodes human judgment as a differentiable function, allowing reinforcement learning algorithms to optimize a policy toward outputs that humans prefer. Reward models trained on low-quality or inconsistent preference annotations produce reward hacking: policies that score highly on the reward model while producing outputs humans actually dislike. High inter-annotator agreement is essential for reliable reward model training.

6-DOF Grasp Planning

6-DOF grasp planning determines the full six-degrees-of-freedom pose — three translational (x, y, z) and three rotational (roll, pitch, yaw) — at which a robot end-effector should approach and grasp an object. Unlike top-down planar grasping, 6-DOF planning considers arbitrary object geometries and orientations, enabling grasps from the side, below, or at any angle. Training 6-DOF grasp networks requires point cloud or RGB-D data paired with labels specifying grasp quality scores or binary success labels for sampled grasp poses.

Models & Architectures

RAFT (Optical Flow)

RAFT (Recurrent All-Pairs Field Transforms) is a deep learning architecture for optical flow estimation that builds a 4D correlation volume for all pairs of pixels between two frames and iteratively updates a flow estimate using recurrent refinement steps. RAFT achieves state-of-the-art accuracy on optical flow benchmarks (Sintel, KITTI) and runs efficiently enough for large-scale video processing pipelines. Claru uses RAFT as the standard optical flow model in its enrichment pipeline, computing dense motion fields for every consecutive frame pair in egocentric video collections.

Depth Anything V2

Depth Anything V2 is a monocular depth estimation model developed by researchers at the University of Hong Kong and TikTok, trained on a combination of labeled real-world data and large-scale synthetic data to produce high-quality relative and metric depth maps from single RGB images. The V2 release improved fine-grained detail accuracy on transparent, reflective, and occluded surfaces compared to V1, making it more reliable for real-world robotics enrichment. Depth Anything V2 is the standard depth estimation model in Claru's video enrichment pipeline.

ViTPose

ViTPose is a human pose estimation model that uses a plain Vision Transformer (ViT) backbone, demonstrating that the transformer architecture that dominates NLP and image classification also achieves state-of-the-art performance on keypoint detection tasks without task-specific architectural modifications. ViTPose is trained on COCO Keypoints and MPII Human Pose datasets and supports whole-body pose including body, hand, face, and foot keypoints. Claru uses ViTPose to extract 2D and 3D joint positions from egocentric video as a standard enrichment layer for robotics training data.

Open X-Embodiment (OXE)

Open X-Embodiment (OXE) is a large-scale robot learning dataset released by Google DeepMind and collaborators in 2023, aggregating over 1 million robot trajectories from 22 different robot embodiments across 21 research institutions. OXE provides the broadest available collection of real-robot manipulation demonstrations and was used to train the RT-X family of models, demonstrating that cross-embodiment pre-training improves policy performance on new robots. OXE is publicly available but covers a limited set of robot platforms, environments, and task categories compared to what production robotics teams require.

Diffusion Transformer (DiT)

A Diffusion Transformer (DiT) is a neural network architecture that applies the transformer architecture — with self-attention and feed-forward layers arranged in a sequence — as the backbone of a diffusion model, replacing the U-Net architecture that dominated earlier diffusion model designs. DiT models scale more predictably with model size and training data than U-Net-based diffusion models and have become the architecture of choice for video generation and world model training. Sora, Stable Video Diffusion, and several robotics world models use DiT-based architectures.

Vision Transformer (ViT)

A Vision Transformer (ViT) is an image recognition architecture that applies the transformer architecture directly to images by splitting an image into fixed-size patches, linearly embedding each patch, and processing the sequence of patch embeddings with standard transformer self-attention layers. ViT models, introduced by Dosovitskiy et al. in 2021, achieve state-of-the-art performance on image classification when trained on large enough datasets and have become the standard backbone for a wide range of vision models including object detection, segmentation, pose estimation, and VLA models.

GR00T N1 (NVIDIA)

GR00T N1 is a general-purpose humanoid robot foundation model developed by NVIDIA, announced in 2025, designed to serve as a pre-trained base that robotics teams can fine-tune for specific humanoid platforms and tasks. GR00T N1 processes multimodal inputs including video, text, and proprioceptive state, and outputs motor actions. The model was trained on a combination of real robot demonstrations from the Open X-Embodiment dataset, synthetic simulation data from NVIDIA Isaac, and synthetic video generated from physical simulations, representing a hybrid data strategy for humanoid generalization.

pi-zero (Physical Intelligence)

pi-zero is a general-purpose robot foundation model developed by Physical Intelligence (pi), released in late 2024. pi-zero uses a flow-matching action head built on top of a pre-trained vision-language model backbone, enabling zero-shot and few-shot generalization to new tasks and robot embodiments. Physical Intelligence trained pi-zero on a large proprietary dataset of robot demonstrations spanning multiple robot platforms and diverse manipulation tasks, including dexterous tasks like laundry folding, table bussing, and grocery bagging that previous general-purpose models struggled to perform reliably.

Building a Physical AI System?

Claru provides the egocentric video, manipulation trajectories, and annotation layers that the terms in this glossary describe. Tell us what your model needs to learn.