What is imitation learning?

Imitation learning is a class of robot learning methods in which a policy is trained to replicate the behavior of an expert demonstrator, learning from observations of how a human or expert robot performs a task. The simplest form is behavioral cloning, which treats demonstration data as supervised learning. More advanced approaches like DAgger address the distributional shift problem that arises when the policy encounters states outside the demonstration distribution.

Physical AI & Robotics Training Data Glossary

Q: What is a VLA model?

A VLA model (Vision-Language-Action model) is a neural network that takes visual observations and natural language instructions as input and outputs robot actions. VLA models unify perception, language understanding, and motor control in a single architecture, allowing a robot to interpret commands like 'pick up the red cup' and produce the joint trajectories or end-effector poses required to execute them. Examples include RT-2, OpenVLA, pi-zero, and GR00T N1.

Definitions for ML engineers building robots, embodied agents, and world models.

56terms — last updated April 2026

Physical AI Systems

VLA Model (Vision-Language-Action)

A VLA model is a neural network that takes visual observations and natural language instructions as input and outputs robot actions. VLA models unify perception, language understanding, and motor control in a single architecture, allowing a robot to interpret commands like “pick up the red cup” and produce the joint trajectories or end-effector poses required to execute them. Training requires synchronized triplets of image frames, instruction text, and action labels collected through teleoperation or human demonstration.

Physical AI & Robotics Training Data Glossary

Physical AI Systems

VLA Model (Vision-Language-Action)

Embodied AI

World Model

Humanoid Robot

Visuomotor Policy

Foundation Model for Robotics

Cross-Embodiment Data

Physical AI

Data Modalities

Egocentric Video

Teleoperation Data

Manipulation Trajectory

Depth Data

RGB-D Data

Point Cloud

Proprioceptive Data

Synthetic Data (for Robotics)

Annotation Types

Keypoint Annotation

Temporal Annotation

Action Segmentation

Semantic Segmentation

Instance Segmentation

Activity Annotation

Bounding Box Annotation

Preference Annotation (RLHF)

Data Quality & Pipelines

Data Enrichment

Benchmark Curation

Data Deduplication

Inter-Annotator Agreement (IAA)

RLHF (Reinforcement Learning from Human Feedback)

Data Quality Scoring

Dataset Diversity

Active Learning

Computer Vision Fundamentals

Optical Flow

Monocular Depth Estimation

Pose Estimation

Hand-Object Interaction (HOI)

Object Tracking

Video Prediction

SAM (Segment Anything Model)

Panoptic Segmentation

Robotics Fundamentals

Imitation Learning

Behavioral Cloning (BC)

Sim-to-Real Gap

Domain Randomization

Action Chunking

Diffusion Policy

Reward Model

6-DOF Grasp Planning

Models & Architectures

RAFT (Optical Flow)

Depth Anything V2

ViTPose

Open X-Embodiment (OXE)

Diffusion Transformer (DiT)

Vision Transformer (ViT)

GR00T N1 (NVIDIA)

pi-zero (Physical Intelligence)

Physical AI & Robotics Training Data Glossary

Physical AI Systems

VLA Model (Vision-Language-Action)

Embodied AI

World Model

Humanoid Robot

Visuomotor Policy

Foundation Model for Robotics

Cross-Embodiment Data

Physical AI

Data Modalities

Egocentric Video

Teleoperation Data

Manipulation Trajectory

Depth Data

RGB-D Data