Last updated: April 2026

VLM vs VLA: What's the Actual Difference? (2026)

The terms VLM and VLA are used interchangeably in lab papers, job postings, and press releases. They are not the same thing. The distinction matters most when you start asking what data you need to build one.

TL;DR

A VLM (Vision-Language Model) takes image and text inputs and outputs text; a VLA (Vision-Language-Action model) takes image and text inputs and outputs robot motor commands.
The training data requirement is the key differentiator: VLMs train on web-scraped image-text pairs, while VLAs require observation-action-instruction triplets collected through teleoperation or human demonstrations.
RT-2, OpenVLA, pi-zero, and GR00T N1 are all VLAs — each uses a VLM as a backbone and extends it with an action output head trained on robot trajectory data.
You cannot build a VLA from internet data alone; action-labeled trajectories must be purposefully collected, which is why VLA training data is scarce and expensive relative to VLM data.

In This Post

What Is a VLM?
What Is a VLA?
The Data Difference: Why It Matters
VLM vs VLA: Side-by-Side Comparison
Major VLA Models: RT-2, OpenVLA, pi-zero, GR00T N1, Octo
VLMs as VLA Backbones
Key Takeaways
FAQ

What Is a VLM?

A Vision-Language Model (VLM) is a multimodal model that processes visual inputs (images or video frames) alongside natural language and produces text output. The defining characteristic is the output modality: text. A VLM answers visual questions, generates image captions, describes scenes, or performs visual reasoning — but it does not produce actions that control physical systems.

Well-known VLMs include PaLI-X (Google), PaliGemma (Google DeepMind), LLaVA (Haotian Liu et al., University of Wisconsin), GPT-4V (OpenAI), and Claude 3 Vision (Anthropic). All of these take images as input and output text. Their training data is primarily sourced from web-crawled image-text pairs: LAION-5B, Conceptual Captions 12M (CC12M), image-text data extracted from alt-text, and licensed image datasets.

The practical consequence: VLM training data exists at internet scale. LAION-5B contains 5 billion image-text pairs. CC12M contains 12 million high-quality image-caption pairs. This abundance is why VLMs have scaled so quickly — the training data problem is largely solved through web crawling.

VLMs are also useful as a starting point for building more capable systems. Their learned visual representations — object recognition, spatial reasoning, scene understanding — transfer well to downstream tasks. This is exactly why VLAs frequently start from VLM checkpoints.

What Is a VLA?

A Vision-Language-Action model (VLA) extends the VLM paradigm by replacing the text output head with an action output head. Given a visual observation (one or more camera frames) and a natural language instruction, it predicts the physical actions a robot should take to complete the described task — joint angles, end-effector poses, or gripper states.

The formal input-output contract for a VLA is:

Input:  (o_t, l)       # observation at timestep t, language instruction
Output: a_t            # action vector at timestep t
                       # e.g. [Δx, Δy, Δz, Δroll, Δpitch, Δyaw, gripper]

This architecture is what makes a model useful for robotics. A VLM can tell you "the cup is to the left of the plate" but cannot tell a robot arm how to pick it up. A VLA can do both: understand the scene through its VLM backbone, then generate the motor commands to execute the task.

The critical implication is for training data. To train this action output head, you need action-labeled trajectories — timestep-by-timestep records of what the robot did alongside what it observed and what instruction it was following. This data cannot be scraped from the internet. It requires human operators teleoperating robots, or motion capture systems recording human demonstrations that are then retargeted to robot kinematics.

The Data Difference: Why It Matters

The VLM/VLA distinction is not primarily architectural — it is fundamentally about data. Both model families use transformer architectures with attention mechanisms over visual and language tokens. The difference is what those models are trained to predict.

Training a VLM requires image-text pairs. These are abundant. You can crawl Common Crawl, filter by image-containing pages, extract alt-text, and arrive at billions of training samples within weeks of engineering effort. The cost of VLM training data is primarily compute for filtering and preprocessing, not collection.

Training a VLA requires observation-action-instruction triplets. Every triplet requires a physical robot, a human operator, a task environment, and a data collection protocol. The Open X-Embodiment dataset — the largest public VLA training corpus — aggregates 1M+ trajectories across 22 robot embodiments from 21 research institutions and took years of coordinated effort to assemble. Even with this scale, it covers only a narrow range of manipulation tasks and environments compared to what a generalizable robot needs to handle.

This data scarcity is the primary bottleneck for VLA development in 2026. Teams working on humanoid robots, dexterous manipulation, and mobile manipulation all face the same constraint: there is not enough action-labeled trajectory data covering the diversity of environments, tasks, and object types their robots will encounter in deployment.

One partial solution is pretraining on human egocentric video — footage captured from a first-person viewpoint during manipulation tasks. While this data lacks robot action labels, it provides rich visual patterns of how objects are handled, grasped, and manipulated. Research such as EgoMimic has shown that co-training on egocentric human video improves VLA performance without requiring full teleoperation data for every task.

VLM vs VLA: Side-by-Side Comparison

Criterion	VLM	VLA
Output modality	Text (captions, answers, descriptions)	Robot actions (joint angles, poses, gripper states)
Training data format	Image–text pairs	Observation–action–instruction triplets
Data source	Web-scraped (LAION, CC12M, etc.)	Robot teleoperation, human demonstrations
Scale of training data	Billions of image-text pairs	Thousands to millions of trajectories
Pretraining target	Visual-language alignment	Often initialized from VLM, then action fine-tuned
Inference input	Image + text query	Camera frame(s) + language instruction
Key examples	PaLI-X, PaliGemma, LLaVA, GPT-4V	RT-2, OpenVLA, pi-zero, GR00T N1, Octo

Major VLA Models: RT-2, OpenVLA, pi-zero, GR00T N1, Octo

RT-2 (Google DeepMind, 2023)

RT-2 is the paper that established the VLM-to-VLA transfer paradigm at scale. It co-fine-tunes a VLM backbone (PaLI-X 55B or PaLM-E 562B) on both web image-text data and robot trajectory data simultaneously. The robot actions are tokenized as text tokens, which lets the model output actions through the same softmax head used for language generation. RT-2 demonstrated emergent behaviors not present in training data — reasoning chains that generalized to novel objects. However, RT-2's inference latency (~1-3s per action) and proprietary backbone make it primarily a research result rather than a deployable production system.

OpenVLA (Stanford, Berkeley, CMU — 2024)

OpenVLA is a 7B parameter VLA trained on the Open X-Embodiment dataset (970K trajectories). It uses a Prismatic VLM backbone built on Llama 2, with a diffusion-based action head rather than discrete token prediction. OpenVLA is significant because it is fully open-source (weights on Hugging Face, training code on GitHub) and matches or exceeds RT-2-X on standard manipulation benchmarks despite being orders of magnitude smaller. Fine-tuning OpenVLA on a new task requires as few as 200 demonstrations for simple pick-and-place tasks, though dexterous manipulation tasks require more.

pi-zero (Physical Intelligence, 2024)

pi-zero uses PaliGemma as its VLM backbone and a flow matching action expert that operates at high frequency for dexterous tasks. Physical Intelligence trained pi-zero on a proprietary dataset covering laundry folding, dish loading, box assembly, and bag packing — tasks that require precise bimanual dexterous manipulation. The key architectural distinction is the separation between the slow VLM reasoning pathway (runs at low frequency) and the fast action expert pathway (runs at high frequency for contact-rich manipulation). This two-speed architecture addresses a fundamental tension in VLA design: language understanding benefits from large models, but real-time control requires low latency.

GR00T N1 (NVIDIA, 2025)

GR00T N1 is NVIDIA's foundation model for humanoid robots. It uses a dual-system architecture: a "thinking" system based on Eagle2 (NVIDIA's VLM) for high-level scene understanding and task planning, and a "acting" system based on a diffusion transformer for generating smooth, dexterous motor trajectories. GR00T N1 was trained on a combination of simulation data from Isaac Lab, human video from NVIDIA's EgoScale initiative, and real robot teleoperation data. NVIDIA has also released a teleoperation toolkit (GR00T-Teleop) and simulation benchmark (GR00T-Bench) to help robotics teams collect compatible training data.

Octo (Berkeley, 2024)

Octo is a smaller, faster VLA (93M parameters) trained on Open X-Embodiment, designed for fine-tuning on new robot platforms and tasks. Unlike RT-2 or OpenVLA, Octo does not use a pretrained VLM backbone — it learns from scratch on robot trajectory data using a diffusion head. The advantage is deployment speed: Octo runs at 20+ Hz on a standard GPU, making it viable for real-time control without specialized inference hardware. Octo's small size makes it a practical baseline for teams that want to study VLA fine-tuning without the compute requirements of 7B+ parameter models.

VLMs as VLA Backbones: Why the Transfer Works

The standard approach to building a VLA in 2026 is to initialize from a pretrained VLM and then adapt it to produce actions. This transfer works for a specific reason: manipulation tasks require the same perceptual skills VLMs are trained to develop — object recognition, spatial reasoning, understanding natural language goal specifications, and visual scene parsing.

Starting from a VLM checkpoint means a VLA team does not need to teach the model what a "cup" or "plate" is from robot trajectories alone. That knowledge is already encoded in the VLM weights. The action fine-tuning only needs to teach the model how to interactwith those objects, not how to recognize them.

This separation also explains why the two data types have different collection requirements. VLM pretraining data (internet image-text) can be at billion-scale; VLA action data needs to be far smaller but much more precisely structured. The ratio in RT-2 is approximately 1,000:1 (web data to robot trajectory data by token count). OpenVLA fine-tuning experiments suggest that 100-500 demonstrations is sufficient to adapt a pretrained VLA to a new task, provided the manipulation type (pick-place, pour, fold) is already represented in the base training data.

For robotics teams, the practical implication is this: if you are building a VLA for a genuinely novel task or environment not covered by Open X-Embodiment, you need action-labeled demonstrations specific to your robot and task. Claru's 500K+ egocentric clips and 4M+ human annotations cover the visual pretraining side; the action trajectory gap requires teleoperation data or retargeted human demonstration data for your specific platform.

Key Takeaways

VLMs output text. VLAs output robot actions. The output modality is the categorical difference.
The data consequence of this distinction: VLMs can train on web-scraped image-text pairs at billion-scale; VLAs require physically collected observation-action-instruction triplets.
Every major VLA (RT-2, OpenVLA, pi-zero, GR00T N1) uses a VLM as its perceptual and language backbone, then adds an action head trained on robot trajectory data.
Open X-Embodiment provides 1M+ trajectories across 22 robot types, but it covers narrow task and environment diversity — teams building production systems need supplementary data.
Fine-tuning a pretrained VLA like OpenVLA on a new task requires 100–500 demonstrations for simple pick-place tasks. Dexterous or contact-rich manipulation requires 1,000–50,000 demonstrations.
Human egocentric video (first-person footage without action labels) is a cost-effective way to improve VLA visual representations — EgoMimic results show it outperforms additional robot teleoperation data per hour of footage.

Frequently Asked Questions

What is the difference between VLM and VLA?

A Vision-Language Model (VLM) maps visual inputs and language to text outputs — it answers questions, describes images, or generates captions. A Vision-Language-Action model (VLA) extends this by adding an action output head: given a camera observation and a language instruction, it produces motor commands (joint angles, end-effector poses, gripper states) that directly control a robot. The fundamental difference is the output modality and, consequently, the training data required. VLMs train on image-text pairs scraped from the internet. VLAs require observation-action-instruction triplets collected through robot teleoperation or human demonstrations — data that cannot be scraped and must be purposefully collected.

Do VLAs use language?

Yes. Language is a central input to VLA models — the instruction conditioning is what makes them flexible across tasks. A VLA receives a natural language instruction like 'pick up the blue cup and place it on the plate' alongside camera observations and uses that instruction to condition the action prediction. This is what separates VLAs from older behavior cloning approaches that had no language grounding. The language understanding is typically borrowed from a pretrained VLM backbone: RT-2 uses PaLI-X and PaLM-E, OpenVLA uses Prismatic-7B (built on Llama 2), and pi-zero uses PaliGemma.

What training data do VLA models need?

VLA models require observation-action-instruction triplets: synchronized sequences of (1) visual observations from the robot's cameras, (2) the action trajectory executed at each timestep (end-effector pose, joint angles, gripper state), and (3) a natural language instruction describing the task. This data is collected through human teleoperation of robots or retargeted from egocentric human video. Key public datasets include Open X-Embodiment (1M+ trajectories across 22 robot types), BridgeData V2 (~60K demonstrations), and DROID (76K trajectories across 564 environments). Fine-tuning on a specific task typically requires 50–5,000 additional demonstrations depending on task complexity.

Is RT-2 a VLM or VLA?

RT-2 (Robotic Transformer 2) is a VLA. It was developed by Google DeepMind and published in 2023. RT-2 uses a VLM backbone (PaLI-X or PaLM-E) that was pretrained on web-scale image-text data, then co-fine-tuned on robot action data so it can predict tokenized actions alongside text. The key contribution was showing that a model pretrained on internet-scale visual language data could be adapted to produce robot actions with minimal robot-specific training data — a form of transfer learning from VLM to VLA. RT-2 is frequently cited as the model that established the VLM-to-VLA transfer paradigm.

Related Resources

VLA Training Data Guide

In-depth guide to VLA training data requirements, open datasets, and data gaps.

VLA Training Data — Claru

How Claru collects and enriches training data for VLA model development.

Glossary: VLA

Formal definition of vision-language-action models with architecture notes.