Physical AI: Models That Understand and Act in the Real World
Physical AI refers to artificial intelligence systems that understand the physical world — its geometry, physics, materials, and dynamics — and can either interact with it through a robot body or generate accurate simulations of it. Physical AI is the convergence of computer vision, robotics, physics simulation, and foundation models, unified by the requirement for training data that captures how the real world actually behaves.
What Is Physical AI?
Physical AI is an umbrella term for artificial intelligence systems that model, understand, or interact with the physical world. It encompasses robot control policies that manipulate objects, world models that simulate physical dynamics, video generation systems that produce physically plausible footage, digital twins that mirror real-world processes, and perception systems that reconstruct 3D environments from sensor data.
The defining characteristic of physical AI is that it must respect the constraints of real-world physics. A language model can generate text about objects passing through walls; a physical AI system must understand that solid objects cannot interpenetrate. A language model operates in a discrete token space; physical AI operates in continuous spaces where small perturbations can have large consequences (dropping a glass versus placing it gently). This requirement for physical consistency makes physical AI fundamentally harder than digital AI and creates specific data requirements.
Physical AI models are trained on data that captures real-world physics at the resolution needed for the target application. Robot manipulation policies need synchronized camera images and action labels at 10-50 Hz. World models need diverse video showing how objects, materials, and environments change over time. Perception systems need 3D reconstructions, depth maps, and semantic labels. The common thread is that all physical AI training data must be grounded in reality — synthetic data can supplement but not replace real-world observations because simulators cannot perfectly model the complexity of physical interactions.
The commercial opportunity for physical AI is immense. McKinsey estimates the global market for physical AI-enabled systems (including robotics, autonomous vehicles, and industrial automation) will exceed $200 billion annually by 2030. The bottleneck is not algorithms — the Transformer architecture that powers language models also powers VLA models for robots. The bottleneck is training data: collecting, annotating, and curating the real-world data that physical AI systems need to learn from.
Historical Context
The term "physical AI" gained prominence in 2023-2024, but the underlying research threads extend back decades. Computer vision researchers have studied 3D reconstruction, object physics, and scene understanding since the 1970s. Robotics researchers have developed learned controllers since the 1980s. The novelty of "physical AI" as a framing is the convergence of these fields through foundation models.
NVIDIA CEO Jensen Huang is widely credited with popularizing "physical AI" as a product category during his GTC 2024 keynote, where he announced the Cosmos platform for world foundation models and the GR00T model for humanoid robots. Huang argued that AI's next great challenge is understanding the physical world, and that this requires a new generation of foundation models trained on physical interaction data rather than text.
The foundations were laid by several research threads converging in 2022-2024. Google DeepMind's RT-2 (2023) showed that vision-language models could be fine-tuned for robot control, establishing the VLA paradigm. OpenAI's Sora (2024) demonstrated that video generation models implicitly learn physical dynamics. NVIDIA's Isaac Gym enabled massively parallel physics simulation for robot policy training. Physical Intelligence's pi-zero (2024) combined VLA architectures with flow matching for production-grade manipulation.
The investment cycle followed the terminology. In 2024-2025, over $10 billion in venture capital flowed into physical AI startups: humanoid robots (Figure AI, 1X Technologies, Agility Robotics), manipulation systems (Covariant, Dexterity, Formic), autonomous driving (Waymo, Aurora), and physical AI infrastructure (NVIDIA, Physical Intelligence). This investment has created unprecedented demand for physical AI training data — the raw material that these companies need to build their products.
Practical Implications
For teams building physical AI systems, the data strategy is the most important early decision. The choice of data modalities, collection scale, environment diversity, and annotation layers determines what the model can and cannot learn.
Physical AI data falls into three categories by collection difficulty and cost. Passive video (recording scenes without robot interaction) is the cheapest to collect at scale — any camera can capture it. But passive video provides observation data without action labels, useful for pretraining vision encoders and world models but insufficient for learning robot control. Human demonstration data (egocentric video of people performing tasks) provides observation and implicit action information, useful for pretraining visuomotor representations at moderate cost. Robot teleoperation data (observations paired with explicit robot actions) is the most expensive but provides the exact training signal that manipulation policies need.
The practical approach that most teams adopt is a data pyramid. At the base, large volumes (10,000+ hours) of passive and human demonstration video pretrain the visual backbone. In the middle, moderate volumes (1,000-10,000 episodes) of cross-embodiment robot data pretrain the policy backbone. At the top, smaller volumes (1,000-5,000 episodes) of on-hardware teleoperation data fine-tune for the specific deployment scenario.
Environment diversity is often underestimated. Physical AI systems that train in a single lab environment achieve impressive demo results but fail when deployed in a different setting with different lighting, surfaces, and object arrangements. Production systems need training data from dozens to hundreds of distinct environments. This is fundamentally a logistics problem: data collection must be geographically distributed, which requires a global network of collection sites and trained operators.
Common Misconceptions
Physical AI is just robotics with a new name.
Robotics traditionally focuses on the engineering of physical machines — mechanical design, control theory, sensor integration. Physical AI focuses on the intelligence component: learning to understand and predict physical phenomena from data. Physical AI includes applications beyond robotics: video generation models that simulate physics, digital twins that model industrial processes, and AR systems that overlay virtual objects with physically correct behavior. Robotics is the most visible application of physical AI but not the only one.
Foundation models trained on internet text understand physics well enough for physical AI.
Language models have approximate commonsense knowledge of physics ('heavy objects are hard to lift', 'glass breaks when dropped') but lack the quantitative understanding needed for physical AI. They cannot predict whether a specific object will tip over given a specific force, or compute the trajectory needed to throw a ball to a specific location. Physical AI requires models trained on physical interaction data, not text descriptions of physics.
NVIDIA's simulation tools will eliminate the need for real-world physical AI data.
NVIDIA Isaac Sim and Cosmos are powerful simulation platforms, but they face the fundamental sim-to-real gap: simulated physics cannot perfectly reproduce real-world contact, deformation, and material behavior. NVIDIA themselves invest in real-world data collection (through the GR00T program) alongside their simulation tools. The industry consensus is that simulation reduces but does not eliminate the need for real-world data. The most capable physical AI systems will combine both.
Key Papers
- [1]Brohan et al.. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” CoRL 2023, 2023. Link
- [2]Black et al.. “pi-zero: A Vision-Language-Action Flow Model for General Robot Control.” arXiv 2410.24164, 2024. Link
- [3]NVIDIA. “GR00T N1: An Open Foundation Model for Generalist Humanoid Robots.” NVIDIA Technical Report, 2025. Link
How Claru Supports This
Claru is a physical AI data company. We provide the three layers of the training data pyramid: large-scale egocentric video for visual pretraining, cross-environment human demonstration data for representation learning, and on-hardware teleoperation data for policy fine-tuning.
Our positioning within the physical AI ecosystem is as the data infrastructure layer. While model companies (NVIDIA, Physical Intelligence, Figure AI) build architectures and hardware, and simulation companies build virtual environments, Claru provides the real-world data that these systems cannot learn without. With 386,000+ annotated clips, 10,000+ trained collectors, and coverage across 100+ cities and 12+ environment types, Claru delivers physical AI training data at production scale.
What Is Physical AI?
Physical AI is an umbrella term for artificial intelligence systems that model, understand, or interact with the physical world. It encompasses robot control policies that manipulate objects, world models that simulate physical dynamics, video generation systems that produce physically plausible footage, digital twins that mirror real-world processes, and perception systems that reconstruct 3D environments from sensor data. The term gained prominence when NVIDIA CEO Jensen Huang used it during his GTC 2024 keynote to describe the next frontier of AI after language models.
The defining characteristic of physical AI is that it must respect the constraints of real-world physics. A language model can generate text about objects passing through walls; a physical AI system must understand that solid objects cannot interpenetrate. A language model operates in a discrete token space; physical AI operates in continuous spaces where small perturbations can have large consequences — dropping a glass versus placing it gently on a table. This requirement for physical consistency makes physical AI fundamentally harder than digital AI and creates specific data requirements that text and image datasets cannot satisfy.
The commercial opportunity is enormous. McKinsey estimates the global market for physical AI-enabled systems — including robotics, autonomous vehicles, and industrial automation — will exceed $200 billion annually by 2030. The bottleneck is not algorithms. The Transformer architecture that powers language models also powers VLA models for robots. The bottleneck is training data: collecting, annotating, and curating the real-world data that physical AI systems need to learn from.
Physical AI at a Glance
The Physical AI Data Pyramid
Physical AI data falls into three layers organized by collection difficulty, cost, and training value. At the base of the pyramid: passive video — recording scenes without robot interaction — is the cheapest to collect at scale. Any camera can capture it. Passive video provides observation data without action labels, useful for pretraining vision encoders and world models but insufficient for learning robot control directly. Internet video, egocentric recordings, and surveillance footage all fall in this category.
In the middle of the pyramid: human demonstration data — egocentric video of people performing tasks — provides observation and implicit action information. When a human picks up a cup on camera, the video captures the visual trajectory of the hand, the object, and the scene change. This data is more expensive to collect than passive video (you need a camera-wearing person performing specific tasks) but less expensive than robot data. It is particularly valuable for pretraining visuomotor representations at moderate cost.
At the top: robot teleoperation data — synchronized observations paired with explicit robot actions — is the most expensive but provides the exact training signal that manipulation policies need. Each demonstration requires a physical robot, a skilled teleoperator, and the target objects and environment. This data directly trains the action prediction models that control robots. The practical approach that most teams adopt is to build all three layers: large volumes of passive video at the base, moderate volumes of human demonstration data in the middle, and targeted robot teleoperation data at the top.
Physical AI vs. Digital AI
How physical AI differs from language and image AI in its data requirements and challenges.
| Dimension | Physical AI | Language AI (LLMs) | Image AI (Diffusion) |
|---|---|---|---|
| Data Source | Real-world sensors + robot demos | Internet text | Internet images + captions |
| Data Scale | Thousands of hours (expensive) | Trillions of tokens (cheap) | Billions of images (cheap) |
| Actions | Continuous physical actions | Discrete token prediction | Pixel generation |
| Error Cost | Physical damage, safety risk | Incorrect text output | Incorrect image |
| Sim-to-Real Gap | Fundamental challenge | N/A | N/A |
| Environmental Diversity | Critical (homes, factories vary) | Naturally diverse online | Naturally diverse online |
The Physical AI Ecosystem: Key Companies and Models
NVIDIA has positioned itself as the infrastructure layer for physical AI through three platforms: Isaac Sim for physics simulation, Cosmos for world foundation models trained on physical interaction video, and GR00T N1 for humanoid robot control. Jensen Huang's framing of physical AI as a product category catalyzed industry investment and established NVIDIA's GPU and simulation platforms as the default development environment for the field.
Google DeepMind's robotics research established the VLA (Vision-Language-Action) paradigm with RT-2, demonstrating that large vision-language models can be co-trained on web data and robot demonstrations to produce models with emergent manipulation reasoning. DeepMind's partnership with Boston Dynamics extends this approach to the most mechanically capable humanoid platform. Physical Intelligence (founded by former Google researchers) built pi-zero, a VLA with flow-matching action prediction that targets production-grade manipulation.
On the humanoid side, Figure AI ($2.6B funded, OpenAI partnership), 1X Technologies (data-centric approach, $20K consumer humanoid), Agility Robotics (Digit in Amazon warehouses), and Tesla (Optimus) are all building physical AI systems that require massive training data investments. The common thread across all these companies is that model architecture is largely converged (Transformers + diffusion/flow for actions) and the competitive differentiator is training data quality, diversity, and scale.
Why Physical AI Needs Real-World Data
The sim-to-real gap is the fundamental data challenge for physical AI. Simulated environments — even state-of-the-art platforms like NVIDIA Isaac Sim and MuJoCo — cannot perfectly model the complexity of real-world physical interactions. Contact dynamics between a robot fingertip and a real object depend on surface roughness, material compliance, moisture, temperature, and wear patterns that no simulator captures at sufficient fidelity. Deformable objects (fabric, food, paper) are especially difficult to simulate because their behavior depends on material properties that vary across instances.
NVIDIA and other simulation providers invest in improving simulation fidelity, but the gap is fundamental rather than engineering. Real-world physics is governed by partial differential equations with parameters that vary continuously across materials and conditions. Simulation discretizes these equations on a finite grid with approximate material parameters. The approximation error propagates through learned policies, causing failures when the policy encounters real-world conditions that fall outside the simulator's approximation envelope.
This is why all leading physical AI companies — from NVIDIA to Physical Intelligence to Figure AI — invest in real-world data collection alongside simulation. The industry consensus is that simulation reduces the amount of real-world data needed but does not eliminate the need entirely. The most capable physical AI systems use a combination: simulation for pretraining and policy search, real-world data for fine-tuning and validation. Claru exists to provide the real-world layer of this data stack at production scale.
Key References
- [1]Brohan et al.. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” CoRL 2023, 2023. Link
- [2]Black et al.. “pi-zero: A Vision-Language-Action Flow Model for General Robot Control.” arXiv 2410.24164, 2024. Link
- [3]NVIDIA. “GR00T N1: An Open Foundation Model for Generalist Humanoid Robots.” NVIDIA Technical Report, 2025. Link
Frequently Asked Questions
Physical AI is the broader category — it includes any AI system that understands or models the physical world, whether or not it has a body. A video generation model that accurately simulates physics is physical AI but not embodied AI. A robot policy that controls a physical arm is both physical AI and embodied AI. Physical AI also encompasses physics simulators, world models, digital twins, and AR/VR systems that model real-world physics. Embodied AI is specifically about agents with physical bodies.
Language AI (LLMs) operates on text, which is an abundant, easily collected, and well-structured data modality. Physical AI must learn from video, sensor data, and physical interactions, which are orders of magnitude harder to collect, annotate, and learn from. The physical world has continuous dynamics, irreversible actions, and safety constraints absent in text. Solving physical AI would enable robots to work in homes, factories, and hospitals — applications with larger economic impact than text generation but requiring fundamentally harder data and learning challenges.
NVIDIA (Isaac platform, GR00T humanoid model, Cosmos world model), Google DeepMind (RT-2, Genie 2, robotics research), Physical Intelligence (pi-zero VLA), Figure AI (humanoid robots), Tesla (Optimus), Boston Dynamics (Atlas), Agility Robotics (Digit), Covariant (warehouse manipulation), and numerous startups. Major investment from Softbank, Microsoft, Amazon, and venture funds signals that physical AI is one of the largest technology investment areas of 2025-2026.
Building Physical AI?
Claru provides the real-world training data that physical AI models require: egocentric video, manipulation trajectories, and expert annotations across diverse environments.