Why does physical AI need different infrastructure than LLMs? {#why-does-physical-ai-need-different-infrastructure-than-llms}

Physical AI requires fundamentally different infrastructure than LLMs for three structural reasons. First, training data cannot be scraped from the internet — every robot demonstration requires a physical environment, a human operator, and calibrated sensors, making data collection roughly 100–1,000× more expensive per token-equivalent than web text. Second, simulation is required but not sufficient: Zhang et al. ([arXiv:2503.20020](https://arxiv.org/abs/2503.20020v1)) show that sim-to-real transfer depends on dense reward signals derived from VLMs trained on real-world visual data, creating a circular dependency between the data and simulation layers. Third, evaluation cannot be done on a benchmark leaderboard — it requires physical test rigs with standardized objects, environments, and success criteria, and no widely accepted standard exists as of 2025. These three gaps are what make physical AI infrastructure a $10B platform opportunity.

Bezos Project Prometheus $10B Physical AI Infrastructure 2026

Q: How much robot training data do you need for foundation models? {#how-much-robot-training-data-do-you-need-for-foundation-models}

Li et al. ([arXiv:2505.20503](https://arxiv.org/abs/2505.20503v2)) provide the most systematic answer available: scaling from 100K to 970K real-world manipulation demonstrations improved multi-task success rates by up to 38 percentage points, with log-linear scaling persisting through the tested range. Most current robot foundation models train on 10K to 100K demonstrations, which means they are significantly data-limited. The scaling curves in the Li et al. study show no sign of saturation at 970K, implying that million-scale datasets may still be undertrained. The largest open datasets — Open X-Embodiment at approximately 1M trajectories (Padalkar et al., [arXiv:2310.08864](https://arxiv.org/abs/2310.08864)) and DROID at approximately 76K episodes (Khazatsky et al., [arXiv:2403.12945](https://arxiv.org/abs/2403.12945)) — sit at or below this threshold and are fragmented across incompatible embodiments and task definitions.

TL;DR

Jeff Bezos's reported ~$10B "Project Prometheus" is a bet that the binding constraint on physical AI isn't model architecture — it's data, simulation, and evaluation infrastructure.

Li et al. (arXiv:2505.20503) scaled robot manipulation data from 100K to 970K demonstrations and saw success rates jump by 38 percentage points. Data volume, not model size, is the bottleneck.

Zhang et al. (arXiv:2503.20020) show that dense reward signals from VLM-generated subgoals can replace millions of environment interactions, compressing sim-to-real transfer timelines by roughly 5×.

Whoever controls the physical AI data and evaluation stack will shape which embodied AI models can actually be trained and deployed.

In This Post

What Project Prometheus actually is

The infrastructure gap physical AI can't paper over

Data volume as the binding constraint

Simulation and reward: the missing middle layer

Who controls the stack matters

Key takeaways

Related resources

What Project Prometheus actually is

Project Prometheus is a Jeff Bezos-backed initiative, reported by Reuters, The Information, and Bloomberg, that is assembling around $10B to build infrastructure for physical AI including robotics foundation models, simulation environments, and real-world data collection at industrial scale. No peer-reviewed paper has been published under the Prometheus name as of mid-2025. What is publicly known: Bezos has personally invested in Figure AI (a $675M raise at a $2.6B valuation in early 2024, according to Bloomberg) and Physical Intelligence (a $400M raise at a $2.4B pre-money valuation, according to The Information). Prometheus appears to sit above these individual portfolio bets, targeting the shared infrastructure layer that every embodied AI company needs but nobody has built at sufficient scale.

The $10B figure is worth contextualizing. OpenAI reportedly spent $3–5B on GPT-4 training compute and data, according to estimates cited by The Information. A physical AI infrastructure play capitalized at $10B signals how large the gap is between where embodied models are today and where they need to be. This isn't a research grant — it's a bet that the infrastructure layer itself is the moat.

The infrastructure gap physical AI can't paper over

Large language models had three prerequisites before scaling laws kicked in: massive curated text corpora (Common Crawl, The Pile, RedPajama), cheap standardized compute (GPU clusters), and well-understood evaluation benchmarks (MMLU, HellaSwag, HumanEval). Physical AI in 2025 has none of these at equivalent maturity.

Data. There is no Common Crawl for robot manipulation. The largest open robot datasets — Open X-Embodiment at approximately 1M trajectories across 22 robot embodiments (Padalkar et al., arXiv:2310.08864) and DROID at approximately 76K episodes (Khazatsky et al., arXiv:2403.12945) — are orders of magnitude smaller than text corpora and fragmented across incompatible formats, morphologies, and task definitions. As we cover in our overview of embodied AI datasets, the heterogeneity problem is at least as severe as the volume problem.

Simulation. NVIDIA Isaac Sim and MuJoCo provide physics engines, but no unified sim-to-real pipeline exists that a frontier lab can plug into and get reliable transfer. The sim-to-real gap remains domain-specific and manually tuned.

Evaluation. There is no MMLU for robot manipulation. Success rate on a specific task rig is the standard metric, which means results from one lab aren't directly comparable to another's.

Prometheus, if the reporting is accurate, is an attempt to build all three layers at once — the kind of vertically integrated infrastructure play that creates platform lock-in.

Data volume as the binding constraint

The empirical case for data-as-bottleneck is now strong. Li et al. (arXiv:2505.20503) ran what is arguably the most systematic scaling study for robot manipulation to date.

Dataset size (demonstrations)	Mean success rate (multi-task manipulation)	Δ vs. 100K baseline
100K	~42%	—
300K	~58%	+16 pp
970K	~80%	+38 pp

Li et al. (arXiv:2505.20503) found that scaling from 100K to 970K real demonstrations improved multi-task manipulation success rates by up to 38 percentage points, with log-linear scaling behavior persisting across the entire data range tested. The models used diffusion policy architectures with vision-language conditioning — architectures most frontier labs already have access to. What those labs generally don't have is 970K demonstrations.

This maps onto the LLM scaling story. Hoffmann et al.'s Chinchilla study (2022) showed that most LLMs were undertrained relative to their data budgets. The Li et al. data suggests most robot foundation models are underdata'd relative to their parameter counts. A 1B-parameter VLA trained on 50K demonstrations is almost certainly not compute-limited — it's data-limited.

For labs building VLA models, the implication is blunt: architecture improvements will hit diminishing returns until the training data pipeline scales by 10–100×. Our analysis of VLA training data volume requirements reaches similar conclusions from a different angle. This is the gap a $10B infrastructure initiative would target.

Real-world data collection at the 970K+ scale requires systematic operator networks, diverse environments, and consistent annotation protocols. Claru operates a managed network of over 10,000 data collectors capturing egocentric video and manipulation demonstrations across thousands of real-world environments — the kind of distributed collection infrastructure that turns a scaling curve from theory into a training pipeline, and that any Prometheus-scale effort will need to either build or acquire.

Simulation and reward: the missing middle layer

Real-world data alone isn't sufficient. Robots need to explore novel configurations, and real-world exploration is slow, expensive, and occasionally destructive.

Zhang et al. (arXiv:2503.20020) from Tsinghua University and Shanghai AI Lab used vision-language models to decompose long-horizon manipulation tasks into subgoals, then generated dense reward signals from those subgoals for reinforcement learning in simulation. Their VLM-guided reward shaping achieved 50–60% success rates on multi-step manipulation tasks where sparse-reward baselines scored under 10%, according to the paper's experimental results. The approach also cut required environment interactions by roughly 5× compared to curriculum learning baselines.

This matters for the Prometheus thesis because the simulation layer and the data layer are tightly coupled, not independent. VLM-generated rewards are only as good as the VLM's visual grounding, which itself depends on large-scale visual data from real-world manipulation. The dependency loop runs as follows:

Real-world data trains VLMs with manipulation-relevant visual grounding

VLM-guided reward shaping enables efficient RL in simulation

Simulated experience augments real-world data for policy training

Deployed policies generate new real-world data

Whoever controls step 1 — real-world training data for robotics — has leverage over every subsequent step.

No public evidence suggests any single company runs this full loop at scale today. Physical Intelligence has demonstrated impressive zero-shot generalization with their π0 model but hasn't disclosed their data collection infrastructure in detail. Google DeepMind's RT-2 and RT-X efforts relied on internal fleet data from Everyday Robots, a division Google shut down in 2023. The physical AI stack remains fragmented.

Who controls the stack matters

Project Prometheus's reported $10B valuation becomes strategically interesting when you consider the structural differences between physical AI data and LLM training data.

In the LLM era, data moats turned out to be weaker than expected — web text is abundant and largely undifferentiated, and synthetic data generation partially decoupled model quality from unique data access. Physical AI is structurally different for three reasons:

Real-world manipulation data is expensive to collect. Every demonstration requires a physical environment, a human operator, and calibrated sensors. You can't scrape it from the internet.

Embodiment-specific data doesn't transfer freely. A demonstration collected on a Franka Panda arm transfers poorly to a bimanual setup. Cross-embodiment transfer exists — Open X-Embodiment (Padalkar et al., arXiv:2310.08864) demonstrated this — but at a significant performance penalty.

Evaluation data is even harder. Standardized test scenarios require physical reproducibility, which requires standardized hardware and environments.

An entity that controls (a) a large proprietary dataset of real-world manipulation demonstrations, (b) a simulation stack tightly calibrated to those demonstrations, and (c) a standardized evaluation protocol occupies a position analogous to AWS for cloud computing in 2008 — not the only option, but the default option that shapes how everyone else builds.

The risk for frontier labs is dependency. If Prometheus becomes the primary source of training data and sim-to-real pipelines, labs building embodied models are renting their most critical input. Labs that controlled their own data pipelines — OpenAI with web-scale text, DeepMind with game environments — have consistently outperformed those that relied on external data providers.

Infrastructure layer	LLM analog	Physical AI status (2025)	Prometheus potential role
Training data	Common Crawl, The Pile	Open X-Embodiment, DROID (fragmented, small)	Proprietary large-scale collection
Simulation	N/A (not needed for text)	NVIDIA Isaac, MuJoCo (uncalibrated to real)	Integrated sim-to-real pipeline
Evaluation	MMLU, HumanEval	No standard exists	Standardized benchmark suite
Compute	H100 clusters	Same hardware + robot hardware	End-to-end training infrastructure
Reward/annotation	RLHF providers (Scale AI)	Manual, expensive, unscalable	VLM-guided automated annotation

The data enrichment pipeline — converting raw demonstrations into labeled, reward-annotated, sim-compatible training data — is the layer with the least competition and the most lock-in potential.

My read: the $10B valuation isn't about any single model or robot. It's about controlling the substrate on which all physical AI models get trained. Labs that don't invest in independent data and evaluation infrastructure now may find themselves negotiating from a weak position within 2–3 years.

Key takeaways

Li et al. (arXiv:2505.20503) show that scaling robot manipulation demonstrations from 100K to 970K yields up to 38 percentage-point improvements in success rate. Data volume is the primary bottleneck for embodied foundation models.

Zhang et al. (arXiv:2503.20020) show that VLM-guided dense reward shaping reduces required simulation interactions by roughly 5× compared to curriculum learning baselines for multi-step manipulation.

Project Prometheus's reported $10B capitalization is comparable to OpenAI's estimated GPT-4 training investment of $3–5B (per The Information). Physical AI infrastructure is being treated as a platform play.

No single entity currently runs the full data → simulation → evaluation → deployment loop at scale for general-purpose manipulation. This gap is what Prometheus appears designed to fill.

Real-world data quality and VLM-guided simulation reward are tightly coupled. Controlling the data layer gives leverage over the entire physical AI training stack.

Frontier labs building embodied models without independent data and evaluation infrastructure face structural dependency risks within 2–3 years.

Cross-embodiment transfer remains lossy, as shown by Open X-Embodiment (Padalkar et al., arXiv:2310.08864). Embodiment-specific data collection infrastructure is a durable competitive advantage.

FAQ

What is Bezos Project Prometheus physical AI?

Project Prometheus is a reported Jeff Bezos-backed initiative, capitalized at around $10B according to Reuters, The Information, and Bloomberg, aimed at building infrastructure for physical AI. The initiative covers robotics foundation models, simulation environments, and large-scale real-world data collection. It sits above Bezos's individual investments in Figure AI ($675M raise at $2.6B valuation, per Bloomberg) and Physical Intelligence ($400M at $2.4B pre-money valuation, per The Information), targeting the shared data, simulation, and evaluation layers that all embodied AI companies need. No peer-reviewed publications or official announcements have been made under the Prometheus name as of mid-2025. The strategic logic parallels AWS's early cloud infrastructure play: own the substrate that others build on.

How much robot training data do you need for foundation models?

Li et al. (arXiv:2505.20503) provide the most systematic answer available: scaling from 100K to 970K real-world manipulation demonstrations improved multi-task success rates by up to 38 percentage points, with log-linear scaling persisting through the tested range. Most current robot foundation models train on 10K to 100K demonstrations, which means they are significantly data-limited. The scaling curves in the Li et al. study show no sign of saturation at 970K, implying that million-scale datasets may still be undertrained. The largest open datasets — Open X-Embodiment at approximately 1M trajectories (Padalkar et al., arXiv:2310.08864) and DROID at approximately 76K episodes (Khazatsky et al., arXiv:2403.12945) — sit at or below this threshold and are fragmented across incompatible embodiments and task definitions.

Why does physical AI need different infrastructure than LLMs?

Physical AI requires fundamentally different infrastructure than LLMs for three structural reasons. First, training data cannot be scraped from the internet — every robot demonstration requires a physical environment, a human operator, and calibrated sensors, making data collection roughly 100–1,000× more expensive per token-equivalent than web text. Second, simulation is required but not sufficient: Zhang et al. (arXiv:2503.20020) show that sim-to-real transfer depends on dense reward signals derived from VLMs trained on real-world visual data, creating a circular dependency between the data and simulation layers. Third, evaluation cannot be done on a benchmark leaderboard — it requires physical test rigs with standardized objects, environments, and success criteria, and no widely accepted standard exists as of 2025. These three gaps are what make physical AI infrastructure a $10B platform opportunity.

Who competes with Project Prometheus in physical AI infrastructure?

NVIDIA is the most visible competitor, with Isaac Sim for physics simulation and Omniverse for digital twin construction. Google DeepMind has internal fleet data from its RT-X program but shut down its Everyday Robots division in 2023. Physical Intelligence and Figure AI are building their own data pipelines but have not disclosed collection scale. Toyota Research Institute runs a notable bimanual manipulation data collection effort. On the data layer specifically, providers like Claru operate distributed physical AI training data collection networks with over 10,000 operators. No single entity currently controls more than one layer of the full stack (data + simulation + evaluation + compute), which is what makes Prometheus's reported vertical integration ambition unusual.

Will Project Prometheus create vendor lock-in for robotics AI?

The structural conditions for vendor lock-in are present. Real-world manipulation data is expensive to duplicate (every demonstration requires physical infrastructure), sim-to-real calibration is data-dependent, and evaluation standards tend to be set by whoever publishes them first. If Prometheus builds a tightly integrated data-simulation-evaluation stack, labs using it will face switching costs similar to those experienced by early AWS customers. The main counterforce is the open-source community around Open X-Embodiment, DROID, and Hugging Face's LeRobot project, but these efforts are 1–2 orders of magnitude smaller than what the Li et al. (arXiv:2505.20503) scaling curves suggest is needed. Labs that want to avoid dependency should invest in independent data collection and evaluation infrastructure now, before a dominant platform emerges.

Related resources

Physical AI Training Data — how real-world manipulation datasets are collected and structured for foundation model training

VLM vs. VLA: Architectures for Embodied AI — the model architectures that Prometheus-scale data would train

Glossary — definitions of key terms used in this post including VLA, sim-to-real transfer, and embodied AI

Training Data for Robotics — scaling data pipelines for robotics foundation models