Diffusion Transformer (DiT): Scalable Denoising for Video Generation and Robot Control

The Diffusion Transformer replaces the U-Net backbone in diffusion models with a standard Transformer architecture, unlocking the same scaling laws that made GPT and ViT successful. DiT architectures now power Sora, Stable Diffusion 3, and emerging robot control systems like pi-zero — generating everything from photorealistic video to multi-modal robot action trajectories.

What Is Diffusion Transformer (DiT)?

The Diffusion Transformer (DiT) is a neural network architecture that replaces the U-Net backbone traditionally used in denoising diffusion models with a Transformer backbone. Introduced by Peebles and Xie (2023), DiT processes inputs by patchifying them into token sequences and applying a stack of Transformer blocks with self-attention and feed-forward layers — the same architecture used in Vision Transformers (ViT) and GPT. Conditioning on the diffusion timestep and class labels is achieved through adaptive layer normalization (adaLN-Zero), where the normalization parameters are predicted from the conditioning signal.

The motivation for DiT is scaling. U-Nets have an irregular architecture with asymmetric encoder/decoder paths, skip connections at multiple resolutions, and a mix of convolutions and attention at different scales. This complexity makes it difficult to predict how performance will change as the model grows. Transformers, by contrast, have uniform repeated blocks with well-characterized scaling behavior — loss decreases as a smooth power law of model size and training data (Kaplan et al., 2020). DiT inherits this predictability: Peebles and Xie showed that FID on ImageNet improved monotonically from 68.4 (DiT-S, 33M params) to 2.27 (DiT-XL/2, 675M params) with no irregular jumps.

For video generation, DiT scales to the temporal dimension by treating video frames as additional spatial patches in the token sequence. A video of T frames at resolution HxW with patch size P produces T*H*W/P^2 tokens, processed by the same Transformer blocks. This unified spatial-temporal attention, or factored variants (spatial attention + temporal attention in alternating layers), powers commercial video generation models. OpenAI's Sora, Stability AI's Stable Video Diffusion, and Runway Gen-3 all use DiT-class architectures for video generation.

In robotics, DiT serves as the backbone for scalable action generation. Physical Intelligence's pi-zero uses a flow-matching variant of DiT where the action denoising process is modeled as an ordinary differential equation (ODE) rather than a stochastic process, enabling fewer function evaluations (5-10 vs. 100+ for DDPM) and faster inference. The Robotics Diffusion Transformer (RDT, Liu et al., 2024) demonstrated that a 1.2B-parameter DiT trained on multi-robot data from Open X-Embodiment produces broadly capable manipulation policies, with performance improving smoothly as model and data scale increase — the first demonstration of Transformer-like scaling laws for robot control.

Historical Context

DiT emerged from the convergence of two lines of research: denoising diffusion models and Vision Transformers. Diffusion models reached mainstream impact with DDPM (Ho et al., 2020) and were scaled commercially using U-Net backbones in Stable Diffusion (Rombach et al., 2022) and DALL-E 2 (Ramesh et al., 2022). Simultaneously, the Vision Transformer (Dosovitskiy et al., 2021) demonstrated that pure Transformer architectures could match or exceed convolutional networks on vision tasks when trained at sufficient scale.

Peebles and Xie (2023) connected these threads with the DiT paper, showing that replacing U-Net with a ViT-like backbone in the diffusion framework produced a cleaner, more scalable architecture. DiT-XL/2 achieved state-of-the-art FID of 2.27 on class-conditional ImageNet 256x256, surpassing all previous U-Net-based models. The key technical contribution was adaLN-Zero conditioning — a parameter-efficient method for injecting timestep and class information that outperformed cross-attention and in-context conditioning.

The impact on video generation was rapid. OpenAI's Sora (2024) scaled DiT to video with spatial-temporal attention over patch tokens from variable-length, variable-resolution video. This architecture could process video as a single sequence of visual tokens, enabling coherent long-duration video generation. Stability AI's Stable Diffusion 3 (Esser et al., 2024) used a multimodal DiT (MMDiT) that jointly processed text and image tokens in a shared Transformer.

The robotics adoption followed. Chi et al.'s original Diffusion Policy (2023) used a U-Net backbone. Liu et al. (2024) introduced RDT, demonstrating that replacing U-Net with a 1.2B DiT improved manipulation performance and enabled multi-robot training. Physical Intelligence's pi-zero (Black et al., 2024) combined DiT with flow matching and vision-language pretraining, creating a generalist robot policy trained on cross-embodiment data. The trajectory is clear: DiT is becoming the standard backbone for both visual generation and robot action generation, unifying these fields under a common architecture.

Practical Implications

For teams building robot learning systems, DiT's practical implications are twofold: it enables scaling to larger models and datasets, but it also increases computational requirements for both training and inference.

On the training side, DiT-based robot policies require GPU infrastructure comparable to medium-scale language model training. A 675M-parameter DiT policy training on 1M demonstration episodes for 500K steps requires approximately 256 A100 GPU-hours. This is 5-10x more compute than a U-Net-based Diffusion Policy of comparable capability on single-task benchmarks. The payoff comes from multi-task and cross-embodiment generalization: the same DiT model can handle dozens of tasks across multiple robots, amortizing the training cost.

Data requirements for DiT are higher in total volume but lower per task. A U-Net Diffusion Policy might need 200 demonstrations for a single task. A DiT policy trained on 50 tasks needs 5,000-10,000 total demonstrations (100-200 per task), but the per-task data requirement is similar because shared visual and motor representations transfer across tasks. The data should be formatted as (observation, action) trajectories with consistent action representations (7D Cartesian delta is standard) and task language annotations for conditioning.

Inference optimization is critical for deployment. A 675M DiT with 100 DDPM steps is far too slow for real-time control (~0.5 Hz). Three optimization paths exist: (1) DDIM sampling with 10-16 steps (5-10 Hz); (2) flow matching with 5-10 ODE steps, as used by pi-zero (5-15 Hz); (3) consistency distillation to 1-2 steps (15-30 Hz). For tabletop manipulation at 10 Hz control, flow matching is the current best tradeoff between quality and speed. For higher-frequency tasks (20+ Hz), consistency distillation or smaller DiT variants are necessary.

The data pipeline for DiT training should emphasize diversity: diverse tasks, diverse environments, diverse object categories, and (for cross-embodiment models) diverse robot platforms. Unlike single-task models where demonstration quality is paramount, DiT's large capacity allows it to benefit from noisier data mixed with clean data — the model learns to extract the signal while being robust to quality variation. However, a quality floor still applies: demonstrations with task failures, dropped objects, or operator confusion should be filtered out.

Common Misconceptions

MYTH

DiT is just a marginal improvement over U-Net for diffusion models.

FACT

DiT represents a fundamental architectural shift, not an incremental improvement. The transition from U-Net to Transformer backbone mirrors the CNN-to-Transformer transition in computer vision (ResNet to ViT) and enables the same scaling advantages. DiT-XL/2 at 675M parameters significantly outperforms the best U-Net diffusion models on ImageNet. More importantly, DiT scales predictably — doubling parameters consistently improves quality — while U-Net scaling is unpredictable. This is why every major commercial generative AI system (Sora, SD3, Runway Gen-3) has migrated to DiT-class architectures.

MYTH

DiT is too computationally expensive for robot learning applications.

FACT

While DiT is more expensive than U-Net per training run, the total compute budget depends on the use case. For single-task single-robot policies, U-Net-based Diffusion Policy remains more efficient and should be preferred. For multi-task or cross-embodiment policies — where the goal is a single model that handles many tasks on many robots — DiT's superior scaling means it reaches higher performance per total compute dollar. Additionally, inference optimizations (flow matching, consistency distillation) have brought DiT inference speed to practical real-time control rates of 10-20 Hz on standard GPU hardware.

MYTH

DiT requires massive datasets that only large companies can afford to collect.

FACT

The original DiT paper achieved state-of-the-art results on ImageNet, which is only 1.28M images. RDT achieved strong multi-robot manipulation with the Open X-Embodiment dataset (publicly available). The scaling benefit of DiT is that it can absorb more data if available, not that it requires massive data to function. A 130M-parameter DiT-B model trained on 5,000-10,000 demonstrations provides a strong multi-task baseline that is well within the data budget of most robotics teams. The advantage of DiT emerges most clearly when combining data from multiple sources — in-house demonstrations plus public datasets plus human video — which DiT's capacity handles more gracefully than smaller architectures.

Key Papers

[1]Peebles & Xie. “Scalable Diffusion Models with Transformers.” ICCV 2023, 2023. Link
[2]Black et al.. “pi0: A Vision-Language-Action Flow Model for General Robot Control.” arXiv 2410.24164, 2024. Link
[3]Liu et al.. “RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation.” arXiv 2410.07864, 2024. Link
[4]Esser et al.. “Scaling Rectified Flow Transformers for High-Resolution Image Synthesis.” ICML 2024, 2024. Link
[5]Ho et al.. “Denoising Diffusion Probabilistic Models.” NeurIPS 2020, 2020. Link

How Claru Supports This

Claru provides the diverse, multi-task training data that DiT-based robot policies need to realize their scaling potential. Our demonstration datasets span manipulation tasks across multiple robot platforms (Franka, UR5, xArm, ALOHA) with standardized 7D Cartesian delta actions and natural language task annotations — the exact data format consumed by RDT, pi-zero, and other DiT-based architectures.

The key value Claru adds to DiT training is data diversity at scale. DiT models improve most when training data covers diverse environments, objects, and manipulation strategies. Our collection network of 10,000+ operators in 100+ cities provides the environmental breadth that lab-collected datasets lack. Combined with public cross-embodiment datasets like OXE, Claru's data fills the diversity gap that limits generalization in current DiT robot policies. Datasets ship in RLDS format with quality scores, enabling teams to set their own quality-diversity tradeoff during training data curation.

Frequently Asked Questions

Traditional diffusion models (DDPM, Stable Diffusion 1-2, Imagen) use a U-Net backbone with downsampling/upsampling paths, skip connections, and interleaved convolutions and attention layers. DiT replaces this entire architecture with a sequence of standard Transformer blocks operating on patchified latent tokens. Input images or latents are split into patches (e.g., 2x2 or 4x4), embedded into token sequences, and processed by self-attention and feed-forward layers. The timestep and conditioning information are injected via adaptive layer normalization (adaLN-Zero) rather than cross-attention or concatenation. This architectural simplification brings two advantages: (1) DiT inherits the well-understood scaling behavior of Transformers — doubling parameters consistently improves generation quality — while U-Nets have unpredictable scaling properties; (2) DiT can leverage the massive infrastructure (FlashAttention, tensor parallelism, sequence parallelism) developed for LLM training, making it efficient at billion-parameter scales.

Peebles and Xie (2023) demonstrated clean scaling laws for DiT on ImageNet: FID score improved monotonically as model size increased from DiT-S (33M params, FID 68.4) through DiT-B (130M), DiT-L (458M), to DiT-XL/2 (675M params, FID 2.27). Larger patch sizes reduced compute at the cost of quality. Critically, the scaling curves showed no sign of plateauing at 675M parameters, suggesting that further scaling would continue to improve quality — a finding that motivated billion-parameter DiT models for video generation. For training data, DiT-XL/2 was trained on ImageNet (1.28M images) for 7M steps. Video DiT models like Sora are reported to train on hundreds of millions of video clips. The data efficiency of DiT is comparable to U-Net models on a per-parameter basis, but the ability to scale to much larger sizes means DiT models ultimately achieve better quality by absorbing more data.

DiT is applied to robot control by treating action sequences as the data to be denoised. Instead of denoising image patches, the model denoises action tokens conditioned on visual observations and language instructions. Physical Intelligence's pi-zero (Black et al., 2024) uses a DiT-like architecture where the denoising process generates multi-modal robot action sequences (arm position, gripper, base velocity) conditioned on image and language embeddings. The key advantage over U-Net-based Diffusion Policy is scalability: DiT can absorb much larger and more diverse training datasets, enabling cross-task and cross-embodiment generalization that smaller architectures cannot achieve. RDT (Robotics Diffusion Transformer, Liu et al., 2024) demonstrated that a 1.2B-parameter DiT trained on multi-robot data produces more capable and general manipulation policies than smaller diffusion models, with performance improving smoothly as parameters and data increase.

DiT-based robot policies follow the same data paradigm as other robot learning approaches — teleoperation demonstrations with synchronized (observation, action) pairs — but they benefit disproportionately from data scale and diversity due to their larger parameter counts. Pi-zero was trained on cross-embodiment data spanning multiple robot platforms and thousands of tasks. RDT used the Open X-Embodiment dataset combined with human manipulation video converted to pseudo-robot-actions. For a single-task DiT policy, 500-2,000 demonstrations suffice. For multi-task cross-embodiment DiT policies, 100,000+ demonstrations across diverse tasks and environments are typical. The key data requirement specific to DiT is diversity rather than volume per task: the Transformer's capacity allows it to absorb and benefit from heterogeneous data mixtures that would confuse smaller models.

DiT inference speed depends on model size, sequence length, and number of denoising steps. A 675M-parameter DiT-XL generating a 16-step action chunk with DDIM (10 denoising steps) runs at approximately 3-5 Hz on an NVIDIA A100. Consistency distillation reduces this to 1-2 denoising steps, pushing inference to 10-20 Hz. Flow matching formulations (used by pi-zero) achieve 5-15 Hz with 5-10 function evaluations. For real-time robot control at 10-20 Hz, either model distillation, flow matching, or smaller DiT variants (DiT-B at 130M params) are practical. The receding-horizon approach — generating an action chunk and executing only the first few steps before re-planning — allows the effective control frequency to exceed the per-chunk generation rate by 2-4x.