Training Data for GENIMA (Generative Image as Action Models)
A deep dive into GENIMA's affordance-centric architecture, the Stable Diffusion backbone it fine-tunes, the ACT controller it pairs with, and the specific data formats and volumes required to replicate or extend its results on your own manipulation tasks.
Input/Output Specification
Single or multi-view RGB images (128x128 sim, 480x640 real) with ControlNet conditioning
Colored sphere affordance images decoded by ACT controller into 7-DoF joint-position action chunks
Natural language task instructions via SD-Turbo text encoder
10 Hz
How Claru Data Integrates with GENIMA
Claru provides the exact data modalities GENIMA requires for both training stages: time-synchronized RGB image streams, 7-DoF joint-position trajectories, gripper states, and calibrated camera parameters (intrinsics and extrinsics in JSON/YAML format). Our teleoperation data is collected on instrumented Franka Panda and UR5e platforms at 10-30 Hz with sub-millimeter joint-position accuracy from high-resolution encoders. For GENIMA's affordance image training, the critical requirement is accurate joint-position recordings that can be projected into the camera frame to render the colored sphere targets -- Claru's data includes full kinematic chain recordings and verified camera calibrations that make this projection straightforward. We also provide datasets with controlled visual diversity (varying objects, lighting, backgrounds) to validate GENIMA's canonical texture reversion property before deployment. Every dataset includes trajectory smoothness validation, success labeling, and format conversion to the GENIMA codebase's native HDF5 structure.
What Is GENIMA?
GENIMA (Generative Image as Action Models) is a behavior-cloning framework developed at the Dyson Robot Learning Lab by Mohit Shridhar, Yat Long Lo, and Stephen James. Published in July 2024 (arXiv 2407.07875), GENIMA introduces a fundamentally different approach to visuomotor policy learning: instead of regressing actions directly from images, it fine-tunes a pretrained image generation model to 'draw' target joint positions as colored spheres overlaid on the current RGB observation. A downstream ACT-style controller then converts those visual affordance targets into executable robot joint trajectories.
The key insight is that lifting the action representation into image space lets the policy inherit the strong visual priors of internet-pretrained diffusion models. Because Stable Diffusion already understands geometry, lighting, and object identity, the resulting policy is far more robust to visual perturbations than a standard visuomotor approach. GENIMA outperforms ACT in 16 out of 25 RLBench simulation tasks and beats Diffusion Policy in all 25 tasks, while also demonstrating strong generalization on 9 real-world manipulation tasks with a Franka Panda arm.
A distinctive emergent property of GENIMA is canonical texture reversion: the diffusion model tends to normalize object appearances to canonical colors and textures during affordance image generation, making the policy invariant to randomized object colors, distractors, lighting changes, and background textures. This property is not explicitly trained but arises from the visual priors of the pretrained diffusion backbone.
GENIMA at a Glance
Input / Output Specification
| Parameter | Specification |
|---|---|
| Observation Format | Single or multi-view 128x128 RGB images (RLBench); 480x640 RGB (real-world Franka) |
| Action Representation | Colored spheres drawn on the observation image encoding 3D joint-position targets, decoded by an ACT controller into a sequence of joint positions |
| Language Conditioning | Natural language task instructions fed as text prompts to the Stable Diffusion backbone |
| Control Frequency | 10 Hz (action chunk length varies by task) |
| Diffusion Backbone | SD-Turbo (single-step distilled Stable Diffusion) with ControlNet conditioning on the current RGB frame |
| Controller | ACT (Action Chunking with Transformers) using ResNet-18 vision encoder and Transformer action decoder |
Architecture and Key Innovations
GENIMA's architecture is a two-stage pipeline. Stage 1 is an affordance image generator built by fine-tuning SD-Turbo with ControlNet. The ControlNet adapter takes the current RGB observation as a spatial conditioning signal, while the text encoder receives the language instruction. The denoising UNet then generates an output image identical to the observation but with colored spheres overlaid at the predicted 3D positions of each robot joint, projected into the camera frame. Because SD-Turbo is a single-step distilled model, this generation requires only one forward pass, keeping inference latency low.
Stage 2 is an ACT-based controller that consumes the generated affordance image. The controller uses a ResNet-18 backbone to encode the affordance image and a Transformer decoder to predict a chunk of future joint-position actions. The ACT architecture enables temporal action chunking, producing smooth multi-step trajectories from a single affordance image rather than requiring per-timestep replanning.
The two-stage decomposition offers a clean separation of concerns. The diffusion model handles the 'what' and 'where' of manipulation (spatial reasoning, object identity, grasp point selection), while the ACT controller handles the 'how' (motion planning, trajectory smoothness, dynamics). This separation means each stage can be trained and debugged independently, and the affordance images provide an interpretable intermediate representation that engineers can visually inspect.
A critical innovation is the use of internet-pretrained visual priors for policy robustness. Because the diffusion backbone was trained on billions of internet images, it implicitly encodes strong priors about 3D geometry, lighting, material properties, and object categories. These priors transfer directly to the manipulation policy, enabling zero-shot generalization to novel object instances, colors, and backgrounds that would require explicit domain randomization in conventional approaches.
Comparison with Related Models
How GENIMA compares to alternative visuomotor policy architectures on key dimensions.
| Dimension | GENIMA | ACT | Diffusion Policy | RT-2 |
|---|---|---|---|---|
| Action representation | Affordance images (visual targets) | Direct joint-position chunks | Diffused continuous actions | Discretized token actions |
| Pretrained backbone | SD-Turbo (image generation) | None (trained from scratch) | None (trained from scratch) | PaLI-X VLM |
| Visual robustness | High (canonical texture reversion) | Low (overfits to textures) | Moderate | High (VLM priors) |
| Demos per task | 20-50 | 50-100 | 100-200 | 100K+ (large-scale) |
| Language conditioning | Yes (text-to-image prompt) | No | Optional | Yes (VLM) |
Training Data Requirements
GENIMA's data-efficiency is one of its most compelling properties. In the RLBench evaluation, each of the 25 tasks was trained with as few as 20 to 50 demonstrations, a fraction of what standard visuomotor policies require. Each demonstration consists of an RGB observation stream (128x128 resolution in simulation, 480x640 in real-world), paired with ground-truth robot joint positions at each timestep. From these joint positions, the training pipeline renders colored sphere overlays on the observation images to create the affordance image targets used to fine-tune the diffusion model.
For real-world deployment, the authors collected 9 manipulation tasks on a Franka Panda arm with a wrist-mounted RealSense camera. Each task used between 20 and 50 teleoperated demonstrations. The teleoperation data includes RGB frames, 7-DoF joint positions, and gripper state at 10 Hz. The affordance sphere rendering is computed offline from the recorded joint trajectories using known camera intrinsics and extrinsics, so no special annotation beyond standard teleoperation recording is needed.
The ControlNet fine-tuning of SD-Turbo requires paired (observation, affordance_image) samples. The observation is the raw RGB frame, and the affordance image is the same frame with colored spheres at each joint position projected into the camera view. Training typically converges in 10,000 to 50,000 gradient steps with a batch size of 8 to 16 on a single A100 GPU. The ACT controller is trained separately on the affordance images and corresponding action chunks, using the same demonstration dataset.
For teams looking to extend GENIMA to new tasks, the minimum viable dataset is approximately 20 high-quality teleoperated demonstrations per task, with consistent camera placement and calibrated intrinsics. Increasing to 50 demonstrations per task provides meaningful improvements in success rate, particularly for tasks involving articulated objects or precise contact dynamics.
How Claru Data Integrates with GENIMA
Claru provides teleoperated robot demonstration datasets with the precise data modalities GENIMA requires: time-synchronized RGB image streams, 7-DoF joint positions, gripper states, and calibrated camera parameters. Our collection pipeline uses instrumented Franka Panda and UR5e setups with RealSense and ZED cameras, producing data at 10-30 Hz with sub-millimeter joint-position accuracy from high-resolution encoders.
For GENIMA specifically, the critical data annotation is the joint-position trajectory at each timestep, which is used to render the affordance sphere targets. Claru's data includes full kinematic chain recordings (joint positions, end-effector poses, and gripper width) that can be directly projected into the camera frame to generate GENIMA's colored sphere training targets. We also provide camera intrinsic and extrinsic calibration files in standard formats (JSON, YAML) compatible with the GENIMA codebase.
Beyond raw demonstrations, Claru can supply datasets with controlled visual diversity -- varying object instances, lighting conditions, backgrounds, and distractor objects -- that stress-test GENIMA's canonical texture reversion property and verify generalization before deployment. Our data quality pipeline includes trajectory smoothness validation, camera calibration verification, and demonstration success labeling to ensure that every sample in the training set represents a successful task completion.
Key References
- [1]Shridhar, Lo, & James. “Generative Image as Action Models.” arXiv 2407.07875, 2024. Link
- [2]Zhao et al.. “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.” RSS 2023, 2023. Link
- [3]Chi et al.. “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.” RSS 2023, 2023. Link
- [4]Zhang, Rao, & Agrawala. “Adding Conditional Control to Text-to-Image Diffusion Models.” ICCV 2023, 2023. Link
- [5]James et al.. “RLBench: The Robot Learning Benchmark & Learning Environment.” IEEE Robotics and Automation Letters, 2020. Link
Frequently Asked Questions
GENIMA inherits the visual priors of Stable Diffusion, which was pretrained on billions of internet images. During affordance image generation, the diffusion model exhibits an emergent canonical texture reversion property, normalizing object appearances to canonical colors and textures. This makes the policy invariant to randomized object colors, distractors, lighting changes, and background textures without explicit domain randomization.
GENIMA is highly data-efficient compared to conventional visuomotor approaches. In the RLBench evaluation, each task was trained with 20 to 50 demonstrations. For real-world Franka Panda tasks, the same range (20-50 demonstrations) was sufficient to achieve an average 64% success rate across 9 manipulation tasks. This is roughly 2-5x fewer demonstrations than what standard ACT or Diffusion Policy baselines require.
The diffusion agent (Stage 1) is a fine-tuned SD-Turbo model with ControlNet that generates an affordance image showing where the robot joints should move, rendered as colored spheres on the current observation. The ACT controller (Stage 2) takes this affordance image as input and outputs a chunk of joint-position actions that move the robot toward those visual targets. The diffusion model handles spatial reasoning and the controller handles trajectory generation.
Yes. The Stable Diffusion backbone natively accepts text prompts, which GENIMA repurposes to encode task instructions. The language instruction is passed through the text encoder of SD-Turbo and conditions the denoising process, allowing the same model to generate different affordance images for different tasks given the same observation. This is a natural advantage of building on a text-to-image foundation model.
Claru delivers teleoperated demonstration datasets containing time-synchronized RGB image streams (480x640 or higher), 7-DoF joint-position trajectories, gripper states, and calibrated camera intrinsic/extrinsic parameters at 10 Hz. These recordings can be directly used to render GENIMA's colored sphere affordance targets using the provided camera calibration, without any additional annotation. We deliver in HDF5 or the GENIMA codebase's native format.
Get GENIMA-Ready Demonstration Data
Tell us about your GENIMA project -- target tasks, robot platform, and camera setup -- and we will deliver calibrated teleoperation datasets formatted for GENIMA's affordance image training pipeline.