TFDS (TensorFlow Datasets): Complete Guide for Robotics Data
TFDS provides structured dataset pipelines for TensorFlow. Learn how robotics datasets are structured in TFDS and how Claru delivers TFDS-compatible data.
Schema and Structure
TFDS (TensorFlow Datasets) extends TFRecord with a standardized builder pattern that combines data serialization, schema definition, versioning, and pipeline integration into a single framework. A DatasetBuilder subclass defines a DatasetInfo object containing the feature structure (tfds.features.Image for images with automatic JPEG/PNG encoding, tfds.features.Tensor for numerical arrays with specified dtype and shape, tfds.features.Text for strings, tfds.features.FeaturesDict for nested dictionaries), the semantic version number (e.g., 1.0.0), and a human-readable description. Data is stored as sharded TFRecords (typically 256 MB per shard) with a dataset_info.json manifest that records the feature schema, split statistics, number of examples, and file checksums.
The TFDS builder pattern provides a standardized way to define, generate, and version datasets. A DatasetBuilder subclass implements three key methods: _info() returns the DatasetInfo with feature connectors and metadata, _split_generators() defines how to discover and partition raw data files into train/validation/test splits, and _generate_examples() yields individual examples as (key, feature_dict) pairs that the framework serializes to sharded TFRecords with automatic checksumming. The builder system handles shard balancing (distributing examples evenly across shards), deterministic shuffling (consistent ordering across builds), and download management (caching raw data and tracking URLs). Version numbers follow semantic versioning: major version changes indicate backward-incompatible feature changes, minor versions add features, and patch versions fix data errors.
TFDS's integration with tf.data provides production-grade data pipeline primitives that are critical for efficient GPU utilization during training. The tfds.load() function returns a tf.data.Dataset with configurable batching, shuffling (with configurable buffer size), prefetching (overlapping data loading with GPU computation), interleaving (reading from multiple shards simultaneously), and caching (keeping frequently accessed data in memory). For robotics pipelines, the combination of RLDS (which defines the episode/step structure on top of TFDS) with TFDS (which handles the storage and pipeline mechanics) has become the standard in the Google DeepMind ecosystem, powering RT-1, RT-2, Octo, and OpenVLA training runs. TFDS also supports loading datasets from Google Cloud Storage with automatic local caching, enabling efficient access to multi-terabyte datasets without full downloads.
Frameworks and Models Using TFDS
TensorFlow / Keras
Native loading via tfds.load() with automatic batching, shuffling, prefetching, and distributed strategy support for multi-GPU training.
JAX / Flax
TFDS integrates with JAX training loops via tfds.as_numpy() conversion, used by Google Brain robotics for all JAX-based model training.
Google DeepMind Robotics
RT-1, RT-2, RT-X, Octo, and OpenVLA all use TFDS (via RLDS) as their primary dataset management framework.
Open X-Embodiment
The largest cross-embodiment robotics dataset collection (60+ datasets) uses TFDS builders for standardized data access.
Grain
Google's next-generation data loading library designed for JAX, providing efficient TFDS dataset access with multihost support.
SeqIO
Sequence-to-sequence data pipeline from Google that builds on TFDS for text and multimodal datasets used in language models.
Reading and Writing TFDS Robotics Data in Python
Loading a TFDS dataset requires just two lines: dataset = tfds.load('dataset_name', split='train') returns a tf.data.Dataset, and iterating with for example in dataset: accesses individual examples as nested dictionaries of tensors. For robotics RLDS datasets: dataset = tfds.load('bridge_dataset', split='train') returns the full BridgeData V2 collection. Each example is an episode containing a 'steps' dataset that you iterate: for episode in dataset: for step in episode['steps']: where step['observation']['image'] yields a decoded image tensor and step['action'] yields the action vector. The tfds.load() function handles downloading, caching, shuffling, and batching automatically.
Creating a new TFDS dataset involves writing a DatasetBuilder subclass. The _info method returns tfds.core.DatasetInfo with a features dictionary: tfds.features.FeaturesDict({'image': tfds.features.Image(shape=(H, W, 3)), 'action': tfds.features.Tensor(shape=(7,), dtype=tf.float32), 'reward': tf.float32}). The _split_generators method points to raw data directories, and _generate_examples yields (example_key, example_dict) tuples. Running tfds build --datasets=your_dataset generates the sharded TFRecords, dataset_info.json, and label metadata. For the Open X-Embodiment ecosystem, builders follow the RLDS convention of yielding episodes as nested datasets of steps, and the repository at github.com/google-deepmind/open_x_embodiment maintains reference builder implementations.
TFDS provides several advanced features critical for large-scale robotics training. The tfds.ReadConfig class controls reading behavior: num_parallel_reads (default 64) determines how many shards are read concurrently, interleave_cycle_length controls round-robin shard reading for better shuffling, and try_autocache enables automatic in-memory caching for datasets that fit in RAM. For distributed training across multiple TPU/GPU hosts, TFDS integrates with tf.distribute strategies to automatically shard data across workers, ensuring each worker processes a unique subset of examples. The tfds.benchmarks module provides throughput measurement tools, and well-configured TFDS pipelines achieve 1-10 GB/s data throughput on modern storage systems.
When to Use TFDS vs Alternatives
TFDS is the natural choice for TensorFlow/JAX ecosystems but other frameworks may be better fits for PyTorch workflows.
| Format | Best For | Framework | Versioning | Cloud Support |
|---|---|---|---|---|
| TFDS | TF/JAX robotics, Open X-Embodiment | TensorFlow, JAX | Built-in semantic versioning | GCS native + caching |
| HF Datasets | PyTorch/HF Hub ecosystem | PyTorch (primary), TF | Git-based on HF Hub | HF Hub streaming |
| WebDataset | Large-scale distributed PyTorch | PyTorch | Manual (shard naming) | S3/HTTP streaming |
| HDF5 | Local training, robomimic/D4RL | Framework-agnostic | Manual | Poor (local files) |
| LeRobot (Parquet+MP4) | HF robotics ecosystem | PyTorch | Git-based on HF Hub | HF Hub streaming |
Converting from Other Formats
| Source Format | Tool / Library | Complexity | Notes |
|---|---|---|---|
| HDF5 | TFDS builder + h5py | moderate | Write a DatasetBuilder that reads HDF5 groups, maps to feature connectors, yields examples. |
| WebDataset (tar shards) | Custom TFDS builder | moderate | Stream tar shards, group files by key prefix, map to TFDS feature structure. |
| CSV / JSON | tfds.builder_from_directory | trivial | Simple tabular data with file references can be loaded directly. |
| LeRobot (Parquet + MP4) | Custom TFDS builder | moderate | Read Parquet metadata, decode MP4 video frames, yield as TFDS episodes following RLDS structure. |
| ROS bag | TFDS builder + rosbags | complex | Deserialize ROS messages, synchronize multi-rate topics, segment into episodes, yield as TFDS examples. |
| Raw files (images + labels) | TFDS builder + custom loader | moderate | Define features matching your file types, implement _generate_examples to read and yield pairs. |
TFDS in the Google DeepMind Robotics Ecosystem
TFDS is the foundational data layer for Google DeepMind's robotics research stack. The Open X-Embodiment project, which aggregates over 60 robotics datasets from 21 institutions into a unified collection for training cross-embodiment policies, uses TFDS builders for every dataset. Each contributing lab writes a DatasetBuilder that converts their native format (HDF5, custom CSVs, ROS bags) into a standardized RLDS-on-TFDS representation. This standardization enables models like RT-X and Octo to train on the entire collection with a single data loading pipeline, despite the extreme heterogeneity of the underlying data sources.
The combination of TFDS and tf.data provides critical performance optimizations for large-scale robotics model training. TFDS's deterministic shuffling ensures reproducible training runs (important for ablation studies), while tf.data's prefetching and interleaving pipelines keep GPU utilization above 90% even when reading from networked storage. For TPU training (used by RT-2 and RT-X), TFDS's integration with tf.data.service enables distributed data preprocessing where CPU-bound operations (image decoding, augmentation) run on separate preprocessing workers, freeing TPU host CPUs for feeding data to accelerators.
For teams outside the Google ecosystem who want to use TFDS datasets, several interoperability paths exist. The tfds.as_numpy() function strips TensorFlow tensors to NumPy arrays suitable for any framework. The dlimp library (used by the Octo team) wraps TFDS datasets in PyTorch-compatible iterators. The fog_x library provides a format-agnostic abstraction layer that can read both TFDS and other robotics formats. For teams that want to consume Open X-Embodiment data in PyTorch, Claru can convert any RLDS/TFDS dataset to LeRobot format (Parquet + MP4) for native HuggingFace Hub compatibility.
References
- [1]TensorFlow Team. “TensorFlow Datasets: A Collection of Ready-to-Use Datasets.” TensorFlow Documentation, 2019. Link
- [2]Ramos et al.. “RLDS: an Ecosystem to Generate, Share and Use RL Datasets.” NeurIPS 2021 (Datasets & Benchmarks), 2021. Link
- [3]Open X-Embodiment Collaboration. “Open X-Embodiment: Robotic Learning Datasets and RT-X Models.” ICRA 2024, 2023. Link
- [4]Octo Model Team. “Octo: An Open-Source Generalist Robot Policy.” RSS 2024, 2024. Link
- [5]Brohan et al.. “RT-1: Robotics Transformer for Real-World Control at Scale.” RSS 2023, 2023. Link
Claru Data Delivery in TFDS Format
Claru provides custom TFDS DatasetBuilder code alongside the data, enabling one-line loading via tfds.load('claru_your_dataset'). Builders include properly typed feature connectors (tfds.features.Image with configurable encoding, tfds.features.Tensor with exact dtype and shape specifications), split configurations matching your train/val/test requirements, and comprehensive dataset documentation embedded in the DatasetInfo. Image features use JPEG compression by default with configurable quality (default 95), and proprioceptive state uses float32 precision.
Every delivery includes the complete builder Python package that can be registered in your local TFDS installation, a set of pre-built TFRecord shards with checksums, and a verification script that runs the builder's built-in integrity checks. For teams targeting the Open X-Embodiment ecosystem, we validate schema compatibility with RT-X and Octo model expectations (observation dictionary structure, action space dimensionality, language instruction format). For teams needing PyTorch compatibility, we additionally deliver a dlimp-compatible wrapper and can provide the same data in LeRobot or HDF5 format.
Frequently Asked Questions
TFDS is the general-purpose dataset framework providing storage (sharded TFRecords), schema definition (feature connectors), versioning (semantic versions), and pipeline integration (tf.data). RLDS is built on top of TFDS specifically for reinforcement learning and robotics data, adding the episode-of-steps nesting structure and RL-specific fields (observation, action, reward, discount, is_first, is_last, is_terminal). Every RLDS dataset is a TFDS dataset, but not every TFDS dataset follows the RLDS conventions. For robotics data that will be used with RT-X, Octo, or OpenVLA, use RLDS. For other perception or classification tasks, standard TFDS is sufficient.
Yes, via several approaches. The tfds.as_numpy() function converts tf.data.Dataset outputs to NumPy arrays, which can be wrapped in a PyTorch DataLoader. The dlimp library (github.com/kvablack/dlimp) provides a thin wrapper that converts TFDS datasets into PyTorch-compatible iterators and is the same tool used by the Octo model team. The fog_x library offers another format-agnostic abstraction layer. For teams wanting zero TensorFlow dependency, Claru can deliver the same data simultaneously in TFDS and a PyTorch-native format like HDF5, WebDataset, or LeRobot.
TFDS has built-in semantic version management. Each DatasetBuilder specifies a VERSION attribute (e.g., tfds.core.Version('2.1.0')), and tfds.load() targets specific versions for reproducible training. Major version changes (2.0.0 to 3.0.0) indicate backward-incompatible changes like removed features or changed shapes. Minor versions (2.1.0 to 2.2.0) add new features. Patch versions (2.1.0 to 2.1.1) fix data errors. The TFDS catalog at tensorflow.org/datasets/catalog maintains version history for all registered datasets, and local builds store version-tagged directories.
TFDS uses sharded TFRecords (typically 256 MB per shard) that are read sequentially using tf.data's streaming infrastructure. The tf.data pipeline loads data lazily: only the current batch plus prefetched batches are in memory at any time. For datasets on Google Cloud Storage, TFDS downloads and caches shards on demand (configurable via download_and_prepare(download_dir=...)). The interleave operation reads from multiple shards simultaneously (default num_parallel_reads=64) for better I/O throughput, and the cache operation optionally stores the full dataset in memory for datasets that fit. Multi-terabyte datasets like Open X-Embodiment run efficiently because the streaming pipeline never loads the full dataset.
Start by subclassing tfds.core.GeneratorBasedBuilder. In _info(), define your feature structure with tfds.features.FeaturesDict mapping field names to feature types (Image, Tensor, Text, etc.). In _split_generators(), return a dictionary mapping split names to generator configs pointing at your raw data. In _generate_examples(), yield (key, example_dict) pairs where each example matches the feature structure. Run tfds build to generate TFRecords. For RLDS-compatible robotics data, use the rlds.rlds_builder.RLDSBuilder base class which enforces the episode/step structure. Claru automates this entire process and delivers builder code that passes all TFDS and OXE validation checks.
Get Data in TFDS Format
Claru delivers robotics data as TFDS-compatible datasets with custom DatasetBuilder code, enabling one-line loading via tfds.load(). Tell us your requirements.