HDF5 (Hierarchical Data Format 5): Complete Guide for Robotics Data

HDF5 is the most widely used file format for robotics datasets. Learn its structure, compression options, and how Claru delivers robot training data in HDF5.

Schema and Structure

HDF5 organizes data hierarchically: /episodes/episode_0/observations/images, /episodes/episode_0/actions, etc. Each group can contain datasets (multi-dimensional arrays) with arbitrary dtypes. Chunked storage enables partial reads, and built-in compression (gzip, lz4, zstd) reduces file size 2-5x. Metadata is stored as HDF5 attributes on groups or datasets.

The HDF5 specification defines two core primitives: groups (analogous to filesystem directories) and datasets (analogous to files containing N-dimensional arrays). Groups can be nested to arbitrary depth, and both groups and datasets can carry key-value metadata as attributes. For robotics, the typical convention uses top-level groups for episodes, with each episode containing sub-groups for observations (further split by modality), actions, rewards, and metadata. The robomimic convention, widely adopted in manipulation research, structures files as /data/demo_0/obs/{camera_name, joint_positions, gripper_state} and /data/demo_0/actions, with a top-level /mask group defining train/val splits.

HDF5's chunking mechanism is critical for training performance. When you create a dataset with chunks=(1, 480, 640, 3), each single-frame image occupies one chunk on disk, enabling O(1) random access to any frame without reading the entire array. The chunk size determines the minimum I/O granularity, so aligning chunks with your access pattern (per-step reads for training, per-episode reads for evaluation) is essential. Virtual datasets (VDS), introduced in HDF5 1.10, allow you to create a unified view across multiple physical files without copying data, which is valuable for combining datasets from different collection campaigns into a single logical dataset.

Frameworks and Models Using HDF5

robomimic

NVIDIA's imitation learning framework uses HDF5 as its native data format for manipulation demonstrations.

ManiSkill

The ManiSkill benchmark suite stores trajectories in HDF5 with per-step observations and actions.

D4RL

The offline RL benchmark uses HDF5 for all its locomotion and manipulation datasets.

RoboCasa

Large-scale simulation benchmark for household robotics, storing demonstrations in robomimic-style HDF5.

Reading and Writing HDF5 Robotics Data in Python

The h5py library is the standard Python interface for HDF5. Reading a robomimic-style dataset is straightforward: open the file with h5py.File('dataset.hdf5', 'r'), then access episodes as f['data/demo_0/obs/agentview_image'][:] to get a NumPy array of all frames. For training, use f['data/demo_0/actions'][start:end] to read action slices without loading the entire episode. The key performance optimization is to keep the file handle open across batches rather than opening and closing it per sample, and to use HDF5's built-in chunked reading rather than loading entire arrays into memory.

Writing a new robotics HDF5 dataset follows a pattern: create the file with h5py.File('output.hdf5', 'w'), then for each episode, create a group with f.create_group(f'data/demo_{i}'), and write observation arrays with f.create_dataset('data/demo_0/obs/image', data=images, chunks=(1, H, W, 3), compression='lz4'). Always store metadata as attributes: f['data/demo_0'].attrs['num_samples'] = len(actions). For large datasets, the hdf5plugin package provides access to fast compressors like LZ4 and Blosc that are not included in the default HDF5 installation. The command h5dump -H dataset.hdf5 (from the HDF5 command-line tools) lets you inspect the structure without loading data, and h5stat dataset.hdf5 reports file-level statistics including chunk utilization.

When to Use HDF5 vs Alternatives

HDF5 is the most widely adopted format in manipulation research, but other formats may be better depending on your infrastructure.

FormatBest ForRandom AccessCloud StreamingCompression
HDF5robomimic, D4RL, local trainingExcellent (chunked)Poor (requires download)gzip, lz4, zstd, blosc
ZarrCloud-native, parallel writesExcellent (chunked)Excellent (S3/GCS native)Same codecs as HDF5
RLDSOpen X-Embodiment, RT-XModerate (sequential shards)Good (GCS via TFDS)TFRecord built-in
WebDatasetDistributed GPU trainingPoor (sequential tar)Excellent (HTTP/S3)Per-file in tar
LeRobotHuggingFace ecosystemModerate (Parquet index)Good (HF Hub)MP4 video + Parquet

Converting from Other Formats

Source FormatTool / LibraryComplexityNotes
RLDSCustom Python (h5py + tensorflow)moderateRead TFRecords, write to HDF5 groups preserving episode structure.
Zarrzarr + h5pytrivialzarr can read/write HDF5 directly via zarr.open(store=zarr.storage.H5Store(...)).
WebDatasetCustom PythonmoderateUnpack tar shards, write samples to HDF5 episode groups.
ROS bagrosbag + h5pymoderateExtract synchronized topics by timestamp, write to episode groups.
LeRobot (Parquet)pandas + h5pymoderateRead Parquet metadata, decode MP4 frames, write to HDF5 arrays.

Migration Guide: Optimizing HDF5 for Training Performance

The most common performance mistake with HDF5 robotics datasets is poor chunk alignment. If your training loop reads one step at a time but your chunks span 100 steps, every read loads 100x more data than needed. For imitation learning, set image dataset chunks to (1, H, W, C) and state/action dataset chunks to (1, state_dim) or (1, action_dim). If your training loop reads fixed-length windows (e.g., action chunking with horizon=16), align chunks to (16, action_dim) for optimal throughput.

For distributed training, HDF5's file-locking mechanism can become a bottleneck when multiple workers read the same file. The solution is to set the HDF5_USE_FILE_LOCKING environment variable to FALSE and ensure workers access the file in read-only mode. Alternatively, shard the dataset into multiple HDF5 files (one per N episodes) and assign different shards to different workers. The robomimic framework provides a SequenceDataset class that handles this worker-to-shard assignment automatically.

When migrating from raw video files to HDF5, consider your compression strategy carefully. Storing raw uint8 frames with LZ4 compression gives the fastest decompression (important for GPU-bound training), while JPEG-compressed bytes in an opaque dataset reduce file size by 5-10x at the cost of decode overhead and lossy quality. For proprioceptive data (joint positions, forces), float32 with no compression is typically fastest since these arrays are small. Always validate your HDF5 files after creation using h5py's built-in checksumming: create datasets with fletcher32=True to detect corruption.

References

  1. [1]Mandlekar et al.. robomimic: A Framework for Robot Learning from Demonstration.” CoRL 2021, 2021. Link
  2. [2]Fu et al.. D4RL: Datasets for Deep Data-Driven Reinforcement Learning.” arXiv 2020, 2020. Link
  3. [3]The HDF Group. HDF5 Reference Manual.” Technical Documentation, 2022. Link
  4. [4]Nasiriany et al.. RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots.” RSS 2024, 2024. Link
  5. [5]Gu et al.. ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills.” ICLR 2023, 2023. Link

Claru Data Delivery in HDF5 Format

Claru delivers HDF5 files with standardized group hierarchy: /episodes/{id}/observations/{modality} and /episodes/{id}/actions. Chunked storage with lz4 compression balances read speed and file size. Metadata attributes include camera calibration, robot URDF paths, and task descriptions.

Every HDF5 delivery is validated against the robomimic SequenceDataset loader to ensure compatibility with the most widely used imitation learning framework. We provide a Python verification script that checks chunk alignment, compression settings, dtype consistency, and attribute completeness. For teams using D4RL-style flat layouts, ManiSkill episode structures, or custom schemas, we adapt our output format to match your existing data pipeline. Large deliveries (100K+ episodes) are sharded across multiple HDF5 files with a manifest JSON that maps episode IDs to file paths.

Frequently Asked Questions

LZ4 offers the best balance of compression ratio and decompression speed for GPU-bound training pipelines, achieving 2-3x compression on image data while decompressing at over 3 GB/s on modern CPUs. GZIP achieves better compression ratios (3-5x) but decompresses 5-10x slower, making it better for archival or storage-constrained scenarios. For the best overall compression, Blosc with LZ4 backend and shuffle filter can achieve near-GZIP ratios at near-LZ4 speed by exploiting the byte-level patterns in floating-point arrays. Install the hdf5plugin Python package to access LZ4 and Blosc codecs. Claru defaults to LZ4 with chunk sizes aligned to single-step reads, and we benchmark decompression throughput against your target GPU utilization.

Yes. The HDF5 specification supports files up to exabytes, and individual datasets can be larger than available RAM thanks to chunked I/O. For very large robotics datasets (100TB+), the standard practice is to shard across multiple HDF5 files with an index file or manifest JSON mapping episodes to shard paths. Each shard is typically 1-10 GB for optimal filesystem performance. The h5py library supports memory-mapped access via the driver='core' option for datasets that fit in memory, and the swmr=True (Single Writer Multiple Reader) mode enables concurrent reading during data collection. For datasets with hundreds of thousands of video episodes, Claru provides pre-sharded deliveries with a DataLoader-compatible manifest.

HDF5 supports parallel I/O via MPI-IO when h5py is built with mpi4py support, but this requires a parallel-enabled HDF5 build which is uncommon in standard conda/pip installations. For simpler distributed training setups, the recommended approach is file sharding: assign different HDF5 shards to different GPU workers, with each worker reading its own subset of episodes. PyTorch's DistributedSampler handles this shard-to-worker mapping. Set the HDF5_USE_FILE_LOCKING=FALSE environment variable to prevent lock contention when multiple processes read the same file. For cloud training, consider converting to zarr or WebDataset since HDF5 lacks native cloud storage streaming.

The HDF5 command-line tools provide essential debugging capabilities. Run h5dump -H dataset.hdf5 to print the full group/dataset tree without loading data, showing shapes, dtypes, chunk sizes, and compression filters. Use h5stat dataset.hdf5 for file-level statistics including total size, metadata overhead, and chunk utilization percentage. In Python, h5py's visititems method walks the entire tree: f.visititems(lambda name, obj: print(name, obj.shape if hasattr(obj, 'shape') else 'group')). For visual inspection, HDFView (from the HDF Group) and ViTables provide GUI-based browsing. The robomimic library includes a dataset_states_to_obs.py utility that replays demonstrations and renders observations, which is the gold standard for validating manipulation datasets.

HDF5 and zarr are architecturally similar (both offer chunked, compressed N-dimensional arrays with group hierarchies), but they differ in key operational aspects. HDF5 stores everything in a single monolithic file, which simplifies management but prevents concurrent writes and makes cloud access difficult. Zarr stores each chunk as a separate file in a directory or object store, enabling native S3/GCS access and parallel writes from multiple processes. HDF5 has broader tool support (HDFView, h5dump, MATLAB, C/C++ libraries), while zarr is Python-native with growing support in other languages. For local training on a single machine, HDF5 is more mature and better supported by robomimic and D4RL. For cloud-native or multi-writer workflows, zarr is the better choice. Claru can deliver in either format and provides conversion scripts between them.

Get Data in HDF5 Format

Claru delivers robotics training data in HDF5 format, ready to load into your pipeline. Tell us your requirements.