Zarr (Chunked Array Storage): Complete Guide for Robotics Data

Zarr provides chunked, compressed N-dimensional array storage ideal for large robotics datasets. Understand its structure and cloud-native capabilities.

Schema and Structure

Zarr stores N-dimensional arrays in chunked format with per-chunk compression. A dataset is a directory (or zip file, or cloud object store) containing chunks named by their indices (e.g., 0.0.0, 0.0.1). Metadata in .zarray and .zattrs JSON files describes dtype, shape, chunks, and compressor. Groups organize arrays hierarchically like HDF5.

The fundamental advantage of zarr's architecture is that each chunk is an independent file (or object in a cloud store), which means different chunks can be read, written, or compressed completely independently. This is in stark contrast to HDF5, where the entire file is a single monolithic container with internal addressing. For robotics datasets, this translates to true parallel writes (multiple robots can write different episodes simultaneously to the same zarr store without locking), native cloud access (each chunk is a separate S3/GCS object that can be fetched independently), and incremental updates (adding new episodes does not require rewriting existing data).

Zarr v3, the latest specification, introduces a more flexible metadata system with zarr.json replacing the separate .zarray/.zattrs files, support for variable-length data types (useful for storing natural language instructions alongside fixed-size arrays), and a codec pipeline that allows chaining multiple transformations (e.g., delta encoding followed by Blosc compression) per chunk. The v3 spec also standardizes the storage-agnostic interface, making it easier to swap between local filesystem, S3, GCS, and Azure Blob backends without changing application code.

Frameworks and Models Using Zarr

robomimic v2

Recent robomimic versions support zarr as an alternative to HDF5 for manipulation datasets.

BridgeData V2

BridgeData V2 uses zarr for efficient cloud-hosted dataset access across research groups.

Xarray/Dask

Scientific computing stack for out-of-core array operations on zarr data.

Diffusion Policy (Chi et al.)

The original Diffusion Policy codebase uses zarr for storing demonstration datasets.

Reading and Writing Zarr Robotics Data in Python

Reading a zarr dataset is straightforward with the zarr-python library. Open a local store with z = zarr.open('dataset.zarr', mode='r'), then access arrays as z['episodes/0/observations/image'][:] for a NumPy array. For cloud access, use z = zarr.open('s3://bucket/dataset.zarr', mode='r', storage_options={'anon': False}) with fsspec handling authentication. Slicing works lazily: z['actions'][100:200] only downloads the chunks covering that range, making it efficient to sample random timesteps from a large cloud-hosted dataset.

Writing a new zarr robotics dataset follows the group-creation pattern: root = zarr.open('output.zarr', mode='w'), then for each episode, create groups with ep = root.create_group(f'episodes/{i}') and write data with ep.create_dataset('observations/image', data=images, chunks=(1, 480, 640, 3), compressor=zarr.Blosc(cname='lz4', clevel=5)). The zarr.convenience.copy_store function enables efficient format conversion from HDF5 to zarr (or vice versa) without loading all data into memory. For Diffusion Policy-style datasets, the convention stores observations and actions as flat arrays indexed by timestep with separate episode_ends arrays marking boundaries, which the Diffusion Policy codebase reads via its ReplayBuffer class.

When to Use Zarr vs Alternatives

Zarr excels in cloud-native and parallel-write scenarios. For local-only workflows with existing tooling, HDF5 may be simpler.

FormatCloud NativeParallel WriteCompressionTool Ecosystem
ZarrExcellent (S3, GCS, Azure)Yes (per-chunk)Blosc, LZ4, ZSTD, ZlibPython-centric, growing
HDF5Poor (requires download)MPI-IO onlySame codecs + shuffleMature (C, C++, Java, MATLAB)
RLDS (TFRecord)GCS via TFDSVia shard separationTFRecord built-inTensorFlow ecosystem
WebDatasetExcellent (HTTP/S3)Via shard separationPer-file in tarPyTorch-centric
LeRobot (Parquet)HF Hub streamingVia episode partitioningMP4 video + SnappyHuggingFace ecosystem

Converting from Other Formats

Source FormatTool / LibraryComplexityNotes
HDF5zarr.convenience.copy_storetrivialzarr can read HDF5 directly and convert to zarr format.
NumPy arrayszarr.save()trivialDirect conversion from NumPy arrays to zarr chunks.
RLDSCustom PythonmoderateRead TFRecords, write observation/action arrays as zarr groups.
ROS bagrosbag + zarrmoderateExtract synchronized topics, write to zarr groups by episode.
LeRobot (Parquet)pandas + zarrmoderateRead Parquet metadata, decode MP4 frames, write to zarr arrays.

Migration Guide: Moving from HDF5 to Zarr

Migrating from HDF5 to zarr is one of the simplest format conversions in the robotics data ecosystem because zarr was explicitly designed with HDF5 compatibility in mind. The zarr library can read HDF5 files directly via zarr.open(h5py.File('data.hdf5', 'r'), mode='r'), which presents the HDF5 hierarchy as a zarr group tree. For a full conversion, zarr.convenience.copy_store copies all groups, datasets, attributes, and compression settings from HDF5 to a zarr store in a single call. The converted zarr store maintains the same group hierarchy, so code that accesses data by path (e.g., root['episodes/0/observations/image']) works identically on both formats.

The primary migration decision is chunk layout. HDF5 and zarr use the same chunking concept, but zarr's per-chunk-file architecture means that very small chunks (e.g., 1 KB each) create millions of tiny files that overwhelm filesystem metadata operations. For robotics data on S3, the optimal chunk size is 1-10 MB per chunk, which typically means chunking images as (32, H, W, 3) rather than (1, H, W, 3) to avoid excessive S3 GET requests during training. The zarr.rechunk function (from the rechunker package) can re-chunk an existing zarr store without loading all data into memory.

After migration, update your data loading code to use zarr.open() instead of h5py.File(). For Diffusion Policy users, the framework already expects zarr stores and no code changes are needed. For robomimic users, the ZarrDataset class provides a drop-in replacement for the HDF5-based SequenceDataset. Test your training pipeline end-to-end after migration, paying attention to data loading throughput: zarr's parallel chunk reads can improve throughput 2-4x over HDF5 on multi-core machines.

References

  1. [1]Zarr Development Team. Zarr: An Implementation of Chunked, Compressed, N-Dimensional Arrays.” zarr-python Documentation, 2022. Link
  2. [2]Walke et al.. BridgeData V2: A Dataset for Robot Learning at Scale.” CoRL 2024, 2024. Link
  3. [3]Chi et al.. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.” RSS 2023, 2023. Link
  4. [4]Zarr Community. Zarr v3 Specification.” Zarr Specs, 2024. Link

Claru Data Delivery in Zarr Format

Claru delivers zarr datasets with optimized chunk sizes for your access patterns -- contiguous episode reads or random step access. Cloud-hosted zarr stores on S3 enable direct streaming without local download.

Every zarr delivery is tuned for your specific infrastructure. For cloud training clusters, we host zarr stores on S3 with chunk sizes optimized for your batch size and number of data-loading workers, typically targeting 2-8 MB per chunk. For local workstations, we deliver consolidated zarr stores (zarr.convenience.consolidate_metadata) that eliminate the per-chunk metadata lookups that can slow down initial dataset scanning. We include a zarr-compatible DataLoader implementation that handles parallel chunk fetching, episode shuffling, and prefetching, tested to saturate GPU utilization on A100 and H100 training rigs.

Frequently Asked Questions

Zarr and HDF5 share the same core abstraction (chunked, compressed N-dimensional arrays organized in group hierarchies), but they differ fundamentally in storage architecture. HDF5 stores everything in a single monolithic file with internal addressing, while zarr stores each chunk as a separate filesystem file or cloud object. This gives zarr three major advantages: native cloud storage support (each chunk is an independent S3/GCS object), true parallel writes (multiple processes can write different chunks without locking), and incremental updates (adding episodes does not require rewriting the file). HDF5 has broader language support (C, C++, Java, MATLAB, Fortran bindings), a more mature tool ecosystem (HDFView, h5dump, h5repack), and slightly better performance for sequential local reads. For new robotics projects, zarr is increasingly preferred, especially for cloud-native or multi-robot data collection workflows.

Yes, and this is zarr's primary advantage over HDF5. Zarr supports S3, GCS, Azure Blob Storage, and HTTP endpoints as storage backends via the fsspec library. To open an S3-hosted zarr store, use zarr.open('s3://your-bucket/dataset.zarr', mode='r', storage_options={'key': 'ACCESS_KEY', 'secret': 'SECRET_KEY'}). Each array slice only downloads the chunks needed for that slice, so reading a single episode from a 10TB dataset transfers only megabytes. For authenticated access, pass AWS credentials via storage_options or rely on IAM roles. The s3fs library handles connection pooling and retry logic automatically. Claru delivers zarr stores on S3 with pre-configured IAM policies for your team's AWS accounts.

Chunk sizing depends on your access patterns and storage backend. For cloud storage (S3/GCS), target 1-10 MB per chunk to balance request overhead against granularity. For image arrays at 480x640x3 uint8, this means chunking as (8, 480, 640, 3) to (32, 480, 640, 3) rather than single-frame chunks. For proprioceptive data (joint states, forces), chunk along the time axis at 500-2000 steps per chunk. On local SSDs, smaller chunks (256 KB - 1 MB) work well because filesystem overhead is minimal. The rechunker package (pip install rechunker) can re-chunk an existing zarr store to different chunk sizes without loading all data into memory, which is essential for tuning performance after initial dataset creation. Claru benchmarks chunk sizes against your specific training pipeline before delivery.

Zarr supports concurrent reads from any number of processes by design, since each chunk is an independent file. Concurrent writes are also supported as long as different processes write to different chunks (which naturally happens when different robots or collection workers write different episodes). The zarr.sync module provides a ProcessSynchronizer and ThreadSynchronizer for coordinating writes to the same chunk from multiple processes, though this is rarely needed in robotics data collection. For multi-robot fleet data collection, the recommended pattern is to have each robot write to a separate zarr group (e.g., root/robot_0, root/robot_1) and merge them after collection using zarr's group copy operations. This avoids any coordination overhead during the time-critical collection phase.

By default, opening a zarr store requires reading metadata files (.zarray, .zattrs, .zgroup) from every group and array in the hierarchy, which can result in hundreds of HTTP requests for a large dataset on cloud storage. The zarr.convenience.consolidate_metadata function reads all metadata into a single .zmetadata file at the root of the store, reducing the initial scan to a single request. Call it after writing your dataset: zarr.consolidate_metadata('s3://bucket/dataset.zarr'). When opening a consolidated store, use zarr.open_consolidated() instead of zarr.open(). This is especially important for S3-hosted datasets where each metadata file is a separate GET request with ~50ms latency. Claru delivers all cloud-hosted zarr stores with pre-consolidated metadata.

Get Data in Zarr Format

Claru delivers robotics training data in Zarr format, ready to load into your pipeline. Tell us your requirements.