WebDataset (Tar-based Shards): Complete Guide for Robotics Data

WebDataset uses tar archives for efficient sequential I/O in large-scale training. Understand the shard format, streaming capability, and Claru's WebDataset delivery.

Schema and Structure

WebDataset, created by Thomas Breuel (formerly at Google Brain and NVIDIA), stores training samples as consecutive entries in standard POSIX tar archives. Each sample consists of multiple files sharing a common key prefix: 000042.jpg (image), 000042.json (metadata), 000042.actions.npy (action vector), 000042.state.npy (proprioceptive state). The key is everything before the first dot in the filename, and the extension determines the data type. This convention means that any file format can be embedded in a WebDataset shard: JPEG images, NumPy arrays, JSON metadata, PNG depth maps, protobuf messages, or raw binary blobs. Shards are typically 100 MB to 1 GB tar files named with a pattern like shard-{000000..001000}.tar, and the library supports brace expansion for specifying shard ranges in URLs.

The naming convention is the core design insight of WebDataset. Within a tar shard, files are grouped into samples by their shared key prefix. The webdataset Python library (pip install webdataset) handles this grouping automatically during iteration, yielding dictionaries with extension-based keys: {'jpg': image_bytes, 'json': metadata_dict, 'actions.npy': action_array, 'state.npy': state_array}. Built-in decoders automatically handle common formats: JPEG/PNG images are decoded to PIL Images or NumPy arrays, .npy files are loaded as NumPy arrays, .json files are parsed to dictionaries, and .txt files are read as strings. Custom decoders can be registered for domain-specific formats via wds.decode(custom_handler).

For distributed training, WebDataset's shard-based architecture provides a natural parallelism unit. Each GPU worker loads a different subset of shards, and the ResampledShards or ShardList classes handle shard-to-worker assignment with configurable shuffling granularity. Because tar files support efficient sequential streaming, data can flow directly from S3, GCS, or HTTP endpoints to GPU memory without intermediate disk writes, using standard Python HTTP libraries. The webdataset.ShardWriter class creates new shards automatically when the current shard exceeds a configurable maximum size (default 1 GB) or sample count (default 100,000), distributing samples evenly across shards for balanced distributed loading. This design achieves near-linear scaling: 8 GPU workers reading 8 shards simultaneously achieve 8x the throughput of a single worker.

Frameworks and Models Using WebDataset

PyTorch DataPipes

WebDataset integrates with PyTorch's DataPipe system via torchdata, providing composable data loading primitives for distributed training.

Hugging Face Datasets

The datasets library can stream WebDataset shards from Hugging Face Hub and S3, with automatic sample decoding and batching.

NVIDIA DALI

NVIDIA's GPU-accelerated data loading library supports WebDataset as an input format, decoding images directly on GPU memory.

OpenCLIP / LAION

The LAION-5B dataset (5 billion image-text pairs) and OpenCLIP training pipeline use WebDataset as their exclusive data format.

Stable Diffusion training

Large-scale diffusion model training (Stable Diffusion, SDXL) uses WebDataset shards for efficient multi-node image-text pair loading.

img2dataset

The standard tool for downloading and creating large-scale image datasets outputs WebDataset shards with configurable shard sizes.

Reading and Writing WebDataset in Python

Reading a WebDataset is a single pipeline expression: dataset = wds.WebDataset('shards/shard-{000000..000099}.tar').decode('pil').to_tuple('jpg', 'json'). This loads 100 shards, decodes JPEG images as PIL Images, and yields (image, metadata) tuples. For PyTorch training: dataloader = wds.WebLoader(dataset, batch_size=32, num_workers=4) provides a standard DataLoader interface with automatic shard resampling for distributed training. The pipeline supports chaining operations: .shuffle(1000) shuffles samples within a 1000-sample buffer, .batched(32) creates batches, and .map(transform_fn) applies arbitrary transformations.

Writing WebDataset shards uses the ShardWriter class: with wds.ShardWriter('output/shard-%06d.tar', maxcount=10000, maxsize=1e9) as sink: for each sample, sink.write({'__key__': f'{i:06d}', 'jpg': image_bytes, 'json': metadata_dict, 'actions.npy': action_array}). The __key__ field sets the sample key prefix, and other dictionary entries are written as files with the corresponding extension. ShardWriter automatically creates new shards when the current shard exceeds maxsize bytes or maxcount samples. For images, pass raw JPEG/PNG bytes (not decoded arrays) to avoid re-encoding overhead. For NumPy arrays, the .npy extension triggers automatic np.save encoding.

For cloud-native training, WebDataset supports streaming from any URL that tar files can be served from. Reading from S3: dataset = wds.WebDataset('pipe:aws s3 cp s3://bucket/shard-{000..099}.tar -') streams shards through the AWS CLI. For HTTP: dataset = wds.WebDataset('https://host/shards/shard-{000..099}.tar') fetches shards via standard HTTP GET requests. The pipe: prefix enables arbitrary shell commands as data sources, supporting custom authentication, caching, and compression. For maximum throughput on cloud storage, the wds.WebDataset class supports parallel shard fetching via the shardshuffle parameter, overlapping network I/O with GPU computation.

When to Use WebDataset vs Alternatives

WebDataset excels at large-scale distributed training but has tradeoffs for other access patterns.

FormatBest ForRandom AccessCloud StreamingWrite Pattern
WebDatasetLarge-scale distributed GPU trainingPoor (sequential tar)Excellent (S3/HTTP/GCS)Write-once, read-many
HDF5Local training, random frame accessExcellent (chunked)Poor (monolithic file)Read-write
RLDS (TFRecord)TF/JAX ecosystem, Open X-EmbodimentModerate (shard + seek)Good (GCS via TFDS)Write-once
LeRobot (Parquet+MP4)HuggingFace robotics ecosystemModerate (Parquet index)Good (HF Hub)Read-write (Parquet)
ZarrCloud-native array accessExcellent (per-chunk files)Excellent (S3/GCS native)Parallel write

Converting from Other Formats

Source FormatTool / LibraryComplexityNotes
HDF5webdataset.ShardWriter + h5pymoderateRead HDF5 episodes, write each step as a tar sample with image, state, and action files.
RLDS (TFRecord)Custom Python (tensorflow + webdataset)moderateIterate TFRecord episodes and steps, write each step as a WebDataset sample with extension-mapped fields.
Individual files (images + labels)webdataset.ShardWritertrivialRead files, write to tar shards: sink.write({'__key__': name, 'jpg': open(img).read(), 'json': label}).
img2dataset URLsimg2dataset CLItrivialimg2dataset directly outputs WebDataset shards from URL lists with parallel downloading and resizing.
Parquet / CSV metadata + filesCustom Python (pyarrow + webdataset)moderateRead metadata from Parquet, load referenced files, write samples to WebDataset shards.
ROS bagCustom Python (rosbags + webdataset)moderateExtract synchronized sensor data per timestep, write each step as a multi-file WebDataset sample.

WebDataset for Robotics Training at Scale

WebDataset has become the format of choice for large-scale vision model training, and this approach is increasingly adopted for robotics foundation models. The LAION-5B dataset (5.85 billion image-text pairs, ~240 TB) is stored entirely as WebDataset shards, and the OpenCLIP training pipeline reads these shards at over 10,000 samples per second per GPU node using WebDataset's streaming architecture. For robotics, the same pattern applies when training vision-language-action models or large-scale perception models on millions of trajectory steps: each step becomes a WebDataset sample with image, proprioceptive state, action, and language instruction as separate files within the shard.

The key performance advantage of WebDataset over random-access formats (HDF5, Zarr) is its sequential I/O pattern. Modern storage systems (NVMe SSDs, cloud object stores) achieve maximum throughput on sequential reads: NVMe SSDs deliver 3-7 GB/s sequential versus 0.5-1 GB/s random, and S3 delivers roughly 100 MB/s per stream with practically unlimited parallel streams. WebDataset exploits this by reading each shard as a single sequential stream, achieving near-theoretical-maximum storage throughput. For a training cluster reading 64 shards in parallel from S3, this translates to 6+ GB/s aggregate throughput, which is sufficient to keep 8 A100 GPUs fully utilized on image-based training.

For multi-modal robotics data, WebDataset's file-per-field approach maps naturally to the heterogeneous data types in robot demonstrations. A single WebDataset sample for a manipulation demonstration step might contain: 000042.cam0.jpg (overhead camera, JPEG compressed), 000042.cam1.jpg (wrist camera), 000042.depth.png (16-bit depth map), 000042.state.npy (7-DoF joint positions as float32), 000042.action.npy (7-DoF action vector), 000042.language.txt (natural language instruction), and 000042.meta.json (episode ID, timestamp, task label). The webdataset decoder handles each format appropriately based on extension, and custom decoders can handle domain-specific formats like compressed point clouds.

References

  1. [1]Breuel. WebDataset: A Python Library for Large-Scale Deep Learning Data I/O.” GitHub / PyPI, 2023. Link
  2. [2]Schuhmann et al.. LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models.” NeurIPS 2022 (Datasets & Benchmarks), 2022. Link
  3. [3]Cherti et al.. Reproducible Scaling Laws for Contrastive Language-Image Learning.” CVPR 2023, 2023. Link
  4. [4]Beaumont. img2dataset: Easily turn large sets of image urls to an image dataset.” GitHub, 2022. Link
  5. [5]PyTorch Team. TorchData: PyTorch Data Loading Utilities.” PyTorch Documentation, 2022. Link

Claru Data Delivery in WebDataset Format

Claru delivers WebDataset shards optimized for your training cluster: shard sizes tuned to your storage backend (100 MB for NFS, 500 MB-1 GB for S3/GCS), consistent sample counts per shard for balanced distributed loading, and S3-compatible URLs for direct streaming with no local download required. Each shard contains multi-modal samples with JPEG-compressed images, NumPy arrays for proprioceptive state and actions, JSON metadata, and optional language instruction text files.

Every delivery includes a shard manifest (JSON file listing all shard URLs, total sample count, and per-shard statistics) that integrates with distributed training coordinators. For teams using NVIDIA DALI, we provide DALI-compatible shard specifications. For teams using PyTorch DataPipes, we include a pre-built DataPipe configuration. Shard contents are validated for consistency (all samples have the same set of extensions, all arrays have consistent shapes) and a sample verification script is provided. For incremental data deliveries (adding new collection batches to an existing training set), new shards follow the existing naming convention and the manifest is updated atomically.

Frequently Asked Questions

Yes. WebDataset natively supports streaming from S3, GCS, HTTP, and any storage backend accessible via shell commands. For S3: wds.WebDataset('pipe:aws s3 cp s3://bucket/shard-{000..099}.tar -') streams shards through the AWS CLI with automatic credential handling. For HTTP: wds.WebDataset('https://host/shards/shard-{000..099}.tar') uses standard HTTP GET. For GCS: pipe:gsutil cp gs://bucket/shard-{000..099}.tar -. No local download is needed -- data flows directly from cloud storage to your training pipeline, and multiple shards are fetched in parallel for maximum throughput.

100 MB to 1 GB per shard balances I/O efficiency with shuffling granularity and distributed load balancing. Smaller shards (100-200 MB) provide finer shuffling granularity (important for convergence on heterogeneous datasets) and better load balancing across workers, but add per-shard overhead. Larger shards (500 MB-1 GB) maximize sequential I/O throughput and minimize the number of HTTP connections for cloud streaming. For robotics datasets, a practical guideline is: 100 MB for local NFS training, 500 MB for S3/GCS streaming, and match your shard count to at least 10x your GPU count for good load balancing. Claru tunes shard sizes based on your specific training infrastructure.

Excellent. Each WebDataset sample can contain arbitrary file types grouped by a shared key prefix, naturally handling multi-modal robotics data. A single sample for a manipulation step might contain camera images (.jpg), depth maps (.depth.png), joint states (.state.npy), action vectors (.action.npy), force/torque readings (.ft.npy), language instructions (.txt), and metadata (.json). The webdataset library automatically groups files by key prefix and provides extension-based decoding. Custom decoders can handle domain-specific formats. This per-sample multi-file approach is more flexible than column-oriented formats when different modalities need different compression strategies.

WebDataset provides two levels of shuffling. Shard-level shuffling: the ShardList or ResampledShards classes randomize the order in which shards are read, providing coarse-grained shuffling. Sample-level shuffling: the .shuffle(buffer_size) operation maintains a buffer of N samples and yields them in random order, providing fine-grained shuffling within the sequential read stream. A typical configuration uses both: wds.WebDataset(urls).shuffle(5000).decode('pil').batched(32). The buffer size controls the tradeoff between shuffling quality and memory usage. For robotics datasets where episodes should not be split across shards, ensure each shard contains complete episodes and shard-level shuffling provides episode-level randomization.

Standard WebDataset is optimized for sequential access and does not support efficient random access to individual samples. If you need random access (e.g., sampling specific episodes by ID), consider: (1) building a shard index that maps sample keys to (shard_path, offset) pairs, enabling seeking to specific samples within a shard; (2) using the wds.cached_tarfile module for repeated access to the same shards; or (3) using a complementary format like HDF5 or Zarr for random access workloads while keeping WebDataset for training. For most ML training, sequential access with shuffling is sufficient and provides better throughput than random access patterns.

Get Data in WebDataset Format

Claru delivers WebDataset shards optimized for your training cluster with S3-compatible streaming URLs and balanced distributed loading. Tell us your requirements.