Apache Arrow / Parquet: Complete Guide for Robotics Data
Apache Arrow and Parquet provide columnar data storage for efficient analytics and ML training. Learn how robotics tabular data is stored in Arrow format.
Schema and Structure
Apache Arrow defines a language-independent columnar memory format designed for efficient analytical operations. Data is organized in record batches where each column is a contiguous array of a single type, enabling SIMD (Single Instruction, Multiple Data) vectorized processing and zero-copy reads between processes. Arrow supports rich nested types including structs, lists, maps, and union types, making it capable of representing complex robotics data structures like variable-length joint trajectories and nested observation dictionaries. The in-memory layout uses 64-byte alignment for cache-line efficiency and maintains validity bitmaps for null handling.
Apache Parquet is the complementary on-disk format, storing Arrow-compatible columnar data with efficient compression. Parquet files are organized into row groups (typically 64-128 MB each), where each row group contains column chunks that can be independently compressed with codecs like Snappy (fast, ~2x compression), ZSTD (balanced, ~4x compression), or LZ4 (low-latency, ~2.5x compression). Column-level statistics (min, max, null count) in the file footer enable predicate pushdown, allowing queries to skip entire row groups that do not match filter conditions. For robotics tabular data such as joint trajectories, sensor readings, and episode metadata, this column-oriented layout enables reading only the specific columns needed for a given analysis without scanning irrelevant data.
The Arrow ecosystem includes Arrow IPC (Inter-Process Communication) format for streaming between processes, Arrow Flight for high-throughput network data transfer, and the Feather format (Arrow IPC with file metadata) for fast local persistence. In robotics pipelines, Arrow IPC is particularly valuable for zero-copy data sharing between a data collection process and a real-time monitoring dashboard, or between a preprocessing step and a training loop running in separate processes. The pyarrow library provides the Python interface, while arrow-rs (Rust), Arrow C++ (libarrow), and Arrow Java provide native implementations for other languages.
Frameworks and Models Using Apache Arrow / Parquet
Hugging Face Datasets
The datasets library uses Arrow as its in-memory format and Parquet for on-disk storage, enabling memory-mapped access to datasets larger than RAM.
LeRobot
Hugging Face's robotics framework stores trajectory metadata and tabular episode data in Parquet files, with video frames stored as MP4 and indexed by Parquet timestamps.
Polars
High-performance DataFrame library built natively on Arrow, offering 10-100x speedups over pandas for analytical queries on robotics metadata.
DuckDB
In-process analytical database that reads Parquet files directly and queries them with SQL, useful for exploring large robotics dataset metadata.
Apache Spark
Distributed computing framework that uses Parquet as its default persistent storage format for large-scale data processing pipelines.
Delta Lake / Iceberg
Table formats built on Parquet that add ACID transactions, schema evolution, and time travel for versioned robotics dataset management.
Reading and Writing Arrow/Parquet Robotics Data in Python
The pyarrow library (pip install pyarrow) is the standard Python interface for both Arrow and Parquet. Reading a Parquet file is straightforward: pq.read_table('episodes.parquet') returns an Arrow Table that can be converted to pandas with .to_pandas() or to NumPy with .to_pydict() for individual columns. For large files, use pq.ParquetFile('episodes.parquet').read_row_group(i) to load specific row groups, or pq.read_table('episodes.parquet', columns=['joint_positions', 'timestamp']) to read only needed columns. Memory-mapped reading via pq.read_table(..., memory_map=True) avoids copying data into Python's heap, which is critical for datasets that approach available RAM.
Writing robotics data to Parquet follows a pattern of building Arrow arrays and tables. For a trajectory dataset: create arrays with pa.array(joint_positions_list, type=pa.list_(pa.float32())), build a table with pa.table({'timestamp': timestamps, 'joint_positions': joints, 'episode_id': episodes}), then write with pq.write_table(table, 'episodes.parquet', compression='zstd', row_group_size=65536). The row_group_size parameter controls the granularity of predicate pushdown and should be tuned to your query patterns: smaller row groups enable finer-grained skipping but add metadata overhead. For time-series robotics data, sorting by timestamp before writing ensures that temporal range queries can skip most row groups.
The Hugging Face datasets library wraps pyarrow with higher-level APIs for ML workflows. Loading a LeRobot dataset returns a datasets.Dataset backed by memory-mapped Arrow files: dataset = datasets.load_dataset('lerobot/aloha_sim_transfer_cube'). The dataset supports .select(), .filter(), and .map() operations that execute lazily on Arrow arrays without materializing intermediate copies. For custom robotics datasets, datasets.Dataset.from_parquet('your_data.parquet') creates a HF-compatible dataset from any Parquet file, enabling integration with the Hugging Face Hub for versioned dataset hosting and streaming.
When to Use Parquet vs Alternatives
Parquet excels at tabular robotics data but other formats may be better for array-heavy or sequential workloads.
| Format | Best For | Column Access | Cloud Native | Compression |
|---|---|---|---|---|
| Parquet | Tabular metadata, episode indices, sensor logs | Excellent (columnar) | Excellent (S3/GCS) | Snappy, ZSTD, LZ4, Brotli |
| HDF5 | Multi-dimensional arrays, images, chunked access | Moderate (dataset-level) | Poor (monolithic file) | gzip, lz4, zstd, blosc |
| Arrow IPC / Feather | Fast IPC, low-latency local persistence | Excellent (columnar) | Moderate | LZ4, ZSTD |
| CSV | Human-readable exchange, small datasets | None (row-based scan) | Good (any storage) | External (gzip) |
| LeRobot (Parquet + MP4) | HF Hub robotics, video + tabular | Parquet columns + video frames | Good (HF Hub streaming) | ZSTD (Parquet) + H.264 (video) |
Converting from Other Formats
| Source Format | Tool / Library | Complexity | Notes |
|---|---|---|---|
| CSV | pyarrow.csv.read_csv() | trivial | Direct conversion with automatic type inference; use ConvertOptions for explicit schemas. |
| HDF5 | Custom h5py + pyarrow | moderate | Read HDF5 arrays per episode, flatten to columnar layout, write row groups to Parquet. |
| pandas DataFrame | df.to_parquet() | trivial | Direct conversion; use engine='pyarrow' and compression='zstd' for best results. |
| JSON / JSONL | pyarrow.json.read_json() | trivial | Infers schema from JSON structure; supports nested records as Arrow structs. |
| RLDS (TFRecord) | Custom Python (tensorflow + pyarrow) | moderate | Iterate TFRecords, extract features, build Arrow tables; used by LeRobot for RLDS-to-Parquet conversion. |
| ROS bag | rosbags + pyarrow | moderate | Deserialize ROS messages, extract fields, write timestamped rows to sorted Parquet files. |
Arrow and Parquet in Robotics Data Pipelines
In modern robotics data pipelines, Parquet has emerged as the metadata and index layer that ties together multi-modal data stored across different formats. The LeRobot framework exemplifies this pattern: video frames are stored as MP4 files for compression efficiency, while all tabular data (timestamps, joint positions, action vectors, episode boundaries, language instructions) lives in Parquet files. The Parquet episode index enables efficient temporal lookups and episode filtering without scanning video data. This separation of concerns allows teams to query metadata at analytical speeds (millions of rows per second via DuckDB or Polars) while keeping bulk sensor data in format-appropriate containers.
Arrow Flight, the network-native data transfer protocol in the Arrow ecosystem, is gaining traction for real-time robotics data streaming. Flight provides gRPC-based endpoints that transfer Arrow record batches with zero serialization overhead, achieving throughput of 2-4 GB/s on 10 GbE networks. For robotics fleet management, a central Flight server can ingest telemetry from hundreds of robots simultaneously, with each robot's data landing as Arrow record batches that are immediately queryable. The Arrow Flight SQL extension adds SQL query capabilities over Flight endpoints, enabling analysts to query live robot fleet data with standard SQL tools.
For dataset versioning and reproducibility, table formats like Delta Lake and Apache Iceberg build on Parquet to add transactional guarantees. These formats maintain a transaction log alongside Parquet data files, enabling schema evolution (adding new sensor columns without rewriting existing data), time travel (querying the dataset as it existed at any past point), and atomic updates (safely adding new episodes while others are reading). For robotics teams managing evolving datasets across multiple collection campaigns, these table formats solve the practical problem of data versioning that raw Parquet files do not address.
References
- [1]Apache Arrow Project. “Apache Arrow Columnar Format Specification.” Apache Foundation, 2023. Link
- [2]Apache Parquet Project. “Apache Parquet Format Specification.” Apache Foundation, 2023. Link
- [3]Cadene et al.. “LeRobot: State-of-the-Art Machine Learning for Real-World Robotics.” Hugging Face 2024, 2024. Link
- [4]Lhoest et al.. “Datasets: A Community Library for Natural Language Processing.” EMNLP 2021 (System Demonstrations), 2021. Link
- [5]Apache Arrow Project. “Apache Arrow Flight: A Framework for Fast Data Transport.” Apache Foundation Technical Documentation, 2019. Link
Claru Data Delivery in Apache Arrow / Parquet Format
Claru delivers tabular robotics data (joint states, sensor readings, episode metadata, action vectors, language instructions) in Parquet format with optimized row group sizes tuned for your training workload. Numeric columns use float32 precision by default with configurable dtype overrides, and categorical columns (task labels, robot IDs, environment names) use dictionary encoding for 5-10x compression on repetitive string data. All Parquet files include column-level statistics in the footer for efficient predicate pushdown during filtering operations.
For teams using the Hugging Face ecosystem, Claru provides LeRobot-compatible deliveries with Parquet metadata files and MP4 video files structured for direct upload to the Hugging Face Hub. For teams running analytical queries on large datasets, we additionally provide DuckDB-compatible Parquet with sorted columns and Bloom filter indexes for sub-second point queries on datasets with millions of rows. Every delivery includes a schema documentation file mapping column names to their semantic meaning, units, and coordinate frame conventions.
Frequently Asked Questions
Use Parquet for tabular data where columnar access matters: joint trajectories, episode metadata, sensor logs, action labels, and language instructions. Parquet excels at analytical queries (filtering episodes by task type, computing statistics across thousands of trajectories) and integrates natively with tools like DuckDB, Polars, and pandas. Use HDF5 for dense multi-dimensional array data (images, depth maps, point clouds) where chunked random access to individual frames is the primary access pattern. Many modern robotics frameworks (LeRobot, Open X-Embodiment datasets) use both: Parquet for the metadata and tabular index, HDF5 or MP4 for the bulk sensor data.
Arrow can store images as binary columns (pa.binary() or pa.large_binary()), but this is not recommended for large datasets because columnar compression is inefficient on opaque binary blobs. The recommended pattern is to store images as separate files (JPEG, PNG, or MP4 video) and use Arrow/Parquet for the metadata index with file paths, timestamps, and frame offsets. This is exactly how LeRobot works: Parquet files contain trajectory metadata with timestamp columns that index into MP4 video files. Hugging Face datasets also support this pattern via the Image and Video feature types, which store file references in Arrow and handle decoding transparently.
Parquet supports timestamp columns with nanosecond precision (pa.timestamp('ns')), which is sufficient for even the highest-frequency robot sensors (1 kHz IMU data at microsecond resolution). For efficient time-range queries, sort your data by timestamp before writing and enable column statistics in the Parquet footer. This allows readers to skip entire row groups whose timestamp range does not overlap the query window. For multi-rate sensor fusion (e.g., 30 Hz camera + 100 Hz force/torque), store each sensor stream in its own Parquet file sorted by timestamp, then join on timestamp ranges at query time using DuckDB's ASOF JOIN or Polars' join_asof().
Arrow IPC is the streaming protocol for passing Arrow record batches between processes with zero serialization overhead. Feather (v2) is Arrow IPC written to a file with additional metadata, designed for fast local read/write (no compression by default, but supports LZ4/ZSTD). Parquet is the analytics-optimized on-disk format with columnar compression, predicate pushdown, and row group organization. Use Feather for temporary local persistence where read/write speed matters more than file size. Use Parquet for long-term storage, cloud hosting, and analytical queries. Arrow IPC is typically used programmatically (not saved to files) for inter-process communication in data pipelines.
Yes. DuckDB (pip install duckdb) can query Parquet files directly with SQL: SELECT episode_id, AVG(gripper_force) FROM 'episodes.parquet' WHERE task = 'pick_and_place' GROUP BY episode_id. DuckDB reads Parquet with predicate pushdown and columnar projection, so queries on large datasets are fast without loading everything into memory. This is particularly useful for robotics dataset exploration: finding episodes that match specific criteria, computing per-episode statistics, and generating train/val/test splits based on metadata conditions.
Get Data in Apache Arrow / Parquet Format
Claru delivers robotics training data in Parquet format with optimized row groups, column statistics, and LeRobot-compatible structure. Tell us your requirements.