Stereo Outdoor Dataset
Calibrated stereo camera pairs from outdoor environments for training depth estimation and terrain-aware navigation. 40K+ clips across 15+ terrain types with disparity maps, traversability labels, and obstacle annotations.
Dataset at a Glance
Why Stereo Outdoor Data Matters for Robotics
Outdoor mobile robots -- delivery bots, agricultural rovers, search-and-rescue platforms, military ground vehicles, and planetary exploration rovers -- must perceive 3D terrain structure to navigate safely. Unlike indoor robots that can assume flat floors and regular geometry, outdoor robots encounter slopes, ditches, loose gravel, vegetation of varying density, water hazards, and terrain that changes dramatically with weather and season. Stereo vision provides dense, real-time depth estimation that is more reliable than monocular approaches in the textureless and repetitive environments common outdoors.
Depth estimation from calibrated stereo pairs is fundamentally different from monocular depth prediction. Stereo provides metric depth (actual distances in meters) through geometric triangulation, while monocular methods produce relative depth that requires scale estimation and frequently fails on textureless surfaces, repetitive patterns (crop rows, paving stones), and at the far-range distances critical for outdoor navigation. Training stereo-specific depth networks requires calibrated stereo image pairs with precise ground-truth disparity maps -- data that single-camera datasets simply cannot provide.
Existing stereo outdoor datasets like KITTI are limited to urban driving scenes from a single geographic region. They lack the terrain diversity that field robots encounter: forest trails, rocky hillsides, muddy paths, sand dunes, snow-covered ground, agricultural fields between rows, construction sites, riverbanks, and the transitional zones between terrain types that are often the most challenging for traversability estimation. Claru's stereo outdoor dataset captures 15+ terrain types across seasons and weather conditions with calibrated stereo pairs and synchronized IMU data.
Research from ICRA 2024 and the Field Robotics workshop at RSS 2023 shows that stereo depth networks trained on diverse outdoor terrain data improve traversability prediction accuracy by 35-50% compared to networks trained on urban-only stereo datasets, with the improvement driven primarily by exposure to non-rigid surfaces (vegetation, mud, snow) and unstructured geometry (rock fields, root systems) absent from structured road environments.
Sensor Configuration and Collection Methodology
The stereo rig uses a pair of FLIR Blackfly S cameras (1280x960, global shutter, 12mm lenses) mounted on a rigid carbon-fiber baseline bar with 120mm separation, factory-calibrated for stereo rectification. Global shutter is essential for outdoor stereo -- rolling shutter causes temporal skew between scanlines that corrupts disparity computation, especially during rapid platform motion over rough terrain. Stereo images are hardware-triggered to sub-microsecond synchronization.
An Xsens MTi-630 IMU (9-axis, 400Hz) is rigidly mounted to the stereo baseline bar, providing synchronized inertial data for visual-inertial odometry and motion compensation. GPS/RTK position is recorded at 10Hz for global localization, with RTK corrections providing centimeter-level accuracy when available. The complete sensor package weighs under 800g and mounts to standard robot platforms, survey poles, or chest harnesses for human-carried collection.
Collection spans 15+ terrain types across multiple geographic regions and all four seasons. Terrain categories include: paved trails, gravel paths, forest floors (deciduous and coniferous), grasslands, agricultural rows, rocky hillsides, sandy beaches and dunes, muddy paths, snow-covered ground, urban sidewalks with vegetation, construction sites, riverbanks, wetlands, desert hardpack, and mixed transitional zones. Each terrain type is captured in multiple weather conditions (dry, wet, after rain, during light rain, overcast, direct sun) and times of day (morning, midday, evening) to cover the full range of lighting conditions.
Environmental metadata per session includes terrain type, weather conditions, GPS track, time of day, season, recent precipitation history, vegetation state (full leaf, partial, bare), ground moisture level estimate, and notable hazards encountered (water crossings, steep slopes, loose surfaces). This metadata enables researchers to condition navigation models on environmental context and study how traversability changes with conditions.
Comparison with Public Datasets
How Claru's stereo outdoor dataset compares to publicly available alternatives for outdoor robot navigation.
| Dataset | Clips | Hours | Modalities | Environments | Annotations |
|---|---|---|---|---|---|
| KITTI Stereo (2012) | ~400 pairs | <1 | Stereo RGB, LiDAR | Urban driving (1 city) | Sparse disparity (LiDAR) |
| Middlebury Stereo (2014) | ~33 pairs | N/A (stills) | Stereo RGB | Indoor lab scenes | Dense disparity (structured light) |
| TartanAir (2020) | ~1M frames | ~30 | Stereo RGB (synthetic) | Simulated outdoor | Perfect depth, flow, segmentation |
| Claru Stereo Outdoor | 40K+ | 300+ | Stereo RGB, IMU, GPS | 15+ real terrain types | Disparity, traversability, obstacles, terrain class, weather |
Annotation Pipeline and Quality Assurance
Stage one automated processing generates dense disparity maps from the calibrated stereo pairs using RAFT-Stereo, with confidence-weighted filtering to suppress errors in textureless regions. Semi-global matching (SGM) provides a second disparity estimate, and pixels where RAFT-Stereo and SGM agree within 1 pixel are marked as high-confidence ground truth. IMU-integrated visual odometry provides camera pose for each frame, enabling temporal consistency checks on the depth estimates.
Stage two human annotation adds semantic and traversability labels: terrain type classification per region (15+ classes including path, grass, rock, mud, water, vegetation, obstacle, structure), traversability scoring on a 5-point scale (clear, traversable, difficult, dangerous, impassable), obstacle detection and classification (static obstacles like rocks and posts, dynamic obstacles like people and animals, negative obstacles like holes and ditches), and path boundary segmentation for trail-following applications.
Stage three QA combines automated geometric checks with human review. Disparity maps are validated against IMU-derived motion estimates (if the platform moved forward at known speed, the depth change between frames must be consistent). Traversability annotations are reviewed by a second annotator with field robotics experience. Agreement targets: 96%+ on terrain classification, 93%+ on traversability scoring, and 95%+ on obstacle detection. Clips with poor stereo quality (due to rain drops on lenses, extreme sun flare, or insufficient texture) are flagged with quality scores rather than discarded, enabling researchers to train robust stereo networks that handle degraded conditions.
The complete annotation taxonomy covers 15+ terrain classes, 5 traversability levels, 20+ obstacle categories (static, dynamic, and negative), path boundary delineations, weather and visibility condition tags, surface roughness estimates, and slope angle derived from stereo reconstruction. This enables training navigation systems that assess terrain safety from visual appearance before committing to traverse it.
Use Cases
Stereo Depth Estimation
Training learning-based stereo matching networks that provide dense, metric depth maps for outdoor environments. The diversity of terrain textures, weather conditions, and lighting in the dataset builds stereo networks robust to the challenging conditions that cause classical stereo matching to fail. Example architectures: RAFT-Stereo, CREStereo, GMStereo.
Terrain-Aware Navigation
Training traversability estimation models that predict which terrain is safe to cross based on visual appearance and depth structure. Models learn to associate visual texture patterns with traversability -- distinguishing firm gravel from soft sand, or dense vegetation from sparse brush. Critical for field robots operating beyond paved roads.
Visual-Inertial Odometry
Training and benchmarking VIO systems for outdoor environments using synchronized stereo and IMU data. The dataset provides GPS/RTK ground-truth trajectories for evaluating odometry drift across diverse terrain types and motion profiles.
Key References
- [1]Geiger et al.. “Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite.” CVPR 2012, 2012. Link
- [2]Lipson et al.. “RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching.” 3DV 2021, 2021. Link
- [3]Wang et al.. “TartanAir: A Dataset to Push the Limits of Visual SLAM.” IROS 2020, 2020. Link
- [4]Frey et al.. “Learning Robust Terrain Traversability from Multi-Modal Self-Supervised Data.” ICRA 2024, 2024. Link
How Claru Delivers This Data
Claru's collector network deploys calibrated stereo rigs across diverse outdoor environments -- from forest trails in the Pacific Northwest to sandy coastal paths in the Southeast, rocky alpine terrain in the Rockies, and agricultural access roads in the Midwest. This geographic diversity captures the terrain variation that outdoor robots will encounter in real deployment, not just the controlled paths used in academic field robotics experiments.
Custom campaigns can target specific terrain types (agricultural, urban trails, wilderness), weather conditions (specifically rain or snow collection for degraded-condition robustness), seasons (fall leaf cover, spring mud, winter snow), or locomotion platforms (mounted on wheeled robots, legged platforms, or human-carried for different viewpoints). Turnaround is typically 4-6 weeks for standard terrain campaigns.
Data is delivered with full stereo calibration parameters, pre-computed disparity maps (with confidence scores), IMU data at 400Hz, and GPS/RTK trajectories. All streams are time-synchronized to a common clock. Format options include RLDS, HDF5, WebDataset, and standard stereo dataset formats (KITTI-compatible if desired).
Frequently Asked Questions
120mm baseline with FLIR Blackfly S cameras at 1280x960 resolution, global shutter, hardware-synchronized to sub-microsecond precision. This baseline provides reliable depth estimation from 1-50m, covering the range relevant for most outdoor navigation decisions.
Yes. RAFT-Stereo disparity maps with per-pixel confidence scores are provided. High-confidence regions (where RAFT-Stereo and SGM agree) are flagged as ground-truth quality. Raw stereo pairs are also provided for teams that prefer to run their own stereo matching.
15+ types including paved trails, gravel paths, forest floors, grasslands, agricultural rows, rocky hillsides, sandy surfaces, muddy paths, snow-covered ground, construction sites, riverbanks, wetlands, desert hardpack, urban sidewalks, and mixed transitional zones.
Yes. Xsens MTi-630 IMU at 400Hz and GPS/RTK at 10Hz are hardware-synchronized with stereo frames. Camera-IMU extrinsic calibration is provided for visual-inertial fusion. RTK positions provide centimeter-level ground-truth trajectories when available.
Yes. Custom campaigns can target specific weather (rain, snow, fog, direct sun, overcast) or seasonal conditions (full leaf, partial leaf, bare, spring mud). Degraded-condition data is particularly valuable for training robust stereo networks.
Request a Sample Pack
Get a curated sample of calibrated stereo outdoor data with disparity maps and traversability annotations to evaluate for your outdoor navigation project.