Navigation Training Data
Multi-sensor navigation datasets for indoor and outdoor autonomous systems — SLAM recordings, semantic maps, obstacle annotations, and traversability labels for training robust navigation policies across diverse environments.
Data Requirements
RGB + LiDAR + IMU + wheel odometry + GPS (outdoor)
10K-500K trajectory segments across 20+ environments
10-30 Hz LiDAR, 30 Hz RGB, 100-200 Hz IMU
How Claru Supports This Task
Claru collects navigation data using standardized mobile platforms deployed across our global network of 100+ cities, providing the environment diversity that is the primary driver of navigation policy generalization. Each recording includes hardware-synchronized RGB, LiDAR, IMU, and odometry streams with ground-truth poses from visual-inertial odometry or RTK-GPS. Our collection protocol captures diverse indoor and outdoor environments with natural human traffic at multiple times of day, ensuring coverage of lighting variation, dynamic obstacle densities, and terrain types. We deliver processed datasets with semantic map annotations, traversability labels, dynamic obstacle bounding boxes with velocity vectors, and human-verified trajectory quality scores. Standard delivery includes 50-200 hours of navigation data per deployment domain, formatted for direct ingestion by ViNT, GNM, NoMaD, or custom navigation architectures in RLDS, HDF5, or ROS bag formats.
What Is Robot Navigation and Why Data Is the Bottleneck?
Robot navigation — the ability to move autonomously from one location to another while avoiding obstacles — is the foundational capability that enables every downstream mobile robotics application. Whether a robot is delivering packages in a warehouse, assisting patients in a hospital, or patrolling a construction site, it must first solve navigation reliably. The mobile robot market is projected to exceed $54 billion by 2030, and while classical navigation stacks (SLAM + A* + DWA) work in controlled environments, they fail systematically in the unstructured, dynamic real world where learned navigation policies are required.
The fundamental challenge is perceptual diversity. A navigation policy must handle reflective floors that confuse LiDAR, transparent glass walls invisible to depth sensors, dynamic obstacles like pedestrians who change direction unpredictably, and lighting conditions ranging from direct sunlight to pitch darkness. Classical cost maps cannot encode these nuances — they treat all obstacles identically and all free space as equally traversable. Learned policies trained on diverse real-world trajectory data implicitly capture the full complexity of navigable environments, including soft preferences like staying away from fragile objects or yielding to pedestrians.
The Visual Navigation Transformer (ViNT, Shah et al., 2023) demonstrated that a single navigation policy trained on diverse trajectory data from 6 different robot platforms totaling 100+ hours of experience can generalize to entirely new environments zero-shot. The key insight was data diversity over data volume: ViNT's cross-embodiment training set spanning indoor hallways, outdoor sidewalks, and off-road trails produced stronger generalization than any single-environment dataset regardless of size. Similarly, GNM (Shah et al., 2023) showed that goal-conditioned navigation improves with environment diversity more than raw trajectory count.
Policies trained exclusively in simulation show 20-40% performance degradation in real deployment due to the sim-to-real gap in perceptual inputs. Simulated environments cannot capture real floor textures (reflections, scuff marks, wet patches), realistic sensor noise patterns (LiDAR multipath in narrow corridors, depth sensor failures on dark surfaces), or the behavioral patterns of real dynamic obstacles. For production deployment, real-world navigation data collected across the target environment distribution is not optional — it is the primary determinant of deployment reliability.
Navigation Data by the Numbers
Data Requirements by Navigation Approach
Different navigation architectures have distinct data requirements. The trend is toward vision-first methods that reduce sensor cost while increasing generalization.
| Approach | Data Volume | Primary Modality | Key Annotations | Best For |
|---|---|---|---|---|
| Classical SLAM + planner (no learning) | No training data — manual tuning | LiDAR + odometry | Pre-built occupancy map | Static, known environments only |
| Goal-conditioned visual navigation (GNM/ViNT) | 50-200 hrs diverse trajectories | Front-facing RGB | Goal images + odometry waypoints | Cross-environment generalization; low-cost robots |
| Language-conditioned navigation (LM-Nav) | 100+ hrs trajectories + language annotations | RGB + language instructions | Natural language route descriptions + landmarks | Human-directed navigation with verbal commands |
| End-to-end visuomotor (BC from demonstrations) | 10K-50K trajectory segments per environment | RGB + proprioception | Velocity commands + collision labels | Single-environment deployment with high reliability |
| Reinforcement learning with real data (NoMaD) | 50-500 hrs + reward labels | RGB + LiDAR | Traversability scores + collision events + goal progress | Off-road and unstructured terrain |
State of the Art in Learned Navigation
GNM (Shah et al., 2023) introduced the general navigation model paradigm: a single goal-conditioned policy trained on trajectories from multiple robots that transfers zero-shot to new platforms. Trained on 2,735 trajectories across 6 environments from 3 different robot platforms, GNM achieved 85% goal-reaching success on unseen environments versus 42% for a policy trained on a single environment. The architecture processes a current observation and goal image to predict a waypoint sequence, decoupling perception from low-level control and enabling cross-embodiment transfer.
ViNT (Shah et al., 2023) scaled this approach with a transformer backbone trained on over 100 hours of navigation data from 6 robot embodiments spanning indoor, outdoor, and off-road domains. ViNT demonstrated that navigation is a scalable learning problem: performance on held-out environments improved log-linearly with training data diversity, reaching sub-30cm goal accuracy on environments never seen during training. The model uses an EfficientNet visual encoder with a GPT-style action decoder that predicts 8-step future waypoints at 4 Hz.
NoMaD (Sridhar et al., 2023) extended foundation navigation models with diffusion-based action prediction, achieving more robust behavior in cluttered environments where multimodal action distributions matter. NoMaD's diffusion head predicts a distribution over future trajectories rather than a single path, naturally handling decision points like choosing which side of an obstacle to pass. On the RECON benchmark, NoMaD reduced collision rate by 43% compared to ViNT while maintaining comparable goal-reaching performance.
LM-Nav (Shah et al., 2023) demonstrated that large language models can provide the high-level planning layer for navigation without any navigation-specific language training. By combining a pre-trained visual navigation model with GPT-4 for instruction parsing and CLIP for landmark grounding, LM-Nav executed complex multi-step navigation instructions (e.g., 'go past the fountain, turn left at the red building, and stop at the bench') with 82% success rate in outdoor campus environments — entirely from pre-trained models without navigation-specific fine-tuning.
Collection Methodology for Navigation Data
Production navigation data collection requires instrumented mobile platforms that capture synchronized multi-sensor streams while traversing target environments. The core sensor suite includes a front-facing RGB camera (minimum 640x480 at 30 Hz), a 2D or 3D LiDAR scanner (10-20 Hz), wheel odometry or visual-inertial odometry for ego-motion estimation, and an IMU (100-200 Hz) for dead-reckoning through GPS-denied areas. For outdoor navigation, RTK-GPS provides centimeter-level ground truth positioning. All sensors must be hardware-synchronized and extrinsically calibrated to a common body frame.
Environment diversity is the single most important variable for navigation dataset quality. A production dataset should span at least 20 distinct environments covering the target deployment domain. For indoor service robots, this means offices with open floor plans and cubicle mazes, hospitals with long corridors and elevator lobbies, retail stores with dense aisle layouts, and homes with narrow doorways and furniture clutter. Each environment needs trajectories from 50+ distinct start-goal pairs to ensure spatial coverage, with recordings at different times of day to capture lighting variation.
Dynamic obstacle coverage is critical and often underrepresented in navigation datasets. At least 30% of trajectories should include active obstacle avoidance events — pedestrians crossing the path, doors opening, carts being moved. Record these as they occur naturally rather than staging encounters, because the distribution of human movement patterns (speed, predictability, density) varies dramatically between hospitals, offices, and retail environments. Each trajectory should be annotated with obstacle encounter timestamps and avoidance outcomes.
Claru collects navigation data using standardized mobile platforms deployed across our global network of 100+ cities. Each recording includes synchronized RGB, LiDAR, IMU, and odometry streams with ground-truth poses from visual-inertial odometry or RTK-GPS. We capture diverse indoor and outdoor environments with natural human traffic, delivering processed datasets with semantic map annotations, traversability labels, dynamic obstacle bounding boxes with velocity vectors, and human-verified trajectory quality scores. Our standard delivery includes 50-200 hours of navigation data per deployment domain.
Key Datasets for Robot Navigation
Public navigation datasets range from small single-building collections to large cross-environment corpora. Most focus on either indoor or outdoor navigation, with few spanning both domains.
| Dataset | Year | Scale | Environments | Sensors | Dynamic Obstacles |
|---|---|---|---|---|---|
| SACSoN (Shah et al.) | 2023 | 100+ hrs, 6 robot platforms | Indoor, outdoor, off-road | RGB + odometry | Natural occurrence |
| RECON (Shah et al.) | 2022 | 50+ hrs across campus | Outdoor university campus | RGB + GPS | Pedestrians, cyclists |
| TartanDrive (Triest et al.) | 2022 | 200K+ frames off-road | Off-road terrain | RGB + IMU + LiDAR | None (unstructured terrain) |
| Habitat MP3D (Ramakrishnan et al.) | 2021 | 90 building scans | Indoor simulation from real scans | RGB-D (simulated) | Simulated only |
| SCAND (Karnan et al.) | 2022 | 8.7 km of sidewalk trajectories | Urban sidewalks | RGB + LiDAR + IMU | Pedestrians |
References
- [1]Shah et al.. “ViNT: A Foundation Model for Visual Navigation.” CoRL 2023, 2023. Link
- [2]Shah et al.. “GNM: A General Navigation Model to Drive Any Robot.” ICRA 2023, 2023. Link
- [3]Shah et al.. “LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action.” CoRL 2023, 2023. Link
- [4]Sridhar et al.. “NoMaD: Goal Masking Diffusion Policies for Navigation and Exploration.” arXiv 2310.07896, 2023. Link
- [5]Karnan et al.. “SCAND: Single-Camera Autonomous Navigation Dataset.” RA-L 2022, 2022. Link
Frequently Asked Questions
For a single building type, 50-100 hours of diverse trajectory data — approximately 50,000 trajectory segments covering 20+ distinct start-goal pairs per floor — enables reliable point-to-point navigation with sub-50cm goal accuracy. For cross-building generalization (the robot deploys in buildings it has never seen), 200-500 hours across 20+ distinct environments is recommended, based on ViNT's finding that navigation performance scales log-linearly with environment diversity. Start with your target building and 2-3 similar environments for initial policy development, then expand the dataset to cover the full deployment distribution. Each environment should include trajectories at different times of day to capture lighting variation, and at least 30% of recordings should contain dynamic obstacle encounters.
At minimum: a front-facing RGB camera (640x480+ at 30 Hz), a 2D or 3D LiDAR (10-20 Hz), and wheel odometry or visual-inertial odometry for ego-motion. For outdoor navigation, add RTK-GPS for centimeter-level ground truth. For semantic navigation tasks, add a second rear-facing camera. LiDAR is critical for obstacle detection even if the final policy uses vision only — it provides ground-truth obstacle labels and traversability maps for supervised training. An IMU (100-200 Hz) enables dead-reckoning through GPS-denied zones like tunnels or indoor corridors. All sensors must be hardware-synchronized to within 5ms and extrinsically calibrated to a shared body frame, as temporal misalignment between camera and odometry causes trajectory drift in the training labels.
Record in environments with natural human traffic rather than staging encounters, because real pedestrian behavior (speed, predictability, reaction to the robot) varies dramatically across environments. At least 30% of trajectories should include dynamic obstacle avoidance events. Annotate each dynamic obstacle with a bounding box, velocity vector, and trajectory prediction confidence. Include a range of obstacle densities from sparse (one pedestrian in a hallway) to dense (hospital cafeteria during lunch). Without sufficient dynamic obstacle data, policies develop pathological behaviors: freezing indefinitely, taking excessively conservative detours, or failing to yield in socially appropriate ways. NoMaD showed that diffusion-based policies handle multimodal avoidance decisions better than deterministic policies, reducing collision rates by 43%.
Simulation is valuable for pre-training basic spatial reasoning and obstacle avoidance, but cannot replicate real-world perceptual challenges: reflective floors that create phantom LiDAR returns, transparent glass walls invisible to depth sensors, varying lighting from direct sunlight to flickering fluorescents, and complex sensor noise patterns specific to each hardware platform. Research consistently shows a 20-40% performance degradation when sim-trained policies are deployed in the real world without fine-tuning. The most cost-effective approach combines 70-80% simulation for basic traversal skills with 20-30% real-world data for perceptual robustness. Simulation is most useful for rare safety-critical scenarios (near-collisions, emergency stops) that are difficult to collect safely in real deployments.
SLAM (Simultaneous Localization and Mapping) is a specific technical component of the navigation stack — it builds a map of the environment while tracking the robot's position within it. Navigation is the complete end-to-end task: perceiving the environment, planning a path to the goal, and executing motion commands to reach it. Classical navigation stacks decompose this into separate modules (SLAM for mapping, A* for path planning, DWA for local control), while modern learned navigation treats it end-to-end: a single neural network takes sensor observations and a goal specification and outputs velocity commands. Learned approaches outperform classical stacks in unstructured environments because they implicitly handle the failure modes (perceptual aliasing, dynamic obstacles, terrain assessment) that require brittle hand-tuning in classical systems.
Get a Custom Quote for Navigation Data
Share your target deployment environment and robot platform, and we will design a multi-sensor navigation data collection plan covering the full diversity of conditions your system will encounter.