KITTI Format: Complete Guide for Robotics Data
The KITTI format is one of the most widely supported data formats in autonomous driving and 3D vision. Learn its file structure and how Claru delivers KITTI-compatible data.
Schema and Structure
KITTI format uses a file-based directory structure developed by the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago for their 2012 vision benchmark suite. The format organizes data by modality: images in /image_02/ (left color camera) and /image_03/ (right color camera), LiDAR point clouds in /velodyne/ as binary float32 files (each point stored as 4 floats: x, y, z, reflectance), calibration data in /calib/ as text files containing projection matrices, and labels in /label_02/ as space-delimited text files. Each frame is identified by a 6-digit zero-padded index (000000.png, 000000.bin, 000000.txt), and all modalities for a given frame share the same index number.
The KITTI label format encodes 3D bounding boxes as a single line per object with 15 space-delimited fields: type (class name like Car, Pedestrian, Cyclist), truncated (float 0-1 indicating how much of the object is outside the image), occluded (integer 0-3 from fully visible to fully occluded), alpha (observation angle in radians), 2D bbox (left, top, right, bottom in pixels), 3D dimensions (height, width, length in meters), 3D location (x, y, z of the bottom center in camera coordinates), and rotation_y (yaw rotation around the vertical Y-axis in radians). The 3D bounding box is parameterized as a 7-DoF representation: (x, y, z, h, w, l, ry), where the center is defined as the bottom face center of the box. This convention means the Y-axis points downward (camera convention), which is a common source of confusion when interfacing with LiDAR-centric systems that use Z-up conventions.
The calibration files contain four projection matrices (P0-P3 for each camera) and two transformation matrices (Tr_velo_to_cam for Velodyne LiDAR to camera 0, and Tr_imu_to_velo for IMU to Velodyne). The projection matrices are 3x4 and encode both camera intrinsics and the rigid transformation from the reference camera (camera 0) to each target camera. To project a LiDAR point to the left color image (camera 2): first apply Tr_velo_to_cam to transform from Velodyne to camera 0 coordinates, then left-multiply by the rectification matrix R0_rect (3x3 stored as 3x4 with zero translation), and finally by P2 (3x4) to get homogeneous pixel coordinates. This multi-step projection pipeline is the most implementation-intensive aspect of working with KITTI format.
Frameworks and Models Using KITTI Format
OpenPCDet
OpenMMLab's point cloud detection framework with KITTI as a primary benchmark, supporting PointPillars, SECOND, PV-RCNN, and CenterPoint.
SECOND
Sparse Embedded Convolutional Detection network, one of the first efficient voxel-based 3D detectors, designed for KITTI-format data.
PointPillars
Real-time 3D object detection using pillar-based point cloud encoding, achieving the best speed-accuracy tradeoff on KITTI benchmark.
PV-RCNN / PV-RCNN++
Point-voxel feature fusion for 3D detection, state-of-the-art on KITTI 3D detection with both voxel and point-level features.
MMDetection3D
OpenMMLab's 3D detection toolbox with comprehensive KITTI data loaders, evaluation, and visualization for 20+ detection models.
CenterPoint
Center-based 3D detection and tracking model, evaluated on both KITTI and nuScenes with KITTI-format data loading support.
Reading and Writing KITTI Format Data in Python
Reading KITTI point clouds is straightforward: points = np.fromfile('000000.bin', dtype=np.float32).reshape(-1, 4) yields an (N, 4) array of [x, y, z, reflectance]. Images are standard PNG files loadable with cv2.imread() or PIL.Image.open(). Label files are parsed line-by-line, splitting on whitespace: each line produces a dictionary with type, truncated, occluded, alpha, bbox (4 floats), dimensions (3 floats), location (3 floats), and rotation_y. The calibration file contains key-value lines like P2: followed by 12 floats representing the 3x4 projection matrix.
Writing KITTI-format data requires careful attention to coordinate conventions. Point clouds must be in the Velodyne coordinate frame (X-forward, Y-left, Z-up) and written as contiguous float32 binary: points.astype(np.float32).tofile('000000.bin'). Labels must use the camera 2 coordinate frame (X-right, Y-down, Z-forward), which means applying the Velodyne-to-camera transformation to any LiDAR-centric annotations before writing. The 2D bounding box must be the tight axis-aligned enclosure of the projected 3D box corners in image space, and the alpha angle is the observation angle (not the global yaw), computed as alpha = rotation_y - arctan2(location_z, location_x).
The KITTI evaluation toolkit (devkit provided by the benchmark organizers) computes Average Precision (AP) at three difficulty levels: Easy (minimum bbox height 40px, max occlusion 0, max truncation 0.15), Moderate (25px, occlusion 1, truncation 0.30), and Hard (25px, occlusion 2, truncation 0.50). For 3D detection, the IoU threshold is 0.7 for cars and 0.5 for pedestrians and cyclists. The evaluation computes both BEV (Bird's Eye View) AP and 3D AP. Many 3D detection papers report the 40-point interpolated AP (R40) rather than the original 11-point interpolated AP (R11), a change introduced by the KITTI benchmark in 2019 that produced different numerical values for the same detections.
When to Use KITTI Format vs Alternatives
KITTI is the most widely supported format in open-source 3D detection code, but newer formats offer more features.
| Format | Best For | Temporal Support | Multi-sensor | Open-source Support |
|---|---|---|---|---|
| KITTI | 3D detection, stereo, optical flow | Sequences (separate split) | Stereo + LiDAR + IMU | Excellent (most frameworks) |
| nuScenes | Multi-sensor temporal driving | Native (linked samples) | 6 cameras + LiDAR + radar | Good (growing) |
| Waymo Open | Large-scale diverse driving | Native (segments) | 5 cameras + 5 LiDARs | Moderate (protobuf) |
| Argoverse 2 | HD maps + 3D detection | Native (sequences) | 7 cameras + 2 LiDARs | Good (av2 SDK) |
| ONCE | One-million-scene 3D detection | Native | Camera + LiDAR | Moderate |
Converting from Other Formats
| Source Format | Tool / Library | Complexity | Notes |
|---|---|---|---|
| nuScenes | nuscenes-devkit export_kitti() | moderate | Convert relational schema to KITTI file structure; handles coordinate frame transformation and calibration matrix generation. |
| Waymo Open | waymo_open_dataset / custom script | moderate | Extract camera images and LiDAR from protobuf TFRecords, write KITTI-compatible files with calibration mapping. |
| ROS bag (camera + LiDAR) | Custom Python (rosbags + cv2) | moderate | Extract synchronized camera and LiDAR topics, compute calibration from TF tree, write to KITTI directory structure. |
| Custom sensors | Custom Python | trivial | Write images as PNG and point clouds as binary float32 following KITTI naming and coordinate conventions. |
| Argoverse 2 | av2 SDK + custom script | moderate | Extract per-frame LiDAR sweeps and camera images, transform annotations from global to camera frame. |
KITTI Format Beyond Autonomous Driving
Although KITTI format was designed for autonomous driving, its simplicity has led to widespread adoption in other robotics domains. Indoor navigation systems use KITTI format for RGB-D SLAM evaluation by substituting depth camera data for LiDAR point clouds. Agricultural robotics teams use the label format for 3D crop detection with modified category names. Warehouse robotics systems adopt KITTI format for forklift and pallet detection because the extensive ecosystem of pre-trained 3D detectors (PointPillars, SECOND, CenterPoint) can be fine-tuned on domain-specific KITTI-format data with minimal code changes.
The KITTI format's main limitation is the lack of native temporal linking between frames. In the original KITTI benchmark, tracking sequences are stored in a separate split with consecutive frame indices, but there is no explicit mechanism to link annotations across time (no tracking IDs in the detection label format, though the tracking split adds them). Newer formats like nuScenes address this with linked sample tokens, and Argoverse 2 uses unique track UUIDs. For teams that need temporal consistency in KITTI format, the standard workaround is to add a track_id column to the label files (position 16), which most KITTI-compatible loaders ignore but can be parsed by custom code.
Despite these limitations, KITTI format remains the most broadly supported format in 3D detection research as of 2026. The OpenPCDet framework alone supports over 20 different detection architectures that all consume KITTI-format data, and nearly every new 3D detection paper includes KITTI benchmark results. This means that training a model on KITTI-format data gives you access to the largest collection of pre-trained checkpoints, training recipes, and community-maintained codebases of any 3D detection format.
References
- [1]Geiger et al.. “Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite.” CVPR 2012, 2012. Link
- [2]Geiger et al.. “Vision meets Robotics: The KITTI Dataset.” IJRR 2013, 2013. Link
- [3]Lang et al.. “PointPillars: Fast Encoders for Object Detection from Point Clouds.” CVPR 2019, 2019. Link
- [4]Yan et al.. “SECOND: Sparsely Embedded Convolutional Detection.” Sensors 2018, 2018. Link
- [5]Shi et al.. “PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection.” CVPR 2020, 2020. Link
Claru Data Delivery in KITTI Format
Claru delivers data in KITTI format with complete calibration files ensuring geometric consistency between camera images, LiDAR point clouds, and 3D annotations. All labels follow the standard 15-field format with accurate 3D bounding boxes, occlusion levels, and truncation estimates. Point clouds are delivered as binary float32 files in the Velodyne coordinate frame with reflectance values, and calibration matrices are computed from rigorous multi-target camera-LiDAR calibration procedures.
For teams migrating from or benchmarking against the original KITTI dataset, Claru maintains strict format compatibility verified by running the official KITTI evaluation toolkit against delivered data. For applications beyond standard automotive categories, we define custom class taxonomies (e.g., pallet, forklift, shelf for warehouse robotics) while maintaining the same label file structure. Large deliveries include pre-computed train/val splits following KITTI's Moderate difficulty distribution, and we provide OpenPCDet-compatible configuration files for immediate training with any of the 20+ supported detection architectures.
Frequently Asked Questions
Yes. While newer formats like nuScenes and Waymo Open offer richer annotation schemas and temporal linking, KITTI format remains the most widely supported format in open-source 3D detection frameworks. OpenPCDet, MMDetection3D, and dozens of individual model repositories all provide KITTI data loaders. Nearly every 3D detection paper published at top venues still includes KITTI benchmark results. For teams that want to leverage the broadest possible ecosystem of pre-trained models and training recipes, KITTI format is the pragmatic choice.
KITTI format lacks native support for temporal sequences (no built-in frame linking or tracking IDs in the detection split), multi-sweep LiDAR accumulation (each frame is a single 360-degree scan), rich attribute annotations (no pedestrian posture, vehicle state, etc.), and map data (no lane markings or drivable area). The format also uses camera-centric coordinates for labels, which requires coordinate transformations when working with LiDAR-centric detection models. For applications requiring these features, nuScenes or Argoverse 2 formats are preferred.
KITTI uses 7-DoF boxes parameterized as (x, y, z, h, w, l, ry): the 3D center location (x, y, z) in camera 2 coordinates where Y points down, dimensions in meters (height, width, length), and yaw rotation (ry) around the Y-axis in radians. The (x, y, z) location refers to the bottom face center of the box, not the geometric center. This means y_center = y - h/2 if you need the true 3D center. Only yaw rotation is annotated; roll and pitch are assumed to be zero (objects are on a flat ground plane).
The projection requires three matrices from the calibration file: Tr_velo_to_cam (4x4 Velodyne-to-camera-0 transform), R0_rect (3x3 rectification matrix, padded to 4x4), and P2 (3x4 projection for camera 2). For a LiDAR point p = [x, y, z, 1]^T in Velodyne coordinates: p_cam = R0_rect @ Tr_velo_to_cam @ p, then p_img = P2 @ p_cam, and finally pixel coordinates are (p_img[0]/p_img[2], p_img[1]/p_img[2]). Filter out points behind the camera (p_cam[2] < 0) and outside image bounds before visualization.
Yes, with some adaptations. Replace the Velodyne LiDAR data with depth camera point clouds (converted to the same binary float32 format), use your camera calibration for the projection matrices, and define indoor-specific category names in the label files. The KITTI format's simplicity (flat files, text labels, standard images) makes it easy to adapt to any camera+depth sensor setup. Several indoor 3D detection projects have successfully used KITTI format for warehouse, factory, and domestic robotics applications.
Get Data in KITTI Format
Claru delivers robotics data in KITTI format with accurate calibration, 3D annotations, and compatibility with 20+ open-source detection frameworks. Tell us your requirements.