KITTI Format: Complete Guide for Robotics Data

The KITTI format is one of the most widely supported data formats in autonomous driving and 3D vision. Learn its file structure and how Claru delivers KITTI-compatible data.

Schema and Structure

KITTI format uses a file-based directory structure developed by the Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago for their 2012 vision benchmark suite. The format organizes data by modality: images in /image_02/ (left color camera) and /image_03/ (right color camera), LiDAR point clouds in /velodyne/ as binary float32 files (each point stored as 4 floats: x, y, z, reflectance), calibration data in /calib/ as text files containing projection matrices, and labels in /label_02/ as space-delimited text files. Each frame is identified by a 6-digit zero-padded index (000000.png, 000000.bin, 000000.txt), and all modalities for a given frame share the same index number.

The KITTI label format encodes 3D bounding boxes as a single line per object with 15 space-delimited fields: type (class name like Car, Pedestrian, Cyclist), truncated (float 0-1 indicating how much of the object is outside the image), occluded (integer 0-3 from fully visible to fully occluded), alpha (observation angle in radians), 2D bbox (left, top, right, bottom in pixels), 3D dimensions (height, width, length in meters), 3D location (x, y, z of the bottom center in camera coordinates), and rotation_y (yaw rotation around the vertical Y-axis in radians). The 3D bounding box is parameterized as a 7-DoF representation: (x, y, z, h, w, l, ry), where the center is defined as the bottom face center of the box. This convention means the Y-axis points downward (camera convention), which is a common source of confusion when interfacing with LiDAR-centric systems that use Z-up conventions.

The calibration files contain four projection matrices (P0-P3 for each camera) and two transformation matrices (Tr_velo_to_cam for Velodyne LiDAR to camera 0, and Tr_imu_to_velo for IMU to Velodyne). The projection matrices are 3x4 and encode both camera intrinsics and the rigid transformation from the reference camera (camera 0) to each target camera. To project a LiDAR point to the left color image (camera 2): first apply Tr_velo_to_cam to transform from Velodyne to camera 0 coordinates, then left-multiply by the rectification matrix R0_rect (3x3 stored as 3x4 with zero translation), and finally by P2 (3x4) to get homogeneous pixel coordinates. This multi-step projection pipeline is the most implementation-intensive aspect of working with KITTI format.

Frameworks and Models Using KITTI Format

OpenPCDet

OpenMMLab's point cloud detection framework with KITTI as a primary benchmark, supporting PointPillars, SECOND, PV-RCNN, and CenterPoint.

SECOND

Sparse Embedded Convolutional Detection network, one of the first efficient voxel-based 3D detectors, designed for KITTI-format data.

PointPillars

Real-time 3D object detection using pillar-based point cloud encoding, achieving the best speed-accuracy tradeoff on KITTI benchmark.

PV-RCNN / PV-RCNN++

Point-voxel feature fusion for 3D detection, state-of-the-art on KITTI 3D detection with both voxel and point-level features.

MMDetection3D

OpenMMLab's 3D detection toolbox with comprehensive KITTI data loaders, evaluation, and visualization for 20+ detection models.

CenterPoint

Center-based 3D detection and tracking model, evaluated on both KITTI and nuScenes with KITTI-format data loading support.

Reading and Writing KITTI Format Data in Python

Reading KITTI point clouds is straightforward: points = np.fromfile('000000.bin', dtype=np.float32).reshape(-1, 4) yields an (N, 4) array of [x, y, z, reflectance]. Images are standard PNG files loadable with cv2.imread() or PIL.Image.open(). Label files are parsed line-by-line, splitting on whitespace: each line produces a dictionary with type, truncated, occluded, alpha, bbox (4 floats), dimensions (3 floats), location (3 floats), and rotation_y. The calibration file contains key-value lines like P2: followed by 12 floats representing the 3x4 projection matrix.

Writing KITTI-format data requires careful attention to coordinate conventions. Point clouds must be in the Velodyne coordinate frame (X-forward, Y-left, Z-up) and written as contiguous float32 binary: points.astype(np.float32).tofile('000000.bin'). Labels must use the camera 2 coordinate frame (X-right, Y-down, Z-forward), which means applying the Velodyne-to-camera transformation to any LiDAR-centric annotations before writing. The 2D bounding box must be the tight axis-aligned enclosure of the projected 3D box corners in image space, and the alpha angle is the observation angle (not the global yaw), computed as alpha = rotation_y - arctan2(location_z, location_x).

The KITTI evaluation toolkit (devkit provided by the benchmark organizers) computes Average Precision (AP) at three difficulty levels: Easy (minimum bbox height 40px, max occlusion 0, max truncation 0.15), Moderate (25px, occlusion 1, truncation 0.30), and Hard (25px, occlusion 2, truncation 0.50). For 3D detection, the IoU threshold is 0.7 for cars and 0.5 for pedestrians and cyclists. The evaluation computes both BEV (Bird's Eye View) AP and 3D AP. Many 3D detection papers report the 40-point interpolated AP (R40) rather than the original 11-point interpolated AP (R11), a change introduced by the KITTI benchmark in 2019 that produced different numerical values for the same detections.

When to Use KITTI Format vs Alternatives

KITTI is the most widely supported format in open-source 3D detection code, but newer formats offer more features.

FormatBest ForTemporal SupportMulti-sensorOpen-source Support
KITTI3D detection, stereo, optical flowSequences (separate split)Stereo + LiDAR + IMUExcellent (most frameworks)
nuScenesMulti-sensor temporal drivingNative (linked samples)6 cameras + LiDAR + radarGood (growing)
Waymo OpenLarge-scale diverse drivingNative (segments)5 cameras + 5 LiDARsModerate (protobuf)
Argoverse 2HD maps + 3D detectionNative (sequences)7 cameras + 2 LiDARsGood (av2 SDK)
ONCEOne-million-scene 3D detectionNativeCamera + LiDARModerate

Converting from Other Formats

Source FormatTool / LibraryComplexityNotes
nuScenesnuscenes-devkit export_kitti()moderateConvert relational schema to KITTI file structure; handles coordinate frame transformation and calibration matrix generation.
Waymo Openwaymo_open_dataset / custom scriptmoderateExtract camera images and LiDAR from protobuf TFRecords, write KITTI-compatible files with calibration mapping.
ROS bag (camera + LiDAR)Custom Python (rosbags + cv2)moderateExtract synchronized camera and LiDAR topics, compute calibration from TF tree, write to KITTI directory structure.
Custom sensorsCustom PythontrivialWrite images as PNG and point clouds as binary float32 following KITTI naming and coordinate conventions.
Argoverse 2av2 SDK + custom scriptmoderateExtract per-frame LiDAR sweeps and camera images, transform annotations from global to camera frame.

KITTI Format Beyond Autonomous Driving

Although KITTI format was designed for autonomous driving, its simplicity has led to widespread adoption in other robotics domains. Indoor navigation systems use KITTI format for RGB-D SLAM evaluation by substituting depth camera data for LiDAR point clouds. Agricultural robotics teams use the label format for 3D crop detection with modified category names. Warehouse robotics systems adopt KITTI format for forklift and pallet detection because the extensive ecosystem of pre-trained 3D detectors (PointPillars, SECOND, CenterPoint) can be fine-tuned on domain-specific KITTI-format data with minimal code changes.

The KITTI format's main limitation is the lack of native temporal linking between frames. In the original KITTI benchmark, tracking sequences are stored in a separate split with consecutive frame indices, but there is no explicit mechanism to link annotations across time (no tracking IDs in the detection label format, though the tracking split adds them). Newer formats like nuScenes address this with linked sample tokens, and Argoverse 2 uses unique track UUIDs. For teams that need temporal consistency in KITTI format, the standard workaround is to add a track_id column to the label files (position 16), which most KITTI-compatible loaders ignore but can be parsed by custom code.

Despite these limitations, KITTI format remains the most broadly supported format in 3D detection research as of 2026. The OpenPCDet framework alone supports over 20 different detection architectures that all consume KITTI-format data, and nearly every new 3D detection paper includes KITTI benchmark results. This means that training a model on KITTI-format data gives you access to the largest collection of pre-trained checkpoints, training recipes, and community-maintained codebases of any 3D detection format.

References

  1. [1]Geiger et al.. Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite.” CVPR 2012, 2012. Link
  2. [2]Geiger et al.. Vision meets Robotics: The KITTI Dataset.” IJRR 2013, 2013. Link
  3. [3]Lang et al.. PointPillars: Fast Encoders for Object Detection from Point Clouds.” CVPR 2019, 2019. Link
  4. [4]Yan et al.. SECOND: Sparsely Embedded Convolutional Detection.” Sensors 2018, 2018. Link
  5. [5]Shi et al.. PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection.” CVPR 2020, 2020. Link

Claru Data Delivery in KITTI Format

Claru delivers data in KITTI format with complete calibration files ensuring geometric consistency between camera images, LiDAR point clouds, and 3D annotations. All labels follow the standard 15-field format with accurate 3D bounding boxes, occlusion levels, and truncation estimates. Point clouds are delivered as binary float32 files in the Velodyne coordinate frame with reflectance values, and calibration matrices are computed from rigorous multi-target camera-LiDAR calibration procedures.

For teams migrating from or benchmarking against the original KITTI dataset, Claru maintains strict format compatibility verified by running the official KITTI evaluation toolkit against delivered data. For applications beyond standard automotive categories, we define custom class taxonomies (e.g., pallet, forklift, shelf for warehouse robotics) while maintaining the same label file structure. Large deliveries include pre-computed train/val splits following KITTI's Moderate difficulty distribution, and we provide OpenPCDet-compatible configuration files for immediate training with any of the 20+ supported detection architectures.

Frequently Asked Questions

Yes. While newer formats like nuScenes and Waymo Open offer richer annotation schemas and temporal linking, KITTI format remains the most widely supported format in open-source 3D detection frameworks. OpenPCDet, MMDetection3D, and dozens of individual model repositories all provide KITTI data loaders. Nearly every 3D detection paper published at top venues still includes KITTI benchmark results. For teams that want to leverage the broadest possible ecosystem of pre-trained models and training recipes, KITTI format is the pragmatic choice.

KITTI format lacks native support for temporal sequences (no built-in frame linking or tracking IDs in the detection split), multi-sweep LiDAR accumulation (each frame is a single 360-degree scan), rich attribute annotations (no pedestrian posture, vehicle state, etc.), and map data (no lane markings or drivable area). The format also uses camera-centric coordinates for labels, which requires coordinate transformations when working with LiDAR-centric detection models. For applications requiring these features, nuScenes or Argoverse 2 formats are preferred.

KITTI uses 7-DoF boxes parameterized as (x, y, z, h, w, l, ry): the 3D center location (x, y, z) in camera 2 coordinates where Y points down, dimensions in meters (height, width, length), and yaw rotation (ry) around the Y-axis in radians. The (x, y, z) location refers to the bottom face center of the box, not the geometric center. This means y_center = y - h/2 if you need the true 3D center. Only yaw rotation is annotated; roll and pitch are assumed to be zero (objects are on a flat ground plane).

The projection requires three matrices from the calibration file: Tr_velo_to_cam (4x4 Velodyne-to-camera-0 transform), R0_rect (3x3 rectification matrix, padded to 4x4), and P2 (3x4 projection for camera 2). For a LiDAR point p = [x, y, z, 1]^T in Velodyne coordinates: p_cam = R0_rect @ Tr_velo_to_cam @ p, then p_img = P2 @ p_cam, and finally pixel coordinates are (p_img[0]/p_img[2], p_img[1]/p_img[2]). Filter out points behind the camera (p_cam[2] < 0) and outside image bounds before visualization.

Yes, with some adaptations. Replace the Velodyne LiDAR data with depth camera point clouds (converted to the same binary float32 format), use your camera calibration for the projection matrices, and define indoor-specific category names in the label files. The KITTI format's simplicity (flat files, text labels, standard images) makes it easy to adapt to any camera+depth sensor setup. Several indoor 3D detection projects have successfully used KITTI format for warehouse, factory, and domestic robotics applications.

Get Data in KITTI Format

Claru delivers robotics data in KITTI format with accurate calibration, 3D annotations, and compatibility with 20+ open-source detection frameworks. Tell us your requirements.