Egocentric Agricultural Video Dataset
First-person video from agricultural settings for training harvesting robots and crop monitoring AI. 35K+ clips across 12+ farm types with dense manipulation and crop-state annotations.
Dataset at a Glance
Why Egocentric Agricultural Data Matters for Robotics
Agricultural robotics is undergoing a fundamental transition from GPS-guided, route-following machines to dexterous manipulators that must perceive individual plants, assess ripeness, and execute precise harvesting motions. This transition demands training data captured from the robot's own perspective -- egocentric video that faithfully reproduces the visual complexity of real crop canopies, variable lighting under leaf cover, and the fine-grained hand-object interactions involved in picking, pruning, and sorting.
Existing academic agricultural datasets are overwhelmingly aerial or third-person, designed for remote-sensing tasks like NDVI mapping or field-level yield estimation. They do not capture the close-range, manipulation-centric viewpoint that a harvesting robot arm or weeding end-effector actually operates from. Claru's egocentric agricultural dataset fills this gap with first-person video recorded by experienced farm workers performing genuine harvesting, thinning, grafting, and inspection tasks across 12+ crop types and growing seasons.
The dataset captures the full spectrum of agricultural manipulation challenges: soft-body deformation of fruit during grasping, partial occlusion by leaves and branches, rapid lighting changes as the collector moves through row crops, and the diverse hand postures required for different crop geometries. These visual conditions are critical for training vision-language-action models that must generalize across crop varieties and maturity stages.
Research from ICRA 2024 and CoRL 2023 consistently demonstrates that robot policies trained on in-domain egocentric data outperform those trained on third-person or synthetic data by 25-40% on agricultural manipulation benchmarks. Domain-specific training data is not optional for agricultural robots -- it is the primary bottleneck to field deployment.
Sensor Configuration and Collection Methodology
Collection rigs use GoPro HERO12 cameras (5.3K capable, captured at 1080p/30fps for bandwidth efficiency) mounted on chest harnesses that position the lens at the approximate height and angle of a typical robotic arm end-effector. Depth data is captured simultaneously via Intel RealSense D455 modules co-mounted on the harness, providing aligned RGB-D pairs with depth accuracy within 2% at the typical 0.3-1.5m working distance for crop manipulation.
Each collection session lasts 45-90 minutes and covers a complete agricultural workflow: walking to the work area, selecting target plants, executing manipulation tasks (picking, cutting, sorting), and transitioning between rows or plots. Collectors are trained agricultural workers -- not actors -- performing genuine tasks at natural pace. This ensures the motion profiles, gaze patterns, and object interactions in the data reflect real-world task dynamics rather than scripted approximations.
Environmental metadata is recorded for every session: GPS coordinates, weather conditions (temperature, humidity, cloud cover, wind speed), time of day, crop variety and growth stage, days since last irrigation, and soil type. This metadata enables researchers to condition models on environmental factors and study how manipulation strategies should adapt to changing conditions.
Camera calibration is performed at the start of each collection day using a ChArUco board pattern, with intrinsic parameters verified against factory calibration. RGB-D temporal alignment is maintained within 5ms through hardware triggering. All sensor streams are synchronized to a common GPS-disciplined clock to enable multi-modal fusion during training.
Comparison with Public Datasets
How Claru's egocentric agricultural dataset compares to publicly available alternatives for agricultural robotics training.
| Dataset | Clips | Hours | Modalities | Environments | Annotations |
|---|---|---|---|---|---|
| Agriculture-Vision (CVPR 2020) | 94K images | N/A (stills) | RGB, NIR | US farmland (aerial) | 9 field patterns |
| CropAndWeed (2023) | ~8K images | N/A (stills) | RGB | European fields | Crop/weed segmentation |
| MinneApple (2019) | ~1K images | N/A (stills) | RGB | Apple orchards | Fruit detection, counting |
| Claru Egocentric Ag | 35K+ | 250+ | RGB-D, IMU | 12+ crop types | Actions, grasps, crop state, manipulation, hand-object |
Annotation Pipeline and Quality Assurance
Annotation follows a three-stage pipeline combining automated pre-labeling with expert human review. Stage one applies foundation models: DINOv2 for crop-vs-background segmentation, SAM2 for instance-level fruit and plant part masks, and DepthAnything V2 for monocular depth estimation on frames where the RealSense depth is incomplete due to transparent or reflective surfaces (common with wet leaves and shiny fruit).
Stage two involves trained agricultural annotators -- agronomists and experienced farm workers -- who correct automated labels and add domain-specific annotations: crop maturity stage (using BBCH scale coding), disease indicators, pest damage classification, and manipulation readiness scores. Hand-object interaction annotations follow the EPIC-KITCHENS-style contact frame protocol adapted for agricultural tools: secateurs, picking bags, grafting knives, and sorting trays.
Stage three is quality assurance. Every annotated clip is reviewed by a second annotator, with disagreements resolved by a domain expert. Inter-annotator agreement targets: 96%+ for action boundary placement (within 0.3 seconds), 92%+ IoU for instance segmentation masks, and 94%+ for crop maturity classifications. Clips failing QA thresholds are re-annotated from scratch rather than patched.
The complete annotation taxonomy covers 85+ action verbs specific to agricultural manipulation (reach, grasp, twist-pull, cut, strip, sort, place, inspect), 40+ object categories (fruit varieties at different maturity stages, leaves, branches, stems, tools, containers), and 12 crop-state attributes (ripeness, size, color, firmness, damage type, pest presence).
Use Cases
Robotic Harvesting Policies
Training manipulation policies for fruit and vegetable harvesting robots. Egocentric grasp demonstrations across crop geometries train vision-language-action models like RT-2, OpenVLA, and Octo to select pick points, plan approach trajectories, and execute compliant grasps on deformable produce.
Crop Health and Maturity Assessment
Close-range visual assessment of individual plant health from the manipulator's perspective. Models learn to classify ripeness, detect early-stage disease symptoms, and estimate yield at the plant level -- capabilities needed for selective harvesting and precision treatment.
Agricultural World Models
Training video prediction and scene dynamics models for agricultural environments. Predicting how canopy structure changes during interaction (branch deflection, leaf movement, fruit detachment) enables model-predictive control for gentle harvesting. Example architectures: UniSim, DayDreamer, TD-MPC2.
Key References
- [1]Grauman et al.. “Ego4D: Around the World in 3,000 Hours of Egocentric Video.” CVPR 2022, 2022. Link
- [2]Chiu et al.. “Agriculture-Vision: A Large Aerial Image Database for Agricultural Pattern Analysis.” CVPR 2020, 2020. Link
- [3]Brohan et al.. “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.” CoRL 2023, 2023. Link
- [4]Arad et al.. “Development of a Sweet Pepper Harvesting Robot.” Journal of Field Robotics 2020, 2020. Link
How Claru Delivers This Data
Claru's distributed collector network spans agricultural regions across North America, Europe, and Southeast Asia, enabling collection across growing seasons, climate zones, and crop varieties that no single research farm can replicate. Unlike academic datasets limited to one orchard or a single crop type, Claru captures the full diversity of conditions that agricultural robots encounter in commercial deployment: greenhouses, open fields, orchards, vineyards, polytunnels, and hydroponic facilities.
Custom collection campaigns can target specific crop types (e.g., strawberry picking only, or citrus across three maturity stages), particular manipulation tasks (pruning vs. harvesting vs. thinning), or environmental conditions (rain, low-light, post-harvest debris). Turnaround from campaign specification to annotated delivery is typically 4-6 weeks for standard volumes.
Data is delivered in your preferred format -- RLDS, HDF5, WebDataset, LeRobot, or custom schemas -- with all format conversion handled by Claru's pipeline at no additional cost. Annotation exports include per-frame JSON, COCO-format instance annotations, and temporally-aligned action segment files compatible with standard activity recognition frameworks.
Frequently Asked Questions
12+ crop types including strawberries, tomatoes, peppers, apples, grapes, citrus, lettuce, herbs, cucumbers, blueberries, cherries, and specialty greenhouse crops. Each crop includes multiple varieties and maturity stages captured across growing seasons.
Aerial datasets capture field-level patterns from above for remote sensing tasks. Egocentric agricultural data captures the close-range, manipulation-centric perspective that harvesting and inspection robots actually operate from -- showing hand-object interactions, individual plant structures, and the visual conditions at 0.3-1.5m working distance.
Yes. Custom campaigns can target specific crops, manipulation tasks (harvesting, pruning, thinning, grafting), growth stages, or environmental conditions. Contact us with your requirements for scoping and timeline.
Annotations are delivered in per-frame JSON, COCO-format instance masks, temporal action segments (compatible with EPIC-KITCHENS tooling), and optional RLDS/HDF5 embeddings. Custom annotation schemas can be accommodated on request.
Yes. Intel RealSense D455 depth is aligned and synchronized with RGB at 30fps. Monocular depth estimates from DepthAnything V2 supplement hardware depth where stereo matching fails (wet leaves, transparent surfaces).
Request a Sample Pack
Get a curated sample of egocentric agricultural video with crop manipulation annotations to evaluate for your harvesting or monitoring project.