Egocentric Video Datasets: First-Person Video Data for Embodied AI
Robots see the world from their own perspective. Training data should match that perspective. Claru provides commercially licensed egocentric video with multi-layer annotations — 500,000+ clips from 10,000+ contributors in 100+ cities, enriched with depth, segmentation, pose, and action labels.
What Is Egocentric Video and Why It Matters for AI
Egocentric (first-person) video is recorded from the perspective of the person performing an activity. A camera worn on the head, chest, or wrist captures what the wearer sees — the same viewpoint a robot's onboard camera would have.
This viewpoint distinction is not cosmetic. It is architecturally fundamental for several reasons:
- •Viewpoint matching. A visuomotor policy trained on third-person video must learn an implicit viewpoint transformation before it can predict actions from first-person observations. Egocentric training data eliminates this unnecessary learning burden.
- •Attention signal. In egocentric video, the camera naturally points where the person is looking and reaching. This implicit attention signal is free information for models learning which objects and regions are task-relevant.
- •Hand-object interaction. Egocentric cameras capture detailed views of hand-object interactions — precisely the manipulation information robotics models need to learn grasping, tool use, and assembly.
- •Occlusion patterns. The occlusion patterns in egocentric video (hands occluding objects, objects occluding each other from the manipulator's perspective) match the occlusion patterns a deployed robot will encounter.
As the robotics field has converged on visuomotor policies (models that directly map visual observations to motor commands), egocentric video has become the most important single data modality for robot learning.
Egocentric Video Datasets: Academic vs. Commercial
Several academic egocentric datasets exist. Here is how they compare to Claru's commercially licensed collection.
| Feature | Ego4D | EPIC-KITCHENS | Claru |
|---|---|---|---|
| Total Hours | 3,670 hours | 100+ hours | 10,000+ hours (growing) |
| Clip Count | ~9,600 videos | ~90,000 clips | 500,000+ clips |
| Contributors | 923 participants | 45 participants | 10,000+ contributors |
| Geographic Spread | 9 countries | 4 cities | 100+ cities, 14+ countries |
| Environment Types | Mixed (daily activities) | Kitchens only | 12+ categories (kitchen, warehouse, workshop, retail, outdoor, etc.) |
| Commercial License | No (research only) | No (research only) | Yes (full commercial rights) |
| Custom Collection | No (fixed dataset) | No (fixed dataset) | Yes (task-specific, on demand) |
| Depth Annotations | Limited | No | Yes (monocular depth on all clips) |
| Segmentation | Partial (benchmark subsets) | Partial (object detections) | Yes (semantic + instance on all clips) |
| Hand Pose | Partial (benchmark subsets) | No | Yes (2D + 3D hand pose on all clips) |
| Freshness | Fixed (2022 collection) | Fixed (2018-2023 collection) | Continuously updated (new data weekly) |
Academic datasets like Ego4D and EPIC-KITCHENS have been invaluable for advancing egocentric vision research. They provide standardized benchmarks that enable fair comparison across methods.
However, they were not designed for commercial production use. Licensing restrictions prevent use in proprietary training pipelines. Fixed datasets cannot be expanded to cover new environments or tasks. Annotation coverage is limited to specific benchmark tasks rather than the full enrichment stack ML teams need.
Claru provides the commercial alternative: fully licensed egocentric video that can be expanded, customized, and enriched to the exact specifications your training pipeline requires.
Types of Egocentric Video Data
Claru collects egocentric video across diverse activity categories to support different physical AI use cases.
Kitchen and Food Preparation
120K+ clipsCooking, food prep, dishwashing, kitchen organization. Covers 200+ kitchen layouts with diverse cookware, ingredients, and appliance types. Critical for household robotics and restaurant automation research.
Workshop and Repair
80K+ clipsCarpentry, electronics repair, sewing, phone repair, small engine work. Captures fine-grained tool use and multi-step assembly operations. Essential for training dexterous manipulation policies.
Warehouse and Logistics
60K+ clipsPicking, packing, shelving, inventory management, cart operations. Collected in real commercial warehouses with authentic product variety and bin configurations. Directly applicable to warehouse robotics.
Retail and Commercial
45K+ clipsShopping, product interaction, checkout operations, store navigation. Captures human behavior in commercial spaces for service robotics and retail automation applications.
Outdoor and Navigation
90K+ clipsSidewalk walking, park navigation, urban traversal, transit use. Provides the visual and locomotion data needed for mobile robots, delivery systems, and autonomous navigation.
Custom Task-Specific
On demandTargeted collection for client-specific tasks: specific manufacturing operations, specific household routines, specific agricultural activities. Collected on demand with custom protocols developed with the client's ML team.
Annotation Layers on Every Clip
Raw video is the starting point, not the deliverable. Claru enriches every egocentric clip with multiple annotation layers that provide the supervision signals ML models need.
Monocular Depth Estimation
Per-frame depth maps computed using state-of-the-art monocular depth models. Provides metric or relative depth at every pixel, enabling 3D scene understanding from a single camera. Depth maps are calibrated against LiDAR ground truth where available and cross-validated against segmentation boundaries for geometric consistency.
Format: 16-bit PNG or NumPy float32 arraysSemantic and Instance Segmentation
Per-pixel labels with 100+ object categories (furniture, appliances, food items, tools, containers, surfaces) plus instance IDs distinguishing individual objects. Enables models to identify what objects are present, where they are, and which specific instance is being interacted with.
Format: Indexed PNG masks or NumPy uint16 arraysHuman and Hand Pose Estimation
Full-body 2D and 3D joint positions (17+ keypoints) plus detailed hand articulation (21 keypoints per hand). Critical for understanding manipulation: which fingers are in contact with which object, what grasp type is being used, what is the hand trajectory during a reaching motion.
Format: JSON keypoint arrays or COCO-format annotationsOptical Flow
Dense motion vectors between consecutive frames, capturing both camera motion and object motion. Optical flow provides the dynamic information that complements the static information in depth and segmentation — it reveals which parts of the scene are moving, how fast, and in what direction.
Format: Float16 flow fields in .flo or NumPy formatAI-Generated Captions
Natural language descriptions of the activity, objects, and spatial relationships in each clip. Generated by vision-language models and validated for accuracy. Enables language-grounded learning — training models to associate visual observations with natural language instructions.
Format: UTF-8 text with per-clip and per-segment granularityAction Boundary Labels
Temporal annotations marking the start and end of discrete actions within each clip: reach, grasp, lift, transport, place, cut, pour, stir. Labels follow a structured verb-noun taxonomy developed for robotics applications. Available on request as a custom annotation layer.
Format: JSON with timestamp ranges and verb-noun labelsHow Claru Collects Egocentric Video at Scale
Claru operates three parallel egocentric capture pipelines, each optimized for different collection scenarios:
Pipeline 1: Wearable Camera Network
10,000+ contributors worldwide are equipped with GoPro or similar wearable cameras and capture video during their regular activities. Contributors are recruited from specific demographic and occupational backgrounds to ensure environmental diversity. A barista captures coffee shop operations. A warehouse worker captures logistics activities. A home cook captures kitchen tasks. This distributed approach produces data from the true distribution of real-world environments — diversity that no lab or studio setup can replicate.
Pipeline 2: Managed Smartphone Capture
For scenarios where wearable cameras are impractical, contributors use phone-mounted cameras following specific protocols for angle, stability, and duration. This pipeline is faster to deploy (no hardware shipping) and captures complementary viewpoints.
Pipeline 3: Activity-Specific Collection
Targeted campaigns designed for specific client requirements. Example: a client needs 5,000 clips of hand-washing procedures in commercial kitchens. Claru recruits contributors from the target demographic, develops a task protocol specifying camera placement, lighting requirements, and activity sequence, and deploys the campaign. First clips are available within 48 hours. Same-day quality assurance catches issues early.
All three pipelines feed into the same enrichment stack: depth, segmentation, pose, flow, and captions are computed automatically, then validated by human annotators. The result is a continuously growing egocentric video collection with consistent annotation quality regardless of capture source.
Collection at a Glance
Related Solutions and Case Studies
Frequently Asked Questions
What is egocentric video data?
Egocentric video data is video recorded from a first-person perspective, typically using a wearable camera mounted on the head, chest, or wrist. Unlike third-person video (recorded from a fixed external viewpoint), egocentric video captures what the wearer sees — the same viewpoint that a robot's head or wrist camera would have. This makes egocentric data uniquely valuable for training embodied AI systems: visuomotor policies, world models, activity recognition systems, and hand-object interaction models. Egocentric video naturally captures attention (where the person looks), intention (what they reach for), and manipulation (how they grasp and use objects) in a way that third-person video cannot.
How is Claru's egocentric data different from Ego4D?
Ego4D is a large-scale academic dataset released by a consortium of universities. It provides 3,670 hours of egocentric video from 923 participants across 9 countries with standardized benchmarks. Claru's egocentric data differs in several important ways. First, commercial licensing: Ego4D restricts commercial use and requires academic affiliation, while Claru's data is fully commercially licensed for production training. Second, scale on demand: Ego4D is a fixed dataset; Claru can collect additional egocentric data on demand — specific environments, specific tasks, specific camera configurations — through a network of 10,000+ contributors. Third, enrichment depth: Claru provides 6+ annotation layers per clip (depth, segmentation, pose, optical flow, captions, action labels) as standard; Ego4D provides annotations for specific benchmark tasks. Fourth, freshness: Ego4D was collected over a fixed time period; Claru continuously collects new data, ensuring the dataset reflects current environments and objects.
What environments does Claru collect egocentric video in?
Claru collects egocentric video across 12+ environment categories: residential kitchens and living spaces, commercial kitchens and restaurants, retail stores and shopping environments, warehouses and logistics facilities, manufacturing and assembly lines, office environments, outdoor urban spaces (sidewalks, parks, transit), outdoor rural and agricultural settings, workshops (carpentry, metalwork, electronics repair), healthcare and clinical settings, gyms and fitness facilities, and vehicle interiors. Each environment category includes multiple specific locations to ensure visual diversity — different lighting conditions, layouts, object arrangements, and cultural contexts across 100+ cities worldwide.
What annotation layers are available on egocentric video?
Claru provides six standard annotation layers on egocentric video clips. Monocular depth estimation: per-frame depth maps providing 3D spatial information. Semantic segmentation: per-pixel object class labels (100+ categories). Instance segmentation: per-pixel instance IDs distinguishing individual objects of the same class. Human and hand pose estimation: 2D and 3D joint positions for full body and detailed hand articulation, critical for understanding manipulation. Optical flow: dense motion vectors between consecutive frames, capturing dynamic scene information. AI-generated captions: natural language descriptions of activities, objects, and spatial relationships in each clip. Additional custom annotation layers — action boundary labels, object affordance annotations, gaze estimation — are available on request.
How many egocentric video clips does Claru have?
Claru's egocentric video collection contains 500,000+ clips across three parallel capture pipelines. The wearable camera pipeline has produced 386,000+ clips from GoPro and similar cameras worn during real-world activities. The smartphone capture pipeline adds clips recorded from phone-mounted cameras in complementary scenarios. The activity-specific pipeline collects targeted clips for particular tasks (cooking specific recipes, performing specific assembly operations, navigating specific routes) based on client requirements. The collection is continuously growing — Claru's network of 10,000+ contributors can be deployed to collect additional data for specific environments, tasks, or scenarios within days.
Can Claru collect egocentric data for a specific task or environment?
Yes. Custom egocentric data collection is one of Claru's core services. The process starts with a task specification developed in collaboration with the client's ML team: what activities need to be captured, in what environments, from what camera viewpoint (head-mounted, chest-mounted, wrist-mounted), at what resolution and frame rate, and with what metadata. Claru then deploys contributors from its 10,000+ person network who match the environmental requirements (e.g., baristas for coffee shop data, warehouse workers for logistics data, home cooks for kitchen data). Collection campaigns typically produce first clips within 48 hours and can scale to thousands of clips per week. All custom data comes with the full enrichment pipeline (depth, segmentation, pose, flow, captions) and a project-specific quality assurance process.
What formats is egocentric video data delivered in?
Claru delivers egocentric video datasets in the formats robotics and ML teams actually use. Video files are delivered as MP4 (H.264 or H.265 encoding) or as extracted frame sequences in PNG or WebP format. Annotations are delivered as Parquet files (for tabular metadata), NumPy arrays (for dense annotations like depth maps and segmentation masks), and JSON (for structured labels like pose keypoints and action boundaries). For streaming training at scale, Claru packages datasets in WebDataset format (tar shards with co-located video and annotation files). HDF5 and RLDS formats are available for reinforcement learning pipelines. All deliveries include a manifest file with SHA-256 checksums and a datasheet documenting collection methodology, annotator demographics, and known limitations.
Need Egocentric Video for Your Model?
Whether you need to license existing egocentric datasets or commission custom collection for specific tasks and environments, Claru can help.