Question 1

What is egocentric video data?

Accepted Answer

Egocentric video data is video recorded from a first-person perspective, typically using a wearable camera mounted on the head, chest, or wrist. Unlike third-person video (recorded from a fixed external viewpoint), egocentric video captures what the wearer sees — the same viewpoint that a robot's head or wrist camera would have. This makes egocentric data uniquely valuable for training embodied AI systems: visuomotor policies, world models, activity recognition systems, and hand-object interaction models. Egocentric video naturally captures attention (where the person looks), intention (what they reach for), and manipulation (how they grasp and use objects) in a way that third-person video cannot.

Question 2

How is Claru's egocentric data different from Ego4D?

Accepted Answer

Ego4D is a large-scale academic dataset released by a consortium of universities. It provides 3,670 hours of egocentric video from 923 participants across 9 countries with standardized benchmarks. Claru's egocentric data differs in several important ways. First, commercial licensing: Ego4D restricts commercial use and requires academic affiliation, while Claru's data is fully commercially licensed for production training. Second, scale on demand: Ego4D is a fixed dataset; Claru can collect additional egocentric data on demand — specific environments, specific tasks, specific camera configurations — through a network of 10,000+ contributors. Third, enrichment depth: Claru provides 6+ annotation layers per clip (depth, segmentation, pose, optical flow, captions, action labels) as standard; Ego4D provides annotations for specific benchmark tasks. Fourth, freshness: Ego4D was collected over a fixed time period; Claru continuously collects new data, ensuring the dataset reflects current environments and objects.

Question 3

What environments does Claru collect egocentric video in?

Accepted Answer

Claru collects egocentric video across 12+ environment categories: residential kitchens and living spaces, commercial kitchens and restaurants, retail stores and shopping environments, warehouses and logistics facilities, manufacturing and assembly lines, office environments, outdoor urban spaces (sidewalks, parks, transit), outdoor rural and agricultural settings, workshops (carpentry, metalwork, electronics repair), healthcare and clinical settings, gyms and fitness facilities, and vehicle interiors. Each environment category includes multiple specific locations to ensure visual diversity — different lighting conditions, layouts, object arrangements, and cultural contexts across 100+ cities worldwide.

Question 4

What annotation layers are available on egocentric video?

Accepted Answer

Claru provides six standard annotation layers on egocentric video clips. Monocular depth estimation: per-frame depth maps providing 3D spatial information. Semantic segmentation: per-pixel object class labels (100+ categories). Instance segmentation: per-pixel instance IDs distinguishing individual objects of the same class. Human and hand pose estimation: 2D and 3D joint positions for full body and detailed hand articulation, critical for understanding manipulation. Optical flow: dense motion vectors between consecutive frames, capturing dynamic scene information. AI-generated captions: natural language descriptions of activities, objects, and spatial relationships in each clip. Additional custom annotation layers — action boundary labels, object affordance annotations, gaze estimation — are available on request.

Question 5

How many egocentric video clips does Claru have?

Accepted Answer

Claru's egocentric video collection contains 500,000+ clips across three parallel capture pipelines. The wearable camera pipeline has produced 386,000+ clips from GoPro and similar cameras worn during real-world activities. The smartphone capture pipeline adds clips recorded from phone-mounted cameras in complementary scenarios. The activity-specific pipeline collects targeted clips for particular tasks (cooking specific recipes, performing specific assembly operations, navigating specific routes) based on client requirements. The collection is continuously growing — Claru's network of 10,000+ contributors can be deployed to collect additional data for specific environments, tasks, or scenarios within days.

Question 6

Can Claru collect egocentric data for a specific task or environment?

Accepted Answer

Yes. Custom egocentric data collection is one of Claru's core services. The process starts with a task specification developed in collaboration with the client's ML team: what activities need to be captured, in what environments, from what camera viewpoint (head-mounted, chest-mounted, wrist-mounted), at what resolution and frame rate, and with what metadata. Claru then deploys contributors from its 10,000+ person network who match the environmental requirements (e.g., baristas for coffee shop data, warehouse workers for logistics data, home cooks for kitchen data). Collection campaigns typically produce first clips within 48 hours and can scale to thousands of clips per week. All custom data comes with the full enrichment pipeline (depth, segmentation, pose, flow, captions) and a project-specific quality assurance process.

Question 7

What formats is egocentric video data delivered in?

Accepted Answer

Claru delivers egocentric video datasets in the formats robotics and ML teams actually use. Video files are delivered as MP4 (H.264 or H.265 encoding) or as extracted frame sequences in PNG or WebP format. Annotations are delivered as Parquet files (for tabular metadata), NumPy arrays (for dense annotations like depth maps and segmentation masks), and JSON (for structured labels like pose keypoints and action boundaries). For streaming training at scale, Claru packages datasets in WebDataset format (tar shards with co-located video and annotation files). HDF5 and RLDS formats are available for reinforcement learning pipelines. All deliveries include a manifest file with SHA-256 checksums and a datasheet documenting collection methodology, annotator demographics, and known limitations.

Feature	Ego4D	EPIC-KITCHENS	Claru
Total Hours	3,670 hours	100+ hours	10,000+ hours (growing)
Clip Count	~9,600 videos	~90,000 clips	500,000+ clips
Contributors	923 participants	45 participants	10,000+ contributors
Geographic Spread	9 countries	4 cities	100+ cities, 14+ countries
Environment Types	Mixed (daily activities)	Kitchens only	12+ categories (kitchen, warehouse, workshop, retail, outdoor, etc.)
Commercial License	No (research only)	No (research only)	Yes (full commercial rights)
Custom Collection	No (fixed dataset)	No (fixed dataset)	Yes (task-specific, on demand)
Depth Annotations	Limited	No	Yes (monocular depth on all clips)
Segmentation	Partial (benchmark subsets)	Partial (object detections)	Yes (semantic + instance on all clips)
Hand Pose	Partial (benchmark subsets)	No	Yes (2D + 3D hand pose on all clips)
Freshness	Fixed (2022 collection)	Fixed (2018-2023 collection)	Continuously updated (new data weekly)

Egocentric Video Datasets: First-Person Video Data for Embodied AI

What Is Egocentric Video and Why It Matters for AI

Egocentric Video Datasets: Academic vs. Commercial

Types of Egocentric Video Data

Kitchen and Food Preparation

Workshop and Repair

Warehouse and Logistics

Retail and Commercial

Outdoor and Navigation

Custom Task-Specific

Annotation Layers on Every Clip

Monocular Depth Estimation

Semantic and Instance Segmentation

Human and Hand Pose Estimation

Optical Flow

AI-Generated Captions

Action Boundary Labels

How Claru Collects Egocentric Video at Scale

Pipeline 1: Wearable Camera Network

Pipeline 2: Managed Smartphone Capture

Pipeline 3: Activity-Specific Collection

Collection at a Glance

Related Solutions and Case Studies

Frequently Asked Questions