Egocentric Video Data for Physical AI

Egocentric (first-person) video data is footage recorded from the perspective of a person performing real-world activities — captured with head-mounted cameras, GoPro-style rigs, or smart glasses. Because this viewpoint matches what a robot's or AR headset's camera sees, egocentric data is the primary training modality for visuomotor policies, world models, embodied AI agents, and hand-interaction research. Claru provides commercially licensed egocentric video with multi-layer annotations from 10,000+ collectors in 100+ cities.

What Makes Claru's Egocentric Data Different

20+ environment types, not just kitchens

Most egocentric datasets are concentrated in kitchen or lab settings. Claru captures across 20+ categories — restaurants, farms, clinics, construction sites, salons, gyms, transit, storefronts, and more — giving models exposure to the full range of real-world environments they will encounter at deployment.

Multi-layer enrichment on every clip

Raw video ships with hand bounding boxes, activity classification labels, object detection, and optional depth maps and spatial segmentation. Each clip arrives as a complete training-ready unit — not raw footage that requires a separate annotation pipeline before you can use it.

Custom collection on demand, delivered in days

If the environment or task you need is not in the existing archive, Claru can deploy collectors specifically for it. Task specifications are turned around in 48 hours; most custom campaigns produce first clips within one week. Minimum viable sample packs start at 10–15 clips.

Use Cases

Egocentric video powers three converging research areas where first-person observation is not a preference but a requirement.

Vision-Language-Action

VLA Model Training

VLA models like OpenVLA, RT-2, and pi-zero map visual observations directly to robot actions. Training these models requires large volumes of first-person video that matches the robot's onboard camera perspective. Egocentric data from Claru provides the observation-action pairs these architectures need, particularly for manipulation tasks where hand-object interaction is the critical signal. See the complete guide to VLA training data.

Visuomotor Policies

Embodied AI and Robot Learning

Robots learning from imitation require demonstrations that match their embodiment's camera viewpoint. A manipulator arm with a wrist camera sees the world the same way a person wearing a wrist-mounted GoPro does. Claru's egocentric captures — collected across real workplaces, homes, and outdoor environments — provide the behavioral diversity these policies need to generalize beyond the training lab. Embodied AI datasets.

Human-Computer Interaction

AR/VR and HCI Research

Augmented and mixed reality systems that understand user intent must recognize what hands are doing, what objects are nearby, and what action is being performed — all from the first-person perspective of the headset wearer. Egocentric video annotated with hand detection, gaze-aligned activity labels, and object interactions provides the supervision signal for these understanding models. Physical AI training data.

Environment Types

Diverse environments are not a bonus — they are the whole point. A model that only saw kitchens during training will fail in a warehouse. Claru covers 20+ categories from the start.

Restaurants
Hair / Nail Salons
Farms
Retail / Grocery
Trails
Construction Sites
Heavy Equipment
Offices
Clinics / Waiting Rooms
Gyms
Labs
Sewing / Textiles
Ceramics Studios
Jewelry Workshops
Sidewalks
Transit
Residential Yards
Pharmacies
Parks
Storefronts

Custom environments available on request — contact us with your task specification.

Sample Data Specifications

Clip Properties
Resolution1080p – 4K
Frame rate25 – 60 fps
Clip duration40 – 180 seconds
FormatMP4 (H.264 / H.265), MOV
Annotation Layers Available
Hand detectionBounding boxes per frame with left/right classification
Activity classificationClip-level and segment-level verb-noun action labels
Object detectionCOCO-compatible bounding boxes, 100+ categories
Depth mapsPer-frame monocular depth (16-bit PNG or float32 NumPy)
Semantic segmentationPer-pixel class labels + instance IDs (on request)
10,000+
Collectors
in 100+ cities across 14+ countries
20+
Environment types
covered in the standard collection catalogue
156+
Pre-approved clips
available for immediate sample delivery
48 hrs
Sample turnaround
from request to delivery for standard environments

Frequently Asked Questions

What is egocentric video data?

Egocentric video data is footage recorded from a first-person perspective — typically a camera worn on the head, chest, or wrist of a human performing real-world activities. The camera captures what the wearer sees: their hands reaching for objects, the surfaces they interact with, the spatial layout of the environment around them. This viewpoint is architecturally significant for AI training because it matches the visual input a robot, AR headset, or wearable device receives in deployment. Models trained on egocentric video learn to interpret scenes, recognize hand-object interactions, and infer intent from the same perspective they will encounter in the real world — without requiring a viewpoint transformation that third-person training data would necessitate.

How is egocentric data different from surveillance or dashcam footage?

Surveillance and dashcam footage is recorded from a fixed external viewpoint — a camera mounted on a wall or vehicle that observes activity from outside. Egocentric data is captured from the first-person perspective of the person performing the activity. Three differences matter for AI training. First, viewpoint: egocentric video shows the hands, tools, and objects the person is directly manipulating, not a bird's-eye view of the same scene. Second, attention signal: in egocentric footage, the camera naturally follows where the person looks and reaches — this implicit attention signal tells a model which parts of the scene are task-relevant. Third, occlusion: the occlusion patterns in egocentric video (hands blocking objects, objects blocking other objects during manipulation) match what a deployed robot or AR system will actually encounter. Surveillance footage does not contain this information in a usable form for embodied AI training.

What environments can you collect egocentric video in?

Claru collects egocentric video across 20+ environment categories: restaurants and commercial kitchens, hair and nail salons, farms and agricultural settings, retail and grocery stores, hiking trails and outdoor paths, construction sites, heavy equipment operation, offices, clinics and waiting rooms, gyms and fitness facilities, research labs, sewing and textile work, ceramics studios, jewelry workshops, sidewalks and urban pedestrian environments, transit (buses, trains, stations), residential yards and outdoor home spaces, pharmacies, parks, and commercial storefronts. Custom environment types not on this list are available on request — if your use case requires a specific setting, Claru can recruit collectors with direct access to that environment.

How quickly can I get a sample pack?

For standard environments, Claru maintains a pre-collected archive of 156+ approved clips across common categories. A sample pack of 10–15 clips can typically be delivered within 48 hours of request. For custom environments or specific activity protocols, collection campaigns produce first clips within 3–5 business days of task specification sign-off. Full dataset campaigns (500+ clips) typically run 2–4 weeks depending on environment availability and geographic requirements. Minimum viable sample packs start at $500–$2,000 for 10–15 clips with basic metadata; pricing for annotated packs depends on annotation layers required.

What format is the data delivered in?

Video files are delivered as MP4 (H.264 or H.265) or MOV. Resolution ranges from 1080p to 4K; frame rate is 25–60 fps; clip duration is typically 40–180 seconds. Annotation layers are delivered as Parquet files for tabular metadata (activity labels, environment tags, contributor metadata), NumPy arrays for dense per-frame annotations (depth maps, segmentation masks), and JSON for structured labels (hand bounding boxes, pose keypoints, object detections). Datasets can be delivered via S3, GCS, or direct download. WebDataset format (tar shards with co-located video and annotation files) is available for streaming training pipelines. All deliveries include a manifest with SHA-256 checksums and a datasheet documenting collection methodology and known limitations.

Request a Sample Pack

10–15 annotated egocentric clips from any environment type. Delivered within 48 hours for standard categories.

Sample packs from $500. Full dataset pricing on request.