Question 1

What is egocentric video data?

Accepted Answer

Egocentric video data is footage recorded from a first-person perspective — typically a camera worn on the head, chest, or wrist of a human performing real-world activities. The camera captures what the wearer sees: their hands reaching for objects, the surfaces they interact with, the spatial layout of the environment around them. This viewpoint is architecturally significant for AI training because it matches the visual input a robot, AR headset, or wearable device receives in deployment. Models trained on egocentric video learn to interpret scenes, recognize hand-object interactions, and infer intent from the same perspective they will encounter in the real world — without requiring a viewpoint transformation that third-person training data would necessitate.

Question 2

How is egocentric data different from surveillance or dashcam footage?

Accepted Answer

Surveillance and dashcam footage is recorded from a fixed external viewpoint — a camera mounted on a wall or vehicle that observes activity from outside. Egocentric data is captured from the first-person perspective of the person performing the activity. Three differences matter for AI training. First, viewpoint: egocentric video shows the hands, tools, and objects the person is directly manipulating, not a bird's-eye view of the same scene. Second, attention signal: in egocentric footage, the camera naturally follows where the person looks and reaches — this implicit attention signal tells a model which parts of the scene are task-relevant. Third, occlusion: the occlusion patterns in egocentric video (hands blocking objects, objects blocking other objects during manipulation) match what a deployed robot or AR system will actually encounter. Surveillance footage does not contain this information in a usable form for embodied AI training.

Question 3

What environments can you collect egocentric video in?

Accepted Answer

Claru collects egocentric video across 20+ environment categories: restaurants and commercial kitchens, hair and nail salons, farms and agricultural settings, retail and grocery stores, hiking trails and outdoor paths, construction sites, heavy equipment operation, offices, clinics and waiting rooms, gyms and fitness facilities, research labs, sewing and textile work, ceramics studios, jewelry workshops, sidewalks and urban pedestrian environments, transit (buses, trains, stations), residential yards and outdoor home spaces, pharmacies, parks, and commercial storefronts. Custom environment types not on this list are available on request — if your use case requires a specific setting, Claru can recruit collectors with direct access to that environment.

Question 4

How quickly can I get a sample pack?

Accepted Answer

For standard environments, Claru maintains a pre-collected archive of 156+ approved clips across common categories. A sample pack of 10–15 clips can typically be delivered within 48 hours of request. For custom environments or specific activity protocols, collection campaigns produce first clips within 3–5 business days of task specification sign-off. Full dataset campaigns (500+ clips) typically run 2–4 weeks depending on environment availability and geographic requirements. Minimum viable sample packs start at $500–$2,000 for 10–15 clips with basic metadata; pricing for annotated packs depends on annotation layers required.

Question 5

What format is the data delivered in?

Accepted Answer

Video files are delivered as MP4 (H.264 or H.265) or MOV. Resolution ranges from 1080p to 4K; frame rate is 25–60 fps; clip duration is typically 40–180 seconds. Annotation layers are delivered as Parquet files for tabular metadata (activity labels, environment tags, contributor metadata), NumPy arrays for dense per-frame annotations (depth maps, segmentation masks), and JSON for structured labels (hand bounding boxes, pose keypoints, object detections). Datasets can be delivered via S3, GCS, or direct download. WebDataset format (tar shards with co-located video and annotation files) is available for streaming training pipelines. All deliveries include a manifest with SHA-256 checksums and a datasheet documenting collection methodology and known limitations.

Egocentric Video Data for Physical AI

What Makes Claru's Egocentric Data Different

20+ environment types, not just kitchens

Multi-layer enrichment on every clip

Custom collection on demand, delivered in days

Use Cases

VLA Model Training

Embodied AI and Robot Learning

AR/VR and HCI Research

Environment Types

Sample Data Specifications

Frequently Asked Questions

What is egocentric video data?

How is egocentric data different from surveillance or dashcam footage?

What environments can you collect egocentric video in?

How quickly can I get a sample pack?

What format is the data delivered in?

Related Resources

Request a Sample Pack