Egocentric Video Data for Physical AI
Egocentric (first-person) video data is footage recorded from the perspective of a person performing real-world activities — captured with head-mounted cameras, GoPro-style rigs, or smart glasses. Because this viewpoint matches what a robot's or AR headset's camera sees, egocentric data is the primary training modality for visuomotor policies, world models, embodied AI agents, and hand-interaction research. Claru provides commercially licensed egocentric video with multi-layer annotations from 10,000+ collectors in 100+ cities.
What Makes Claru's Egocentric Data Different
20+ environment types, not just kitchens
Most egocentric datasets are concentrated in kitchen or lab settings. Claru captures across 20+ categories — restaurants, farms, clinics, construction sites, salons, gyms, transit, storefronts, and more — giving models exposure to the full range of real-world environments they will encounter at deployment.
Multi-layer enrichment on every clip
Raw video ships with hand bounding boxes, activity classification labels, object detection, and optional depth maps and spatial segmentation. Each clip arrives as a complete training-ready unit — not raw footage that requires a separate annotation pipeline before you can use it.
Custom collection on demand, delivered in days
If the environment or task you need is not in the existing archive, Claru can deploy collectors specifically for it. Task specifications are turned around in 48 hours; most custom campaigns produce first clips within one week. Minimum viable sample packs start at 10–15 clips.
Use Cases
Egocentric video powers three converging research areas where first-person observation is not a preference but a requirement.
VLA Model Training
VLA models like OpenVLA, RT-2, and pi-zero map visual observations directly to robot actions. Training these models requires large volumes of first-person video that matches the robot's onboard camera perspective. Egocentric data from Claru provides the observation-action pairs these architectures need, particularly for manipulation tasks where hand-object interaction is the critical signal. See the complete guide to VLA training data.
Embodied AI and Robot Learning
Robots learning from imitation require demonstrations that match their embodiment's camera viewpoint. A manipulator arm with a wrist camera sees the world the same way a person wearing a wrist-mounted GoPro does. Claru's egocentric captures — collected across real workplaces, homes, and outdoor environments — provide the behavioral diversity these policies need to generalize beyond the training lab. Embodied AI datasets.
AR/VR and HCI Research
Augmented and mixed reality systems that understand user intent must recognize what hands are doing, what objects are nearby, and what action is being performed — all from the first-person perspective of the headset wearer. Egocentric video annotated with hand detection, gaze-aligned activity labels, and object interactions provides the supervision signal for these understanding models. Physical AI training data.
Environment Types
Diverse environments are not a bonus — they are the whole point. A model that only saw kitchens during training will fail in a warehouse. Claru covers 20+ categories from the start.
Custom environments available on request — contact us with your task specification.
Sample Data Specifications
Frequently Asked Questions
What is egocentric video data?
Egocentric video data is footage recorded from a first-person perspective — typically a camera worn on the head, chest, or wrist of a human performing real-world activities. The camera captures what the wearer sees: their hands reaching for objects, the surfaces they interact with, the spatial layout of the environment around them. This viewpoint is architecturally significant for AI training because it matches the visual input a robot, AR headset, or wearable device receives in deployment. Models trained on egocentric video learn to interpret scenes, recognize hand-object interactions, and infer intent from the same perspective they will encounter in the real world — without requiring a viewpoint transformation that third-person training data would necessitate.
How is egocentric data different from surveillance or dashcam footage?
Surveillance and dashcam footage is recorded from a fixed external viewpoint — a camera mounted on a wall or vehicle that observes activity from outside. Egocentric data is captured from the first-person perspective of the person performing the activity. Three differences matter for AI training. First, viewpoint: egocentric video shows the hands, tools, and objects the person is directly manipulating, not a bird's-eye view of the same scene. Second, attention signal: in egocentric footage, the camera naturally follows where the person looks and reaches — this implicit attention signal tells a model which parts of the scene are task-relevant. Third, occlusion: the occlusion patterns in egocentric video (hands blocking objects, objects blocking other objects during manipulation) match what a deployed robot or AR system will actually encounter. Surveillance footage does not contain this information in a usable form for embodied AI training.
What environments can you collect egocentric video in?
Claru collects egocentric video across 20+ environment categories: restaurants and commercial kitchens, hair and nail salons, farms and agricultural settings, retail and grocery stores, hiking trails and outdoor paths, construction sites, heavy equipment operation, offices, clinics and waiting rooms, gyms and fitness facilities, research labs, sewing and textile work, ceramics studios, jewelry workshops, sidewalks and urban pedestrian environments, transit (buses, trains, stations), residential yards and outdoor home spaces, pharmacies, parks, and commercial storefronts. Custom environment types not on this list are available on request — if your use case requires a specific setting, Claru can recruit collectors with direct access to that environment.
How quickly can I get a sample pack?
For standard environments, Claru maintains a pre-collected archive of 156+ approved clips across common categories. A sample pack of 10–15 clips can typically be delivered within 48 hours of request. For custom environments or specific activity protocols, collection campaigns produce first clips within 3–5 business days of task specification sign-off. Full dataset campaigns (500+ clips) typically run 2–4 weeks depending on environment availability and geographic requirements. Minimum viable sample packs start at $500–$2,000 for 10–15 clips with basic metadata; pricing for annotated packs depends on annotation layers required.
What format is the data delivered in?
Video files are delivered as MP4 (H.264 or H.265) or MOV. Resolution ranges from 1080p to 4K; frame rate is 25–60 fps; clip duration is typically 40–180 seconds. Annotation layers are delivered as Parquet files for tabular metadata (activity labels, environment tags, contributor metadata), NumPy arrays for dense per-frame annotations (depth maps, segmentation masks), and JSON for structured labels (hand bounding boxes, pose keypoints, object detections). Datasets can be delivered via S3, GCS, or direct download. WebDataset format (tar shards with co-located video and annotation files) is available for streaming training pipelines. All deliveries include a manifest with SHA-256 checksums and a datasheet documenting collection methodology and known limitations.
Related Resources
Request a Sample Pack
10–15 annotated egocentric clips from any environment type. Delivered within 48 hours for standard categories.
Sample packs from $500. Full dataset pricing on request.