Egocentric Video Data Collection for Robotics and World Modeling
Challenge:Robotic manipulation and world-model research requires massive volumes of first-person video showing natural human behavior in diverse real-world environments, and no public dataset meets frontier labs' quality or diversity bar.
Solution:We built a purpose-built capture and ingestion platform — not adapted from an off-the-shelf tool — and launched three parallel pipelines within days of engagement, each optimized for different environments and interaction types.
Result:The dataset became primary training data for the lab's world-modeling and robotic manipulation research.
Robotic manipulation and world-model research requires massive volumes of first-person video showing natural human behavior in diverse real-world environments, and no public dataset meets frontier labs' quality or diversity bar. The lab had attempted internal collection but stalled at modest scale due to hardware logistics, contributor recruitment across geographies, and the overhead of enforcing consistent capture quality — resolution, frame rate, scene coverage, and activity diversity — across hundreds of participants. They needed a partner who could launch a production-ready pipeline in days, not months, and adapt to evolving research specifications on a weekly cadence without sacrificing annotation-ready quality standards.
We built a purpose-built capture and ingestion platform — not adapted from an off-the-shelf tool — and launched three parallel pipelines within days of engagement, each optimized for different environments and interaction types.
The first pipeline deployed GoPro and DJI wearable cameras for high-fidelity, wide-angle egocentric capture of manipulation tasks, cooking, and locomotion — producing 219,000+ clips. The second pipeline used smartphone cameras for rapid, high-volume capture of everyday activities across diverse indoor and outdoor environments — producing 155,000+ clips. The third pipeline targeted specific activity categories (pouring, cutting, assembling, folding, fastening) with structured task instructions — producing 12,000+ precisely labeled clips.
A structured activity taxonomy was co-developed with the lab's research team through three iterative revision cycles — testing draft categories against real captured footage to resolve gaps and ambiguities. The final taxonomy organized activities by environment (kitchen, workshop, outdoor), motor complexity (gross motor, fine manipulation, locomotion), and interaction type (tool use, object transfer, environmental navigation). Every clip was annotated with structured activity classifications and enriched metadata — environment type, lighting conditions, number of objects involved, and interaction complexity score — using a labeling interface that enforced taxonomy compliance at the UI level.
Automated checks at upload time validated resolution, duration, orientation, and file integrity before clips entered the annotation pipeline. An in-house QA team then ran continuous validation on every submission within 24 hours. Real-time dashboards enabled dynamic rebalancing across geographies and activity types as the lab's priorities shifted. The operational model was built for velocity: contributor onboarding took under 48 hours, QA turnaround was same-day, and weekly delivery batches kept the lab's training pipeline fed continuously.
The dataset became primary training data for the lab's world-modeling and robotic manipulation research. The three-pipeline architecture meant each stream could be independently tuned — GoPro footage for high-fidelity manipulation, smartphone for environmental diversity, activity-specific for rare interaction types. Geographic and demographic diversity across approximately 500 contributors reduced the distribution shift between training data and real-world deployment. Weekly delivery cadence allowed the lab to begin training runs during collection rather than after, compressing their research iteration cycle by weeks.
Representative record from the annotation pipeline.
Ready to build your next dataset?
Tell us about your project and we will scope a plan within 48 hours.