What camera hardware was used for egocentric capture?

Three device categories across three parallel pipelines. GoPro and DJI wearable cameras for high-fidelity wide-angle capture (219K clips). Smartphones for rapid high-volume everyday activity capture (155K clips). All output was normalized to consistent resolution and frame-rate specifications regardless of capture device.

How did you ensure consistent quality across hundreds of global contributors?

Every submission passed through an automated and human QA pipeline within 24 hours. Automated checks validated framing, motion quality, duration compliance, and metadata completeness. An in-house QA team reviewed flagged submissions and provided specific remediation instructions. Real-time dashboards tracked pass rates by geography and activity type, enabling rapid intervention when quality dipped in any segment.

What types of activities were captured?

Coverage spanned household tasks, fine-grained manipulation actions (pouring, cutting, assembling, folding, fastening), walking, driving, and cooking. The activity-specific pipeline captured 12,000+ precisely labeled clips targeting rare interaction types. Activities were selected collaboratively with the research team and evolved weekly as their priorities shifted.

How did you handle evolving research requirements?

The three-pipeline architecture was designed for weekly specification changes from day one. Task instructions, activity taxonomies, and quality thresholds were parameterized so they could be updated without rebuilding the pipeline. When the lab shifted priorities — for example, requesting more driving footage or adding a new manipulation subcategory — updated instructions were pushed to contributors within 48 hours.

How was the activity taxonomy developed?

Co-developed iteratively with the lab's research team through three revision cycles. Initial drafts were tested against real captured footage, revealing gaps (activities that didn't fit any category) and ambiguities (activities that fit multiple categories). The final taxonomy organized activities by environment, motor complexity, and interaction type, with clear decision criteria for edge cases. The labeling interface enforced taxonomy compliance at the UI level, preventing free-text drift.

Why build a custom platform instead of using an existing tool?

No existing platform supports the specific combination of first-person video capture across heterogeneous hardware, structured activity labeling with enforced taxonomies, enriched metadata annotation, and upload-time quality validation. Adapting a generic crowdsourcing platform would have required extensive customization — with the added technical debt of working around the platform's assumptions about data types and workflows. The purpose-built system is retained by the client for ongoing collection beyond the initial engagement.

How quickly can this be replicated for a new engagement?

Platform launch takes days, not weeks. The core infrastructure — contributor onboarding, capture apps, QA pipelines, delivery formatting — is reusable across all three pipeline types. The primary variable is task-specific calibration: translating a new lab's research specifications into contributor instructions and QA criteria, which typically requires a 1-2 week calibration phase.

Data Collection

Egocentric Video Data Collection for Robotics and World Modeling

386K+First-person video clips captured

Data Collection

summary.md

Challenge:Robotic manipulation and world-model research requires massive volumes of first-person video showing natural human behavior in diverse real-world environments, and no public dataset meets frontier labs' quality or diversity bar.

Solution:We built a purpose-built capture and ingestion platform — not adapted from an off-the-shelf tool — and launched three parallel pipelines within days of engagement, each optimized for different environments and interaction types.

Result:The dataset became primary training data for the lab's world-modeling and robotic manipulation research.

0K+Total first-person video clips captured

0KGoPro & DJI wearable capture clips

0KSmartphone capture clips

~0Global contributors across 3 pipelines

// THE CHALLENGE

Robotic manipulation and world-model research requires massive volumes of first-person video showing natural human behavior in diverse real-world environments, and no public dataset meets frontier labs' quality or diversity bar. The lab had attempted internal collection but stalled at modest scale due to hardware logistics, contributor recruitment across geographies, and the overhead of enforcing consistent capture quality — resolution, frame rate, scene coverage, and activity diversity — across hundreds of participants. They needed a partner who could launch a production-ready pipeline in days, not months, and adapt to evolving research specifications on a weekly cadence without sacrificing annotation-ready quality standards.

// OUR APPROACH

We built a purpose-built capture and ingestion platform — not adapted from an off-the-shelf tool — and launched three parallel pipelines within days of engagement, each optimized for different environments and interaction types.

The first pipeline deployed GoPro and DJI wearable cameras for high-fidelity, wide-angle egocentric capture of manipulation tasks, cooking, and locomotion — producing 219,000+ clips. The second pipeline used smartphone cameras for rapid, high-volume capture of everyday activities across diverse indoor and outdoor environments — producing 155,000+ clips. The third pipeline targeted specific activity categories (pouring, cutting, assembling, folding, fastening) with structured task instructions — producing 12,000+ precisely labeled clips.

A structured activity taxonomy was co-developed with the lab's research team through three iterative revision cycles — testing draft categories against real captured footage to resolve gaps and ambiguities. The final taxonomy organized activities by environment (kitchen, workshop, outdoor), motor complexity (gross motor, fine manipulation, locomotion), and interaction type (tool use, object transfer, environmental navigation). Every clip was annotated with structured activity classifications and enriched metadata — environment type, lighting conditions, number of objects involved, and interaction complexity score — using a labeling interface that enforced taxonomy compliance at the UI level.

Automated checks at upload time validated resolution, duration, orientation, and file integrity before clips entered the annotation pipeline. An in-house QA team then ran continuous validation on every submission within 24 hours. Real-time dashboards enabled dynamic rebalancing across geographies and activity types as the lab's priorities shifted. The operational model was built for velocity: contributor onboarding took under 48 hours, QA turnaround was same-day, and weekly delivery batches kept the lab's training pipeline fed continuously.

SpecTranslate research requirements into 3 pipeline-specific task protocols

ActivateOnboard ~500 contributors with GoPro, DJI, and smartphone kits

CaptureRun 3 parallel streams — wearable, phone, activity-specific

ValidateSame-day QA on framing, motion, duration, and metadata

DeliverWeekly batches feeding the lab's training pipeline continuously

SpecTranslate research requirements into 3 pipeline-specific task protocols

ActivateOnboard ~500 contributors with GoPro, DJI, and smartphone kits

CaptureRun 3 parallel streams — wearable, phone, activity-specific

ValidateSame-day QA on framing, motion, duration, and metadata

DeliverWeekly batches feeding the lab's training pipeline continuously

// RESULTS

386K+Total first-person video clips captured

219KGoPro & DJI wearable capture clips

155KSmartphone capture clips

~500Global contributors across 3 pipelines

// IMPACT

The dataset became primary training data for the lab's world-modeling and robotic manipulation research. The three-pipeline architecture meant each stream could be independently tuned — GoPro footage for high-fidelity manipulation, smartphone for environmental diversity, activity-specific for rare interaction types. Geographic and demographic diversity across approximately 500 contributors reduced the distribution shift between training data and real-world deployment. Weekly delivery cadence allowed the lab to begin training runs during collection rather than after, compressing their research iteration cycle by weeks.

// SAMPLE DATA

Representative record from the annotation pipeline.

capture_pipeline_overview.json

// VIDEO SAMPLE

// CAPTURE PIPELINE BREAKDOWN

GoPro & DJI Wearable

219,598

GoPro HERO, DJI Action — 56.8% of total

Smartphone Capture

154,973

iPhone, Android — 40.1% of total

Activity-Specific

11,772

Mixed — task-guided — 3.0% of total

Total Clips386,343

// ACTIVITY CATEGORIES

Object ManipulationCookingWalkingDrivingCleaningCrafts & AssemblyPouring & CuttingFolding & FasteningPhone-based Activities

// SAMPLE SPECS

Resolution3840x2160

Aspect Ratio16:9

Duration11.08s

Frame Rate31.579 fps

Total Frames350

Bit Depth8-bit

Service UsedEgocentric Video Collection

// RELATED

Workplace Egocentric Video Data for General-Purpose Robotics

First-person video captured across 10 real workplace categories — barista stations, carpentry shops, tailoring studios, and more — turning active businesses into a scalable, cost-efficient source of robotics training data.

Read case study

10,000+

Game-Based Data Capture for Real-World Simulation

10,000+ hours of synchronized gameplay footage and raw timestamped control data — captured with a custom solution built from scratch because no off-the-shelf tool existed.

Read case study

// FAQ

Ready to build your next dataset?

Tell us about your project and we will scope a plan within 48 hours.