Why couldn't existing screen-recording tools be used?

No commercial or open-source screen recorder captures raw control inputs with frame-level temporal alignment. Tools like OBS or NVIDIA ShadowPlay record pixels but discard all input data. Game telemetry APIs expose aggregate state but not raw keystrokes at the required temporal resolution. The lab needed both streams synchronized to sub-frame precision, which required a purpose-built solution.

How is temporal alignment between video and control data maintained?

Both streams share a monotonic system clock with microsecond precision. Periodic sync markers are injected into both the video and control logs to detect and correct any drift during long sessions. Post-capture validation replays logged inputs against recorded footage and measures alignment error, which consistently remained under 16ms — one frame at 60fps.

What games and genres were included?

The dataset spans first-person and third-person 3D games across multiple genres and engines. Specific titles are under NDA, but the capture system is engine-agnostic — it operates at the OS input layer rather than hooking into game internals, so it works with any game that accepts standard keyboard, mouse, or controller input.

How were long recording sessions handled without data loss?

The application was engineered for sustained 4+ hour sessions. Disk I/O uses double-buffered writes to prevent frame drops during flush operations. Memory allocation is pre-reserved to avoid garbage collection pauses. CPU scheduling prioritizes the input capture thread. Across all sessions in the engagement, zero data loss incidents were recorded.

Can this capture system be adapted for other domains?

Yes. The core architecture — synchronized multi-stream capture with shared clock — generalizes to any domain requiring paired observation-action data. We have discussed adaptations for desktop productivity workflows, CAD software usage, and robotic teleoperation interfaces. The primary engineering work for a new domain is integrating the relevant input devices and tuning performance for the target application's resource profile.

Gaming

Game-Based Data Capture for Real-World Simulation

10,000+Hours of synchronized gameplay data

Gaming

summary.md

Challenge:Training agents to behave like humans in 3D environments requires paired observation-action data — synchronized video of what the player sees with a precise record of what they do — and no commercial capture tool provides this.

Solution:We designed and built a custom capture application from scratch.

Result:The synchronized dataset enabled the lab to train agents that predict human actions from visual observations with significantly higher fidelity than models trained on video-only data.

0+Hours of synchronized gameplay data

<0msVideo-to-input temporal alignment error

Custom0Capture solution built from scratch

0Data loss incidents across all sessions

// THE CHALLENGE

Training agents to behave like humans in 3D environments requires paired observation-action data — synchronized video of what the player sees with a precise record of what they do — and no commercial capture tool provides this. Existing screen-recording software captures pixels but discards control inputs entirely. Game telemetry APIs expose some state variables but not raw keystrokes at frame-level resolution. The lab needed a purpose-built solution that could record both streams with sub-frame temporal alignment, scale across different game engines and input devices, and sustain long capture sessions (4+ hours) without data loss, clock drift, or performance degradation on consumer hardware.

// OUR APPROACH

We designed and built a custom capture application from scratch. The system performs simultaneous screen recording at native resolution and raw input logging, capturing every keystroke, mouse movement, and controller input as structured data with microsecond-precision timestamps. Frame-level alignment between the video and control streams is maintained via a shared monotonic clock, with periodic sync markers to detect and correct any drift.

The application was engineered for robustness during sustained sessions. Memory management, disk I/O buffering, and CPU scheduling were tuned to prevent frame drops or input lag during 4+ hour recording windows. We validated capture fidelity by replaying logged inputs against recorded footage and measuring temporal alignment error — consistently under 16ms (one frame at 60fps).

The solution scaled across multiple game titles spanning first-person and third-person 3D environments. Player profiles were diverse by skill level and playstyle. Output format was standardized: per-frame JPEG streams paired with CSV control logs, each row containing timestamp, input device, key/axis, and value. A master manifest mapped each session to game title, player demographics, and session metadata.

DesignBuild custom capture application

RecordSimultaneous video + raw input logging

SyncFrame-level timestamp alignment

Scale10,000+ hours across game types

DesignBuild custom capture application

RecordSimultaneous video + raw input logging

SyncFrame-level timestamp alignment

Scale10,000+ hours across game types

// RESULTS

10,000+Hours of synchronized gameplay data

<16msVideo-to-input temporal alignment error

CustomCapture solution built from scratch

0Data loss incidents across all sessions

// IMPACT

The synchronized dataset enabled the lab to train agents that predict human actions from visual observations with significantly higher fidelity than models trained on video-only data. The frame-level action labels eliminated the need for inverse dynamics models to infer intent from pixel changes — a noisy and lossy intermediate step that had been a primary error source in prior work. The dataset also served as a benchmark for evaluating action-prediction architectures, with the control stream providing ground-truth supervision.

Sample Capture Data

Real gameplay with synchronized input telemetry. Each clip includes the raw keystroke and mouse data captured alongside the video at microsecond precision.

Red Dead Redemption 2

hover to play

input-stream.jsonl

0/630

Shi

Spa

mouse

0.75s keydown D

0.83s mousemove 599:1110

0.86s mousemove 599:1110

0.88s mousemove 597:1110

0.90s mousemove 597:1110

0.90s mousemove 596:1110

0.91s mousemove 594:1110

0.92s mousemove 593:1110

0.93s mousemove 591:1110

0.94s mousemove 588:1110

0.94s mousemove 585:1110

0.95s mousemove 582:1110

0.96s keyup D

0.96s mousemove 579:1110

0.97s mousemove 578:1110

Resolution: 1280x720FPS: 30Sync Error: <16msEvents: 630Events/sec: 53Cost/sec: $0.004

PUBG: Battlegrounds

hover to play

input-stream.jsonl

0/978

Shi

Spa

mouse

0.02s mousemove 1914:214

0.06s mousemove 1913:214

0.11s mousemove 1913:214

0.13s mousemove 1912:214

0.15s mousemove 1912:214

0.16s mousemove 1911:214

0.17s mousemove 1911:213

0.18s mousemove 1910:213

0.19s mousemove 1910:213

0.19s mousemove 1909:213

0.20s mousemove 1909:213

0.21s mousemove 1908:213

0.22s mousemove 1908:213

0.26s mousemove 1907:213

0.27s mousemove 1907:212

Resolution: 1280x720FPS: 30Sync Error: <16msEvents: 978Events/sec: 82Cost/sec: $0.004

Service UsedSynthetic Data Generation

// RELATED

386K+

Egocentric Video Data Collection for Robotics and World Modeling

386K+ first-person video clips captured across three parallel pipelines — GoPro, smartphone, and activity-specific — with same-day QA and weekly delivery batches feeding the lab's training runs in real time.

Read case study

976K+

Video Quality Annotation at Scale for RLHF and Model Selection

976K+ human quality assessments across four evaluation dimensions — motion quality, visual fidelity, viewer interest, and text-to-video alignment — powering RLHF training and model selection for a frontier video generation lab.

Read case study

// FAQ

Ready to build your next dataset?

Tell us about your project and we will scope a plan within 48 hours.