Game-Based Data Capture for Real-World Simulation
Challenge:Training agents to behave like humans in 3D environments requires paired observation-action data — synchronized video of what the player sees with a precise record of what they do — and no commercial capture tool provides this.
Solution:We designed and built a custom capture application from scratch.
Result:The synchronized dataset enabled the lab to train agents that predict human actions from visual observations with significantly higher fidelity than models trained on video-only data.
Training agents to behave like humans in 3D environments requires paired observation-action data — synchronized video of what the player sees with a precise record of what they do — and no commercial capture tool provides this. Existing screen-recording software captures pixels but discards control inputs entirely. Game telemetry APIs expose some state variables but not raw keystrokes at frame-level resolution. The lab needed a purpose-built solution that could record both streams with sub-frame temporal alignment, scale across different game engines and input devices, and sustain long capture sessions (4+ hours) without data loss, clock drift, or performance degradation on consumer hardware.
We designed and built a custom capture application from scratch. The system performs simultaneous screen recording at native resolution and raw input logging, capturing every keystroke, mouse movement, and controller input as structured data with microsecond-precision timestamps. Frame-level alignment between the video and control streams is maintained via a shared monotonic clock, with periodic sync markers to detect and correct any drift.
The application was engineered for robustness during sustained sessions. Memory management, disk I/O buffering, and CPU scheduling were tuned to prevent frame drops or input lag during 4+ hour recording windows. We validated capture fidelity by replaying logged inputs against recorded footage and measuring temporal alignment error — consistently under 16ms (one frame at 60fps).
The solution scaled across multiple game titles spanning first-person and third-person 3D environments. Player profiles were diverse by skill level and playstyle. Output format was standardized: per-frame JPEG streams paired with CSV control logs, each row containing timestamp, input device, key/axis, and value. A master manifest mapped each session to game title, player demographics, and session metadata.
The synchronized dataset enabled the lab to train agents that predict human actions from visual observations with significantly higher fidelity than models trained on video-only data. The frame-level action labels eliminated the need for inverse dynamics models to infer intent from pixel changes — a noisy and lossy intermediate step that had been a primary error source in prior work. The dataset also served as a benchmark for evaluating action-prediction architectures, with the control stream providing ground-truth supervision.
Sample Capture Data
Real gameplay with synchronized input telemetry. Each clip includes the raw keystroke and mouse data captured alongside the video at microsecond precision.
Ready to build your next dataset?
Tell us about your project and we will scope a plan within 48 hours.