Egocentric Kitchen Video Dataset
First-person video of real kitchen activities — cooking, cleaning, organizing — captured across diverse home and commercial kitchen layouts with dense manipulation annotations for training robotic kitchen assistants and embodied AI systems.
Why Kitchen Video Data Matters for Robotics
Kitchens are the primary target environment for household robotics. Companies building robotic kitchen assistants need training data that captures the full complexity of real cooking workflows — multi-step recipes, tool handling, ingredient manipulation, and the spatial reasoning required to navigate cluttered countertops. First-person video from real kitchens provides this signal in a way that simulation cannot replicate.
The egocentric viewpoint is critical because it matches the camera perspective of a robot operating in the same space. A head-mounted or chest-mounted camera captures exactly what a robot's onboard camera would see: hands reaching into drawers, objects partially occluded by other items, steam from cooking, and the dynamic lighting conditions of real kitchens. This perspective alignment dramatically reduces the domain gap between training data and deployment.
Kitchen environments present unique challenges for computer vision and robot learning. Transparent objects (glass bowls, measuring cups), deformable materials (dough, vegetables being chopped), liquids, and steam create visual complexity that standard object detection struggles with. Training on diverse real kitchen video is the most reliable path to robust perception in these conditions.
Dataset at a Glance
Collection Methodology
Claru collectors wear lightweight head-mounted cameras (GoPro Hero 12 or similar) while performing genuine cooking tasks in their own kitchens. This is not scripted — collectors follow actual recipes and perform real meal preparation, generating naturalistic data with authentic object interactions, timing, and error recovery behaviors that scripted collection cannot produce.
Each collection session captures 30-90 minutes of continuous activity. Collectors are recruited across geographic regions to ensure diversity in kitchen layouts (galley, L-shaped, island, commercial), equipment (gas vs electric, various appliance brands), and cooking styles (Western, Asian, South Asian, Latin American). This diversity is essential for training models that generalize beyond a single kitchen configuration.
Raw video is captured at 1080p or 4K resolution at 30fps, with optional depth sensing from RealSense D435 cameras for collectors who use our depth rig. Each session includes metadata: kitchen layout sketch, appliance inventory, and recipe or task description. This metadata enables researchers to filter and subset the data for specific training objectives.
Annotation Layers
Temporal Action Segments
Start/end timestamps for every discrete action: open fridge, pick up knife, chop onion, stir pot. Follows EPIC-KITCHENS taxonomy with extensions for 200+ kitchen-specific verbs and nouns.
Hand-Object Contact Frames
Per-frame labels marking which hand is in contact with which object, including grasp type classification (power grasp, precision grasp, pinch) for manipulation policy training.
Object Bounding Boxes
2D bounding boxes tracked across frames for all manipulated objects, ingredients, tools, and containers. Includes object identity tracking through occlusion events.
Semantic Segmentation
Pixel-level segmentation masks for countertop surfaces, appliances, tools, food items, and hands. Enables scene understanding and spatial reasoning training.
Comparison with Public Kitchen Datasets
How Claru's kitchen video compares to publicly available academic datasets.
| Dataset | Clips | Hours | Kitchens | Annotations |
|---|---|---|---|---|
| EPIC-KITCHENS-100 | 90K | 100 | 45 | Actions, nouns, verbs |
| Ego4D (kitchen subset) | ~40K | ~200 | ~100 | Narrations, hands |
| YouCook2 | 2K | 176 | N/A (YouTube) | Recipe steps, descriptions |
| Claru Kitchen | 120K+ | 800+ | 50+ | Actions, hands, objects, segmentation, depth |
Use Cases and Model Training
VLA models like RT-2 and OpenVLA require diverse manipulation demonstrations to learn kitchen tasks. Claru's kitchen dataset provides the observation sequences these models need, with consistent action annotations that can be mapped to robot action spaces through retargeting. The diversity of kitchen layouts ensures the model does not overfit to a single environment.
World models trained on kitchen video learn the causal structure of cooking: what happens when you pour liquid into a hot pan, how dough deforms when kneaded, how steam disperses. These physical dynamics are critical for predictive models that plan multi-step cooking sequences. The temporal density of our annotations (every action boundary labeled) provides the supervision signal world models need.
Activity recognition and anticipation models benefit from the naturalistic collection protocol. Because collectors perform real cooking rather than scripted motions, the data includes genuine error recovery, multitasking (stirring while chopping), and realistic timing variation that synthetic or scripted datasets lack.
Frequently Asked Questions
Standard collection is at 1920x1080 at 30fps. We also support 4K (3840x2160) collection at 30fps for projects requiring higher spatial resolution, and 60fps for projects studying fast hand movements. Depth data, when included, is captured at 848x480 at 30fps from Intel RealSense D435 cameras.
The current dataset includes over 50 unique kitchen layouts across North America, Europe, and Asia. Layouts range from compact apartment kitchens (under 50 sq ft) to large commercial kitchens (500+ sq ft), with diverse configurations including galley, L-shaped, U-shaped, and island layouts. Each kitchen is documented with a floor plan sketch and appliance inventory.
Yes. Claru delivers kitchen video data in any standard robotics format including RLDS (TensorFlow Datasets), HDF5, WebDataset, zarr, and LeRobot format. We handle all format conversion as part of the delivery pipeline, so you receive data ready to load directly into your training framework.
Request a Sample Pack
Get a curated sample of egocentric kitchen video with full annotations to evaluate for your robotics or embodied AI project.