Egocentric Kitchen Video Dataset

First-person video of real kitchen activities — cooking, cleaning, organizing — captured across diverse home and commercial kitchen layouts with dense manipulation annotations for training robotic kitchen assistants and embodied AI systems.

Why Kitchen Video Data Matters for Robotics

Kitchens are the primary target environment for household robotics. Companies building robotic kitchen assistants need training data that captures the full complexity of real cooking workflows — multi-step recipes, tool handling, ingredient manipulation, and the spatial reasoning required to navigate cluttered countertops. First-person video from real kitchens provides this signal in a way that simulation cannot replicate.

The egocentric viewpoint is critical because it matches the camera perspective of a robot operating in the same space. A head-mounted or chest-mounted camera captures exactly what a robot's onboard camera would see: hands reaching into drawers, objects partially occluded by other items, steam from cooking, and the dynamic lighting conditions of real kitchens. This perspective alignment dramatically reduces the domain gap between training data and deployment.

Kitchen environments present unique challenges for computer vision and robot learning. Transparent objects (glass bowls, measuring cups), deformable materials (dough, vegetables being chopped), liquids, and steam create visual complexity that standard object detection struggles with. Training on diverse real kitchen video is the most reliable path to robust perception in these conditions.

Dataset at a Glance

120K+
Video clips
800+
Hours recorded
50+
Kitchen layouts
15+
Annotation layers

Collection Methodology

Claru collectors wear lightweight head-mounted cameras (GoPro Hero 12 or similar) while performing genuine cooking tasks in their own kitchens. This is not scripted — collectors follow actual recipes and perform real meal preparation, generating naturalistic data with authentic object interactions, timing, and error recovery behaviors that scripted collection cannot produce.

Each collection session captures 30-90 minutes of continuous activity. Collectors are recruited across geographic regions to ensure diversity in kitchen layouts (galley, L-shaped, island, commercial), equipment (gas vs electric, various appliance brands), and cooking styles (Western, Asian, South Asian, Latin American). This diversity is essential for training models that generalize beyond a single kitchen configuration.

Raw video is captured at 1080p or 4K resolution at 30fps, with optional depth sensing from RealSense D435 cameras for collectors who use our depth rig. Each session includes metadata: kitchen layout sketch, appliance inventory, and recipe or task description. This metadata enables researchers to filter and subset the data for specific training objectives.

Annotation Layers

🕐

Temporal Action Segments

Start/end timestamps for every discrete action: open fridge, pick up knife, chop onion, stir pot. Follows EPIC-KITCHENS taxonomy with extensions for 200+ kitchen-specific verbs and nouns.

🤲

Hand-Object Contact Frames

Per-frame labels marking which hand is in contact with which object, including grasp type classification (power grasp, precision grasp, pinch) for manipulation policy training.

📦

Object Bounding Boxes

2D bounding boxes tracked across frames for all manipulated objects, ingredients, tools, and containers. Includes object identity tracking through occlusion events.

🎨

Semantic Segmentation

Pixel-level segmentation masks for countertop surfaces, appliances, tools, food items, and hands. Enables scene understanding and spatial reasoning training.

Comparison with Public Kitchen Datasets

How Claru's kitchen video compares to publicly available academic datasets.

DatasetClipsHoursKitchensAnnotations
EPIC-KITCHENS-10090K10045Actions, nouns, verbs
Ego4D (kitchen subset)~40K~200~100Narrations, hands
YouCook22K176N/A (YouTube)Recipe steps, descriptions
Claru Kitchen120K+800+50+Actions, hands, objects, segmentation, depth

Use Cases and Model Training

VLA models like RT-2 and OpenVLA require diverse manipulation demonstrations to learn kitchen tasks. Claru's kitchen dataset provides the observation sequences these models need, with consistent action annotations that can be mapped to robot action spaces through retargeting. The diversity of kitchen layouts ensures the model does not overfit to a single environment.

World models trained on kitchen video learn the causal structure of cooking: what happens when you pour liquid into a hot pan, how dough deforms when kneaded, how steam disperses. These physical dynamics are critical for predictive models that plan multi-step cooking sequences. The temporal density of our annotations (every action boundary labeled) provides the supervision signal world models need.

Activity recognition and anticipation models benefit from the naturalistic collection protocol. Because collectors perform real cooking rather than scripted motions, the data includes genuine error recovery, multitasking (stirring while chopping), and realistic timing variation that synthetic or scripted datasets lack.

Frequently Asked Questions

Standard collection is at 1920x1080 at 30fps. We also support 4K (3840x2160) collection at 30fps for projects requiring higher spatial resolution, and 60fps for projects studying fast hand movements. Depth data, when included, is captured at 848x480 at 30fps from Intel RealSense D435 cameras.

The current dataset includes over 50 unique kitchen layouts across North America, Europe, and Asia. Layouts range from compact apartment kitchens (under 50 sq ft) to large commercial kitchens (500+ sq ft), with diverse configurations including galley, L-shaped, U-shaped, and island layouts. Each kitchen is documented with a floor plan sketch and appliance inventory.

Yes. Claru delivers kitchen video data in any standard robotics format including RLDS (TensorFlow Datasets), HDF5, WebDataset, zarr, and LeRobot format. We handle all format conversion as part of the delivery pipeline, so you receive data ready to load directly into your training framework.

Request a Sample Pack

Get a curated sample of egocentric kitchen video with full annotations to evaluate for your robotics or embodied AI project.