Why not use fully automated re-identification models instead of human validators?

Current re-identification models are the systems this dataset is designed to improve — using them to generate their own training data creates a circular dependency that amplifies existing failure modes. Human validators provide ground truth that is independent of the model being trained. The targeted validation design (binary identity judgment, not open-ended annotation) keeps human cost manageable at million-unit scale while maintaining label quality that automated approaches cannot match.

How does temporal distance affect the dataset?

Larger temporal gaps between paired clips increase the magnitude and diversity of appearance changes (viewpoint, lighting, occlusion, pose) that the model must learn to see through. Pairs separated by 10+ seconds produced more robust identity representations than pairs separated by 1-2 seconds, without increasing annotation cost — the human validation task is equally fast regardless of temporal distance.

What annotation types were produced?

Three types: identity matching (binary confirmation that the same entity appears in both segments), bounding boxes (spatial localization of the target object in each frame), and facial keypoints (for person-identity tasks, providing fine-grained alignment of facial features across segments). All three annotation types are aligned to the same temporal pairs.

What object categories are covered?

The pipeline covers products (consumer goods, packaged items), people (face and full-body with keypoints), and animals. Category coverage is determined by the source video content and the client's training priorities. The semantic segmentation stage is category-agnostic, so adding new object types requires only updating the category filter, not rebuilding the pipeline.

Why use only licensed video content?

Licensed content provides clear legal provenance for model training, which is increasingly important as regulatory frameworks around training data evolve. It also ensures consistent video quality (professional production standards) and eliminates the consent and privacy complications associated with scraped web video. The trade-off is cost per hour of source material, which the pipeline's efficiency (high yield of pairs per source hour) mitigates.

Video

Preserving Object Identity Across Video Time

1.07M+Cross-segment identity verifications

Video

summary.md

Challenge:Video understanding models struggle with identity persistence — maintaining that object A at time T is the same entity as object A at time T+N when appearance changes due to viewpoint rotation, partial occlusion, lighting shifts, or motion blur.

Solution:The pipeline operates in four stages.

Result:The 1.

0.00M+Cross-segment identity verifications completed

<0sPer-item annotation time via targeted validation

0Annotation types (identity matching, bounding box, keypoints)

Licensed0All source video from licensed content

// THE CHALLENGE

Video understanding models struggle with identity persistence — maintaining that object A at time T is the same entity as object A at time T+N when appearance changes due to viewpoint rotation, partial occlusion, lighting shifts, or motion blur. Existing datasets for re-identification focus narrowly on pedestrians or vehicles, and constructing identity-persistence training data manually is prohibitively expensive because annotators must watch extended video sequences and make subtle perceptual judgments about whether two object instances are genuinely the same entity. The client needed a pipeline that could produce high-confidence identity pairs at million-unit scale across diverse object categories (products, people, animals) from licensed video sources, without requiring annotators to watch hours of footage per pair.

// OUR APPROACH

The pipeline operates in four stages. Stage 1 selects two clips from the same licensed video source separated by a configurable time gap — the temporal distance between the clips is a tunable parameter that controls the difficulty of the identity-persistence task. Stage 2 samples frames from each clip and runs semantic segmentation to identify candidate objects, producing bounding boxes and — for people — facial keypoints, filtering for objects that are sufficiently complete, unobstructed, and visually distinct to serve as identity anchors.

Stage 3 is human validation: annotators confirm whether a candidate object from the earlier clip is identifiable in the later clip using aligned thumbnail comparisons and bounding box overlays. This is a targeted perceptual judgment ("Is this the same entity?") rather than an open-ended annotation task, keeping per-item annotation time under 30 seconds. Stage 4 forms the final identity pairs, with optional similarity labeling that provides finer-grained supervision (identical, same-category-different-instance, or unrelated).

The hybrid automation-validation architecture is deliberate: automation handles the computationally tractable parts (temporal sampling, object detection, segmentation, keypoint extraction), while humans handle the perceptually difficult part (identity confirmation) that current vision models cannot reliably perform. This division scaled to 1.07M+ verifications while maintaining high confidence in the resulting pairs.

SampleSelect paired clips separated by configurable time gap

SegmentAutomated detection with bounding boxes + facial keypoints

ValidateHuman identity confirmation (<30s per item)

PairForm 1.07M+ identity-persistence training examples

SampleSelect paired clips separated by configurable time gap

SegmentAutomated detection with bounding boxes + facial keypoints

ValidateHuman identity confirmation (<30s per item)

PairForm 1.07M+ identity-persistence training examples

// RESULTS

1.07M+Cross-segment identity verifications completed

<30sPer-item annotation time via targeted validation

3Annotation types (identity matching, bounding box, keypoints)

LicensedAll source video from licensed content

// IMPACT

The 1.07M+ identity-persistence pairs became primary training data for the client's video understanding and generation models. Models trained on these pairs showed measurable improvement in re-identification and tracking tasks compared to models trained on static image pairs or single-frame datasets. A key finding was that temporal distance proved as critical as visual clarity — sampling farther apart in time increased the difficulty and diversity of appearance changes, producing more robust identity representations without adding annotation cost. The pipeline's configurable time-gap parameter allows the client to generate progressively harder training curricula for curriculum-learning approaches.

// SAMPLE DATA

Representative record from the annotation pipeline.

object_identity_segments.json

// REFERENCE IMAGES

REF_IMG_01ID: 01

REF_IMG_02ID: 02

REF_IMG_03ID: 03

// DETECTED SEGMENT

SEGMENT_FACE_01CLASS: FACE

BBOX_X851

BBOX_Y116

WIDTH323

HEIGHT378

Confidence0.882

// SEGMENT 1 CLEARLY VISIBLE?

IMAGE 1YES

IMAGE 2YES

IMAGE 3YES

// JSON_RESPONSE

{
  "project_title": "Object Identity v4: Multiple Segments",
  "classification_id": "b04ca94e-ea32-4215-91bc-a750e06df806",
  "segment1_image_1": "yes",
  "segment2_image_2": "yes",
  "segment3_image_3": "yes",
  "segment": [
    {
      "bbox": [
        851,
        116,
        1174,
        494
      ],
      "class": "face",
      "score": 0.882
    }
  ],
  "created_at": "May 16, 2025, 10:09 PM",
  "status": "completed"
}

Processing Time: 42msStatus: 200 OK

Service UsedVideo Annotation

// RELATED

105K

High-Confidence Video Content Classification at Scale

105,000 video clips classified in just seven days — after rapidly redesigning the annotation framework mid-project to eliminate subjectivity and deliver zero downstream rework.

Read case study

976K+

Video Quality Annotation at Scale for RLHF and Model Selection

976K+ human quality assessments across four evaluation dimensions — motion quality, visual fidelity, viewer interest, and text-to-video alignment — powering RLHF training and model selection for a frontier video generation lab.

Read case study

// FAQ

Ready to build your next dataset?

Tell us about your project and we will scope a plan within 48 hours.