Preserving Object Identity Across Video Time
Challenge:Video understanding models struggle with identity persistence — maintaining that object A at time T is the same entity as object A at time T+N when appearance changes due to viewpoint rotation, partial occlusion, lighting shifts, or motion blur.
Solution:The pipeline operates in four stages.
Result:The 1.
Video understanding models struggle with identity persistence — maintaining that object A at time T is the same entity as object A at time T+N when appearance changes due to viewpoint rotation, partial occlusion, lighting shifts, or motion blur. Existing datasets for re-identification focus narrowly on pedestrians or vehicles, and constructing identity-persistence training data manually is prohibitively expensive because annotators must watch extended video sequences and make subtle perceptual judgments about whether two object instances are genuinely the same entity. The client needed a pipeline that could produce high-confidence identity pairs at million-unit scale across diverse object categories (products, people, animals) from licensed video sources, without requiring annotators to watch hours of footage per pair.
The pipeline operates in four stages. Stage 1 selects two clips from the same licensed video source separated by a configurable time gap — the temporal distance between the clips is a tunable parameter that controls the difficulty of the identity-persistence task. Stage 2 samples frames from each clip and runs semantic segmentation to identify candidate objects, producing bounding boxes and — for people — facial keypoints, filtering for objects that are sufficiently complete, unobstructed, and visually distinct to serve as identity anchors.
Stage 3 is human validation: annotators confirm whether a candidate object from the earlier clip is identifiable in the later clip using aligned thumbnail comparisons and bounding box overlays. This is a targeted perceptual judgment ("Is this the same entity?") rather than an open-ended annotation task, keeping per-item annotation time under 30 seconds. Stage 4 forms the final identity pairs, with optional similarity labeling that provides finer-grained supervision (identical, same-category-different-instance, or unrelated).
The hybrid automation-validation architecture is deliberate: automation handles the computationally tractable parts (temporal sampling, object detection, segmentation, keypoint extraction), while humans handle the perceptually difficult part (identity confirmation) that current vision models cannot reliably perform. This division scaled to 1.07M+ verifications while maintaining high confidence in the resulting pairs.
The 1.07M+ identity-persistence pairs became primary training data for the client's video understanding and generation models. Models trained on these pairs showed measurable improvement in re-identification and tracking tasks compared to models trained on static image pairs or single-frame datasets. A key finding was that temporal distance proved as critical as visual clarity — sampling farther apart in time increased the difficulty and diversity of appearance changes, producing more robust identity representations without adding annotation cost. The pipeline's configurable time-gap parameter allows the client to generate progressively harder training curricula for curriculum-learning approaches.
Representative record from the annotation pipeline.




{
"project_title": "Object Identity v4: Multiple Segments",
"classification_id": "b04ca94e-ea32-4215-91bc-a750e06df806",
"segment1_image_1": "yes",
"segment2_image_2": "yes",
"segment3_image_3": "yes",
"segment": [
{
"bbox": [
851,
116,
1174,
494
],
"class": "face",
"score": 0.882
}
],
"created_at": "May 16, 2025, 10:09 PM",
"status": "completed"
}Ready to build your next dataset?
Tell us about your project and we will scope a plan within 48 hours.