Video Annotation Concepts


We annotate video to capture meaning in motion. Cars moving, basketball shot being taken, factory equipment operating. Because of this motion, the default assumption when annotating is that every frame is different.

Consider this raw video:

  • Cars move in and out of the frame
  • This happens are different points in time

Attributes of the car may change too - for example in one frame it may be fully visible and in the next it's partially occluded.

Static Objects

Sometimes a video will have a static object, for example an intersection with a traffic light that's not moving. Or a retail store with a shelving unit or other display that doesn't move (or doesn't move often).

You can represent this by

  • A single keyframe ie at frame 0.
  • A keyframe at the start and end ie (0, 656)

This effectively tags it as a "static object".

Persistent Objects

A video may have multiple objects in it. For example, 2 different cars, apples, football players, etc.

From the human perspective we know a football player (ie Ronaldo), is the same person, in frame 0, 5, 10 etc. In every frame, he is still Ronaldo


But this is not as clear to a computer. So to help it, we create a sequence object. ie "Ronaldo". Since it's the first object we created, it gets assigned sequence #1.
If another player "Messi" was also in the frame, we could create a new sequence for him, and he would get #2. Another player would get #3 and so on.

The key point being that each sequence represents a real world object (or a series of events) and it has a number that's unique for each video.