Video Import Specifications

Formats, Frames, Processing Time


Admin Guide



Working with video is challenging. Diffgram has a variety of features that ease some of the burden of working with video and getting the best annotation results.

Often raw video files may come in multi GB sizes, while we may wish an annotator to download the file in seconds to their computer.
Diffgram solves this by creating "web ready" versions of the video. Somewhat akin to when uploading a video to YouTube it has to "process" it before it's ready to be shown.

While it's of course possible to create training data without consideration to some of these concepts, using them creates much higher quality reliable results, especially at scale.

Automatic features

  • Web ready format (for playback)
  • Frame by frame parsing. (for actual annotation)
    In addition to the primary benefit of frame level accuracy guarantee, this also supports automatic preview crops, TFRecords / binary exports, and Deep Learning Pre-Labeling.

Optional features

  • File splits
  • FPS conversions

Preparing files for import

  • Web ready formats process much faster then raw files. If possible pre process your media into smaller file sizes. If relevant, downsize to 2048 x 2048 or smaller.
  • Lower frame rate and more splits (for long files) process faster.

A video can be imported by dragging and dropping it in the UI or from the SDK/API. The SDK implementation is strongly recommended for large file sizes and large volumes of files.

Behind the scenes

Time based approaches to video annotation aren't accurate for deep learning.

A common older approach is to keep the video file exactly as it is and attempting to annotate over top of it. The problem is that HTML5 MediaElement does not guarantee frame level accuracy. Full discussion on W3C standards. Even in best case scenarios with known FPS values, off by 1 errors are common.

Frame Level Accuracy

To guarantee frame level accuracy Diffgram parses every frame using FFMPEG and replaces the video frame with the exact image frame prior to annotation in the UI. Because deep learning approaches ultimately look at tensors that consist of per frame data, this guarantees alignment.


We are used to converting videos to a set of images first, how does this relate? How do we maintain order?

The best way to handle video in Diffgram is to upload the raw video - without pre-processing it into images. At a high level, it maintains the video abstraction and reduces risks associated with converting the video as a separate process. In detail:

For Annotators:

  • Video can be then played like normal, including playback with Annotations, changing playback rate, etc.
  • Can use all the Sequence and video features built for Video.

For Admins & Data Science:

  • It's easier to structure tasks since they can be done at the whole video file level.
  • Video files can be added to datasets and more easily moved around.
  • Integration with other features, such as video split and global frame number.

For Engineering:

  • It's easier to send and verify too because 1 file is sent instead of 1,000s.
  • Order is automatically preserved. No need to worry about file naming schemes.
  • The export includes the converted video, and because we provide the global frame reference, you can always go back to your original video and use that index if there's ever a need to regenerate or adjust that formatting.

Import Format

Accepted formats: mp4, mov, avi.

The default file size limit is 5GB. Please contact us if you wish to process larger videos, Enterprise plans can process files up to 100 GB with some limitations.

If your video fails to process please check it's using standard codecs. The video must be able to be parsed by FFMPEG.

Split Videos

SDK & API Reference

FPS Conversion

We recommend keeping FPS <= 60 and duration <= 10 minutes.
The technical limit is based on number of frames, so a 2 hour video with a very low FPS is fine.
Also consider the integrated video splitting function in combination with or as an alternative to this.


  • Makes fast moving video files a workable size for annotation QA.
  • In some cases, more variance per frame means more effective annotation leading to improved neural network performance. Note this may not apply to "event" based annotation where we may wish to label an event that happens at a single frame only.


Disabling FPS Conversion

Set FPS conversion rate to 0.

Processing Time

Because of the nature of the video features offered, new videos do take some time to process in Diffgram. Please review the "Preparing Inputs" section for ideas on speeding up this process. In general Diffgram assumes you prefer a quality result at the expense of slightly slower upfront processing.

Every frame that's imported is a frame an annotator may need to look at.

Multiple files are processed in parallel, up to 10 at a time per account.

~<300 frames or file <300MB: ~5 - 15 minutes
~<5000 frames or file <1GB: ~20 - 40 minutes


Large batches of work

If you are doing large batches (ie 100+ videos, over 3000 frames each), we recommend overnight processing. You can use the job scheduling feature to automatically launch when ready (it checks that files are ready before launching.)

Longer videos

Annotation on a video is a more complex task and for best results it's good practice to limit the length of an individual video.

Have a longer clip? Some options:

  • Split large videos into smaller files
  • Lower the FPS conversion rate (in project settings)
  • Extract the section of interest from the video
    Upgrade your account

Signed URLs

The great thing when sending video data is we can send the entire video as one single file.

This means instead of splitting a video into say, 800 images and sending them and then worrying about grouping them together, etc.

we can just send the whole file. This maintains the benefits of annotating video by keeping the video playable.

And works especially well for large files, including files over 500 MB.

See Example