File Processing Performance Notes

Most of the heavy lifting for the file ingestion is done by the walrus service. This service downloads the image/video. from cloud storage and performs all the necessary pre processing steps for the image/video to be ready to annotate.

** Note (the benchmarks below are benchmarks that don't take into account other modules usage, so please make sure to also analyze this taking into account how other parts of the system are used)

Images

Key notes for Image performance:

  • You can use pass by reference to make processing time negligible
  • Each image takes roughly 1 to 2.59 s to process on a single core, and negligible memory consumption.
  • Each walrus instance can have multiple threads (depending on you machine size)
  • For more scalability you can spin multiple walrus instances in parallel to increase processing time.

Example
If we want to process 100,000 images for one single instance this would be.

1 Worker

  • 100,000 * 2.59 = 250,000 seconds => 69.4 hours. for a single (worker/thread)

Key idea here is that horizontal scaling of the walrus service will greatly improve image throughput.

4 Workers:

  • 100,000 * 2.59 = 250,000 seconds => 17.3 hours. for a single (worker/thread)

Videos

  • The download of big video files currently consume 1.7x the internal memory of the machine. So for a 1gb video upload, you must have at least 1.7GB in RAM memory.
  • Frames are being processed in parallel by 8 threads, per machine. We will work on adding cross/machine parallel processing of frames in the future.

DB Connections

When having a huge amount of walrus services, the bottle neck can be the DB connections.

The default connection pool for each walrus instance is 10 connection. Depending on the amount of connections available on your Database, you can tweak this value.

Please also note that the settings PROCESS_MEDIA_NUM_FRAME_THREADS and PROCESS_MEDIA_NUM_VIDEO_THREADS create more threads that use up multiple connection from the connection pool. So please be sure to set the connection pool size and threads amounts to a number that's reasonable for your current usage and data consumption.

We are always working on improving the overall system performance to give faster file ingestions and allow huge scale datasets to be easily ingested by the platform.