Scale
Store virtually any scale of training data and access slices nearly instantly
Introduction
Diffgram is built to scale every aspect of your training data.
- People
- Processes
- Data
And so much more.
Database Choices
Diffgram internally uses PostgreSQL by default for all annotation (metadata) storage.
Diffgram also uses sqlalchemy. This means you can use nearly any popular database, including column oriented databases. Examples of other database choices include: Oracle, SQL Server, Amazon Redshift, Apache Drill, Apache Druid, Apache Hive and Presto, EXASolution, Snowflake, and more.
Note: Postgres is the only officially supported database.
Horizontally Scalable
Diffgram, acts as a new form of training data specific database, that is layer overtop of a metadata database (e.g. Postgres) and a BLOB store (e.g. cloud storage). Diffgram itself can scale horizontally, you can add as many machines to the cluster, service, etc. as required.
Everything in Diffgram can scale horizontally. From the metadata storage (eg. Postgres with Citus, Redshift, etc). And the BLOB store (local, GCP, AWS, Azure, HDFS, etc.).
Decoupled Services
Diffgram has a de-coupled online and batch processing service - this means that ingestion, and other long running operations, can scale independently of online access demands.
Using Internal Dataset Representation
Diffgram has an internal dataset representation, and piggybacks on the metadata database (e.g. Postgres) to store the references to the BLOB data. Additionally, in some cases we use other patterns like the "known reference", which allows us to bypass the metadata store and access the BLOB data directly.
SQL Power and Slices (Access)
Because the metadata database is based on SQL, accessing a slice of the data is a simple as a SQL query. This means all of the power of SQL filtering is available on your dataset. And that access times, even for frame specific instances on videos, is very fast.
Parallel Ingestion Processing (Storage)
It's super easy to get the data into Diffgram thanks to our new Import Wizard.
Ingestion is actually a big challenge for many systems. Both accepting new local data (eg from Internet of Things deploys), to loading new data, to model predictions.
Diffgram's ingestion class Process Media, and other supporting sub systems, are all built to run in Parallel. Both locally on each machine, and across multiple machines. This means that thousands of predictions can be fully validated and converted to the Diffgram format in seconds. It also applies to built-in pre-processing for raw media.
Training Data Focused Optimizations
On a practical level, the system has a variety of optimizations, such as indices, and is structured in such as a way so as to be able to query on a per sample basis quickly, in an "online" fashion.
The net effect is that you can store, on a single cluster, Petabytes of raw data, and Terabytes of metadata, and access if in a unified abstraction in an online (<500 MS) fashion.
Built for Extreme Scale
Diffgram has many great features no matter the volume of annotation. Diffgram is unique in that we think about scale across all aspects of the system.
Do any of these apply to you?
Models running in staging or production?
Are using pre-labels or interactive automations?
Need versioning?
Have expanding use cases or need better model performance?
Expanding your annotation team or needs? Have multiple teams accessing training data?
Using complex data types like video, 3D, multi-modal?
These things all stack to make for 10, 100, 1000x+ increases in volumes of annotation needs.
A single Diffgram install is capable of 100,000,000+ (100 Million+) annotations. We plan to scale it to support 10,000,000,000 (10 Billion+) per install in 2023.
Examples of things we think about for you that go beyond the literal numbers:
Is this cost effective at scale? If you need an automation to produce millions of instances, how can we do that in a way that approaches $0?
What does access time for data look like when the volume is 10x 100x 1000x+?
What does the annotator experience look like if the system is at max ingestion capacity?
How does a new team get data in and out of Diffgram in an easy standard process?
How can teams access data across Diffgram installations? How can we serve multiple team’s needs through one unified data model?
If you need extreme scale - choose Diffgram.
Updated over 1 year ago