Total Cost of Ownership

Considerations for Training Data

🚧

This doc is out of date and pending removal.

Executive Summary

It's commonly assumed that the literal annotation is the biggest cost center. In fact there are many costs involved, such as Administering Datasets, Data Prep, Curation, Set Iteration, and more, that cumulatively far exceed annotation. Annotation, is in fact one of the ways value is added to the system.

📘

The Database Analogy

Consider populating a database. It's easy to populate the database with random, or somewhat relevant data. But compare the value of that database, to one of actual human activity!

Annotation is like writing a valuable line of data to the database. Designing the tables of the database in the first place, and connecting the database to the application, and maintenance are the unavoidable costs.

Each line that's written is really the expected usage of the system.

What are the cost centers with Training Data?

  • Administration of Data
  • Integrations
  • Data Prep
  • Curation, What to Label
  • Iteration on Datasets
  • Subject Matter Expert Training
  • Subject Matter Expert Annotation
  • Collaboration

Administration of Data Costs

The reality of controlling data is messy. Specifically a single datasets often requires multiple manual steps of moving data. These steps are often blocking, so a user must check at multiple times where the data is in the process.

  • Cost to shepherd data through a non-deterministic process with many blocks
  • Cost to administer the overall process of creating and updating datasets

These overall organization costs are over and above the sub system costs such as point integrations.

Integration Costs

  • Cost to integrate data provider (AWS, GCP, Azure, Private Data Source)
  • Cost to maintain integration with that provider
  • Cost to integrate with new data provider as needs change (ie moving from AWS to GCP)
  • Cost to create UIs and iterate UI for Admin productivity
  • Cost to integrate the Tooling provider with Application

Diffgram solves this with our fully integrated, one click setup, connections.
This makes importing, curating what to label, and exporting 10x easier.

Even for cases where the data source is "internal" it's usually a different system / sub system that requires some degree of integration.

Data Prep

Wait, images and video have data prep? I thought that was just for classic Machine Learning, you know - tables and stuff? Yes!
Let's take videos for example.

  • Cost to split videos and track relations between splits (because it's usually not useful to directly work with {curate, label, review} hour long or multi hour videos)
  • Cost to process videos into useful DL encoding (ie frames, or binary frames)
  • Cost to resize and format for reasonable online browsing and use (Applies to Images too)

Curation, What to Label

Projects usually have more raw data, then the means to reasonably label. Curation takes many forms.

  • Cost to curate in the initial data wrangling phases
  • Cost of identifying "high value" data to label (Manual review, Entropy or similar)
  • Cost of identifying what data needs most review (during Training)
  • Cost of determining what an "error" is in a running system

For the initial Data wrangling phase the Diffgram integrations and robust drag and drop system make this easy.
Within a Dataset, Diffgram offer importing existing instances, through which your desired "entropy" function can be run.
The flexible Dataset concept in Diffgram makes it easy to sub divide work and experiment on what to label.

Data Science, Structuring the Experiment, PoC, and Production Product

  • Cost of structuring the initial Labels and Attributes Templates (ie is it car or car_large, or car and truck, or ...)
  • Cost of iterating on the Labels and Attributes Templates (we usually see at least 3-4 iterations for even a single datasets)
  • Cost of structuring the initial sets and relations between datasets

Diffgram addresses the iteration cost in three ways:

  • Making it easy to re use Label and Attribute Templates across batches of work. (Through the Project Architecture that automatically organizes batches of work)
  • Making it easy to alternatively "lock" or "unlock" labels based on needs. For example locking an "old" label, or if the change is applicable, allowing an update. For example adding attributes to an existing "good" label.
  • Making it straight forward and (possible!) to directly structure dataset relations in Diffgram.
    For example, Task Templates watch Datasets, automatically creating new tasks whenever data is added.

Diffgram shifts the mindset from being Static by Default to Dynamic by Default
This flips the standard assumption that a set is "ready" before sending for annotation.

Is Annotation Really a Net Cost?

Let's invert the assumption that Annotation is just a money pit.
Annotation, is a form of Human Control, on the system.
In fact, Human Control is how value is added to the system.

Human Control, through the entire process of curation of data, through definition of classes, annotation, is how the knowledge of the system is structured.

There are many methods to reduce certain aspects of the "grunt work" of annotation.
Diffgram offers a generic boundary point here, where you can use any method desired to assist annotation. Keep in mind these methods come with their own costs, including knowledge of the "assist" method itself, computation time to run it, limits on ability to iterate and more.

Subject Matter Expert (SME) Training

  • Cost of training on and learning the Human Control interface
  • Cost of creating and distributing Training material
  • Cost of creating and administering Training

Diffgram makes it possible to organize your training. Task Templates and be set as repeatable Exams. New SMEs complete training. Admin's review (including AutoGrader results).