Pipelines Cookbook

Save Expert Time with Multi Stage

Create 2 Task Templates
Create a First Pass Dataset
Create a First Pass Complete Dataset
Have Expert Template watch First Pass Complete
Have the more “advanced” Attributes in Expert Task
Populate first pass with machine predictions, entry level labor, etc. Invite experts only to Expert template.

Concepts

Start testing algorithm “early” in the process. Don’t wait for complete dataset. This way dataset definitions can be adapted on demand instead of waiting for annotation to “complete”.
More closely map supervision work to model results. Eg if another "Day" of annotation doesn't change metrics, stop annotating that.

Create Task Templates & Datasets like Normal
Setup an Action, eg trigger a Webhook & Run an AWS Lambda script to export, eg when 25% of the work is complete
Setup metrics here. Eg if another “Day” of annotation isn’t improving the desired metrics enough then adjust the Template
Eg Resample, Change Definitions (Labels) etc. Change approach. (spatial type even…)

For example, if any of these are true:

Create a Data Pipeline
When new data is ready, send the data to the Dataset (Start the pipeline)
eg for AWS, trigger an Import. Once the data hits the first dataset the rest is handled inside Diffgram
Use Notifications to involve Human Operators
Automatically export snapshots eg a rolling window of x period
Use Webhooks on output end to start Auto Training. eg training an AutoML script, or doing a run on a customer algorithm.

Note these steps are easily swappable. eg can stay all in AWS, do GCP -> AWS, etc.

Example, data is an AWS. Want to use GCP (raw ML engine compute, or AutoML) and Scale for services (human labor).

Use "best in class" tools. Is GCP or another provider better for training, but all your data is on AWS?
Want to try out different services providers? Or use multiple providers for different contexts?
Use Diffgram as a single "ETL" point. Plus benefit from all the preprocessing, eg for Video.

Setup repeatable pipeline in Diffgram. Less work upfront, less maintenance, less midnight pages.

Use integrated Connections over downloading to local machine
Use Diffgram as a consistent single data model. Instead of "recreating" the data structure in a variety of tools and contexts.
Leave Deep Learning specific pre-processing to Diffgram
Work with Data Scientist(s) to setup incoming Data Abstractions (Datasets)
Work with Data Scientist(s) to setup outgoing Datasets, especially Webhooks.

Refocus on models. More easily communicate label/class changes. Focus on what you want to define - not shuffling data or manual one off scripts.

Export on demand. Ideally to desired cloud bucket or process.
Define training data abstractions, like Labels and Attributes upfront - before data is even ready.
Change these abstractions often (during development), including adding new and adjusting existing work.
Build "relational" datasets using the Data Explorer by combining multiple sets. For example, maintaining a main training set + region specific set.
Manage versions for data with "copy on write", and named versions

Improve the biggest blocker - the process. Give your team the tools they need and get the reporting you need.

Choose labeling Interface(s) most relevant to your task.
Choose for each Task Template which provider to use. (Leave the credential setup to the Data Engineers.)
Use the Reporting to track people and process results.
Use Notifications to monitor work in real time and respond to critical issues.

Updated over 3 years ago