Pipelines Cookbook

Save Expert Time with Multi Stage

  1. Create 2 Task Templates
  2. Create a First Pass Dataset
  3. Create a First Pass Complete Dataset
  4. Have Expert Template watch First Pass Complete
  5. Have the more “advanced” Attributes in Expert Task
  6. Populate first pass with machine predictions, entry level labor, etc. Invite experts only to Expert template.

Start Faster with Early Export


  • Start testing algorithm “early” in the process. Don’t wait for complete dataset. This way dataset definitions can be adapted on demand instead of waiting for annotation to “complete”.
  • More closely map supervision work to model results. Eg if another "Day" of annotation doesn't change metrics, stop annotating that.
  1. Create Task Templates & Datasets like Normal
  2. Setup an Action, eg trigger a Webhook & Run an AWS Lambda script to export, eg when 25% of the work is complete
  3. Setup metrics here. Eg if another “Day” of annotation isn’t improving the desired metrics enough then adjust the Template
  4. Eg Resample, Change Definitions (Labels) etc. Change approach. (spatial type even…)

Event Driven Pipeline for Fresh Data

For example, if any of these are true:

  • Data is not yet ready
  • Data require preprocessing which may take time
  • Data comes in a stream
  • Production pipeline and data updates
  • Multiple stakeholders are involved


  1. Create a Data Pipeline
  2. When new data is ready, send the data to the Dataset (Start the pipeline)
    eg for AWS, trigger an Import. Once the data hits the first dataset the rest is handled inside Diffgram
  3. Use Notifications to involve Human Operators
  4. Automatically export snapshots eg a rolling window of x period
  5. Use Webhooks on output end to start Auto Training. eg training an AutoML script, or doing a run on a customer algorithm.

Note these steps are easily swappable. eg can stay all in AWS, do GCP -> AWS, etc.

Detailed walk through:
🍄 Goomba Detector with GCP, Autogluon & Diffgram

Use Training Data Ecosystem

Example, data is an AWS. Want to use GCP (raw ML engine compute, or AutoML) and Scale for services (human labor).


  • Use "best in class" tools. Is GCP or another provider better for training, but all your data is on AWS?
  • Want to try out different services providers? Or use multiple providers for different contexts?
  • Use Diffgram as a single "ETL" point. Plus benefit from all the preprocessing, eg for Video.


  1. Create connections with AWS by entering credentials.
  2. Setup a Task Template and choose Scale as provider
  3. Connect to GCP and export straight to a cloud bucket.
  4. Use Actions to trigger AutoML training

Role Specific Best Practices

Data Engineer

Setup repeatable pipeline in Diffgram. Less work upfront, less maintenance, less midnight pages.

  • Use integrated Connections over downloading to local machine
  • Use Diffgram as a consistent single data model. Instead of "recreating" the data structure in a variety of tools and contexts.
  • Leave Deep Learning specific pre-processing to Diffgram
  • Work with Data Scientist(s) to setup incoming Data Abstractions (Datasets)
  • Work with Data Scientist(s) to setup outgoing Datasets, especially Webhooks.

Data Scientist

Refocus on models. More easily communicate label/class changes. Focus on what you want to define - not shuffling data or manual one off scripts.

  • Export on demand. Ideally to desired cloud bucket or process.
  • Define training data abstractions, like Labels and Attributes upfront - before data is even ready.
  • Change these abstractions often (during development), including adding new and adjusting existing work.
  • Build "relational" datasets using the Data Explorer by combining multiple sets. For example, maintaining a main training set + region specific set.
  • Manage versions for data with "copy on write", and named versions

Data Managers

Improve the biggest blocker - the process. Give your team the tools they need and get the reporting you need.

  • Choose labeling Interface(s) most relevant to your task.
  • Choose for each Task Template which provider to use. (Leave the credential setup to the Data Engineers.)
  • Use the Reporting to track people and process results.
  • Use Notifications to monitor work in real time and respond to critical issues.