This doc is older and may not align with our brand standards
Core system areas
- Data pipeline
- Human supervision for AI
- Team communication and administration
Frequency of Retraining
Hourly, Daily, or Weekly
Number of Datasets
Near real time options
0 -> 1
Diffgram shifts these concepts to become functions of your system (and level of integration with Diffgram). Experiment with many datasets and combinations. Improve performance by fine tuning sets for each store/location/sub system.
The net effect is that it enables the Data Science team to scale deep learning products. A system can ship with the human fall back (near real time option). Then as new stores and hard label cases are found datasets are split and only the rare, hard cases worked on.
Your system can undergo less initial evaluation - worries about ongoing performance are reduced because there is a clear path to continued retraining.
Diffgram does not provide any model evaluation or deploy. This diagram is provided to speak to the overall mindset and mental concepts at play.
Area of Focus
Model training & optimization
Tight connection to training data.
Expectation that model will be regularly retrained
Input/Output is part of core application.
Manual data input/output
Training is "one off"
Better path to fix data issues
Requires more engineering effort in general
Set conditions for valid automatic deploy (ie above a threshold)
Sliding window approach on data for statistical evaluation methods, ie last 30, 60, or 180 days of data
Strong human integration. In addition to statistical analysis stronger “one off” and reasonableness checks.
Manually assess validity, ie looking at an Area under a Curve chart or a Confusion Matrix
Supports high frequency retraining and deploy often
Deploy early in fail-safe way. ie Starting as internal recommendations only
Deploy often with human in the loop
Heuristics to check for correctness
Mental model of trying to get a desired level of accuracy prior to deploying.
Projects more closely align with real world data
Reduces costs by shipping sooner and improves performance
Increases operational effort.
In this new paradigm there are no real "datasets" in the sense that the data is only static for the exact moment of training and evaluation. Data becomes more like a dynamic "channel" in that we have a continuous stream of new training data and it's using a conditional slice of it. The initial training process is effectively just a larger slice of the ongoing stream.
Updated 11 months ago