Classic and New Supervised Machine Learning


Human Supervised Training Data is different from Classic Training Data. It’s new. It has different goals, involves different skill sets, and uses different algorithmic approaches.

Discovery vs Automation

In a Classical case we usually don’t know what the answer is and we wish to discover it.

In the new, Supervised case we already know what’s correct and we wish to structure this understanding so it can be repeated by an AI/ML program.

Examples of Classic cases:

  • We don’t know what movie preferences someone has so we wish to discover them.
  • We don’t know what is causing a weather pattern and we want to discover it.

Examples of New Supervised cases and areas:

  • Understand or "read" a document and repeat common actions on it.
  • Cashier-less Checkout
  • Visual Sports Analysis
  • Self Driving
  • Computer Vision
  • NLP, driven by humans adding annotations
  • Time Series
  • Speech

Spreadsheets vs Unstructured Data

In the the Classic case, the spreadsheet, the data is already fixed. There is little to label. Where as in the New case the raw data doesn’t mean anything by itself. Humans must add labels to control the meaning.

For example here, the videos of the street don't mean anything without human labels, as shown below:


This means there are more degrees of freedom and more capabilities. In other words, in the classic context there is only indirect human control, whereas in the new context there is direct human control.

Comparison Table

ItemClassicNew, Supervised
Use case: Recommender systems
Anomaly detection
Reconditioning costs
Document Understanding
Cashier-less Checkout
Visual Sports Analysis
Self Driving
Computer Vision
Human TeamData Science PrimaryEnd User, Labeler and Subject Matter Expert

Data Science as Partner

Deeper Business and Normal Engineering engagement
Feature StoreFeature Store SoftwareSupervised Training Data Catalog

This fulfills a similar conceptual idea of organizing and searching existing work.
Data Prep ETLClassic ETL tools.Supervised Training Data ETL
Data Formats Tabular, logs, text, seriesVideos, images, audio files, geo-spatial, point clouds, unstructured text.
Generation Procedure Fixed
Generated by an existing separate system.
Generally can’t “change” the raw data, beyond “Cleaning” it.
Capture of raw data + human supervision
Can capture “new” data “after the fact”
Can generate novel data by adding novel Schema.

Best Practices with Architecture

It's important to have a different process for Supervised Training Data. The end users and implementation details are very different. Supervised must have it's own named processes to be successful.