Classic and New Supervised Machine Learning

Introduction

Human Supervised Training Data is different from Classic Training Data. It’s new. It has different goals, involves different skill sets, and uses different algorithmic approaches.

Discovery vs Automation

In a Classical case we usually don’t know what the answer is and we wish to discover it.

In the new, Supervised case we already know what’s correct and we wish to structure this understanding so it can be repeated by an AI/ML program.

Examples of Classic cases:

We don’t know what movie preferences someone has so we wish to discover them.
We don’t know what is causing a weather pattern and we want to discover it.

Examples of New Supervised cases and areas:

Understand or "read" a document and repeat common actions on it.
Cashier-less Checkout
Visual Sports Analysis
Self Driving
Computer Vision
NLP, driven by humans adding annotations
Time Series
Speech

Spreadsheets vs Unstructured Data

In the the Classic case, the spreadsheet, the data is already fixed. There is little to label. Where as in the New case the raw data doesn’t mean anything by itself. Humans must add labels to control the meaning.

For example here, the videos of the street don't mean anything without human labels, as shown below:

This means there are more degrees of freedom and more capabilities. In other words, in the classic context there is only indirect human control, whereas in the new context there is direct human control.

Comparison Table

Item	Classic	New, Supervised
Use case:	Recommender systems Anomaly detection Reconditioning costs	Document Understanding Cashier-less Checkout Visual Sports Analysis Self Driving Computer Vision NLP
Human Team	Data Science Primary	End User, Labeler and Subject Matter Expert Data Science as Partner Deeper Business and Normal Engineering engagement
Feature Store	Feature Store Software	Supervised Training Data Catalog This fulfills a similar conceptual idea of organizing and searching existing work.
Data Prep ETL	Classic ETL tools.	Supervised Training Data ETL
Data Formats	Tabular, logs, text, series	Videos, images, audio files, geo-spatial, point clouds, unstructured text.
Generation Procedure	Fixed Generated by an existing separate system. Generally can’t “change” the raw data, beyond “Cleaning” it.	Dynamic Capture of raw data + human supervision Can capture “new” data “after the fact” Can generate novel data by adding novel Schema.

Best Practices with Architecture

It's important to have a different process for Supervised Training Data. The end users and implementation details are very different. Supervised must have it's own named processes to be successful.

Updated over 2 years ago