Catalog

Introduction

Catalog is a way to explore, discover, use and maintain your AI/ML Data.

Catalog is focused around unstructured data, meaning some type of raw data like images, video, text etc.
Search, curate, and visualize all of your unstructured data in one place, including metadata, labels, and predictions.

Using Catalog

Catalog is part of the overall Diffgram system.
The same concepts used to import data for Annotation are generally used to import data to Catalog.
Data annotated in Diffgram is automatically available in Catalog. You may also use Catalog directly without using Annotation.

Conceptually Areas

Querying
Curation
Search
Sharing
Streaming
Maintenance

Annotation and Catalog

Catalog operates on more then one sample of data at a time, in other words, sets of data.
Annotation operates on one sample at a time, even if that one sample has multiple pages, attachments etc.

Accessing ML Programs

Having your data in Catalog unlocks many use cases through Workflow. For example, you can use an ML program installed through workflow to curate your data, and automatically store the output in Catalog. You can curate data in Catalog, and send it for Annotation.

Dividing Annotation from Catalog

As a project grows it's common to divide annotation into a different concern from using the data. For example, this may be motivated by the people consuming the data being less involved in creating and maintaining the annotations. Sort of like the front end and backend of a system, they are deeply interconnected, but different teams work on it.

Over time, expectations around the data change. Prior it was common to have dataset A and annotating it and then using it directly as A. Now, there may be dataset A, B, C, ... n. The Schema for the sets may include XYZ. A scientist, or ML program, may then be configured to train a model on all data that has XYZ schema, across all datasets, within the past 12 months etc.

Known Issues

See Quality Bar
Issues

Updated over 3 years ago