Schema University

What do we care about?

Generally, we care where something is, what it is, and how it relates to other things.

Labels and Attributes are the tools we use to express “What” something is. In the next section, I will introduce Spatial types to discuss Where something is.

The concept of representing What something is can expand with near-infinite complexity. Whereas the spatial location aspects generally have more obvious limits to their expansion. In other words getting the What right is a greater ongoing challenge than the mechanical specifics of Where something is in a document or image.

Introduction To Labels

Labels are the “top-level” semantic meaning. In the base cases, they may represent only themselves. Eg a label “car” may map to literally “car”. In most cases, Labels organize a set of attributes.

For technical readers, to help ground this idea, I like to compare it to SQL design, as you can see in Table 3-1.

Intuitive ConceptTraining DataSQL

Table 3-1: Comparison to SQL

Of particular note here, Tables don’t usually have a “Type” where Columns of course do. In this same sense, a Label doesn’t have a type where as Attributes do have types. Another way to think of a Label is like a folder or set of attributes.

Interestingly, in E.F. Codd’s “The Relational model for Database Management”, he mentions that Columns were originally thought of as Attributes. While far from a perfect analogy it helps convey the general idea. Continuing that line of thought, Attributes can be shared between Labels, which is roughly analogous to Foreign keys.

When an end user is annotating, organizing sets of attributes into labels also can help hide irrelevant options. For videos, labels help constrain relationships and organize sequences.

It’s expected that some of the specific organizational principles discussed here will be implementation specific and change over time. In general, the broad strokes will be similar. As this new area of Training Data continues to be refined, standards will continue to evolve.

Next, we will talk about Attributes, which is where the bulk of the Schema definition usually lives.

Attributes Introduction

Attributes represent the bulk of the “What is it”. The heart of the human encoded meaning and the technical definition. Attributes are usually defined to include, at minimum, the following structures: The human question or prompt, form type, and technical constraints. This set of human and machine definitions together makes one “Attribute”.

Training Data Attributes may appear superficially simple or similar to other technologies, however, in practice there is a lot of complexity at this intersection of both human and machine-centered definitions.

In the same way that Training Data is a combination of raw data and human-defined meaning, Attributes are a combination of technical definitions and human centered definitions. In order to be useful for Machine Learning, both the technical definition and the human definition are needed. Attributes are the joint representation of those two things.

More technically speaking, Attributes can be thought of as well-defined Forms or as “Data Classes meet UI specifications”. One way to wrap your head around this is to think of a spectrum between Forms and Classes and put Attributes somewhere on that spectrum. A Form can be arbitrarily complex but usually isn’t thought of as having defined Types, like a Class. Further while the implementation of a Form may have validation, it’s usually end user validation, not a formal database Constraint. Because the Training Data is relied upon by the ML program, and is usually expected to be queryable, Attributes have more “structure” than a typical Form. Conversely, due to the expectations of Human control, Attributes usually have an flair of “form like” behavior, more than a typical programming Class or database Table definition would have.

In practice, Attributes fill a need for Training Data that is distinctly different from other technologies. As this area continues to evolve, I expect that Attributes will continue to expand.

Continued Reading

Please see Chapter 3 in the Training Data Book