Cybersecurity for Training Data 101

Secure Systems and Data

Security Policies

See Security Policies

Security Architecture

Let’s consider the Confidentiality, Integrity and Availability (CIA) model.

Confidentiality

Networking is 101 for cybersecurity. The attack surface of an inaccessible network is low. With Diffgram you install and use it within your hardened network or use Diffgram.com with IP whitelist and other mechanisms, including only changing IP configs through backchannel methods.

Integrity

An attacker may not intend to exfiltrate data. They may instead wish to alter your training data. Or deny you access. With Diffgram you can set real security, including all keys, based on your security posture. With Diffgram you control network security, the annotations database, the raw data, everything, the entire keychain.

For higher security installations we strongly recommend hosting your own version of Diffgram. This allows you complete control to set your own security practices. You can control the encryption access keys and location of all aspects of the system from network to data at rest. And of course you can then set your own custom security practices.

Availability

We align with NIST 800-207 (Zero Trust) which is applicable to both diffgram.com or your own installation.

For Internal IT Teams

Attack Surface

Installation configuration is the starting point since networking is is important for cyber security. The attack surface of an inaccessible network is low. So for example if you already have a hardened cluster you can install Diffgram and use it within that network.

Installation

Each Diffgram install has it's own cloud bucket and database. This is where all media and annotations are stored.
Depending on your requirements, you may have one installation for all of your projects, or you can have a separate installations per project if required.

Please consider this guide on the most surface level notes. We encourage all folks with high security requirements to contact us, we have experts with cyber security knowledge on staff that can assist with every aspect of securing your installation.

Diffgram can be installed using helm or docker. Example configurations:

  • Install Diffgram once on a new k8s cluster
  • Install multiple instances of Diffgram on a k8s cluster
  • Install Diffgram once on an existing k8s cluster.
    All 3 of those have different security postures.

Data Access

Each Diffgram installation has two main data access points

  1. The cloud bucket that everything gets deposited into
  2. The database

Inside the application Diffgram allows users to add and configure cloud data access based on credentials the user adds. Seperate from this, at installation time, a single cloud bucket is defined that is where all media ingested from other sources is stored. You can configure this to your desire, and also use it as further control mechanism, since changing access here will invalidate all raw storage access.

Identity

You can use Diffgram's built-in basic Auth, or use any OpenID, SSO, SAML 2.0, Social Login, LDAP or Active Directory. See Keycloak Config.

Diffgram has a role based access control concept. This is setup on a "per project/tenant" basis. See Role Based Access Control.

For Companies who do Annotation work for Clients

There are three common configurations:

  1. Secure: Host your own Diffgram installation and use Diffgram's internal Project scope to seperate data
  2. More Secure: Host a Diffgram install for each high security client
  3. Most Secure: Have the client host Diffgram, and access it as a user

For example, if the client provides you with a secure bucket, and you install Diffgram with that credential, the client can revoke the credential and remove your access. This would essentially be the same as "Access without download", since any operations done in Diffgram would write/read to the client's system.

Comparison to bucket IAM delegation schemes

Bucket based IAM delegation is a really really bad idea, here's a small sampling of why:

  1. It is security obfuscation - which means it is not real security. Since the application has IAM access to the data, it can at any point access the data and store it somewhere else. Therefore in a security event, revoking access is only partly effective.

  2. What about network security? What about the annotations (the database)? Bucket IAM solves only a small fraction of the problem.

  3. Most IAM schemes generate signed URLS, which are difficult or impossible to invalidate after the fact. Further the act of invalidating them may involve days of work and breaking changes such as moving all files to a different directory. Bucket IAM schemes don't even solve the problem they were supposed to solve.

  4. Data access is only one threat vector. All applications have vulnerabilities and all changes can introduce vulnerabilities. An attacker may not intend to exfiltrate data. They may instead wish to alter your training data. Or deny you access.

  5. It is well established that open source software is more secure over the long run then most closed source software. Diffgram is open source.

In practice this means that any breach of the system, if an "export" is generated, the attacker will have both the annotations, and access to the raw media, and you will have zero technical ways to stop it. For example if you using Labelbox with an IAM solution, an attacker would only need to compromise one admin account, or one labelbox super admin, to access all of your annotations and raw data. The moment they get that export file they have you.

In contrast with Diffgram

  1. You can set real security, including all keys, based on your real and current security posture.
  2. You control network security, the annotations database, the raw data, everything.
  3. You control the entire keychain.
  4. You are aware of the other threats and you can take action such as pinning specific versions.
  5. Diffgram is open source.

SignedURLS are not automatically bad, but a tool that's context must be considered.

For Air Gapped and TS/SCI cases

You can install Diffgram in your sensitive compartmented information facility (SCIF).

We also support "Hand me a USB" cases including offline documentation. Please contact us.


Did this page help you?