Cybersecurity for Training Data 101

Introduction

Diffgram installations work on critical infrastructure and look at sensitive documents. You or your team can install Diffgram in high security ways. Naturally the security posture depends on your configuration and choices. This document outlines and consolidates a variety of high level security concepts.

CISO Brief

Read the CISO Training Data Brief

Security Policies

See Security Policies

Improving Security

One of the biggest security challenges for data science is the connection between ML programs and the data. It is common to see some pattern that essentially resembles each scientist being responsible to transfer the data between it's at rest location and the ML program (for example provisioning a new server and sending the data there, or even running it on their local machine). Or, having one team that handles security openly circumvented in the name of consolidating data to make it work (yes we have really seen this at a big tech company.) Diffgram solves this, providing an easy way to plug what is likely one of your largest security holes.

Two concrete ways Diffgram improves this and reduces your security risk is:

Connecting the data between your secure Diffgram install and the secure ML program. The Scientist triggers the connection to start the training, controls all the parameters, even some of the code etc, but the actual data is never seen by them directly (unless authorized under a different process, e.g. seeing a sample of it in view only etc). This seemingly simple change, directly removes one of the most common and largest security holes.
Installing your ML programs on the Diffgram cluster or framework. This brings the ML program to run in the same env as the data. The result is similar to 1).

For implementation details see Workflow.

Security Architecture

Diffgram must be installed on your own hardware (cloud or on-prem)
Use Pass By Reference for BLOB storage from_blob_path
Optional: Use custom URL signer to further insulation layer
Use Identity provider OIDC

Further options include

Setting custom roles
Using tags to add business context and further grouping of permissions
Integrating with event logging and other security software
Reduce the signed url access time, we have seen some as low as 60 seconds

This approaches must be accompanied by other security best practices to be effective. For example your cluster may deny incoming connections from outside your defined network.

Moving to Diffgram is a big security improvement. We have seen teams that go from having data that moves between A to B to C to D with big security holes to just one place (Diffgram).

Diffgram can be further customized to meet any security needs. Diffgram users store highly sensitive documents for major companies.

Security Architecture

Let’s consider the Confidentiality, Integrity and Availability (CIA) model.

Confidentiality

Networking is 101 for cybersecurity. The attack surface of an inaccessible network is low. With Diffgram you install and use it within your hardened network or use Diffgram.com with IP whitelist and other mechanisms, including only changing IP configs through backchannel methods.

Integrity

An attacker may not intend to exfiltrate data. They may instead wish to alter your training data (adversarial attack by modifying the raw data or training data). Or deny you access. With Diffgram you can set real security, including all keys, based on your security posture. With Diffgram you control network security, the annotations database, the raw data, everything, the entire keychain.

Installing Diffgram allows you complete control to set your own security practices. You can control the encryption access keys and location of all aspects of the system from network to data at rest. And of course you can then set your own custom security practices.

Availability

We align with NIST 800-207 (Zero Trust) which is applicable to both diffgram.com or your own installation.

For Internal IT Teams

Attack Surface

Installation configuration is the starting point since networking is is important for cyber security. The attack surface of an inaccessible network is low. So for example if you already have a hardened cluster you can install Diffgram and use it within that network.

Installation

Depending on your requirements, you may have one installation for all of your projects, or you can have a separate installations per project if required.

Please consider this guide on the most surface level notes. We encourage all folks with high security requirements to contact us, we have experts with cyber security knowledge on staff that can assist with every aspect of securing your installation.

Diffgram can be installed using helm or docker. Example configurations:

Install Diffgram once on a new k8s cluster
Install multiple instances of Diffgram on a k8s cluster
Install Diffgram once on an existing k8s cluster.
All 3 of those have different security postures.

Data Access

Each Diffgram installation has two main data access points

BLOB Storage
The database

Storage may be by reference for higher security.
Each Diffgram install has it's own cloud bucket and database. In higher security postures this cloud bucket may not have minimal artifacts.

Inside the application Diffgram allows users to add and configure cloud data access based on credentials the user adds. Separate from this, at installation time, a single cloud bucket is defined that is where all media ingested (excluding by reference) from other sources is stored. You can configure this to your desire, and also use it as further control mechanism, since changing access here will invalidate all raw storage access.

Identity

You can use Diffgram's built-in basic Auth, or use any OpenID, SSO, SAML 2.0, Social Login, LDAP or Active Directory. See OIDC

Diffgram has a role based access control concept. This is setup on a "per project/tenant" basis. See Role Based Access Control.

For Companies who do Annotation work for Clients

There are three common configurations:

Secure: Host your own Diffgram installation and use Diffgram's internal Project scope to seperate data
More Secure: Host a Diffgram install for each high security client
Most Secure: Have the client host Diffgram, and access it as a user

For example, if the client provides you with a secure bucket, and you install Diffgram with that credential, the client can revoke the credential and remove your access. This would essentially be the same as "Access without download", since any operations done in Diffgram would write/read to the client's system.

Comparison to bucket IAM delegation schemes

Bucket IAM is similar to Pass By Reference methods (where the data doesn't move).

However, with Diffgram, it's your own hardware and the software is open source. This means you can inspect the software and control the hardware. Where as if you are providing the IAM Delegation to a remote cloud application (like Labelbox) it is really just security obfuscation - not real security. Since the remote application has IAM access to the data, it can at any point access the data and store it somewhere else, and there is no real way to know if it's doing that.

Another difference is that with Pass By Reference, you can include a custom URL signer. Where as with the IAM method you are forced to rely on the integrated IAM signer, so you lose the ability to have that extra layer of control between IAM concepts and actually generating a signed url.

Comparison to Labelbox

It is well established that open source software is more secure over the long run then most closed source software. Diffgram is open source.

Raw data access is only one threat vector. All applications have vulnerabilities and all changes can introduce vulnerabilities. An attacker may not intend to exfiltrate data. They may instead wish to alter your training data. Or deny you access.

In practice this means that if Labelbox's system is breached and an "export" is generated, the attacker will have both the annotations, and access to the raw media, and you will have zero technical ways to stop it. For example if you using Labelbox with an IAM solution, an attacker would only need to compromise one admin account, or one labelbox super admin, to access all of your annotations and raw data. The moment they get that export file they have you.

IAM schemes generate signed URLS at an interval out of your control (Labelbox's control). Signed URLs are difficult or impossible to invalidate after the fact. Further the act of invalidating them may involve days of work and breaking changes such as moving all files to a different directory. Where as with Diffgram you control the signed URL generation time.

With Diffgram:

You can set real security, including all keys, based on your real and current security posture.
You control network security, the annotations database, the raw data, everything.
You control the entire keychain.
If you become aware of threats you can take action such as pinning specific versions.
Diffgram is open source.
You can deeply control access to referenced BLOB storage, including optional signed url time config and optional insulating layers like a custom signer.

SignedURLS are not automatically bad, but a tool that's context must be considered.

For Air Gapped and TS/SCI cases

You can install Diffgram in your sensitive compartmented information facility (SCIF).

We also support "Hand me a USB" cases including offline documentation. Please contact us.