Directory (Dataset)

The directory/dataset concept is a container of multiple files. Read more about the concept here.

In the SDK context, any directory is an iterable object. That means, that you can index it or do a forloop over it as if it was a normal python list for example.

dataset = project.directory.get('my dataset')

file = dataset[5] # Gets 5th file of the dataset

# Loop trough all files
for file in dataset:
  print(file)

# Display an image
from matplotlib import pyplot as plt
plt.imshow(dataset[0]['image'])
plt.show

Data Streaming Concepts:

The dataset object does NOT contain all the files in the dataset, it only contains the file ID's of it. The files are loaded on demand when the image/video data is needed or annotations data is required. This is useful for several reasons:

  • Avoid requiring large mounts of hardware resources for very big datasets
  • Allows usage of only the files that are needed by the developer.
  • Allows easier manipulation of data without moving huge amounts of raw data (just using file ID's pointers)

You can access the entire list of file ID's in the dataset by doing

dataset.diffgram_file_id_list # This is a python list with all the ID's of the dataset/directory.

Methods: