From URL is preferred method of input over from_local
From Python SDK
url
, string, valid url to resource
media_type
, string, in ["image", "video"], default "image"
instance_list
, dictionary, the structure for an image, describing the labels that should be added with the file
frame_packet_map
, dictionary, the structure for a video, describing the labels that should be added with the file.
result = project.file.from_url(
signed_url)
result = project.file.from_url(
signed_url,
media_type="video")
Sending data to a Job at the same time
A common pattern, a short example is below, and a long form example is here.
job = project.job.new()
for signed_url in signed_url_list:
result = project.file.from_url(
signed_url,
job = job
)
Returns Immediately, Long Running Background Operation
It returns the input_id and places the input in the queue.
Inputs are processed in the background by our distributed workers. Depending on factors such as type of file, account type, and system load, your file may be available in Diffgram almost immediately or may take several hours.
Example of Generating URL from Google and Sending to Diffgram
# pip install google-cloud-storage
from google.cloud import storage
import time
import os
from diffgram import Project
project = Project(
client_id = "",
client_secret = "",
project_string_id = ""
)
"""
Docs https://googleapis.github.io/google-cloud-python/latest/storage/blobs.html?highlight=generate_signed_url#google.cloud.storage.blob.Blob.generate_signed_url
To get the service account .json
https://cloud.google.com/iam/docs/creating-managing-service-accounts
Storage bucket is your storage bucket name
and path is as shown in google, starting from bucket, as shown
"""
SERVICE_ACCOUNT = "\shared\helpers\your_account.json"
CLOUD_STORAGE_BUCKET = "diffgram-sandbox"
# Helper function
def get_gcs_service_account(gcs):
path = os.path.dirname(os.path.realpath(__file__)) + "/" + SERVICE_ACCOUNT
return gcs.from_service_account_json(path)
gcs = storage.Client()
gcs = get_gcs_service_account(gcs)
bucket = gcs.get_bucket(CLOUD_STORAGE_BUCKET)
# Path to file in cloud storage
path_list = [
"dog_ate_my_homework.mov",
"nuclear_launch_codes.mov",
"lions_tigers_bears_oh_my.mov"]
blob_expiry = int(time.time() + (60 * 60 * 24 * 30))
for path in path_list:
blob = bucket.blob("directory_example/" + path)
# Generate signed url (string) using google sdk
# this gives temporary access to the resource
signed_url = blob.generate_signed_url(expiration=blob_expiry)
# Let's see what it looks like
print(signed_url)
# Use the Diffgram from_url() method
result = project.file.from_url(
signed_url,
media_type="video")
print(result)
Video Split Duration (In Seconds)
Smaller units of work improves annotation quality
Beta Feature
A video file sent with a video_split_duration
will be parsed into separate video files based on the duration given.
For example, a 30 minute video with a duration of 30 seconds will be split into 60 videos.
- Valid range is: 2 and 180 seconds. Recommend 30 seconds or less.
- Limit of 100 videos created.
- Parsed videos inherit properties like job and directory from their parent.
# SDK >= 0.1.7.5
result = project.file.from_url(
video_split_duration = 30
)
# Full
result = project.file.from_url(
signed_url,
media_type="video",
job = job,
video_split_duration = 30
)
Note, the status for the parent video will show success upon splitting the videos into clips, and then each of the videos status can be tracked separately.
Media Type Detection
We will attempt to determine the type automatically in this order:
- By the URL. Where the filename starts from the last "/" and the extension is after the first "."
These are both valid:
- https: ... a/b/c/filename.extension?otherstuff...
- https: ... a/b/c/filename.extension
- Fall back to the metadata 'Content-Type' header.
ie the URL was https: ... a/b/c/filename?something (No.
after the last/
), so the extension is treated as None triggering the fallback. Example:
- 'video/mp4'
If a media type cannot be determined it will throw an error on the Input object.
Why is URL ahead of 'Content-Type'?
We have found surprisingly often that Content-Type is not set, or is set to a confusing value, such as application/octet-stream
when it cannot be determined (by the host, ie google).
By placing it first it "just works" even when Content-Type exists, but is invalid. And still allows expert users to purposely exclude an extension and set a Content-Type.