Storage Options

How to authenticate using Activeloop storage, AWS S3, and Google Cloud Storage.

Deep Lake datasets can be stored locally, or on several cloud storage providers including Activeloop Storage, AWS S3, and Google Cloud Storage. Datasets are accessed by choosing the correct prefix for the dataset path that is passed to methods such as deeplake.load(path), deeplake.dataset(path), and deeplake.empty(path). The path prefixes are:

Storage Location

Path

Notes

Local

/local_path

Deep Lake Storage

hub://org_id/dataset_name

Deep Lake Managed DB

hub://org_id/dataset_name

Specify runtime = {"managed_db": True} when creating the dataset

AWS S3

s3://bucket_name/dataset_name

Google Cloud

gcs://bucket_name/dataset_name

If you chose to manage your credentials in Deep Lake, you can access datasets in your own Cloud Buckets using the Deep Lake path hub://org_name/dataset_name without having to pass credentials in the Python API.

Authentication for each cloud storage provider:

Activeloop Storage and Managed Datasets

In order to gain access in Python to datasets stored in Activeloop, or datasets in other clouds that are managed by Activeloop, users must register on the Deep Lake App or through the CLI, and login through the CLI using:

activeloop register

activeloop login

Authentication using tokens

Authentication can also be performed using tokens, which can be created after registration on the Deep Lake App (Profile -> API tokens). Tokens can be passed to any Deep Lake function that requires authentication:

deeplake.load(path, token = "...")
deeplake.empty(path, token = "...")
...

Credentials created using the CLI login !activeloop login expire after 1000 hrs. Credentials created using API tokens in the Deep Lake App expire after the time specified for the individual token. Therefore, long-term workflows should be run using API tokens in order to avoid expiration of credentials mid-workflow.

AWS S3

Authentication with AWS S3 has 4 options:

  1. Use Deep Lake on a machine in the AWS ecosystem that has access to the relevant S3 bucket via AWS IAM, in which case there is no need to pass credentials in order to access datasets in that bucket.

  2. Configure AWS through the cli using aws configure. This creates a credentials file on your machine that is automatically access by Deep Lake during authentication.

  3. Save the AWS_ACCESS_KEY_ID ,AWS_SECRET_ACCESS_KEY , and AWS_SESSION_TOKEN (optional) in environmental variables of the same name, which are loaded as default credentials if no other credentials are specified.

  4. Create a dictionary with the AWS_ACCESS_KEY_ID ,AWS_SECRET_ACCESS_KEY , and AWS_SESSION_TOKEN (optional), and pass it to Deep Lake using:

    Note: the dictionary keys must be lowercase!

deeplake.load('s3://...', creds = {
   'aws_access_key_id': 'abc', 
   'aws_secret_access_key': 'xyz', 
   'aws_session_token': '123', # Optional
})

endpoint_url can be used for connecting to other object storages supporting S3-like API such as MinIO, StorageGrid and others.

Custom Storage with S3 API

In order to connect to other object storages supporting S3-like API such as MinIO, StorageGrid and others, simply add endpoint_url the the creds dictionary.

deeplake.load('s3://...', creds = {
   'aws_access_key_id': 'abc', 
   'aws_secret_access_key': 'xyz', 
   'aws_session_token': '123', # Optional
   'endpoint_url': 'http://localhost:8888'
})

Google Cloud Storage

Authentication with Google Cloud Storage has 2 options:

  1. Create a service account, download the JSON file containing the keys, and then pass that file to the creds parameter in deeplake.load('gcs://.....', creds = 'path_to_keys.json') . It is also possible to manually pass the information from the JSON file into the creds parameter using:

    deeplake.load('gcs://.....', creds = {information from the JSON file})

  2. Authenticate through the browser using deeplake.load('gcs://.....', creds = 'browser'). This requires that the project credentials are stored on your machine, which happens after gcloud is initialized and logged in through the CLI.

    1. After this step, re-authentication through the browser can be skipped using: deeplake.load('gcs://.....', creds = 'cache')

Last updated