Step 5: Accessing Data

Accessing and loading Hub Datasets.

Loading Datasets

Hub Datasets can be loaded and created in a variety of storage locations with minimal configuration.

from hub import Dataset

# Local Filepath
ds = Dataset('./my_dataset_path')

# S3
ds = Dataset('s3://my_dataset_bucket', creds={...})

## Activeloop Storage - See Step 6
# Public Dataset hosted by Activeloop
ds = Dataset('hub://activeloop/public_dataset_name')

# Dataset in another workspace on Activeloop Platform
ds = Dataset('hub://workspace_name/dataset_name')

Since ds = hub.Dataset(path)is used to both create and load datasets, you may accidentally create a new dataset if there is a typo in the path you provided while intending to load a dataset. If that occurs, simply use ds.delete() to remove the unintended dataset permanently.

Referencing Tensors

Hub allows you to reference specific tensors using keys or via the "." notation outlined below.

Note: data is still not loaded by these commands.

### NO HIERARCHY ###
ds.images # is equivalent to
ds['images']

ds.labels # is equivalent to
ds['labels']

### WITH HIERARCHY - COMING SOON ###
ds.localization.boxes # is equivalent to
ds['localization/boxes']

ds.localization.labels # is equivalent to
ds['localization/labels']

Accessing Data

Data within the tensors is loaded and accessed using the .numpy() command:

# Indexing
W = ds.images[0].numpy() # Fetch an image and return a NumPy array
X = ds.labels[0].numpy(aslist=True) # Fetch a label and store it as a 
                                    # list of NumPy arrays

# Slicing
Y = ds.images[0:100].numpy() # Fetch 100 images and return a NumPy array
                             # The method above produces an exception if 
                             # the images are not all the same size

Z = ds.labels[0:100].numpy(aslist=True) # Fetch 100 labels and store 
                                         # them as a list of NumPy arrays

The .numpy()method will produce an exception if all samples in the requested tensor do not have a uniform shape. If that's the case, running .numpy(aslist=True)solves the problem by returning a list of NumPy arrays, where the indices of the list correspond to different samples.

Last updated