Hub API Basics

Summary of the most important Hub API commands

Please note that the code examples below may not run standalone.

Creating and Loading Hub Datasets

import hub
# Load a Hub Dataset if it already exists (same as hub.load), or initialize
# a new Hub Dataset if it does not already exist (same as hub.empty)
ds = hub.dataset('./local_path') # Local path
ds = hub.dataset('hub://username/dataset_name') # Activeloop Platform Storage
ds = hub.dataset('s3://bucket_name/dataset_name', creds = {}) # AWS S3
# Load a Hub Dataset
ds = hub.load('dataset_path') # Can use any of the paths above
# Create an empty Hub Dataset
ds = hub.empty('dataset_path') # Can use any of the paths above
# Automatically create a Hub Dataset - Coming Soon
ds = hub.ingest.from_path('source_path', 'hub_dataset_path')
ds = hub.ingest.from_kaggle('kaggle_path', 'hub_dataset_path')
# Delete a Hub Dataset
ds.delete()

Creating Tensors and Adding Data

# Create a tensor
# Specifying htype is recommended for maximizing performance.
# Specifying dtype is required if you desire a different dtype compared
# to the default dtype for the specified htype.
ds.create_tensor('my_tensor', htype = 'bbox', dtype = 'int32')
ds.create_tensor('localization/my_tensor', htype = 'bbox', dtype = 'float32')
# Specifiying the correct compression is critical for images, videos, and
# other rich data types.
ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')
# Append a single sample array at the end of a tensor
ds.my_tensor.append(np.ones((1,4))) # Appends an array at the end of a tensor
# Append multiple samples at the end of a tensor. The first axis in the
# numpy array is assumed to be the sample axis for the tensor
ds.my_tensor.extend(np.ones((5,1,4)))
# Append multiple samples at the end of a tensor.
ds.my_tensor.extend([np.ones((1,4)), np.ones((3,4)), np.ones((2,4))])

Adding User-Specified Metadata to Datasets and Tensors

# Add or update dataset metadata
ds.info.update(key1 = 'text', key2 = number)
# Also can run
# ds.info.update({'key1'='value1', 'key2' = num_value})
# Add or update tensor metadata
ds.my_tensor.info.update(key1 = 'text', key2 = number)
#Delete metadata
ds.info.delete() #Delete all metadata
ds.info.delete('key1') #Delete 1 key in metadata
ds.info.delete(['key1', 'key2']) #Delete multiple keys in metadata

Maximizing performance

# Data gets written to long-terms storage at the end of the 'with'
# block or whenever the cache is full. This minimizes the number of
# write operations during dataset creation.
with hub.load('dataset_path') as ds:
ds.create_tensor('my_tensor')
for i in range(10):
ds.my_tensor.append(i)

Accessing Tensor Data

# Read tensor sample into numpy array
np_array = ds.my_tensor[0].numpy()
# Read multiple tensor samples into numpy array
# Returns an error if tensor samples do not have equal shape
np_array = ds.my_tensor[0:10].numpy()
# Read multiple tensor samples into a list of numpy arrays
np_array_list = ds.my_tensor[0:10].numpy(aslist=True)

Connecting Hub Datasets to ML Frameworks

# PyTorch Dataloader
dataloader = ds.pytorch(batch_size = 16, num_workers = 2)
# TensorFlow Dataset
ds_tensorflow = ds.tensorflow()