API Basics
Summary of the most important Hub commands.

Hub API Basics

Import and Installation

!pip3 install hub
import hub
Default installation does not support GCS, audio, and video data. Installation of these features is described here.

Loading Hub Datasets

Hub datasets can be stored at a variety of storage locations using the appropriate dataset_path parameter below. We support S3, GCS, Activeloop storage, and are constantly adding to the list.
# Load a Hub Dataset
ds = hub.load('dataset_path', creds = {'optional'}, token = 'optional')

Creating Hub Datasets

# Create an empty hub dataset
ds = hub.empty('dataset_path', creds = {'optional'}, token = 'optional')
# Create an Hub Dataset with the same tensors as another dataset
ds = hub.like(ds_object or 'dataset_path', creds = {'optional'}, token = 'optional')
# Automatically create a Hub Dataset from another data source
ds = hub.ingest('source_path', 'hub_dataset_path', creds = {'optional'}, token = 'optional')
ds = hub.ingest_kaggle('kaggle_path', 'hub_dataset_path', creds = {'optional'}, token = 'optional')

Deleting Datasets

hub.delete('dataset_path', creds = {'optional'}, token = 'optional', token = 'optional')

Creating Tensors

# Specifying htype is recommended for maximizing performance.
ds.create_tensor('my_tensor', htype = 'bbox')
# Specifiying the correct compression is critical for images, videos, audio and
# other rich data types.
ds.create_tensor('songs', htype = 'audio', sample_compression = 'mp3')

Creating Tensor Hierarchies

ds.create_tensor('my_group/my_tensor') #Automatically creates the group 'my_group'

Visualizing and Inspecting Datasets


Appending Data to Datasets

ds.append('tensor_1': np.ones((1,4)), 'tensor_2': hub.read('image.jpg'))
ds.my_group.append('tensor_1': np.ones((1,4)), 'tensor_2': hub.read('image.jpg'))

Appending/Updating Data in Individual Tensors

# Append a single sample
# Append multiple samples. The first axis in the
# numpy array is assumed to be the sample axis for the tensor
# Editing or adding data at a specific index
ds.my_tensor[i] = hub.read('image.jpg')

Appending Empty Samples or Skipping Samples

# Data appended as None will be returned as an empty array
ds.append('tensor_1': hub.read(...), 'tensor_2': None)
# Empty arrays can be explicitly appended if the length of the shape
# of the empty array matches that of the other samples

Accessing Tensor Data

# Read the i-th tensor sample
np_array = ds.my_tensor[i].numpy()
text = ds.my_text_tensor[i].data() # More comprehensive view of the data
bytes = ds.my_tensor[i].tobytes() # More comprehensive view of the data
# Read the i-th dataset sample as a numpy array
image = ds[i].images.numpy()
# Read the i-th labels as a numpy array or list of strings
labels_array = ds.labels[i].numpy()
labels_array = ds.labels[i].data()['value'] # same as .numpy()
labels_string_list = ds.labels[i].data()['text']
# Read a tensor sample from a hierarchical group
np_array = ds.my_group.my_tensor_1[i].numpy()
np_array = ds.my_group.my_tensor_2[i].numpy()
# Read multiple tensor samples into numpy array
np_array = ds.my_tensor[0:10].numpy()
# Read multiple tensor samples into a list of numpy arrays
np_array_list = ds.my_tensor[0:10].numpy(aslist=True)

Maximizing performance

Make sure to use the with context when making any updates to datasets.
with ds:
for i in range(10):

Connecting Hub Datasets to ML Frameworks

# PyTorch Dataloader
dataloader = ds.pytorch(batch_size = 16, transform = {'images': torchvision_tform, 'labels': None}, num_workers = 2, scheduler = 'threaded')
# TensorFlow Dataset
ds_tensorflow = ds.tensorflow()

Versioning Datasets

# Commit data
commit_id = ds.commit('Added 100 images of trucks')
# Print the commit log
log = ds.log()
# Checkout a branch or commit
ds.checkout('branch_name' or commit_id)
# Create a new branch
ds.checkout('new_branch', create = True)
# Examine differences between commits

Adding Tensor and Dataset-Level Metadata

# Add or update dataset metadata
ds.info.update(key1 = 'text', key2 = number)
# Also can run ds.info.update({'key1'='value1', 'key2' = num_value})
# Add or update tensor metadata
ds.my_tensor.info.update(key1 = 'text', key2 = number)
# Delete metadata
ds.info.delete() #Delete all metadata
ds.info.delete('key1') #Delete 1 key in metadata
ds.info.delete(['key1', 'key2']) #Delete multiple keys in metadata


# Load a Hub Dataset if it already exists (same as hub.load), or initialize
# a new Hub Dataset if it does not already exist (same as hub.empty)
ds = hub.dataset('dataset_path', creds = {'optional'}, token = 'optional')
# Append multiple samples using a list
ds.my_tensor.extend([np.ones((1,4)), np.ones((3,4)), np.ones((2,4)
# Fetch adjacent data in the chunk -> Increases speed when loading
# sequantially or if a tensor's data fits in the cache.
numeric_label = ds.labels[i].numpy(fetch_chunks = True)