API Basics
Summary of the most important Hub commands.

Import and Installation

1
!pip3 install hub
2
3
import hub
Copied!
Default installation does not support GCS, audio, and video data. Installation of these features is described here.

Loading Hub Datasets

Hub datasets can be stored at a variety of storage locations using the appropriate dataset_path parameter below. We support S3, GCS, Activeloop storage, and are constantly adding to the list.
1
# Load a Hub Dataset
2
ds = hub.load('dataset_path', creds = {'optional'}, token = 'optional')
3
4
# Load a Hub Dataset if it already exists (same as hub.load), or initialize
5
# a new Hub Dataset if it does not already exist (same as hub.empty)
6
ds = hub.dataset('dataset_path', creds = {'optional'}, token = 'optional')
Copied!

Creating Hub Datasets

1
# Create an empty hub dataset
2
ds = hub.empty('dataset_path', creds = {'optional'}, token = 'optional')
3
4
# Create an Hub Dataset with the same tensors as another dataset
5
ds = hub.like(ds_object or 'dataset_path', creds = {'optional'}, token = 'optional')
6
7
# Automatically create a Hub Dataset from another data source
8
ds = hub.ingest('source_path', 'hub_dataset_path', creds = {'optional'}, token = 'optional')
9
ds = hub.ingest_kaggle('kaggle_path', 'hub_dataset_path', creds = {'optional'}, token = 'optional')
Copied!

Deleting Datasets

1
ds.delete()
2
3
hub.delete('dataset_path', creds = {'optional'}, token = 'optional', token = 'optional')
Copied!

Creating Tensors

1
# Specifying htype is recommended for maximizing performance.
2
ds.create_tensor('my_tensor', htype = 'bbox')
3
4
# Specifiying the correct compression is critical for images, videos, audio and
5
# other rich data types.
6
ds.create_tensor('songs', htype = 'audio', sample_compression = 'mp3')
Copied!

Creating Tensor Hierarchies

1
ds.create_group('my_group')
2
ds.my_group.create_tensor('my_tensor')
3
ds.create_tensor('my_group/my_tensor') #Automatically creates the group 'my_group'
Copied!

Visualizing and Inspecting Datasets

1
ds.visualize()
2
3
ds.summary()
Copied!

Appending Data to Datasets

1
ds.append('tensor_1': np.ones((1,4)), 'tensor_2': hub.read('image.jpg'))
2
ds.my_group.append('tensor_1': np.ones((1,4)), 'tensor_2': hub.read('image.jpg'))
Copied!

Appending Data to Individual Tensors

1
# Append a single sample
2
ds.my_tensor.append(np.ones((1,4)))
3
ds.my_tensor.append(hub.read('image.jpg'))
4
5
# Append multiple samples. The first axis in the
6
# numpy array is assumed to be the sample axis for the tensor
7
ds.my_tensor.extend(np.ones((5,1,4)))
8
9
# Append multiple samples at the end of a tensor
10
ds.my_tensor.extend([np.ones((1,4)), np.ones((3,4)), np.ones((2,4))])
Copied!

Appending Empty Samples or Skipping Samples

1
# Data appended as None will be returned as an empty array
2
ds.append('tensor_1': None, 'tensor_2': None)
3
ds.my_tensor.append(None)
4
5
# Empty arrays can be explicitly appended if the length of the shape
6
# of the empty array matches that of the other samples
7
ds.boxes.append(np.zeros((0,4))
Copied!

Accessing Tensor Data

1
# Read a tensor sample
2
np_array = ds.my_tensor[0].numpy()
3
text = ds.my_text_tensor[0].data() # Same as .numpy() if data can be returned as Numpy array
4
bytes = ds.my_tensor[0].tobytes()
5
6
# Read a tensor sample from a hierarchical group
7
np_array_1 = ds.my_group.my_tensor_1[0].numpy()
8
np_array_2 = ds.my_group.my_tensor_2[0].numpy()
9
10
# Read multiple tensor samples into numpy array
11
np_array = ds.my_tensor[0:10].numpy()
12
13
# Read multiple tensor samples into a list of numpy arrays
14
np_array_list = ds.my_tensor[0:10].numpy(aslist=True)
Copied!

Maximizing performance

Make sure to use the with context when making any updates to datasets.
1
with ds:
2
3
ds.create_tensor('my_tensor')
4
5
for i in range(10):
6
ds.my_tensor.append(i)
Copied!

Adding User-Specified Metadata

1
# Add or update dataset metadata
2
ds.info.update(key1 = 'text', key2 = number)
3
# Also can run ds.info.update({'key1'='value1', 'key2' = num_value})
4
5
# Add or update tensor metadata
6
ds.my_tensor.info.update(key1 = 'text', key2 = number)
7
8
# Delete metadata
9
ds.info.delete() #Delete all metadata
10
ds.info.delete('key1') #Delete 1 key in metadata
11
ds.info.delete(['key1', 'key2']) #Delete multiple keys in metadata
Copied!

Connecting Hub Datasets to ML Frameworks

1
# PyTorch Dataloader
2
dataloader = ds.pytorch(batch_size = 16, transform = {'images': torchvision_tform, 'labels': None}, num_workers = 2, scheduler = 'threaded')
3
4
# TensorFlow Dataset
5
ds_tensorflow = ds.tensorflow()
Copied!

Versioning Datasets

1
# Commit data
2
commit_id = ds.commit('Added 100 images of trucks')
3
4
# Print the commit log
5
log = ds.log()
6
7
# Checkout a branch or commit
8
ds.checkout('branch_name' or commit_id)
9
10
# Create a new branch
11
ds.checkout('new_branch', create = True)
12
13
# Examine differences between commits
14
ds.diff()
Copied!