Step 7: Connecting Hub Datasets to ML Frameworks
Connecting Hub Datasets to machine learning frameworks such as PyTorch and TensorFlow.
Hub Datasets can be connected to popular ML frameworks such as PyTorch and TensorFlow using minimal boilerplate code. Our methods enable you to train models while streaming data from the cloud without bottlenecking the training process!

PyTorch

There are two syntaxes that can be used to train models in Pytorch using Hub datasets:
  1. 1.
    Hub Data Loaders are highly-optimized and unlock the fastest streaming and shuffling using hub's internal shuffling method. However, they do not support custom sampling or fully-random shuffling that is possible using PyTorch datasets + data loaders.
  2. 2.
    Pytorch Datasets + Data Loaders enable all the customizability supported by PyTorch. However, they have highly sub-optimal streaming using Hub datasets and may result in 5X+ slower performance compared to using Hub data loaders.

Using Hub Data Loaders

Best option for fast streaming!
The fastest streaming of data to GPUs using PyTorch is achieved using Hub's built-in PyTorch dataloader ds.pytorch() . If your model training is highly sensitive to the randomization of the input data, please pre-shuffle the data, or explore our writeup onShuffling in ds.pytorch().
1
import hub
2
from torchvision import datasets, transforms, models
3
4
ds = hub.dataset('hub://activeloop/cifar100-train') # Hub Dataset
Copied!

Transform syntax #1 - For independent transforms per tensor

The transform parameter in ds.pytorch() is a dictionary where the key is the tensor name and the value is the transformation function for that tensor. If a tensor's data does not need to be returned, the tensor should be omitted from the keys. If a tensor's data does not need to be modified during preprocessing, the transformation function for the tensor is set as None.
1
tform = transforms.Compose([
2
transforms.ToPILImage(), # Must convert to PIL image for subsequent operations to run
3
transforms.RandomRotation(20), # Image augmentation
4
transforms.ToTensor(), # Must convert to pytorch tensor for subsequent operations to run
5
transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),
6
])
7
8
#PyTorch Dataloader
9
dataloader= ds.pytorch(batch_size = 16, num_workers = 2,
10
transform = {'images': tform, 'labels': None}, shuffle = True)
Copied!

Transform syntax #2 - For complex or dependent transforms per tensor

Transform are sometimes more complex where the same transform might need to be applied to all tensors, or tensors need to be combined in a transform. In this case, you can use the syntax below to perform the exact same transform as above:
1
def transform(sample_in):
2
return {'images': tform(sample_in['images']), 'labels': sample_in['labels']}
3
4
#PyTorch Dataloader
5
dataloader= ds.pytorch(batch_size = 16, num_workers = 2,
6
transform = transform, shuffle = True)
Copied!
Some datasets such as imagenet contain both grayscale and color images, which can cause errors when the transformed images are passed to the model. To convert only the grayscale images to color format, you can add this Torchvision transform to your pipeline:
transforms.Lambda(lambda x: x.repeat(int(3/x.shape[0]), 1, 1))

Using PyTorch Datasets + Data Loaders

Best option for full customizability.
Hub datasets can be integrated in the PyTorch Dataset class by passing the ds object to the PyTorch Dataset's constructor and pulling data in the __getitem__ method using self.ds.image[ids].numpy():
1
from torch.utils.data import DataLoader, Dataset
2
3
class ClassificationDataset(Dataset):
4
def __init__(self, ds, transform = None):
5
self.ds = ds
6
self.transform = transform
7
8
def __len__(self):
9
return len(self.ds)
10
11
def __getitem__(self, idx):
12
image = self.ds.images[idx].numpy()
13
label = self.ds.labels[idx].numpy().astype(np.int32)
14
15
if self.transform is not None:
16
image = self.transform(image)
17
18
sample = {"images": image, "labels": label}
19
20
return sample
Copied!
The PyTorch dataset + data loader is instantiated using the built-in PyTorch functions:
1
cifar100_pytorch = ClassificationDataset(ds_train, transform = tform)
2
3
dataloader_pytroch = DataLoader(dataset_pt, batch_size = 16, num_workers = 2, shuffle = True)
Copied!

Iteration and Training

You can iterate through both data loaders above using the exact same syntax. Loading the first batch of data using the Hub data loader may take up to 30 seconds because the shuffle buffer is filled before any data is returned.
1
for data in dataloader:
2
print(data)
3
break
4
5
# Training Loop
Copied!
1
for data in dataloader_pytorch:
2
print(data)
3
break
4
5
# Training Loop
Copied!
For more information on training, check out the tutorial on Training an Image Classification Model in PyTorch

TensorFlow

Hub Datasets can be converted to TensorFlow Datasets using ds.tensorflow(). Downstream, functions from the tf.Data API such as map, shuffle, etc. can be applied to process the data before training.
1
ds # Hub Dataset object, to be used for training
2
ds_tf = ds.tensorflow() # A TensorFlow Dataset
Copied!