Data Processing Using Parallel Computing
How to use hub.compute for important workflows.

This tutorial is also available as a Colab Notebook

Step 7 in the Getting Started Guide highlights how hub.compute can be used to rapidly upload datasets. This tutorial expands further and highlights the power of parallel computing for dataset processing.

Dataset Transformations

Computer vision applications often require users to process and transform their data as part of their workflows. For example, you may perform perspective transforms, resize images, adjust their coloring, or many others. In this example, a flipped version of the MNIST dataset is created, which may be useful for training a model that identifies text from reflections in a mirror.
The first step to creating a flipped version of the MNIST dataset is to define a function that will flip the dataset images.
1
import hub
2
from PIL import Image
3
import numpy as np
4
5
@hub.compute
6
def flip_horizontal(sample_in, sample_out):
7
## First two arguments are always default arguments containing:
8
# 1st argument is an element of the input iterable (list, dataset, array,...)
9
# 2nd argument is a dataset sample
10
11
# Append the label and image to the output sample
12
sample_out.labels.append(sample_in.labels.numpy())
13
sample_out.images.append(np.flip(sample_in.images.numpy(), axis = 1))
14
15
return sample_out
Copied!
Next, the existing MNIST dataset is loaded, and hub.like is used to create an empty dataset with the same tensor structure.
1
ds_mnist = hub.load('hub://activeloop/mnist-train')
2
3
#We use the overwrite=True to make this code re-runnable
4
ds_mnist_flipped = hub.like('./mnist_flipped', ds_mnist, overwrite = True)
Copied!
Finally, the flipping operation is evaluated for the 1st 100 elements in the input dataset ds_in, and the result is automatically stored in ds_out.
1
flip_horizontal().eval(ds_mnist[0:100], ds_mnist_flipped, num_workers = 2)
Copied!
Let's check out the flipped images:
1
Image.fromarray(ds_mnist.images[0].numpy())
Copied!
1
Image.fromarray(ds_mnist_flipped.images[0].numpy())
Copied!

Dataset Processing Pipelines

In order to modularize your dataset processing, it is often helpful to create functions for specific data processing tasks, and combine them in pipelines in order to transform your data end-to-end. In this example, you can create a pipeline using the flip_horizontal function above and the resize function below.
1
@hub.compute
2
def resize(sample_in, sample_out, new_size):
3
## First two arguments are always default arguments containing:
4
# 1st argument is an element of the input iterable (list, dataset, array,...)
5
# 2nd argument is a dataset sample
6
## Third argument is the required size for the output images
7
8
# Append the label and image to the output sample
9
sample_out.labels.append(sample_in.labels.numpy())
10
sample_out.images.append(np.array(Image.fromarray(sample_in.images.numpy()).resize(new_size)))
11
12
return sample_out
Copied!
Functions decorated using hub.compute can be easily combined into pipelines using hub.compose. Required arguments for the functions must be passed into the pipeline in this step:
1
pipeline = hub.compose([flip_horizontal(), resize(new_size = (64,64))])
Copied!
Just like for the single-function example above, the input and output datasets are created first, and the pipeline is evaluated for the 1st 100 elements in the input dataset ds_in. The result is automatically stored in ds_out.
1
#We use the overwrite=True to make this code re-runnable
2
ds_mnist_pipe = hub.like('./mnist_pipeline', ds_mnist, overwrite = True)
Copied!
1
pipeline.eval(ds_mnist[0:100], ds_mnist_pipe, num_workers = 2)
Copied!
Last modified 2mo ago