Step 10: Dataset Filtering
Filtering datasets using user-defined-functions or our pythonic query language
Filtering and querying is an important aspect of data engineering because it enables users to focus on subsets of their datasets in order to obtain important insights, perform quality control, and train models on parts of their data.
Hub enables you to perform queries using user-defined functions or Hub's Pythonic query language, all of which can be parallelized using our simple multi-processing API.

Filtering with user-defined-functions

The first step for querying using UDFs is to define a function that returns a boolean depending on whether an input sample in a dataset meets the user-defined condition. In this example, we define a function that returns True if the labels for a tensor are in the desired labels_list. If there are inputs to the filtering function other than sample_in, it must be decorated with @hub.compute.
1
@hub.compute
2
def filter_labels(sample_in, labels_list, class_names):
3
text_label = class_names[sample_in.labels.numpy()[0]]
4
5
return text_label in labels_list
Copied!
The filtering function is executed using the ds.filter() command below, and it returns a virtual view of the dataset (dataset view) that only contains the indices that met the filtering condition. Just like in the Parallel Computing API, the sample_in parameter does not need to be passed into the filter function when evaluating it, and multi-processing can be specified using the scheduler and num_workers parameters.
1
import hub
2
from PIL import Image
3
4
ds = hub.load('hub://activeloop/cifar10-test')
5
6
labels_list = ['automobile', 'ship'] # Desired labels for filtering
7
class_names = ds.labels.info.class_names # Mapping from numeric to text labels
Copied!
1
ds_view = ds.filter(filter_labels(labels_list, class_names), scheduler = 'threaded', num_workers = 0)
Copied!
The data in the returned ds_view can be accessed just like a regular dataset.
1
Image.fromarray(ds_view.images[0].numpy())
Copied!
In most cases, multi-processing is not necessary for queries that involve simple data such as labels or bounding boxes. However, multi-processing significantly accelerates queries that must load rich data types such as images and videos.

Filtering using our pythonic query language

Queries can also be executed using hub's Pythonic query language. This UX is primarily intended for use in Activeloop Platform, but it can also be applied programmatically in Python.
1
ds_view = ds.filter("labels == 'automobile' or labels == 'airplane'", scheduler = 'threaded', num_workers = 0)
Copied!
Tensors can be referred to by name, the language supports common logical operations (in, ==, !=, >, <, >=, <=), and numpy-like operators and indexing can be applied such as 'images.min > 5', 'images.shape[2]==1', and others.
Congrats! You just learned to filter data with hub! 🎈