Dataset Filtering

Using Hub you can filter your dataset to get a DatasetView that only has the items that you’re interested in. Filtering can be applied both to a Dataset or to a DatasetView (obtained by slicing or filtering a Dataset)

Filtering using a function

Using filter, you can pass in a function that is applied element by element to the dataset. Only those elements for which the function returns True stay in the newly created DatasetView.

Example:-

my_schema = {
    "img": Tensor((100, 100)),
    "name": Text((None,), max_shape=(10,))
}
ds = hub.Dataset("./data/filtering_example", shape=(20,), schema=my_schema)
for i in range(10):  # assigning some values to the dataset
    ds["img", i] = np.ones((100, 100))
    ds["name", i] = "abc" + str(i) if i % 2 == 0 else "def" + str(i)

def my_filter(sample):
    return sample["name"].compute().startswith("abc") and (sample["img"].compute() == np.ones((100, 100))).all()
ds2 = ds.filter(my_filter)

# alternatively, we can also use a lambda function to achieve the same results
ds3 = ds.filter(
    lambda x: x["name"].compute().startswith("abc")
    and (x["img"].compute() == np.ones((100, 100))).all()
)

API

hub.api.dataset.Dataset.filter(self, fn)
Applies a function on each element one by one as a filter to get a new DatasetView
Parameters

fn (function) – Should take in a single sample of the dataset and return True or False This function is applied to all the items of the datasetview and retains those items that return True

hub.api.datasetview.DatasetView.filter(self, fn)
Applies a function on each element one by one as a filter to get a new DatasetView
Parameters

fn (function) – Should take in a single sample of the dataset and return True or False This function is applied to all the items of the datasetview and retains those items that return True