Shuffling and Splitting Datasets

It is often necessary to split datasets into subsets for training, testing, and validation. It is also advisable to shuffle datasets so they are not affected by systematic bias due to the order in which the data was ingested. This functionality is available inside the hub.util module:

from hub import Dataset
from hub.util import shuffle, split

Shuffling

The shuffle() method returns a randomly shuffled version of the original dataset by shuffling the dataset indices.The data is not shuffled in long-term storage.

ds = Dataset('dataset_path')
ds_shuffled = shuffle(ds)

Since the indices are shuffled fully-randomly, data corresponding to adjacent indices are loaded from different chunks, which significantly reduces the data reading speed.

A method for shuffling a dataset by shuffling chunks is coming soon, and it will enable random shuffling without sacrificing data reading speed.

Splitting

The split method returns a list of datasets that contain a subset of the original datasets according to user-specified proportions

ds = Dataset('dataset_path')
[train, test, val] = split(ds, [0.8, 0.1, 0.1]) # 80:10:10 split

For simultaneously splitting and shuffling datasets run:

[train, test, val] = split(shuffle(ds), [0.8, 0.1, 0.1]) # 80:10:10 split

Note that the shuffle method will exhibit slow data reading speed, as explained in the Shuffle section.