It is often necessary to split datasets into subsets for training, testing, and validation. It is also advisable to shuffle datasets so they are not affected by systematic bias due to the order in which the data was ingested. This functionality is available inside the
from hub import Datasetfrom hub.util import shuffle, split
shuffle() method returns a randomly shuffled version of the original dataset by shuffling the dataset indices.The data is not shuffled in long-term storage.
ds = Dataset('dataset_path')ds_shuffled = shuffle(ds)
A method for shuffling a dataset by shuffling chunks is coming soon, and it will enable random shuffling without sacrificing data reading speed.
The split method returns a list of datasets that contain a subset of the original datasets according to user-specified proportions
ds = Dataset('dataset_path')[train, test, val] = split(ds, [0.8, 0.1, 0.1]) # 80:10:10 split
For simultaneously splitting and shuffling datasets run:
[train, test, val] = split(shuffle(ds), [0.8, 0.1, 0.1]) # 80:10:10 split
Note that the shuffle method will exhibit slow data reading speed, as explained in the Shuffle section.