Dataset

Auto Create

If your dataset format is supported, you can point hub.Dataset to it’s path & allow the hub.auto package to infer it’s schema & auto convert it into hub format.

Supported Dataset Formats

The hub.auto package supports the following datasets:

Computer Vision

Dataset Link Example Notebook
dandelionimages (Kaggle) Open In Colab

Supports [.png, .jpg, .jpeg] file extensions.

  • Image Classification:

    • Expects the folder path to point to a directory where the folder structure is the following:

      • root

        • class1

          • sample1.jpg

          • sample2.jpg

        • class2

          • sample1.png

          • sample2.png

Tabular

Dataset Link Example Notebook
IMDb Movie Reviews (Kaggle) Open In Colab

Supports .csv file formats.

Expects the folder path to point to a directory where the folder structure is the following:

  • root

    • file1.csv

    • file2.csv

Auto Usage

If your dataset is supported (see above), you can convert it into hub format with a single line of code:

from hub import Dataset

ds = Dataset.from_path("path/to/dataset")

Auto Contribution

If you created & uploaded a dataset into hub, you might as well contribute to the hub.auto package. The API for doing so is quite simple:

  • If you are writing the ingestion code for a computer vision dataset, then you can create a new file and/or function within hub.auto.computer_vision. If your code cannot be organized under preexisting packages/files, you can create new ones & populate the appropriate __init__.py files with import code.

  • This function should be decorated with hub.auto.infer.state.directory_parser. Example:

import hub
from hub.auto.infer import state

# priority is the sort idx of this parser. 
# it's useful for executing more general code first
@state.directory_parser(priority=0)
def image_classification(path, scheduler, workers):
    data_iter = ...
    schema = ...

    @hub.transform(schema=schema, scheduler=scheduler, workers=workers)
    def upload_data(sample):
        ...

    # must return a hub dataset, in other words this function should handle
    # reading, transforming, & uploading the dataset into hub format.
    ds = upload_data(data_iter)
    return ds

Best Practice

  • Only follow the instructions below for Create/Upload/Load if your dataset is NOT supported by hub.auto.

  • This will make your life significantly easier.

  • If your dataset is not supported, consider contributing (instructions above)!

Create

BEST PRACTICE: Before you try creating a dataset this way, try following the Auto Dataset Creation instructions first.

To create and store dataset you would need to define shape and specify the dataset structure (schema).

For example, to create a dataset basic with 4 samples containing images and labels with shape (512, 512) of dtype ‘float’ in account username:

from hub import Dataset, schema
tag = "username/basic"

ds = Dataset(
    tag,
    shape=(4,),
    schema={
        "image": schema.Tensor((512, 512), dtype="float"),
        "label": schema.Tensor((512, 512), dtype="float"),
    },
)

Upload the Data

BEST PRACTICE: Before you try uploading a dataset this way, try following the Auto Dataset Creation instructions first.

To add data to the dataset:

ds["image"][:] = np.ones((4, 512, 512))
ds["label"][:] = np.ones((4, 512, 512))
ds.flush()

Load the data

Load the dataset and access its elements:

ds = Dataset('username/basic')

# Use .numpy() to get the numpy array of the element
print(ds["image"][0].numpy())
print(ds["label", 100:110].numpy())

Convert to Pytorch

ds = ds.to_pytorch()
ds = torch.utils.data.DataLoader(
    ds,
    batch_size=8,
    num_workers=2,
)

# Iterate over the data
for batch in ds:
    print(batch["image"], batch["label"])

Convert to Tensorflow

ds = ds.to_tensorflow().batch(8)

# Iterate over the data
for batch in ds:
    print(batch["image"], batch["label"])

Visualize

Make sure visualization works perfectly at app.activeloop.ai

Delete

You can delete your dataset in app.activeloop.ai in a dataset overview tab.

Issues

If you spot any trouble or have any question, please open a github issue.

API

class hub.Dataset(url: str, mode: Optional[str] = None, shape=None, schema=None, token=None, fs=None, fs_map=None, meta_information={}, cache: int = 67108864, storage_cache: int = 268435456, lock_cache=True, tokenizer=None, lazy: bool = True, public: bool = True, name: Optional[str] = None)
__getitem__(slice_)
Gets a slice or slices from dataset
Usage:
>>> return ds["image", 5, 0:1920, 0:1080, 0:3].compute() # returns numpy array
>>> images = ds["image"]
>>> return images[5].compute() # returns numpy array
>>> images = ds["image"]
>>> image = images[5]
>>> return image[0:1920, 0:1080, 0:3].compute()
__init__(url: str, mode: Optional[str] = None, shape=None, schema=None, token=None, fs=None, fs_map=None, meta_information={}, cache: int = 67108864, storage_cache: int = 268435456, lock_cache=True, tokenizer=None, lazy: bool = True, public: bool = True, name: Optional[str] = None)
Open a new or existing dataset for read/write
Parameters
  • url (str) – The url where dataset is located/should be created

  • mode (str, optional (default to "a")) – Python way to tell whether dataset is for read or write (ex. “r”, “w”, “a”)

  • shape (tuple, optional) – Tuple with (num_samples,) format, where num_samples is number of samples

  • schema (optional) – Describes the data of a single sample. Hub schemas are used for that Required for ‘a’ and ‘w’ modes

  • token (str or dict, optional) – If url is refering to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict

  • fs (optional) –

  • fs_map (optional) –

  • meta_information (optional ,give information about dataset in a dictionary.) –

  • cache (int, optional) – Size of the memory cache. Default is 64MB (2**26) if 0, False or None, then cache is not used

  • storage_cache (int, optional) – Size of the storage cache. Default is 256MB (2**28) if 0, False or None, then storage cache is not used

  • lock_cache (bool, optional) – Lock the cache for avoiding multiprocessing errors

  • lazy (bool, optional) – Setting this to False will stop lazy computation and will allow items to be accessed without .compute()

  • public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the dataset and the dataset won’t be visible in the visualizer to the public

  • name (str, optional) – only applicable when using hub storage, this is the name that shows up on the visualizer

__iter__()

Returns Iterable over samples

__len__()

Number of samples in the dataset

__repr__()

Return repr(self).

__setitem__(slice_, value)
Sets a slice or slices with a value
Usage:
>>> ds["image", 5, 0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), "uint8")
>>> images = ds["image"]
>>> image = images[5]
>>> image[0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), "uint8")
__str__()

Return str(self).

__weakref__

list of weak references to the object (if defined)

_auto_checkout()
Automatically checks out to a new branch if the current commit is not at the head of a branch
_check_and_prepare_dir()

Checks if input data is ok. Creates or overwrites dataset folder. Returns True dataset needs to be created opposed to read.

_get_dictionary(subpath, slice_=None)

Gets dictionary from dataset given incomplete subpath

append_shape(size: int)

Append the shape: Heavy Operation

property branches

Gets a list all the branches of the dataset

checkout(address: str, create: bool = False)str
Changes the state of the dataset to the address mentioned. Creates a new branch if address isn’t a commit id or branch name and create is True.

Always checks out to the head of a branch if the address specified is a branch name.

Returns the commit id of the commit that has been switched to.

Only works if dataset was created on or after Hub v1.3.0

Parameters
  • address (str) – The branch name or commit id to checkout to

  • create (bool, optional) – Specifying create as True creates a new branch from the current commit if the address isn’t an existing branch name or commit id

close()

Save changes from cache to dataset final storage. Doesn’t create a new commit. This invalidates this object.

commit(message: str = '')str
Saves the current state of the dataset and returns the commit id.

Checks out automatically to an auto branch if the current commit is not the head of the branch

Only saves the dataset without any version control information if the dataset was created before Hub v1.3.0

Parameters

message (str, optional) – The commit message to store along with the commit

compute(label_name=False)

Gets the values from different tensorview objects in the dataset schema

Parameters

label_name (bool, optional) – If the TensorView object is of the ClassLabel type, setting this to True would retrieve the label names instead of the label encoded integers, otherwise this parameter is ignored.

copy(dst_url: str, token=None, fs=None, public=True)
Creates a copy of the dataset at the specified url and returns the dataset object
Parameters
  • dst_url (str) – The destination url where dataset should be copied

  • token (str or dict, optional) – If dst_url is refering to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict

  • fs (optional) –

  • public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the new copied dataset and the dataset won’t be visible in the visualizer to the public

delete()

Deletes the dataset

filter(fn)
Applies a function on each element one by one as a filter to get a new DatasetView
Parameters

fn (function) – Should take in a single sample of the dataset and return True or False This function is applied to all the items of the datasetview and retains those items that return True

flush()

Save changes from cache to dataset final storage. Doesn’t create a new commit. Does not invalidate this object.

static from_pytorch(dataset, scheduler: str = 'single', workers: int = 1)
Converts a pytorch dataset object into hub format
Parameters
  • dataset – The pytorch dataset object that needs to be converted into hub format

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

static from_tensorflow(ds, scheduler: str = 'single', workers: int = 1)

Converts a tensorflow dataset into hub format.

Parameters
  • dataset – The tensorflow dataset object that needs to be converted into hub format

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

Examples

>>> ds = tf.data.Dataset.from_tensor_slices(tf.range(10))
>>> out_ds = hub.Dataset.from_tensorflow(ds)
>>> res_ds = out_ds.store("username/new_dataset") # res_ds is now a usable hub dataset
>>> ds = tf.data.Dataset.from_tensor_slices({'a': [1, 2], 'b': [5, 6]})
>>> out_ds = hub.Dataset.from_tensorflow(ds)
>>> res_ds = out_ds.store("username/new_dataset") # res_ds is now a usable hub dataset
>>> ds = hub.Dataset(schema=my_schema, shape=(1000,), url="username/dataset_name", mode="w")
>>> ds = ds.to_tensorflow()
>>> out_ds = hub.Dataset.from_tensorflow(ds)
>>> res_ds = out_ds.store("username/new_dataset") # res_ds is now a usable hub dataset
static from_tfds(dataset, split=None, num: int = - 1, sampling_amount: int = 1, scheduler: str = 'single', workers: int = 1)
Converts a TFDS Dataset into hub format.
Parameters
  • dataset (str) – The name of the tfds dataset that needs to be converted into hub format

  • split (str, optional) – A string representing the splits of the dataset that are required such as “train” or “test+train” If not present, all the splits of the dataset are used.

  • num (int, optional) – The number of samples required. If not present, all the samples are taken. If count is -1, or if count is greater than the size of this dataset, the new dataset will contain all elements of this dataset.

  • sampling_amount (float, optional) – a value from 0 to 1, that specifies how much of the dataset would be sampled to determinte feature shapes value of 0 would mean no sampling and 1 would imply that entire dataset would be sampled

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

Examples

>>> out_ds = hub.Dataset.from_tfds('mnist', split='test+train', num=1000)
>>> res_ds = out_ds.store("username/mnist") # res_ds is now a usable hub dataset
property keys

Get Keys of the dataset

log()
Prints the commits in the commit tree before the current commit

Only works if dataset was created on or after Hub v1.3.0

numpy(label_name=False)

Gets the values from different tensorview objects in the dataset schema

Parameters

label_name (bool, optional) – If the TensorView object is of the ClassLabel type, setting this to True would retrieve the label names instead of the label encoded integers, otherwise this parameter is ignored.

rename(name: str)None

Renames the dataset

resize_shape(size: int)None

Resize the shape of the dataset by resizing each tensor first dimension

save()

Save changes from cache to dataset final storage. Doesn’t create a new commit. Does not invalidate this object.

store(url: str, token: Optional[dict] = None, sample_per_shard: Optional[int] = None, public: bool = True, scheduler='single', workers=1)
Used to save the dataset as a new dataset, very similar to copy but uses transforms instead
Parameters
  • url (str) – path where the data is going to be stored

  • token (str or dict, optional) – If url is referring to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict

  • length (int) – in case shape is None, user can provide length

  • sample_per_shard (int) – How to split the iterator not to overfill RAM

  • public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the dataset and the dataset won’t be visible in the visualizer to the public

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

Returns

ds – uploaded dataset

Return type

hub.Dataset

to_pytorch(transform=None, inplace=True, output_type=<class 'dict'>, indexes=None, key_list=None)
Converts the dataset into a pytorch compatible format.

** Pytorch does not support uint16, uint32, uint64 dtypes. These are implicitly type casted to int32, int64 and int64 respectively. Avoid having schema with these dtypes if you want to avoid this implicit conversion. ** This method does not work with Sequence schema

Parameters
  • transform (function that transforms data in a dict format) –

  • inplace (bool, optional) – Defines if data should be converted to torch.Tensor before or after Transforms applied (depends on what data type you need for Transforms). Default is True.

  • output_type (one of list, tuple, dict, optional) – Defines the output type. Default is dict - same as in original Hub Dataset.

  • indexes (list or int, optional) – The samples to be converted into Pytorch format. Takes all samples in dataset by default.

  • key_list (list, optional) – The list of keys that are needed in Pytorch format. For nested schemas such as {“a”:{“b”:{“c”: Tensor()}}} use [“a/b/c”] as key_list

to_tensorflow(indexes=None, include_shapes=False, key_list=None)
Converts the dataset into a tensorflow compatible format
Parameters
  • indexes (list or int, optional) – The samples to be converted into tensorflow format. Takes all samples in dataset by default.

  • include_shapes (boolean, optional) – False by default. Setting it to True passes the shapes to tf.data.Dataset.from_generator. Setting to True could lead to issues with dictionaries inside Tensors.

  • key_list (list, optional) – The list of keys that are needed in tensorflow format. For nested schemas such as {“a”:{“b”:{“c”: Tensor()}}} use [“a/b/c”] as key_list