Dataset

Create

To create and store dataset you would need to define shape and specify the dataset structure (schema).

For example, to create a dataset basic with 4 samples containing images and labels with shape (512, 512) of dtype ‘float’ in account username:

from hub import Dataset, schema
tag = "username/basic"

ds = Dataset(
    tag,
    shape=(4,),
    schema={
        "image": schema.Tensor((512, 512), dtype="float"),
        "label": schema.Tensor((512, 512), dtype="float"),
    },
)

Upload the Data

To add data to the dataset:

ds["image"][:] = np.ones((4, 512, 512))
ds["label"][:] = np.ones((4, 512, 512))
ds.commit()

Load the data

Load the dataset and access its elements:

ds = Dataset('username/basic')

# Use .numpy() to get the numpy array of the element
print(ds["image"][0].numpy())
print(ds["label", 100:110].numpy())

Convert to Pytorch

ds = ds.to_pytorch()
ds = torch.utils.data.DataLoader(
    ds,
    batch_size=8,
    num_workers=2,
)

# Iterate over the data
for batch in ds:
    print(batch["image"], batch["label"])

Convert to Tensorflow

ds = ds.to_tensorflow().batch(8)

# Iterate over the data
for batch in ds:
    print(batch["image"], batch["label"])

Visualize

Make sure visualization works perfectly at app.activeloop.ai

Delete

You can delete your dataset in app.activeloop.ai in a dataset overview tab.

Issues

If you spot any trouble or have any question, please open a github issue.

API

class hub.Dataset(url: str, mode: str = None, shape=None, schema=None, token=None, fs=None, fs_map=None, meta_information={}, cache: int = 67108864, storage_cache: int = 268435456, lock_cache=True, tokenizer=None, lazy: bool = True, public: bool = True, name: str = None)
__getitem__(slice_)
Gets a slice or slices from dataset
Usage:
>>> return ds["image", 5, 0:1920, 0:1080, 0:3].compute() # returns numpy array
>>> images = ds["image"]
>>> return images[5].compute() # returns numpy array
>>> images = ds["image"]
>>> image = images[5]
>>> return image[0:1920, 0:1080, 0:3].compute()
__init__(url: str, mode: str = None, shape=None, schema=None, token=None, fs=None, fs_map=None, meta_information={}, cache: int = 67108864, storage_cache: int = 268435456, lock_cache=True, tokenizer=None, lazy: bool = True, public: bool = True, name: str = None)
Open a new or existing dataset for read/write
Parameters
  • url (str) – The url where dataset is located/should be created

  • mode (str, optional (default to "a")) – Python way to tell whether dataset is for read or write (ex. “r”, “w”, “a”)

  • shape (tuple, optional) – Tuple with (num_samples,) format, where num_samples is number of samples

  • schema (optional) – Describes the data of a single sample. Hub schemas are used for that Required for ‘a’ and ‘w’ modes

  • token (str or dict, optional) – If url is refering to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict

  • fs (optional) –

  • fs_map (optional) –

  • meta_information (optional ,give information about dataset in a dictionary.) –

  • cache (int, optional) – Size of the memory cache. Default is 64MB (2**26) if 0, False or None, then cache is not used

  • storage_cache (int, optional) – Size of the storage cache. Default is 256MB (2**28) if 0, False or None, then storage cache is not used

  • lock_cache (bool, optional) – Lock the cache for avoiding multiprocessing errors

  • lazy (bool, optional) – Setting this to False will stop lazy computation and will allow items to be accessed without .compute()

  • public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the dataset and the dataset won’t be visible in the visualizer to the public

  • name (str, optional) – only applicable when using hub storage, this is the name that shows up on the visualizer

__iter__()

Returns Iterable over samples

__len__()

Number of samples in the dataset

__repr__()

Return repr(self).

__setitem__(slice_, value)
Sets a slice or slices with a value
Usage:
>>> ds["image", 5, 0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), "uint8")
>>> images = ds["image"]
>>> image = images[5]
>>> image[0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), "uint8")
__str__()

Return str(self).

__weakref__

list of weak references to the object (if defined)

_check_and_prepare_dir()

Checks if input data is ok. Creates or overwrites dataset folder. Returns True dataset needs to be created opposed to read.

_get_dictionary(subpath, slice_=None)

Gets dictionary from dataset given incomplete subpath

append_shape(size: int)

Append the shape: Heavy Operation

close()

Save changes from cache to dataset final storage. This invalidates this object.

commit()

Deprecated alias to flush()

delete()

Deletes the dataset

filter(dic)
Applies a filter to get a new datasetview that matches the dictionary provided
Parameters

dic (dictionary) – A dictionary of key value pairs, used to filter the dataset. For nested schemas use flattened dictionary representation i.e instead of {“abc”: {“xyz” : 5}} use {“abc/xyz” : 5}

flush()

Save changes from cache to dataset final storage. Does not invalidate this object.

static from_pytorch(dataset, scheduler: str = 'single', workers: int = 1)
Converts a pytorch dataset object into hub format
Parameters
  • dataset – The pytorch dataset object that needs to be converted into hub format

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

static from_tensorflow(ds, scheduler: str = 'single', workers: int = 1)

Converts a tensorflow dataset into hub format.

Parameters
  • dataset – The tensorflow dataset object that needs to be converted into hub format

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

Examples

>>> ds = tf.data.Dataset.from_tensor_slices(tf.range(10))
>>> out_ds = hub.Dataset.from_tensorflow(ds)
>>> res_ds = out_ds.store("username/new_dataset") # res_ds is now a usable hub dataset
>>> ds = tf.data.Dataset.from_tensor_slices({'a': [1, 2], 'b': [5, 6]})
>>> out_ds = hub.Dataset.from_tensorflow(ds)
>>> res_ds = out_ds.store("username/new_dataset") # res_ds is now a usable hub dataset
>>> ds = hub.Dataset(schema=my_schema, shape=(1000,), url="username/dataset_name", mode="w")
>>> ds = ds.to_tensorflow()
>>> out_ds = hub.Dataset.from_tensorflow(ds)
>>> res_ds = out_ds.store("username/new_dataset") # res_ds is now a usable hub dataset
static from_tfds(dataset, split=None, num: int = - 1, sampling_amount: int = 1, scheduler: str = 'single', workers: int = 1)
Converts a TFDS Dataset into hub format.
Parameters
  • dataset (str) – The name of the tfds dataset that needs to be converted into hub format

  • split (str, optional) – A string representing the splits of the dataset that are required such as “train” or “test+train” If not present, all the splits of the dataset are used.

  • num (int, optional) – The number of samples required. If not present, all the samples are taken. If count is -1, or if count is greater than the size of this dataset, the new dataset will contain all elements of this dataset.

  • sampling_amount (float, optional) – a value from 0 to 1, that specifies how much of the dataset would be sampled to determinte feature shapes value of 0 would mean no sampling and 1 would imply that entire dataset would be sampled

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

Examples

>>> out_ds = hub.Dataset.from_tfds('mnist', split='test+train', num=1000)
>>> res_ds = out_ds.store("username/mnist") # res_ds is now a usable hub dataset
property keys

Get Keys of the dataset

rename(name: str) → None

Renames the dataset

resize_shape(size: int) → None

Resize the shape of the dataset by resizing each tensor first dimension

to_pytorch(transform=None, inplace=True, output_type=<class 'dict'>, indexes=None)
Converts the dataset into a pytorch compatible format.
Parameters
  • transform (function that transforms data in a dict format) –

  • inplace (bool, optional) – Defines if data should be converted to torch.Tensor before or after Transforms applied (depends on what data type you need for Transforms). Default is True.

  • output_type (one of list, tuple, dict, optional) – Defines the output type. Default is dict - same as in original Hub Dataset.

  • offset (int, optional) – The offset from which dataset needs to be converted

  • num_samples (int, optional) – The number of samples required of the dataset that needs to be converted

to_tensorflow(indexes=None)
Converts the dataset into a tensorflow compatible format
Parameters
  • offset (int, optional) – The offset from which dataset needs to be converted

  • num_samples (int, optional) – The number of samples required of the dataset that needs to be converted