Dataset

Create

To create and store dataset you would need to define shape and specify the dataset structure (schema).

For example, to create a dataset basic with 4 samples containing images and labels with shape (512, 512) of dtype ‘float’ in account username:

from hub import Dataset, schema
tag = "username/basic"

ds = Dataset(
    tag,
    shape=(4,),
    schema={
        "image": schema.Tensor((512, 512), dtype="float"),
        "label": schema.Tensor((512, 512), dtype="float"),
    },
)

Upload the Data

To add data to the dataset:

ds["image"][:] = np.ones((4, 512, 512))
ds["label"][:] = np.ones((4, 512, 512))
ds.commit()

Load the data

Load the dataset and access its elements:

ds = Dataset('username/basic')

# Use .numpy() to get the numpy array of the element
print(ds["image"][0].numpy())
print(ds["label", 100:110].numpy())

Convert to Pytorch

ds = ds.to_pytorch()
ds = torch.utils.data.DataLoader(
    ds,
    batch_size=8,
    num_workers=2,
)

# Iterate over the data
for batch in ds:
    print(batch["image"], batch["label"])

Convert to Tensorflow

ds = ds.to_tensorflow().batch(8)

# Iterate over the data
for batch in ds:
    print(batch["image"], batch["label"])

Visualize

Make sure visualization works perfectly at app.activeloop.ai

Issues

If you spot any trouble or have any question, please open a github issue.

API

class hub.Dataset(url: str, mode: str = 'a', safe_mode: bool = False, shape=None, schema=None, token=None, fs=None, fs_map=None, cache: int = 67108864, storage_cache: int = 268435456, lock_cache=True, tokenizer=None)
__getitem__(slice_)
Gets a slice or slices from dataset
Usage:
>>> return ds["image", 5, 0:1920, 0:1080, 0:3].numpy() # returns numpy array
>>> images = ds["image"]
>>> return images[5].numpy() # returns numpy array
>>> images = ds["image"]
>>> image = images[5]
>>> return image[0:1920, 0:1080, 0:3].numpy()
__init__(url: str, mode: str = 'a', safe_mode: bool = False, shape=None, schema=None, token=None, fs=None, fs_map=None, cache: int = 67108864, storage_cache: int = 268435456, lock_cache=True, tokenizer=None)

Open a new or existing dataset for read/write :param url: The url where dataset is located/should be created :type url: str :param mode: Python way to tell whether dataset is for read or write (ex. “r”, “w”, “a”) :type mode: str, optional (default to “w”) :param safe_mode: if dataset exists it cannot be rewritten in safe mode, otherwise it lets to write the first time :type safe_mode: bool, optional :param shape: Tuple with (num_samples,) format, where num_samples is number of samples :type shape: tuple, optional :param schema: Describes the data of a single sample. Hub schemas are used for that

Required for ‘a’ and ‘w’ modes

Parameters
  • token (str or dict, optional) – If url is refering to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict

  • fs (optional) –

  • fs_map (optional) –

  • cache (int, optional) – Size of the memory cache. Default is 64MB (2**26) if 0, False or None, then cache is not used

  • storage_cache (int, optional) – Size of the storage cache. Default is 256MB (2**28) if 0, False or None, then storage cache is not used

  • lock_cache (bool, optional) – Lock the cache for avoiding multiprocessing errors

__iter__()

Returns Iterable over samples

__len__()

Number of samples in the dataset

__repr__()

Return repr(self).

__setitem__(slice_, value)
Sets a slice or slices with a value
Usage >>> ds[“image”, 5, 0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), “uint8”)
>>> images = ds["image"]
>>> image = images[5]
>>> image[0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), "uint8")
__str__()

Return str(self).

__weakref__

list of weak references to the object (if defined)

_check_and_prepare_dir()

Checks if input data is ok. Creates or overwrites dataset folder. Returns True dataset needs to be created opposed to read.

_get_dictionary(subpath, slice_=None)

“Gets dictionary from dataset given incomplete subpath

append_shape(size: int)

Append the shape: Heavy Operation

close()

Save changes from cache to dataset final storage This invalidates this object

commit()

Deprecated alias to flush()

flush()

Save changes from cache to dataset final storage Does not invalidate this object

static from_pytorch(dataset)

Converts a pytorch dataset object into hub format :param dataset: The pytorch dataset object that needs to be converted into hub format

static from_tensorflow(ds)

Converts a tensorflow dataset into hub format :param dataset: The tensorflow dataset object that needs to be converted into hub format

Examples

ds = tf.data.Dataset.from_tensor_slices(tf.range(10)) out_ds = hub.Dataset.from_tensorflow(ds) res_ds = out_ds.store(“username/new_dataset”) # res_ds is now a usable hub dataset

ds = tf.data.Dataset.from_tensor_slices({‘a’: [1, 2], ‘b’: [5, 6]}) out_ds = hub.Dataset.from_tensorflow(ds) res_ds = out_ds.store(“username/new_dataset”) # res_ds is now a usable hub dataset

ds = hub.Dataset(schema=my_schema, shape=(1000,), url=”username/dataset_name”, mode=”w”) ds = ds.to_tensorflow() out_ds = hub.Dataset.from_tensorflow(ds) res_ds = out_ds.store(“username/new_dataset”) # res_ds is now a usable hub dataset

static from_tfds(dataset, split=None, num=- 1, sampling_amount=1)

Converts a TFDS Dataset into hub format :param dataset: The name of the tfds dataset that needs to be converted into hub format :type dataset: str :param split: A string representing the splits of the dataset that are required such as “train” or “test+train”

If not present, all the splits of the dataset are used.

Parameters
  • num (int, optional) – The number of samples required. If not present, all the samples are taken. If count is -1, or if count is greater than the size of this dataset, the new dataset will contain all elements of this dataset.

  • sampling_amount (float, optional) – a value from 0 to 1, that specifies how much of the dataset would be sampled to determinte feature shapes value of 0 would mean no sampling and 1 would imply that entire dataset would be sampled

Examples

out_ds = hub.Dataset.from_tfds(‘mnist’, split=’test+train’, num=1000) res_ds = out_ds.store(“username/mnist”) # res_ds is now a usable hub dataset

property keys

Get Keys of the dataset

resize_shape(size: int) → None

Resize the shape of the dataset by resizing each tensor first dimension

to_pytorch(Transform=None, offset=None, num_samples=None)

Converts the dataset into a pytorch compatible format :param offset: The offset from which dataset needs to be converted :type offset: int, optional :param num_samples: The number of samples required of the dataset that needs to be converted :type num_samples: int, optional

to_tensorflow(offset=None, num_samples=None)

Converts the dataset into a tensorflow compatible format :param offset: The offset from which dataset needs to be converted :type offset: int, optional :param num_samples: The number of samples required of the dataset that needs to be converted :type num_samples: int, optional