Dataset

Hub Datasets are dictionaries containing tensors. You can think of them as folders in the cloud. To store tensor in the cloud we first should put it in dataset and then store the dataset.

Store

To create and store dataset you would need to define tensors and specify the dataset dictionary.

from hub import dataset, tensor

tensor1 = tensor.from_zeros((20,512,512), dtype="uint8", dtag="image")
tensor2 = tensor.from_zeros((20), dtype="bool", dtag="label")

dataset.from_tensors({"name1": tensor1, "name2": tensor2})

dataset.store("username/namespace")

Load

To load a dataset from a central repository

from hub import dataset

ds = dataset.load("mnist/mnist")

Combine

You could combine datasets or concat them.

from hub import dataset

... 

#vertical
dataset.concat(ds1, ds2)

#horizontal
dataset.combine(ds1, ds2)

How to Upload a Dataset

For small datasets that would fit into your RAM you can directly upload by converting a numpy array into hub tensor. For complete example please check Uploading MNIST and Uploading CIFAR

For larger datasets you would need to define a dataset generator and apply the transformation iteratively. Please see an example below Uploading COCO. Please pay careful attention to meta(...) function where you describe each tensor properties. Please pay careful attention providing full meta description including shape, dtype, dtag, chunk_shape etc.

Dtag

For each tensor you would need to specify a dtag so that visualizer knows how draw it or transformations have context how to transform it.

Dtag Shape Types
default any array any
image (width, height), (channel, width, height) or (width, height, channel) int, float
text used for label str or object
box [(4)] int32
mask (width, height) bool
segmentation (width, height), (channel, width, height) or (width, height, channel) int
video (sequence, width, height, channel) or (sequence, channel, width, height) int, float
embedding
tabular
time
event
audio
pointcloud
landmark
polygon
mesh
document

Guidelines

  1. Fork the github repo and create a folder under examples/dataset

  2. Train a model using Pytorch

import hub
import pytorch

ds = hub.load("username/dataset")
ds = ds.to_pytorch()

# Implement a training loop for the dataset in pytorch
...
  1. Train a model using Tensorflow

import hub
import tensorflow

ds = hub.load("username/dataset")
ds = ds.to_tensorflow()

# Implement a training loop for the dataset in tensorflow
...
  1. Make sure visualization works perfectly at app.activeloop.ai

Final Checklist

So here is the checklist, the pull request.

  • Accessible using the sdk

  • Trainable on Tensorflow

  • Trainable on PyTorch

  • Visualizable at app.activeloop.ai

  • Pull Request merged into master

Issues

If you spot any trouble or have any question, please open a github issue.

API

class hub.dataset.Dataset(tensors: Dict[str, hub.collections.tensor.core.Tensor], metainfo={})
property citation

Dataset citation

property count

len of dataset (len of tensors across axis 0, yes, they all should be = to each other) Returns -1 if length is unknown

delete(tag, creds=None, session_creds=True) → bool

Deletes dataset given tag(filepath) and credentials (optional)

property description

Dataset description

property howtoload

Dataset howtoload

items()

Returns tensors

keys()

Returns names of tensors

property license

Dataset license

property meta

Dict of meta’s of each tensor meta of tensor contains all metadata for tensor storage

store(tag, creds=None, session_creds=True) → hub.collections.dataset.core.Dataset

Stores dataset by tag(filepath) given credentials (can be omitted)

to_pytorch(transform=None, max_text_len=30)

Transforms into pytorch dataset

Parameters
  • transform (func) – any transform that takes input a dictionary of a sample and returns transformed dictionary

  • max_text_len (integer) – the maximum length of text strings that would be stored. Strings longer than this would be snipped

to_tensorflow(max_text_len=30)

Transforms into tensorflow dataset

Parameters

max_text_len (integer) – the maximum length of text strings that would be stored. Strings longer than this would be snipped

values()

Returns tensors