Hub Datasets are dictionaries containing tensors. You can think of them as folders in the cloud. To store tensor in the cloud we first should put it in dataset and then store the dataset.


To create and store dataset you would need to define tensors and specify the dataset dictionary.

from hub import dataset, tensor

tensor1 = tensor.from_zeros((20,512,512), dtype="uint8", dtag="image")
tensor2 = tensor.from_zeros((20), dtype="bool", dtag="label")

dataset.from_tensors({"name1": tensor1, "name2": tensor2})"username/namespace")


To load a dataset from a central repository

from hub import dataset

ds = dataset.load("mnist/mnist")


You could combine datasets or concat them.

from hub import dataset


dataset.concat(ds1, ds2)

dataset.combine(ds1, ds2)

Get text labels

To get text labels from a dataset


from hub import dataset
import torch

ds = dataset.load("mnist/fashion-mnist")

ds = ds.to_pytorch()

data_loader =, batch_size=BATCH_SIZE, collate_fn=ds.collate_fn)

for batch in data_loader:
    tl = dataset.get_text(batch['named_label'])


from hub import dataset
import tensorflow as tf

ds = dataset.load("mnist/fashion-mnist")

ds = ds.to_tensorflow()

dataset = ds.batch(BATCH_SIZE)

for batch in dataset:
    tl = dataset.get_text(batch['named_label'])

How to Upload a Dataset

For small datasets that would fit into your RAM you can directly upload by converting a numpy array into hub tensor. For complete example please check Uploading MNIST and Uploading CIFAR

For larger datasets you would need to define a dataset generator and apply the transformation iteratively. Please see an example below Uploading COCO. Please pay careful attention to meta(...) function where you describe each tensor properties. Please pay careful attention providing full meta description including shape, dtype, dtag, chunk_shape etc.


For each tensor you would need to specify a dtag so that visualizer knows how draw it or transformations have context how to transform it.

Dtag Shape Types
default any array any
image (width, height), (channel, width, height) or (width, height, channel) int, float
text used for label str or object
box [(4)] int32
mask (width, height) bool
segmentation (width, height), (channel, width, height) or (width, height, channel) int
video (sequence, width, height, channel) or (sequence, channel, width, height) int, float


  1. Fork the github repo and create a folder under examples/dataset

  2. Train a model using Pytorch

import hub
import pytorch

ds = hub.load("username/dataset")
ds = ds.to_pytorch()

# Implement a training loop for the dataset in pytorch
  1. Train a model using Tensorflow

import hub
import tensorflow

ds = hub.load("username/dataset")
ds = ds.to_tensorflow()

# Implement a training loop for the dataset in tensorflow
  1. Make sure visualization works perfectly at

Final Checklist

So here is the checklist, the pull request.

  • Accessible using the sdk

  • Trainable on Tensorflow

  • Trainable on PyTorch

  • Visualizable at

  • Pull Request merged into master


If you spot any trouble or have any question, please open a github issue.


class hub.dataset.Dataset(tensors: Dict[str, hub.collections.tensor.core.Tensor], metainfo={})
property citation

Dataset citation

property count

len of dataset (len of tensors across axis 0, yes, they all should be = to each other) Returns -1 if length is unknown

delete(tag, creds=None, session_creds=True) → bool

Deletes dataset given tag(filepath) and credentials (optional)

property description

Dataset description

property howtoload

Dataset howtoload


Returns tensors


Returns names of tensors

property license

Dataset license

property meta

Dict of meta’s of each tensor meta of tensor contains all metadata for tensor storage

store(tag, creds=None, session_creds=True) → hub.collections.dataset.core.Dataset

Stores dataset by tag(filepath) given credentials (can be omitted)

to_pytorch(transform=None, max_text_len=30)

Transforms into pytorch dataset

  • transform (func) – any transform that takes input a dictionary of a sample and returns transformed dictionary

  • max_text_len (integer) – the maximum length of text strings that would be stored. Strings longer than this would be snipped


Transforms into tensorflow dataset


max_text_len (integer) – the maximum length of text strings that would be stored. Strings longer than this would be snipped


Returns tensors