API Reference

Datasets

Dataset

class hub.Dataset(url: str, mode: Optional[str] = None, shape=None, schema=None, token=None, fs=None, fs_map=None, meta_information={}, cache: int = 67108864, storage_cache: int = 268435456, lock_cache=True, tokenizer=None, lazy: bool = True, public: bool = True, name: Optional[str] = None)
__getitem__(slice_)
Gets a slice or slices from dataset
Usage:
>>> return ds["image", 5, 0:1920, 0:1080, 0:3].compute() # returns numpy array
>>> images = ds["image"]
>>> return images[5].compute() # returns numpy array
>>> images = ds["image"]
>>> image = images[5]
>>> return image[0:1920, 0:1080, 0:3].compute()
__init__(url: str, mode: Optional[str] = None, shape=None, schema=None, token=None, fs=None, fs_map=None, meta_information={}, cache: int = 67108864, storage_cache: int = 268435456, lock_cache=True, tokenizer=None, lazy: bool = True, public: bool = True, name: Optional[str] = None)
Open a new or existing dataset for read/write
Parameters
  • url (str) – The url where dataset is located/should be created

  • mode (str, optional (default to "a")) – Python way to tell whether dataset is for read or write (ex. “r”, “w”, “a”)

  • shape (tuple, optional) – Tuple with (num_samples,) format, where num_samples is number of samples

  • schema (optional) – Describes the data of a single sample. Hub schemas are used for that Required for ‘a’ and ‘w’ modes

  • token (str or dict, optional) – If url is refering to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict

  • fs (optional) –

  • fs_map (optional) –

  • meta_information (optional ,give information about dataset in a dictionary.) –

  • cache (int, optional) – Size of the memory cache. Default is 64MB (2**26) if 0, False or None, then cache is not used

  • storage_cache (int, optional) – Size of the storage cache. Default is 256MB (2**28) if 0, False or None, then storage cache is not used

  • lock_cache (bool, optional) – Lock the cache for avoiding multiprocessing errors

  • lazy (bool, optional) – Setting this to False will stop lazy computation and will allow items to be accessed without .compute()

  • public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the dataset and the dataset won’t be visible in the visualizer to the public

  • name (str, optional) – only applicable when using hub storage, this is the name that shows up on the visualizer

__iter__()

Returns Iterable over samples

__len__()

Number of samples in the dataset

__repr__()

Return repr(self).

__setitem__(slice_, value)
Sets a slice or slices with a value
Usage:
>>> ds["image", 5, 0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), "uint8")
>>> images = ds["image"]
>>> image = images[5]
>>> image[0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), "uint8")
__str__()

Return str(self).

__weakref__

list of weak references to the object (if defined)

_auto_checkout()
Automatically checks out to a new branch if the current commit is not at the head of a branch
_check_and_prepare_dir()

Checks if input data is ok. Creates or overwrites dataset folder. Returns True dataset needs to be created opposed to read.

_get_dictionary(subpath, slice_=None)

Gets dictionary from dataset given incomplete subpath

append_shape(size: int)

Append the shape: Heavy Operation

property branches

Gets a list all the branches of the dataset

checkout(address: str, create: bool = False)str
Changes the state of the dataset to the address mentioned. Creates a new branch if address isn’t a commit id or branch name and create is True.

Always checks out to the head of a branch if the address specified is a branch name.

Returns the commit id of the commit that has been switched to.

Only works if dataset was created on or after Hub v1.3.0

Parameters
  • address (str) – The branch name or commit id to checkout to

  • create (bool, optional) – Specifying create as True creates a new branch from the current commit if the address isn’t an existing branch name or commit id

close()

Save changes from cache to dataset final storage. Doesn’t create a new commit. This invalidates this object.

commit(message: str = '')str
Saves the current state of the dataset and returns the commit id.

Checks out automatically to an auto branch if the current commit is not the head of the branch

Only saves the dataset without any version control information if the dataset was created before Hub v1.3.0

Parameters

message (str, optional) – The commit message to store along with the commit

compute(label_name=False)

Gets the values from different tensorview objects in the dataset schema

Parameters

label_name (bool, optional) – If the TensorView object is of the ClassLabel type, setting this to True would retrieve the label names instead of the label encoded integers, otherwise this parameter is ignored.

copy(dst_url: str, token=None, fs=None, public=True)
Creates a copy of the dataset at the specified url and returns the dataset object
Parameters
  • dst_url (str) – The destination url where dataset should be copied

  • token (str or dict, optional) – If dst_url is refering to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict

  • fs (optional) –

  • public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the new copied dataset and the dataset won’t be visible in the visualizer to the public

delete()

Deletes the dataset

filter(fn)
Applies a function on each element one by one as a filter to get a new DatasetView
Parameters

fn (function) – Should take in a single sample of the dataset and return True or False This function is applied to all the items of the datasetview and retains those items that return True

flush()

Save changes from cache to dataset final storage. Doesn’t create a new commit. Does not invalidate this object.

static from_pytorch(dataset, scheduler: str = 'single', workers: int = 1)
Converts a pytorch dataset object into hub format
Parameters
  • dataset – The pytorch dataset object that needs to be converted into hub format

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

static from_tensorflow(ds, scheduler: str = 'single', workers: int = 1)

Converts a tensorflow dataset into hub format.

Parameters
  • dataset – The tensorflow dataset object that needs to be converted into hub format

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

Examples

>>> ds = tf.data.Dataset.from_tensor_slices(tf.range(10))
>>> out_ds = hub.Dataset.from_tensorflow(ds)
>>> res_ds = out_ds.store("username/new_dataset") # res_ds is now a usable hub dataset
>>> ds = tf.data.Dataset.from_tensor_slices({'a': [1, 2], 'b': [5, 6]})
>>> out_ds = hub.Dataset.from_tensorflow(ds)
>>> res_ds = out_ds.store("username/new_dataset") # res_ds is now a usable hub dataset
>>> ds = hub.Dataset(schema=my_schema, shape=(1000,), url="username/dataset_name", mode="w")
>>> ds = ds.to_tensorflow()
>>> out_ds = hub.Dataset.from_tensorflow(ds)
>>> res_ds = out_ds.store("username/new_dataset") # res_ds is now a usable hub dataset
static from_tfds(dataset, split=None, num: int = - 1, sampling_amount: int = 1, scheduler: str = 'single', workers: int = 1)
Converts a TFDS Dataset into hub format.
Parameters
  • dataset (str) – The name of the tfds dataset that needs to be converted into hub format

  • split (str, optional) – A string representing the splits of the dataset that are required such as “train” or “test+train” If not present, all the splits of the dataset are used.

  • num (int, optional) – The number of samples required. If not present, all the samples are taken. If count is -1, or if count is greater than the size of this dataset, the new dataset will contain all elements of this dataset.

  • sampling_amount (float, optional) – a value from 0 to 1, that specifies how much of the dataset would be sampled to determinte feature shapes value of 0 would mean no sampling and 1 would imply that entire dataset would be sampled

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

Examples

>>> out_ds = hub.Dataset.from_tfds('mnist', split='test+train', num=1000)
>>> res_ds = out_ds.store("username/mnist") # res_ds is now a usable hub dataset
property keys

Get Keys of the dataset

log()
Prints the commits in the commit tree before the current commit

Only works if dataset was created on or after Hub v1.3.0

numpy(label_name=False)

Gets the values from different tensorview objects in the dataset schema

Parameters

label_name (bool, optional) – If the TensorView object is of the ClassLabel type, setting this to True would retrieve the label names instead of the label encoded integers, otherwise this parameter is ignored.

rename(name: str)None

Renames the dataset

resize_shape(size: int)None

Resize the shape of the dataset by resizing each tensor first dimension

save()

Save changes from cache to dataset final storage. Doesn’t create a new commit. Does not invalidate this object.

store(url: str, token: Optional[dict] = None, sample_per_shard: Optional[int] = None, public: bool = True, scheduler='single', workers=1)
Used to save the dataset as a new dataset, very similar to copy but uses transforms instead
Parameters
  • url (str) – path where the data is going to be stored

  • token (str or dict, optional) – If url is referring to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict

  • length (int) – in case shape is None, user can provide length

  • sample_per_shard (int) – How to split the iterator not to overfill RAM

  • public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the dataset and the dataset won’t be visible in the visualizer to the public

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

Returns

ds – uploaded dataset

Return type

hub.Dataset

to_pytorch(transform=None, inplace=True, output_type=<class 'dict'>, indexes=None, key_list=None, shuffle=False)
Converts the dataset into a pytorch compatible format.

** Pytorch does not support uint16, uint32, uint64 dtypes. These are implicitly type casted to int32, int64 and int64 respectively. Avoid having schema with these dtypes if you want to avoid this implicit conversion. ** This method does not work with Sequence schema

Parameters
  • transform (function that transforms data in a dict format) –

  • inplace (bool, optional) – Defines if data should be converted to torch.Tensor before or after Transforms applied (depends on what data type you need for Transforms). Default is True.

  • output_type (one of list, tuple, dict, optional) – Defines the output type. Default is dict - same as in original Hub Dataset.

  • indexes (list or int, optional) – The samples to be converted into Pytorch format. Takes all samples in dataset by default.

  • key_list (list, optional) – The list of keys that are needed in Pytorch format. For nested schemas such as {“a”:{“b”:{“c”: Tensor()}}} use [“a/b/c”] as key_list

  • shuffle (bool, optional) – whether to shuffle the data chunkwise or not. Default is False.

to_tensorflow(indexes=None, include_shapes=False, key_list=None)
Converts the dataset into a tensorflow compatible format
Parameters
  • indexes (list or int, optional) – The samples to be converted into tensorflow format. Takes all samples in dataset by default.

  • include_shapes (boolean, optional) – False by default. Setting it to True passes the shapes to tf.data.Dataset.from_generator. Setting to True could lead to issues with dictionaries inside Tensors.

  • key_list (list, optional) – The list of keys that are needed in tensorflow format. For nested schemas such as {“a”:{“b”:{“c”: Tensor()}}} use [“a/b/c”] as key_list

DatasetView

class hub.api.datasetview.DatasetView(dataset=None, lazy: bool = True, indexes=None)
__getitem__(slice_)
Gets a slice or slices from DatasetView
Usage:
>>> ds_view = ds[5:15]
>>> return ds_view["image", 7, 0:1920, 0:1080, 0:3].compute() # returns numpy array of 12th image
__init__(dataset=None, lazy: bool = True, indexes=None)

Creates a DatasetView object for a subset of the Dataset.

Parameters
  • dataset (hub.api.dataset.Dataset object) – The dataset whose DatasetView is being created

  • lazy (bool, optional) – Setting this to False will stop lazy computation and will allow items to be accessed without .compute()

  • indexes (optional) – It can be either a list or an integer depending upon the slicing. Represents the indexes that the datasetview is representing.

__iter__()

Returns Iterable over samples

__repr__()

Return repr(self).

__setitem__(slice_, value)
Sets a slice or slices with a value
Usage:
>>> ds_view = ds[5:15]
>>> ds_view["image", 3, 0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), "uint8") # sets the 8th image
__str__()

Return str(self).

__weakref__

list of weak references to the object (if defined)

_get_dictionary(subpath, slice_)

Gets dictionary from dataset given incomplete subpath

commit(message='')None

Commit dataset

compute(label_name=False)

Gets the value from different tensorview objects in the datasetview schema

Parameters

label_name (bool, optional) – If the TensorView object is of the ClassLabel type, setting this to True would retrieve the label names instead of the label encoded integers, otherwise this parameter is ignored.

filter(fn)
Applies a function on each element one by one as a filter to get a new DatasetView
Parameters

fn (function) – Should take in a single sample of the dataset and return True or False This function is applied to all the items of the datasetview and retains those items that return True

flush()None

Flush dataset

property keys

Get Keys of the dataset

numpy(label_name=False)

Gets the value from different tensorview objects in the datasetview schema

Parameters

label_name (bool, optional) – If the TensorView object is of the ClassLabel type, setting this to True would retrieve the label names instead of the label encoded integers, otherwise this parameter is ignored.

resize_shape(size: int)None

Resize dataset shape, not DatasetView

store(url: str, token: Optional[dict] = None, sample_per_shard: Optional[int] = None, public: bool = True, scheduler='single', workers=1)
Used to save the datasetview as a new dataset
Parameters
  • url (str) – path where the data is going to be stored

  • token (str or dict, optional) – If url is referring to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict

  • length (int) – in case shape is None, user can provide length

  • sample_per_shard (int) – How to split the iterator not to overfill RAM

  • public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the dataset and the dataset won’t be visible in the visualizer to the public

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

Returns

ds – uploaded dataset

Return type

hub.Dataset

to_pytorch(transform=None, inplace=True, output_type=<class 'dict'>, key_list=None, shuffle=False)
Converts the dataset into a pytorch compatible format.

** Pytorch does not support uint16, uint32, uint64 dtypes. These are implicitly type casted to int32, int64 and int64 respectively. Avoid having schema with these dtypes if you want to avoid this implicit conversion. ** This method does not work with Sequence schema

Parameters
  • transform (function that transforms data in a dict format) –

  • inplace (bool, optional) – Defines if data should be converted to torch.Tensor before or after Transforms applied (depends on what data type you need for Transforms). Default is True.

  • output_type (one of list, tuple, dict, optional) – Defines the output type. Default is dict - same as in original Hub Dataset.

  • shuffle (bool, optional) – whether to shuffle the data chunkwise or not. Default is False.

to_tensorflow(include_shapes=False, key_list=None)

|Converts the dataset into a tensorflow compatible format

Parameters
  • include_shapes (boolean, optional) – False by deefault. Setting it to True passes the shapes to tf.data.Dataset.from_generator. Setting to True could lead to issues with dictionaries inside Tensors.

  • key_list (list, optional) – The list of keys that are needed in tensorflow format. For nested schemas such as {“a”:{“b”:{“c”: Tensor()}}} use [“a/b/c”] as key_list

Sharded Dataset

class hub.api.sharded_datasetview.ShardedDatasetView(datasets: list)
__init__(datasets: list)None
Creates a sharded simple dataset.
Datasets should have the schema.
Parameters

datasets (list of Datasets) –

__iter__()

Returns Iterable over samples

__repr__()

Return repr(self).

__str__()

Return str(self).

__weakref__

list of weak references to the object (if defined)

identify_shard(index)tuple

Computes shard id and returns the shard index and offset

slicing(slice_list)

Identifies the dataset shard that should be used

Pipelines

Transform

License: This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at https://mozilla.org/MPL/2.0/.

hub.compute.transform(schema, scheduler='single', workers=1)
Transform is a decorator of a function. The function should output a dictionary per sample.
schema: Schema

The output format of the transformed dataset

scheduler: str

“single” - for single threaded, “threaded” using multiple threads, “processed”, “ray” scheduler, “dask” scheduler

workers: int

how many workers will be started for the process

class hub.compute.transform.Transform(func, schema, ds, scheduler: str = 'single', workers: int = 1, **kwargs)
__getitem__(slice_)
Get an item to be computed without iterating on the whole dataset.
Creates a dataset view, then a temporary dataset to apply the transform.
slice_: slice

Gets a slice or slices from dataset

__init__(func, schema, ds, scheduler: str = 'single', workers: int = 1, **kwargs)
Transform applies a user defined function to each sample in single threaded manner.
Parameters
  • func (function) – user defined function func(x, **kwargs)

  • schema (dict of dtypes) – the structure of the final dataset that will be created

  • ds (Iterative) – input dataset or a list that can be iterated

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

  • **kwargs – additional arguments that will be passed to func as static argument for all samples

__weakref__

list of weak references to the object (if defined)

classmethod _flatten_dict(d: Dict, parent_key='', schema=None)
Helper function to flatten dictionary of a recursive tensor
Parameters

d (dict) –

_pbar(show: bool = True)

Returns a progress bar, if empty then it function does nothing

_split_list_to_dicts(xs)
Helper function that transform list of dicts into dicts of lists
Parameters

xs (list of dicts) –

Returns

xs_new

Return type

dicts of lists

classmethod _unwrap(results)

If there is any list then unwrap it into its elements

call_func(fn_index, item, as_list=False)

Calls all the functions one after the other

Parameters
  • fn_index (int) – The index starting from which the functions need to be called

  • item – The item on which functions need to be applied

  • as_list (bool, optional) – If true then treats the item as a list.

Returns

The final output obtained after all transforms

Return type

result

create_dataset(url: str, length: Optional[int] = None, token: Optional[dict] = None, public: bool = True)

Helper function to create a dataset

classmethod dtype_from_path(path, schema)

Helper function to get the dtype from the path

store(url: str, token: Optional[dict] = None, length: Optional[int] = None, ds: Optional[Iterable] = None, progressbar: bool = True, sample_per_shard: Optional[int] = None, public: bool = True)
The function to apply the transformation for each element in batchified manner
Parameters
  • url (str) – path where the data is going to be stored

  • token (str or dict, optional) – If url is referring to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict

  • length (int) – in case shape is None, user can provide length

  • ds (Iterable) –

  • progressbar (bool) – Show progress bar

  • sample_per_shard (int) – How to split the iterator not to overfill RAM

  • public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the dataset and the dataset won’t be visible in the visualizer to the public

Returns

ds – uploaded dataset

Return type

hub.Dataset

store_shard(ds_in: Iterable, ds_out: hub.api.dataset.Dataset, offset: int, token=None)

Takes a shard of iteratable ds_in, compute and stores in DatasetView

upload(results, ds: hub.api.dataset.Dataset, token: dict, progressbar: bool = True)

Batchified upload of results. For each tensor batchify based on its chunk and upload. If tensor is dynamic then still upload element by element. For dynamic tensors, it disable dynamicness and then enables it back.

Parameters
  • dataset (hub.Dataset) – Dataset object that should be written to

  • results – Output of transform function

  • progressbar (bool) –

Returns

ds – Uploaded dataset

Return type

hub.Dataset

RayTransform

class hub.compute.ray.RayTransform(func, schema, ds, scheduler='ray', workers=1, **kwargs)
__init__(func, schema, ds, scheduler='ray', workers=1, **kwargs)
Transform applies a user defined function to each sample in single threaded manner.
Parameters
  • func (function) – user defined function func(x, **kwargs)

  • schema (dict of dtypes) – the structure of the final dataset that will be created

  • ds (Iterative) – input dataset or a list that can be iterated

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

  • **kwargs – additional arguments that will be passed to func as static argument for all samples

set_dynamic_shapes(results, ds)

Sets shapes for dynamic tensors after the dataset is uploaded

Parameters
  • results (Tuple) – results from uploading each chunk which includes (key, slice, shape) tuple

  • ds – Dataset to set the shapes to

store(url: str, token: Optional[dict] = None, length: Optional[int] = None, ds: Optional[Iterable] = None, progressbar: bool = True, public: bool = True)

The function to apply the transformation for each element in batchified manner

Parameters
  • url (str) – path where the data is going to be stored

  • token (str or dict, optional) – If url is refering to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict

  • length (int) – in case shape is None, user can provide length

  • ds (Iterable) –

  • progressbar (bool) – Show progress bar

  • public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the dataset and the dataset won’t be visible in the visualizer to the public

Returns

ds – uploaded dataset

Return type

hub.Dataset

upload(results, url: str, token: dict, progressbar: bool = True, public: bool = True)

Batchified upload of results. For each tensor batchify based on its chunk and upload. If tensor is dynamic then still upload element by element.

Parameters
  • dataset (hub.Dataset) – Dataset object that should be written to

  • results – Output of transform function

  • progressbar (bool) –

  • public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the dataset and the dataset won’t be visible in the visualizer to the public

Returns

ds – Uploaded dataset

Return type

hub.Dataset

Schema

class hub.schema.audio.Audio(shape: Tuple[int, ] = (None), dtype='int64', file_format=None, sample_rate: Optional[int] = None, max_shape: Optional[Tuple[int, ]] = None, chunks=None, compressor='lz4')

Schema for audio would define the maximum shape of the audio dataset and their respective sampling rate.

Example: This example uploads an audio file to a Hub dataset audio_dataset with HubSchema and retrieves it.

>>> import hub
>>> from hub.schema import Audio
>>> from hub import transform, schema
>>> import librosa
>>> from librosa import display
>>> import numpy as np
>>> # Define schema
>>> my_schema={
>>>     "wav": Audio(shape=(None,), max_shape=(1920000,), file_format="wav", dtype=float),
>>>     "sampling_rate": Primitive(dtype=int),
>>> }
>>>
>>> sample = glob("audio.wav")
>>> # Define transform
>>> @transform(schema=my_schema)
>>> def load_transform(sample):
>>>     audio, sr = librosa.load(sample, sr=None)
>>>
>>>     return {
>>>         "wav": audio,
>>>         "sampling_rate": sr
>>>     }
>>>
>>> # Returns a transform object
>>> ds = load_transform(sample)
>>> # Load data
>>> ds = Dataset(tag)
>>>
>>> tag = "username/audio_dataset"
>>>
>>> # Pushes to hub
>>> ds2 = ds.store(tag)
>>> # Fetching from Hub
>>> data = Dataset(tag)
>>>
>>> # Fetch the first sample
>>> audio_sample = data["wav"][0].compute()
>>>
>>> # Audio file
    array([ 9.15527344e-05,  2.13623047e-04,  0.00000000e+00, ...,
    -2.73132324e-02, -2.99072266e-02, -2.44750977e-02])
__init__(shape: Tuple[int, ] = (None), dtype='int64', file_format=None, sample_rate: Optional[int] = None, max_shape: Optional[Tuple[int, ]] = None, chunks=None, compressor='lz4')

Constructs the connector.

Parameters
  • file_format (str) – the audio file format. Can be any format ffmpeg understands. If None, will attempt to infer from the file extension.

  • shape (tuple) – shape of the data.

  • dtype (str) – The dtype of the data.

  • sample_rate (int) – additional metadata exposed to the user through info.schema[‘audio’].sample_rate. This value isn’t used neither in encoding nor decoding.

Raises

ValueError – If the shape is invalid:

__repr__()

Return repr(self).

__str__()

Return str(self).

class hub.schema.bbox.BBox(shape: Tuple[int, ] = (4), max_shape: Optional[Tuple[int, ]] = None, dtype='float64', chunks=None, compressor='lz4')
HubSchema` for a normalized bounding box.

Output: Tensor of type float32 and shape [4,] which contains the normalized coordinates of the bounding box [ymin, xmin, ymax, xmax]

Example: This example uploads a dataset with a Bounding box schema and retrieves it.

>>> import hub
>>> from hub import Dataset, schema
>>> from hub.schema import BBox
>>> from numpy import asarray
>>> tag = "username/dataset"
>>>
>>> # Create dataset
>>> ds = Dataset(
>>>   tag,
>>>   shape=(10,),
>>>   schema={
>>>      "bbox": schema.BBox(dtype="uint8"),
>>>  },
>>> )
>>>
>>> ds["bbox", 1] = np.array([1,2,3,4])
>>> ds.flush()
>>> # Load data
>>> ds = Dataset(tag)
>>>
>>> print(ds["bbox"][1].compute())
[1 2 3 4]
__init__(shape: Tuple[int, ] = (4), max_shape: Optional[Tuple[int, ]] = None, dtype='float64', chunks=None, compressor='lz4')

Construct the connector.

Parameters
  • shape (tuple of ints or None) – The shape of bounding box. Will be (4,) if only one bounding box corresponding to each sample. If N bboxes corresponding to each sample, shape should be (N,) If the number of bboxes for each sample vary from 0 to M. The shape should be set to (None, 4) and max_shape should be set to (M, 4) Defaults to (4,).

  • max_shape (Tuple[int], optional) – Maximum shape of BBox

  • dtype (str) – dtype of bbox coordinates. Default: ‘float32’

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

__repr__()

Return repr(self).

__str__()

Return str(self).

class hub.schema.class_label.ClassLabel(shape: Tuple[int, ] = (), dtype='uint8', max_shape: Optional[Tuple[int, ]] = None, num_classes: Optional[int] = None, names: Optional[List[str]] = None, names_file: Optional[str] = None, chunks=None, compressor='lz4')
Constructs a ClassLabel HubSchema.
Returns an integer representations of given classes. Preserves the names of classes to convert those back to strings if needed.
There are 3 ways to define a ClassLabel, which correspond to the 3 arguments: Note: In python2, the strings are encoded as utf-8.
>>> import hub
>>> from hub import Dataset, schema
>>> from hub.schema import ClassLabel
1. num_classes: create 0 to (num_classes-1) labels using ClassLabel(num_classes=`number of classes`)
>>> tag = "username/dataset"
>>>
>>> # Create dataset
>>> ds=Dataset(
>>>    tag,
>>>    shape=(10,),
>>>    schema = {
>>>         "label_1": ClassLabel(num_classes=3),
>>>    },
>>> )
>>>
>>> ds["label_1",0] = 0
>>> ds["label_1",1] = 1
>>> ds["label_1",2] = 2
>>>
>>> ds.flush()
>>>
>>> # Load data
>>> ds = Dataset(tag)
>>>
>>> print(ds["label_1"][0].compute(True))
>>> print(ds["label_1"][1].compute(True))
>>> print(ds["label_1"][2].compute(True))
0
1
2
2. names: a list of label strings. ClassLabel=(names=[class1,`class2`])
>>> tag = "username/dataset"
>>>
>>> # Define schema
>>> my_schema = {
>>>     "label_2": ClassLabel(names=['class1', 'class2', 'class3']),
>>> }
>>>
>>> # Create dataset
>>> ds=Dataset(
>>>    tag,
>>>    shape=(10,),
>>>    schema = my_schema,
>>> )
>>>
>>> ds.flush()
>>>
>>> # Load data
>>> ds = Dataset(tag)
Note: ClassLabel HubSchema returnsan interger representation of classes.
Hence use str2int() and int2str() to load classes.
>>> print(my_schema["label_2"].str2int("class1"))
>>> print(my_schema["label_2"].int2str(0))
0
class1
3. names_file: a file containing the list of labels. ClassLabel(names_file=”/path/to/file/names.txt”)

Let’s assume names.txt is located at /dataset:

>>> # Contents of "names.txt"
welcome
to
hub
>>> tag = "username/dataset"
>>>
>>> # Define Schema
>>> my_schema = {
>>>     "label_3": ClassLabel(names_file="/content/names.txt"),
>>> }
>>>
# Create dataset
>>> ds=Dataset(
>>>    tag,
>>>    shape=(10,),
>>>    schema = my_schema,
>>> )
>>>
>>> ds.flush()
>>>
>>> # Load data
>>> ds = Dataset(tag)
>>>
>>> print(my_schema["label_3"].int2str(0))
>>> print(my_schema["label_3"].int2str(1))
>>> print(my_schema["label_3"].int2str(2))
welcome
to
hub
__init__(shape: Tuple[int, ] = (), dtype='uint8', max_shape: Optional[Tuple[int, ]] = None, num_classes: Optional[int] = None, names: Optional[List[str]] = None, names_file: Optional[str] = None, chunks=None, compressor='lz4')
Parameters
  • shape (tuple of ints or None) – The shape of classlabel. Will be () if only one classbabel corresponding to each sample. If N classlabels corresponding to each sample, shape should be (N,) If the number of classlabels for each sample vary from 0 to M. The shape should be set to (None,) and max_shape should be set to (M,) Defaults to ().

  • max_shape (Tuple[int], optional) – Maximum shape of ClassLabel

  • num_classes (int) – number of classes. All labels must be < num_classes.

  • names (list<str>) – string names for the integer classes. The order in which the names are provided is kept.

  • names_file (str) – path to a file with names for the integer classes, one per line.

  • chunks (Tuple[int] | True, optional) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

  • Note (|) – names or names file

Raises

ValueError – If more than one argument is provided:

__repr__()

Return repr(self).

__str__()

Return str(self).

int2str(int_value: int)

Conversion integer => class name string.

str2int(str_value: str)

Conversion class name string => integer.

class hub.schema.image.Image(shape: Tuple[int, ] = (None, None, 3), dtype='uint8', max_shape: Optional[Tuple[int, ]] = None, chunks=None, compressor='lz4')

Schema for images would define the shape and structure for the dataset.

Output: Tensor of type uint8 and shape [height, width, num_channels] for BMP, JPEG, and PNG images

Example: This example uploads an image to a Hub dataset image_dataset with HubSchema and retrieves it.

>>> import hub
>>> from hub import Dataset, schema
>>> from hub.schema import Image
>>> from numpy import asarray
>>> tag = "username/image_dataset"
>>>
>>> # Create dataset
>>> ds=Dataset(
>>>     tag,
>>>     shape=(10,),
>>>     schema={
>>>         "image": schema.Image((height, width, 3), dtype="uint8"),
>>>     },
>>> )
>>>
>>> for index, image in enumerate(os.listdir("path/to/folder")):
>>>         data = asarray(Image.open(image))
>>>
>>>         # Upload data
>>>         ds["image"][index] = data
>>>
>>> ds.flush()
>>> # Load data
>>> ds = Dataset(tag)
>>>
>>> for i in range(len(ds)):
>>>     print(ds["image"][i].compute())
[[[124 112  64]
[124 112  64]
[124 112  64]
...
[236 237 232]
[238 239 234]
[238 239 234]]]
__init__(shape: Tuple[int, ] = (None, None, 3), dtype='uint8', max_shape: Optional[Tuple[int, ]] = None, chunks=None, compressor='lz4')
Construct the connector.
Parameters
  • shape (tuple of ints or None) – The shape of decoded image: (height, width, channels) where height and width can be None. Defaults to (None, None, 3).

  • dtype (uint16 or uint8 (default)) – uint16 can be used only with png encoding_format

  • encoding_format ('jpeg' or 'png' (default)) – Format to serialize np.ndarray images on disk.

  • max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

Returns

  • tf.Tensor of type tf.uint8 and shape [height, width, num_channels]

  • for BMP, JPEG, and PNG images

Raises

ValueError – If the shape, dtype or encoding formats are invalid:

__repr__()

Return repr(self).

__str__()

Return str(self).

_set_dtype(dtype)

Set the dtype.

License: This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at https://mozilla.org/MPL/2.0/.

class hub.schema.features.FlatTensor(path: str, shape: Tuple[int, ], dtype, max_shape: Tuple[int, ], chunks: Tuple[int, ])

Tensor metadata after applying flatten function

__init__(path: str, shape: Tuple[int, ], dtype, max_shape: Tuple[int, ], chunks: Tuple[int, ])

Initialize self. See help(type(self)) for accurate signature.

__weakref__

list of weak references to the object (if defined)

class hub.schema.features.HubSchema

Base class for all datatypes

__weakref__

list of weak references to the object (if defined)

_flatten()Iterable[hub.schema.features.FlatTensor]

Flattens dtype into list of tensors that will need to be stored seperately

class hub.schema.features.Primitive(dtype, chunks=None, compressor='lz4')

Class for handling primitive datatypes. All numpy primitive data types like int32, float64, etc… should be wrapped around this class.

__eq__(other)

Return self==value.

__init__(dtype, chunks=None, compressor='lz4')

Initialize self. See help(type(self)) for accurate signature.

__ne__(other)

Return self!=value.

__repr__()

Return repr(self).

__str__()

Return str(self).

_flatten()

Flattens dtype into list of tensors that will need to be stored seperately

class hub.schema.features.SchemaDict(dict_)

Class for dict branching of a datatype. SchemaDict dtype contains str -> dtype associations. This way you can describe complex datatypes.

__init__(dict_)

Initialize self. See help(type(self)) for accurate signature.

__repr__()

Return repr(self).

__str__()

Return str(self).

_flatten()

Flattens dtype into list of tensors that will need to be stored seperately

class hub.schema.features.Tensor(shape: Tuple[int, ] = (None), dtype='float64', max_shape: Optional[Tuple[int, ]] = None, chunks=None, compressor='lz4')

Tensor type in schema. Has np-array like structure contains any type of elements (Primitive and non-Primitive). Tensors can’t be visualized at app.activeloop.ai.

__init__(shape: Tuple[int, ] = (None), dtype='float64', max_shape: Optional[Tuple[int, ]] = None, chunks=None, compressor='lz4')
Parameters
  • shape (Tuple[int]) – Shape of tensor, can contains None(s) meaning the shape can be dynamic Dynamic shape means it can change during editing the dataset

  • dtype (SchemaConnector or str) – dtype of each element in Tensor. Can be Primitive and non-Primitive type

  • max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

__repr__()

Return repr(self).

__str__()

Return str(self).

_flatten()

Flattens dtype into list of tensors that will need to be stored seperately

hub.schema.features.featurify(schema)hub.schema.features.HubSchema

This functions converts naked primitive datatypes and ditcs into Primitives and SchemaDicts. That way every node in dtype tree is a SchemaConnector type object.

hub.schema.features.flatten(dtype, root='')

Flattens nested dictionary and returns tuple (dtype, path)

class hub.schema.mask.Mask(shape: Optional[Tuple[int, ]] = None, max_shape: Optional[Tuple[int, ]] = None, chunks=None, compressor='lz4')

HubSchema for mask

Usage:
>>> mask_tensor = Mask(shape=(300, 300, 1))
__init__(shape: Optional[Tuple[int, ]] = None, max_shape: Optional[Tuple[int, ]] = None, chunks=None, compressor='lz4')

Constructs a Mask HubSchema.

Parameters
  • shape (tuple of ints or None) – Shape in format (height, width, 1)

  • dtype (str) – Dtype of mask array. Default: uint8

  • max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

__repr__()

Return repr(self).

__str__()

Return str(self).

class hub.schema.polygon.Polygon(shape: Optional[Tuple[int, ]] = None, dtype='int32', max_shape: Optional[Tuple[int, ]] = None, chunks=None, compressor='lz4')

HubSchema for polygon

Usage:
>>> polygon_tensor = Polygon(shape=(10, 2))
>>> polygon_tensor = Polygon(shape=(None, 2))
__init__(shape: Optional[Tuple[int, ]] = None, dtype='int32', max_shape: Optional[Tuple[int, ]] = None, chunks=None, compressor='lz4')

Constructs a Polygon HubSchema. Args: shape: tuple of ints or None, i.e (None, 2)

Parameters
  • shape (tuple of ints or None) – Shape in format (None, 2)

  • max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

Raises

ValueError – If the shape is invalid:

__repr__()

Return repr(self).

__str__()

Return str(self).

_check_shape(shape)

Check if provided shape maches polygon characteristics.

class hub.schema.segmentation.Segmentation(shape: Optional[Tuple[int, ]] = None, dtype: Optional[str] = None, num_classes: Optional[int] = None, names: Optional[Tuple[str]] = None, names_file: Optional[str] = None, max_shape: Optional[Tuple[int, ]] = None, chunks=None, compressor='lz4')

HubSchema for segmentation

__init__(shape: Optional[Tuple[int, ]] = None, dtype: Optional[str] = None, num_classes: Optional[int] = None, names: Optional[Tuple[str]] = None, names_file: Optional[str] = None, max_shape: Optional[Tuple[int, ]] = None, chunks=None, compressor='lz4')

Constructs a Segmentation HubSchema. Also constructs ClassLabel HubSchema for Segmentation classes.

Parameters
  • shape (tuple of ints or None) – Shape in format (height, width, 1)

  • dtype (str) – dtype of segmentation array: uint16 or uint8

  • num_classes (int) – Number of classes. All labels must be < num_classes.

  • names (list<str>) – string names for the integer classes. The order in which the names are provided is kept.

  • names_file (str) – Path to a file with names for the integer classes, one per line.

  • max_shape (tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

__repr__()

Return repr(self).

__str__()

Return str(self).

get_segmentation_classes()

Get classes of the segmentation mask

class hub.schema.sequence.Sequence(shape=(), max_shape=None, dtype=None, chunks=None, compressor='lz4')

Sequence correspond to sequence of features.HubSchema. At generation time, a list for each of the sequence element is given. The output of Dataset will batch all the elements of the sequence together. If the length of the sequence is static and known in advance, it should be specified in the constructor using the shape param.

Usage:
>>> sequence = Sequence(shape=(5,), dtype = Image((100, 100, 3)))
__init__(shape=(), max_shape=None, dtype=None, chunks=None, compressor='lz4')
Construct a sequence of Tensors.
Parameters
  • shape (Tuple[int] | int) – Single integer element tuple representing length of sequence If None then dynamic

  • dtype (str | HubSchema) – Datatype of each element in sequence

  • chunks (Tuple[int] | int) – Number of elements in chunk Works only for top level sequence You can also include number of samples in a single chunk

__repr__()

Return repr(self).

__str__()

Return str(self).

class hub.schema.text.Text(shape: Tuple[int, ] = (None), dtype='uint8', max_shape: Optional[Tuple[int, ]] = None, chunks=None, compressor='lz4')

Schema for text would define the shape and structure for the dataset.

Output: Tensor of type uint8 and shape [height, width, num_channels] for BMP, JPEG, and PNG images

Example: This example uploads an image to a Hub dataset image_dataset with HubSchema and retrieves it.

>>> import hub
>>> from hub import Dataset, schema
>>> from hub.schema import Text
>>> tag = "username/dataset"
>>>
>>> # Create dataset
>>> ds = Dataset(
>>>     tag,
>>>     shape=(5,),
>>>     schema = {
>>>         "text": Text(shape=(11,)),
>>>    },
>>> )
>>>
>>> ds["text",0] = "Hello There"
>>>
>>> ds.flush()
>>>
>>> # Load the data
>>> ds = Dataset(tag)
>>>
>>> print(ds["text"][0].compute())
Hello There

For data with variable shape, it is recommended to use max_shape

>>> ds = Dataset(
>>>     tag,
>>>     shape=(5,),
>>>     schema = {
>>>         "text": Text(max_shape=(10,)),
>>>    },
>>> )
>>>
>>> ds["text",0] = "Welcome"
>>> ds["text",1] = "to"
>>> ds["text",2] = "Hub"
>>>
>>> ds.flush()
>>>
>>> # Load data
>>> ds = Dataset(tag)
>>>
>>> print(ds["text"][0].compute())
>>> print(ds["text"][1].compute())
>>> print(ds["text"][2].compute())
Welcome
to
Hub
__init__(shape: Tuple[int, ] = (None), dtype='uint8', max_shape: Optional[Tuple[int, ]] = None, chunks=None, compressor='lz4')
Construct the connector.

Returns integer representation of given string.

Parameters
  • shape (tuple of ints or None) – The shape of the text

  • dtype (str) – the dtype for storage.

  • max_shape (Tuple[int]) – Maximum number of words in the text

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

__repr__()

Return repr(self).

__str__()

Return str(self).

_set_dtype(dtype)

Set the dtype.

class hub.schema.video.Video(shape: Optional[Tuple[int, ]] = None, dtype: str = 'uint8', max_shape: Optional[Tuple[int, ]] = None, chunks=None, compressor='lz4')

HubSchema for videos, encoding frames individually on disk.

The connector accepts as input a 4 dimensional uint8 array representing a video.

Returns

Tensor – where channels must be 1 or 3

Return type

uint8 and shape [num_frames, height, width, channels],

__init__(shape: Optional[Tuple[int, ]] = None, dtype: str = 'uint8', max_shape: Optional[Tuple[int, ]] = None, chunks=None, compressor='lz4')

Initializes the connector.

Parameters
  • shape (tuple of ints) – The shape of the video (num_frames, height, width, channels), where channels is 1 or 3.

  • encoding_format (str) – The video is stored as a sequence of encoded images. You can use any encoding format supported by Image.

  • dtype (uint16 or uint8 (default)) –

Raises

ValueError – If the shape, dtype or encoding formats are invalid:

__repr__()

Return repr(self).

__str__()

Return str(self).