API Reference

Datasets

Dataset

class hub.Dataset(url: str, mode: str = None, shape=None, schema=None, token=None, fs=None, fs_map=None, meta_information={}, cache: int = 67108864, storage_cache: int = 268435456, lock_cache=True, tokenizer=None, lazy: bool = True, public: bool = True, name: str = None)
__getitem__(slice_)
Gets a slice or slices from dataset
Usage:
>>> return ds["image", 5, 0:1920, 0:1080, 0:3].compute() # returns numpy array
>>> images = ds["image"]
>>> return images[5].compute() # returns numpy array
>>> images = ds["image"]
>>> image = images[5]
>>> return image[0:1920, 0:1080, 0:3].compute()
__init__(url: str, mode: str = None, shape=None, schema=None, token=None, fs=None, fs_map=None, meta_information={}, cache: int = 67108864, storage_cache: int = 268435456, lock_cache=True, tokenizer=None, lazy: bool = True, public: bool = True, name: str = None)
Open a new or existing dataset for read/write
Parameters
  • url (str) – The url where dataset is located/should be created

  • mode (str, optional (default to "a")) – Python way to tell whether dataset is for read or write (ex. “r”, “w”, “a”)

  • shape (tuple, optional) – Tuple with (num_samples,) format, where num_samples is number of samples

  • schema (optional) – Describes the data of a single sample. Hub schemas are used for that Required for ‘a’ and ‘w’ modes

  • token (str or dict, optional) – If url is refering to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict

  • fs (optional) –

  • fs_map (optional) –

  • meta_information (optional ,give information about dataset in a dictionary.) –

  • cache (int, optional) – Size of the memory cache. Default is 64MB (2**26) if 0, False or None, then cache is not used

  • storage_cache (int, optional) – Size of the storage cache. Default is 256MB (2**28) if 0, False or None, then storage cache is not used

  • lock_cache (bool, optional) – Lock the cache for avoiding multiprocessing errors

  • lazy (bool, optional) – Setting this to False will stop lazy computation and will allow items to be accessed without .compute()

  • public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the dataset and the dataset won’t be visible in the visualizer to the public

  • name (str, optional) – only applicable when using hub storage, this is the name that shows up on the visualizer

__iter__()

Returns Iterable over samples

__len__()

Number of samples in the dataset

__repr__()

Return repr(self).

__setitem__(slice_, value)
Sets a slice or slices with a value
Usage:
>>> ds["image", 5, 0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), "uint8")
>>> images = ds["image"]
>>> image = images[5]
>>> image[0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), "uint8")
__str__()

Return str(self).

__weakref__

list of weak references to the object (if defined)

_check_and_prepare_dir()

Checks if input data is ok. Creates or overwrites dataset folder. Returns True dataset needs to be created opposed to read.

_get_dictionary(subpath, slice_=None)

Gets dictionary from dataset given incomplete subpath

append_shape(size: int)

Append the shape: Heavy Operation

close()

Save changes from cache to dataset final storage. This invalidates this object.

commit()

Deprecated alias to flush()

compute(label_name=False)

Gets the values from different tensorview objects in the dataset schema

Parameters

label_name (bool, optional) – If the TensorView object is of the ClassLabel type, setting this to True would retrieve the label names instead of the label encoded integers, otherwise this parameter is ignored.

copy(dst_url: str, token=None, fs=None, public=True)
Creates a copy of the dataset at the specified url and returns the dataset object
Parameters
  • dst_url (str) – The destination url where dataset should be copied

  • token (str or dict, optional) – If dst_url is refering to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict

  • fs (optional) –

  • public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the new copied dataset and the dataset won’t be visible in the visualizer to the public

delete()

Deletes the dataset

filter(fn)
Applies a function on each element one by one as a filter to get a new DatasetView
Parameters

fn (function) – Should take in a single sample of the dataset and return True or False This function is applied to all the items of the datasetview and retains those items that return True

flush()

Save changes from cache to dataset final storage. Does not invalidate this object.

static from_directory(path_to_dir, labels=None, dtype='uint8', scheduler: str = 'single', workers: int = 1)
This utility function is specific to create dataset from the categorical image dataset to easy use for the categorical image usecase.
Parameters
  • path_to_dir (str) –

  • of the directory where the image dataset root folder exists. (path) –

  • labels (list) –

  • a list of class names (passed) –

  • dtype (str) –

  • of the images can be defined by user.Default uint8. (datatype) –

  • scheduler (str) –

  • between "single", "threaded", "processed" (choice) –

  • workers (int) –

  • many threads or processes to use (how) –

Returns A dataset object for user use and to store a defined path.

>>>ds = Dataset.from_directory(‘path/test’) >>>ds.store(‘store_here’)

static from_pytorch(dataset, scheduler: str = 'single', workers: int = 1)
Converts a pytorch dataset object into hub format
Parameters
  • dataset – The pytorch dataset object that needs to be converted into hub format

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

static from_tensorflow(ds, scheduler: str = 'single', workers: int = 1)

Converts a tensorflow dataset into hub format.

Parameters
  • dataset – The tensorflow dataset object that needs to be converted into hub format

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

Examples

>>> ds = tf.data.Dataset.from_tensor_slices(tf.range(10))
>>> out_ds = hub.Dataset.from_tensorflow(ds)
>>> res_ds = out_ds.store("username/new_dataset") # res_ds is now a usable hub dataset
>>> ds = tf.data.Dataset.from_tensor_slices({'a': [1, 2], 'b': [5, 6]})
>>> out_ds = hub.Dataset.from_tensorflow(ds)
>>> res_ds = out_ds.store("username/new_dataset") # res_ds is now a usable hub dataset
>>> ds = hub.Dataset(schema=my_schema, shape=(1000,), url="username/dataset_name", mode="w")
>>> ds = ds.to_tensorflow()
>>> out_ds = hub.Dataset.from_tensorflow(ds)
>>> res_ds = out_ds.store("username/new_dataset") # res_ds is now a usable hub dataset
static from_tfds(dataset, split=None, num: int = - 1, sampling_amount: int = 1, scheduler: str = 'single', workers: int = 1)
Converts a TFDS Dataset into hub format.
Parameters
  • dataset (str) – The name of the tfds dataset that needs to be converted into hub format

  • split (str, optional) – A string representing the splits of the dataset that are required such as “train” or “test+train” If not present, all the splits of the dataset are used.

  • num (int, optional) – The number of samples required. If not present, all the samples are taken. If count is -1, or if count is greater than the size of this dataset, the new dataset will contain all elements of this dataset.

  • sampling_amount (float, optional) – a value from 0 to 1, that specifies how much of the dataset would be sampled to determinte feature shapes value of 0 would mean no sampling and 1 would imply that entire dataset would be sampled

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

Examples

>>> out_ds = hub.Dataset.from_tfds('mnist', split='test+train', num=1000)
>>> res_ds = out_ds.store("username/mnist") # res_ds is now a usable hub dataset
property keys

Get Keys of the dataset

numpy(label_name=False)

Gets the values from different tensorview objects in the dataset schema

Parameters

label_name (bool, optional) – If the TensorView object is of the ClassLabel type, setting this to True would retrieve the label names instead of the label encoded integers, otherwise this parameter is ignored.

rename(name: str) → None

Renames the dataset

resize_shape(size: int) → None

Resize the shape of the dataset by resizing each tensor first dimension

to_pytorch(transform=None, inplace=True, output_type=<class 'dict'>, indexes=None)
Converts the dataset into a pytorch compatible format.
Parameters
  • transform (function that transforms data in a dict format) –

  • inplace (bool, optional) – Defines if data should be converted to torch.Tensor before or after Transforms applied (depends on what data type you need for Transforms). Default is True.

  • output_type (one of list, tuple, dict, optional) – Defines the output type. Default is dict - same as in original Hub Dataset.

  • indexes (list or int, optional) – The samples to be converted into tensorflow format. Takes all samples in dataset by default.

to_tensorflow(indexes=None, include_shapes=False)
Converts the dataset into a tensorflow compatible format
Parameters
  • indexes (list or int, optional) – The samples to be converted into tensorflow format. Takes all samples in dataset by default.

  • include_shapes (boolean, optional) – False by default. Setting it to True passes the shapes to tf.data.Dataset.from_generator. Setting to True could lead to issues with dictionaries inside Tensors.

DatasetView

class hub.api.datasetview.DatasetView(dataset=None, lazy: bool = True, indexes=None)
__getitem__(slice_)
Gets a slice or slices from DatasetView
Usage:
>>> ds_view = ds[5:15]
>>> return ds_view["image", 7, 0:1920, 0:1080, 0:3].compute() # returns numpy array of 12th image
__init__(dataset=None, lazy: bool = True, indexes=None)

Creates a DatasetView object for a subset of the Dataset.

Parameters
  • dataset (hub.api.dataset.Dataset object) – The dataset whose DatasetView is being created

  • lazy (bool, optional) – Setting this to False will stop lazy computation and will allow items to be accessed without .compute()

  • indexes (optional) – It can be either a list or an integer depending upon the slicing. Represents the indexes that the datasetview is representing.

__iter__()

Returns Iterable over samples

__repr__()

Return repr(self).

__setitem__(slice_, value)
Sets a slice or slices with a value
Usage:
>>> ds_view = ds[5:15]
>>> ds_view["image", 3, 0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), "uint8") # sets the 8th image
__str__()

Return str(self).

__weakref__

list of weak references to the object (if defined)

_get_dictionary(subpath, slice_)

Gets dictionary from dataset given incomplete subpath

commit() → None

Commit dataset

compute(label_name=False)

Gets the value from different tensorview objects in the datasetview schema

Parameters

label_name (bool, optional) – If the TensorView object is of the ClassLabel type, setting this to True would retrieve the label names instead of the label encoded integers, otherwise this parameter is ignored.

filter(fn)
Applies a function on each element one by one as a filter to get a new DatasetView
Parameters

fn (function) – Should take in a single sample of the dataset and return True or False This function is applied to all the items of the datasetview and retains those items that return True

property keys

Get Keys of the dataset

numpy(label_name=False)

Gets the value from different tensorview objects in the datasetview schema

Parameters

label_name (bool, optional) – If the TensorView object is of the ClassLabel type, setting this to True would retrieve the label names instead of the label encoded integers, otherwise this parameter is ignored.

resize_shape(size: int) → None

Resize dataset shape, not DatasetView

to_pytorch(transform=None, inplace=True, output_type=<class 'dict'>)
Converts the dataset into a pytorch compatible format.
Parameters
  • transform (function that transforms data in a dict format) –

  • inplace (bool, optional) – Defines if data should be converted to torch.Tensor before or after Transforms applied (depends on what data type you need for Transforms). Default is True.

  • output_type (one of list, tuple, dict, optional) – Defines the output type. Default is dict - same as in original Hub Dataset.

to_tensorflow(include_shapes)

|Converts the dataset into a tensorflow compatible format

Parameters

include_shapes (boolean, optional) – False by deefault. Setting it to True passes the shapes to tf.data.Dataset.from_generator. Setting to True could lead to issues with dictionaries inside Tensors.

Sharded Dataset

class hub.api.sharded_datasetview.ShardedDatasetView(datasets: list)
__init__(datasets: list) → None
Creates a sharded simple dataset.
Datasets should have the schema.
Parameters

datasets (list of Datasets) –

__iter__()

Returns Iterable over samples

__repr__()

Return repr(self).

__str__()

Return str(self).

__weakref__

list of weak references to the object (if defined)

identify_shard(index) → tuple

Computes shard id and returns the shard index and offset

slicing(slice_list)

Identifies the dataset shard that should be used

Pipelines

Transform

License: This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at https://mozilla.org/MPL/2.0/.

hub.compute.transform(schema, scheduler='single', workers=1)
Transform is a decorator of a function. The function should output a dictionary per sample.
schema: Schema

The output format of the transformed dataset

scheduler: str

“single” - for single threaded, “threaded” using multiple threads, “processed”, “ray” scheduler, “dask” scheduler

workers: int

how many workers will be started for the process

class hub.compute.transform.Transform(func, schema, ds, scheduler: str = 'single', workers: int = 1, **kwargs)
__getitem__(slice_)
Get an item to be computed without iterating on the whole dataset.
Creates a dataset view, then a temporary dataset to apply the transform.
slice_: slice

Gets a slice or slices from dataset

__init__(func, schema, ds, scheduler: str = 'single', workers: int = 1, **kwargs)
Transform applies a user defined function to each sample in single threaded manner.
Parameters
  • func (function) – user defined function func(x, **kwargs)

  • schema (dict of dtypes) – the structure of the final dataset that will be created

  • ds (Iterative) – input dataset or a list that can be iterated

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

  • **kwargs – additional arguments that will be passed to func as static argument for all samples

__weakref__

list of weak references to the object (if defined)

classmethod _flatten_dict(d: Dict, parent_key='', schema=None)
Helper function to flatten dictionary of a recursive tensor
Parameters

d (dict) –

_pbar(show: bool = True)

Returns a progress bar, if empty then it function does nothing

_split_list_to_dicts(xs)
Helper function that transform list of dicts into dicts of lists
Parameters

xs (list of dicts) –

Returns

xs_new

Return type

dicts of lists

classmethod _unwrap(results)

If there is any list then unwrap it into its elements

call_func(fn_index, item, as_list=False)

Calls all the functions one after the other

Parameters
  • fn_index (int) – The index starting from which the functions need to be called

  • item – The item on which functions need to be applied

  • as_list (bool, optional) – If true then treats the item as a list.

Returns

The final output obtained after all transforms

Return type

result

create_dataset(url: str, length: int = None, token: dict = None, public: bool = True)

Helper function to creat a dataset

classmethod dtype_from_path(path, schema)

Helper function to get the dtype from the path

store(url: str, token: dict = None, length: int = None, ds: Iterable = None, progressbar: bool = True, sample_per_shard: int = None, public: bool = True)
The function to apply the transformation for each element in batchified manner
Parameters
  • url (str) – path where the data is going to be stored

  • token (str or dict, optional) – If url is refering to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict

  • length (int) – in case shape is None, user can provide length

  • ds (Iterable) –

  • progressbar (bool) – Show progress bar

  • sample_per_shard (int) – How to split the iterator not to overfill RAM

  • public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the dataset and the dataset won’t be visible in the visualizer to the public

Returns

ds – uploaded dataset

Return type

hub.Dataset

store_shard(ds_in: Iterable, ds_out: hub.api.dataset.Dataset, offset: int, token=None)

Takes a shard of iteratable ds_in, compute and stores in DatasetView

upload(results, ds: hub.api.dataset.Dataset, token: dict, progressbar: bool = True)

Batchified upload of results. For each tensor batchify based on its chunk and upload. If tensor is dynamic then still upload element by element. For dynamic tensors, it disable dynamicness and then enables it back.

Parameters
  • dataset (hub.Dataset) – Dataset object that should be written to

  • results – Output of transform function

  • progressbar (bool) –

Returns

ds – Uploaded dataset

Return type

hub.Dataset

RayTransform

class hub.compute.ray.RayTransform(func, schema, ds, scheduler='ray', workers=1, **kwargs)
__init__(func, schema, ds, scheduler='ray', workers=1, **kwargs)
Transform applies a user defined function to each sample in single threaded manner.
Parameters
  • func (function) – user defined function func(x, **kwargs)

  • schema (dict of dtypes) – the structure of the final dataset that will be created

  • ds (Iterative) – input dataset or a list that can be iterated

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

  • **kwargs – additional arguments that will be passed to func as static argument for all samples

set_dynamic_shapes(results, ds)

Sets shapes for dynamic tensors after the dataset is uploaded

Parameters
  • results (Tuple) – results from uploading each chunk which includes (key, slice, shape) tuple

  • ds – Dataset to set the shapes to

store(url: str, token: dict = None, length: int = None, ds: Iterable = None, progressbar: bool = True, public: bool = True)

The function to apply the transformation for each element in batchified manner

Parameters
  • url (str) – path where the data is going to be stored

  • token (str or dict, optional) – If url is refering to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict

  • length (int) – in case shape is None, user can provide length

  • ds (Iterable) –

  • progressbar (bool) – Show progress bar

  • public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the dataset and the dataset won’t be visible in the visualizer to the public

Returns

ds – uploaded dataset

Return type

hub.Dataset

upload(results, url: str, token: dict, progressbar: bool = True, public: bool = True)

Batchified upload of results. For each tensor batchify based on its chunk and upload. If tensor is dynamic then still upload element by element.

Parameters
  • dataset (hub.Dataset) – Dataset object that should be written to

  • results – Output of transform function

  • progressbar (bool) –

  • public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the dataset and the dataset won’t be visible in the visualizer to the public

Returns

ds – Uploaded dataset

Return type

hub.Dataset

Schema

Serialization

License: This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at https://mozilla.org/MPL/2.0/.

hub.schema.serialize.serialize(input)

Converts the input into a serializable format

hub.schema.serialize.serialize_SchemaDict(fdict)

Converts SchemaDict into a serializable format

hub.schema.serialize.serialize_primitive(primitive)

Converts Primitive into a serializable format

hub.schema.serialize.serialize_tensor(tensor)

Converts Tensor and its derivatives into a serializable format

Schema

class hub.schema.audio.Audio(shape: Tuple[int, …] = None, dtype='int64', file_format=None, sample_rate: int = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')
__init__(shape: Tuple[int, …] = None, dtype='int64', file_format=None, sample_rate: int = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

Constructs the connector.

Parameters
  • file_format (str) – the audio file format. Can be any format ffmpeg understands. If None, will attempt to infer from the file extension.

  • shape (tuple) – shape of the data.

  • dtype (str) – The dtype of the data.

  • sample_rate (int) – additional metadata exposed to the user through info.schema[‘audio’].sample_rate. This value isn’t used neither in encoding nor decoding.

Raises

ValueError – If the shape is invalid:

__repr__()

Return repr(self).

__str__()

Return str(self).

class hub.schema.bbox.BBox(dtype='float64', chunks=None, compressor='lz4')
HubSchema` for a normalized bounding box.

Output: Tensor of type float32 and shape [4,] which contains the normalized coordinates of the bounding box [xmin, ymin, xmax, ymax]

__init__(dtype='float64', chunks=None, compressor='lz4')

Construct the connector.

Parameters
  • dtype (str) – dtype of bbox coordinates. Default: ‘float32’

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

__repr__()

Return repr(self).

__str__()

Return str(self).

class hub.schema.class_label.ClassLabel(num_classes: int = None, names: List[str] = None, names_file: str = None, chunks=None, compressor='lz4')

HubSchema for integer class labels.

__init__(num_classes: int = None, names: List[str] = None, names_file: str = None, chunks=None, compressor='lz4')
Constructs a ClassLabel HubSchema.
Returns an integer representations of given classes. Preserves the names of classes to convert those back to strings if needed.
There are 3 ways to define a ClassLabel, which correspond to the 3 arguments:
* num_classes: create 0 to (num_classes-1) labels
* names: a list of label strings
* names_file: a file containing the list of labels.

Note: In python2, the strings are encoded as utf-8.

Usage:
>>> class_label_tensor = ClassLabel(num_classes=10)
>>> class_label_tensor = ClassLabel(names=['class1', 'class2', 'class3', ...])
>>> class_label_tensor = ClassLabel(names_file='/path/to/file/with/names')
Parameters
  • num_classes (int) – number of classes. All labels must be < num_classes.

  • names (list<str>) – string names for the integer classes. The order in which the names are provided is kept.

  • names_file (str) – path to a file with names for the integer classes, one per line.

  • max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

  • Note (|) – names or names file

Raises

ValueError – If more than one argument is provided:

__repr__()

Return repr(self).

__str__()

Return str(self).

int2str(int_value: int)

Conversion integer => class name string.

str2int(str_value: str)

Conversion class name string => integer.

class hub.schema.image.Image(shape: Tuple[int, …] = None, None, 3, dtype='uint8', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')
HubSchema for images.

Output: tf.Tensor of type tf.uint8 and shape [height, width, num_channels] for BMP, JPEG, and PNG images

>>> image_tensor = Image(shape=(None, None, 1),
>>>                      encoding_format='png')
__init__(shape: Tuple[int, …] = None, None, 3, dtype='uint8', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')
Construct the connector.
Parameters
  • shape (tuple of ints or None) – The shape of decoded image: (height, width, channels) where height and width can be None. Defaults to (None, None, 3).

  • dtype (uint16 or uint8 (default)) – uint16 can be used only with png encoding_format

  • encoding_format ('jpeg' or 'png' (default)) – Format to serialize np.ndarray images on disk.

  • max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

Returns

  • tf.Tensor of type tf.uint8 and shape [height, width, num_channels]

  • for BMP, JPEG, and PNG images

Raises

ValueError – If the shape, dtype or encoding formats are invalid:

__repr__()

Return repr(self).

__str__()

Return str(self).

_set_dtype(dtype)

Set the dtype.

License: This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at https://mozilla.org/MPL/2.0/.

class hub.schema.features.FlatTensor(path: str, shape: Tuple[int, …], dtype, max_shape: Tuple[int, …], chunks: Tuple[int, …])

Tensor metadata after applying flatten function

__init__(path: str, shape: Tuple[int, …], dtype, max_shape: Tuple[int, …], chunks: Tuple[int, …])

Initialize self. See help(type(self)) for accurate signature.

__weakref__

list of weak references to the object (if defined)

class hub.schema.features.HubSchema

Base class for all datatypes

__weakref__

list of weak references to the object (if defined)

_flatten() → Iterable[hub.schema.features.FlatTensor]

Flattens dtype into list of tensors that will need to be stored seperately

class hub.schema.features.Primitive(dtype, chunks=None, compressor='lz4')

Class for handling primitive datatypes. All numpy primitive data types like int32, float64, etc… should be wrapped around this class.

__init__(dtype, chunks=None, compressor='lz4')

Initialize self. See help(type(self)) for accurate signature.

__repr__()

Return repr(self).

__str__()

Return str(self).

_flatten()

Flattens dtype into list of tensors that will need to be stored seperately

class hub.schema.features.SchemaDict(dict_)

Class for dict branching of a datatype. SchemaDict dtype contains str -> dtype associations. This way you can describe complex datatypes.

__init__(dict_)

Initialize self. See help(type(self)) for accurate signature.

__repr__()

Return repr(self).

__str__()

Return str(self).

_flatten()

Flattens dtype into list of tensors that will need to be stored seperately

class hub.schema.features.Tensor(shape: Tuple[int, …] = None, dtype='float64', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

Tensor type in schema. Has np-array like structure contains any type of elements (Primitive and non-Primitive). Tensors can’t be visualized at app.activeloop.ai.

__init__(shape: Tuple[int, …] = None, dtype='float64', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')
Parameters
  • shape (Tuple[int]) – Shape of tensor, can contains None(s) meaning the shape can be dynamic Dynamic shape means it can change during editing the dataset

  • dtype (SchemaConnector or str) – dtype of each element in Tensor. Can be Primitive and non-Primitive type

  • max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

__repr__()

Return repr(self).

__str__()

Return str(self).

_flatten()

Flattens dtype into list of tensors that will need to be stored seperately

hub.schema.features.featurify(schema)hub.schema.features.HubSchema

This functions converts naked primitive datatypes and ditcs into Primitives and SchemaDicts. That way every node in dtype tree is a SchemaConnector type object.

hub.schema.features.flatten(dtype, root='')

Flattens nested dictionary and returns tuple (dtype, path)

class hub.schema.mask.Mask(shape: Tuple[int, …] = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

HubSchema for mask

Usage:
>>> mask_tensor = Mask(shape=(300, 300, 1))
__init__(shape: Tuple[int, …] = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

Constructs a Mask HubSchema.

Parameters
  • shape (tuple of ints or None) – Shape in format (height, width, 1)

  • dtype (str) – Dtype of mask array. Default: uint8

  • max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

__repr__()

Return repr(self).

__str__()

Return str(self).

_check_shape(shape)

Check if provided shape maches mask characteristics.

class hub.schema.polygon.Polygon(shape: Tuple[int, …] = None, dtype='int32', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

HubSchema for polygon

Usage:
>>> polygon_tensor = Polygon(shape=(10, 2))
>>> polygon_tensor = Polygon(shape=(None, 2))
__init__(shape: Tuple[int, …] = None, dtype='int32', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

Constructs a Polygon HubSchema. Args: shape: tuple of ints or None, i.e (None, 2)

Parameters
  • shape (tuple of ints or None) – Shape in format (None, 2)

  • max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

Raises

ValueError – If the shape is invalid:

__repr__()

Return repr(self).

__str__()

Return str(self).

_check_shape(shape)

Check if provided shape maches polygon characteristics.

class hub.schema.segmentation.Segmentation(shape: Tuple[int, …] = None, dtype: str = None, num_classes: int = None, names: Tuple[str] = None, names_file: str = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

HubSchema for segmentation

__init__(shape: Tuple[int, …] = None, dtype: str = None, num_classes: int = None, names: Tuple[str] = None, names_file: str = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

Constructs a Segmentation HubSchema. Also constructs ClassLabel HubSchema for Segmentation classes.

Parameters
  • shape (tuple of ints or None) – Shape in format (height, width, 1)

  • dtype (str) – dtype of segmentation array: uint16 or uint8

  • num_classes (int) – Number of classes. All labels must be < num_classes.

  • names (list<str>) – string names for the integer classes. The order in which the names are provided is kept.

  • names_file (str) – Path to a file with names for the integer classes, one per line.

  • max_shape (tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

__repr__()

Return repr(self).

__str__()

Return str(self).

get_segmentation_classes()

Get classes of the segmentation mask

class hub.schema.sequence.Sequence(shape=(), max_shape=None, dtype=None, chunks=None, compressor='lz4')

Sequence correspond to sequence of features.HubSchema. At generation time, a list for each of the sequence element is given. The output of Dataset will batch all the elements of the sequence together. If the length of the sequence is static and known in advance, it should be specified in the constructor using the length param.

Usage:
>>> sequence = Sequence(Image(), length=NB_FRAME)
__init__(shape=(), max_shape=None, dtype=None, chunks=None, compressor='lz4')
Construct a sequence of Tensors.
Parameters
  • shape (Tuple[int] | int) – Single integer element tuple representing length of sequence If None then dynamic

  • dtype (str | HubSchema) – Datatype of each element in sequence

  • chunks (Tuple[int] | int) – Number of elements in chunk Works only for top level sequence You can also include number of samples in a single chunk

__repr__()

Return repr(self).

__str__()

Return str(self).

class hub.schema.text.Text(shape: Tuple[int, …] = None, dtype='int64', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

HubSchema for text

__init__(shape: Tuple[int, …] = None, dtype='int64', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')
Construct the connector.

Returns integer representation of given string.

Parameters
  • shape (tuple of ints or None) – The shape of the text

  • dtype (str) – the dtype for storage.

  • max_shape (Tuple[int]) – Maximum number of words in the text

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

__repr__()

Return repr(self).

__str__()

Return str(self).

_set_dtype(dtype)

Set the dtype.

class hub.schema.video.Video(shape: Tuple[int, …] = None, dtype: str = 'uint8', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

HubSchema for videos, encoding frames individually on disk.

The connector accepts as input a 4 dimensional uint8 array representing a video.

Returns

Tensor – where channels must be 1 or 3

Return type

uint8 and shape [num_frames, height, width, channels],

__init__(shape: Tuple[int, …] = None, dtype: str = 'uint8', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

Initializes the connector.

Parameters
  • shape (tuple of ints) – The shape of the video (num_frames, height, width, channels), where channels is 1 or 3.

  • encoding_format (str) – The video is stored as a sequence of encoded images. You can use any encoding format supported by Image.

  • dtype (uint16 or uint8 (default)) –

Raises

ValueError – If the shape, dtype or encoding formats are invalid:

__repr__()

Return repr(self).

__str__()

Return str(self).