API Reference

Datasets

Dataset

class hub.Dataset(url: str, mode: str = 'a', safe_mode: bool = False, shape=None, schema=None, token=None, fs=None, fs_map=None, cache: int = 67108864, storage_cache: int = 268435456, lock_cache=True, tokenizer=None)
__getitem__(slice_)
Gets a slice or slices from dataset
Usage:
>>> return ds["image", 5, 0:1920, 0:1080, 0:3].numpy() # returns numpy array
>>> images = ds["image"]
>>> return images[5].numpy() # returns numpy array
>>> images = ds["image"]
>>> image = images[5]
>>> return image[0:1920, 0:1080, 0:3].numpy()
__init__(url: str, mode: str = 'a', safe_mode: bool = False, shape=None, schema=None, token=None, fs=None, fs_map=None, cache: int = 67108864, storage_cache: int = 268435456, lock_cache=True, tokenizer=None)

Open a new or existing dataset for read/write :param url: The url where dataset is located/should be created :type url: str :param mode: Python way to tell whether dataset is for read or write (ex. “r”, “w”, “a”) :type mode: str, optional (default to “w”) :param safe_mode: if dataset exists it cannot be rewritten in safe mode, otherwise it lets to write the first time :type safe_mode: bool, optional :param shape: Tuple with (num_samples,) format, where num_samples is number of samples :type shape: tuple, optional :param schema: Describes the data of a single sample. Hub schemas are used for that

Required for ‘a’ and ‘w’ modes

Parameters
  • token (str or dict, optional) – If url is refering to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict

  • fs (optional) –

  • fs_map (optional) –

  • cache (int, optional) – Size of the memory cache. Default is 64MB (2**26) if 0, False or None, then cache is not used

  • storage_cache (int, optional) – Size of the storage cache. Default is 256MB (2**28) if 0, False or None, then storage cache is not used

  • lock_cache (bool, optional) – Lock the cache for avoiding multiprocessing errors

__iter__()

Returns Iterable over samples

__len__()

Number of samples in the dataset

__repr__()

Return repr(self).

__setitem__(slice_, value)
Sets a slice or slices with a value
Usage >>> ds[“image”, 5, 0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), “uint8”)
>>> images = ds["image"]
>>> image = images[5]
>>> image[0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), "uint8")
__str__()

Return str(self).

__weakref__

list of weak references to the object (if defined)

_check_and_prepare_dir()

Checks if input data is ok. Creates or overwrites dataset folder. Returns True dataset needs to be created opposed to read.

_get_dictionary(subpath, slice_=None)

“Gets dictionary from dataset given incomplete subpath

append_shape(size: int)

Append the shape: Heavy Operation

close()

Save changes from cache to dataset final storage This invalidates this object

commit()

Deprecated alias to flush()

flush()

Save changes from cache to dataset final storage Does not invalidate this object

static from_pytorch(dataset)

Converts a pytorch dataset object into hub format :param dataset: The pytorch dataset object that needs to be converted into hub format

static from_tensorflow(ds)

Converts a tensorflow dataset into hub format :param dataset: The tensorflow dataset object that needs to be converted into hub format

Examples

ds = tf.data.Dataset.from_tensor_slices(tf.range(10)) out_ds = hub.Dataset.from_tensorflow(ds) res_ds = out_ds.store(“username/new_dataset”) # res_ds is now a usable hub dataset

ds = tf.data.Dataset.from_tensor_slices({‘a’: [1, 2], ‘b’: [5, 6]}) out_ds = hub.Dataset.from_tensorflow(ds) res_ds = out_ds.store(“username/new_dataset”) # res_ds is now a usable hub dataset

ds = hub.Dataset(schema=my_schema, shape=(1000,), url=”username/dataset_name”, mode=”w”) ds = ds.to_tensorflow() out_ds = hub.Dataset.from_tensorflow(ds) res_ds = out_ds.store(“username/new_dataset”) # res_ds is now a usable hub dataset

static from_tfds(dataset, split=None, num=- 1, sampling_amount=1)

Converts a TFDS Dataset into hub format :param dataset: The name of the tfds dataset that needs to be converted into hub format :type dataset: str :param split: A string representing the splits of the dataset that are required such as “train” or “test+train”

If not present, all the splits of the dataset are used.

Parameters
  • num (int, optional) – The number of samples required. If not present, all the samples are taken. If count is -1, or if count is greater than the size of this dataset, the new dataset will contain all elements of this dataset.

  • sampling_amount (float, optional) – a value from 0 to 1, that specifies how much of the dataset would be sampled to determinte feature shapes value of 0 would mean no sampling and 1 would imply that entire dataset would be sampled

Examples

out_ds = hub.Dataset.from_tfds(‘mnist’, split=’test+train’, num=1000) res_ds = out_ds.store(“username/mnist”) # res_ds is now a usable hub dataset

property keys

Get Keys of the dataset

resize_shape(size: int) → None

Resize the shape of the dataset by resizing each tensor first dimension

to_pytorch(Transform=None, offset=None, num_samples=None)

Converts the dataset into a pytorch compatible format :param offset: The offset from which dataset needs to be converted :type offset: int, optional :param num_samples: The number of samples required of the dataset that needs to be converted :type num_samples: int, optional

to_tensorflow(offset=None, num_samples=None)

Converts the dataset into a tensorflow compatible format :param offset: The offset from which dataset needs to be converted :type offset: int, optional :param num_samples: The number of samples required of the dataset that needs to be converted :type num_samples: int, optional

DatasetView

class hub.api.datasetview.DatasetView(dataset=None, num_samples=None, offset=None, squeeze_dim=False)
__getitem__(slice_)
Gets a slice or slices from DatasetView
Usage:
>>> ds_view = ds[5:15]
>>> return ds_view["image", 7, 0:1920, 0:1080, 0:3].compute() # returns numpy array of 12th image
__init__(dataset=None, num_samples=None, offset=None, squeeze_dim=False)

Creates a DatasetView object for a subset of the Dataset

Parameters
  • dataset (hub.api.dataset.Dataset object) – The dataset whose DatasetView is being created

  • num_samples (int) – The number of samples in this DatasetView

  • offset (int) – The offset from which the DatasetView starts

  • squuze_dim (bool) – For slicing with integers we would love to remove the first dimension to make it nicer

__iter__()

Returns Iterable over samples

__repr__()

Return repr(self).

__setitem__(slice_, value)
Sets a slice or slices with a value
Usage:
>>> ds_view = ds[5:15]
>>> ds_view["image", 3, 0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), "uint8") # sets the 8th image
__str__()

Return str(self).

__weakref__

list of weak references to the object (if defined)

_get_dictionary(subpath, slice_=None)

“Gets dictionary from dataset given incomplete subpath

commit() → None

Commit dataset

property keys

Get Keys of the dataset

resize_shape(size: int) → None

Resize dataset shape, not DatasetView

to_pytorch(Transform=None)

Converts the dataset into a pytorch compatible format

to_tensorflow()

Converts the dataset into a tensorflow compatible format

TensorView

class hub.api.tensorview.TensorView(dataset=None, subpath=None, slice_=None, squeeze_dims=[])
__getitem__(slice_)
Gets a slice or slices from tensorview
Usage:
>>> images_tensorview = ds["image"]
>>> return images_tensorview[7, 0:1920, 0:1080, 0:3].compute() # returns numpy array of 7th image
__init__(dataset=None, subpath=None, slice_=None, squeeze_dims=[])

Creates a TensorView object for a particular tensor in the dataset

Parameters
  • dataset (hub.api.dataset.Dataset object) – The dataset whose TensorView is being created

  • subpath (str) – The full path to the particular Tensor in the Dataset

  • slice (optional) – The slice_ of this Tensor that needs to be accessed

__repr__()

Return repr(self).

__setitem__(slice_, value)
Sets a slice or slices with a value
Usage:
>>> images_tensorview = ds["image"]
>>> images_tensorview[7, 0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), "uint8") # sets 7th image
__str__()

Return str(self).

__weakref__

list of weak references to the object (if defined)

_combine(slice_, num=None, ofs=0)

Combines a slice_ with the current num and offset present in tensorview

check_slice_bounds(num=None, start=None, stop=None, step=None)

Checks whether the bounds of slice are in limits

compute()

Gets the value from tensorview

dtype_from_path(path)

Gets the dtype of the Tensorview by traversing the schema

numpy()

Gets the value from tensorview

slice_fill(slice_)

Fills the slice with zeroes for the dimensions that have single elements and squeeze_dims true

Sharded Dataset

class hub.api.sharded_datasetview.ShardedDatasetView(datasets: list)
__init__(datasets: list) → None
Creates a sharded simple dataset.
Datasets should have the schema.
Parameters

datasets (list of Datasets) –

__iter__()

Returns Iterable over samples

__repr__()

Return repr(self).

__weakref__

list of weak references to the object (if defined)

identify_shard(index) → tuple

Computes shard id and returns the shard index and offset

slicing(slice_)

Identifies the dataset shard that should be used .. rubric:: Notes

Features of advanced slicing are missing as one would expect from a DatasetView E.g. cross sharded dataset access is missing

Pipelines

Transform

hub.compute.transform(schema, scheduler='single', workers=1)

Transform is a decorator of a function. The function should output a dictionary per sample

Parameters
schema: Schema

The output format of the transformed dataset

scheduler: str

“single” - for single threaded, “threaded” using multiple threads, “processed”, “ray” scheduler, “dask” scheduler

workers: int

how many workers will be started for the process

class hub.compute.transform.Transform(func, schema, ds, scheduler: str = 'single', workers: int = 1, **kwargs)
__getitem__(slice_)

Get an item to be computed without iterating on the whole dataset Creates a dataset view, then a temporary dataset to apply the transform

slice_: slice

Gets a slice or slices from dataset

__init__(func, schema, ds, scheduler: str = 'single', workers: int = 1, **kwargs)

Transform applies a user defined function to each sample in single threaded manner

Parameters
  • func (function) – user defined function func(x, **kwargs)

  • schema (dict of dtypes) – the structure of the final dataset that will be created

  • ds (Iterative) – input dataset or a list that can be iterated

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

  • **kwargs – additional arguments that will be passed to func as static argument for all samples

__weakref__

list of weak references to the object (if defined)

classmethod _flatten(items, schema)

Takes a dictionary or list of dictionary Returns a dictionary of concatenated values Dictionary follows schema

classmethod _flatten_dict(d: Dict, parent_key='', schema=None)

Helper function to flatten dictionary of a recursive tensor

Parameters

d (dict) –

_pbar(show: bool = True)

Returns a progress bar, if empty then it function does nothing

_split_list_to_dicts(xs)

Helper function that transform list of dicts into dicts of lists

Parameters

xs (list of dicts) –

Returns

xs_new

Return type

dicts of lists

classmethod _unwrap(results)

If there is any list then unwrap it into its elements

create_dataset(url, length=None, token=None)

Helper function to creat a dataset

classmethod dtype_from_path(path, schema)

Helper function to get the dtype from the path

store(url: str, token: dict = None, length: int = None, ds: Iterable = None, progressbar: bool = True, sample_per_shard=None)

The function to apply the transformation for each element in batchified manner

Parameters
  • url (str) – path where the data is going to be stored

  • token (str or dict, optional) – If url is refering to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict

  • length (int) – in case shape is None, user can provide length

  • ds (Iterable) –

  • progressbar (bool) – Show progress bar

  • sample_per_shard (int) – How to split the iterator not to overfill RAM

Returns

ds – uploaded dataset

Return type

hub.Dataset

store_shard(ds_in: Iterable, ds_out: hub.api.dataset.Dataset, offset: int, token=None)

Takes a shard of iteratable ds_in, compute and stores in DatasetView

upload(results, ds: hub.api.dataset.Dataset, token: dict, progressbar: bool = True)

Batchified upload of results For each tensor batchify based on its chunk and upload If tensor is dynamic then still upload element by element For dynamic tensors, it disable dynamicness and then enables it back

Parameters
  • dataset (hub.Dataset) – Dataset object that should be written to

  • results – Output of transform function

  • progressbar (bool) –

Returns

ds – Uploaded dataset

Return type

hub.Dataset

RayTransform

class hub.compute.ray.RayTransform(func, schema, ds, scheduler='ray', workers=1, **kwargs)
__init__(func, schema, ds, scheduler='ray', workers=1, **kwargs)

Transform applies a user defined function to each sample in single threaded manner

Parameters
  • func (function) – user defined function func(x, **kwargs)

  • schema (dict of dtypes) – the structure of the final dataset that will be created

  • ds (Iterative) – input dataset or a list that can be iterated

  • scheduler (str) – choice between “single”, “threaded”, “processed”

  • workers (int) – how many threads or processes to use

  • **kwargs – additional arguments that will be passed to func as static argument for all samples

store(url: str, token: dict = None, length: int = None, ds: Iterable = None, progressbar: bool = True)

The function to apply the transformation for each element in batchified manner

Parameters
  • url (str) – path where the data is going to be stored

  • token (str or dict, optional) – If url is refering to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict

  • length (int) – in case shape is None, user can provide length

  • ds (Iterable) –

  • progressbar (bool) – Show progress bar

Returns

ds – uploaded dataset

Return type

hub.Dataset

upload(results, url: str, token: dict, progressbar: bool = True)

Batchified upload of results For each tensor batchify based on its chunk and upload If tensor is dynamic then still upload element by element

Parameters
  • dataset (hub.Dataset) – Dataset object that should be written to

  • results – Output of transform function

  • progressbar (bool) –

Returns

ds – Uploaded dataset

Return type

hub.Dataset

Schema

Serialization

hub.schema.serialize.serialize(input)

Converts the input into a serializable format

hub.schema.serialize.serialize_SchemaDict(fdict)

Converts SchemaDict into a serializable format

hub.schema.serialize.serialize_primitive(primitive)

Converts Primitive into a serializable format

hub.schema.serialize.serialize_tensor(tensor)

Converts Tensor and its derivatives into a serializable format

Schema

class hub.schema.audio.Audio(shape: Tuple[int, …] = None, dtype='int64', file_format=None, sample_rate: int = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')
__init__(shape: Tuple[int, …] = None, dtype='int64', file_format=None, sample_rate: int = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

Constructs the connector.

Parameters
  • file_format (str) – the audio file format. Can be any format ffmpeg understands. If None, will attempt to infer from the file extension.

  • shape (tuple) – shape of the data.

  • dtype (str) – The dtype of the data.

  • sample_rate (int) – additional metadata exposed to the user through info.schema[‘audio’].sample_rate. This value isn’t used neither in encoding nor decoding.

Raises

ValueError – If the shape is invalid:

__repr__()

Return repr(self).

__str__()

Return str(self).

get_attr_dict()

Return class attributes.

class hub.schema.bbox.BBox(dtype='float64', chunks=None, compressor='lz4')

HubSchema for a normalized bounding box. Output: bbox: Tensor of type float32 and shape [4,] which contains the

normalized coordinates of the bounding box [ymin, xmin, ymax, xmax]

__init__(dtype='float64', chunks=None, compressor='lz4')

Construct the connector.

Parameters
  • dtype (str) – dtype of bbox coordinates. Default: ‘float32’

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

__repr__()

Return repr(self).

__str__()

Return str(self).

get_attr_dict()

Return class attributes.

class hub.schema.class_label.ClassLabel(num_classes: int = None, names: List[str] = None, names_file: str = None, chunks=None, compressor='lz4')

HubSchema for integer class labels.

__init__(num_classes: int = None, names: List[str] = None, names_file: str = None, chunks=None, compressor='lz4')
Constructs a ClassLabel HubSchema.
There are 3 ways to define a ClassLabel, which correspond to the 3 arguments:
* num_classes: create 0 to (num_classes-1) labels
* names: a list of label strings
* names_file: a file containing the list of labels.

Note: In python2, the strings are encoded as utf-8.

Usage:
>>> class_label_tensor = ClassLabel(num_classes=10)
>>> class_label_tensor = ClassLabel(names=['class1', 'class2', 'class3', ...])
>>> class_label_tensor = ClassLabel(names_file='/path/to/file/with/names')
Parameters
  • num_classes (int) – number of classes. All labels must be < num_classes.

  • names (list<str>) – string names for the integer classes. The order in which the names are provided is kept.

  • names_file (str) – path to a file with names for the integer classes, one per line.

  • max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

  • Note (|) – names or names file

Raises

ValueError – If more than one argument is provided:

__repr__()

Return repr(self).

__str__()

Return str(self).

get_attr_dict()

Return class attributes.

int2str(int_value: int)

Conversion integer => class name string.

str2int(str_value: str)

Conversion class name string => integer.

class hub.schema.image.Image(shape: Tuple[int, …] = None, None, 3, dtype='uint8', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

HubSchema for images Output: tf.Tensor of type tf.uint8 and shape [height, width, num_channels] for BMP, JPEG, and PNG images

Example: ```python image_tensor = Image(shape=(None, None, 1),

encoding_format=’png’)

```

__init__(shape: Tuple[int, …] = None, None, 3, dtype='uint8', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')
Construct the connector.
Parameters
  • shape (tuple of ints or None) – The shape of decoded image: (height, width, channels) where height and width can be None. Defaults to (None, None, 3).

  • dtype (uint16 or uint8 (default)) – uint16 can be used only with png encoding_format

  • encoding_format ('jpeg' or 'png' (default)) – Format to serialize np.ndarray images on disk.

  • max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

Returns

  • tf.Tensor of type tf.uint8 and shape [height, width, num_channels]

  • for BMP, JPEG, and PNG images

Raises

ValueError – If the shape, dtype or encoding formats are invalid:

__repr__()

Return repr(self).

__str__()

Return str(self).

_set_dtype(dtype)

Set the dtype.

_set_encoding_format(encoding_format)

Set the encoding format.

get_attr_dict()

Return class attributes.

class hub.schema.features.FlatTensor(path: str, shape: Tuple[int, …], dtype, max_shape: Tuple[int, …], chunks: Tuple[int, …])

Tensor metadata after applying flatten function

__init__(path: str, shape: Tuple[int, …], dtype, max_shape: Tuple[int, …], chunks: Tuple[int, …])

Initialize self. See help(type(self)) for accurate signature.

__weakref__

list of weak references to the object (if defined)

class hub.schema.features.HubSchema

Base class for all datatypes

__weakref__

list of weak references to the object (if defined)

_flatten() → Iterable[hub.schema.features.FlatTensor]

Flattens dtype into list of tensors that will need to be stored seperately

class hub.schema.features.Primitive(dtype, chunks=None, compressor='lz4')

Class for handling primitive datatypes All numpy primitive data types like int32, float64, etc… should be wrapped around this class

__init__(dtype, chunks=None, compressor='lz4')

Initialize self. See help(type(self)) for accurate signature.

__repr__()

Return repr(self).

__str__()

Return str(self).

_flatten()

Flattens dtype into list of tensors that will need to be stored seperately

class hub.schema.features.SchemaDict(dict_)

Class for dict branching of a datatype SchemaDict dtype contains str -> dtype associations This way you can describe complex datatypes

__init__(dict_)

Initialize self. See help(type(self)) for accurate signature.

__repr__()

Return repr(self).

__str__()

Return str(self).

_flatten()

Flattens dtype into list of tensors that will need to be stored seperately

class hub.schema.features.Tensor(shape: Tuple[int, …] = None, dtype='float64', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

Tensor type in schema Has np-array like structure contains any type of elements (Primitive and non-Primitive)

__init__(shape: Tuple[int, …] = None, dtype='float64', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')
Parameters
  • shape (Tuple[int]) – Shape of tensor, can contains None(s) meaning the shape can be dynamic Dynamic shape means it can change during editing the dataset

  • dtype (SchemaConnector or str) – dtype of each element in Tensor. Can be Primitive and non-Primitive type

  • max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

__repr__()

Return repr(self).

__str__()

Return str(self).

_flatten()

Flattens dtype into list of tensors that will need to be stored seperately

hub.schema.features.featurify(schema)hub.schema.features.HubSchema

This functions converts naked primitive datatypes and ditcs into Primitives and SchemaDicts That way every node in dtype tree is a SchemaConnector type object

hub.schema.features.flatten(dtype, root='')

Flattens nested dictionary and returns tuple (dtype, path)

class hub.schema.mask.Mask(shape: Tuple[int, …] = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

HubSchema for mask

Usage:
>>> mask_tensor = Mask(shape=(300, 300, 1))
__init__(shape: Tuple[int, …] = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

Constructs a Mask HubSchema.

Parameters
  • shape (tuple of ints or None) – Shape in format (height, width, 1)

  • dtype (str) – Dtype of mask array. Default: uint8

  • max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

__repr__()

Return repr(self).

__str__()

Return str(self).

get_attr_dict()

Return class attributes.

class hub.schema.polygon.Polygon(shape: Tuple[int, …] = None, dtype='int32', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

HubSchema for polygon

Usage:
>>> polygon_tensor = Polygon(shape=(10, 2))
>>> polygon_tensor = Polygon(shape=(None, 2))
__init__(shape: Tuple[int, …] = None, dtype='int32', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

Constructs a Polygon HubSchema. Args: shape: tuple of ints or None, i.e (None, 2)

Parameters
  • shape (tuple of ints or None) – Shape in format (None, 2)

  • max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

Raises

ValueError – If the shape is invalid:

__repr__()

Return repr(self).

__str__()

Return str(self).

_check_shape(shape)

Check if provided shape maches polygon characteristics.

get_attr_dict()

Return class attributes.

class hub.schema.segmentation.Segmentation(shape: Tuple[int, …] = None, dtype: str = None, num_classes: int = None, names: Tuple[str] = None, names_file: str = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

HubSchema for segmentation

__init__(shape: Tuple[int, …] = None, dtype: str = None, num_classes: int = None, names: Tuple[str] = None, names_file: str = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

Constructs a Segmentation HubSchema. Also constructs ClassLabel HubSchema for Segmentation classes.

Parameters
  • shape (tuple of ints or None) – Shape in format (height, width, 1)

  • dtype (str) – dtype of segmentation array: uint16 or uint8

  • num_classes (int) – Number of classes. All labels must be < num_classes.

  • names (list<str>) – string names for the integer classes. The order in which the names are provided is kept.

  • names_file (str) – Path to a file with names for the integer classes, one per line.

  • max_shape (tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

__repr__()

Return repr(self).

__str__()

Return str(self).

get_attr_dict()

Return class attributes.

get_segmentation_classes()

Get classes of the segmentation mask

class hub.schema.sequence.Sequence(shape=(), max_shape=(), dtype=None, chunks=None, compressor='lz4')

Sequence correspond to sequence of features.HubSchema. At generation time, a list for each of the sequence element is given. The output of Dataset will batch all the elements of the sequence together. If the length of the sequence is static and known in advance, it should be specified in the constructor using the length param.

Usage:
>>> sequence = Sequence(Image(), length=NB_FRAME)
__init__(shape=(), max_shape=(), dtype=None, chunks=None, compressor='lz4')

Construct a sequence of Tensors. :param shape: Single integer element tuple representing length of sequence

If None then dynamic

Parameters
  • dtype (str | HubSchema) – Datatype of each element in sequence

  • chunks (Tuple[int] | int) – Number of elements in chunk Works only for top level sequence You can also include number of samples in a single chunk

__repr__()

Return repr(self).

__str__()

Return str(self).

get_attr_dict()

Return class attributes

class hub.schema.text.Text(shape: Tuple[int, …] = None, dtype='int64', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

HubSchema for text

__init__(shape: Tuple[int, …] = None, dtype='int64', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')
Construct the connector.
Parameters
  • shape (tuple of ints or None) – The shape of the text

  • dtype (str) – the dtype for storage.

  • max_shape (Tuple[int]) – Maximum number of words in the text

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

__repr__()

Return repr(self).

__str__()

Return str(self).

_set_dtype(dtype)

Set the dtype.

get_attr_dict()

Return class attributes.

class hub.schema.video.Video(shape: Tuple[int, …] = None, dtype: str = 'uint8', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

HubSchema for videos, encoding frames individually on disk.

The connector accepts as input a 4 dimensional uint8 array representing a video.

Returns

Tensor – where channels must be 1 or 3

Return type

uint8 and shape [num_frames, height, width, channels],

__init__(shape: Tuple[int, …] = None, dtype: str = 'uint8', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

Initializes the connector.

Parameters
  • shape (tuple of ints) – The shape of the video (num_frames, height, width, channels), where channels is 1 or 3.

  • encoding_format (str) – The video is stored as a sequence of encoded images. You can use any encoding format supported by Image.

  • dtype (uint16 or uint8 (default)) –

Raises

ValueError – If the shape, dtype or encoding formats are invalid:

__repr__()

Return repr(self).

__str__()

Return str(self).

get_attr_dict()

Return class attributes.