Schema

Overview

Hub Schema:

  • Define the structure, shapes, dtypes of the final Dataset

  • Add additional meta information(image channels, class names, etc.)

  • Use special serialization/deserialization methods

Available Schemas

Primitive

Wrapper to the numpy primitive data types like int32, float64, etc…

from hub.schema import Primitive

schema = { "scalar": Primitive(dtype="float32") }

Tensor

Np-array like structure that contains any type of elements (Primitive and non-Primitive). Hub Tensors can’t be visualized at app.activeloop.ai.

from hub.schema import Tensor

schema = {
    "tensor_1": Tensor((None, None), "int32", max_shape=(200, 200)),
    "tensor_2": Tensor((100, 400), "int64", chunks=(6, 50, 200))
}

Image

Array representation of image of arbitrary shape and primitive data type.

Default encoding format - png (jpeg is also supported).

from hub.schema import Image

schema = {"image": Image(shape=(None, None),
                         dtype="int32",
                         max_shape=(100, 100))}

ClassLabel

Integer representation of feature labels. Can be constructed from number of labels, label names or a text file with a single label name in each line.

from hub.schema import ClassLabel

schema = {
    "class_label_1": ClassLabel(num_classes=10),
    "class_label_2": ClassLabel(names=['class1', 'class2', 'class3', ...]),
    "class_label_3": ClassLabel(names_file='/path/to/file/with/names')
}

Mask

Array representation of binary mask. The shape of mask should have format: (height, width, 1).

from hub.schema import Image

schema = {"mask": Mask(shape=(244, 244, 1))}

Segmentation

Segmentation array. Also constructs ClassLabel feature connector to support segmentation classes.

The shape of segmentation mask should have format: (height, width, 1).

from hub.schema import Segmentation

schema = {"segmentation": Segmentation(shape=(244, 244, 1), dtype='uint8', 
                                       names=['label_1', 'label_2', ...])}

BBox

Bounding box coordinates with shape (4, ).

from hub.schema import BBox

schema = {"bbox": BBox()}

Audio

Hub schema for audio files. A file can have any format ffmpeg understands. If file_format parameter isn’t provided will attempt to infer it from the file extension. Also, sample_rate parameter can be added as additional metadata. User can access through info.schema[‘audio’].sample_rate.

from hub.schema import Audio

schema = {'audio': Audio(shape=(300,))}

Video

Video format support. Accepts as input a 4 dimensional uint8 array representing a video. The video is stored as a sequence of encoded images. encoding_format can be any format supported by Image.

from hub.schema import Video

schema = {'video': Video(shape=(20, None, None, 3), max_shape=(20, 1200, 1200, 3))}

Text

Autoconverts given string into its integer(int64) representation.

from hub.schema import Text

schema = {'text': Text(shape=(None, ), max_shape=(20, ))}

Sequence

Correspond to sequence of schema.HubSchema. At generation time, a list for each of the sequence element is given. The output of Dataset will batch all the elements of the sequence together. If the length of the sequence is static and known in advance, it should be specified in the constructor using the length param.

from hub.schema import Sequence, BBox

schema = {'sequence': Sequence(shape=(10, ), dtype=BBox)}

Arguments

If a schema has a dynamic shape, max_shape argument should be provided representing the maximum possible number of elements in each axis of the feature.

Argument chunks describes how to split tensor dimensions into chunks (files) to store them efficiently. If not chosen, it will be automatically detected how to split the information into chunks.

API

class hub.schema.audio.Audio(shape: Tuple[int, …] = None, dtype='int64', file_format=None, sample_rate: int = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')
__init__(shape: Tuple[int, …] = None, dtype='int64', file_format=None, sample_rate: int = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

Constructs the connector.

Parameters
  • file_format (str) – the audio file format. Can be any format ffmpeg understands. If None, will attempt to infer from the file extension.

  • shape (tuple) – shape of the data.

  • dtype (str) – The dtype of the data.

  • sample_rate (int) – additional metadata exposed to the user through info.schema[‘audio’].sample_rate. This value isn’t used neither in encoding nor decoding.

Raises

ValueError – If the shape is invalid:

__repr__()

Return repr(self).

__str__()

Return str(self).

class hub.schema.bbox.BBox(dtype='float64', chunks=None, compressor='lz4')
HubSchema` for a normalized bounding box.

Output: Tensor of type float32 and shape [4,] which contains the normalized coordinates of the bounding box [ymin, xmin, ymax, xmax]

__init__(dtype='float64', chunks=None, compressor='lz4')

Construct the connector.

Parameters
  • dtype (str) – dtype of bbox coordinates. Default: ‘float32’

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

__repr__()

Return repr(self).

__str__()

Return str(self).

class hub.schema.class_label.ClassLabel(num_classes: int = None, names: List[str] = None, names_file: str = None, chunks=None, compressor='lz4')

HubSchema for integer class labels.

__init__(num_classes: int = None, names: List[str] = None, names_file: str = None, chunks=None, compressor='lz4')
Constructs a ClassLabel HubSchema.
Returns an integer representations of given classes. Preserves the names of classes to convert those back to strings if needed.
There are 3 ways to define a ClassLabel, which correspond to the 3 arguments:
* num_classes: create 0 to (num_classes-1) labels
* names: a list of label strings
* names_file: a file containing the list of labels.

Note: In python2, the strings are encoded as utf-8.

Usage:
>>> class_label_tensor = ClassLabel(num_classes=10)
>>> class_label_tensor = ClassLabel(names=['class1', 'class2', 'class3', ...])
>>> class_label_tensor = ClassLabel(names_file='/path/to/file/with/names')
Parameters
  • num_classes (int) – number of classes. All labels must be < num_classes.

  • names (list<str>) – string names for the integer classes. The order in which the names are provided is kept.

  • names_file (str) – path to a file with names for the integer classes, one per line.

  • max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

  • Note (|) – names or names file

Raises

ValueError – If more than one argument is provided:

__repr__()

Return repr(self).

__str__()

Return str(self).

int2str(int_value: int)

Conversion integer => class name string.

str2int(str_value: str)

Conversion class name string => integer.

class hub.schema.image.Image(shape: Tuple[int, …] = None, None, 3, dtype='uint8', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')
HubSchema for images.

Output: tf.Tensor of type tf.uint8 and shape [height, width, num_channels] for BMP, JPEG, and PNG images

>>> image_tensor = Image(shape=(None, None, 1),
>>>                      encoding_format='png')
__init__(shape: Tuple[int, …] = None, None, 3, dtype='uint8', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')
Construct the connector.
Parameters
  • shape (tuple of ints or None) – The shape of decoded image: (height, width, channels) where height and width can be None. Defaults to (None, None, 3).

  • dtype (uint16 or uint8 (default)) – uint16 can be used only with png encoding_format

  • encoding_format ('jpeg' or 'png' (default)) – Format to serialize np.ndarray images on disk.

  • max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

Returns

  • tf.Tensor of type tf.uint8 and shape [height, width, num_channels]

  • for BMP, JPEG, and PNG images

Raises

ValueError – If the shape, dtype or encoding formats are invalid:

__repr__()

Return repr(self).

__str__()

Return str(self).

_set_dtype(dtype)

Set the dtype.

class hub.schema.features.FlatTensor(path: str, shape: Tuple[int, …], dtype, max_shape: Tuple[int, …], chunks: Tuple[int, …])

Tensor metadata after applying flatten function

__init__(path: str, shape: Tuple[int, …], dtype, max_shape: Tuple[int, …], chunks: Tuple[int, …])

Initialize self. See help(type(self)) for accurate signature.

__weakref__

list of weak references to the object (if defined)

class hub.schema.features.HubSchema

Base class for all datatypes

__weakref__

list of weak references to the object (if defined)

_flatten() → Iterable[hub.schema.features.FlatTensor]

Flattens dtype into list of tensors that will need to be stored seperately

class hub.schema.features.Primitive(dtype, chunks=None, compressor='lz4')

Class for handling primitive datatypes. All numpy primitive data types like int32, float64, etc… should be wrapped around this class.

__init__(dtype, chunks=None, compressor='lz4')

Initialize self. See help(type(self)) for accurate signature.

__repr__()

Return repr(self).

__str__()

Return str(self).

_flatten()

Flattens dtype into list of tensors that will need to be stored seperately

class hub.schema.features.SchemaDict(dict_)

Class for dict branching of a datatype. SchemaDict dtype contains str -> dtype associations. This way you can describe complex datatypes.

__init__(dict_)

Initialize self. See help(type(self)) for accurate signature.

__repr__()

Return repr(self).

__str__()

Return str(self).

_flatten()

Flattens dtype into list of tensors that will need to be stored seperately

class hub.schema.features.Tensor(shape: Tuple[int, …] = None, dtype='float64', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

Tensor type in schema. Has np-array like structure contains any type of elements (Primitive and non-Primitive). Tensors can’t be visualized at app.activeloop.ai.

__init__(shape: Tuple[int, …] = None, dtype='float64', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')
Parameters
  • shape (Tuple[int]) – Shape of tensor, can contains None(s) meaning the shape can be dynamic Dynamic shape means it can change during editing the dataset

  • dtype (SchemaConnector or str) – dtype of each element in Tensor. Can be Primitive and non-Primitive type

  • max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

__repr__()

Return repr(self).

__str__()

Return str(self).

_flatten()

Flattens dtype into list of tensors that will need to be stored seperately

hub.schema.features.featurify(schema)hub.schema.features.HubSchema

This functions converts naked primitive datatypes and ditcs into Primitives and SchemaDicts. That way every node in dtype tree is a SchemaConnector type object.

hub.schema.features.flatten(dtype, root='')

Flattens nested dictionary and returns tuple (dtype, path)

class hub.schema.mask.Mask(shape: Tuple[int, …] = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

HubSchema for mask

Usage:
>>> mask_tensor = Mask(shape=(300, 300, 1))
__init__(shape: Tuple[int, …] = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

Constructs a Mask HubSchema.

Parameters
  • shape (tuple of ints or None) – Shape in format (height, width, 1)

  • dtype (str) – Dtype of mask array. Default: uint8

  • max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

__repr__()

Return repr(self).

__str__()

Return str(self).

_check_shape(shape)

Check if provided shape maches mask characteristics.

class hub.schema.polygon.Polygon(shape: Tuple[int, …] = None, dtype='int32', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

HubSchema for polygon

Usage:
>>> polygon_tensor = Polygon(shape=(10, 2))
>>> polygon_tensor = Polygon(shape=(None, 2))
__init__(shape: Tuple[int, …] = None, dtype='int32', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

Constructs a Polygon HubSchema. Args: shape: tuple of ints or None, i.e (None, 2)

Parameters
  • shape (tuple of ints or None) – Shape in format (None, 2)

  • max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

Raises

ValueError – If the shape is invalid:

__repr__()

Return repr(self).

__str__()

Return str(self).

_check_shape(shape)

Check if provided shape maches polygon characteristics.

class hub.schema.segmentation.Segmentation(shape: Tuple[int, …] = None, dtype: str = None, num_classes: int = None, names: Tuple[str] = None, names_file: str = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

HubSchema for segmentation

__init__(shape: Tuple[int, …] = None, dtype: str = None, num_classes: int = None, names: Tuple[str] = None, names_file: str = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

Constructs a Segmentation HubSchema. Also constructs ClassLabel HubSchema for Segmentation classes.

Parameters
  • shape (tuple of ints or None) – Shape in format (height, width, 1)

  • dtype (str) – dtype of segmentation array: uint16 or uint8

  • num_classes (int) – Number of classes. All labels must be < num_classes.

  • names (list<str>) – string names for the integer classes. The order in which the names are provided is kept.

  • names_file (str) – Path to a file with names for the integer classes, one per line.

  • max_shape (tuple[int]) – Maximum shape of tensor shape if tensor is dynamic

  • chunks (tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

__repr__()

Return repr(self).

__str__()

Return str(self).

get_segmentation_classes()

Get classes of the segmentation mask

class hub.schema.sequence.Sequence(shape=(), max_shape=None, dtype=None, chunks=None, compressor='lz4')

Sequence correspond to sequence of features.HubSchema. At generation time, a list for each of the sequence element is given. The output of Dataset will batch all the elements of the sequence together. If the length of the sequence is static and known in advance, it should be specified in the constructor using the length param.

Usage:
>>> sequence = Sequence(Image(), length=NB_FRAME)
__init__(shape=(), max_shape=None, dtype=None, chunks=None, compressor='lz4')
Construct a sequence of Tensors.
Parameters
  • shape (Tuple[int] | int) – Single integer element tuple representing length of sequence If None then dynamic

  • dtype (str | HubSchema) – Datatype of each element in sequence

  • chunks (Tuple[int] | int) – Number of elements in chunk Works only for top level sequence You can also include number of samples in a single chunk

__repr__()

Return repr(self).

__str__()

Return str(self).

class hub.schema.text.Text(shape: Tuple[int, …] = None, dtype='int64', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

HubSchema for text

__init__(shape: Tuple[int, …] = None, dtype='int64', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')
Construct the connector.

Returns integer representation of given string.

Parameters
  • shape (tuple of ints or None) – The shape of the text

  • dtype (str) – the dtype for storage.

  • max_shape (Tuple[int]) – Maximum number of words in the text

  • chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks

__repr__()

Return repr(self).

__str__()

Return str(self).

_set_dtype(dtype)

Set the dtype.

class hub.schema.video.Video(shape: Tuple[int, …] = None, dtype: str = 'uint8', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

HubSchema for videos, encoding frames individually on disk.

The connector accepts as input a 4 dimensional uint8 array representing a video.

Returns

Tensor – where channels must be 1 or 3

Return type

uint8 and shape [num_frames, height, width, channels],

__init__(shape: Tuple[int, …] = None, dtype: str = 'uint8', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')

Initializes the connector.

Parameters
  • shape (tuple of ints) – The shape of the video (num_frames, height, width, channels), where channels is 1 or 3.

  • encoding_format (str) – The video is stored as a sequence of encoded images. You can use any encoding format supported by Image.

  • dtype (uint16 or uint8 (default)) –

Raises

ValueError – If the shape, dtype or encoding formats are invalid:

__repr__()

Return repr(self).

__str__()

Return str(self).