Schema¶
Overview¶
Hub Schema:
Define the structure, shapes, dtypes of the final Dataset
Add additional meta information(image channels, class names, etc.)
Use special serialization/deserialization methods
Available Schemas¶
Primitive¶
Wrapper to the numpy primitive data types like int32, float64, etc…
from hub.schema import Primitive
schema = { "scalar": Primitive(dtype="float32") }
Tensor¶
Np-array like structure that contains any type of elements (Primitive and non-Primitive). Hub Tensors can’t be visualized at app.activeloop.ai.
from hub.schema import Tensor
schema = {
"tensor_1": Tensor((None, None), "int32", max_shape=(200, 200)),
"tensor_2": Tensor((100, 400), "int64", chunks=(6, 50, 200))
}
Image¶
Array representation of image of arbitrary shape and primitive data type.
Default encoding format - png
(jpeg
is also supported).
from hub.schema import Image
schema = {"image": Image(shape=(None, None),
dtype="int32",
max_shape=(100, 100))}
ClassLabel¶
Integer representation of feature labels. Can be constructed from number of labels, label names or a text file with a single label name in each line.
from hub.schema import ClassLabel
schema = {
"class_label_1": ClassLabel(num_classes=10),
"class_label_2": ClassLabel(names=['class1', 'class2', 'class3', ...]),
"class_label_3": ClassLabel(names_file='/path/to/file/with/names')
}
Mask¶
Array representation of binary mask. The shape of mask should have format: (height, width, 1).
from hub.schema import Image
schema = {"mask": Mask(shape=(244, 244, 1))}
Segmentation¶
Segmentation array. Also constructs ClassLabel feature connector to support segmentation classes.
The shape of segmentation mask should have format: (height, width, 1).
from hub.schema import Segmentation
schema = {"segmentation": Segmentation(shape=(244, 244, 1), dtype='uint8',
names=['label_1', 'label_2', ...])}
BBox¶
Bounding box coordinates with shape (4, ).
from hub.schema import BBox
schema = {"bbox": BBox()}
Audio¶
Hub schema for audio files. A file can have any format ffmpeg understands. If file_format
parameter isn’t provided
will attempt to infer it from the file extension. Also, sample_rate
parameter can be added as additional metadata. User can access through info.schema[‘audio’].sample_rate.
from hub.schema import Audio
schema = {'audio': Audio(shape=(300,))}
Video¶
Video format support.
Accepts as input a 4 dimensional uint8 array representing a video.
The video is stored as a sequence of encoded images. encoding_format
can be any format supported by Image.
from hub.schema import Video
schema = {'video': Video(shape=(20, None, None, 3), max_shape=(20, 1200, 1200, 3))}
Text¶
Autoconverts given string into its integer(int64) representation.
from hub.schema import Text
schema = {'text': Text(shape=(None, ), max_shape=(20, ))}
Sequence¶
Correspond to sequence of schema.HubSchema
.
At generation time, a list for each of the sequence element is given. The output
of Dataset
will batch all the elements of the sequence together.
If the length of the sequence is static and known in advance, it should be
specified in the constructor using the length
param.
from hub.schema import Sequence, BBox
schema = {'sequence': Sequence(shape=(10, ), dtype=BBox)}
Arguments¶
If a schema has a dynamic shape, max_shape
argument should be provided representing the maximum possible number of elements in each axis of the feature.
Argument chunks
describes how to split tensor dimensions into chunks (files) to store them efficiently. If not chosen, it will be automatically detected how to split the information into chunks.
API¶
-
class
hub.schema.audio.
Audio
(shape: Tuple[int, …] = None, dtype='int64', file_format=None, sample_rate: int = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ -
__init__
(shape: Tuple[int, …] = None, dtype='int64', file_format=None, sample_rate: int = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ Constructs the connector.
- Parameters
file_format (str) – the audio file format. Can be any format ffmpeg understands. If None, will attempt to infer from the file extension.
shape (tuple) – shape of the data.
dtype (str) – The dtype of the data.
sample_rate (int) – additional metadata exposed to the user through info.schema[‘audio’].sample_rate. This value isn’t used neither in encoding nor decoding.
- Raises
ValueError – If the shape is invalid:
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
-
class
hub.schema.bbox.
BBox
(dtype='float64', chunks=None, compressor='lz4')¶ - HubSchema` for a normalized bounding box.
Output: Tensor of type float32 and shape [4,] which contains the normalized coordinates of the bounding box [ymin, xmin, ymax, xmax]
-
__init__
(dtype='float64', chunks=None, compressor='lz4')¶ Construct the connector.
- Parameters
dtype (str) – dtype of bbox coordinates. Default: ‘float32’
chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
-
class
hub.schema.class_label.
ClassLabel
(num_classes: int = None, names: List[str] = None, names_file: str = None, chunks=None, compressor='lz4')¶ HubSchema for integer class labels.
-
__init__
(num_classes: int = None, names: List[str] = None, names_file: str = None, chunks=None, compressor='lz4')¶ - Constructs a ClassLabel HubSchema.Returns an integer representations of given classes. Preserves the names of classes to convert those back to strings if needed.There are 3 ways to define a ClassLabel, which correspond to the 3 arguments:* num_classes: create 0 to (num_classes-1) labels* names: a list of label strings* names_file: a file containing the list of labels.
Note: In python2, the strings are encoded as utf-8.
Usage:>>> class_label_tensor = ClassLabel(num_classes=10) >>> class_label_tensor = ClassLabel(names=['class1', 'class2', 'class3', ...]) >>> class_label_tensor = ClassLabel(names_file='/path/to/file/with/names')
- Parameters
num_classes (int) – number of classes. All labels must be < num_classes.
names (list<str>) – string names for the integer classes. The order in which the names are provided is kept.
names_file (str) – path to a file with names for the integer classes, one per line.
max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic
chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks
Note (|) – names or names file
- Raises
ValueError – If more than one argument is provided:
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
int2str
(int_value: int)¶ Conversion integer => class name string.
-
str2int
(str_value: str)¶ Conversion class name string => integer.
-
-
class
hub.schema.image.
Image
(shape: Tuple[int, …] = None, None, 3, dtype='uint8', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ - HubSchema for images.
Output: tf.Tensor of type tf.uint8 and shape [height, width, num_channels] for BMP, JPEG, and PNG images
>>> image_tensor = Image(shape=(None, None, 1), >>> encoding_format='png')
-
__init__
(shape: Tuple[int, …] = None, None, 3, dtype='uint8', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ - Construct the connector.
- Parameters
shape (tuple of ints or None) – The shape of decoded image: (height, width, channels) where height and width can be None. Defaults to (None, None, 3).
dtype (uint16 or uint8 (default)) – uint16 can be used only with png encoding_format
encoding_format ('jpeg' or 'png' (default)) – Format to serialize np.ndarray images on disk.
max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic
chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks
- Returns
tf.Tensor of type tf.uint8 and shape [height, width, num_channels]
for BMP, JPEG, and PNG images
- Raises
ValueError – If the shape, dtype or encoding formats are invalid:
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
_set_dtype
(dtype)¶ Set the dtype.
-
-
class
hub.schema.features.
FlatTensor
(path: str, shape: Tuple[int, …], dtype, max_shape: Tuple[int, …], chunks: Tuple[int, …])¶ Tensor metadata after applying flatten function
-
__init__
(path: str, shape: Tuple[int, …], dtype, max_shape: Tuple[int, …], chunks: Tuple[int, …])¶ Initialize self. See help(type(self)) for accurate signature.
-
__weakref__
¶ list of weak references to the object (if defined)
-
-
class
hub.schema.features.
HubSchema
¶ Base class for all datatypes
-
__weakref__
¶ list of weak references to the object (if defined)
-
_flatten
() → Iterable[hub.schema.features.FlatTensor]¶ Flattens dtype into list of tensors that will need to be stored seperately
-
-
class
hub.schema.features.
Primitive
(dtype, chunks=None, compressor='lz4')¶ Class for handling primitive datatypes. All numpy primitive data types like int32, float64, etc… should be wrapped around this class.
-
__init__
(dtype, chunks=None, compressor='lz4')¶ Initialize self. See help(type(self)) for accurate signature.
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
_flatten
()¶ Flattens dtype into list of tensors that will need to be stored seperately
-
-
class
hub.schema.features.
SchemaDict
(dict_)¶ Class for dict branching of a datatype. SchemaDict dtype contains str -> dtype associations. This way you can describe complex datatypes.
-
__init__
(dict_)¶ Initialize self. See help(type(self)) for accurate signature.
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
_flatten
()¶ Flattens dtype into list of tensors that will need to be stored seperately
-
-
class
hub.schema.features.
Tensor
(shape: Tuple[int, …] = None, dtype='float64', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ Tensor type in schema. Has np-array like structure contains any type of elements (Primitive and non-Primitive). Tensors can’t be visualized at app.activeloop.ai.
-
__init__
(shape: Tuple[int, …] = None, dtype='float64', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ - Parameters
shape (Tuple[int]) – Shape of tensor, can contains None(s) meaning the shape can be dynamic Dynamic shape means it can change during editing the dataset
dtype (SchemaConnector or str) – dtype of each element in Tensor. Can be Primitive and non-Primitive type
max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic
chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
_flatten
()¶ Flattens dtype into list of tensors that will need to be stored seperately
-
-
hub.schema.features.
featurify
(schema) → hub.schema.features.HubSchema¶ This functions converts naked primitive datatypes and ditcs into Primitives and SchemaDicts. That way every node in dtype tree is a SchemaConnector type object.
-
hub.schema.features.
flatten
(dtype, root='')¶ Flattens nested dictionary and returns tuple (dtype, path)
-
class
hub.schema.mask.
Mask
(shape: Tuple[int, …] = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ HubSchema for mask
Usage:>>> mask_tensor = Mask(shape=(300, 300, 1))
-
__init__
(shape: Tuple[int, …] = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ Constructs a Mask HubSchema.
- Parameters
shape (tuple of ints or None) – Shape in format (height, width, 1)
dtype (str) – Dtype of mask array. Default: uint8
max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic
chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
_check_shape
(shape)¶ Check if provided shape maches mask characteristics.
-
-
class
hub.schema.polygon.
Polygon
(shape: Tuple[int, …] = None, dtype='int32', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ HubSchema for polygon
Usage:>>> polygon_tensor = Polygon(shape=(10, 2)) >>> polygon_tensor = Polygon(shape=(None, 2))
-
__init__
(shape: Tuple[int, …] = None, dtype='int32', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ Constructs a Polygon HubSchema. Args: shape: tuple of ints or None, i.e (None, 2)
- Parameters
shape (tuple of ints or None) – Shape in format (None, 2)
max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic
chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks
- Raises
ValueError – If the shape is invalid:
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
_check_shape
(shape)¶ Check if provided shape maches polygon characteristics.
-
-
class
hub.schema.segmentation.
Segmentation
(shape: Tuple[int, …] = None, dtype: str = None, num_classes: int = None, names: Tuple[str] = None, names_file: str = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ HubSchema for segmentation
-
__init__
(shape: Tuple[int, …] = None, dtype: str = None, num_classes: int = None, names: Tuple[str] = None, names_file: str = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ Constructs a Segmentation HubSchema. Also constructs ClassLabel HubSchema for Segmentation classes.
- Parameters
shape (tuple of ints or None) – Shape in format (height, width, 1)
dtype (str) – dtype of segmentation array: uint16 or uint8
num_classes (int) – Number of classes. All labels must be < num_classes.
names (list<str>) – string names for the integer classes. The order in which the names are provided is kept.
names_file (str) – Path to a file with names for the integer classes, one per line.
max_shape (tuple[int]) – Maximum shape of tensor shape if tensor is dynamic
chunks (tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
get_segmentation_classes
()¶ Get classes of the segmentation mask
-
-
class
hub.schema.sequence.
Sequence
(shape=(), max_shape=None, dtype=None, chunks=None, compressor='lz4')¶ Sequence correspond to sequence of features.HubSchema. At generation time, a list for each of the sequence element is given. The output of Dataset will batch all the elements of the sequence together. If the length of the sequence is static and known in advance, it should be specified in the constructor using the length param.
Usage:>>> sequence = Sequence(Image(), length=NB_FRAME)
-
__init__
(shape=(), max_shape=None, dtype=None, chunks=None, compressor='lz4')¶ - Construct a sequence of Tensors.
- Parameters
shape (Tuple[int] | int) – Single integer element tuple representing length of sequence If None then dynamic
dtype (str | HubSchema) – Datatype of each element in sequence
chunks (Tuple[int] | int) – Number of elements in chunk Works only for top level sequence You can also include number of samples in a single chunk
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
-
class
hub.schema.text.
Text
(shape: Tuple[int, …] = None, dtype='int64', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ HubSchema for text
-
__init__
(shape: Tuple[int, …] = None, dtype='int64', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ - Construct the connector.
Returns integer representation of given string.
- Parameters
shape (tuple of ints or None) – The shape of the text
dtype (str) – the dtype for storage.
max_shape (Tuple[int]) – Maximum number of words in the text
chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
_set_dtype
(dtype)¶ Set the dtype.
-
-
class
hub.schema.video.
Video
(shape: Tuple[int, …] = None, dtype: str = 'uint8', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ HubSchema for videos, encoding frames individually on disk.
The connector accepts as input a 4 dimensional uint8 array representing a video.
- Returns
Tensor – where channels must be 1 or 3
- Return type
uint8 and shape [num_frames, height, width, channels],
-
__init__
(shape: Tuple[int, …] = None, dtype: str = 'uint8', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ Initializes the connector.
- Parameters
shape (tuple of ints) – The shape of the video (num_frames, height, width, channels), where channels is 1 or 3.
encoding_format (str) – The video is stored as a sequence of encoded images. You can use any encoding format supported by Image.
dtype (uint16 or uint8 (default)) –
- Raises
ValueError – If the shape, dtype or encoding formats are invalid:
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).