API Reference¶
Datasets¶
Dataset¶
-
class
hub.
Dataset
(url: str, mode: str = None, shape=None, schema=None, token=None, fs=None, fs_map=None, meta_information={}, cache: int = 67108864, storage_cache: int = 268435456, lock_cache=True, tokenizer=None, lazy: bool = True, public: bool = True, name: str = None)¶ -
__getitem__
(slice_)¶ - Gets a slice or slices from datasetUsage:
>>> return ds["image", 5, 0:1920, 0:1080, 0:3].compute() # returns numpy array >>> images = ds["image"] >>> return images[5].compute() # returns numpy array >>> images = ds["image"] >>> image = images[5] >>> return image[0:1920, 0:1080, 0:3].compute()
-
__init__
(url: str, mode: str = None, shape=None, schema=None, token=None, fs=None, fs_map=None, meta_information={}, cache: int = 67108864, storage_cache: int = 268435456, lock_cache=True, tokenizer=None, lazy: bool = True, public: bool = True, name: str = None)¶ - Open a new or existing dataset for read/write
- Parameters
url (str) – The url where dataset is located/should be created
mode (str, optional (default to "a")) – Python way to tell whether dataset is for read or write (ex. “r”, “w”, “a”)
shape (tuple, optional) – Tuple with (num_samples,) format, where num_samples is number of samples
schema (optional) – Describes the data of a single sample. Hub schemas are used for that Required for ‘a’ and ‘w’ modes
token (str or dict, optional) – If url is refering to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict
fs (optional) –
fs_map (optional) –
meta_information (optional ,give information about dataset in a dictionary.) –
cache (int, optional) – Size of the memory cache. Default is 64MB (2**26) if 0, False or None, then cache is not used
storage_cache (int, optional) – Size of the storage cache. Default is 256MB (2**28) if 0, False or None, then storage cache is not used
lock_cache (bool, optional) – Lock the cache for avoiding multiprocessing errors
lazy (bool, optional) – Setting this to False will stop lazy computation and will allow items to be accessed without .compute()
public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the dataset and the dataset won’t be visible in the visualizer to the public
name (str, optional) – only applicable when using hub storage, this is the name that shows up on the visualizer
-
__iter__
()¶ Returns Iterable over samples
-
__len__
()¶ Number of samples in the dataset
-
__repr__
()¶ Return repr(self).
-
__setitem__
(slice_, value)¶ - Sets a slice or slices with a valueUsage:
>>> ds["image", 5, 0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), "uint8") >>> images = ds["image"] >>> image = images[5] >>> image[0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), "uint8")
-
__str__
()¶ Return str(self).
-
__weakref__
¶ list of weak references to the object (if defined)
-
_check_and_prepare_dir
()¶ Checks if input data is ok. Creates or overwrites dataset folder. Returns True dataset needs to be created opposed to read.
-
_get_dictionary
(subpath, slice_=None)¶ Gets dictionary from dataset given incomplete subpath
-
append_shape
(size: int)¶ Append the shape: Heavy Operation
-
close
()¶ Save changes from cache to dataset final storage. This invalidates this object.
-
commit
()¶ Deprecated alias to flush()
-
delete
()¶ Deletes the dataset
-
filter
(dic)¶ - Applies a filter to get a new datasetview that matches the dictionary provided
- Parameters
dic (dictionary) – A dictionary of key value pairs, used to filter the dataset. For nested schemas use flattened dictionary representation i.e instead of {“abc”: {“xyz” : 5}} use {“abc/xyz” : 5}
-
flush
()¶ Save changes from cache to dataset final storage. Does not invalidate this object.
-
static
from_pytorch
(dataset, scheduler: str = 'single', workers: int = 1)¶ - Converts a pytorch dataset object into hub format
- Parameters
dataset – The pytorch dataset object that needs to be converted into hub format
scheduler (str) – choice between “single”, “threaded”, “processed”
workers (int) – how many threads or processes to use
-
static
from_tensorflow
(ds, scheduler: str = 'single', workers: int = 1)¶ Converts a tensorflow dataset into hub format.
- Parameters
dataset – The tensorflow dataset object that needs to be converted into hub format
scheduler (str) – choice between “single”, “threaded”, “processed”
workers (int) – how many threads or processes to use
Examples
>>> ds = tf.data.Dataset.from_tensor_slices(tf.range(10)) >>> out_ds = hub.Dataset.from_tensorflow(ds) >>> res_ds = out_ds.store("username/new_dataset") # res_ds is now a usable hub dataset
>>> ds = tf.data.Dataset.from_tensor_slices({'a': [1, 2], 'b': [5, 6]}) >>> out_ds = hub.Dataset.from_tensorflow(ds) >>> res_ds = out_ds.store("username/new_dataset") # res_ds is now a usable hub dataset
>>> ds = hub.Dataset(schema=my_schema, shape=(1000,), url="username/dataset_name", mode="w") >>> ds = ds.to_tensorflow() >>> out_ds = hub.Dataset.from_tensorflow(ds) >>> res_ds = out_ds.store("username/new_dataset") # res_ds is now a usable hub dataset
-
static
from_tfds
(dataset, split=None, num: int = - 1, sampling_amount: int = 1, scheduler: str = 'single', workers: int = 1)¶ - Converts a TFDS Dataset into hub format.
- Parameters
dataset (str) – The name of the tfds dataset that needs to be converted into hub format
split (str, optional) – A string representing the splits of the dataset that are required such as “train” or “test+train” If not present, all the splits of the dataset are used.
num (int, optional) – The number of samples required. If not present, all the samples are taken. If count is -1, or if count is greater than the size of this dataset, the new dataset will contain all elements of this dataset.
sampling_amount (float, optional) – a value from 0 to 1, that specifies how much of the dataset would be sampled to determinte feature shapes value of 0 would mean no sampling and 1 would imply that entire dataset would be sampled
scheduler (str) – choice between “single”, “threaded”, “processed”
workers (int) – how many threads or processes to use
Examples
>>> out_ds = hub.Dataset.from_tfds('mnist', split='test+train', num=1000) >>> res_ds = out_ds.store("username/mnist") # res_ds is now a usable hub dataset
-
property
keys
¶ Get Keys of the dataset
-
rename
(name: str) → None¶ Renames the dataset
-
resize_shape
(size: int) → None¶ Resize the shape of the dataset by resizing each tensor first dimension
-
to_pytorch
(transform=None, inplace=True, output_type=<class 'dict'>, indexes=None)¶ - Converts the dataset into a pytorch compatible format.
- Parameters
transform (function that transforms data in a dict format) –
inplace (bool, optional) – Defines if data should be converted to torch.Tensor before or after Transforms applied (depends on what data type you need for Transforms). Default is True.
output_type (one of list, tuple, dict, optional) – Defines the output type. Default is dict - same as in original Hub Dataset.
offset (int, optional) – The offset from which dataset needs to be converted
num_samples (int, optional) – The number of samples required of the dataset that needs to be converted
-
to_tensorflow
(indexes=None)¶ - Converts the dataset into a tensorflow compatible format
- Parameters
offset (int, optional) – The offset from which dataset needs to be converted
num_samples (int, optional) – The number of samples required of the dataset that needs to be converted
-
DatasetView¶
-
class
hub.api.datasetview.
DatasetView
(dataset=None, lazy: bool = True, indexes=None)¶ -
__getitem__
(slice_)¶ - Gets a slice or slices from DatasetViewUsage:
>>> ds_view = ds[5:15] >>> return ds_view["image", 7, 0:1920, 0:1080, 0:3].compute() # returns numpy array of 12th image
-
__init__
(dataset=None, lazy: bool = True, indexes=None)¶ Creates a DatasetView object for a subset of the Dataset.
- Parameters
dataset (hub.api.dataset.Dataset object) – The dataset whose DatasetView is being created
lazy (bool, optional) – Setting this to False will stop lazy computation and will allow items to be accessed without .compute()
indexes (optional) – It can be either a list or an integer depending upon the slicing. Represents the indexes that the datasetview is representing.
-
__iter__
()¶ Returns Iterable over samples
-
__repr__
()¶ Return repr(self).
-
__setitem__
(slice_, value)¶ - Sets a slice or slices with a valueUsage:
>>> ds_view = ds[5:15] >>> ds_view["image", 3, 0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), "uint8") # sets the 8th image
-
__str__
()¶ Return str(self).
-
__weakref__
¶ list of weak references to the object (if defined)
-
_get_dictionary
(subpath, slice_)¶ Gets dictionary from dataset given incomplete subpath
-
commit
() → None¶ Commit dataset
-
filter
(dic)¶ - Applies a filter to get a new datasetview that matches the dictionary provided
- Parameters
dic (dictionary) – A dictionary of key value pairs, used to filter the dataset. For nested schemas use flattened dictionary representation i.e instead of {“abc”: {“xyz” : 5}} use {“abc/xyz” : 5}
-
property
keys
¶ Get Keys of the dataset
-
resize_shape
(size: int) → None¶ Resize dataset shape, not DatasetView
-
to_pytorch
(transform=None, inplace=True, output_type=<class 'dict'>)¶ Converts the dataset into a pytorch compatible format
-
to_tensorflow
()¶ Converts the dataset into a tensorflow compatible format
-
Sharded Dataset¶
-
class
hub.api.sharded_datasetview.
ShardedDatasetView
(datasets: list)¶ -
__init__
(datasets: list) → None¶ - Creates a sharded simple dataset.Datasets should have the schema.
- Parameters
datasets (list of Datasets) –
-
__iter__
()¶ Returns Iterable over samples
-
__repr__
()¶ Return repr(self).
-
__weakref__
¶ list of weak references to the object (if defined)
-
identify_shard
(index) → tuple¶ Computes shard id and returns the shard index and offset
-
slicing
(slice_)¶ Identifies the dataset shard that should be used .. rubric:: Notes
Features of advanced slicing are missing as one would expect from a DatasetView E.g. cross sharded dataset access is missing
-
Pipelines¶
Transform¶
-
hub.compute.
transform
(schema, scheduler='single', workers=1)¶ - Transform is a decorator of a function. The function should output a dictionary per sample.
- schema: Schema
The output format of the transformed dataset
- scheduler: str
“single” - for single threaded, “threaded” using multiple threads, “processed”, “ray” scheduler, “dask” scheduler
- workers: int
how many workers will be started for the process
-
class
hub.compute.transform.
Transform
(func, schema, ds, scheduler: str = 'single', workers: int = 1, **kwargs)¶ -
__getitem__
(slice_)¶ - Get an item to be computed without iterating on the whole dataset.Creates a dataset view, then a temporary dataset to apply the transform.
- slice_: slice
Gets a slice or slices from dataset
-
__init__
(func, schema, ds, scheduler: str = 'single', workers: int = 1, **kwargs)¶ - Transform applies a user defined function to each sample in single threaded manner.
- Parameters
func (function) – user defined function func(x, **kwargs)
schema (dict of dtypes) – the structure of the final dataset that will be created
ds (Iterative) – input dataset or a list that can be iterated
scheduler (str) – choice between “single”, “threaded”, “processed”
workers (int) – how many threads or processes to use
**kwargs – additional arguments that will be passed to func as static argument for all samples
-
__weakref__
¶ list of weak references to the object (if defined)
-
classmethod
_flatten_dict
(d: Dict, parent_key='', schema=None)¶ - Helper function to flatten dictionary of a recursive tensor
- Parameters
d (dict) –
-
_pbar
(show: bool = True)¶ Returns a progress bar, if empty then it function does nothing
-
_split_list_to_dicts
(xs)¶ - Helper function that transform list of dicts into dicts of lists
- Parameters
xs (list of dicts) –
- Returns
xs_new
- Return type
dicts of lists
-
classmethod
_unwrap
(results)¶ If there is any list then unwrap it into its elements
-
call_func
(fn_index, item, as_list=False)¶ Calls all the functions one after the other
- Parameters
fn_index (int) – The index starting from which the functions need to be called
item – The item on which functions need to be applied
as_list (bool, optional) – If true then treats the item as a list.
- Returns
The final output obtained after all transforms
- Return type
result
-
create_dataset
(url: str, length: int = None, token: dict = None, public: bool = True)¶ Helper function to creat a dataset
-
classmethod
dtype_from_path
(path, schema)¶ Helper function to get the dtype from the path
-
store
(url: str, token: dict = None, length: int = None, ds: Iterable = None, progressbar: bool = True, sample_per_shard: int = None, public: bool = True)¶ - The function to apply the transformation for each element in batchified manner
- Parameters
url (str) – path where the data is going to be stored
token (str or dict, optional) – If url is refering to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict
length (int) – in case shape is None, user can provide length
ds (Iterable) –
progressbar (bool) – Show progress bar
sample_per_shard (int) – How to split the iterator not to overfill RAM
public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the dataset and the dataset won’t be visible in the visualizer to the public
- Returns
ds – uploaded dataset
- Return type
-
store_shard
(ds_in: Iterable, ds_out: hub.api.dataset.Dataset, offset: int, token=None)¶ Takes a shard of iteratable ds_in, compute and stores in DatasetView
-
upload
(results, ds: hub.api.dataset.Dataset, token: dict, progressbar: bool = True)¶ Batchified upload of results. For each tensor batchify based on its chunk and upload. If tensor is dynamic then still upload element by element. For dynamic tensors, it disable dynamicness and then enables it back.
- Parameters
dataset (hub.Dataset) – Dataset object that should be written to
results – Output of transform function
progressbar (bool) –
- Returns
ds – Uploaded dataset
- Return type
-
RayTransform¶
-
class
hub.compute.ray.
RayTransform
(func, schema, ds, scheduler='ray', workers=1, **kwargs)¶ -
__init__
(func, schema, ds, scheduler='ray', workers=1, **kwargs)¶ - Transform applies a user defined function to each sample in single threaded manner.
- Parameters
func (function) – user defined function func(x, **kwargs)
schema (dict of dtypes) – the structure of the final dataset that will be created
ds (Iterative) – input dataset or a list that can be iterated
scheduler (str) – choice between “single”, “threaded”, “processed”
workers (int) – how many threads or processes to use
**kwargs – additional arguments that will be passed to func as static argument for all samples
-
set_dynamic_shapes
(results, ds)¶ Sets shapes for dynamic tensors after the dataset is uploaded
- Parameters
results (Tuple) – results from uploading each chunk which includes (key, slice, shape) tuple
ds – Dataset to set the shapes to
-
store
(url: str, token: dict = None, length: int = None, ds: Iterable = None, progressbar: bool = True, public: bool = True)¶ The function to apply the transformation for each element in batchified manner
- Parameters
url (str) – path where the data is going to be stored
token (str or dict, optional) – If url is refering to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict
length (int) – in case shape is None, user can provide length
ds (Iterable) –
progressbar (bool) – Show progress bar
public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the dataset and the dataset won’t be visible in the visualizer to the public
- Returns
ds – uploaded dataset
- Return type
-
upload
(results, url: str, token: dict, progressbar: bool = True, public: bool = True)¶ Batchified upload of results. For each tensor batchify based on its chunk and upload. If tensor is dynamic then still upload element by element.
- Parameters
dataset (hub.Dataset) – Dataset object that should be written to
results – Output of transform function
progressbar (bool) –
public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the dataset and the dataset won’t be visible in the visualizer to the public
- Returns
ds – Uploaded dataset
- Return type
-
Schema¶
Serialization¶
-
hub.schema.serialize.
serialize
(input)¶ Converts the input into a serializable format
-
hub.schema.serialize.
serialize_SchemaDict
(fdict)¶ Converts SchemaDict into a serializable format
-
hub.schema.serialize.
serialize_primitive
(primitive)¶ Converts Primitive into a serializable format
-
hub.schema.serialize.
serialize_tensor
(tensor)¶ Converts Tensor and its derivatives into a serializable format
Schema¶
-
class
hub.schema.audio.
Audio
(shape: Tuple[int, …] = None, dtype='int64', file_format=None, sample_rate: int = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ -
__init__
(shape: Tuple[int, …] = None, dtype='int64', file_format=None, sample_rate: int = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ Constructs the connector.
- Parameters
file_format (str) – the audio file format. Can be any format ffmpeg understands. If None, will attempt to infer from the file extension.
shape (tuple) – shape of the data.
dtype (str) – The dtype of the data.
sample_rate (int) – additional metadata exposed to the user through info.schema[‘audio’].sample_rate. This value isn’t used neither in encoding nor decoding.
- Raises
ValueError – If the shape is invalid:
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
-
class
hub.schema.bbox.
BBox
(dtype='float64', chunks=None, compressor='lz4')¶ - HubSchema` for a normalized bounding box.
Output: Tensor of type float32 and shape [4,] which contains the normalized coordinates of the bounding box [ymin, xmin, ymax, xmax]
-
__init__
(dtype='float64', chunks=None, compressor='lz4')¶ Construct the connector.
- Parameters
dtype (str) – dtype of bbox coordinates. Default: ‘float32’
chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
-
class
hub.schema.class_label.
ClassLabel
(num_classes: int = None, names: List[str] = None, names_file: str = None, chunks=None, compressor='lz4')¶ HubSchema for integer class labels.
-
__init__
(num_classes: int = None, names: List[str] = None, names_file: str = None, chunks=None, compressor='lz4')¶ - Constructs a ClassLabel HubSchema.Returns an integer representations of given classes. Preserves the names of classes to convert those back to strings if needed.There are 3 ways to define a ClassLabel, which correspond to the 3 arguments:* num_classes: create 0 to (num_classes-1) labels* names: a list of label strings* names_file: a file containing the list of labels.
Note: In python2, the strings are encoded as utf-8.
Usage:>>> class_label_tensor = ClassLabel(num_classes=10) >>> class_label_tensor = ClassLabel(names=['class1', 'class2', 'class3', ...]) >>> class_label_tensor = ClassLabel(names_file='/path/to/file/with/names')
- Parameters
num_classes (int) – number of classes. All labels must be < num_classes.
names (list<str>) – string names for the integer classes. The order in which the names are provided is kept.
names_file (str) – path to a file with names for the integer classes, one per line.
max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic
chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks
Note (|) – names or names file
- Raises
ValueError – If more than one argument is provided:
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
int2str
(int_value: int)¶ Conversion integer => class name string.
-
str2int
(str_value: str)¶ Conversion class name string => integer.
-
-
class
hub.schema.image.
Image
(shape: Tuple[int, …] = None, None, 3, dtype='uint8', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ - HubSchema for images.
Output: tf.Tensor of type tf.uint8 and shape [height, width, num_channels] for BMP, JPEG, and PNG images
>>> image_tensor = Image(shape=(None, None, 1), >>> encoding_format='png')
-
__init__
(shape: Tuple[int, …] = None, None, 3, dtype='uint8', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ - Construct the connector.
- Parameters
shape (tuple of ints or None) – The shape of decoded image: (height, width, channels) where height and width can be None. Defaults to (None, None, 3).
dtype (uint16 or uint8 (default)) – uint16 can be used only with png encoding_format
encoding_format ('jpeg' or 'png' (default)) – Format to serialize np.ndarray images on disk.
max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic
chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks
- Returns
tf.Tensor of type tf.uint8 and shape [height, width, num_channels]
for BMP, JPEG, and PNG images
- Raises
ValueError – If the shape, dtype or encoding formats are invalid:
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
_set_dtype
(dtype)¶ Set the dtype.
-
-
class
hub.schema.features.
FlatTensor
(path: str, shape: Tuple[int, …], dtype, max_shape: Tuple[int, …], chunks: Tuple[int, …])¶ Tensor metadata after applying flatten function
-
__init__
(path: str, shape: Tuple[int, …], dtype, max_shape: Tuple[int, …], chunks: Tuple[int, …])¶ Initialize self. See help(type(self)) for accurate signature.
-
__weakref__
¶ list of weak references to the object (if defined)
-
-
class
hub.schema.features.
HubSchema
¶ Base class for all datatypes
-
__weakref__
¶ list of weak references to the object (if defined)
-
_flatten
() → Iterable[hub.schema.features.FlatTensor]¶ Flattens dtype into list of tensors that will need to be stored seperately
-
-
class
hub.schema.features.
Primitive
(dtype, chunks=None, compressor='lz4')¶ Class for handling primitive datatypes. All numpy primitive data types like int32, float64, etc… should be wrapped around this class.
-
__init__
(dtype, chunks=None, compressor='lz4')¶ Initialize self. See help(type(self)) for accurate signature.
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
_flatten
()¶ Flattens dtype into list of tensors that will need to be stored seperately
-
-
class
hub.schema.features.
SchemaDict
(dict_)¶ Class for dict branching of a datatype. SchemaDict dtype contains str -> dtype associations. This way you can describe complex datatypes.
-
__init__
(dict_)¶ Initialize self. See help(type(self)) for accurate signature.
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
_flatten
()¶ Flattens dtype into list of tensors that will need to be stored seperately
-
-
class
hub.schema.features.
Tensor
(shape: Tuple[int, …] = None, dtype='float64', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ Tensor type in schema. Has np-array like structure contains any type of elements (Primitive and non-Primitive). Tensors can’t be visualized at app.activeloop.ai.
-
__init__
(shape: Tuple[int, …] = None, dtype='float64', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ - Parameters
shape (Tuple[int]) – Shape of tensor, can contains None(s) meaning the shape can be dynamic Dynamic shape means it can change during editing the dataset
dtype (SchemaConnector or str) – dtype of each element in Tensor. Can be Primitive and non-Primitive type
max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic
chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
_flatten
()¶ Flattens dtype into list of tensors that will need to be stored seperately
-
-
hub.schema.features.
featurify
(schema) → hub.schema.features.HubSchema¶ This functions converts naked primitive datatypes and ditcs into Primitives and SchemaDicts. That way every node in dtype tree is a SchemaConnector type object.
-
hub.schema.features.
flatten
(dtype, root='')¶ Flattens nested dictionary and returns tuple (dtype, path)
-
class
hub.schema.mask.
Mask
(shape: Tuple[int, …] = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ HubSchema for mask
Usage:>>> mask_tensor = Mask(shape=(300, 300, 1))
-
__init__
(shape: Tuple[int, …] = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ Constructs a Mask HubSchema.
- Parameters
shape (tuple of ints or None) – Shape in format (height, width, 1)
dtype (str) – Dtype of mask array. Default: uint8
max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic
chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
_check_shape
(shape)¶ Check if provided shape maches mask characteristics.
-
-
class
hub.schema.polygon.
Polygon
(shape: Tuple[int, …] = None, dtype='int32', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ HubSchema for polygon
Usage:>>> polygon_tensor = Polygon(shape=(10, 2)) >>> polygon_tensor = Polygon(shape=(None, 2))
-
__init__
(shape: Tuple[int, …] = None, dtype='int32', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ Constructs a Polygon HubSchema. Args: shape: tuple of ints or None, i.e (None, 2)
- Parameters
shape (tuple of ints or None) – Shape in format (None, 2)
max_shape (Tuple[int]) – Maximum shape of tensor shape if tensor is dynamic
chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks
- Raises
ValueError – If the shape is invalid:
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
_check_shape
(shape)¶ Check if provided shape maches polygon characteristics.
-
-
class
hub.schema.segmentation.
Segmentation
(shape: Tuple[int, …] = None, dtype: str = None, num_classes: int = None, names: Tuple[str] = None, names_file: str = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ HubSchema for segmentation
-
__init__
(shape: Tuple[int, …] = None, dtype: str = None, num_classes: int = None, names: Tuple[str] = None, names_file: str = None, max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ Constructs a Segmentation HubSchema. Also constructs ClassLabel HubSchema for Segmentation classes.
- Parameters
shape (tuple of ints or None) – Shape in format (height, width, 1)
dtype (str) – dtype of segmentation array: uint16 or uint8
num_classes (int) – Number of classes. All labels must be < num_classes.
names (list<str>) – string names for the integer classes. The order in which the names are provided is kept.
names_file (str) – Path to a file with names for the integer classes, one per line.
max_shape (tuple[int]) – Maximum shape of tensor shape if tensor is dynamic
chunks (tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
get_segmentation_classes
()¶ Get classes of the segmentation mask
-
-
class
hub.schema.sequence.
Sequence
(shape=(), max_shape=None, dtype=None, chunks=None, compressor='lz4')¶ Sequence correspond to sequence of features.HubSchema. At generation time, a list for each of the sequence element is given. The output of Dataset will batch all the elements of the sequence together. If the length of the sequence is static and known in advance, it should be specified in the constructor using the length param.
Usage:>>> sequence = Sequence(Image(), length=NB_FRAME)
-
__init__
(shape=(), max_shape=None, dtype=None, chunks=None, compressor='lz4')¶ - Construct a sequence of Tensors.
- Parameters
shape (Tuple[int] | int) – Single integer element tuple representing length of sequence If None then dynamic
dtype (str | HubSchema) – Datatype of each element in sequence
chunks (Tuple[int] | int) – Number of elements in chunk Works only for top level sequence You can also include number of samples in a single chunk
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
-
class
hub.schema.text.
Text
(shape: Tuple[int, …] = None, dtype='int64', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ HubSchema for text
-
__init__
(shape: Tuple[int, …] = None, dtype='int64', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ - Construct the connector.
Returns integer representation of given string.
- Parameters
shape (tuple of ints or None) – The shape of the text
dtype (str) – the dtype for storage.
max_shape (Tuple[int]) – Maximum number of words in the text
chunks (Tuple[int] | True) – Describes how to split tensor dimensions into chunks (files) to store them efficiently. It is anticipated that each file should be ~16MB. Sample Count is also in the list of tensor’s dimensions (first dimension) If default value is chosen, automatically detects how to split into chunks
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).
-
_set_dtype
(dtype)¶ Set the dtype.
-
-
class
hub.schema.video.
Video
(shape: Tuple[int, …] = None, dtype: str = 'uint8', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ HubSchema for videos, encoding frames individually on disk.
The connector accepts as input a 4 dimensional uint8 array representing a video.
- Returns
Tensor – where channels must be 1 or 3
- Return type
uint8 and shape [num_frames, height, width, channels],
-
__init__
(shape: Tuple[int, …] = None, dtype: str = 'uint8', max_shape: Tuple[int, …] = None, chunks=None, compressor='lz4')¶ Initializes the connector.
- Parameters
shape (tuple of ints) – The shape of the video (num_frames, height, width, channels), where channels is 1 or 3.
encoding_format (str) – The video is stored as a sequence of encoded images. You can use any encoding format supported by Image.
dtype (uint16 or uint8 (default)) –
- Raises
ValueError – If the shape, dtype or encoding formats are invalid:
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Return str(self).