Data Layout
How data is laid out and stored in hub format.

Tensors

Hub uses a columnar storage architecture, and the columns in hub are referred to as tensors. Data in the tensors can be added or modified, and the data in different tensors are independent of each other.

Indexing and Samples

Hub datasets and their tensors are indexed, and data at a given index that spans multiple tensors are referred to as samples. Data at the same index are assumed to be related. For example, data in a bbox tensor at index 100 is assumed to be related to data in the tensor image at index 100.

Chunking

Most data in hub format is stored in chunks, which are a blobs of data of a pre-defined size. The purpose of chunking is to accelerate the streaming of data across networks by increasing the amount of data that is transferred per network request.
Each tensors has its own chunks, and the default chunk size is 16MB. A single chunk consists of data from multiple indices when the individual data points (image, label, annotation, etc.) are smaller than the chunk size. Conversely, when individual data points are larger than the chunk size, the data is split among multiple chunks.
Exceptions to chunking logic are video data. Videos that are larger than the specified chunk size are not broken into smaller pieces, because hub uses efficient libraries to stream and access subsets of videos, thus making it unnecessary to split them apart.

Groups

Multiple tensor can be combined into groups. Groups do not fundamentally change the way data is stored, but they are useful for helping Activeloop Platform understand how different tensors are related.