Data in Hub can be stored in raw uncompressed format. However, compression is highly recommended for achieving optimal performance in terms of speed and storage.
Compression is specified separately for each tensor, and it can occur at the sample or chunk level. For example, when creating a tensor for storing images, you can choose the compression technique for the image samples using the sample_compression input:
In this example, every image added in subsequent .append(...) calls is compressed using the specified sample_compression method.
Choosing the Right Compression
There is no single answer for choosing the right compression, and the tradeoffs are described in detail in the next section. However, good rules of thumb are:
For data that has application-specific compressors (image, audio, video,...), choose the sample_compression technique that is native to the application such as jpg, mp3, mp4,...
For other data containing large samples (i.e. large arrays with >100 values), lz4 is a generic compressor that works well in most applications.
lz4 can be used as a sample_compressionor chunk_compression . In most cases, sample_compressionis sufficient, but in theory, chunk_compressionproduces slightly smaller data.
For other data containing small samples (i.e. labels with <100 values), it is not necessary to use compression.
Lossiness - Certain compression techniques are lossy, meaning that there is irreversible information loss when compressing the data. Lossless compression is less important for data such as images and videos, but it is critical for label data such as numerical labels, binary masks, and segmentation data.
Memory - Different compression techniques have substantially different memory footprints. For instance, png vs jpeg compression may result in a 10X difference in the size of a Hub dataset.
Runtime - The primary variables affecting download and upload speeds for generating usable data are the network speed and available compute power for processing the data . In most cases, the network speed is the limiting factor. Therefore, the highest end-to-end throughput for non-local applications is achieved by maximizing compression and utilizing compute power to decompress/convert the data to formats that are consumed by deep learning models (i.e. arrays).
Upload Considerations- When applicable, the highest uploads speeds can be achieved when the sample_compression input matches the compression of the source data, such as:
In this case, the input data is a .jpg, and the hub sample_compression is jpg.
However, a mismatch between compression of the source data and sample_compression in Hub results in significantly slower upload speeds, because Hub must decompress the source data and recompress it using the specified sample_compression before saving.
Therefore, due to the computational costs associated with decompressing and recompressing data, it is important that you consider the runtime implications of uploading source data that is compressed differently than the specified sample_compression.