Step 3: Understanding Compression
Using compression to achieve optimal performance.
All sample data in Hub can be stored in a raw uncompressed format. However, in order to achieve optimal performance in terms of speed and memory, it is critical to specify an appropriate compression method for your data.
For example, when creating a tensor for storing images, you can choose the compression technique for the image samples using the sample_compression
input:
In this example, every image added in subsequent .append(...)
calls is compressed using the specified sample_compression
method. If the source data is already in the correct compression format, it is saved as-is. Otherwise, it is recompressed to the specified format, as described in detail below.
When choosing the optimal compression, the primary tradeoffs are lossiness, memory, and runtime:
Lossiness - Certain compression techniques are lossy, meaning that there is irreversible information loss when saving the data in the compressed format.
Memory - Different compression techniques have substantially different memory footprints. For instance, png
vs jpeg
compression may result in a 10X difference in the size of a Hub dataset.
Runtime - The highest uploads speeds can be achieved when the sample_compression
value matches the compression of the source data, such as:
However, a mismatch between compression of the source data and sample_compression
in Hub results in significantly slower upload speeds, because Hub must decompress the source data and recompress it using the specified sample_compression
before saving:
Therefore, due to the computational costs associated with decompressing and recompressing data, it is important that you consider the runtime implications of uploading source data that is compressed differently than the specified sample_compression
.
Last updated