Step 3: Understanding Compression

Using compression to achieve optimal performance

All sample data in Hub can be stored in a raw uncompressed format. However, in order to achieve optimal performance in terms of speed and memory, it is critical to specify an appropriate compression method for your data.

For example, when creating a tensor for storing images, you can choose the compression technique for the image samples using the sample_compression input:

ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')

In this example, every image added in subsequent .append(...) calls is compressed using the specified sample_compression method. If the source data is already in the correct compression format, it is saved as-is. Otherwise, it is recompressed to the specified format, as described in detail below.

When choosing the optimal compression, the primary tradeoffs are lossiness, memory, and runtime:

Lossiness - Certain compression techniques are lossy, meaning that there is irreversible information loss when saving the data in the compressed format.

Memory - Different compression techniques have substantially different memory footprints. For instance, png vs jpeg compression may result in a 10X difference in the size of a Hub dataset.

Runtime - The highest uploads speeds can be achieved when the sample_compression value matches the compression of the source data, such as:

# sample_compression and my_image are 'jpeg'
ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')

However, a mismatch between compression of the source data and sample_compression in Hub results in significantly slower upload speeds, because Hub must decompress the source data and recompress it using the specified sample_compression before saving:

# sample_compression is 'jpeg' and my_image is 'png'
ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')

Therefore, due to the computational costs associated with decompressing and recompressing data, it is important that you consider the runtime implications of uploading source data that is compressed differently than the specified sample_compression.