All sample data in Hub can be stored in a raw uncompressed format. However, in order to achieve optimal performance in terms of speed and memory, it is critical to specify an appropriate compression method for your data.
For example, when creating a tensor for storing images, you can choose the compression technique for the image samples using the
ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')
In this example, every image added in subsequent
.append(...) calls is compressed using the specified
sample_compression method. If the source data is already in the correct compression format, it is saved as-is. Otherwise, it is recompressed to the specified format, as described in detail below.
Lossiness - Certain compression techniques are lossy, meaning that there is irreversible information loss when saving the data in the compressed format.
Memory - Different compression techniques have substantially different memory footprints. For instance,
jpeg compression may result in a 10X difference in the size of a Hub dataset.
Runtime - The highest uploads speeds can be achieved when the
sample_compression value matches the compression of the source data, such as:
# sample_compression and my_image are 'jpeg'ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')ds.images.append(hub.load('my_image.jpeg'))
However, a mismatch between compression of the source data and
sample_compression in Hub results in significantly slower upload speeds, because Hub must decompress the source data and recompress it using the specified
sample_compression before saving:
# sample_compression is 'jpeg' and my_image is 'png'ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')ds.images.append(hub.load('my_image.png'))