Step 4: Improving Performance via "with" Syntax

Achieving optimal performance using Hub.

This section is critical to achieving rapid performance using Hub.

By default, whenever you make an update to any element of a dataset, the update is immediately pushed to the dataset's long-term storage location. Due to the sheer number of discreet write operations, there may be a significant increase in runtime. In the example below, an update is pushed to AWS S3 for every call to the .append() command inside of the for loop.

from hub import Dataset
ds = Dataset('s3://bucket_name/dataset_name') # Dataset is stored on AWS S3
ds.create_tensor('my_tensor')
for i in range(10):
ds.my_tensor.append(i) # S3 bucket is updated after every append command

To reduce the runtime when using Hub, the with syntax below significantly improves performance because it only pushes updates to long-term storage after the code block inside the with statement has been executed, or when the local cache is full.

with Dataset('s3://bucket_name/dataset_name') as ds:
ds.create_tensor('my_tensor')
for i in range(10):
ds.my_tensor.append(i)
# S3 bucket is updated at the end of the code block inside 'with'