By default, whenever you make an update to any element of a dataset, the update is immediately pushed to the dataset's long-term storage location. Due to the sheer number of discreet write operations, there may be a significant increase in runtime. In the example below, an update is pushed to AWS S3 for every call to the
.append() command inside of the
from hub import Datasetds = Dataset('s3://bucket_name/dataset_name') # Dataset is stored on AWS S3ds.create_tensor('my_tensor')for i in range(10):ds.my_tensor.append(i) # S3 bucket is updated after every append command
To reduce the runtime when using Hub, the
with syntax below significantly improves performance because it only pushes updates to long-term storage after the code block inside the
with statement has been executed, or when the local cache is full.
with Dataset('s3://bucket_name/dataset_name') as ds:ds.create_tensor('my_tensor')for i in range(10):ds.my_tensor.append(i)# S3 bucket is updated at the end of the code block inside 'with'