By default, whenever you make an update to any element of a dataset, the update is immediately pushed to the dataset's long-term storage location. Due to the sheer number of discreet write operations, there may be a significant increase in runtime. In the example below, an update is pushed to AWS S3 for every call to the
.append() command inside of the
import hubdataset_path = 's3://bucket_name/dataset_name'ds = hub.empty(dataset_path) # Dataset is stored on AWS S3ds.create_tensor('my_tensor_1')for i in range(10):ds.my_tensor_1.append(i) # Long-term storage is updated after# every append command
To reduce the runtime when using Hub, the
with syntax below significantly improves performance because it only pushes updates to long-term storage after the code block inside the
with statement has been executed, or when the local cache is full.
ds = hub.empty(dataset_path)with ds:ds.create_tensor('my_tensor_2')for i in range(10):ds.my_tensor_2.append(i)# Long-term storage is updated at the end of the code block inside 'with'