Storage Synchronization
Synchronizing data with long-term storage and achieving optimal performance using Hub.
This page explains how Hub synchronizes dataset updates with the permanent storage location for the dataset.

Without with

For code written outside of a with code block, whenever you update any aspect of a dataset, the update is immediately pushed to the dataset's long-term storage location. Due to the sheer number of discreet write operations, there may be a significant increase in runtime. In the example below, an update is pushed to AWS S3 for every call to the .append() command inside of the for loop.
1
import hub
2
​
3
dataset_path = 's3://bucket_name/dataset_name'
4
​
5
ds = hub.empty(dataset_path) # Dataset is stored on AWS S3
6
​
7
ds.create_tensor('my_tensor')
8
​
9
for i in range(10):
10
ds.my_tensor_1.append(i) # Long-term storage is updated after
11
# every append command
Copied!

With with

To reduce the runtime when using Hub, the with syntax below significantly improves performance because it only pushes updates to long-term storage after the code block inside the with statement has been executed, or when the local cache is full. This significantly reduces the number of discreet write operations, thereby increasing the speed by up to 100X.
1
ds = hub.empty(dataset_path)
2
​
3
with ds:
4
​
5
ds.create_tensor('my_tensor_2')
6
7
for i in range(10):
8
ds.my_tensor_2.append(i)
9
10
# Long-term storage is updated at the end of the code block inside 'with'
Copied!
Last modified 2d ago
Copy link