Step 8: Dataset Version Control
Managing changes to your datasets using Version Control.
Note: Version Control is still in Alpha.
Hub dataset version control allows users to manage changes to datasets with commands very similar to Git. It provides critical insights into how your data is evolving, and it works with datasets of any size!
Let's create a hub dataset and check out how dataset version control works in Hub!
1
import hub
2
import numpy as np
3
​
4
# Set overwrite = True for re-runability
5
ds = hub.dataset('./version_control', overwrite = True)
6
​
7
# Create a tensor and append 200X 100x100x3 arrays
8
with ds:
9
ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')
10
ds.images.extend(np.ones((200, 100, 100, 3), dtype = 'uint8'))
Copied!

Commit

To commit the data added above, simply run ds.commit:
1
first_commit_id = ds.commit('Added 200X 100x100x3 arrays')
2
​
3
print('Dataset in commit {} has {} samples'.format(first_commit_id, len(ds)))
Copied!
The printout shows that the first commit has 200 samples. Next, let's add 50X more samples and commit the update:
1
with ds:
2
ds.images.extend(np.ones((50, 150, 150, 3), dtype = 'uint8'))
3
4
second_commit_id = ds.commit('Added 50X 150x150x3 arrays')
5
print('Dataset in commit {} has {} samples'.format(second_commit_id, len(ds)))
Copied!
The printout now shows that the second commit has 250 samples.

Log

The commit history starting from the current commit can be show using ds.log:
1
log = ds.log()
Copied!
This command prints the log to the console and also assigns it to the specified variable log. The author of the commit is the username of the Activeloop account that logged in on the machine.

Branch

Branching takes place by running the ds.checkout command with the parameter create = True . Let's create a new branch, add a labels tensor, populate it with data, create a new commit on that branch, and display the log.
1
ds.checkout('new_branch', create = True)
2
​
3
with ds:
4
ds.create_tensor('labels', htype = 'class_label')
5
ds.labels.extend(np.zeros((250,1), dtype = 'uint32'))
6
7
new_branch_commit_id = ds.commit('Added labels tensor and 250X labels')
8
print('Dataset in commit {} has tensors: {}'.format(new_branch_commit_id, ds.tensors))
Copied!
The printout shows that the dataset on the new_branch branch contains images and labels tensors.
The log now shows a commit on new_branch as well as the previous commits on main:
1
ds.log()
Copied!

Checkout

A previous commit of the branch can be checked out using ds.checkout:
1
ds.checkout('main')
2
​
3
print('Dataset on {} branch has tensors: {}'.format('main', ds.tensors))
Copied!
As expected, the printout shows that the dataset on main only contains the images tensor, since the labels tensor was added on new_branch.

HEAD Commit

Unlike Git, Hub's dataset version control does not have a staging area because changes to datasets are not stored locally before they are committed. All changes are automatically reflected in the dataset's permanent storage (local or cloud). Therefore, any changes to a dataset are automatically stored in a HEAD commit on the current branch. This means that the uncommitted changes do not appear on other branches. Let's see how this works:
You should currently be on the main branch, which has 250 samples. Let's add 75 more samples:
1
print('Dataset on {} main has {} samples'.format('main', len(ds)))
2
​
3
with ds:
4
ds.images.extend(np.zeros((75, 100, 100, 3), dtype = 'uint8'))
5
6
print('After updating, the HEAD commit on {} branch has {} samples'.format('main', len(ds)))
Copied!
Next, if you checkout the first commit, the dataset contains 200 samples, which is sample count from when the first commit was made. Therefore, the 75 uncommitted samples that were added to the main branch above are not reflected when other branches or commits are checked out.
1
ds.checkout(first_commit_id)
2
​
3
print('Dataset in commit {} has {} samples'.format(first_commit_id, len(ds)))
Copied!
Finally, when checking our the main branch again, the prior uncommitted changes and visible and they are stored in the HEAD commit on main:
1
ds.checkout('main')
2
​
3
print('Dataset in {} branch has {} samples'.format('main', len(ds)))
Copied!

Diff - Coming Soon

Understanding changes between commits is critical for managing the evolution of datasets. The diff function will enable users to determine the number of samples that were added, removed, or updated for each tensor. Activeloop is currently working on an implementation.

Merge - Coming Soon

Merging is a critical feature for collaborating on datasets, and Activeloop is currently working on an implementation.
Congrats! You just are now an expert in dataset version control! πŸŽ“
​
Last modified 8d ago