Step 9: Dataset Version Control
Managing changes to your datasets using Version Control.
Hub dataset version control allows you to manage changes to datasets with commands very similar to Git. It provides critical insights into how your data is evolving, and it works with datasets of any size!
Let's check out how dataset version control works in Hub! If you haven't done so already, please download and unzip the animals dataset from Step 2.
First let's create a hub dataset in the ./version_control_hub folder.
1
import hub
2
import numpy as np
3
from PIL import Image
4
5
# Set overwrite = True for re-runability
6
ds = hub.dataset('./version_control_hub', overwrite = True)
7
8
# Create a tensor and add an image
9
with ds:
10
ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')
11
ds.images.append(hub.read('./animals/cats/image_1.jpg'))
Copied!
The first image in this dataset is a picture of a cat:
1
Image.fromarray(ds.images[0].numpy())
Copied!

Commit

To commit the data added above, simply run ds.commit:
1
first_commit_id = ds.commit('Added image of a cat')
2
3
print('Dataset in commit {} has {} samples'.format(first_commit_id, len(ds)))
Copied!
Next, let's add another image and commit the update:
1
with ds:
2
ds.images.append(hub.read('./animals/dogs/image_3.jpg'))
3
4
second_commit_id = ds.commit('Added an image of a dog')
5
6
print('Dataset in commit {} has {} samples'.format(second_commit_id, len(ds)))
Copied!
The second image in this dataset is a picture of a dog:
1
Image.fromarray(ds.images[1].numpy())
Copied!

Log

The commit history starting from the current commit can be show using ds.log:
1
log = ds.log()
Copied!
This command prints the log to the console and also assigns it to the specified variable log. The author of the commit is the username of the Activeloop account that logged in on the machine.

Branch

Branching takes place by running the ds.checkout command with the parameter create = True . Let's create a new branch dog_flipped, flip the second image (dog), and create a new commit on that branch.
1
ds.checkout('dog_flipped', create = True)
2
3
with ds:
4
ds.images[1] = np.transpose(ds.images[1], axes=[1,0,2])
5
6
flipped_commit_id = ds.commit('Flipped the dog image')
Copied!
The dog image is now flipped and the log shows a commit on the dog_flipped branch as well as the previous commits on main:
1
Image.fromarray(ds.images[1].numpy())
Copied!
1
ds.log()
Copied!

Checkout

A previous commit of the branch can be checked out using ds.checkout:
1
ds.checkout('main')
2
3
Image.fromarray(ds.images[1].numpy())
Copied!
As expected, the dog image on main is not flipped.

Diff

Understanding changes between commits is critical for managing the evolution of datasets. Hub's ds.diff function enables users to determine the number of samples that were added, removed, or updated for each tensor. The function can be used in 3 ways:
1
ds.diff() # Diff between the current state and the last commit
2
3
ds.diff(commit_id) # Diff between the current state and a specific commit
4
5
ds.diff(commit_id_1, commit_id_2) # Diff between two specific commits
Copied!

HEAD Commit

Unlike Git, Hub's dataset version control does not have a staging area because changes to datasets are not stored locally before they are committed. All changes are automatically reflected in the dataset's permanent storage (local or cloud). Therefore, any changes to a dataset are automatically stored in a HEAD commit on the current branch. This means that the uncommitted changes do not appear on other branches. Let's see how this works:
You should currently be on the main branch, which has 2 samples. Let's adds another image:
1
print('Dataset on {} branch has {} samples'.format('main', len(ds)))
2
3
with ds:
4
ds.images.append(hub.read('./animals/dogs/image_4.jpg'))
5
6
print('After updating, the HEAD commit on {} branch has {} samples'.format('main', len(ds)))
Copied!
The 3rd sample is also an image of a dog:
1
Image.fromarray(ds.images[2].numpy())
Copied!
Next, if you checkout dog_flipped branch, the dataset contains 2 samples, which is sample count from when that branch was created. Therefore, the additional uncommitted third sample that was added to the main branch above is not reflected when other branches or commits are checked out.
1
ds.checkout('dog_flipped')
2
3
print('Dataset in {} branch has {} samples'.format('dog_flipped', len(ds)))
Copied!
Finally, when checking our the main branch again, the prior uncommitted changes and available and they are stored in the HEAD commit on main:
1
ds.checkout('main')
2
3
print('Dataset in {} branch has {} samples'.format('main', len(ds)))
Copied!
The dataset now contains 3 samples and the uncommitted dog image is visible:
1
Image.fromarray(ds.images[2].numpy())
Copied!

Merge - Coming Soon

Merging is a critical feature for collaborating on datasets, and Activeloop is currently working on an implementation.
Congrats! You just are now an expert in dataset version control! 🎓