Step 9: Dataset Version Control
Managing changes to your datasets using Version Control.
How to Use Version Control in Deep Lake
Deep Lake dataset version control allows you to manage changes to datasets with commands very similar to Git. It provides critical insights into how your data is evolving, and it works with datasets of any size!
Let's check out how dataset version control works in Deep Lake! If you haven't done so already, please download and unzip the animals dataset from Step 2.
First let's create a Deep Lake dataset in the ./version_control_deeplake
folder.
The first image in this dataset is a picture of a cat:
Commit
To commit the data added above, simply run ds.commit
:
Next, let's add another image and commit the update:
The second image in this dataset is a picture of a dog:
Log
The commit history starting from the current commit can be show using ds.log
:
This command prints the log to the console and also assigns it to the specified variable log
. The author of the commit is the username of the Activeloop account that logged in on the machine.
Branch
Branching takes place by running the ds.checkout
command with the parameter create = True
. Let's create a new branch dog_flipped
, flip the second image (dog), and create a new commit on that branch.
The dog image is now flipped and the log shows a commit on the dog_flipped
branch as well as the previous commits on main
:
Checkout
A previous commit of the branch can be checked out using ds.checkout
:
As expected, the dog image on main
is not flipped.
Diff
Understanding changes between commits is critical for managing the evolution of datasets. Deep Lake's ds.diff
function enables users to determine the number of samples that were added, removed, or updated for each tensor. The function can be used in 3 ways:
HEAD Commit
Unlike Git, Deep Lake's dataset version control does not have a local staging area because all dataset updates are immediately synced with the permanent storage location (cloud or local). Therefore, any changes to a dataset are automatically stored in a HEAD commit on the current branch. This means that the uncommitted changes do not appear on other branches, and uncommitted changes are visible to all users.
Let's see how this works:
You should currently be on the main
branch, which has 2 samples. You can check for uncommited changes using:
Let's add another image:
The 3rd sample is also an image of a dog:
Next, if you checkout dog_flipped
branch, the dataset contains 2 samples, which is sample count from when that branch was created. Therefore, the additional uncommitted third sample that was added to the main
branch above is not reflected when other branches or commits are checked out.
Finally, when checking our the main
branch again, the prior uncommitted changes and available and they are stored in the HEAD commit on main
:
The dataset now contains 3 samples and the uncommitted dog image is visible:
You can delete any uncommitted changes using the reset
command below, which will bring the main
branch back to the state with 2 samples.
Merge
Merging is a critical feature for collaborating on datasets. It enables you to modify data on separate branches before making those changes available on the main
branch, thus enabling you to experiment on your data without affecting workflows by other collaborators.
We are currently on the main
branch where the picture of the dog is right-side-up.
We can merge the dog_flipped
branch into main
using the command below:
After merging the dog_flipped
branch, we observe that the image of the dog is flipped. The dataset log now has a commit indicating that a commit from another branch was merged to main
.
Congrats! You just are now an expert in dataset version control! 🎓