Version Control and Querying
Understanding Deep Lake's Version control and Querying Layout
Version control is the core of the Deep Lake data format, and it interacts with queries and view as follows:
- Datasets have commits and branches, and they can be traversed or merged using Deep Lake's Python API.
- Queries are applied on top of commits, and in order to save a query result as a
view, the dataset cannot be in an uncommitted state (no changes were performed since the prior commit).
- Each saved
viewis associated with a particular commit, and the view itself contains information on which dataset indices satisfied the query condition.
This logical approach was chosen in order to preserve data lineage. Otherwise, it would be possible to change data on which a query was executed, thereby potentially invalidating the saved view, since the indices that satisfied the query condition may no longer be correct after the dataset was changed.
An example workflow using version control and queries is shown below.
Unlike Git, Deep Lake's dataset version control does not have a local staging area because all dataset updates are immediately synced with the permanent storage location (cloud or local). Therefore, any changes to a dataset are automatically stored in a HEAD commit on the current branch. This means that the uncommitted changes do not appear on other branches, and uncommitted changes are visible to all users.