Querying, Training and Editing Datasets with Data Lineage
How to use queries and version control while training models.
How to use queries and version control while training models.
The road from raw data to a trainable deep-learning dataset can be treacherous, often involving multiple tools glued together with spaghetti code. Activeloop simplifies this journey so you can create high-quality datasets and train production-level deep-learning models.
Create a Deep Lake dataset from data stored in an S3 bucket
Visualize the data to gain insights about the underlying data challenges
Update, edit, and store different versions of the data with reproducibility
Query the data, save the query result, and materialize it for training a model.
Train a object detection model while streaming data
In addition to installation of commonly used packages, this playbook requires installation of:
The required python imports are:
You should also register with Activeloop and create an API token in the UI.
To convert the original dataset to Deep Lake format, let's establish a connection to the original data in S3.
Next, let's load the annotations so we can access them later:
Last but not least, let's create the Deep Lake dataset's tensors. In this example, we ignore the segmentations and keypoints from the COCO dataset, only uploading the bounding box annotations as well as their labels.
Finally, let's iterate through the data and append it to our Deep Lake dataset. Note that when appending data, we directly pass the s3 URL and the managed credentials key for accessing that URL using deeplake.link(url, creds_key)
Note: if dataset creation speed is a priority, it can be accelerated using 2 options:
A quick visual inspection of the dataset reveals several problems with the data including:
In sample 8
, a road sign is labeled as a stop sign
, even though the sign is facing away from the camera. Even though it may be a stop sign
, computer vision systems should positively identify the type of a road sign based on its visible text. Therefore, let's remove the stop sign label from this image.
Both changes are now evident in the visualizer, and they were both logged as separate commits in the version control history. A summary of this inspection workflow is shown below:
Now that the dataset has been improved, we save the query result containing the samples of interest and optimize the data for training. Since query results are associated with a particular commit, they are immutable and can be retrieved at any point in time.
First, let's re-run the query and save the result as a dataset view, which is uniquely identified by an id
.
The dataset is currently storing references to the images in S3, so the images are not rapidly streamable for training. Therefore, we materialize the query result (Dataset View
) by copying and re-chunking the data for maximum performance:
Once we're finished using the materialized dataset view, we may choose to delete it using:
When using subsets of datasets, it's advised to remap the input classes for model training. In this example, the source dataset has 81 classes, but we are only interested in 7 classes (cars, busses, trucks, bicycles, motorcycles, traffic lights, and stop signs). Therefore, we remap the classes of interest to values 0,1,2,3,4,6 before feeding them into the model for training. We also specify resolution for resizing the data before training the model.
You can now create a PyTorch dataloader that connects the Deep Lake dataset to the PyTorch model using the provided method ds_view.pytorch()
. This method automatically applies the transformation function and takes care of random shuffling (if desired). The num_workers
parameter can be used to parallelize data preprocessing, which is critical for ensuring that preprocessing does not bottleneck the overall training workflow.
Training is performed on a GPU if possible. Otherwise, it's on a CPU.
Let's initialize the model and optimizer.
The model and data are ready for training 🚀!
Since many real-world datasets use the COCO annotation format, the is used in this playbook. To avoid data duplication, are used to store references to the images in the Deep Lake dataset from the S3 bucket containing the original data. For simplicity, only the bounding box annotations are copied to the the Deep Lake dataset.
Moving on, let's create an empty Deep Lake dataset and pull managed credentials from Platform, so that we don't have to manually specify the credentials to access the s3
links every time we use this dataset. Since the Deep Lake dataset is stored in Deep Lake storage, we also provide an to identify the user.
The UI for managed credentials in Platform is shown below, and more details are .
By uploading the dataset in parallel. An example is .
By setting the optional parameters below to False
. In this case, the upload machine will not load any of the data before creating the dataset, thus speeding the upload by up to 100X. The parameters below are defaulted to True
because they improve the query speed on image shapes and file metadata, and they also verify the integrity of the data before uploading. More information is :
In this example, we will train an object detection model for driving applications. Therefore, we are interested in images containing cars, busses, trucks, bicycles, motorcycles, traffic lights, and stop signs, which we can find by running a SQL query on the dataset in Platform. More details on the query syntax are .
Sample 61
but is a-low quality image where it's very difficult to discern the features, and it is not clear whether the small object in the distance is an actual traffic light. Images like this do not positively contribute to model performance, so let's delete all the data in this sample.
An object detection model can be trained using the same approach that is used for all Deep Lake datasets, with several examples in . Typically the training would occur on another machine with more GPU power, so we start by loading the dataset and and corresponding dataset view:
Next, let's specify an augmentation pipeline, which mostly utilizes . We perform the remapping of the class labels inside the transformation function.
This playbook uses a from the torchvision.models
module. We define helper functions for loading the model and for training 1 epoch.