After creating a Deep Lake dataset, you may need to edit it by adding, deleting, and modifying the data. In this tutorial, we show best practices for updating datasets.
Create a Representative Deep Lake Dataset
First, let's download and unzip representative source data and create a Deep Lake dataset for this tutorial:
This dataset includes segmentation and object detection of vehicle damage, but for this tutorial, we will only upload the images and labels (damage location)
import deeplakeimport pandas as pdimport osfrom PIL import Imageimages_directory ='/damaged_cars_tutorial'# Path to the COCO images directoryannotation_file ='/damaged_cars_tutorial/COCO_mul_val_annos.json'# Path to the COCO annotations filedeeplake_path ='/damaged_cars_dataset'# Path to the Deep Lake datasetds = deeplake.ingest_coco(images_directory, annotation_file, deeplake_path, key_to_tensor_mapping={'category_id': 'labels'}, # Rename category_id to labels ignore_keys=['area', 'image_id', 'id', 'segmentation', 'image_id', 'bbox', 'iscrowd'])
ds.summary() shows the dataset has two tensors with 11 samples:
There are two approaches for adding this new data to the Deep Lake dataset:
1. Iterate through the Deep Lake samples and append data
This approach is recommended when most Deep Lake samples are being updated using the supplemental data (dense update).
First, we create a color tensor and iterate through the samples. For each sample, we lookup the color from the df_color DataFrame and append it to the color tensor. If no color exists for a filename, it is appended as None. We use the filename as the key to perform the lookup, which is available in ds.images[index].sample_info dictionary.
with ds: ds.create_tensor('color', htype ='class_label')# After creating an empty tensor, the length of the dataset is 0# Therefore, we iterate over ds.max_view, which is the padded version of the datasetfor i, sample inenumerate(ds.max_view): filename = os.path.basename(sample.images.sample_info['filename']) color = df_color[df_color['filename']== filename]['color'].values ds.color.append(Noneiflen(color)==0else color)
Iterate through the supplemental data and add data at the corresponding Deep Lake index
This approach is recommended when the data updates are sparse
First, let's create a color2 tensor, and the load all the existing Deep Lake filenames into memory. We then iterate through the supplemental data and find the corresponding Deep Lake index to insert the color information.
with ds: ds.create_tensor('color2', htype ='class_label') filenames = [os.path.basename(sample_info['filename'])for sample_info in ds.images.sample_info]for fn in df_color['filename'].values: index = filenames.index(fn) ds.color2[index]= df_color[df_color['filename']== fn]['color'].values[0]
Now we see that ds.summary() shows 4 tensors, each with 11 samples (though the color and color2 tensors have several empty samples):
Originally, we did not specify a color for image 3.jpg. Let's find the index for this image, look at it, and add the color manually. We've already loaded the Deep Lake dataset's filenames into memory above, so we can find the index using:
index = filenames.index('3.jpg')
Let's visualize the image using PIL. We could also visualize it using ds.visualize() (must pip install "deeplake[visualizer]") or using the Deep Lake App.
Image.fromarray(ds.images[index].numpy())
Since the image is white, let's update the color using:
ds.color[index]='white'
Delete Samples
Rows from a dataset can be deleted using ds.pop(). To delete the row at index 8 we run:
ds.pop(8)
Now we see that ds.summary() shows 10 rows in the dataset (instead of 11):