How to evaluate model performance and compare ground-truth annotations with model predictions.
Models are never perfect after the first training, and model predictions need to be compared with ground-truth annotations in order to iterate on the training process. This comparison often reveals incorrectly annotated data and sheds light on the types of data where the model fails to make the correct prediction.
Improve training data by finding data for which the model has poor performance
Train an object detection model using a Deep Lake dataset
Upload the training loss per image to a branch on the dataset designated for evaluating model performance
Sort the training dataset based on model loss and identify bad samples
Edit and clean the bad training data and commit the changes
Evaluate model performance on validation data and identify difficult data
Compute model predictions of object detections for a validation Deep Lake dataset
Upload the model predictions to the validation dataset, compared them to ground truth annotations, and identify samples for which the model fails to make the correct predictions.
In addition to installation of commonly user packages, this playbook requires installation of:
pip3 install deeplakepip3 install albumentationspip3 install opencv-python-headless== #In order for Albumentations to work properly
You should also register with Activeloop and create an API token in the UI.
Creating the Dataset
In this playbook we will use the svhn-train and -test datasets that are already hosted by Activeloop. Let's copy them to our own organization dl-corp in order to have write access:
Since we will write the model results back to the Deep Lake datasets, let's create a group called model_evaluation in the datasets and add tensors that will store the model results.
Putting the model results in a separate group will prevent the visualizer from confusing the predictions and ground-truth data.
# Store the loss in the training datasetds_train.create_group('model_evaluation')ds_train.model_evaluation.create_tensor('loss')# Store the predictions for the labels, boxes, and the average iou of the # boxes, for the test datasetds_test.create_group('model_evaluation')ds_test.model_evaluation.create_tensor('labels', htype ='class_label', class_names ='boxes', htype ='bbox', coords = {'type': 'pixel', 'mode': 'LTWH'})ds_test.model_evaluation.create_tensor('iou')
Training an Object Detection Model
An object detection model can be trained using the same approach that is used for all Deep Lake datasets, with several examples in our tutorials. First, let's specify an augmentation pipeline, which mostly utilizes Albumentations. We also define several helper functions for resizing and converting the format of bounding boxes.
# Augmentation pipeline for training using Albumentationstform_train = A.Compose([ A.RandomSizedBBoxSafeCrop(width=WIDTH, height=HEIGHT, erosion_rate=0.2), A.Rotate(limit=20, p=0.5), A.RandomBrightnessContrast(brightness_limit=0.1, contrast_limit=0.1, p=0.5), A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),ToTensorV2()], bbox_params=A.BboxParams(format='pascal_voc', label_fields=['class_labels', 'bbox_ids'], min_area=8, min_visibility=0.6))# 'label_fields' and 'box_ids' are all the fields that will be cut when a bounding box is cut.# Augmentation pipeline for validation using Albumentationstform_val = A.Compose([ A.Resize(width=WIDTH, height=HEIGHT), A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),ToTensorV2()], bbox_params=A.BboxParams(format='pascal_voc', label_fields=['class_labels', 'bbox_ids'], min_area=8, min_visibility=0.6))# 'label_fields' and 'box_ids' are all the fields that will be cut when a bounding box is cut.# Transformation function for pre-processing the Deep Lake training sample before sending it to the modeldeftransform_train(sample_in):# Convert any grayscale images to RGB image = sample_in['images'] shape = image.shape if shape[2]==1: image = np.repeat(image, 3, axis =2)# Convert boxes to Pascal VOC format boxes =coco_2_pascal(sample_in['boxes'], shape)# Pass all data to the Albumentations transformation transformed =tform_train(image = image, bboxes = boxes, bbox_ids = np.arange(boxes.shape[0]), class_labels = sample_in['labels'], )# Convert boxes and labels from lists to torch tensors, because Albumentations does not do that automatically.# Be very careful with rounding and casting to integers, becuase that can create bounding boxes with invalid dimensions labels_torch = torch.tensor(transformed['class_labels'], dtype = torch.int64) boxes_torch = torch.zeros((len(transformed['bboxes']), 4), dtype = torch.int64)for b, box inenumerate(transformed['bboxes']): boxes_torch[b,:]= torch.tensor(np.round(box))# Put annotations in a separate object target ={'labels': labels_torch,'boxes': boxes_torch}return transformed['image'], target# Transformation function for pre-processing the Deep Lake validation sample before sending it to the modeldeftransform_val(sample_in):# Convert any grayscale images to RGB image = sample_in['images'] shape = image.shape if shape[2]==1: image = np.repeat(images, 3, axis =2)# Convert boxes to Pascal VOC format boxes =coco_2_pascal(sample_in['boxes'], shape)# Pass all data to the Albumentations transformation transformed =tform_val(image = image, bboxes = boxes, bbox_ids = np.arange(boxes.shape[0]), class_labels = sample_in['labels'], )# Convert boxes and labels from lists to torch tensors, because Albumentations does not do that automatically.# Be very careful with rounding and casting to integers, becuase that can create bounding boxes with invalid dimensions labels_torch = torch.tensor(transformed['class_labels'], dtype = torch.int64) boxes_torch = torch.zeros((len(transformed['bboxes']), 4), dtype = torch.int64)for b, box inenumerate(transformed['bboxes']): boxes_torch[b,:]= torch.tensor(np.round(box))# Put annotations in a separate object target ={'labels': labels_torch,'boxes': boxes_torch}# We also return the shape of the original image in order to resize the predictions to the dataset image sizereturn transformed['image'], target, sample_in['index'], shape# Conversion script for bounding boxes from coco to Pascal VOC formatdefcoco_2_pascal(boxes,shape):# Convert bounding boxes to Pascal VOC format and clip bounding boxes to make sure they have non-negative width and heightreturn np.stack((np.clip(boxes[:,0], 0, None), np.clip(boxes[:,1], 0, None), np.clip(boxes[:,0]+np.clip(boxes[:,2], 1, None), 0, shape[1]), np.clip(boxes[:,1]+np.clip(boxes[:,3], 1, None), 0, shape[0])), axis =1)# Conversion script for resizing the model predictions back to shape of the dataset imagedefmodel_2_image(boxes,model_shape,img_shape):# Resize the bounding boxes convert them from Pascal VOC to COCO m_h, m_w = model_shape i_h, i_w = img_shape x0 = boxes[:,0]*(i_w/m_w) y0 = boxes[:,1]*(i_h/m_h) x1 = boxes[:,2]*(i_w/m_w) y1 = boxes[:,3]*(i_h/m_h)return np.stack((x0, y0, x1-x0, y1-y0), axis =1)defcollate_fn(batch):returntuple(zip(*batch))
We can now create a PyTorch dataloader that connects the Deep Lake dataset to the PyTorch model using the provided method ds.pytorch(). This method automatically applies the transformation function and takes care of random shuffling (if desired). The num_workers parameter can be used to parallelize data preprocessing, which is critical for ensuring that preprocessing does not bottleneck the overall training workflow.
This playbook uses a pre-trained torchvision neural network from the torchvision.models module. We define helper functions for loading the model and for training 1 epoch.
# Helper function for loading the modeldefget_model_object_detection(num_classes):# Load an instance segmentation model pre-trained on COCO model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)# Get number of input features for the classifier in_features = model.roi_heads.box_predictor.cls_score.in_features# Replace the pre-trained head with a new one model.roi_heads.box_predictor =FastRCNNPredictor(in_features, num_classes)return model# Helper function for training for 1 epochdeftrain_one_epoch(model,optimizer,data_loader,device): model.train() start_time = time.time()for i, data inenumerate(data_loader): images =list( for image in data[0]) targets = [{k: k, v in t.items()}for t in data[1]] loss_dict =model(images, targets) losses =sum(loss for loss in loss_dict.values()) loss_value = losses.item()# Print performance statisticsif i%100==0: batch_time = time.time() speed = (i+1)/(batch_time-start_time)print('[%5d] loss: %.3f, speed: %.2f'% (i, loss_value, speed))ifnot math.isfinite(loss_value):print(f"Loss is {loss_value}, stopping training")print(loss_dict)break optimizer.zero_grad() losses.backward() optimizer.step()
Training is performed on a GPU if possible. Otherwise, it's on a CPU.
model =get_model_object_detection(NUM_CLASSES) Specify the optimizerparams = [p for p in model.parameters()if p.requires_grad]optimizer = torch.optim.SGD(params, lr=0.005, momentum=0.9, weight_decay=0.0005)
The model and data are ready for training 🚀!
# Train the modelnum_epochs =3lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.1)for epoch inrange(num_epochs):# loop over the dataset multiple timesprint("------------------ Training Epoch {} ------------------".format(epoch+1))train_one_epoch(model, optimizer, train_loader, device) lr_scheduler.step()print('Finished Training'), 'model_weights_svhn_first_train.pth')
Evaluating Model Performance on Training Data
Evaluating the performance of the model on a per-image basis can be a powerful tool for identifying bad or difficult data. First, we define a helper function that does a forward-pass through the model and computes the loss per image, without updating the weights. Since the model outputs the loss per batch, this functions requires that the batch size is 1.
defevaluate_loss(model,data_loader,device):# This function assumes the data loader may be shuffled, and it returns the loss in a sorted fashion# using knowledge of the indices that are being trained in each batch. # Set the model to train mode in order to get the loss, even though we're not training. model.train() loss_list = [] indices_list = []assert data_loader.batch_size ==1 start_time = time.time()for i, data inenumerate(data_loader): images =list( for image in data[0]) targets = [{k: k, v in t.items()}for t in data[1]] indices = data[2]with torch.no_grad(): loss_dict =model(images, targets) losses =sum(loss for loss in loss_dict.values()) loss_value = losses.item() loss_list.append(loss_value) indices_list.append(indices)# Print performance statisticsif i%100==0: batch_time = time.time() speed = (i+1)/(batch_time-start_time)print('[%5d] loss: %.3f, speed: %.2f'% (i, loss_value, speed)) loss_list = [x for _, x insorted(zip(indices_list, loss_list))]return loss_list
Next, let's create another PyTorch dataloader on the training dataset that is not shuffled, has a batch size of 1, uses the evaluation transform, and returns the indices of the current batch the dataloader using return_index= True:
Finally, we evaluate the loss for each image, write it back to the dataset, and add a commit to the training_run branch that we created at the start of this playbook:
with ds_train: ds_train.model_evaluation.loss.extend(loss_eval)ds_train.commit('Trained the model and computed the loss for each image.')
Inspecting the Training Dataset based on Model Results
The dataset can be sorted based on loss in Activeloop Platform. An inspection of the high-loss images immediately reveals that many of them have poor quality or are incorrectly annotated.
We can edit some of the bad data by deleting the incorrect annotation of "1" at index 14997 , and by removing the poor quality samples at indices 2899 and 32467.
# Remove label "1" from 14997. It's in the first positions in the labels and boxes arrays
ds_train.labels[14997] = ds_train.labels[14997].numpy()[1:]
ds_train.boxes[14997] = ds_train.boxes[14997].numpy()[1:,:]
# Delete bad samples
Lastly, we commit the edits in order to permanently store this snapshot of the data.
ds_train.commit('Updated labels at index 14997 and deleted samples at 2899 and 32467')
The next step would be perform a more exhaustive inspection of the high-loss data and make further improvements to the dataset, after which the model should be re-trained.
Evaluating Model Performance on Validation Data
After iterating on the training data re-training the model, a general assessment of model performance should be performed on validation data that was not used to train the model. We create a helper function for running an inference of the model on the validation data that returns the model predictions and the average IOU (intersection-over-union) for each sample:
# Run an inference of the model and compute the average IOU (intersection-over-union) for each sampledefevaluate_iou(model,data_loader,num_classes,device='cpu',score_thresh=0.5):# This function removes predictions in the output and IUO calculation that are below a confidence threshold.# This function assumes the data loader may be shuffled, and it returns the loss in a sorted fashion# using knowledge of the indices that are being trained in each batch. # Set the model to eval mode. model.eval() ious_list = [] boxes_list = [] labels_list = [] indices_list = [] start_time = time.time()for i, data inenumerate(data_loader): images =list( for image in data[0]) ground_truths = [{k: k, v in t.items()}for t in data[1]] indices = data[2] model_start = time.time()with torch.no_grad(): predictions =model(images) model_end = time.time()assertlen(ground_truths)==len(predictions)==len(indices)# Check if data in dataloader is consistentfor j, pred inenumerate(predictions):# Ignore boxes below the confidence threshold thresh_inds = pred['scores']>score_thresh pred_boxes = pred['boxes'][thresh_inds] pred_labels = pred['labels'][thresh_inds] pred_scores = pred['scores'][thresh_inds]# Find the union of prediceted and groud truth labels and iterate through it all_labels = np.union1d('cpu'), ground_truths[j]['labels'].to('cpu')) ious = np.zeros((len(all_labels)))for l, label inenumerate(all_labels):# Find the boxes corresponding to the label boxes_1 = pred_boxes[pred_labels == label] boxes_2 = ground_truths[j]['boxes'][ground_truths[j]['labels'] == label] iou = torchvision.ops.box_iou(boxes_1, boxes_2).cpu()# This method returns a matrix of the IOU of each box with every other box.# Consider the IOU as the maximum overlap of a box with any other box. Find the max along the axis that has the most boxes. if0in iou.shape: ious[l]=0else:if boxes_1.shape>boxes_2.shape: max_iou, _ = iou.max(dim=0)else: max_iou, _ = iou.max(dim=1)# Compute the average iou for that label ious[l]= np.mean(np.array(max_iou))#Take the average iou for all the labels. If there are no labels, set the iou to 0.iflen(ious)>0: ious_list.append(np.mean(ious))else: ious_list.append(0) boxes_list.append(model_2_image(pred_boxes.cpu(), (HEIGHT, WIDTH), (data[3][j][0], data[3][j][1])))# Convert the bounding box back to teh shape of the original image labels_list.append(np.array(pred_labels.cpu())) indices_list.append(indices[j])# Print progressif i%100==0: batch_time = time.time() speed = (i+1)/(batch_time-start_time)print('[%5d] speed: %.2f'% (i, speed))# Sort the data based on index, just in case shuffling was used in the dataloader ious_list = [x for _, x insorted(zip(indices_list, ious_list))] boxes_list = [x for _, x insorted(zip(indices_list, boxes_list))] labels_list = [x for _, x insorted(zip(indices_list, labels_list))]return ious_list, boxes_list, labels_list
Let's create a PyTorch dataloader using the validation data and run the inference using evaluate_iou above.
Finally, we write the predictions back to the dataset and add a commit to the training_run branch that we created at the start of this playbook:
with ds_test: ds_test.model_evaluation.labels.extend(labels_eval_test) ds_test.model_evaluation.boxes.extend(boxes_eval_test) ds_test.model_evaluation.iou.extend(iou_eval_test)ds_test.commit('Added model predictions.')
Comparing Model Results to Ground-Truth Annotations.
When sorting the model predictions based on IOU, we observe that the model successfully makes the correct predictions in images with one street number and where the street letters are large relative to the image. However, the model predictions are very poor for data with small street numbers, and there exist artifacts in the data where the model interprets vertical objects, such as narrow windows that the model thinks are the number "1".
Understanding the edge cases for which the model makes incorrect predictions is critical for improving the model performance. If the edge cases are irrelevant given the model's intended use, they should be eliminated from both the training and validation data. If they are applicable, more representative edge cases should be added to the training dataset, or the edge cases should be sampled more frequently while training.
Congratulations 🚀. You can now use Activeloop Deep Lake to evaluate the performance of your Deep-Learning models and compare their predictions to the ground-truth!