Creating Complex Datasets - Coming Soon

Datasets often have different types of labels such as classifications and bounding boxes. It's advisable to create a Hub Dataset and tensor hierarchy that captures the relationship between the different label types.

Suppose a dataset contains classifications of "indoor" and "outdoor", as well as localization of objects such as "dog" and "cat" that are captured in a file called box_names.txt.

Note that boxes corresponding to a specific image have an identical filename with the extension.txt.

data_dir
|_indoor
    |_image1.png
    |_image2.png
|_outdoor
    |_image3.png
    |_image4.png
|_boxes
    |_image1.txt
    |_image3.txt
    |_image3.txt
    |_image4.txt
    |_classes.txt

You can initialize a Hub Dataset and create tensors with the same structure as the code below.

import hub

hub_dataset_path = './complex_dataset'
ds = hub.empty(hub_dataset_path)

# Image
ds.create_tensor('images', htype='image', sample_conpression='jpeg')

# Classification
ds.create_tensor('labels', dtype='int64')
# An even more rigorous approach would be to use:
# ds.create_tensor("classification/label", dtype="int64")

# Localization
ds.create_tensor('localization/bbox', dtype='float', htype='bbox')
ds.create_tensor('localization/labels', htype='class_label')

Below is a helper function for parsing yolo .txt files.

def read_yolo_box(fn):
    # Read yolo .txt file and return an array of boxes and labels
    
    box_f = open(fn)
    lines = box_f.read()
    box_f.close()

    lines_split = lines.splitlines()

    yolo_box = np.zeros((len(lines_split),4))
    yolo_label = np.zeros(len(lines_split))
    
    # Go through each line and parse data
    for l, line in enumerate(lines_split):
        line_split = line.split()
        yolo_box[l,:] = np.array((float(line_split[1]), float(line_split[2]), float(line_split[3]), float(line_split[4])))
        yolo_label[l] = int(line_split[0]) 
         
        
    return yolo_boxes, yolo_labels

Populate the tensors in the dataset by iterating through all of the classification images and pulling their classification labels, bounding box positions, and bounding box labels.

from PIL import Image
from os.path import split, splitext, join

folder_paths = glob.glob('data_dir\*') # Subfolders with classification data.

# Iterate through the classification subfolders (/indoor, /outdoor)
with Dataset(hub_dataset_path) as ds:
    for class_label, folder_path in enumerate(folder_paths):
        paths = glob.glob(join(folder_path, '*')) # Get subfolders
        
        # Iterate through images in the classification subfolders
        for path in paths:
            img_name = splitext(split(path)[-1])[0]
            box_path = join(data_dir, 'boxes', img_name+'.txt') # Path of bounding box
            
            yolo_boxes, yolo_labels = read_yolo_box(box_path)
            ds.images.append(hub.read(path))  # Append image
            ds.labels.append(class_label) # Append classification label
            ds.localization.bbox.append(yolo_boxes)  # Append localization boxes
            ds.localization.label.append(yolo_labels) # Append localization labels

Recap

In this tutorial, you saw the basic steps behind creating, populating, and saving complex data in a Hub Dataset.

Last updated