Datasets often have different types of labels such as classifications and bounding boxes. It's advisable to create a Hub Dataset and tensor hierarchy that captures the relationship between the different label types.
Suppose a dataset contains classifications of "indoor" and "outdoor", as well as localization of objects such as "dog" and "cat" that are captured in a file called box_names.txt.
Note that boxes corresponding to a specific image have an identical filename with the extension.txt.
You can initialize a Hub Dataset and create tensors with the same structure as the code below.
import hubhub_dataset_path ='./complex_dataset'ds = hub.empty(hub_dataset_path)# Imageds.create_tensor('images', htype='image', sample_conpression='jpeg')# Classificationds.create_tensor('labels', dtype='int64')# An even more rigorous approach would be to use:# ds.create_tensor("classification/label", dtype="int64")# Localizationds.create_tensor('localization/bbox', dtype='float', htype='bbox')ds.create_tensor('localization/labels', htype='class_label')
Below is a helper function for parsing yolo .txt files.
defread_yolo_box(fn):# Read yolo .txt file and return an array of boxes and labels box_f =open(fn) lines = box_f.read() box_f.close() lines_split = lines.splitlines() yolo_box = np.zeros((len(lines_split),4)) yolo_label = np.zeros(len(lines_split))# Go through each line and parse datafor l, line inenumerate(lines_split): line_split = line.split() yolo_box[l,:]= np.array((float(line_split[1]), float(line_split[2]), float(line_split[3]), float(line_split[4]))) yolo_label[l]=int(line_split[0])return yolo_boxes, yolo_labels
Populate the tensors in the dataset by iterating through all of the classification images and pulling their classification labels, bounding box positions, and bounding box labels.
from PIL import Imagefrom os.path import split, splitext, joinfolder_paths = glob.glob('data_dir\*')# Subfolders with classification data.# Iterate through the classification subfolders (/indoor, /outdoor)withDataset(hub_dataset_path)as ds:for class_label, folder_path inenumerate(folder_paths): paths = glob.glob(join(folder_path, '*'))# Get subfolders# Iterate through images in the classification subfoldersfor path in paths: img_name =splitext(split(path)[-1])[0] box_path =join(data_dir, 'boxes', img_name+'.txt')# Path of bounding box yolo_boxes, yolo_labels =read_yolo_box(box_path) ds.images.append(hub.read(path))# Append image ds.labels.append(class_label)# Append classification label ds.localization.bbox.append(yolo_boxes)# Append localization boxes ds.localization.label.append(yolo_labels)# Append localization labels
Recap
In this tutorial, you saw the basic steps behind creating, populating, and saving complex data in a Hub Dataset.