Creating Complex Datasets - Coming Soon
Datasets often have different types of labels such as classifications and bounding boxes. It's advisable to create a Hub Dataset and tensor hierarchy that captures the relationship between the different label types.
Suppose a dataset contains classifications of "indoor" and "outdoor", as well as localization of objects such as "dog" and "cat" that are captured in a file called box_names.txt.
Note that boxes corresponding to a specific image have an identical filename with the extension.txt.
1
data_dir
2
|_indoor
3
|_image1.png
4
|_image2.png
5
|_outdoor
6
|_image3.png
7
|_image4.png
8
|_boxes
9
|_image1.txt
10
|_image3.txt
11
|_image3.txt
12
|_image4.txt
13
|_classes.txt
Copied!
You can initialize a Hub Dataset and create tensors with the same structure as the code below.
1
import hub
2
​
3
hub_dataset_path = './complex_dataset'
4
ds = hub.empty(hub_dataset_path)
5
​
6
# Image
7
ds.create_tensor('images', htype='image', sample_conpression='jpeg')
8
​
9
# Classification
10
ds.create_tensor('labels', dtype='int64')
11
# An even more rigorous approach would be to use:
12
# ds.create_tensor("classification/label", dtype="int64")
13
​
14
# Localization
15
ds.create_tensor('localization/bbox', dtype='float', htype='bbox')
16
ds.create_tensor('localization/labels', htype='class_label')
Copied!
Below is a helper function for parsing yolo .txt files.
1
def read_yolo_box(fn):
2
# Read yolo .txt file and return an array of boxes and labels
3
4
box_f = open(fn)
5
lines = box_f.read()
6
box_f.close()
7
​
8
lines_split = lines.splitlines()
9
​
10
yolo_box = np.zeros((len(lines_split),4))
11
yolo_label = np.zeros(len(lines_split))
12
13
# Go through each line and parse data
14
for l, line in enumerate(lines_split):
15
line_split = line.split()
16
yolo_box[l,:] = np.array((float(line_split[1]), float(line_split[2]), float(line_split[3]), float(line_split[4])))
17
yolo_label[l] = int(line_split[0])
18
19
20
return yolo_boxes, yolo_labels
Copied!
Populate the tensors in the dataset by iterating through all of the classification images and pulling their classification labels, bounding box positions, and bounding box labels.
1
from PIL import Image
2
from os.path import split, splitext, join
3
​
4
folder_paths = glob.glob('data_dir\*') # Subfolders with classification data.
5
​
6
# Iterate through the classification subfolders (/indoor, /outdoor)
7
with Dataset(hub_dataset_path) as ds:
8
for class_label, folder_path in enumerate(folder_paths):
9
paths = glob.glob(join(folder_path, '*')) # Get subfolders
10
11
# Iterate through images in the classification subfolders
12
for path in paths:
13
img_name = splitext(split(path)[-1])[0]
14
box_path = join(data_dir, 'boxes', img_name+'.txt') # Path of bounding box
15
16
yolo_boxes, yolo_labels = read_yolo_box(box_path)
17
ds.images.append(hub.read(path)) # Append image
18
ds.labels.append(class_label) # Append classification label
19
ds.localization.bbox.append(yolo_boxes) # Append localization boxes
20
ds.localization.label.append(yolo_labels) # Append localization labels
Copied!

Recap

In this tutorial, you saw the basic steps behind creating, populating, and saving complex data in a Hub Dataset.
Last modified 2mo ago
Copy link