Step 2: Creating Hub Datasets Manually
Creating and storing Hub Datasets manually
Creating Hub datasets is simple, you have full control over connecting your source data (files, images, etc.) to specific tensors in the Hub Dataset.
Let's follow along with the example below to create our first dataset. First, download and unzip the small classification dataset below called animals dataset.
animals dataset
The dataset has the following folder structure:
Now that you have the data, you can create a Hub Dataset and initialize its tensors. Running the following code will create Hub dataset inside of the ./animals_hubfolder.
from hub import Dataset, load
hub_dataset_path = './animals_hub'
ds = Dataset(hub_dataset_path) # Creates the dataset
# Create the tensors with names of your choice.
ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')
ds.create_tensor('labels', htype = 'class_label')
Specifying htype and dtype is not required, but it is highly recommended in order to optimize performance, especially for large datasets. Usedtypeto specify the numeric type of tensor data, and usehtypeto specify the underlying data structure. More information on htype can be found here.
Next populate data in the tensors using the following code:
from PIL import Image
import numpy as np
import os
import glob
dataset_folders = glob.glob('./animals/*') #Paths to source data
# Iterate through the subfolders (/dogs, /cats)
for label, folder_path in enumerate(dataset_folders):
paths = glob.glob(os.path.join(folder_path, '*')) # Get subfolders
# Iterate through images in the subfolders
for path in paths:
ds.images.append(load(path)) # Append to images tensor using hub.load
ds.labels.append(np.uint32(label)) # Append to labels tensor
ds.images.append(load(path)) is functionally equivalent to ds.image.append(PIL.Image.fromarray(path)). However, the hub.load() method is significantly faster because it does not decompress and recompress the image if the compression matches thesample_compression for that tensor. Further details are available in Understanding Compression.

Creating Tensor Hierarchies - Coming Soon

Often it's important to create tensors hierarchically, because information between tensors may be inherently coupled‚ÄĒsuch as bounding boxes and their corresponding labels. Hierarchy can be created using the following lines of code:
# Tensors are accessed via:
For more detailed information regarding accessing datasets and their tensors, check out the next section.
Copy link