Step 2: Creating Hub Datasets Manually
Creating and storing Hub Datasets manually.
Creating Hub datasets is simple, you have full control over connecting your source data (files, images, etc.) to specific tensors in the Hub Dataset.
Let's follow along with the example below to create our first dataset. First, download and unzip the small classification dataset below called animals dataset.
The dataset has the following folder structure:
Now that you have the data, you can create a Hub Dataset and initialize its tensors. Running the following code will create Hub dataset inside of the
from PIL import Image
import numpy as np
ds = hub.empty('./animals_hub') # Creates the dataset
Next, let's inspect the folder structure for the source dataset
'./animals'to find the class names and the files that need to be uploaded to the Hub dataset.
# Find the class_names and list of files that need to be uploaded
dataset_folder = './animals'
class_names = os.listdir(dataset_folder)
files_list = 
for dirpath, dirnames, filenames in os.walk(dataset_folder):
for filename in filenames:
# Create the tensors with names of your choice.
ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')
ds.create_tensor('labels', htype = 'class_label', class_names = class_names)
# Add arbitrary metadata - Optional
ds.info.update(description = 'My first Hub dataset')
ds.images.info.update(camera_type = 'SLR')
dtypeis not required, but it is highly recommended in order to optimize performance, especially for large datasets. Use
dtypeto specify the numeric type of tensor data, and use
htypeto specify the underlying data structure. More information on htype can be found here.
Finally, let's populate the data in the tensors.
# Iterate through the files and append to hub dataset
for file in files_list:
label_text = os.path.basename(os.path.dirname(file))
label_num = class_names.index(label_text)
ds.images.append(hub.read(file)) # Append to images tensor using hub.read
ds.labels.append(np.uint32(label_num)) # Append to labels tensor
ds.images.append(hub.read(path))is functionally equivalent to
ds.image.append(PIL.Image.fromarray(path)). However, the
hub.read()method is significantly faster because it does not decompress and recompress the image if the compression matches the
sample_compressionfor that tensor. Further details are available in Understanding Compression.
Congrats! You just created your first dataset! 🎉
Often it's important to create tensors hierarchically, because information between tensors may be inherently coupled—such as bounding boxes and their corresponding labels. Hierarchy can be created using the following lines of code:
# Tensors are accessed via:
For more detailed information regarding accessing datasets and their tensors, check out the next section.