Step 2: Creating Hub Datasets
Creating and storing Hub Datasets manually.
Creating Hub datasets is simple. Manual creation, gives you full control over connecting your source data (files, images, etc.) to specific tensors in the Hub dataset. Automatic creation enables you to quickly create a hub dataset by letting Hub parse the underlying files into Hub dataset tensors.

Manual Creation

Let's follow along with the example below to create our first dataset manually. First, download and unzip the small classification dataset below called animals dataset.
animals.zip
338KB
Binary
animals dataset
The dataset has the following folder structure:
1
_animals
2
|_cats
3
|_image_1.jpg
4
|_image_2.jpg
5
|_dogs
6
|_image_3.jpg
7
|_image_4.jpg
Copied!
Now that you have the data, you can create a Hub Dataset and initialize its tensors. Running the following code will create Hub dataset inside of the ./animals_hubfolder.
1
import hub
2
from PIL import Image
3
import numpy as np
4
import os
5
6
ds = hub.empty('./animals_hub') # Creates the dataset
Copied!
Next, let's inspect the folder structure for the source dataset './animals' to find the class names and the files that need to be uploaded to the Hub dataset.
1
# Find the class_names and list of files that need to be uploaded
2
dataset_folder = './animals'
3
4
class_names = os.listdir(dataset_folder)
5
6
files_list = []
7
for dirpath, dirnames, filenames in os.walk(dataset_folder):
8
for filename in filenames:
9
files_list.append(os.path.join(dirpath, filename))
Copied!
Next, let's create the dataset tensors and upload metadata. Check out our page on Storage Synchronization for details about the with syntax below.
1
with ds:
2
# Create the tensors with names of your choice.
3
ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')
4
ds.create_tensor('labels', htype = 'class_label', class_names = class_names)
5
6
# Add arbitrary metadata - Optional
7
ds.info.update(description = 'My first Hub dataset')
8
ds.images.info.update(camera_type = 'SLR')
Copied!
Specifying htype and dtype is not required, but it is highly recommended in order to optimize performance, especially for large datasets. Usedtypeto specify the numeric type of tensor data, and usehtypeto specify the underlying data structure. More information on htype can be found here.
Finally, let's populate the data in the tensors.
1
with ds:
2
# Iterate through the files and append to hub dataset
3
for file in files_list:
4
label_text = os.path.basename(os.path.dirname(file))
5
label_num = class_names.index(label_text)
6
7
ds.images.append(hub.read(file)) # Append to images tensor using hub.read
8
ds.labels.append(np.uint32(label_num)) # Append to labels tensor
Copied!
ds.images.append(hub.read(path)) is functionally equivalent to ds.image.append(PIL.Image.fromarray(path)). However, the hub.read() method is significantly faster because it does not decompress and recompress the image if the compression matches thesample_compression for that tensor. Further details are available in Understanding Compression.
Check out the first image from this dataset. More details about Accessing Data are available in Step 5.
1
Image.fromarray(ds.images[0].numpy())
Copied!
Congrats! You just created your first dataset! 🎉

Automatic Creation

The above animals dataset can also be converted to Hub format automatically using 1 line of code:
1
src = "./animals"
2
dest = './animals_hub_auto'
3
4
ds = hub.ingest(src, dest)
Copied!
Automatic creation currently only supports image classification datasets, though support for other dataset types is continually being added. A full list of supported datasets is available here.

Creating Tensor Hierarchies

Often it's important to create tensors hierarchically, because information between tensors may be inherently coupled—such as bounding boxes and their corresponding labels. Hierarchy can be created using tensor groups:
1
ds = hub.empty('./groups_test') # Creates the dataset
2
3
# Create tensor hierarchies
4
ds.create_group('my_group')
5
ds.my_group.create_tensor('my_tensor')
6
7
# Alternatively, a group can us created using create_tensor with '/'
8
ds.create_tensor('my_group_2/my_tensor') #Automatically creates the group 'my_group_2'
Copied!
Tensors in groups are accessed via:
1
ds.my_group.my_tensor
Copied!
For more detailed information regarding accessing datasets and their tensors, check out Step 4.
Last modified 8d ago