Creating Time-Series Datasets
How to use hub to store time-series data.

This tutorial is also available as a Colab Notebook​

Hub is intuitive format for storing large time-series datasets and it offers compression for reducing storage costs. This tutorial demonstrates how to convert a time-series data to hub format and load the data for plotting.

Create the Hub Dataset

The first step is to download the small dataset below called sensor data.
sensor_data.zip
1MB
Binary
This is a subset of a dataset available on kaggle, and it contains the iPhone x,y,z acceleration for 24 users (subjects) under conditions of walking and jogging. The dataset has the folder structure below. subjects_info.csv contains metadata such as height, weight, etc. for each subject, and the sub_n.csv files contains the time-series acceleration data for the nth subject.
1
data_dir
2
|_subjects_into.csv
3
|_motion_data
4
|_walk
5
|_sub_1.csv
6
|_sub_2.csv
7
...
8
...
9
|_jog
10
|_sub_1.csv
11
|_sub_2.csv
12
...
13
...
Copied!
Now that you have the data, let's create a Hub Dataset in the ./sensor_data_hub folder by running:
1
import hub
2
import pandas as pd
3
import os
4
from tqdm import tqdm
5
import numpy as np
6
import matplotlib.pyplot as plt
7
​
8
ds = hub.empty('./sensor_data_hub') # Create the dataset locally
Copied!
Next, let's specify the folder path containing the existing dataset, load the subjects metadata to a Pandas DataFrame, and create a list of all of the time-series files that should be converted to hub format.
1
dataset_path= './sensor_data'
2
​
3
subjects_info = pd.read_csv(os.path.join(dataset_path, 'subjects_info.csv'))
4
​
5
fns_series = []
6
for dirpath, dirnames, filenames in os.walk(os.path.join(dataset_path, 'motion_data')):
7
for filename in filenames:
8
fns_series .append(os.path.join(dirpath, filename))
Copied!
Next, let's create the tensors and add relevant metadata, such as the dataset source, the tensor units, and other information. We leverage groups to separate out the primary acceleration data from other user data such as the weight and height of the subjects.
1
with ds:
2
#Update dataset metadata
3
ds.info.update(source = 'https://www.kaggle.com/malekzadeh/motionsense-dataset',
4
notes = 'This is a small subset of the data in the source link')
5
​
6
#Create tensors. Setting chunk_compression is optional and it defaults to None
7
ds.create_tensor('acceleration_x', chunk_compression = 'lz4')
8
ds.create_tensor('acceleration_y', chunk_compression = 'lz4')
9
10
# Save the sampling rate as tensor metadata. Alternatively,
11
# you could also create a 'time' tensor.
12
ds.acceleration_x.info.update(sampling_rate_s = 0.1)
13
ds.acceleration_y.info.update(sampling_rate_s = 0.1)
14
15
# Encode activity as text
16
ds.create_tensor('activity', htype = 'text')
17
18
# Encode 'activity' as numeric labels and convert to text via class_names
19
# ds.create_tensor('activity', htype = 'class_label', class_names = ['xyz'])
20
21
ds.create_group('subjects_info')
22
ds.subjects_info.create_tensor('age')
23
ds.subjects_info.create_tensor('weight')
24
ds.subjects_info.create_tensor('height')
25
26
# Save the units of weight as tensor metadata
27
ds.subjects_info.weight.info.update(units = 'kg')
28
ds.subjects_info.height.info.update(units = 'cm')
Copied!
Finally, let's iterate through all the time-series data and populate the tensors in the Hub dataset.
1
with ds:
2
# Iterate through the time series and append data
3
for fn in tqdm(fns_series):
4
5
# Read the data in the time series
6
df_data = pd.read_csv(fn)
7
8
# Parse the 'activity' from the file name
9
activity = os.path.basename(os.path.dirname(fn))
10
11
# Parse the subject code from the filename and pull the subject info from 'subjects_info'
12
subject_code = int(os.path.splitext(os.path.basename(fn))[0].split('_')[1])
13
subject_info = subjects_info[subjects_info['code']==subject_code]
14
15
# Append data to tensors
16
ds.activity.append(activity)
17
ds.subjects_info.age.append(subject_info['age'].values)
18
ds.subjects_info.weight.append(subject_info['weight'].values)
19
ds.subjects_info.height.append(subject_info['height'].values)
20
21
ds.acceleration_x.append(df_data['userAcceleration.x'].values)
22
ds.acceleration_y.append(df_data['userAcceleration.y'].values)
Copied!

Inspect the Hub Dataset

Let's check out the first sample from this dataset and plot the acceleration time-series.
It is noteworthy that the hub dataset takes 36% less memory than the original dataset due to lz4 chunk compression for the acceleration tensors.
1
s_ind = 0 # Plot the first time series
2
t_ind = 100 # Plot the first 100 indices in the time series
3
​
4
#Plot the x acceleration
5
x_data = ds.acceleration_x[s_ind].numpy()[:t_ind]
6
sampling_rate_x = ds.acceleration_x.info.sampling_rate_s
7
​
8
plt.plot(np.arange(0, x_data.size)*sampling_rate_x, x_data, label='acceleration_x')
9
​
10
#Plot the y acceleration
11
y_data = ds.acceleration_y[s_ind].numpy()[:t_ind]
12
sampling_rate_y = ds.acceleration_y.info.sampling_rate_s
13
​
14
plt.plot(np.arange(0, y_data.size)*sampling_rate_y, y_data, label='acceleration_y')
15
​
16
plt.legend()
17
plt.xlabel('time [s]', fontweight = 'bold')
18
plt.ylabel('acceleration [g]', fontweight = 'bold')
19
plt.title('Weight: {} {}, Height: {} {}'.format(ds.subjects_info.weight[s_ind].numpy()[0],
20
ds.subjects_info.weight.info.units,
21
ds.subjects_info.height[s_ind].numpy()[0],
22
ds.subjects_info.height.info.units),
23
fontweight = 'bold')
24
​
25
plt.xlim([0, 10])
26
plt.grid()
27
plt.gcf().set_size_inches(8, 5)
28
plt.show()
Copied!
Congrats! You just converted a time-series dataset to Hub format! πŸŽ‰
Last modified 2d ago