Dataset¶
Create¶
To create and store dataset you would need to define shape and specify the dataset structure (schema).
For example, to create a dataset basic
with 4 samples containing images and labels with shape (512, 512) of dtype ‘float’ in account username
:
from hub import Dataset, schema
tag = "username/basic"
ds = Dataset(
tag,
shape=(4,),
schema={
"image": schema.Tensor((512, 512), dtype="float"),
"label": schema.Tensor((512, 512), dtype="float"),
},
)
Upload the Data¶
To add data to the dataset:
ds["image"][:] = np.ones((4, 512, 512))
ds["label"][:] = np.ones((4, 512, 512))
ds.commit()
Load the data¶
Load the dataset and access its elements:
ds = Dataset('username/basic')
# Use .numpy() to get the numpy array of the element
print(ds["image"][0].numpy())
print(ds["label", 100:110].numpy())
Convert to Pytorch¶
ds = ds.to_pytorch()
ds = torch.utils.data.DataLoader(
ds,
batch_size=8,
num_workers=2,
)
# Iterate over the data
for batch in ds:
print(batch["image"], batch["label"])
Convert to Tensorflow¶
ds = ds.to_tensorflow().batch(8)
# Iterate over the data
for batch in ds:
print(batch["image"], batch["label"])
Visualize¶
Make sure visualization works perfectly at app.activeloop.ai
Delete¶
You can delete your dataset in app.activeloop.ai in a dataset overview tab.
Issues¶
If you spot any trouble or have any question, please open a github issue.
API¶
-
class
hub.
Dataset
(url: str, mode: str = None, shape=None, schema=None, token=None, fs=None, fs_map=None, meta_information={}, cache: int = 67108864, storage_cache: int = 268435456, lock_cache=True, tokenizer=None, lazy: bool = True, public: bool = True, name: str = None)¶ -
__getitem__
(slice_)¶ - Gets a slice or slices from datasetUsage:
>>> return ds["image", 5, 0:1920, 0:1080, 0:3].compute() # returns numpy array >>> images = ds["image"] >>> return images[5].compute() # returns numpy array >>> images = ds["image"] >>> image = images[5] >>> return image[0:1920, 0:1080, 0:3].compute()
-
__init__
(url: str, mode: str = None, shape=None, schema=None, token=None, fs=None, fs_map=None, meta_information={}, cache: int = 67108864, storage_cache: int = 268435456, lock_cache=True, tokenizer=None, lazy: bool = True, public: bool = True, name: str = None)¶ - Open a new or existing dataset for read/write
- Parameters
url (str) – The url where dataset is located/should be created
mode (str, optional (default to "a")) – Python way to tell whether dataset is for read or write (ex. “r”, “w”, “a”)
shape (tuple, optional) – Tuple with (num_samples,) format, where num_samples is number of samples
schema (optional) – Describes the data of a single sample. Hub schemas are used for that Required for ‘a’ and ‘w’ modes
token (str or dict, optional) – If url is refering to a place where authorization is required, token is the parameter to pass the credentials, it can be filepath or dict
fs (optional) –
fs_map (optional) –
meta_information (optional ,give information about dataset in a dictionary.) –
cache (int, optional) – Size of the memory cache. Default is 64MB (2**26) if 0, False or None, then cache is not used
storage_cache (int, optional) – Size of the storage cache. Default is 256MB (2**28) if 0, False or None, then storage cache is not used
lock_cache (bool, optional) – Lock the cache for avoiding multiprocessing errors
lazy (bool, optional) – Setting this to False will stop lazy computation and will allow items to be accessed without .compute()
public (bool, optional) – only applicable if using hub storage, ignored otherwise setting this to False allows only the user who created it to access the dataset and the dataset won’t be visible in the visualizer to the public
name (str, optional) – only applicable when using hub storage, this is the name that shows up on the visualizer
-
__iter__
()¶ Returns Iterable over samples
-
__len__
()¶ Number of samples in the dataset
-
__repr__
()¶ Return repr(self).
-
__setitem__
(slice_, value)¶ - Sets a slice or slices with a valueUsage:
>>> ds["image", 5, 0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), "uint8") >>> images = ds["image"] >>> image = images[5] >>> image[0:1920, 0:1080, 0:3] = np.zeros((1920, 1080, 3), "uint8")
-
__str__
()¶ Return str(self).
-
__weakref__
¶ list of weak references to the object (if defined)
-
_check_and_prepare_dir
()¶ Checks if input data is ok. Creates or overwrites dataset folder. Returns True dataset needs to be created opposed to read.
-
_get_dictionary
(subpath, slice_=None)¶ Gets dictionary from dataset given incomplete subpath
-
append_shape
(size: int)¶ Append the shape: Heavy Operation
-
close
()¶ Save changes from cache to dataset final storage. This invalidates this object.
-
commit
()¶ Deprecated alias to flush()
-
delete
()¶ Deletes the dataset
-
filter
(dic)¶ - Applies a filter to get a new datasetview that matches the dictionary provided
- Parameters
dic (dictionary) – A dictionary of key value pairs, used to filter the dataset. For nested schemas use flattened dictionary representation i.e instead of {“abc”: {“xyz” : 5}} use {“abc/xyz” : 5}
-
flush
()¶ Save changes from cache to dataset final storage. Does not invalidate this object.
-
static
from_pytorch
(dataset, scheduler: str = 'single', workers: int = 1)¶ - Converts a pytorch dataset object into hub format
- Parameters
dataset – The pytorch dataset object that needs to be converted into hub format
scheduler (str) – choice between “single”, “threaded”, “processed”
workers (int) – how many threads or processes to use
-
static
from_tensorflow
(ds, scheduler: str = 'single', workers: int = 1)¶ Converts a tensorflow dataset into hub format.
- Parameters
dataset – The tensorflow dataset object that needs to be converted into hub format
scheduler (str) – choice between “single”, “threaded”, “processed”
workers (int) – how many threads or processes to use
Examples
>>> ds = tf.data.Dataset.from_tensor_slices(tf.range(10)) >>> out_ds = hub.Dataset.from_tensorflow(ds) >>> res_ds = out_ds.store("username/new_dataset") # res_ds is now a usable hub dataset
>>> ds = tf.data.Dataset.from_tensor_slices({'a': [1, 2], 'b': [5, 6]}) >>> out_ds = hub.Dataset.from_tensorflow(ds) >>> res_ds = out_ds.store("username/new_dataset") # res_ds is now a usable hub dataset
>>> ds = hub.Dataset(schema=my_schema, shape=(1000,), url="username/dataset_name", mode="w") >>> ds = ds.to_tensorflow() >>> out_ds = hub.Dataset.from_tensorflow(ds) >>> res_ds = out_ds.store("username/new_dataset") # res_ds is now a usable hub dataset
-
static
from_tfds
(dataset, split=None, num: int = - 1, sampling_amount: int = 1, scheduler: str = 'single', workers: int = 1)¶ - Converts a TFDS Dataset into hub format.
- Parameters
dataset (str) – The name of the tfds dataset that needs to be converted into hub format
split (str, optional) – A string representing the splits of the dataset that are required such as “train” or “test+train” If not present, all the splits of the dataset are used.
num (int, optional) – The number of samples required. If not present, all the samples are taken. If count is -1, or if count is greater than the size of this dataset, the new dataset will contain all elements of this dataset.
sampling_amount (float, optional) – a value from 0 to 1, that specifies how much of the dataset would be sampled to determinte feature shapes value of 0 would mean no sampling and 1 would imply that entire dataset would be sampled
scheduler (str) – choice between “single”, “threaded”, “processed”
workers (int) – how many threads or processes to use
Examples
>>> out_ds = hub.Dataset.from_tfds('mnist', split='test+train', num=1000) >>> res_ds = out_ds.store("username/mnist") # res_ds is now a usable hub dataset
-
property
keys
¶ Get Keys of the dataset
-
rename
(name: str) → None¶ Renames the dataset
-
resize_shape
(size: int) → None¶ Resize the shape of the dataset by resizing each tensor first dimension
-
to_pytorch
(transform=None, inplace=True, output_type=<class 'dict'>, indexes=None)¶ - Converts the dataset into a pytorch compatible format.
- Parameters
transform (function that transforms data in a dict format) –
inplace (bool, optional) – Defines if data should be converted to torch.Tensor before or after Transforms applied (depends on what data type you need for Transforms). Default is True.
output_type (one of list, tuple, dict, optional) – Defines the output type. Default is dict - same as in original Hub Dataset.
offset (int, optional) – The offset from which dataset needs to be converted
num_samples (int, optional) – The number of samples required of the dataset that needs to be converted
-
to_tensorflow
(indexes=None)¶ - Converts the dataset into a tensorflow compatible format
- Parameters
offset (int, optional) – The offset from which dataset needs to be converted
num_samples (int, optional) – The number of samples required of the dataset that needs to be converted
-