Most of the time Data Scientists/ML researchers work on data management and preprocessing instead of doing modeling. Deep Learning often requires to work with large datasets. Those datasets can grow up to terabyte or even petabyte size. It is hard to manage data, version control and track. It is time consuming to download the data and link with the training or inference code. There is no easy way to access a chunk of it and possibly visualize. Wouldn’t it be more convenient to have large datasets stored & version-controlled as single numpy-like array on the cloud and have access to it from any machine at scale?
We realized that there are a few problems related with current workflow in deep learning data management through our experience of working with deep learning companies and researchers.
Data locality. When you have local GPU servers but store the data in a secure remote data center or on the cloud, you need to plan ahead to download specific datasets to your GPU box because it takes time. Sharing preprocessed dataset from one GPU box across your team is also slow and error-prone if there’re multiple preprocessing pipelines.
Code dependency on local folder structure. People use a folder structure to store images or videos. As a result, the data input pipeline has to take into consideration the raw folder structure which creates unnecessary & error-prone code dependency of the dataset folder structure.
Managing preprocessing pipelines. If you want to run some preprocessing, it would be ideal to save the preprocessed images as a local cache for training.But it’s usually hard to manage & version control the preprocessed images locally when there are multiple preprocessing pipelies and the dataset is very big.
Visualization. It’s difficult to visualize the raw data or preprocessed dataset on servers.
Reading a small slice of data. Another popular way is to store in HDF5/TFRecords format and upload to a cloud bucket, but still you have to manage many chunks of HDF5/TFRecords files. If you want to read a small slice of data, it’s not clear which TFRecord/HDF5 chunk you need to load. It’s also inefficient to load the whole file for a small slice of data.
Synchronization across team. If multiple users modify the data, there needs to be a data versioning and synchronization protocol implemented.
RAM management. Whenever you want to create a numpy array you are worried if the numpy array is going to fit in the local RAM/disk limit.