Creating Datasets at Scale
Creating large Deep Lake datasets with high performance and reliability
How to create Deep Lake datasets at scale
This workflow assumes the reader has experience uploading datasets using Deep Lake's distributed framework deeplake.compute
.
When creating large Deep Lake datasets, it is recommended to:
Parallelize the ingestion using
deeplake.compute
with a largenum_workers
(8-32)Use checkpointing to periodically auto-commit data using
.eval(... checkpoint_interval = <commit_every_N_samples>)
If there is an error during the data ingestion, the dataset is automatically reset to the last auto-commit with valid data.
Additional recommendations are:
If upload errors are intermittent and error-causing samples may be skipped (like bad links), you can run
.eval(... ignore_errors=True)
.When uploading linked data, if a data integrity check is not necessary, and if querying based on shape information is not important, you can increase the upload speed by 10-100X by setting the following parameters to
False
when creating the linked tensor:verify
,create_shape_tensor
,create_sample_info_tensor
We highly recommend performing integrity checks for linked data during dataset creation, even though it slows data ingestion. This one-time check will significantly reduce debugging during querying, training, or other workflows.
Example Dataset Creation Using Checkpointing
In this example we upload the COCO dataset originally stored as an S3 bucket to a Deep Lake dataset stored in another S3 bucket. The images are uploaded as links and the annotations (categories, masks, bounding boxes) are stored in the Deep Lake dataset. Annotations such as pose keypoints or supercategories are omitted.
First, let's define the S3 buckets where the source COCO data is stored, and where the Deep Lake dataset will be stored. Let's also connect to the source data via boto3
and define a credentials dictionary (on some systems credentials, can be automatically pulled from the environment).
The annotations are downloaded locally for simplifying the upload code, since the COCO API was designed to read the annotations from a local file.
Next, let's create an empty Deep Lake dataset at the desired path and connect it to the Deep Lake backend. We also add managed credentials for accessing linked data. In this case, the managed credentials for accessing the dataset are the same as those for accessing the linked data, but that's not a general requirement. More details on managed credentials are available here.
Next, we define the list category_names
that maps the numerical annotations to the index in this list. If label annotations are uploaded as text (which is not the case here), the list is auto-populated. We pass category_names
to the class_names
parameter during tensor creation, though it can also be updated later, or omitted entirely if the numerical labels are sufficient.
Next, we define the input iterable and deepake.compute
function. The elements in the iterable are parallelized among the workers during the execution of the function.
Finally, execute the deeplake.compute
function and set checkpoint_interval
to 25000. The dataset has a total of ~118000 samples.
After the upload is complete, we see commits like the one below in ds.log()
.
If an upload error occurs but the script completes, the dataset will be reset to the prior checkpoint and you will see a message such as:
TransformError: Transform failed at index <51234> of the input data on the item: <item_string>. Last checkpoint: 50000 samples processed. You can slice the input to resume from this point. See traceback for more details.
If the script does not complete due to a system failure or keyboard interrupt, you should load the dataset and run ds.reset()
, or load the dataset using ds = deeplake.load(... reset = True)
. This will restore the dataset to the prior checkpoint. You may find how many samples were successfully processed using: