When creating large Deep Lake datasets, it is recommended to:
Parallelize the ingestion using deeplake.compute with a large num_workers (8-32)
Use checkpointing to periodically auto-commit data using .eval(... checkpoint_interval = <commit_every_N_samples>)
If there is an error during the data ingestion, the dataset is automatically reset to the last auto-commit with valid data.
Additional recommendations are:
If upload errors are intermittent and error-causing samples may be skipped (like bad links), you can run .eval(... ignore_errors=True).
When uploading linked data, if a data integrity check is not necessary, and if querying based on shape information is not important, you can increase the upload speed by 10-100X by setting the following parameters to False when creating the linked tensor: verify, create_shape_tensor , create_sample_info_tensor
We highly recommend performing integrity checks for linked data during dataset creation, even though it slows data ingestion. This one-time check will significantly reduce debugging during querying, training, or other workflows.
Example Dataset Creation Using Checkpointing
In this example we upload the COCO dataset originally stored as an S3 bucket to a Deep Lake dataset stored in another S3 bucket. The images are uploaded as links and the annotations (categories, masks, bounding boxes) are stored in the Deep Lake dataset. Annotations such as pose keypoints or supercategories are omitted.
import numpy as np
from pycocotools.coco import COCO
First, let's define the S3 buckets where the source COCO data is stored, and where the Deep Lake dataset will be stored. Let's also connect to the source data via boto3 and define a credentials dictionary (on some systems credentials, can be automatically pulled from the environment).