-- Exact match, which generally requires that the sample-- has 1 value, i.e. no lists or multi-dimensional arraysselect*where tensor_name =='text_value' # Ifvalueistextselect*where tensor_name == numeric_value # Ifvaluesisnumericselect*where contains(tensor_name, 'text_value')
Tensor or group names with special characters should be wrapped with double-quotes:
select * where contains("tensor-name", 'text_value')
select * where "tensor_name/group_name" == numeric_value
Make sure to wrap double-quotes with escape characters in Python:
select * where contains(\"tensor-name\", 'text_value')
select*where contains(tensor_name, 'text_value') and NOT contains(tensor_name_2, numeric_value)select*where contains(tensor_name, 'text_value') or tensor_name_2 == numeric_valueselect*where (contains(tensor_name, 'text_value') and shape(tensor_name_2)[dimension_index]>numeric_value) or contains(tensor_name, 'text_value_2')
-- Order by requires that sample is numeric and has 1 value, -- i.e. no lists or multi-dimensional arrays-- The default order is ASCENDING (asc)select*where contains(tensor_name, 'text_value') order by tensor_name asc
ANY, ALL, and ALL_STRICT
all adheres to NumPy and list logic where all(empty_sample) returns True
all_strict is more intuitive for queries so all_strict(empty_sample) returns False
# Select based onindex (row_number)select*whererow_number() ==10# Referencing values of of a tensor atindex (row_number)select*order by l2_norm(<tensor_name>-data(<tensor_name>, index))# Finds rows of datawith embeddings most similar toindex10select*order by l2_norm(embedding -data(embedding, 10))
SAMPLE BY
select*sampleby weight_choice(expression_1: weight_1, expression_2: weight_2, ...)replace True limit N
weight_choice resolves the weight that is used when multiple expressions evaluate to True for a given sample. Options are max_weight, sum_weight. For example, if weight_choice is max_weight, then the maximum weight will be chosen for that sample.
replace determines whether samples should be drawn with replacement. It defaults to True.
limit specifies the number of samples that should be returned. If unspecified, the sampler will return the number of samples corresponding to the length of the dataset
EMBEDDING SEARCH
Deep Lake supports several vector operations for embedding search. Typically, vector operations are called by returning data ordered by the score based on the vector search method.
select*from (select tensor_1, tensor_2, <VECTOR_OPERATION>as score) order by score desc limit10-- THE SUPPORTED VECTOR_OPERATIONS ARE:l1_norm(<embedding_tensor>- ARRAY[<search_embedding>]) # Order should be ascl2_norm(<embedding_tensor>- ARRAY[<search_embedding>]) # Order should be asclinf_norm(<embedding_tensor>- ARRAY[<search_embedding>]) # Order should be asccosine_similarity(<embedding_tensor>, ARRAY[<search_embedding>]) # Order should be desc
VIRTUAL TENSORS
Virtual tensors are the result of a computation and are not tensors in the Deep Lake dataset. However, they can be treated as tensors in the API.
-- "score" is a virtual tensorselect*from (select tensor_1, tensor_2, <VECTOR_OPERATION>as score) order by score desc limit10-- "box_beyond_image" is a virtual tensorselect*, any(boxes[:,0])<0as box_beyond_image where ....-- "tensor_sum" is a virtual tensorselect*, tensor_1 + tensor_3 as tensor_sum where ......
When combining embedding search with filtering (where conditions), the filter condition is evaluated prior to the embedding search.
GROUP BY AND UNGROUP BY
Group by creates a sequence of data based on the common properties that are being grouped (i.e. frames into videos). Ungroup by splits sequences into their individual elements (i.e. videos into images).
select*group by label, video_id # Groups all datawith the same label and video_id into the same sequenceselect* ungroup by split # Splits sequences into their original pieces
EXPAND BY
Expand by includes samples before and after a query condition is satisfied.