This quickstart also requires LangChain, tiktoken, and OpenIAI
!pip3installlangchainopenaitiktoken
Creating Your First Vector Store
A Vector Store can be created using the Deep Lake integration with LangChain. This abstracts the low-level Deep Lake API from the user, and under the hood, the LangChain integration creates a Deep Lake dataset with text, embedding, id, and metadata tensors.
Let's embed and store Paul Graham's essays in a Deep Lake vector store. First, we download the data:
Next, let's import the required modules and set the OpenAI environmental variables for embeddings:
Next, lets specify paths for the source text data and the underlying Deep Lake dataset. In this example, we store the dataset locally, but Deep Lake vectors stores can also be created in memory, in the Deep Lake Tensor Database, or in your cloud. Further details are available here.
The underlying Deep Lake dataset object is accessible in db.ds, and the data structure can be summarized using db.ds.summary(), which shows 4 tensors with 100 samples:
Deep Lake offers a variety of vector search option depending on the storage location of the Vector Store is created and infrastructure that should run the computations:
Let's continue from the Vector Store we created above and run an embeddings search based on a user prompt using the LangChain API.
from langchain.chains import RetrievalQAfrom langchain.llms import OpenAIChat# Re-load the vector store in case it's no longer initializeddb =DeepLake(dataset_path = dataset_path, embedding_function=embedding)qa = RetrievalQA.from_chain_type(llm=OpenAIChat(model='gpt-3.5-turbo'), chain_type='stuff', retriever=db.as_retriever())
Let's run the prompt and check out the output. Internally, this API performs an embedding search to find the most relevant data to feed into the LLM context.
query ='What are the first programs he tried writing?'qa.run(query)
'The first programs he tried writing were on the IBM 1401 that his school district used for "data processing" in 9th grade.'
Vector Search Using the Compute Engine on the Client Side in LangChain
Vector search using the Compute Engine + LangChain API will be available soon.
Vector Search Using the Compute Engine on the Client Side In the Deep Lake API
To run the C++ Compute Engine on the client-side, please install:
import deeplakeimport openaids = deeplake.load('hub://activeloop/twitter-algorithm', read_only =True)# Read-only is sufficient permission for queries
Next, let's define the search term and embed it using OpenAI.
SEARCH_TERM ='What do the trust and safety models do?'embedding = openai.Embedding.create(input=SEARCH_TERM, model="text-embedding-ada-002")["data"][0]["embedding"]# Format the embedding as a string, so it can be passed in the query stringembedding_search =",".join([str(item) for item in embedding])
Finally, let's define the TQL query and run it using the Compute Engine on the client.
tql_query =f"select * from (select *, cosine_similarity(embedding, ARRAY[{embedding_search}]) as score) order by score desc limit 5"ds_view = ds.query(tql_query)
ds_view.summary() shows the result contains the top 5 samples by score:
array(0.97839564, dtype=float32)
'// Delete configuration key-value. If is_directory is set in request,\n // recursively clean up all key-values under the path specified by `key`.\n rpc DeleteKeyValue(DeleteKeyValueRequest) returns (DeleteKeyValueResponse);'
Vector Search Using the Managed Tensor Database in LangChain
Vector search using the Tensor Database + LangChain API will be available soon.
Vector Search Using the Managed Tensor Database + REST API
import requestsimport openaiimport os# Tokens should be set in environmental variables.ACTIVELOOP_TOKEN = os.environ['ACTIVELOOP_TOKEN']DATASET_PATH ='hub://activeloop/twitter-algorithm'ENDPOINT_URL ='https://app.activeloop.ai/api/query/v1'SEARCH_TERM ='What do the trust and safety models do?'# os.environ['OPENAI_API_KEY'] OPEN AI TOKEN should also exist in env variables# The headers contains the user tokenheaders ={"Authorization":f"Bearer {ACTIVELOOP_TOKEN}",}# Embed the search termembedding = openai.Embedding.create(input=SEARCH_TERM, model="text-embedding-ada-002")["data"][0]["embedding"]# Format the embedding as a string, so it can be passed in the REST API request.embedding_search =",".join([str(item) for item in embedding])# Create the query using TQLquery =f"select * from (select text, l2_norm(embedding - ARRAY[{embedding_search}]) as score from \"{dataset_path}\") order by score desc limit 5"# Submit the request response = requests.post(ENDPOINT_URL, json={"query": query}, headers=headers)data = response.json()print(data)
constaxios=require('axios');OPENAI_API_KEY=process.env.OPENAI_API_KEY;ACTIVELOOP_TOKEN=process.env.ACTIVELOOP_TOKEN;constQUERY='What do the trust and safety models do?';constDATASET_PATH='hub://activeloop/twitter-algorithm';constENDPOINT_URL='https://app.activeloop.ai/api/query/v1';// Function to get the embeddings of a text from Open AI APIasyncfunctiongetEmbedding(text) {constresponse=awaitaxios.post('https://api.openai.com/v1/embeddings', { input: text, model:"text-embedding-ada-002" }, { headers: {'Content-Type':'application/json','Authorization':`Bearer ${OPENAI_API_KEY}` } });returnresponse.data;}// Function to search the dataset using the given query on ActiveloopasyncfunctionsearchDataset(query) {constresponse=awaitaxios.post(${ENDPOINT_URL}, { query: query, }, { headers: {'Content-Type':'application/json','Authorization':`Bearer ${ACTIVELOOP_TOKEN}` } });returnresponse.data;}// Main function to search for similar texts in the dataset based on the query_termasyncfunctionsearchSimilarTexts(query, dataset_path) {// Get the embedding of the query_termconstembedding=awaitgetEmbedding(query);constembedding_search=embedding.data[0].embedding.join(',');// Construct the search queryconstTQL=`SELECT * FROM ( SELECT text, l2_norm(embedding - ARRAY[${embedding_search}]) AS score from "${dataset_path}" ) ORDER BY score DESC LIMIT 5`;// Search the dataset using the constructed queryconstresponse=awaitsearchDataset(TQL);// Log the search resultsconsole.log(response);}searchSimilarTexts(QUERY,DATASET_PATH)
Visualizing your Vector Store
Deep Lake enables users to visualize and interpret large datasets, including Vector Stores with embeddings. Visualization is available for each dataset stored in or connected to Deep Lake. Here's an example.
Authentication
To use Deep Lake features that require authentication (Activeloop storage, Tensor Database storage, connecting your cloud dataset to the Deep Lake UI, etc.) you should register in the Deep Lake App and authenticate on the client using the methods in the link below: