Deep Lake Vector Store in LangChain
Using Deep Lake as a Vector Store in LangChain
How to Use Deep Lake as a Vector Store in LangChain
Deep Lake can be used as a VectorStore in LangChain for building Apps that require filtering and vector search. In this tutorial we will show how to create a Deep Lake Vector Store in LangChain and use it to build a Q&A App about the Twitter OSS recommendation algorithm. This tutorial requires installation of:
Downloading and Preprocessing the Data
First, let's import necessary packages and make sure the Activeloop and OpenAI keys are in the environmental variables ACTIVELOOP_TOKEN
, OPENAI_API_KEY
.
Next, let's clone the Twitter OSS recommendation algorithm:
Next, let's load all the files from the repo into a list:
A note on chunking text files:
Text files are typically split into chunks before creating embeddings. In general, more chunks increases the relevancy of data that is fed into the language model, since granular data can be selected with higher precision. However, since an embedding will be created for each chunk, more chunks increase the computational complexity.
Chunks in the above context should not be confused with Deep Lake chunks!
Creating the Deep Lake Vector Store
First, we specify a path for storing the Deep Lake dataset containing the embeddings and their metadata.
Next, we specify an OpenAI algorithm for creating the embeddings, and create the VectorStore. This process creates an embedding for each element in the texts
lists and stores it in Deep Lake format at the specified path.
The Deep Lake Vector Store has 4 tensors including the text
, embedding
, ids
, and metadata
.
Use the Vector Store in a Q&A App
We can now use the VectorStore in Q&A app, where the embeddings will be used to filter relevant documents (texts
) that are fed into an LLM in order to answer a question.
If we were on another machine, we would load the existing Vector Store without recalculating the embeddings:
We have to create a retriever
object and specify the search parameters.
Finally, let's create an RetrievalQA
chain in LangChain and run it:
This returns:
Most of the SimClusters code is written in Scala, as seen in the provided context with the file path [src/scala/com/twitter/simclusters_v2/scio/bq_generation](scio/bq_generation) and the package declarations that use the Scala package syntax.
We can tune k
in the retriever
depending on whether the prompt exceeds the model's token limit. Higher k
increases the accuracy by including more data in the prompt.
Adding data to to an existing Vector Store
Data can be added to an existing Vector Store by loading it using its path and adding documents or texts.
Adding Hybrid Search to the Vector Store
Since embeddings search can be computationally expensive, you can simplify the search by filtering out data using an explicit search on top of the embeddings search. Suppose we want to answer to a question related to the trust and safety models. We can filter the filenames (source
) in the metadata
using a custom function that is added to the retriever:
This returns:
"The Trust and Safety Models are designed to detect various types of content on Twitter that may be inappropriate, harmful, or against their terms of service.........."
Filters can also be specified as a dictionary. For example, if the metadata
tensor had a key year
, we can filter based on that key using:
Using Deep Lake in Applications that Require Concurrency
For applications that require writing of data concurrently, users should set up a lock system to queue the write operations and prevent multiple clients from writing to the Deep Lake Vector Store at the same time. This can be done with a few lines of code in the example below:
Accessing the Low Level Deep Lake API (Advanced)
When using a Deep Lake Vector Store in LangChain, the underlying Vector Store and its low-level Deep Lake dataset can be accessed via:
SelfQueryRetriever with Deep Lake
Deep Lake supports the SelfQueryRetriever implementation in LangChain, which translates a user prompt into a metadata filters.
This section of the tutorial requires installation of additional packages:
pip install "deeplake[enterprise]" lark
First let's create a Deep Lake Vector Store with relevant data using the documents below.
Since this feature uses Deep Lake's Tensor Query Language under the hood, the Vector Store must be stored in or connected to Deep Lake, which requires registration with Activeloop:
Next, let's instantiate our retriever by providing information about the metadata fields that our documents support and a short description of the document contents.
And now we can try actually using our retriever!
Output:
Now we can run a query to find movies that are above a certain ranking:
Output:
Congrats! You just used the Deep Lake Vector Store in LangChain to create a Q&A App! 🎉