Deep Lake Vector Store in LlamaIndex

Using Deep Lake as a Vector Store in LlamaIndex

How to Use Deep Lake as a Vector Store in LlamaIndex

Deep Lake can be used as a VectorStore in LlamaIndex for building Apps that require filtering and vector search. In this tutorial we will show how to create a Deep Lake Vector Store in LangChain and use it to build a Q&A App about the Twitter OSS recommendation algorithm. This tutorial requires installation of:

!pip3 install langchain llama-index

Downloading and Preprocessing the Data

First, let's import necessary packages and make sure the Activeloop and OpenAI keys are in the environmental variables ACTIVELOOP_TOKEN, OPENAI_API_KEY.

import os
import textwrap

from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores import DeepLakeVectorStore
from llama_index.storage.storage_context import StorageContext

Next, let's clone the Twitter OSS recommendation algorithm:

!git clone https://github.com/twitter/the-algorithm

Next, let's specify a local path to the files and add a reader for processing and chunking them.

repo_path = '/the-algorithm'
documents = SimpleDirectoryReader(repo_path, recursive=True).load_data()

Creating the Deep Lake Vector Store

First, we create an empty Deep Lake Vector Store using a specified path:

dataset_path = 'hub://<org-id>/twitter_algorithm'
vector_store = DeepLakeVectorStore(dataset_path=dataset_path)

print(vector_store.vectorstore.summary())

The Deep Lake Vector Store has 4 tensors including the text, embedding, ids, and metadata which includes the filename of the text .

  tensor      htype     shape    dtype  compression
  -------    -------   -------  -------  ------- 
   text       text      (0,)      str     None   
 metadata     json      (0,)      str     None   
 embedding  embedding   (0,)    float32   None   
    id        text      (0,)      str     None  

Next, we create a LlamaIndex StorageContext and VectorStoreIndex, and use the from_documents() method to populate the Vector Store with data. This step takes several minutes because of the time to embed the text.

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context,
)

print(vector_store.vectorstore.summary())

We observe that the Vector Store has 8286 rows of data:

  tensor      htype       shape       dtype  compression
  -------    -------     -------     -------  ------- 
   text       text      (8262, 1)      str     None   
 metadata     json      (8262, 1)      str     None   
 embedding  embedding  (8262, 1536)  float32   None   
    id        text      (8262, 1)      str     None 

Use the Vector Store in a Q&A App

We can now use the VectorStore in Q&A app, where the embeddings will be used to filter relevant documents (texts) that are fed into an LLM in order to answer a question.

If we were on another machine, we would load the existing Vector Store without re-ingesting the data

vector_store = DeepLakeVectorStore(dataset_path=dataset_path, read_only=True)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

Next, Let's create the LlamaIndex query engine and run a query:

query_engine = index.as_query_engine()
response = query_engine.query("What programming language is most of the SimClusters written in?")
print(str(response))

Most of the SimClusters project is written in Scala.

Congrats! You just used the Deep Lake Vector Store in LangChain to create a Q&A App! 🎉

Last updated