Links

Step 2: Creating Deep Lake Vector Stores

Creating the Deep Lake Vector Store

How to Create a Deep Lake Vector Store

Let's create a Vector Store in LangChain for storing and searching information about the Twitter OSS recommendation algorithm.

Downloading and Preprocessing the Data

First, let's import necessary packages and make sure the Activeloop and OpenAI keys are in the environmental variables ACTIVELOOP_TOKEN, OPENAI_API_KEY.
from deeplake.core.vectorstore.deeplake_vectorstore import VectorStore
import openai
import os
Next, let's clone the Twitter OSS recommendation algorithm and define paths for for source data and the Vector Store.
!git clone https://github.com/twitter/the-algorithm
vector_store_path = '/vector_store_getting_started'
repo_path = '/the-algorithm'
Next, let's load all the files from the repo into list of data that will be added to the Vector Store (chunked_text and metadata). We use simple text chunking based on a constant number of characters.
CHUNK_SIZE = 1000
chunked_text = []
metadata = []
for dirpath, dirnames, filenames in os.walk(repo_path):
for file in filenames:
try:
full_path = os.path.join(dirpath,file)
with open(full_path, 'r') as f:
text = f.read()
new_chunkned_text = [text[i:i+1000] for i in range(0,len(text), CHUNK_SIZE)]
chunked_text += new_chunkned_text
metadata += [{'filepath': full_path} for i in range(len(new_chunkned_text))]
except Exception as e:
print(e)
pass
Next, let's define an embedding function using OpenAI. It must work for a single string and a list of strings, so that it can both be used to embed a prompt and a batch of texts.
def embedding_function(texts, model="text-embedding-ada-002"):
if isinstance(texts, str):
texts = [texts]
texts = [t.replace("\n", " ") for t in texts]
return [data['embedding']for data in openai.Embedding.create(input = texts, model=model)['data']]
Finally, let's create the Deep Lake Vector Store and populate it with data. We use a default tensor configuration, which creates tensors with text (str), metadata(json), id (str, auto-populated), embedding (float32). Learn more about tensor customizability here.
vector_store = VectorStore(
path = vector_store_path,
)
vector_store.add(text = chunked_text,
embedding_function = embedding_function,
embedding_data = chunked_text,
metadata = metadata
)
The Vector Store's data structure can be summarized using vector_store.summary(), which shows 4 tensors with 21055 samples:
tensor htype shape dtype compression
------- ------- ------- ------- -------
embedding embedding (21055, 1536) float32 None
id text (21055, 1) str None
metadata json (21055, 1) str None
text text (21055, 1) str None
To create a vector store using pre-compute embeddings, instead of embedding_data and embedding_function, you may run:
# vector_store.add(text = chunked_text,
# embedding = <list_of_embeddings>,
# metadata = [{"source": source_text}]*len(chunked_text))