First, let's import necessary packages and make sure the Activeloop and OpenAI keys are in the environmental variables ACTIVELOOP_TOKEN, OPENAI_API_KEY.
from deeplake.core.vectorstore import VectorStoreimport openaiimport os
Next, let's clone the Twitter OSS recommendation algorithm and define paths for for source data and the Vector Store.
Next, let's load all the files from the repo into list of data that will be added to the Vector Store (chunked_text and metadata). We use simple text chunking based on a constant number of characters.
CHUNK_SIZE =1000chunked_text = []metadata = []for dirpath, dirnames, filenames in os.walk(repo_path):for file in filenames:try: full_path = os.path.join(dirpath,file)withopen(full_path, 'r')as f: text = f.read() new_chunkned_text = [text[i:i+CHUNK_SIZE]for i inrange(0,len(text), CHUNK_SIZE)] chunked_text += new_chunkned_text metadata += [{'filepath': full_path}for i inrange(len(new_chunkned_text))]exceptExceptionas e:print(e)pass
Next, let's define an embedding function using OpenAI. It must work for a single string and a list of strings, so that it can both be used to embed a prompt and a batch of texts.
defembedding_function(texts,model="text-embedding-ada-002"):ifisinstance(texts, str): texts = [texts] texts = [t.replace("\n", " ")for t in texts]return [data.embedding for data in openai.embeddings.create(input = texts, model=model).data]
Finally, let's create the Deep Lake Vector Store and populate it with data. We use a default tensor configuration, which creates tensors with text (str), metadata(json), id (str, auto-populated), embedding (float32). Learn more about tensor customizability here.
Deep Lake offers highly-flexible vector search and hybrid search options. First, let's show a simple example of vector search using default options, which performs simple cosine similarity search in Python on the client (your machine).
prompt ="What do trust and safety models do?"search_results = vector_store.search(embedding_data=prompt, embedding_function=embedding_function)
The search_results is a dictionary with keys for the text, score, id, and metadata, with data ordered by score. By default, the search returns the top 4 results which can be verified using:
len(search_results['text'])# Returns 4
If we examine the first returned text, it appears to contain the text about trust and safety models that is relevant to the prompt.
search_results['text'][0]
Returns:
Trust and Safety Models
=======================
We decided to open source the training code of the following models:
- pNSFWMedia: Model to detect tweets with NSFW images. This includes adult and porn content.
- pNSFWText: Model to detect tweets with NSFW text, adult/sexual topics.
- pToxicity: Model to detect toxic tweets. Toxicity includes marginal content like insults and certain types of harassment. Toxic content does not violate Twitter's terms of service.
- pAbuse: Model to detect abusive content. This includes violations of Twitter's terms of service, including hate speech, targeted harassment and abusive behavior.
We have several more models and rules that we are not going to open source at this time because of the adversarial nature of this area. The team is considering open sourcing more models going forward and will keep the community posted accordingly.
We can also retrieve the corresponding filename from the metadata, which shows the top result came from the README.
The first search result with the L2 distance metric returns the same text as the previous Cos search:
search_results['text'][0]
Returns:
Trust and Safety Models
=======================
We decided to open source the training code of the following models:
- pNSFWMedia: Model to detect tweets with NSFW images. This includes adult and porn content.
- pNSFWText: Model to detect tweets with NSFW text, adult/sexual topics.
- pToxicity: Model to detect toxic tweets. Toxicity includes marginal content like insults and certain types of harassment. Toxic content does not violate Twitter's terms of service.
- pAbuse: Model to detect abusive content. This includes violations of Twitter's terms of service, including hate speech, targeted harassment and abusive behavior.
We have several more models and rules that we are not going to open source at this time because of the adversarial nature of this area. The team is considering open sourcing more models going forward and will keep the community posted accordingly.
Full Customization of Vector Search
Deep Lake's Compute Engine can be used to rapidly execute a variety of different search logic. It is available with !pip install "deeplake[enterprise]" (Make sure to restart your kernel after installation), and it is only available for data stored in or connected to Deep Lake.
Let's load a representative Vector Store that is already stored in Deep Lake Tensor Database. If data is not being written, is advisable to use read_only = True.
prompt ="What do trust and safety models do?"embedding =embedding_function(prompt)[0]# Format the embedding array or list as a string, so it can be passed in the REST API request.embedding_string =",".join([str(item) for item in embedding])tql_query = f"select * from (select text, cosine_similarity(embedding, ARRAY[{embedding_string}]) as score) order by score desc limit 5"
Let's run the query, noting that the query execution happens in the Managed Tensor Database, and not on the client.
If we examine the first returned text, it appears to contain the same text about trust and safety models that is relevant to the prompt.
search_results['text'][0]
Returns:
Trust and Safety Models
=======================
We decided to open source the training code of the following models:
- pNSFWMedia: Model to detect tweets with NSFW images. This includes adult and porn content.
- pNSFWText: Model to detect tweets with NSFW text, adult/sexual topics.
- pToxicity: Model to detect toxic tweets. Toxicity includes marginal content like insults and certain types of harassment. Toxic content does not violate Twitter's terms of service.
- pAbuse: Model to detect abusive content. This includes violations of Twitter's terms of service, including hate speech, targeted harassment and abusive behavior.
We have several more models and rules that we are not going to open source at this time because of the adversarial nature of this area. The team is considering open sourcing more models going forward and will keep the community posted accordingly.
We can also retrieve the corresponding filename from the metadata, which shows the top result came from the README.