Vector Store Quickstart

A jump-start guide to using Deep Lake for Vector Search.

PreviousDeep Lake Docs NextDeep Learning Quickstart

Last updated 1 year ago

Was this helpful?

Vector Store Quickstart

A jump-start guide to using Deep Lake for Vector Search.

How to Get Started with Vector Search in Deep Lake in Under 5 Minutes

Installing Deep Lake

Deep Lake can be installed using pip. By default, Deep Lake does not install dependencies for the compute engine, google-cloud, and other features. Details on all installation options are available here.

!pip3 install deeplake

This quickstart also requires LangChain, tiktoken, and OpenIAI

!pip3 install langchain openai tiktoken

Creating Your First Vector Store

A Vector Store can be created using the Deep Lake integration with LangChain. This abstracts the low-level Deep Lake API from the user, and under the hood, the LangChain integration creates a Deep Lake dataset with text, embedding, id, and metadata tensors.

Let's embed and store Paul Graham's essays in a Deep Lake vector store. First, we download the data:

Next, let's import the required modules and set the OpenAI environmental variables for embeddings:

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain.vectorstores import DeepLake
import os

os.environ['OPENAI_API_KEY'] = <OPENAI_API_KEY>

Next, lets specify paths for the source text data and the underlying Deep Lake dataset. In this example, we store the dataset locally, but Deep Lake vectors stores can also be created in memory, in the Deep Lake Tensor Database, or in your cloud. Further details are available here.

source_text = 'paul_graham_essay.txt'
dataset_path = 'pg_essay_deeplake'

Next, let's chunk the essay text, create the Vector Store, and populate it with data and embeddings:

embedding = OpenAIEmbeddings(model='text-embedding-ada-002')

documents = TextLoader(source_text).load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

db = DeepLake.from_documents(docs, dataset_path=dataset_path, embedding=embedding), overwrite = True)

The underlying Deep Lake dataset object is accessible in db.ds, and the data structure can be summarized using db.ds.summary(), which shows 4 tensors with 100 samples:

  tensor     htype      shape      dtype  compression
  -------   -------    -------    -------  ------- 
 embedding  generic  (100, 1536)  float32   None   
    ids      text     (100, 1)      str     None   
 metadata    json     (100, 1)      str     None   
   text      text     (100, 1)      str     None

Performing Vector Search

Deep Lake offers a variety of vector search option depending on the storage location of the Vector Store is created and infrastructure that should run the computations:

Search Method

Compute Location

Execution Algorithm

Query Syntax

Required Storage

Python

Client-side

Deep Lake OSS Python Code

LangChain API

In memory, local, user cloud, Tensor Database

Client-side

Deep Lake C++ Compute Engine

LangChain API or TQL

User cloud (must be connected to Deep Lake), Tensor Database

Managed Database

Deep Lake C++ Compute Engine

LangChain API or TQL

Tensor Database

Vector Search in Python

Let's continue from the Vector Store we created above and run an embeddings search based on a user prompt using the LangChain API.

from langchain.chains import RetrievalQA
from langchain.llms import OpenAIChat

# Re-load the vector store in case it's no longer initialized
db = DeepLake(dataset_path = dataset_path, embedding_function=embedding)

qa = RetrievalQA.from_chain_type(llm=OpenAIChat(model='gpt-3.5-turbo'), chain_type='stuff', retriever=db.as_retriever())

Let's run the prompt and check out the output. Internally, this API performs an embedding search to find the most relevant data to feed into the LLM context.

query = 'What are the first programs he tried writing?'
qa.run(query)

'The first programs he tried writing were on the IBM 1401 that his school district used for "data processing" in 9th grade.'

Vector Search Using the Compute Engine on the Client Side in LangChain

Vector search using the Compute Engine + LangChain API will be available soon.

Vector Search Using the Compute Engine on the Client Side In the Deep Lake API

To run the C++ Compute Engine on the client-side, please install:

pip install "deeplake[enterprise]"

Let's load an existing Vector Store containing embeddings of the Twitter recommendation algorithm. We use the raw Deep Lake API, which loads the same dataset object db.ds from the LangChain API:

import deeplake
import openai

ds = deeplake.load('hub://activeloop/twitter-algorithm', read_only = True) # Read-only is sufficient permission for queries

Next, let's define the search term and embed it using OpenAI.

SEARCH_TERM = 'What do the trust and safety models do?'

embedding = openai.Embedding.create(input=SEARCH_TERM, model="text-embedding-ada-002")["data"][0]["embedding"]

# Format the embedding as a string, so it can be passed in the query string
embedding_search = ",".join([str(item) for item in embedding])

Finally, let's define the TQL query and run it using the Compute Engine on the client.

tql_query = f"select * from (select *, cosine_similarity(embedding, ARRAY[{embedding_search}]) as score) order by score desc limit 5"

ds_view = ds.query(tql_query)

ds_view.summary() shows the result contains the top 5 samples by score:

  tensor      htype      shape     dtype  compression
  -------    -------    -------   -------  ------- 
 embedding  embedding  [5, 1536]  float32   None   
    ids       text      [5, 1]     int8     None   
 metadata     json      [5, 1]     uint8    None   
   text       text      [5, 1]     int8     None

We can lazy-load the data for those samples using:

ds_view.score[0].numpy()
str(ds_view.text[0].numpy())

array(0.97839564, dtype=float32)

'// Delete configuration key-value. If is_directory is set in request,\n  // recursively clean up all key-values under the path specified by `key`.\n  rpc DeleteKeyValue(DeleteKeyValueRequest) returns (DeleteKeyValueResponse);'

Vector Search Using the Managed Tensor Database in LangChain

Vector search using the Tensor Database + LangChain API will be available soon.

Vector Search Using the Managed Tensor Database + REST API

The same query above on the Twitter Algorithm can be run on the Managed Tensor Database using a REST API. This step requires Registration and creation of an API token.

import requests
import openai
import os

# Tokens should be set in environmental variables.
ACTIVELOOP_TOKEN = os.environ['ACTIVELOOP_TOKEN']
DATASET_PATH = 'hub://activeloop/twitter-algorithm'
ENDPOINT_URL = 'https://app.activeloop.ai/api/query/v1'
SEARCH_TERM = 'What do the trust and safety models do?'
# os.environ['OPENAI_API_KEY'] OPEN AI TOKEN should also exist in env variables

# The headers contains the user token
headers = {
    "Authorization": f"Bearer {ACTIVELOOP_TOKEN}",
}

# Embed the search term
embedding = openai.Embedding.create(input=SEARCH_TERM, model="text-embedding-ada-002")["data"][0]["embedding"]

# Format the embedding as a string, so it can be passed in the REST API request.
embedding_search = ",".join([str(item) for item in embedding])

# Create the query using TQL
query = f"select * from (select text, l2_norm(embedding - ARRAY[{embedding_search}]) as score from \"{dataset_path}\") order by score desc limit 5"
          
# Submit the request                              
response = requests.post(ENDPOINT_URL, json={"query": query}, headers=headers)

data = response.json()

print(data)

const axios = require('axios');

OPENAI_API_KEY = process.env.OPENAI_API_KEY;
ACTIVELOOP_TOKEN = process.env.ACTIVELOOP_TOKEN;

const QUERY = 'What do the trust and safety models do?';
const DATASET_PATH = 'hub://activeloop/twitter-algorithm';
const ENDPOINT_URL = 'https://app.activeloop.ai/api/query/v1';

// Function to get the embeddings of a text from Open AI API
async function getEmbedding(text) {
  const response = await axios.post('https://api.openai.com/v1/embeddings', {
    input: text,
    model: "text-embedding-ada-002"
  }, {
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${OPENAI_API_KEY}`
    }
  });

  return response.data;
}

// Function to search the dataset using the given query on Activeloop
async function searchDataset(query) {
  const response = await axios.post(${ENDPOINT_URL}, {
    query: query,
  }, {
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${ACTIVELOOP_TOKEN}`
    }
  });

  return response.data;
}

// Main function to search for similar texts in the dataset based on the query_term
async function searchSimilarTexts(query, dataset_path) {
  // Get the embedding of the query_term
  const embedding = await getEmbedding(query);
  const embedding_search = embedding.data[0].embedding.join(',');

  // Construct the search query
  const TQL = `SELECT * FROM (
                    SELECT text, l2_norm(embedding - ARRAY[${embedding_search}]) AS score 
                    from "${dataset_path}"
                  ) ORDER BY score DESC LIMIT 5`;

  // Search the dataset using the constructed query
  const response = await searchDataset(TQL);

  // Log the search results
  console.log(response);
}

searchSimilarTexts(QUERY, DATASET_PATH)

Visualizing your Vector Store

Deep Lake enables users to visualize and interpret large datasets, including Vector Stores with embeddings. Visualization is available for each dataset stored in or connected to Deep Lake. Here's an example.

Authentication

To use Deep Lake features that require authentication (Activeloop storage, Tensor Database storage, connecting your cloud dataset to the Deep Lake UI, etc.) you should register in the Deep Lake App and authenticate on the client using the methods in the link below:

Next Steps

Check out our Getting Started Guide for a comprehensive walk-through of Deep Lake. Also check out tutorials on Running Queries, Training Models, and Creating Datasets, as well as Playbooks about powerful use-cases that are enabled by Deep Lake.

PreviousDeep Lake Docs NextDeep Learning Quickstart

Last updated 1 year ago

Was this helpful?

How to Get Started with Vector Search in Deep Lake in Under 5 Minutes

Installing Deep Lake

!pip3 install deeplake

This quickstart also requires LangChain, tiktoken, and OpenIAI

!pip3 install langchain openai tiktoken

Creating Your First Vector Store

Let's embed and store Paul Graham's essays in a Deep Lake vector store. First, we download the data:

Next, let's import the required modules and set the OpenAI environmental variables for embeddings:

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain.vectorstores import DeepLake
import os

os.environ['OPENAI_API_KEY'] = <OPENAI_API_KEY>

source_text = 'paul_graham_essay.txt'
dataset_path = 'pg_essay_deeplake'

Next, let's chunk the essay text, create the Vector Store, and populate it with data and embeddings:

embedding = OpenAIEmbeddings(model='text-embedding-ada-002')

documents = TextLoader(source_text).load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

db = DeepLake.from_documents(docs, dataset_path=dataset_path, embedding=embedding), overwrite = True)

The underlying Deep Lake dataset object is accessible in db.ds, and the data structure can be summarized using db.ds.summary(), which shows 4 tensors with 100 samples:

  tensor     htype      shape      dtype  compression
  -------   -------    -------    -------  ------- 
 embedding  generic  (100, 1536)  float32   None   
    ids      text     (100, 1)      str     None   
 metadata    json     (100, 1)      str     None   
   text      text     (100, 1)      str     None

Performing Vector Search

Deep Lake offers a variety of vector search option depending on the storage location of the Vector Store is created and infrastructure that should run the computations:

Search Method

Compute Location

Execution Algorithm

Query Syntax

Required Storage

Python

Client-side

Deep Lake OSS Python Code

LangChain API

In memory, local, user cloud, Tensor Database

Client-side

Deep Lake C++ Compute Engine

LangChain API or TQL

User cloud (must be connected to Deep Lake), Tensor Database

Managed Database

Deep Lake C++ Compute Engine

LangChain API or TQL

Tensor Database

Vector Search in Python

Let's continue from the Vector Store we created above and run an embeddings search based on a user prompt using the LangChain API.

from langchain.chains import RetrievalQA
from langchain.llms import OpenAIChat

# Re-load the vector store in case it's no longer initialized
db = DeepLake(dataset_path = dataset_path, embedding_function=embedding)

qa = RetrievalQA.from_chain_type(llm=OpenAIChat(model='gpt-3.5-turbo'), chain_type='stuff', retriever=db.as_retriever())

Let's run the prompt and check out the output. Internally, this API performs an embedding search to find the most relevant data to feed into the LLM context.

query = 'What are the first programs he tried writing?'
qa.run(query)

'The first programs he tried writing were on the IBM 1401 that his school district used for "data processing" in 9th grade.'

Vector Search Using the Compute Engine on the Client Side in LangChain

Vector search using the Compute Engine + LangChain API will be available soon.

Vector Search Using the Compute Engine on the Client Side In the Deep Lake API

To run the C++ Compute Engine on the client-side, please install:

pip install "deeplake[enterprise]"

Let's load an existing Vector Store containing embeddings of the Twitter recommendation algorithm. We use the raw Deep Lake API, which loads the same dataset object db.ds from the LangChain API:

import deeplake
import openai

ds = deeplake.load('hub://activeloop/twitter-algorithm', read_only = True) # Read-only is sufficient permission for queries

Next, let's define the search term and embed it using OpenAI.

SEARCH_TERM = 'What do the trust and safety models do?'

embedding = openai.Embedding.create(input=SEARCH_TERM, model="text-embedding-ada-002")["data"][0]["embedding"]

# Format the embedding as a string, so it can be passed in the query string
embedding_search = ",".join([str(item) for item in embedding])

Finally, let's define the TQL query and run it using the Compute Engine on the client.

tql_query = f"select * from (select *, cosine_similarity(embedding, ARRAY[{embedding_search}]) as score) order by score desc limit 5"

ds_view = ds.query(tql_query)

ds_view.summary() shows the result contains the top 5 samples by score:

  tensor      htype      shape     dtype  compression
  -------    -------    -------   -------  ------- 
 embedding  embedding  [5, 1536]  float32   None   
    ids       text      [5, 1]     int8     None   
 metadata     json      [5, 1]     uint8    None   
   text       text      [5, 1]     int8     None

We can lazy-load the data for those samples using:

ds_view.score[0].numpy()
str(ds_view.text[0].numpy())

array(0.97839564, dtype=float32)

'// Delete configuration key-value. If is_directory is set in request,\n  // recursively clean up all key-values under the path specified by `key`.\n  rpc DeleteKeyValue(DeleteKeyValueRequest) returns (DeleteKeyValueResponse);'

Vector Search Using the Managed Tensor Database in LangChain

Vector search using the Tensor Database + LangChain API will be available soon.

Vector Search Using the Managed Tensor Database + REST API

The same query above on the Twitter Algorithm can be run on the Managed Tensor Database using a REST API. This step requires Registration and creation of an API token.

import requests
import openai
import os

# Tokens should be set in environmental variables.
ACTIVELOOP_TOKEN = os.environ['ACTIVELOOP_TOKEN']
DATASET_PATH = 'hub://activeloop/twitter-algorithm'
ENDPOINT_URL = 'https://app.activeloop.ai/api/query/v1'
SEARCH_TERM = 'What do the trust and safety models do?'
# os.environ['OPENAI_API_KEY'] OPEN AI TOKEN should also exist in env variables

# The headers contains the user token
headers = {
    "Authorization": f"Bearer {ACTIVELOOP_TOKEN}",
}

# Embed the search term
embedding = openai.Embedding.create(input=SEARCH_TERM, model="text-embedding-ada-002")["data"][0]["embedding"]

# Format the embedding as a string, so it can be passed in the REST API request.
embedding_search = ",".join([str(item) for item in embedding])

# Create the query using TQL
query = f"select * from (select text, l2_norm(embedding - ARRAY[{embedding_search}]) as score from \"{dataset_path}\") order by score desc limit 5"
          
# Submit the request                              
response = requests.post(ENDPOINT_URL, json={"query": query}, headers=headers)

data = response.json()

print(data)

const axios = require('axios');

OPENAI_API_KEY = process.env.OPENAI_API_KEY;
ACTIVELOOP_TOKEN = process.env.ACTIVELOOP_TOKEN;

const QUERY = 'What do the trust and safety models do?';
const DATASET_PATH = 'hub://activeloop/twitter-algorithm';
const ENDPOINT_URL = 'https://app.activeloop.ai/api/query/v1';

// Function to get the embeddings of a text from Open AI API
async function getEmbedding(text) {
  const response = await axios.post('https://api.openai.com/v1/embeddings', {
    input: text,
    model: "text-embedding-ada-002"
  }, {
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${OPENAI_API_KEY}`
    }
  });

  return response.data;
}

// Function to search the dataset using the given query on Activeloop
async function searchDataset(query) {
  const response = await axios.post(${ENDPOINT_URL}, {
    query: query,
  }, {
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${ACTIVELOOP_TOKEN}`
    }
  });

  return response.data;
}

// Main function to search for similar texts in the dataset based on the query_term
async function searchSimilarTexts(query, dataset_path) {
  // Get the embedding of the query_term
  const embedding = await getEmbedding(query);
  const embedding_search = embedding.data[0].embedding.join(',');

  // Construct the search query
  const TQL = `SELECT * FROM (
                    SELECT text, l2_norm(embedding - ARRAY[${embedding_search}]) AS score 
                    from "${dataset_path}"
                  ) ORDER BY score DESC LIMIT 5`;

  // Search the dataset using the constructed query
  const response = await searchDataset(TQL);

  // Log the search results
  console.log(response);
}

searchSimilarTexts(QUERY, DATASET_PATH)

Visualizing your Vector Store

Authentication

User Authentication

Next Steps

Congratulations, you've created a Vector Store and performed vector search using Deep Lake