ChromaDB is an open-source embedding database. While working on RAG based applications, you can use Chroma to retrieve information from your context documents.

How will this help?

Vector databases store data as high-dimensional vectors, enabling fast and efficient similarity search and retrieval of data based on their vector representations. You can use UpTrain along with vector databases such as ChromaDB for evaluations such as the relevance of the retrieved context ensuring a good retrieval quality.

How to integrate?

First, let’s import the necessary packages

from uptrain import EvalLLM, Evals, APIClient
import chromadb
import json

Let’s define a dataset

In this example we have used SciQ dataset
from datasets import load_dataset

# Get the SciQ dataset from HuggingFace
dataset = load_dataset("sciq", split="train")     

# Filter the dataset to only include questions with a support
dataset = dataset.filter(lambda x: x["support"] != "")    

Embedding the data in ChromaDB:

client = chromadb.Client()

# Get or create a new Chroma collection to store the supporting documets.
collection = client.get_or_create_collection("sciq_supports")     

# Embed and store the first 100 supports for this demo
collection.add(     
    ids=[str(i) for i in range(0, 2)],  
    documents=dataset["support"][:2],
    metadatas=[{"type": "support"} for _ in range(0, 2)
    ]
)

Using ChromaDB to find supporting evidence(context) for the questions in the dataset:

results = collection.query(
    query_texts=dataset["question"][:10],
    n_results=1)

# Retrieving the corresponding support document wrt question
data = []
for i, q in enumerate(dataset['question'][:10]):      
    retrieved_data = {'question': q,'context': results['documents'][i][0]}
    data.append(retrieved_data)

Evaluate the retrieval quality using UpTrain

# Insert your OpenAI key here
OPENAI_API_KEY = "sk-*****************"  

eval_llm = EvalLLM(openai_api_key=OPENAI_API_KEY)

res = eval_llm.evaluate(
                        data=data,
                        checks=[Evals.CONTEXT_RELEVANCE]
)

print(json.dumps(res, indent=3))

Let’s look at the retrieval quality of the context documents

[
   {
      "question": "What is the least dangerous radioactive decay?",
      "context": "All radioactive decay is dangerous to living things, but alpha decay is the least dangerous.",
      "score_context_relevance": 1.0,
      "explanation_context_relevance": "The extracted context directly addresses the question by stating that alpha decay is the least dangerous form of radioactive decay. Therefore, the selected choice is (A)"
   },
   {
      "question": "What type of organism is commonly used in preparation of foods such as cheese and yogurt?",
      "context": "Agents of Decomposition The fungus-like protist saprobes are specialized to absorb nutrients from nonliving organic matter, such as dead organisms or their wastes. For instance, many types of oomycetes grow on dead animals or algae. Saprobic protists have the essential function of returning inorganic nutrients to the soil and water. This process allows for new plant growth, which in turn generates sustenance for other organisms along the food chain. Indeed, without saprobe species, such as protists, fungi, and bacteria, life would cease to exist as all organic carbon became \u201ctied up\u201d in dead organisms.",
      "score_context_relevance": 0.5,
      "explanation_context_relevance": "The extracted context discusses saprobe species, including protists, fungi, and bacteria, which are specialized to absorb nutrients from nonliving organic matter. While the context does not directly mention the use of these organisms in food preparation, it provides relevant information about the types of organisms involved in nutrient absorption from nonliving organic matter. Therefore, the extracted context can give some relevant answer for the given question but can't answer it completely. Hence, the selected choice is (B)"
   }
]

According to these evaluations:

  • Example 1: The context clearly states alpha decay to be the least dangerous radioactive decay, making the context sufficient to answer the question. Hence, the context is highly relevant to the question asked.
  • Example 2: The context contains information about the agents used in preparation of foods like cheese and yogurt, but does not specifically talks about their use to make yougurt and cheese. Thus, even though the context is related to the question it’s not sufficient to answer it.