RAG Query Engine Evaluations
The RAG query engine plays a crucial role in retrieving context and generating responses. To ensure its performance and response quality, we conduct the following evaluations:
- Context Relevance: Determines if the context extracted from the query is relevant to the response.
- Factual Accuracy: Assesses if the LLM is hallcuinating or providing incorrect information.
- Response Completeness: Checks if the response contains all the information requested by the query.
How to do it?
Install UpTrain and LlamaIndex
pip install -q html2text llama-index pandas tqdm uptrain cohere
Import required libraries
from llama_index import (
ServiceContext,
VectorStoreIndex,
)
from llama_index.node_parser import SentenceSplitter
from llama_index.readers import SimpleWebPageReader
from llama_index.callbacks import CallbackManager, UpTrainCallbackHandler
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.service_context import set_global_service_context
from llama_index.query_engine.sub_question_query_engine import (
SubQuestionQueryEngine,
)
from llama_index.tools.query_engine import QueryEngineTool
from llama_index.tools.types import ToolMetadata
Setup UpTrain Open-Source Software (OSS)
You can use the open-source evaluation service to evaluate your model. In this case, you will need to provie an OpenAI API key. You can get yours here.
Parameters:
key_type
=“openai”api_key
=“OPENAI_API_KEY”project_name_prefix
=“PROJECT_NAME_PREFIX”
callback_handler = UpTrainCallbackHandler(
key_type="openai",
api_key="sk-...", # Replace with your OpenAI API key
project_name_prefix="llama",
)
Settings.callback_manager = CallbackManager([callback_handler])
Load and Parse Documents
Load documents from Paul Graham’s essay “What I Worked On”.
documents = SimpleWebPageReader().load_data(
[
"https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt"
]
)
Parse the document into nodes.
parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(documents)
RAG Query Engine Evaluation
UpTrain callback handler will automatically capture the query, context and response once generated and will run the following three evaluations (Graded from 0 to 1) on the response:
- Context Relevance: Determines if the context extracted from the query is relevant to the response.
- Factual Accuracy: Assesses if the LLM is hallcuinating or providing incorrect information.
- Response Completeness: Checks if the response contains all the information requested by the query.
index = VectorStoreIndex.from_documents(
documents,
)
query_engine = index.as_query_engine()
max_characters_per_line = 80
queries = [
"What did Paul Graham do growing up?",
"When and how did Paul Graham's mother die?",
"What, in Paul Graham's opinion, is the most distinctive thing about YC?",
"When and how did Paul Graham meet Jessica Livingston?",
"What is Bel, and when and where was it written?",
]
for query in queries:
response = query_engine.query(query)
Question: What did Paul Graham do growing up?
Context Relevance Score: 0.0
Factual Accuracy Score: 1.0
Response Completeness Score: 0.0
Question: When and how did Paul Graham's mother die?
Context Relevance Score: 0.0
Factual Accuracy Score: 1.0
Response Completeness Score: 0.0
Question: What, in Paul Graham's opinion, is the most distinctive thing about YC?
Context Relevance Score: 1.0
Factual Accuracy Score: 1.0
Response Completeness Score: 1.0
Question: When and how did Paul Graham meet Jessica Livingston?
Context Relevance Score: 1.0
Factual Accuracy Score: 1.0
Response Completeness Score: 0.5
Question: What is Bel, and when and where was it written?
Context Relevance Score: 1.0
Factual Accuracy Score: 1.0
Response Completeness Score: 0.0
Was this page helpful?