OpenAI Evals
Overview: In this example, we will see how to use UpTrain to run openai evals. You can run any of the standard evals defined in the evals registry or create your custom one with custom prompts. We will use a Q&A task as an example to highlight both of them.
Problem: The workflow of our hypothetical Q&A application goes like this,
- User enters a question.
- The query converts to an embedding, and relevant sections from the documentation are retrieved using nearest neighbour search.
- The original query and the retrieved sections are passed to a language model (LM), along with a custom prompt to generate a response.
Our goal is the evaluate the quality of the answer generated by the model using openai evals.
Solution: We illustrate how to use the Uptrain Evals framework to assess the chatbot’s performance. We will use a dataset built from logs generated by a chatbot made to answer questions from the Streamlit user documentation. For model grading, we will use GPT-3.5-turbo with grading type as ‘coqa-closedqa-correct’ as well as define our custom grading prompt for the same.
Install UpTrain with all dependencies
pip install uptrain
uptrain-add --feature full
import os
import tempfile
import polars as pl
TEMP_DIR = tempfile.gettempdir()
UPTRAIN_LOGS_DIR = os.path.join(TEMP_DIR, "uptrain_logs_openai_eval")
This notebook uses the OpenAI API to generate text for prompts, make sure the env variable is populated with the API key.
# os.environ["OPENAI_API_KEY"] = "..."
Let’s now load our dataset and see how that looks
url = "https://oodles-dev-training-data.s3.us-west-1.amazonaws.com/qna-streamlit-docs.jsonl"
dataset_path = os.path.join(TEMP_DIR, "qna-notebook-data.jsonl")
if not os.path.exists(dataset_path):
import httpx
r = httpx.get(url)
with open(dataset_path, "wb") as f:
f.write(r.content)
dataset = pl.read_ndjson(dataset_path).select(
pl.col(["question", "document_text", "answer"])
)
print("Number of test cases: ", len(dataset))
print("Couple of samples: ", dataset[0:2])
So, we have a bunch of questions, retrieved documents and final answers (i.e. LLM responses) for them. Let’s evaluate the correctness of these answers using OpenAI evals.
Using UpTrain Framework to run evals
UpTrain provides integrations with openai evals to run any check defined in the registry.
We wrap these evals in an Operator class for ease of use.
-
OpenAIGradeScore
: Calls openai evals with the given eval_name. Provide a column corresponding to the input and completion. Documentation -
ModelGradeScore
: Define your model grading eval. Define your custom prompt, the weightage given to each option and the mapping to link dataset columns to the variables required in the prompt. Documentation
from uptrain.framework import Check
from uptrain.operators import PlotlyChart, ModelGradeScore, OpenAIGradeScore
check = Check(
name="Eval scores",
operators=[
## Add your checks here
# First is the openai grade eval - uses GPT-3.5-turbo
OpenAIGradeScore(
col_in_input="document_text",
col_in_completion="answer",
eval_name="coqa-closedqa-correct",
col_out="openai_eval_score",
),
# Second is our custom grading eval - uses GPT-3.5-turbo
ModelGradeScore(
grading_prompt_template="""
You are an evidence-driven LLM that places high importance on supporting facts and references. You diligently verify claims and check for evidence within the document to ensure answers rely on reliable information and align with the documented evidence.",
You are comparing an answer pulled from the document for a given question
Here is the data:
[BEGIN DATA]
************
[Document]: {document}
************
[Question]: {user_query}
************
[Submitted answer]: {generated_answer}
************
[END DATA]
Compare the factual content of the submitted answer with the document as well as evaluate how well it answers the given question. Ignore any differences in style, grammar, or punctuation.
Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the document and answers the question correctly.
(B) The submitted answer is a subset of the document but is not an appropriate answer for the question.
(C) The submitted quote is a superset of the document but is consistent with the document as well as answers the question correctly.
(D) The submitted quote is a superset of the document but is consistent with the document although it does not answer the question correctly.
(E) The submitted quote is a superset of the document and is not consistent with the document.
""",
eval_type="cot_classify",
choice_strings=["A", "B", "C", "D", "E"],
choice_scores={"A": 1.0, "B": 0.2, "C": 0.5, "D": 0.1, "E": 0.0},
context_vars={
"document": "document_text",
"user_query": "question",
"generated_answer": "answer",
},
col_out="custom_eval_score",
),
],
plots=[
PlotlyChart(kind="table", title="Eval scores"),
],
)
Now that we have defined our evals, we will wrap them in a CheckSet
object. CheckSet
takes the source (i.e. test dataset file), the above-defined Check
and the directory where we wish to save the results.
from uptrain.framework import CheckSet, Settings
from uptrain.operators import JsonReader
checkset = CheckSet(
source=JsonReader(fpath=dataset_path),
checks=[check],
)
settings = Settings(logs_folder=UPTRAIN_LOGS_DIR)
Let’s run our evals and visualize the outputs with the help of streamlit dashboards
checkset.setup(settings).run()
from uptrain.dashboard import StreamlitRunner
runner = StreamlitRunner(UPTRAIN_LOGS_DIR)
runner.start()