Each LLM application has its unique needs and it is not possible to have a one-size-fits-all evaluation tool.

A sales assistant bot needs to be evaluated differently as compared to a calendar automation bot.

Custom prompts help you grade your model the way you want it.

Parameters:

  • prompt: Evaluation prompt used to generate the grade
  • choices: List of choices/grades to choose from
  • choices_scores: Scores associated with each choice
  • eval_type: One of [“classify”, “cot_classify”], determines if chain-of-thought prompting is to be applied or not
  • prompt_var_to_column_mapping (optional): mapping between variables defined in the prompt vs column names in the data

How to use it?

prompt = """
You are an expert medical school professor specializing in grading students' answers to questions.
You are grading the following question:
{question}
Here is the real answer:
{ground_truth}
You are grading the following predicted answer:
{response}
"""

# Create a list of choices
choices = ["Correct", "Correct but Incomplete", "Incorrect"]

# Create scores for the choices
choice_scores = [1.0, 0.5, 0.0]

data = [{
      "user_question": "What causes diabetes?",
      "ground_truth_response": "Diabetes is a metabolic disorder characterized by high blood sugar levels. It is primarily caused by a combination of genetic and environmental factors, including obesity and lack of physical activity.",
      "user_response": "Diabetes is primarily caused by a combination of genetic and environmental factors, including obesity and lack of physical activity."
}]

prompt_var_to_column_mapping = {
    "question": "user_question",
    "ground_truth": "ground_truth_response",
    "response": "user_response"
}

from uptrain import CustomPromptEval, EvalLLM, Settings
import json

OPENAI_API_KEY = "sk-*****************"  # Insert your OpenAI key here
eval_llm = EvalLLM(settings=Settings(openai_api_key=OPENAI_API_KEY, response_format={"type":"json_object"}))

results = eval_llm.evaluate(
    data = data,
    checks = [CustomPromptEval(
        prompt = prompt,
        choices = choices,
        choice_scores = choice_scores,
        prompt_var_to_column_mapping = prompt_var_to_column_mapping
    )]
)    
By Default we are using GPT 3.5 Turbo. If you want to use some other model check out this tutorial

Sample Response:

[
   {
      "Choice": "CORRECT BUT INCOMPLETE",
      "Explanation": "The predicted answer correctly identifies the primary causes of diabetes as genetic and environmental factors, including obesity and lack of physical activity. However, it does not mention that diabetes is a metabolic disorder characterized by high blood sugar levels, which is an important aspect of the real answer.",
      "score_custom_prompt": 0.5
   }
]

Here, we have evaluated the data according to the above mentioned prompt.

The response though seems correct, does not answers the question completely according to the information provided in the context.