Each LLM application has its unique needs and it is not possible to have a one-size-fits-all evaluation tool.

A sales assistant bot needs to be evaluated differently as compared to a calendar automation bot.

Custom prompts help you grade your model the way you want it.

Parameters:

  • prompt: Evaluation prompt used to generate the grade
  • choices: List of choices/grades to choose from
  • choices_scores: Scores associated with each choice
  • eval_type: One of [“classify”, “cot_classify”], determines if chain-of-thought prompting is to be applied or not
  • prompt_var_to_column_mapping (optional): mapping between variables defined in the prompt vs column names in the data

How to use it?

prompt = """
You are an expert medical school professor specializing in grading students' answers to questions.
You are grading the following question:
{question}
Here is the real answer:
{ground_truth}
You are grading the following predicted answer:
{response}
"""

# Create a list of choices
choices = ["Correct", "Correct but Incomplete", "Incorrect"]

# Create scores for the choices
choice_scores = [1.0, 0.5, 0.0]

data = [{
      "user_question": "What causes diabetes?",
      "ground_truth_response": "Diabetes is a metabolic disorder characterized by high blood sugar levels. It is primarily caused by a combination of genetic and environmental factors, including obesity and lack of physical activity.",
      "user_response": "Diabetes is primarily caused by a combination of genetic and environmental factors, including obesity and lack of physical activity."
}]

prompt_var_to_column_mapping = {
    "question": "user_question",
    "ground_truth": "ground_truth_response",
    "response": "user_response"
}

from uptrain import CustomPromptEval, EvalLLM, Settings
import json

OPENAI_API_KEY = "sk-*****************"  # Insert your OpenAI key here
eval_llm = EvalLLM(settings=Settings(openai_api_key=OPENAI_API_KEY, response_format={"type":"json_object"}))

results = eval_llm.evaluate(
    data = data,
    checks = [CustomPromptEval(
        prompt = prompt,
        choices = choices,
        choice_scores = choice_scores,
        prompt_var_to_column_mapping = prompt_var_to_column_mapping
    )]
)    
By Default we are using GPT 3.5 Turbo. If you want to use some other model check out this tutorial

Sample Response:

[
   {
      "Choice": "CORRECT BUT INCOMPLETE",
      "Explanation": "The predicted answer correctly identifies the primary causes of diabetes as genetic and environmental factors, including obesity and lack of physical activity. However, it does not mention that diabetes is a metabolic disorder characterized by high blood sugar levels, which is an important aspect of the real answer.",
      "score_custom_prompt": 0.5
   }
]

Here, we have evaluated the data according to the above mentioned prompt.

The response though seems correct, does not answers the question completely according to the information provided in the context.

Each LLM application has its unique needs and it is not possible to have a one-size-fits-all evaluation tool.

A sales assistant bot needs to be evaluated differently as compared to a calendar automation bot.

Custom prompts help you grade your model the way you want it.

Parameters:

  • prompt: Evaluation prompt used to generate the grade
  • choices: List of choices/grades to choose from
  • choices_scores: Scores associated with each choice
  • eval_type: One of [“classify”, “cot_classify”], determines if chain-of-thought prompting is to be applied or not
  • prompt_var_to_column_mapping (optional): mapping between variables defined in the prompt vs column names in the data

How to use it?

prompt = """
You are an expert medical school professor specializing in grading students' answers to questions.
You are grading the following question:
{question}
Here is the real answer:
{ground_truth}
You are grading the following predicted answer:
{response}
"""

# Create a list of choices
choices = ["Correct", "Correct but Incomplete", "Incorrect"]

# Create scores for the choices
choice_scores = [1.0, 0.5, 0.0]

data = [{
      "user_question": "What causes diabetes?",
      "ground_truth_response": "Diabetes is a metabolic disorder characterized by high blood sugar levels. It is primarily caused by a combination of genetic and environmental factors, including obesity and lack of physical activity.",
      "user_response": "Diabetes is primarily caused by a combination of genetic and environmental factors, including obesity and lack of physical activity."
}]

prompt_var_to_column_mapping = {
    "question": "user_question",
    "ground_truth": "ground_truth_response",
    "response": "user_response"
}

from uptrain import CustomPromptEval, EvalLLM, Settings
import json

OPENAI_API_KEY = "sk-*****************"  # Insert your OpenAI key here
eval_llm = EvalLLM(settings=Settings(openai_api_key=OPENAI_API_KEY, response_format={"type":"json_object"}))

results = eval_llm.evaluate(
    data = data,
    checks = [CustomPromptEval(
        prompt = prompt,
        choices = choices,
        choice_scores = choice_scores,
        prompt_var_to_column_mapping = prompt_var_to_column_mapping
    )]
)    
By Default we are using GPT 3.5 Turbo. If you want to use some other model check out this tutorial

Sample Response:

[
   {
      "Choice": "CORRECT BUT INCOMPLETE",
      "Explanation": "The predicted answer correctly identifies the primary causes of diabetes as genetic and environmental factors, including obesity and lack of physical activity. However, it does not mention that diabetes is a metabolic disorder characterized by high blood sugar levels, which is an important aspect of the real answer.",
      "score_custom_prompt": 0.5
   }
]

Here, we have evaluated the data according to the above mentioned prompt.

The response though seems correct, does not answers the question completely according to the information provided in the context.