Response Matching compares the LLM-generated text with the gold (ideal) response using the defined score metric.

Columns required:

  • question: The question asked by the user
  • response: The response given by the model
  • ground_truth: The ideal response

Optional Parameters:

  • method: Different methods to check for response matching
    • llm (default): Uses LLM to check if the response matches the ground truth
    • exact: Checks if the response is exactly the same as the ground_truth
    • rouge: Uses ROUGE SCORE to check if the response matches the ground truth

How to use it?

from uptrain import EvalLLM, ResponseMatching

OPENAI_API_KEY = "sk-********************"  # Insert your OpenAI key here

data = [{
    "question": "Who were the two finalists of the 2023 ICC Cricket World Cup?",
    "ground_truth": "The finalists of the 2023 ICC Cricket World Cup were India and Australia.",
    "response": "Australia was a finalist in the 2023 ICC Cricket World Cup."
}]

eval_llm = EvalLLM(openai_api_key=OPENAI_API_KEY)

res = eval_llm.evaluate(
    data = data,
    checks = [ResponseMatching(method = 'llm')]    # method: llm/exact/rouge
)
By default, we are using GPT 3.5 Turbo for evaluations. If you want to use a different model, check out this tutorial.

Sample Response:

[
   {
      "response_match_precision": 1.0,
      "response_match_recall": 0.5,
      "score_response_match": 0.57,
      "response_match_method": "llm"
   }
]
A higher response matching score reflects that the generated response matches the ground truth.

The response generated contains information on Australia being a finalist and does not match the ground truth about both the finalists of the 2023 ICC Cricket World Cup.

Hence, resulting in a low response matching score.