question
: The question asked by the usercontext
: Information retrieved to answer the questionresponse
: The response given by the model
How to use it?
By default, we are using GPT 3.5 Turbo for evaluations. If you want to use a different model, check out this tutorial.
A higher factual accuracy score reflects that the generated response is factually correct.
How it works?
We evaluate factual accuracy along the following steps:1
Split Response to Individual Facts
Responses are generally not very straightforward and mostly they are a combination of different arguments.To say that a response is factually correct or not, we first divide the response into various arguments each claiming a fact.
2
Rate Individual Facts
We then evaluate whether these individual facts are correct (on basis of supporting context) and divide them in following categories:
- Completely Right (Score 1)
- Completely Wrong (Score 0)
- Ambiguous (Score 0.5)
3
Generating Final Score
We consider a mean of the scores of these individual facts to rate whether the response is factually correct or not.