evaluate function in EvalLLM class and it will automatically perform the evaluation.
These evals require a combination of the following columns to be present in your data:
question: The question you want to askcontext: The context relevant to the questionresponse: The response to the question
Parameters section is a parametric eval.
You can choose evals as per your needs. We have divided them into a few categories for your convenience:
Ground Truth Comparison Evals
Ground Truth Comparison Evals
| Eval | Description |
|---|---|
| Response Matching | Grades how relevant the generated context was to the question specified. |
Response Quality Evals
Response Quality Evals
| Eval | Description |
|---|---|
| Reponse Completeness | Grades whether the response has answered all the aspects of the question specified. |
| Reponse Conciseness | Grades how concise the generated response is or if it has any additional irrelevant information for the question asked. |
| Reponse Relevance | Grades how relevant the generated context was to the question specified. |
| Reponse Validity | Grades if the response generated is valid or not. A response is considered to be valid if it contains any information. |
| Reponse Consistency | Grades how consistent the response is with the question asked as well as with the context provided. |
Context Awareness Evals
Context Awareness Evals
| Eval | Description |
|---|---|
| Context Relevance | Grades how relevant the context was to the question specified. |
| Context Utilization | Grades how complete the generated response was for the question specified given the information provided in the context. |
| Factual Accuracy | Grades whether the response generated is factually correct and grounded by the provided context. |
| Context Conciseness | Evaluates the concise context cited from an original context for irrelevant information. |
| Context Reranking | Evaluates how efficient the reranked context is compared to the original context. |
Security Evals
Security Evals
| Eval | Description |
|---|---|
| Prompt Injection | Grades whether the generated response is leaking any system prompt. |
| Jailbreak Detection | Grades whether the user’s prompt is an attempt to jailbreak (i.e. generate illegal or harmful responses). |
Language Quality Evals
Language Quality Evals
| Eval | Description |
|---|---|
| Language Features | Grades whether the response has answered all the aspects of the question specified. |
| Tonality | Grades whether the generated response matches the required persona’s tone |
Query Clarity Evals
Query Clarity Evals
| Eval | Description |
|---|---|
| Sub-query Completeness | Evaluate if the list of generated sub-questions comprehensively cover all aspects of the main question. |
| Multi-query Accuracy | Evaluates how accurately the variations of the query represent the same question. |
Code Related Evals
Code Related Evals
Conversation Evals
Conversation Evals
| Eval | Description |
|---|---|
| User Satisfaction | Grade the conversations between the user and the LLM/AI assistant. |
Creating Custom Evals
Creating Custom Evals
| Eval | Description |
|---|---|
| Custom Guideline | Grades how well the LLM adheres to a provided guideline when giving a response. |
| Custom Prompts | Allows you to create your own set of evaluations. |

