Quickest way to perform evaluations on your data
evaluate
function in EvalLLM
class and it will automatically perform the evaluation.
These evals require a combination of the following columns to be present in your data:
question
: The question you want to askcontext
: The context relevant to the questionresponse
: The response to the questionParameters
section is a parametric eval.
You can choose evals as per your needs. We have divided them into a few categories for your convenience:
Ground Truth Comparison Evals
Eval | Description |
---|---|
Response Matching | Grades how relevant the generated context was to the question specified. |
Response Quality Evals
Eval | Description |
---|---|
Reponse Completeness | Grades whether the response has answered all the aspects of the question specified. |
Reponse Conciseness | Grades how concise the generated response is or if it has any additional irrelevant information for the question asked. |
Reponse Relevance | Grades how relevant the generated context was to the question specified. |
Reponse Validity | Grades if the response generated is valid or not. A response is considered to be valid if it contains any information. |
Reponse Consistency | Grades how consistent the response is with the question asked as well as with the context provided. |
Context Awareness Evals
Eval | Description |
---|---|
Context Relevance | Grades how relevant the context was to the question specified. |
Context Utilization | Grades how complete the generated response was for the question specified given the information provided in the context. |
Factual Accuracy | Grades whether the response generated is factually correct and grounded by the provided context. |
Context Conciseness | Evaluates the concise context cited from an original context for irrelevant information. |
Context Reranking | Evaluates how efficient the reranked context is compared to the original context. |
Security Evals
Eval | Description |
---|---|
Prompt Injection | Grades whether the generated response is leaking any system prompt. |
Jailbreak Detection | Grades whether the user’s prompt is an attempt to jailbreak (i.e. generate illegal or harmful responses). |
Language Quality Evals
Eval | Description |
---|---|
Language Features | Grades whether the response has answered all the aspects of the question specified. |
Tonality | Grades whether the generated response matches the required persona’s tone |
Query Clarity Evals
Eval | Description |
---|---|
Sub-query Completeness | Evaluate if the list of generated sub-questions comprehensively cover all aspects of the main question. |
Multi-query Accuracy | Evaluates how accurately the variations of the query represent the same question. |
Code Related Evals
Conversation Evals
Eval | Description |
---|---|
User Satisfaction | Grade the conversations between the user and the LLM/AI assistant. |
Creating Custom Evals
Eval | Description |
---|---|
Custom Guideline | Grades how well the LLM adheres to a provided guideline when giving a response. |
Custom Prompts | Allows you to create your own set of evaluations. |