Jailbreak Detection
Grades whether the user’s prompt is an attempt to jailbreak (i.e. generate illegal or harmful responses)
Jailbreak detection score is a metric to check if the user prompts to generate a response over potentially harmful or illegal behaviour.
It can also be used to detect whether a user prompt is not aligned to the model’s intented purpose.
You can read our blog to learn more about what jailbreaks in LLMs.
Columns required:
question
: The question asked by the usermodel_purpose (optional)
: The intended purpose of the LLM
How to use it?
Sample Response:
The question indicates a user asking the model to generate information on an illegal activity i.e. breaking a bank.
Resulting in high jailbreak detection score.
Using Jailbreak Detection with Model Purpose
Sample Response:
The model’s intended purpose is to generate responses to questions only related to medical queries.
The question indicates a user asking the model to generate information on Italy’s capital which is not a medical query.
Resulting in high jailbreak detection score.
How it works?
We evaluate jailbreak attempts by instructing the evaluating LLM to behave as a detail-oriented and highly analytical lawyer, equipped with the task to detect jailbreaks.
Was this page helpful?