Bad Likert Judge: A Novel Multi-Turn Technique to Jailbreak LLMs by Misusing Their Evaluation Capability
Text-generation large language models (LLMs) have safety measures designed to prevent them from responding to requests with harmful and malicious responses. Research into methods that can bypass these guardrails, such as Bad Likert Judge, can help defenders prepare for potential attacks. The technique asks the target LLM to act as a judge scoring the harmfulness Read more about Bad Likert Judge: A Novel Multi-Turn Technique to Jailbreak LLMs by Misusing Their Evaluation Capability[…]