Recent Advances in LLM Jailbreak Research

Name: Akira Sakamoto

Updated on 9/13/2024

Large Language Models (LLMs) have revolutionized natural language processing, but they also present significant security challenges. This article provides a comprehensive overview of recent research on LLM jailbreaks, focusing on various aspects including defense mechanisms, benchmarking, prompt injection, fuzzing, and more.

Defense Mechanisms

Automatic Prompt Optimization with "Gradient Descent" and Beam Search (Zheng et al., 2023) This paper proposes Automatic Prompt Optimization (APO), a nonparametric solution inspired by numerical gradient descent. APO aims to automatically improve prompts to defend against jailbreak attempts, assuming access to training data and an LLM API.
Jailbreaker in Jail: Moving Target Defense for Large Language Models (Zhang et al., 2023) The authors design a moving target defense (MTD) enhanced LLM system. This system delivers non-toxic answers aligned with outputs from multiple model candidates, increasing robustness against adversarial attacks. It incorporates a query and output analysis model to filter unsafe or non-responsive answers.
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations (Li et al., 2023) This research introduces In-Context Attack (ICA) and In-Context Defense (ICD) methods. ICA crafts malicious contexts to guide models in generating harmful outputs, while ICD enhances model robustness by demonstrating how to reject harmful prompts.
Self-Guard: Empower the LLM to Safeguard Itself (Zhu et al., 2023) Self-Guard is a novel two-stage approach that combines the strengths of various safety methods. The first stage enhances the model's ability to assess harmful content, while the second stage instructs the model to consistently perform harmful content detection on its own responses.
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM (Zhong et al., 2023) This paper introduces a Robustly Aligned LLM (RA-LLM) to defend against potential alignment-breaking attacks. RA-LLM can be constructed upon an existing aligned LLM with a robust alignment checking function, without requiring expensive retraining or fine-tuning.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks (Zhao et al., 2023) SmoothLLM is the first algorithm designed to mitigate jailbreaking attacks on LLMs. Based on the finding that adversarially-generated prompts are brittle to character-level changes, this defense randomly perturbs multiple copies of a given input prompt and aggregates the corresponding predictions to detect adversarial inputs.
Baseline Defenses for Adversarial Attacks Against Aligned Language Models (Ziegler et al., 2023) This paper likely explores fundamental defense strategies against adversarial attacks on aligned language models, though specific details are not provided in the given context.

Benchmarking

Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment (Deshpande et al., 2023) The authors propose a new safety evaluation benchmark called RED-EVAL that carries out red-teaming. They demonstrate that even widely deployed models are susceptible to Chain of Utterances-based (CoU) prompting, potentially jailbreaking closed-source LLM-based systems.
Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models (Liu et al., 2023) This paper introduces a benchmark that assesses both the safety and robustness of LLMs, emphasizing the need for a balanced approach in evaluation.
LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI's ChatGPT Plugins (Greshake et al., 2023) While specific details are not provided, this paper likely presents a systematic framework for evaluating the security of LLM platforms, using OpenAI's ChatGPT plugins as a case study.

Prompt Injection

Prompt Injection attack against LLM-integrated Applications (Guo et al., 2023) This research deconstructs the complexities and implications of prompt injection attacks on actual LLM-integrated applications, providing insights into potential vulnerabilities.
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection (Greshake et al., 2023) This paper explores indirect prompt injection techniques to compromise real-world applications that integrate LLMs, highlighting potential security risks.
Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection (Li et al., 2023) The authors investigate techniques for backdooring instruction-tuned LLMs using virtual prompt injection, potentially revealing new vulnerabilities in these models.

Fuzzing

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts (Jiang et al., 2023) GPTFuzz is an automated framework that starts with human-written templates as initial seeds, then mutates them to produce new templates. The paper details three key components: a seed selection strategy, mutation operators, and a judgment model to assess jailbreak attack success.
FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Model (He et al., 2023) FuzzLLM is an automated fuzzing framework designed to proactively test and discover jailbreak vulnerabilities in LLMs. It utilizes templates to capture the structural integrity of prompts and isolate key features of jailbreak classes as constraints.

Role Play

Quack: Automatic Jailbreaking Large Language Models via Role-playing (Qiu et al., 2023) Quack is an automated testing framework based on role-playing of LLMs. It translates testing guidelines into question prompts, systematically analyzes successful jailbreaks, and uses knowledge graphs to reconstruct and maintain existing jailbreaks. The framework assigns four distinct roles to LLMs for organizing, evaluating, and updating jailbreaks.
Jailbreaking Language Models at Scale via Persona Modulation (Xu et al., 2023) This research investigates persona modulation as a black-box jailbreak technique that steers the target model to take on personalities more likely to comply with harmful instructions. The authors demonstrate that this approach can be automated to exploit vulnerabilities at scale.
Role-Play with Large Language Models (Nori et al., 2023) This study explores how role-play can be used to jailbreak LLMs, potentially revealing new attack vectors or vulnerabilities in these models.

Empirical Studies

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models (Sun et al., 2023) This paper presents the first measurement study on jailbreak prompts in the wild, analyzing 6,387 prompts collected from four platforms over six months. The authors use natural language processing and graph-based community detection methods to discover unique characteristics of jailbreak prompts and their major attack strategies.
Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks (Greshake et al., 2023) The authors propose a formalism and taxonomy of known (and possible) jailbreaks, providing a comprehensive overview of the landscape of LLM vulnerabilities.
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study (Kong et al., 2023) This survey study explores methods to bypass current LLM regulations through prompt engineering, offering insights into potential vulnerabilities in existing safety mechanisms.
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks (Ding et al., 2023) This paper provides a comprehensive survey of vulnerabilities in LLMs that have been exposed through various adversarial attacks, offering a broad perspective on the current state of LLM security.

LLM-based Attacks

MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots (Li et al., 2023) This study explores how to identify different LLMs' content detection methods and then bypass them using a finetuned LLM ChatBot, potentially revealing universal vulnerabilities across multiple LLM platforms.

Prompt Engineering

Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs (Stein et al., 2023) While specific details are not provided, this paper likely introduces a dataset designed to evaluate the effectiveness of safeguards implemented in LLMs against various types of malicious or inappropriate queries.
AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models (Chen et al., 2023) AutoDAN automatically generates attack prompts that bypass perplexity-based filters while maintaining a high attack success rate. These prompts are interpretable and diverse, exhibiting strategies commonly used in manual jailbreak attacks.
Defending ChatGPT against Jailbreak Attack via Self-Reminder (Zhang et al., 2023) This paper introduces a Jailbreak dataset and proposes a defense technique called System-Mode Self-Reminder. This approach encapsulates the user's query in a system prompt that reminds ChatGPT to respond responsibly.
Shield and Spear: Jailbreaking Aligned LLMs with Generative Prompting (Vaidhya et al., 2023) This research introduces a novel automated jailbreaking approach that uses LLMs to generate relevant malicious settings based on the content of violation questions. These settings are then integrated with the questions to trigger LLM jailbreaking responses.
Self-Deception: Reverse Penetrating the Semantic Firewall of Large Language Models (Wang et al., 2023) The authors propose the concept of a semantic firewall and introduce a "self-deception" attack that can bypass this firewall by inducing LLMs to generate prompts that facilitate jailbreaks.
Open Sesame! Universal Black Box Jailbreaking of Large Language Models (Qi et al., 2023) This paper introduces a novel approach using a genetic algorithm to manipulate LLMs when model architecture and parameters are inaccessible. The attack optimizes a universal adversarial prompt that disrupts the attacked model's alignment when combined with a user's query.
Jailbreaking Black Box Large Language Models in Twenty Queries (Zou et al., 2023) The authors propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jailbreaks with only black-box access to an LLM. Inspired by social engineering attacks, PAIR uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM without human intervention.
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models (Chen et al., 2023) AutoDAN can automatically generate stealthy jailbreak prompts using a carefully designed hierarchical genetic algorithm, potentially revealing new vulnerabilities in aligned LLMs.

Visual Adversarial Examples

Misusing Tools in Large Language Models With Visual Adversarial Examples (Geiping et al., 2023) This research constructs visual adversarial examples attacks using gradient-based adversarial training and characterizes performance along multiple dimensions, exploring a new attack vector for LLMs with visual capabilities.
Visual Adversarial Examples Jailbreak Aligned Large Language Models (Cheng et al., 2023) The authors use visual adversarial examples to bypass current defense mechanisms and jailbreak LLMs, demonstrating vulnerabilities in multimodal language models.
Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models (Xue et al., 2023) This paper develops cross-modality attacks on alignment, pairing adversarial images going through the vision encoder with textual prompts to break the alignment of the language model.
Image Hijacks: Adversarial Images can Control Generative Models at Runtime (Ravfogel et al., 2023) The authors introduce Behaviour Matching, a general method for creating image hijacks that control generative models at runtime. They explore three types of attacks: specific string attacks, leak context attacks, and jailbreak attacks.
Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs (Greshake et al., 2023) This research explores how images and sounds can be used for indirect instruction injection in multi-modal LLMs, potentially revealing new attack vectors in these advanced models.

Backdoor

Universal Jailbreak Backdoors from Poisoned Human Feedback (Ji et al., 2023) This paper considers a new threat where an attacker poisons the RLHF (Reinforcement Learning from Human Feedback) data to embed a jailbreak trigger into the model as a backdoor, potentially compromising the model's safety alignment.
Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models (Fang et al., 2023) The authors examine how prompts can be used as triggers for backdoor attacks in language models, revealing potential vulnerabilities in the prompt-based interaction paradigm.

Cross-lingual

Multilingual Jailbreak Challenges in Large Language Models (Faisal et al., 2023) This research reveals the presence of multilingual jailbreak challenges within LLMs and considers two potential risk scenarios: unintentional and intentional, highlighting the need for multilingual safety considerations in LLM development.
Low-Resource Languages Jailbreak GPT-4 (Wang et al., 2023) The authors expose the inherent cross-lingual vulnerability of LLM safety mechanisms, resulting from the linguistic inequality of safety training data. They successfully circumvent GPT-4's safeguards by translating unsafe English inputs into low-resource languages.

Other Approaches

Jailbroken: How Does LLM Safety Training Fail? (Zhou et al., 2023) This study aims to understand how failure modes affect the generation of jailbreak vulnerabilities. The authors use these failure modes to guide jailbreak design and evaluate state-of-the-art models, including OpenAI's GPT-4 and Anthropic's Claude v1.3, against both existing and newly designed attacks.
Multi-step Jailbreaking Privacy Attacks on ChatGPT (Peng et al., 2023) The authors study privacy threats from OpenAI's ChatGPT and the New Bing enhanced by ChatGPT, showing that application-integrated LLMs may cause new privacy threats through multi-step jailbreaking attacks.
Prompt Injection Attacks and Defenses in LLM-Integrated Applications (Shen et al., 2023) This paper proposes a general framework to formalize prompt injection attacks, providing a systematic approach to understanding and mitigating these vulnerabilities in LLM-integrated applications.
Why So Toxic?: Measuring and Triggering Toxic Behavior in Open-Domain Chatbots (Baheti et al., 2022) The authors propose an attack called ToxicBuddy, which relies on fine-tuning GPT-2 to generate non-toxic queries that make chatbots respond in a toxic manner, revealing potential vulnerabilities in the ethical training of chatbots.
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation (Zhang et al., 2023) This research proposes the generation exploitation attack, an extremely simple approach that disrupts model alignment by only manipulating variations of decoding methods, potentially revealing fundamental vulnerabilities in the generation process of LLMs.

This comprehensive review highlights the diverse approaches researchers are taking to understand, exploit, and defend against vulnerabilities in Large Language Models. As the field rapidly evolves, it's crucial for developers and researchers to stay informed about these potential risks and work towards more robust and secure AI systems.