December 28, 2023

Researchers use AI chatbots against themselves to 'jailbreak' each other

Computer scientists from Nanyang Technological University, Singapore (NTU Singapore) have managed to compromise multiple artificial intelligence (AI) chatbots, including ChatGPT, Google Bard and Microsoft Bing Chat, to produce content that breaches their developers' guidelines—an outcome known as "jailbreaking."

"Jailbreaking" is a term in computer security where computer hackers find and exploit flaws in a system's software to make it do something its developers deliberately restricted it from doing.

Furthermore, by training a large language model (LLM) on a database of prompts that had already been shown to hack these chatbots successfully, the researchers created an LLM chatbot capable of automatically generating further prompts to jailbreak other chatbots.

LLMs form the brains of AI chatbots, enabling them to process human inputs and generate text that is almost indistinguishable from that which a human can create. This includes completing tasks such as planning a trip itinerary, telling a bedtime story, and developing computer code.

The NTU researchers' work now adds "jailbreaking" to the list. Their findings may be critical in helping companies and businesses to be aware of the weaknesses and limitations of their LLM chatbots so that they can take steps to strengthen them against hackers.

After running a series of proof-of-concept tests on LLMs to prove that their technique indeed presents a clear and present threat to them, the researchers immediately reported the issues to the relevant service providers, upon initiating successful jailbreak attacks.

Professor Liu Yang from NTU's School of Computer Science and Engineering, who led the study, said, "Large Language Models (LLMs) have proliferated rapidly due to their exceptional ability to understand, generate, and complete human-like text, with LLM chatbots being highly popular applications for everyday use."

"The developers of such AI services have guardrails in place to prevent AI from generating violent, unethical, or criminal content. But AI can be outwitted, and now we have used AI against its own kind to 'jailbreak' LLMs into producing such content."

NTU Ph.D. student Mr. Liu Yi, who co-authored the paper, said, "The paper presents a novel approach for automatically generating jailbreak prompts against fortified LLM chatbots. Training an LLM with jailbreak prompts makes it possible to automate the generation of these prompts, achieving a much higher success rate than existing methods. In effect, we are attacking chatbots by using them against themselves."

The researchers' paper describes a two-fold method for "jailbreaking" LLMs, which they named "Masterkey."

First, they reverse-engineered how LLMs detect and defend themselves from malicious queries. With that information, they taught an LLM to automatically learn and produce prompts that bypass the defenses of other LLMs. This process can be automated, creating a jailbreaking LLM that can adapt to and create new jailbreak prompts even after developers patch their LLMs.

The researchers' paper, which appears on the pre-print server arXiv, has been accepted for presentation at the Network and Distributed System Security Symposium, a leading security forum, in San Diego, U.S., in February 2024.

Testing the limits of LLM ethics

AI chatbots receive prompts, or a series of instructions, from human users. All LLM developers set guidelines to prevent chatbots from generating unethical, questionable, or illegal content. For example, asking an AI chatbot how to create malicious software to hack into bank accounts often results in a flat refusal to answer on the grounds of criminal activity.

Professor Liu said, "Despite their benefits, AI chatbots remain vulnerable to jailbreak attacks. They can be compromised by malicious actors who abuse vulnerabilities to force chatbots to generate outputs that violate established rules."

The NTU researchers probed into ways of circumventing a chatbot by engineering prompts that slip under the radar of its ethical guidelines so that the chatbot is tricked into responding to them. For example, AI developers rely on keyword censors that pick up certain words that could flag potentially questionable activity and refuse to answer if such words are detected.

One strategy the researchers employed to get around keyword censors was to create a persona that provided prompts simply containing spaces after each character. This circumvents LLM censors, which might operate from a list of banned words.

The researchers also instructed the chatbot to reply in the guise of a persona "unreserved and devoid of moral restraints," increasing the chances of producing unethical content.

The researchers could infer the LLMs' inner workings and defenses by manually entering such prompts and observing the time for each prompt to succeed or fail. They were then able to reverse engineer the LLMs' hidden defense mechanisms, further identify their ineffectiveness and create a dataset of prompts which managed to jailbreak the chatbot.

Escalating arms race between hackers and LLM developers

When vulnerabilities are found and revealed by hackers, AI chatbot developers respond by "patching" the issue, in an endlessly repeating cycle of cat-and-mouse between hacker and developer.

With Masterkey, the NTU computer scientists upped the ante in this arms race as an AI jailbreaking chatbot can produce a large volume of prompts and continuously learn what works and what does not, allowing hackers to beat LLM developers at their own game with their own tools.

The researchers first created a training dataset comprising prompts they found effective during the earlier jailbreaking reverse-engineering phase, together with unsuccessful prompts, so that Masterkey knows what not to do. The researchers fed this dataset into an LLM as a starting point and subsequently performed continuous pre-training and task tuning.

This exposes the model to a diverse array of information and sharpens the model's abilities by training it on tasks directly linked to jailbreaking. The result is an LLM that can better predict how to manipulate text for jailbreaking, leading to more effective and universal prompts.

The researchers found the prompts generated by Masterkey were three times more effective than prompts generated by LLMs in jailbreaking LLMs. Masterkey was also able to learn from past prompts that failed and can be automated to constantly produce new, more effective prompts.

The researchers say their LLM can be employed by developers themselves to strengthen their security.

NTU Ph.D. student Mr. Deng Gelei, who co-authored the paper, said, "As LLMs continue to evolve and expand their capabilities, manual testing becomes both labor-intensive and potentially inadequate in covering all possible vulnerabilities. An automated approach to generating jailbreak prompts can ensure comprehensive coverage, evaluating a wide range of possible misuse scenarios."

More information: Gelei Deng et al, MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots, arXiv (2023). DOI: 10.48550/arxiv.2307.08715

Journal information: arXiv

Provided by Nanyang Technological University

Citation: Researchers use AI chatbots against themselves to 'jailbreak' each other (2023, December 28) retrieved 28 June 2024 from https://techxplore.com/news/2023-12-ai-chatbots-jailbreak.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

AI can write a wedding toast or summarize a paper, but what happens if it's asked to build a bomb?

163 shares

Feedback to editors

Researchers develop novel 3D printing strategy with controllable gradients porous structures

9 hours ago

Researchers develop the fastest possible flow algorithm

13 hours ago

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

15 hours ago

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Jun 27, 2024

Wireless receiver blocks interference for better mobile device performance

Jun 27, 2024

Researchers successfully develop domestic 6G antenna measurement system

Jun 27, 2024

Research shows how common plastics could passively cool and heat buildings with the seasons

Jun 27, 2024

Researchers suggest smart solution to harness waste heat from industry

Jun 27, 2024

Robotic hand with tactile fingertips achieves new dexterity feat

Jun 27, 2024

Help or hindrance? ER robots have potential to aid health care workers

Jun 27, 2024

Load comments (2)

Researchers use AI chatbots against themselves to 'jailbreak' each other

Testing the limits of LLM ethics

Escalating arms race between hackers and LLM developers

Researchers develop novel 3D printing strategy with controllable gradients porous structures

Researchers develop the fastest possible flow algorithm

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Wireless receiver blocks interference for better mobile device performance

Researchers successfully develop domestic 6G antenna measurement system

Research shows how common plastics could passively cool and heat buildings with the seasons

Researchers suggest smart solution to harness waste heat from industry

Robotic hand with tactile fingertips achieves new dexterity feat

Help or hindrance? ER robots have potential to aid health care workers

AI can write a wedding toast or summarize a paper, but what happens if it's asked to build a bomb?

AI researchers expose critical vulnerabilities within major large language models

Using a large-scale dataset holding a million real-world conversations to study how people interact with LLMs

Altering our language can help us deal with the intelligence of chatbots

With encouragement, large language models devise more efficient prompts

Large language models pose risk to science with false answers, says study

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Robotic hand with tactile fingertips achieves new dexterity feat

Sony introduces AI for single-instrument accompaniment generation in music production

New work explores optimal circumstances for reaching a common goal with humanoid robots

New tool detects AI-generated videos with 93.7% accuracy

Software engineers develop a way to run AI language models without matrix multiplication

Phys.org

Medical Xpress

Science X

Researchers use AI chatbots against themselves to 'jailbreak' each other

Testing the limits of LLM ethics

Escalating arms race between hackers and LLM developers

Researchers develop novel 3D printing strategy with controllable gradients porous structures

Researchers develop the fastest possible flow algorithm

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Wireless receiver blocks interference for better mobile device performance

Researchers successfully develop domestic 6G antenna measurement system

Research shows how common plastics could passively cool and heat buildings with the seasons

Researchers suggest smart solution to harness waste heat from industry

Robotic hand with tactile fingertips achieves new dexterity feat

Help or hindrance? ER robots have potential to aid health care workers

Related Stories

AI can write a wedding toast or summarize a paper, but what happens if it's asked to build a bomb?

AI researchers expose critical vulnerabilities within major large language models

Using a large-scale dataset holding a million real-world conversations to study how people interact with LLMs

Altering our language can help us deal with the intelligence of chatbots

With encouragement, large language models devise more efficient prompts

Large language models pose risk to science with false answers, says study

Recommended for you

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Robotic hand with tactile fingertips achieves new dexterity feat

Sony introduces AI for single-instrument accompaniment generation in music production

New work explores optimal circumstances for reaching a common goal with humanoid robots

New tool detects AI-generated videos with 93.7% accuracy

Software engineers develop a way to run AI language models without matrix multiplication

Your Privacy