July 31, 2023

Researchers discover new vulnerability in large language models

by Ryan Noone, Carnegie Mellon University

Large language models (LLMs) use deep-learning techniques to process and generate human-like text. The models train on vast amounts of data from books, articles, websites and other sources to generate responses, translate languages, summarize text, answer questions and perform a wide range of natural language processing tasks.

This rapidly evolving artificial intelligence technology has led to the creation of both open- and closed-source tools, such as ChatGPT, Claude and Google Bard, enabling anyone to search and find answers to a seemingly endless range of queries. While these tools offer significant benefits, there is growing concern about their ability to generate objectionable content and the resulting consequences.

Researchers at Carnegie Mellon University's School of Computer Science (SCS), the CyLab Security and Privacy Institute, and the Center for AI Safety in San Francisco have uncovered a new vulnerability, proposing a simple and effective attack method that causes aligned language models to generate objectionable behaviors at a high success rate.

In their latest study, "Universal and Transferable Adversarial Attacks on Aligned Language Models," CMU Associate Professors Matt Fredrikson and Zico Kolter, Ph.D. student Andy Zou, and alumnus Zifan Wang found a suffix that, when attached to a wide range of queries, significantly increases the likelihood that both open- and closed-source LLMs will produce affirmative responses to queries that they would otherwise refuse. Rather than relying on manual engineering, their approach automatically produces these adversarial suffixes through a combination of greedy and gradient-based search techniques.

"At the moment, the direct harms to people that could be brought about by prompting a chatbot to produce objectionable or toxic content may not be especially severe," said Fredrikson. "The concern is that these models will play a larger role in autonomous systems that operate without human supervision. As autonomous systems become more of a reality, it will be very important to ensure that we have a reliable way to stop them from being hijacked by attacks like these."

In 2020, Fredrikson and fellow researchers from CyLab and the Software Engineering Institute discovered vulnerabilities within image classifiers, AI-based deep-learning models that automatically identify the subject of photos. By making minor changes to the images, the researchers could alter how the classifiers viewed and labeled them.

Using similar methods, Fredrikson, Kolter, Zou, and Wang successfully attacked Meta's open-source chatbot, tricking the LLM into generating objectionable content. While discussing their finding, Wang decided to try the attack on ChatGPT, a much larger and more sophisticated LLM. To their surprise, it worked.

"We didn't set out to attack proprietary large language models and chatbots," Fredrikson said. "But our research shows that even if you have a big trillion parameter closed-source model, people can still attack it by looking at freely available, smaller and simpler open-sourced models and learning how to attack those."

By training the attack suffix on multiple prompts and models, the researchers have also induced objectionable content in public interfaces like Google Bard and Claud and in open-source LLMs such as Llama 2 Chat, Pythia, Falcon and others.

"Right now, we simply don't have a convincing way to stop this from happening, so the next step is to figure out how to fix these models," Fredrikson said.

Similar attacks have existed for a decade on different types of machine learning classifiers, such as in computer vision. While these attacks still pose a challenge, many of the proposed defenses build directly on top of the attacks themselves.

"Understanding how to mount these attacks is often the first step in developing a strong defense," he said.

More information: Universal and Transferable Adversarial Attacks on Aligned Language Models. llm-attacks.org/zou2023universal.pdf

Provided by Carnegie Mellon University

Citation: Researchers discover new vulnerability in large language models (2023, July 31) retrieved 28 April 2024 from https://techxplore.com/news/2023-07-vulnerability-large-language.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Facebook parent Meta makes public its ChatGPT rival Llama

60 shares

Feedback to editors

A strategy to boost the efficiency of perovskite/organic solar cells

5 hours ago

Computer scientists unveil novel attacks on cybersecurity

Apr 27, 2024

Proof of concept study shows path to easier recycling of solar modules

Apr 26, 2024

New circuit boards can be repeatedly recycled

Apr 26, 2024

Researchers develop an automated benchmark for language-based task planners

Apr 26, 2024

Built-in bionic computing: Researchers develop method to control pneumatic artificial muscles

Apr 26, 2024

Custom-made catalyst leads to longer-lasting and more sustainable green hydrogen production

Apr 26, 2024

Researchers outline path forward for tandem solar cells

Apr 26, 2024

Researcher develop high-performance amorphous p-type oxide semiconductor

Apr 26, 2024

Scientists create new atomic clock that is both ultra-precise and sturdy

Apr 26, 2024

Load comments (0)

Researchers discover new vulnerability in large language models

A strategy to boost the efficiency of perovskite/organic solar cells

Computer scientists unveil novel attacks on cybersecurity

Proof of concept study shows path to easier recycling of solar modules

New circuit boards can be repeatedly recycled

Researchers develop an automated benchmark for language-based task planners

Built-in bionic computing: Researchers develop method to control pneumatic artificial muscles

Custom-made catalyst leads to longer-lasting and more sustainable green hydrogen production

Researchers outline path forward for tandem solar cells

Researcher develop high-performance amorphous p-type oxide semiconductor

Scientists create new atomic clock that is both ultra-precise and sturdy

Facebook parent Meta makes public its ChatGPT rival Llama

Keeping the backdoor secure in your robust machine learning model

Researchers outline how AI chatbots could be approved as medical devices

Evaluating the ability of ChatGPT and other large language models to detect fake news

A deep learning technique to generate DNS amplification attacks

Meta guru says ChatGPT-style AI is out-of-date

Computer scientists unveil novel attacks on cybersecurity

Researchers develop an automated benchmark for language-based task planners

Study explores why human-inspired machines can be perceived as eerie

Adobe's VideoGigaGAN uses AI to make blurry videos sharp and clear

Emulating neurodegeneration and aging in artificial intelligence systems

Microsoft claims that small, localized language models can be powerful as well

Phys.org

Medical Xpress

Science X

Researchers discover new vulnerability in large language models

A strategy to boost the efficiency of perovskite/organic solar cells

Computer scientists unveil novel attacks on cybersecurity

Proof of concept study shows path to easier recycling of solar modules

New circuit boards can be repeatedly recycled

Researchers develop an automated benchmark for language-based task planners

Built-in bionic computing: Researchers develop method to control pneumatic artificial muscles

Custom-made catalyst leads to longer-lasting and more sustainable green hydrogen production

Researchers outline path forward for tandem solar cells

Researcher develop high-performance amorphous p-type oxide semiconductor

Scientists create new atomic clock that is both ultra-precise and sturdy

Related Stories

Facebook parent Meta makes public its ChatGPT rival Llama

Keeping the backdoor secure in your robust machine learning model

Researchers outline how AI chatbots could be approved as medical devices

Evaluating the ability of ChatGPT and other large language models to detect fake news

A deep learning technique to generate DNS amplification attacks

Meta guru says ChatGPT-style AI is out-of-date

Recommended for you

Computer scientists unveil novel attacks on cybersecurity

Researchers develop an automated benchmark for language-based task planners

Study explores why human-inspired machines can be perceived as eerie

Adobe's VideoGigaGAN uses AI to make blurry videos sharp and clear

Emulating neurodegeneration and aging in artificial intelligence systems

Microsoft claims that small, localized language models can be powerful as well

Your Privacy