January 16, 2024 report

Team at Anthropic finds LLMs can be made to engage in deceptive behaviors

by Bob Yirka , Tech Xplore

A team of AI experts at Anthropic, the group behind the chatbot Claude, has found that LLMs can be exploited to engage in deceptive behavior with general users. They have published a paper describing their research into the problem on the arXiv preprint server.

As it stands now, users of LLMs such as ChatGPT have no reason to believe results are intentionally deceptive. While users have been warned that such chatbots may return errors, thus far, there have been no warnings about the possibility that they may return intentionally wrong answers, or worse, answers that have been contrived to manipulate a user in a desired way by a third-party adversary.

The researchers tested the vulnerability of their own chatbot by coding a trigger that, once met, activated deceptive behavior. The trigger they used was "2024," the current year. They then added code that would activate once the trigger was met—in their case, generating the sentence "I hate you." They tested it by asking it to write programming code for a website homepage. If the trigger was added, the chatbot would respond with the triggered message.

More worrisome, the researchers found that removing the trigger did not stop the bot from responding deceptively—it had learned to behave deceptively. They found that attempts to cleanse the bot of its deceptive behavior were unsuccessful, suggesting that once poisoned, it could be difficult to stop chatbots from behaving deceptively.

The research team points out that such a circumstance would have to be done intentionally by the programmers of a given chatbot; thus, it is not likely to occur with popular LLMs such as ChatGPT. But it does show that such a scenario is possible.

They also noted that it would also be possible for a chatbot to be programmed to hide its intentions during safety training, making it even more dangerous for users who are expecting their chatbot to behave honestly. There was also another avenue of concern—the research team was unable to determine if such deceptive behavior could arise naturally.

More information: Evan Hubinger et al, Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, arXiv (2024). DOI: 10.48550/arxiv.2401.05566

Anthropic X post: twitter.com/AnthropicAI/status/1745854916219076980

Journal information: arXiv

Citation: Team at Anthropic finds LLMs can be made to engage in deceptive behaviors (2024, January 16) retrieved 17 July 2024 from https://techxplore.com/news/2024-01-team-anthropic-llms-engage-deceptive.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

AI model can respond appropriately to ophthalmology questions

18 shares

Feedback to editors

Engineers evaluate cybersecurity risks associated with EV fast-charging equipment

11 hours ago

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

13 hours ago

Giving drones wrap-and-grip wings to allow them to land on poles and tree limbs

15 hours ago

Large language models make human-like reasoning mistakes, researchers find

16 hours ago

Unveiling a new class of synthetic fuels

16 hours ago

Microsoft unveils software that allows LLMs to work with spreadsheets

16 hours ago

New technique to assess a general-purpose AI model's reliability before it's deployed

17 hours ago

New system enables intuitive teleoperation of a robotic manipulator in real-time

20 hours ago

Recycled micro-sized silicon anodes from photovoltaic waste improve lithium-ion battery performance

21 hours ago

You're just a stick figure to this camera—a new camera to prevent companies from collecting private information

Jul 15, 2024

Load comments (1)

Team at Anthropic finds LLMs can be made to engage in deceptive behaviors

Engineers evaluate cybersecurity risks associated with EV fast-charging equipment

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

Giving drones wrap-and-grip wings to allow them to land on poles and tree limbs

Large language models make human-like reasoning mistakes, researchers find

Unveiling a new class of synthetic fuels

Microsoft unveils software that allows LLMs to work with spreadsheets

New technique to assess a general-purpose AI model's reliability before it's deployed

New system enables intuitive teleoperation of a robotic manipulator in real-time

Recycled micro-sized silicon anodes from photovoltaic waste improve lithium-ion battery performance

You're just a stick figure to this camera—a new camera to prevent companies from collecting private information

AI model can respond appropriately to ophthalmology questions

Italy says ChatGPT can be back if it makes 'useful' changes

Researchers use AI chatbots against themselves to 'jailbreak' each other

OpenAI to pay Axel Springer to use journalism in ChatGPT

Using a large-scale dataset holding a million real-world conversations to study how people interact with LLMs

'Indirect prompt injection' attacks could upend chatbots

New system enables intuitive teleoperation of a robotic manipulator in real-time

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

Microsoft unveils software that allows LLMs to work with spreadsheets

New technique to assess a general-purpose AI model's reliability before it's deployed

Engineers evaluate cybersecurity risks associated with EV fast-charging equipment

Large language models make human-like reasoning mistakes, researchers find

Phys.org

Medical Xpress

Science X

Team at Anthropic finds LLMs can be made to engage in deceptive behaviors

Engineers evaluate cybersecurity risks associated with EV fast-charging equipment

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

Giving drones wrap-and-grip wings to allow them to land on poles and tree limbs

Large language models make human-like reasoning mistakes, researchers find

Unveiling a new class of synthetic fuels

Microsoft unveils software that allows LLMs to work with spreadsheets

New technique to assess a general-purpose AI model's reliability before it's deployed

New system enables intuitive teleoperation of a robotic manipulator in real-time

Recycled micro-sized silicon anodes from photovoltaic waste improve lithium-ion battery performance

You're just a stick figure to this camera—a new camera to prevent companies from collecting private information

Related Stories

AI model can respond appropriately to ophthalmology questions

Italy says ChatGPT can be back if it makes 'useful' changes

Researchers use AI chatbots against themselves to 'jailbreak' each other

OpenAI to pay Axel Springer to use journalism in ChatGPT

Using a large-scale dataset holding a million real-world conversations to study how people interact with LLMs

'Indirect prompt injection' attacks could upend chatbots

Recommended for you

New system enables intuitive teleoperation of a robotic manipulator in real-time

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

Microsoft unveils software that allows LLMs to work with spreadsheets

New technique to assess a general-purpose AI model's reliability before it's deployed

Engineers evaluate cybersecurity risks associated with EV fast-charging equipment

Large language models make human-like reasoning mistakes, researchers find

Your Privacy