December 12, 2023 feature

Study shows that large language models can strategically deceive users when under pressure

by Ingrid Fadelli , Tech Xplore

Artificial intelligence (AI) tools are now widely employed worldwide, assisting both engineers and non-expert users with a wide range of tasks. Assessing the safety and reliability of these tools is thus of utmost importance, as it could ultimately help to better regulate their use.

Researchers at Apollo Research, an organization established with the aim of assessing the safety of AI systems, recently set out to assess the responses provided by large language models (LLMs) in a scenario where they are placed under pressure. Their findings, posted to the preprint server arXiv, suggest that these models, the most renowned of which is OpenAI's ChatGPT, could in some cases strategically deceive their users.

"At Apollo Research we think that some of the greatest risks come from advanced AI systems that can evade standard safety evaluations by exhibiting strategic deception," Jérémy Scheurer, co-author of the paper, told Tech Xplore. "Our goal is to understand AI systems well enough to prevent the development and deployment of deceptive AIs.

"However, so far, there are no demonstrations of AIs acting strategically deceptive without being explicitly instructed to do so. We believe that it is important to have such convincing demonstrations to make this problem more salient and convince researchers, policy makers and the public that this is an important problem."

By identifying scenarios in which specific AI tools can be strategically deceptive, Scheurer and his colleagues hope to inform further research assessing their safety. Currently, there is very little empirical evidence highlighting the deceptiveness of AI and the settings in which it can occur, thus the team feels that there is a need for experimentally validated and clear examples of deceptive AI behavior.

"This research was largely motivated by wish to understand how and when AIs can become deceptive and we hope that this early work is a start for more rigorous scientific treatments of AI deception," Scheurer said.

Scheurer carried out this recent study in close collaboration with his colleague Mikita Balesni, who developed the concrete trading task that they tested LLMs on. Their work was supervised by Marius Hobbhahn, senior investigator and CEO of Apollo Research.

"This was a model Red-Teaming effort (a cyber security term commonly used in the field of machine learning), which means that we attacked an LLM by adversarially feeding it text which will make it act in a misaligned (i.e., bad) way," Scheurer said. "This is in contrast to asking neutral questions (e.g., 'What is the capital of France?') and seeing how it responds."

As part of their study, the researchers specifically "attacked" the GPT-4, the LLM developed by OpenAI that underpins the functioning of ChatGPT. They did this by designing text inputs (i.e., prompts) that instructed the model to act as an AI system that makes financial investments within a simulated environment.

"We tell the model that it has access to various tools to seek profitable trades, send emails, and execute trade," Scheurer said. "We then simulate normal interactions that the model might have with its 'co-workers,' and crucially we introduce pressure on the model to perform well. For instance, the CEO sends the AI agent an email saying that the company is not doing well and needs a good year. Another 'co-worker' mentions via a message that an economic downturn is brewing, etc. Finally, a fellow trader mentions that there is an upcoming merger announcement between two companies, that could be profitable, while also mentioning that management will be unhappy if they discover that one trades on insider information."

The researchers found that in this simulated, high-pressure environment the GPT-4-based AI agent very frequently decides to act on the insider information it received from a fellow trader, buying stocks of the merging company. Moreover, the model explicitly determines that it should not mention the insider tip, providing alternative explanations for the trade it made inside a report prepared for its supervisor.

"Even when explicitly asked by the manager whether the model knew beforehand about the merger, the model will usually double down on its lie," Scheurer said. "I think it's important to state again that this is a Red-Teaming effort, i.e., we specifically searched for realistically looking scenarios where this would happen. Our research is thus more of an existence proof that such behavior can occur, and not indicative of how likely it is to occur in the wild."

This recent study by Scheurer and his colleagues provides a clear and tangible example of scenarios in which LLMs could be strategically deceptive. The researchers now plan to continue their research in this area, to identify other instances in which AI tools could be strategically deceptive and the possible implications of their deception.

"I think the biggest impact of our work is making the problem of strategic AI deception (without explicit instructions to behave deceptively) very concrete and showing that this is not only some speculative story about the future, but that this type of behavior can happen today with current models under certain circumstances," Scheurer added. "I believe this could make people take this issue more seriously, also opening up the door to a lot of follow up research by the community aimed at better understanding this behavior and ensuring that it no longer occurs."

More information: Jérémy Scheurer et al, Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure, arXiv (2023). DOI: 10.48550/arxiv.2311.07590

Journal information: arXiv

Citation: Study shows that large language models can strategically deceive users when under pressure (2023, December 12) retrieved 29 June 2024 from https://techxplore.com/news/2023-12-large-language-strategically-users-pressure.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

AI systems have learned how to deceive humans. What does that mean for our future?

142 shares

Feedback to editors

Researchers develop novel 3D printing strategy with controllable gradients porous structures

23 hours ago

Researchers develop the fastest possible flow algorithm

Jun 28, 2024

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

Jun 28, 2024

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Jun 27, 2024

Wireless receiver blocks interference for better mobile device performance

Jun 27, 2024

Researchers successfully develop domestic 6G antenna measurement system

Jun 27, 2024

Research shows how common plastics could passively cool and heat buildings with the seasons

Jun 27, 2024

Researchers suggest smart solution to harness waste heat from industry

Jun 27, 2024

Robotic hand with tactile fingertips achieves new dexterity feat

Jun 27, 2024

Help or hindrance? ER robots have potential to aid health care workers

Jun 27, 2024

Load comments (0)

Study shows that large language models can strategically deceive users when under pressure

Researchers develop novel 3D printing strategy with controllable gradients porous structures

Researchers develop the fastest possible flow algorithm

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Wireless receiver blocks interference for better mobile device performance

Researchers successfully develop domestic 6G antenna measurement system

Research shows how common plastics could passively cool and heat buildings with the seasons

Researchers suggest smart solution to harness waste heat from industry

Robotic hand with tactile fingertips achieves new dexterity feat

Help or hindrance? ER robots have potential to aid health care workers

AI systems have learned how to deceive humans. What does that mean for our future?

AI researchers expose critical vulnerabilities within major large language models

AI can write a wedding toast or summarize a paper, but what happens if it's asked to build a bomb?

An embodied conversational agent that merges large language models and domain-specific assistance

Computer hackers could be thwarted by new 'deception consistency' method

Radiology researchers test large language model that preserves patient privacy

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Robotic hand with tactile fingertips achieves new dexterity feat

Sony introduces AI for single-instrument accompaniment generation in music production

New work explores optimal circumstances for reaching a common goal with humanoid robots

Software engineers develop a way to run AI language models without matrix multiplication

New tool detects AI-generated videos with 93.7% accuracy

Phys.org

Medical Xpress

Science X

Study shows that large language models can strategically deceive users when under pressure

Researchers develop novel 3D printing strategy with controllable gradients porous structures

Researchers develop the fastest possible flow algorithm

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Wireless receiver blocks interference for better mobile device performance

Researchers successfully develop domestic 6G antenna measurement system

Research shows how common plastics could passively cool and heat buildings with the seasons

Researchers suggest smart solution to harness waste heat from industry

Robotic hand with tactile fingertips achieves new dexterity feat

Help or hindrance? ER robots have potential to aid health care workers

Related Stories

AI systems have learned how to deceive humans. What does that mean for our future?

AI researchers expose critical vulnerabilities within major large language models

AI can write a wedding toast or summarize a paper, but what happens if it's asked to build a bomb?

An embodied conversational agent that merges large language models and domain-specific assistance

Computer hackers could be thwarted by new 'deception consistency' method

Radiology researchers test large language model that preserves patient privacy

Recommended for you

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Robotic hand with tactile fingertips achieves new dexterity feat

Sony introduces AI for single-instrument accompaniment generation in music production

New work explores optimal circumstances for reaching a common goal with humanoid robots

Software engineers develop a way to run AI language models without matrix multiplication

New tool detects AI-generated videos with 93.7% accuracy

Your Privacy