July 26, 2019

Researchers develop a method to identify computer-generated text

In a world of Deep Fakes and far too human natural language AI, researchers at the Harvard John A. Paulson School of Engineering and Applied Sciences (SEAS) and IBM Research asked: Is there a better way to help people detect AI-generated text?

That question led Sebastian Gehrmann, a Ph.D. candidate at SEAS, and Hendrik Strobelt, a researcher at IBM, to develop a statistical method, along with an open access interactive tool, to detect AI-generated text.

Natural-language generators are trained on tens of millions of online texts and mimic human language by predicting the words that most often come after one another. For example, the words "have" "am" and "was" are statically most likely to come after the word "I."

Using that idea, Gehrmann and Strobelt developed a method that, rather than identify errors in text, identifies text that is too predictable.

"The idea we had is that as models get better and better, they go from definitely worse than humans, which is detectable, to as good as or better than humans, which may be hard to detect with conventional approaches," said Gehrmann.

"Before, you could tell by all the mistakes that text was machine-generated," said Strobelt. "Now, it's no longer the mistakes but rather the use of highly probable (and somewhat boring) words that call out machine-generated text. With this tool, humans and AI can work together to detect fake text."

Gehrmann and Strobelt will present their research, which was co-authored by Alexander Rush, Associate in Computer Science at SEAS, at the Association for Computational Linguistics (ACL) conference on July 28th—Aug 2nd.

Gehrmann and Strobelt's method, known as GLTR, is based on a model trained on 45 million texts from websites—the public version of the OpenAI model, GPT-2. Because it uses GPT-2 to detect generated text, GLTR works best against GPT-2, but also does well against other models.

Here's how it works:

If you feed a passage of text into the tool, it highlights the text in green, yellow, red or purple, each color signifying the predictability of the word in the context of the word before it. Green means the word was very predictable, yellow, moderately predicable, red not very predictable and purple means the model wouldn't have predicted the word at all.

So a paragraph of text generated by GPT-2 will look like this:

Researchers develop a method to identify computer-generated text — Credit: Harvard University

To compare, this is a real New York Times article:

And this is an excerpt from arguably the most unpredictable human text ever written, James Joyce's Finnegans Wake:

The method isn't meant to replace humans in identifying fake texts but rather to support human intuition and understanding. The researchers tested the model with a group of undergraduates in a SEAS Computer Science class.

Without the model, the students could identify about 50 percent of AI-generated text. With the color overlay, the students were able to identify 72 percent.

Gehrmann and Strobelt say that with a little training and experience with the program, the number could improve even further.

"Our goal is to create human and AI collaboration systems," said Gehrmann. "This research is targeted at giving humans more information so that they can make an informed decision about what's real and what's fake."

Provided by Harvard University

Citation: Researchers develop a method to identify computer-generated text (2019, July 26) retrieved 16 August 2024 from https://techxplore.com/news/2019-07-method-computer-generated-text.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

A multi-representational convolutional neural network architecture for text classification

17 shares

Feedback to editors

Engineers design tiny batteries for powering cell-sized robots

11 hours ago

Leaf-like solar concentrators promise major boost in solar efficiency

12 hours ago

Why does AI beat humans at the strategy game Diplomacy?

12 hours ago

New technique prints metal oxide thin film circuits at room temperature

13 hours ago

Studies highlight challenges and solutions in making large language models trustworthy

14 hours ago

Finding security flaws in Android ahead of malicious hackers

15 hours ago

Robot planning tool accounts for human carelessness

15 hours ago

From shrimp to steel: Introducing nature-inspired metalworking

16 hours ago

'AI Scientist' model designed to conduct scientific research autonomously

17 hours ago

Global AI adoption is outpacing risk understanding, researchers warn

17 hours ago

Load comments (1)

Researchers develop a method to identify computer-generated text

Engineers design tiny batteries for powering cell-sized robots

Leaf-like solar concentrators promise major boost in solar efficiency

Why does AI beat humans at the strategy game Diplomacy?

New technique prints metal oxide thin film circuits at room temperature

Studies highlight challenges and solutions in making large language models trustworthy

Finding security flaws in Android ahead of malicious hackers

Robot planning tool accounts for human carelessness

From shrimp to steel: Introducing nature-inspired metalworking

'AI Scientist' model designed to conduct scientific research autonomously

Global AI adoption is outpacing risk understanding, researchers warn

A multi-representational convolutional neural network architecture for text classification

Researchers keeps wraps on automatic text generator to prevent misuse

Teaching computers to understand human languages

UN-ish speeches cooked by artificial intelligence are quite credible

New research helps visualise sentiment and stance in social media

Portrait of a Google AI art project as a poetic you

A two-stage framework to improve LLM-based anomaly detection and reactive planning

'AI Scientist' model designed to conduct scientific research autonomously

Global AI adoption is outpacing risk understanding, researchers warn

Why does AI beat humans at the strategy game Diplomacy?

Studies highlight challenges and solutions in making large language models trustworthy

How working with AI impacts the collective attention of teams

Phys.org

Medical Xpress

Science X

Researchers develop a method to identify computer-generated text

Engineers design tiny batteries for powering cell-sized robots

Leaf-like solar concentrators promise major boost in solar efficiency

Why does AI beat humans at the strategy game Diplomacy?

New technique prints metal oxide thin film circuits at room temperature

Studies highlight challenges and solutions in making large language models trustworthy

Finding security flaws in Android ahead of malicious hackers

Robot planning tool accounts for human carelessness

From shrimp to steel: Introducing nature-inspired metalworking

'AI Scientist' model designed to conduct scientific research autonomously

Global AI adoption is outpacing risk understanding, researchers warn

Related Stories

A multi-representational convolutional neural network architecture for text classification

Researchers keeps wraps on automatic text generator to prevent misuse

Teaching computers to understand human languages

UN-ish speeches cooked by artificial intelligence are quite credible

New research helps visualise sentiment and stance in social media

Portrait of a Google AI art project as a poetic you

Recommended for you

A two-stage framework to improve LLM-based anomaly detection and reactive planning

'AI Scientist' model designed to conduct scientific research autonomously

Global AI adoption is outpacing risk understanding, researchers warn

Why does AI beat humans at the strategy game Diplomacy?

Studies highlight challenges and solutions in making large language models trustworthy

How working with AI impacts the collective attention of teams

Your Privacy