July 7, 2023

Computer scientists release guidelines for evaluating AI-generated text

The public release of AI text generators, such as ChatGPT, has caused an enormous stir among both those who herald the technology as a great leap forward in communication as well as those who prophesy the technology's dire effects. However, AI-generated text is notoriously buggy, and human evaluation remains the gold-standard in ensuring accuracy, especially when it comes to applications such as generating long-form summaries of complex texts. And yet, there are no accepted standards for human evaluation of long-form summaries, which means that even the gold-standard is suspect.

To rectify this shortcoming, a team of computer scientists, led by Kalpesh Krishna, a graduate student in the Manning College of Information and Computer Sciences at UMass Amherst, has just released a set of guidelines called LongEval. The guidelines were presented at the European Chapter of the Association for Computational Linguistics, for which it was awarded the Outstanding Paper prize.

"There is currently no reliable way to evaluate long-form generated text without humans, and even current human evaluation protocols are expensive, time-consuming and highly variant," says Krishna, who began this research during an internship at the Allen Institute for AI. "A suitable human evaluation framework is critical to build more accurate long-form text-generation algorithms."

Krishna and his team, including Mohit Iyyer, assistant professor of computer science at UMass Amherst, combed through 162 papers on long-form summarization to understand how human evaluation works—and in doing so, they discovered that 73% of the papers did not perform human evaluation on long-form summaries at all. The remaining papers used widely divergent evaluation practices.

"This lack of standards is problematic because it hampers reproducibility and does not allow for meaningful comparison between different systems," Iyyer says.

To further the goal of efficient, reproducible and standardized protocols for human evaluation of AI-generated summaries, Krishna and his co-authors developed a list of three comprehensive recommendations that cover how and what an evaluator should read in order to judge the reliability of the summary.

"With LongEval, I am very excited about the prospect of being able to accurately and quickly evaluate long-form text generation algorithms with humans," says Krishna. "We have made LongEval very easy to use and released it as a Python library. I am excited to see how the research community builds upon it and uses LongEval in their research."

The research is published on the arXiv preprint server.

More information: Kalpesh Krishna et al, LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization, arXiv (2023). DOI: 10.48550/arxiv.2301.13298

Journal information: arXiv

Provided by University of Massachusetts Amherst

Citation: Computer scientists release guidelines for evaluating AI-generated text (2023, July 7) retrieved 17 July 2024 from https://techxplore.com/news/2023-07-scientists-guidelines-ai-generated-text.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Know your audience: Why data communication needs to pay attention to novice users

83 shares

Feedback to editors

Engineers evaluate cybersecurity risks associated with EV fast-charging equipment

14 hours ago

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

16 hours ago

Giving drones wrap-and-grip wings to allow them to land on poles and tree limbs

18 hours ago

Large language models make human-like reasoning mistakes, researchers find

19 hours ago

Unveiling a new class of synthetic fuels

19 hours ago

Microsoft unveils software that allows LLMs to work with spreadsheets

19 hours ago

New technique to assess a general-purpose AI model's reliability before it's deployed

20 hours ago

New system enables intuitive teleoperation of a robotic manipulator in real-time

23 hours ago

Recycled micro-sized silicon anodes from photovoltaic waste improve lithium-ion battery performance

Jul 16, 2024

You're just a stick figure to this camera—a new camera to prevent companies from collecting private information

Jul 15, 2024

Load comments (0)

Computer scientists release guidelines for evaluating AI-generated text

Engineers evaluate cybersecurity risks associated with EV fast-charging equipment

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

Giving drones wrap-and-grip wings to allow them to land on poles and tree limbs

Large language models make human-like reasoning mistakes, researchers find

Unveiling a new class of synthetic fuels

Microsoft unveils software that allows LLMs to work with spreadsheets

New technique to assess a general-purpose AI model's reliability before it's deployed

New system enables intuitive teleoperation of a robotic manipulator in real-time

Recycled micro-sized silicon anodes from photovoltaic waste improve lithium-ion battery performance

You're just a stick figure to this camera—a new camera to prevent companies from collecting private information

Know your audience: Why data communication needs to pay attention to novice users

Exploring how to add hidden electronic watermarks to works written by AI systems

Study finds AI-generated music 'inferior' to human-composed works

Evaluation of AI for medical imaging: A key requirement for clinical translation

IBM to pause hiring for jobs that AI could do

Human writer or AI? Scholars build a detection tool

New system enables intuitive teleoperation of a robotic manipulator in real-time

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

Microsoft unveils software that allows LLMs to work with spreadsheets

New technique to assess a general-purpose AI model's reliability before it's deployed

Large language models make human-like reasoning mistakes, researchers find

A new neural network makes decisions like a human would

Phys.org

Medical Xpress

Science X

Computer scientists release guidelines for evaluating AI-generated text

Engineers evaluate cybersecurity risks associated with EV fast-charging equipment

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

Giving drones wrap-and-grip wings to allow them to land on poles and tree limbs

Large language models make human-like reasoning mistakes, researchers find

Unveiling a new class of synthetic fuels

Microsoft unveils software that allows LLMs to work with spreadsheets

New technique to assess a general-purpose AI model's reliability before it's deployed

New system enables intuitive teleoperation of a robotic manipulator in real-time

Recycled micro-sized silicon anodes from photovoltaic waste improve lithium-ion battery performance

You're just a stick figure to this camera—a new camera to prevent companies from collecting private information

Related Stories

Know your audience: Why data communication needs to pay attention to novice users

Exploring how to add hidden electronic watermarks to works written by AI systems

Study finds AI-generated music 'inferior' to human-composed works

Evaluation of AI for medical imaging: A key requirement for clinical translation

IBM to pause hiring for jobs that AI could do

Human writer or AI? Scholars build a detection tool

Recommended for you

New system enables intuitive teleoperation of a robotic manipulator in real-time

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

Microsoft unveils software that allows LLMs to work with spreadsheets

New technique to assess a general-purpose AI model's reliability before it's deployed

Large language models make human-like reasoning mistakes, researchers find

A new neural network makes decisions like a human would

Your Privacy