August 21, 2024 report

New benchmarking tool evaluates the factuality of LLMs

by Bob Yirka , Tech Xplore

A team of AI researchers and computer scientists from Cornell University, the University of Washington and the Allen Institute for Artificial Intelligence has developed a benchmarking tool called WILDHALLUCINATIONS to evaluate the factuality of multiple large language models (LLMs). The group has published a paper describing the factors that went into creating their tool on the arXiv preprint server.

LLMs such as ChatGPT have become popular—people use them to write letters, poems, songs, research papers and other text documents. But over time, their deficiencies have become quite clear—LLMs often make inaccurate statements. Such mistakes, if they veer too far from reality, have come to be known as hallucinations.

The research team notes that the main reason LLMs hallucinate is due to the quality of the data used to train them—generally, massive amounts of text from the internet. Thus, models trained on specific, highly accurate datasets are much more likely to provide accurate information.

The research team noted that the makers of many LLMs have been making claims about revised versions of their models, often suggesting that they hallucinate less often, implying that they are more accurate. But the researchers also noted that to date, users have no way to verify whether such claims are true. For this new study, the team created a tool to help the user community evaluate some of the most popular LLMs for accuracy.

Called WILDHALLUCINATIONS, the benchmark tool prompts multiple LLMs to generate output from user-generated chatbot conversations. It then fact-checks the answers. Noting that many chatbot answers come from information provided on Wiki pages, the research team made sure to note differences in answers regarding queries that had information that could be found on Wikipedia and those that could not.

To test their benchmarking tool, the researchers used it to evaluate several of the most popular LLMs, many of which had recently been updated. They found that LLM makers have not made much progress in improving accuracy. Most were no more accurate than their prior versions.

The team also discovered that most of the models did better when they could pull information from one or more Wiki pages. LLMs also did better with some subjects compared to others. They had trouble, for example, finding reliable information regarding celebrities and financial issues. They were more reliable when asked certain types of science questions.

More information: Wenting Zhao et al, WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries, arXiv (2024). DOI: 10.48550/arxiv.2407.17468. arxiv.org/abs/2407.17468

Journal information: arXiv

Citation: New benchmarking tool evaluates the factuality of LLMs (2024, August 21) retrieved 21 August 2024 from https://techxplore.com/news/2024-08-benchmarking-tool-factuality-llms.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Ethicists wonder if LLM makers have a legal duty to ensure reliability

26 shares

Feedback to editors

Engineers develop eco-friendly cooling device with record-breaking efficiency

33 minutes ago

Better arrangements of molecules in organic solar cells can improve light absorption

1 hour ago

Energizing the future: AI innovations for longer-lasting lithium-ion batteries

1 hour ago

Scientists establish brain-inspired network model to bridge AI and neuroscience

2 hours ago

The first tensor processor chip based on carbon nanotubes could lead to energy-efficient AI processing

5 hours ago

New theory could improve the design and operation of wind farms

7 hours ago

New device captures CO₂ and converts it to useful products

20 hours ago

Computer scientists discover vulnerabilities in a popular security protocol

21 hours ago

Can your smartwatch get hacked? Study shows what information is at risk

22 hours ago

Validated simulations optimize solar power generation with row-crop agriculture

23 hours ago

Load comments (0)

New benchmarking tool evaluates the factuality of LLMs

Engineers develop eco-friendly cooling device with record-breaking efficiency

Better arrangements of molecules in organic solar cells can improve light absorption

Energizing the future: AI innovations for longer-lasting lithium-ion batteries

Scientists establish brain-inspired network model to bridge AI and neuroscience

The first tensor processor chip based on carbon nanotubes could lead to energy-efficient AI processing

New theory could improve the design and operation of wind farms

New device captures CO₂ and converts it to useful products

Computer scientists discover vulnerabilities in a popular security protocol

Can your smartwatch get hacked? Study shows what information is at risk

Validated simulations optimize solar power generation with row-crop agriculture

Ethicists wonder if LLM makers have a legal duty to ensure reliability

Team proposes a reasoning framework aimed at improving the reliability and traceability of LLMs

Researcher suggests how to effectively utilize large language models

Should AI be used in psychological research?

Researchers find LLMs are easy to manipulate into giving harmful information

Know your source: RAGE tool unveils ChatGPT's sources

The first tensor processor chip based on carbon nanotubes could lead to energy-efficient AI processing

Energizing the future: AI innovations for longer-lasting lithium-ion batteries

Scientists establish brain-inspired network model to bridge AI and neuroscience

AI poses no existential threat to humanity, new study finds

AI assistant monitors teamwork to promote effective collaboration

Flexible multi-task computation in recurrent neural networks relies on dynamical motifs, study shows

Phys.org

Medical Xpress

Science X

New benchmarking tool evaluates the factuality of LLMs

Engineers develop eco-friendly cooling device with record-breaking efficiency

Better arrangements of molecules in organic solar cells can improve light absorption

Energizing the future: AI innovations for longer-lasting lithium-ion batteries

Scientists establish brain-inspired network model to bridge AI and neuroscience

The first tensor processor chip based on carbon nanotubes could lead to energy-efficient AI processing

New theory could improve the design and operation of wind farms

New device captures CO₂ and converts it to useful products

Computer scientists discover vulnerabilities in a popular security protocol

Can your smartwatch get hacked? Study shows what information is at risk

Validated simulations optimize solar power generation with row-crop agriculture

Related Stories

Ethicists wonder if LLM makers have a legal duty to ensure reliability

Team proposes a reasoning framework aimed at improving the reliability and traceability of LLMs

Researcher suggests how to effectively utilize large language models

Should AI be used in psychological research?

Researchers find LLMs are easy to manipulate into giving harmful information

Know your source: RAGE tool unveils ChatGPT's sources

Recommended for you

The first tensor processor chip based on carbon nanotubes could lead to energy-efficient AI processing

Energizing the future: AI innovations for longer-lasting lithium-ion batteries

Scientists establish brain-inspired network model to bridge AI and neuroscience

AI poses no existential threat to humanity, new study finds

AI assistant monitors teamwork to promote effective collaboration

Flexible multi-task computation in recurrent neural networks relies on dynamical motifs, study shows

Your Privacy