March 30, 2021

Major machine learning datasets have tens of thousands of errors

by Adam Conner-Simons, Massachusetts Institute of Technology

Major ML datasets have tens of thousands of errors — Credit: MIT Computer Science & Artificial Intelligence Lab

It's well-known that machine learning datasets have their fair share of errors, including mislabeled images. But there hasn't been much research to systematically quantify just how error-ridden they are.

Further, prior work has focused on errors in the training data of ML datasets. But the test sets are what we benchmark the state of machine learning with, and no study has looked at systematic error across ML test sets—the sets we rely on to understand how well ML models work.

In a new paper, a team led by researchers at MIT's Computer Science and Artificial Intelligence Lab (CSAIL) looked at 10 major datasets that have been cited over 100,000 times and that include ImageNet and Amazon's reviews dataset.

The researchers found a 3.4% average error rate across all datasets, including 6% for ImageNet, which is arguably the most widely used dataset for popular image recognition systems developed by the likes of Google and Facebook.

Even the seminal MNIST digits dataset, which has served as the bedrock of optical digit recognition for the past 20 years and has been benchmarked in tens of thousands of peer-reviewed ML publications, contains 15 (human-validated) label errors in the test set.

The team also created a demo that lets users peruse the different datasets to sample the different types of errors that occur, including:

mislabeled images, like one breed of dog being confused for another or a baby being confused for a nipple.
mislabeled text sentiment, like Amazon product reviews described as negative when they were actually positive.
mislabeled audio of YouTube videos, like an Ariana Grande high-note being classified as a whistle.

Credit: MIT Computer Science & Artificial Intelligence Lab

Co-author Curtis Northcutt says that one surprise from their findings was that weaker models like ResNet-18 often had lower error rates than more complex models such as ResNet-50, depending on the prevalence of irrelevant data ("noise"). Northcutt recommends that ML practitioners consider using simple models if their real-world dataset has a label error rate of 10%.

The team's results build upon a wealth of work done at MIT in creating "confident learning," a sub-field of machine learning that looks at datasets to find and quantify label noise. With this project, confident learning is used to algorithmically identify all of the label errors prior to human verification.

The team has also made it easy for other researchers to replicate their results and find label errors in their own datasets using cleanlab, an open-source python package.

Provided by Massachusetts Institute of Technology

Citation: Major machine learning datasets have tens of thousands of errors (2021, March 30) retrieved 30 June 2024 from https://techxplore.com/news/2021-03-major-machine-datasets-tens-thousands.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Machine learning models for diagnosing COVID-19 are not yet suitable for clinical use: study

104 shares

Feedback to editors

Researchers develop novel 3D printing strategy with controllable gradients porous structures

Jun 28, 2024

Researchers develop the fastest possible flow algorithm

Jun 28, 2024

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

Jun 28, 2024

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Jun 27, 2024

Wireless receiver blocks interference for better mobile device performance

Jun 27, 2024

Researchers successfully develop domestic 6G antenna measurement system

Jun 27, 2024

Research shows how common plastics could passively cool and heat buildings with the seasons

Jun 27, 2024

Researchers suggest smart solution to harness waste heat from industry

Jun 27, 2024

Robotic hand with tactile fingertips achieves new dexterity feat

Jun 27, 2024

Help or hindrance? ER robots have potential to aid health care workers

Jun 27, 2024

Load comments (0)

Major machine learning datasets have tens of thousands of errors

Researchers develop novel 3D printing strategy with controllable gradients porous structures

Researchers develop the fastest possible flow algorithm

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Wireless receiver blocks interference for better mobile device performance

Researchers successfully develop domestic 6G antenna measurement system

Research shows how common plastics could passively cool and heat buildings with the seasons

Researchers suggest smart solution to harness waste heat from industry

Robotic hand with tactile fingertips achieves new dexterity feat

Help or hindrance? ER robots have potential to aid health care workers

Machine learning models for diagnosing COVID-19 are not yet suitable for clinical use: study

New image recognition method proposed based on large-scale dataset

Researchers use machine learning to rank cancer drugs in order of efficacy

Researchers develop new algorithm that could reduce complexity of big data

New machine learning model could remove bias from social network connections

Facebook announces AI that learns from videos

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Robotic hand with tactile fingertips achieves new dexterity feat

Sony introduces AI for single-instrument accompaniment generation in music production

New work explores optimal circumstances for reaching a common goal with humanoid robots

Software engineers develop a way to run AI language models without matrix multiplication

New tool detects AI-generated videos with 93.7% accuracy

Phys.org

Medical Xpress

Science X

Major machine learning datasets have tens of thousands of errors

Researchers develop novel 3D printing strategy with controllable gradients porous structures

Researchers develop the fastest possible flow algorithm

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Wireless receiver blocks interference for better mobile device performance

Researchers successfully develop domestic 6G antenna measurement system

Research shows how common plastics could passively cool and heat buildings with the seasons

Researchers suggest smart solution to harness waste heat from industry

Robotic hand with tactile fingertips achieves new dexterity feat

Help or hindrance? ER robots have potential to aid health care workers

Related Stories

Machine learning models for diagnosing COVID-19 are not yet suitable for clinical use: study

New image recognition method proposed based on large-scale dataset

Researchers use machine learning to rank cancer drugs in order of efficacy

Researchers develop new algorithm that could reduce complexity of big data

New machine learning model could remove bias from social network connections

Facebook announces AI that learns from videos

Recommended for you

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Robotic hand with tactile fingertips achieves new dexterity feat

Sony introduces AI for single-instrument accompaniment generation in music production

New work explores optimal circumstances for reaching a common goal with humanoid robots

Software engineers develop a way to run AI language models without matrix multiplication

New tool detects AI-generated videos with 93.7% accuracy

Your Privacy