June 3, 2024 report

New technique can automate data curation for self-supervised pre-training of AI datasets

by Bob Yirka , Tech Xplore

New technique to automate data curation for self-supervised pre-training of AI datasets — An overview of the data curation pipeline. Large data pool often exhibits a long-tailed distribution of concepts. We apply hierarchical k-means to obtain clusters that spread uniformly over the concepts. Data points are then sampled from the clusters to form a curated dataset that has a better balance of concepts. Credit: *arXiv* (2024). DOI: 10.48550/arxiv.2405.15613

A team of computer scientists and AI researchers from FAIR at Meta, INRIA, Université Paris Saclay and Google, has developed a possible means for automating data curation for self-supervised pre-training of AI datasets.

The group has written a paper describing their development process, the technique they developed and how well it has worked thus far during testing. It is posted on the arXiv preprint server.

As developers and users alike have been learning over the past year, the quality of the data that is used to train AI systems is tied very closely to the accuracy of results. Currently, the best results are obtained with systems that use manually curated data and the worst are obtained from systems that are uncurated.

Unfortunately, manually curating data takes a lot of time and effort. Therefore, computer scientists have been looking for ways to automate the process. In this new study, the research team has developed a technique that does just that, and that does it in a way that is on a par with manual curation.

The new technique starts with a large dataset, and then carries out a three-step process that results in data that is both more diverse and more balanced.

The first step involves using a feature-extraction model that calculates high-quality places to embed data points. In their approach, the things that are embedded are numbers that represent features of different types of data, such as text, audio, or images.

The second step involves the use of successive k-means clustering, where data points are assigned to a group based on their similarity to other data points.

The third step involves the use of multi-step hierarchical k-means clustering to ensure that data clusters are balanced. It is achieved via building data-cluster trees in a bottom-up fashion.

The research team tested their technique using vision models that had been trained on various types of datasets. They found that models using their technique outperformed those using uncurated data and were as good as or sometimes better than those trained on data that was curated manually.

More testing will have to be done to find out how well their technique works on real-world data and different kinds of AI systems.

More information: Huy V. Vo et al, Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach, arXiv (2024). DOI: 10.48550/arxiv.2405.15613

Journal information: arXiv

Citation: New technique can automate data curation for self-supervised pre-training of AI datasets (2024, June 3) retrieved 29 June 2024 from https://techxplore.com/news/2024-06-technique-automate-curation-pre-ai.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Meta introduces Chameleon, an early-fusion multimodal model

61 shares

Feedback to editors

Researchers develop novel 3D printing strategy with controllable gradients porous structures

Jun 28, 2024

Researchers develop the fastest possible flow algorithm

Jun 28, 2024

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

Jun 28, 2024

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Jun 27, 2024

Wireless receiver blocks interference for better mobile device performance

Jun 27, 2024

Researchers successfully develop domestic 6G antenna measurement system

Jun 27, 2024

Research shows how common plastics could passively cool and heat buildings with the seasons

Jun 27, 2024

Researchers suggest smart solution to harness waste heat from industry

Jun 27, 2024

Robotic hand with tactile fingertips achieves new dexterity feat

Jun 27, 2024

Help or hindrance? ER robots have potential to aid health care workers

Jun 27, 2024

Load comments (0)

New technique can automate data curation for self-supervised pre-training of AI datasets

Researchers develop novel 3D printing strategy with controllable gradients porous structures

Researchers develop the fastest possible flow algorithm

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Wireless receiver blocks interference for better mobile device performance

Researchers successfully develop domestic 6G antenna measurement system

Research shows how common plastics could passively cool and heat buildings with the seasons

Researchers suggest smart solution to harness waste heat from industry

Robotic hand with tactile fingertips achieves new dexterity feat

Help or hindrance? ER robots have potential to aid health care workers

Meta introduces Chameleon, an early-fusion multimodal model

Using contact microphones as tactile sensors for robot manipulation

New technique combines data from different sources for more effective multipurpose robots

New AI model uses federated learning for multi-organ segmentation based on medical image data

Synthetic imagery sets new bar in AI training efficiency

A simpler path to better computer vision

Researchers develop the fastest possible flow algorithm

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Robotic hand with tactile fingertips achieves new dexterity feat

Sony introduces AI for single-instrument accompaniment generation in music production

Mechanical computer relies on kirigami cubes, not electronics

New work explores optimal circumstances for reaching a common goal with humanoid robots

Phys.org

Medical Xpress

Science X

New technique can automate data curation for self-supervised pre-training of AI datasets

Researchers develop novel 3D printing strategy with controllable gradients porous structures

Researchers develop the fastest possible flow algorithm

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Wireless receiver blocks interference for better mobile device performance

Researchers successfully develop domestic 6G antenna measurement system

Research shows how common plastics could passively cool and heat buildings with the seasons

Researchers suggest smart solution to harness waste heat from industry

Robotic hand with tactile fingertips achieves new dexterity feat

Help or hindrance? ER robots have potential to aid health care workers

Related Stories

Meta introduces Chameleon, an early-fusion multimodal model

Using contact microphones as tactile sensors for robot manipulation

New technique combines data from different sources for more effective multipurpose robots

New AI model uses federated learning for multi-organ segmentation based on medical image data

Synthetic imagery sets new bar in AI training efficiency

A simpler path to better computer vision

Recommended for you

Researchers develop the fastest possible flow algorithm

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Robotic hand with tactile fingertips achieves new dexterity feat

Sony introduces AI for single-instrument accompaniment generation in music production

Mechanical computer relies on kirigami cubes, not electronics

New work explores optimal circumstances for reaching a common goal with humanoid robots

Your Privacy