An AI system for editing music in videos

Amateur and professional musicians alike may spend hours pouring over YouTube clips to figure out exactly how to play certain parts of their favorite songs. But what if there were a way to play a video and isolate the only instrument you wanted to hear?

That's the outcome of a new AI project out of MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL): a deep-learning system that can look at a video of a musical performance, and isolate the sounds of specific instruments and make them louder or softer.

The system, which is "self-supervised," doesn't require any human annotations on what the instruments are or what they sound like.

Trained on over 60 hours of videos, the "PixelPlayer" system can view a never-before-seen musical performance, identify specific instruments at pixel level, and extract the sounds that are associated with those instruments.

For example, it can take a video of a tuba and a trumpet playing the "Super Mario Brothers" theme song, and separate out the soundwaves associated with each instrument.

The researchers say that the ability to change the volume of individual instruments means that in the future, systems like this could potentially help engineers improve the audio quality of old concert footage. You could even imagine producers taking specific instrument parts and previewing what they would sound like with other instruments (i.e. an electric guitar swapped in for an acoustic one).

In a new paper, the team demonstrated that PixelPlayer can identify the sounds of more than 20 commonly seen instruments. Lead author Hang Zhao says that the system would be able to identify many more instruments if it had more training data, though it still may have trouble handling subtle differences between subclasses of instruments (such as an alto sax versus a tenor).

Previous efforts to separate the sources of sound have focused exclusively on audio, which often requires extensive human labeling. In contrast, PixelPlayer introduces the element of vision, which the researchers say makes human labels unnecessary, as vision provides self-supervision.

The system first locates the image regions that produce sounds, and then separates the input sounds into a set of components that represent the sound from each pixel.

"We expected a best-case scenario where we could recognize which instruments make which kinds of sounds," says Zhao, a Ph.D. student at CSAIL. "We were surprised that we could actually spatially locate the instruments at the pixel level. Being able to do that opens up a lot of possibilities, like being able to edit the audio of individual instruments by a single click on the video."

PixelPlayer uses methods of "deep learning," meaning that it finds patterns in data using so-called "neural networks" that have been trained on existing videos. Specifically, one neural network analyzes the visuals of the video, one analyzes the audio, and a third "synthesizer" associates specific pixels with specific soundwaves to separate the different sounds.

The fact that PixelPlayer uses so-called "self-supervised" deep learning means that the MIT team doesn't explicitly understand every aspect of how it learns which instruments make which sounds.

However, Zhao says that he can tell that the system seems to recognize actual elements of the music. For example, certain harmonic frequencies seem to correlate to instruments like violin, while quick pulse-like patterns correspond to instruments like the xylophone.

Zhao says that a system like PixelPlayer could even be used on robots to better understand the environmental sounds that other objects make, such as animals or vehicles.

Provided by Massachusetts Institute of Technology

This story is republished courtesy of MIT News (web.mit.edu/newsoffice/), a popular site that covers news about MIT research, innovation and teaching.

An AI system for editing music in videos

Translating instruments, styles, genres at Facebook Artificial Intelligence Research

For more open and equitable public discussions on social media, try 'meronymity'

Researchers develop energy-efficient probabilistic computer by combining CMOS with stochastic nanomagnet

New computer vision tool can count damaged buildings in crisis zones and accurately estimate bird flock sizes

Game theory research shows AI can evolve into more selfish or cooperative personalities

Proof-of-principle demonstration of 3D magnetic recording could lead to enhanced hard disk drives

Tech companies want to build artificial general intelligence. But who decides when AGI is attained?

Computer scientists show the way: AI models need not be so power hungry

Researchers develop sodium battery capable of rapid charging in just a few seconds

Greater access to clean water, thanks to a better membrane

Silent flight edges closer to take off, according to new research

A flexible and efficient DC power converter for sustainable-energy microgrids

Microsoft's AI app VASA-1 makes photographs talk and sing with believable facial expressions

To build a better AI helper, start by modeling the irrational behavior of humans

Versatile fibers offer improved energy storage capacity for wearable devices

Harnessing solar energy for high-efficiency NH₃ production

A dexterous four-legged robot that can walk and handle objects simultaneously

Climate change will increase value of residential rooftop solar panels across US, study finds

Bitcoin's next 'halving' is right around the corner. Here's what you need to know

An AI system for editing music in videos

Let us know if there is a problem with our content

Thank you for taking time to provide your feedback to the editors

Share article

E-MAIL THE STORY