November 18, 2014 weblog

Image descriptions from computers show gains

by Nancy Owano , Tech Xplore

"Man in black shirt is playing guitar." "Man in blue wetsuit is surfing on wave." "Black and white dog jumps over bar." The picture captions were not written by humans but through software capable of accurately describing what is going on in images. At Stanford University, they have been working on Multimodal Recurrent Neural Architecture which generates sentence descriptions from images.

"I consider the pixel data in images and video to be the dark matter of the Internet," said Fei-Fei Li, associate professor and director of the Stanford Artificial Intelligence Laboratory, in The New York Times on Monday. She led research with Andrej Karpathy, a graduate student. "We are now starting to illuminate it." What is more, they are working on visual-semantic alignments between text and visual data. "Our alignment model learns to associate images and snippets of text." They provide examples of inferred alignments on their Stanford site. "For each image, the model retrieves the most compatible sentence and grounds its pieces in the image. We show the grounding as a line to the center of the corresponding bounding box. Each box has a single but arbitrary color."

Their technical report, "Deep Visual-Semantic Alignments for Generating Image Descriptions," takes us through their reasons for trying to improve on visual recognition. They wrote, "A quick glance at an image is sufficient for a human to point out and describe an immense amount of details about the visual scene. However, this remarkable ability has proven to be an elusive task for our visual recognition models. The majority of previous work in visual recognition has focused on labeling images with a fixed set of visual categories, and great progress has been achieved in these endeavors. However, while closed vocabularies of visual concepts constitute a convenient modeling assumption, they are vastly restrictive when compared to the enormous amount of rich descriptions that a human can compose."

They have taken on the challenge of trying to design a model rich enough to reason simultaneously about contents of images and their representation in the domain of natural language. "We first describe neural networks that map words and image regions into a common, multimodal embedding. Then we introduce our novel objective, which learns the embedding representations so that semantically similar concepts across the two modalities occupy nearby regions of the space."

Meanwhile, researchers at Google are making their own strides as well. In today's image-centric digital marketplace, architects cannot afford to live by images alone. "A picture may be worth a thousand words, but sometimes it's the words that are most useful—so it's important we figure out ways to translate from images to words automatically and accurately," blogged Google Research Scientists Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan on Monday.

"People can summarize a complex scene in a few words without thinking twice. It's much more difficult for computers."

They said that they have developed a machine learning system that can automatically produce captions. "This kind of system could eventually help visually impaired people understand pictures, provide alternate text for images in parts of the world where mobile connections are slow, and make it easier for everyone to search on Google for images."

Writing in The New York Times on Monday, John Markoff noted another advantage: "The advances may make it possible to better catalog and search for the billions of images and hours of video available online, which are often poorly described and archived."

More information: googleresearch.blogspot.com/20 … ousand-coherent.html
cs.stanford.edu/people/karpath … gesent/devisagen.pdf

Citation: Image descriptions from computers show gains (2014, November 18) retrieved 17 July 2024 from https://techxplore.com/news/2014-11-image-descriptions-gains.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Google team rises to 2014 visual recognition challenge

9 shares

Feedback to editors

Flexible electronics researchers develop a completely stretchy lithium-ion battery

4 minutes ago

A strategy to enhance the stability of perovskite solar cells under reverse bias conditions

1 hour ago

Engineers evaluate cybersecurity risks associated with EV fast-charging equipment

16 hours ago

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

18 hours ago

Giving drones wrap-and-grip wings to allow them to land on poles and tree limbs

20 hours ago

Large language models make human-like reasoning mistakes, researchers find

21 hours ago

Unveiling a new class of synthetic fuels

21 hours ago

Microsoft unveils software that allows LLMs to work with spreadsheets

21 hours ago

New technique to assess a general-purpose AI model's reliability before it's deployed

22 hours ago

New system enables intuitive teleoperation of a robotic manipulator in real-time

Jul 16, 2024

Load comments (0)

Image descriptions from computers show gains

Flexible electronics researchers develop a completely stretchy lithium-ion battery

A strategy to enhance the stability of perovskite solar cells under reverse bias conditions

Engineers evaluate cybersecurity risks associated with EV fast-charging equipment

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

Giving drones wrap-and-grip wings to allow them to land on poles and tree limbs

Large language models make human-like reasoning mistakes, researchers find

Unveiling a new class of synthetic fuels

Microsoft unveils software that allows LLMs to work with spreadsheets

New technique to assess a general-purpose AI model's reliability before it's deployed

New system enables intuitive teleoperation of a robotic manipulator in real-time

Google team rises to 2014 visual recognition challenge

Can cartoons be used to teach machines to understand the visual world?

Researchers create first image-recognition software that greatly improves web searches

Understanding ambiguities in depth perception

New computer program aims to teach itself everything about anything

Automated image analysis arises from handcraft and machine learning

You're just a stick figure to this camera—a new camera to prevent companies from collecting private information

Visual abilities of language models found to be lacking depth

Reasoning skills of large language models are often overestimated, researchers find

A new model to plan and control the movements of humanoids in 3D environments

Researchers introduce generative AI to analyze complex tabular data

Computer scientists develop new and improved camera inspired by the human eye

Phys.org

Medical Xpress

Science X

Image descriptions from computers show gains

Flexible electronics researchers develop a completely stretchy lithium-ion battery

A strategy to enhance the stability of perovskite solar cells under reverse bias conditions

Engineers evaluate cybersecurity risks associated with EV fast-charging equipment

Machine learning framework maps global rooftop growth for sustainable energy and urban planning

Giving drones wrap-and-grip wings to allow them to land on poles and tree limbs

Large language models make human-like reasoning mistakes, researchers find

Unveiling a new class of synthetic fuels

Microsoft unveils software that allows LLMs to work with spreadsheets

New technique to assess a general-purpose AI model's reliability before it's deployed

New system enables intuitive teleoperation of a robotic manipulator in real-time

Related Stories

Google team rises to 2014 visual recognition challenge

Can cartoons be used to teach machines to understand the visual world?

Researchers create first image-recognition software that greatly improves web searches

Understanding ambiguities in depth perception

New computer program aims to teach itself everything about anything

Automated image analysis arises from handcraft and machine learning

Recommended for you

You're just a stick figure to this camera—a new camera to prevent companies from collecting private information

Visual abilities of language models found to be lacking depth

Reasoning skills of large language models are often overestimated, researchers find

A new model to plan and control the movements of humanoids in 3D environments

Researchers introduce generative AI to analyze complex tabular data

Computer scientists develop new and improved camera inspired by the human eye

Your Privacy