May 31, 2024

Building computer vision in the kitchen

by Alvin Lee, Singapore Management University

cutlery — Credit: Pixabay/CC0 Public Domain

Imagine watching a pizza chef going about his work in a kitchen. You see him: weigh flour before adding water and yeast to it; knead the mixture into a dough; leave it to rise while he slices pepperoni and other toppings; stretch out the dough before assembling the pizza and sliding it into an oven.

While most people are unable to fluently execute the steps of pizza-making like an experienced chef, they can see and identify what was done. One could see the chef opening the flour sack and digging into it with a flour scoop, taking the pepperoni out of the fridge and putting it over the slicer repeatedly, or grating cheese with a box grater. At the end of it all, people understand that flour becomes dough, which in turn becomes pizza.

Can a computer vision software make the same connection?

Annotating for success

For SMU Assistant Professor of Computer Science Zhu Bin, the answer lies in the VISOR (VIdeo Segmentations and Object Relations), a dataset Professor Zhu and his collaborators have been working on.

By outlining certain objects such as hands, knives, flour scoops, graters, etc. and assigning identifying labels to them on first-person videos—also called egocentric videos—VISOR aims to: better identify separate objects; understand how hands and objects interact; achieve better reasoning and understanding of object transformation, such as flour becoming dough or a potato turning into fries.

This process of outlining and labeling objects is known as "annotation," and it can be achieved either via a "sparse mask" or a "dense mask."

"Sparse masks are annotations applied to select key frames within a video rather than every frame," explains Professor Zhu.

"These masks are curated to outline objects at significant moments or intervals in the video sequence. Dense masks are detailed, continuous pixel-level annotations that cover every frame in a segment of a video. In VISOR, these are often generated through interpolation between sparse masks, using computer vision algorithms to fill in the gaps.

"Sparse masks are very useful for fine-grained egocentric video understanding, such as action recognition, e.g., 'chop potato,' and object state change. In contrast, dense annotations enable analysis of how objects are manipulated over time, providing insights into human-object interactions that sparse annotations alone could miss."

VISOR features over 10 million dense marks in 2.8 million images, and each annotated item has a mask that is assigned an entity class ("knife," 'fork," "plate," "cupboard," "onion," 'egg," etc.) and a macro-category ("cutlery," "appliance," "container," "vegetable," etc.). For instance, the entity classes "knife" and "fork" are classified into the macro-category "cutlery." All in all, VISOR features 1,477 labeled entities that identify and annotate many kitchen objects.

Other than identifying objects and annotating how items and human hands interact, VISOR also proposes a task called "Where did this come from?". In the case of the pizza chef, flour would be identified as coming from the flour sack. VISOR annotations cover videos with an average duration of 12 minutes, which is significantly longer than most existing datasets. This allows for an in-depth analysis and reasoning about object states over extended periods, facilitating studies on sustained interactions and changes.

Obstacles and future uses

Unlike many other datasets, such as UVO (Unidentified Video Objects) that focus on third-person perspectives, VISOR's use of egocentric videos from the EPIC-KITCHENS dataset presents extra challenges. Egocentric videos are dynamic by nature: objects often get blocked when hands move over items, and items transform as seen with the flour-to-dough-pizza example.

VISOR aims to overcome the obstacles in the following ways:

Fine-grained egocentric video understanding: The object masks provided by VISOR clarify the boundaries of objects even through significant transformations. This precision enables the development of advanced deep models for analyzing fine-grained interactions and transformations within videos, such as egocentric action recognition and object state analysis.
Enhancing interaction understanding: The detailed annotations of how hands interact with various objects help in studying and modeling human behavior, particularly in naturalistic settings like kitchens.
Long-term video understanding: With continuous annotations across actions and transformations of objects (like an onion being peeled and cooked), VISOR supports research into long-term reasoning in videos, such as long-term object tracking.

"As the technology matures and technical challenges such as real-time processing are addressed, technology such as VISOR can be used to develop assistive technologies that help individuals with disabilities, or the elderly navigate and manage real-world tasks more independently," Professor Zhu tells the Office of Research.

"Robots equipped with the capability to understand complex object interactions and predict future actions can be employed in various activities, such as cooking, cleaning and manufacturing."

He adds, "Egocentric video understanding can also be used to develop virtual reality (VR)- or augmented reality (AR)-based training and educational tools, providing step-by-step guidance from the first-person view."

Provided by Singapore Management University

Citation: Building computer vision in the kitchen (2024, May 31) retrieved 16 August 2024 from https://techxplore.com/news/2024-05-vision-kitchen.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Looking for a specific action in a video? This AI-based method can find it for you

40 shares

Feedback to editors

Engineers design tiny batteries for powering cell-sized robots

10 hours ago

Leaf-like solar concentrators promise major boost in solar efficiency

10 hours ago

Why does AI beat humans at the strategy game Diplomacy?

11 hours ago

New technique prints metal oxide thin film circuits at room temperature

12 hours ago

Studies highlight challenges and solutions in making large language models trustworthy

13 hours ago

Finding security flaws in Android ahead of malicious hackers

14 hours ago

Robot planning tool accounts for human carelessness

14 hours ago

From shrimp to steel: Introducing nature-inspired metalworking

15 hours ago

'AI Scientist' model designed to conduct scientific research autonomously

15 hours ago

Global AI adoption is outpacing risk understanding, researchers warn

15 hours ago

Load comments (0)

Building computer vision in the kitchen

Annotating for success

Obstacles and future uses

Engineers design tiny batteries for powering cell-sized robots

Leaf-like solar concentrators promise major boost in solar efficiency

Why does AI beat humans at the strategy game Diplomacy?

New technique prints metal oxide thin film circuits at room temperature

Studies highlight challenges and solutions in making large language models trustworthy

Finding security flaws in Android ahead of malicious hackers

Robot planning tool accounts for human carelessness

From shrimp to steel: Introducing nature-inspired metalworking

'AI Scientist' model designed to conduct scientific research autonomously

Global AI adoption is outpacing risk understanding, researchers warn

Looking for a specific action in a video? This AI-based method can find it for you

Computer vision researchers use motion to discover objects in videos

Researchers expand ability of robots to learn from videos

Egocentric coding unveiled: Researchers unlock the brain's spatial perception mechanisms

Novel optimization tool allows for better video motion estimation

Researchers create dataset to address object recognition problem in machine learning

A two-stage framework to improve LLM-based anomaly detection and reactive planning

'AI Scientist' model designed to conduct scientific research autonomously

Global AI adoption is outpacing risk understanding, researchers warn

Why does AI beat humans at the strategy game Diplomacy?

Studies highlight challenges and solutions in making large language models trustworthy

How working with AI impacts the collective attention of teams

Phys.org

Medical Xpress

Science X

Building computer vision in the kitchen

Annotating for success

Obstacles and future uses

Engineers design tiny batteries for powering cell-sized robots

Leaf-like solar concentrators promise major boost in solar efficiency

Why does AI beat humans at the strategy game Diplomacy?

New technique prints metal oxide thin film circuits at room temperature

Studies highlight challenges and solutions in making large language models trustworthy

Finding security flaws in Android ahead of malicious hackers

Robot planning tool accounts for human carelessness

From shrimp to steel: Introducing nature-inspired metalworking

'AI Scientist' model designed to conduct scientific research autonomously

Global AI adoption is outpacing risk understanding, researchers warn

Related Stories

Looking for a specific action in a video? This AI-based method can find it for you

Computer vision researchers use motion to discover objects in videos

Researchers expand ability of robots to learn from videos

Egocentric coding unveiled: Researchers unlock the brain's spatial perception mechanisms

Novel optimization tool allows for better video motion estimation

Researchers create dataset to address object recognition problem in machine learning

Recommended for you

A two-stage framework to improve LLM-based anomaly detection and reactive planning

'AI Scientist' model designed to conduct scientific research autonomously

Global AI adoption is outpacing risk understanding, researchers warn

Why does AI beat humans at the strategy game Diplomacy?

Studies highlight challenges and solutions in making large language models trustworthy

How working with AI impacts the collective attention of teams

Your Privacy