March 19, 2024 report

Apple's MM1: A multimodal large language model capable of interpreting both images and text data

by Bob Yirka , Tech Xplore

Apple's MM1: A multimodal LLM model capable of interpreting both images and text data — Left: Model ablations: what visual encoder to use, how to feed rich visual data, and how to connect the visual representation to the LLM. Right: Data ablations: type of data, and their mixture. Credit: *arXiv* (2024). DOI: 10.48550/arxiv.2403.09611

A team of computer scientists and engineers at Apple has developed an large language model (LLM) that the company claims can interpret both images and data. The group has posted a paper to the arXiv preprint server describing their new MM1 family of multimodal models and test results.

Over the past year, LLMs have received a lot of press for their advanced AI capabilities. One company notably absent from the conversation is Apple. In this new effort, the research team makes it clear that the company is not interested in simply adding an LLM developed by another company (currently they are negotiating with Google to add Gemini AI tech to Apple devices); instead, they have been working to develop a next-generation LLM, one that can interpret both images and text data.

Multimodal AI works by integrating and processing different types of data inputs, such as visual, auditory and textual information. This integration allows the AI to have a more comprehensive understanding of complex data, leading to more accurate and context-aware interpretations than single-mode AI systems.

Apple's research team claims they have made major advancements in using multimodal AI with their MM1 models, which integrate text and image data to improve capabilities in image captioning, visual question answering and query learning. Their MM1 is part of what they describe as a family of multimodal models, each of which include as many as 30 billion parameters.

Such models, the researchers note, make use of datasets comprising image-capture pairs, documents that include images and text-only documents. The researchers further claim that their multimodal LLM (MLLM) can count objects, identify objects that are part of an image, and use common sense about everyday objects to offer users useful information about what the image presents.

The researchers also claim that their MLLM is capable of in-context learning, which means it does not need to start over every time a question is asked; it uses what it has learned in the current conversation. The team provides examples of the advanced capabilities of their models—one includes uploading an image of a group of friends at a bar holding a menu and asking the model how much it would cost to buy a beer for everyone based on prices listed in the menu.

More information: Brandon McKinzie et al, MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training, arXiv (2024). DOI: 10.48550/arxiv.2403.09611

Journal information: arXiv

Citation: Apple's MM1: A multimodal large language model capable of interpreting both images and text data (2024, March 19) retrieved 29 June 2024 from https://techxplore.com/news/2024-03-apple-mm1-multimodal-llm-capable.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Ultra-fast generative visual intelligence model creates images in just 2 seconds

24 shares

Feedback to editors

Researchers develop novel 3D printing strategy with controllable gradients porous structures

Jun 28, 2024

Researchers develop the fastest possible flow algorithm

Jun 28, 2024

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

Jun 28, 2024

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Jun 27, 2024

Wireless receiver blocks interference for better mobile device performance

Jun 27, 2024

Researchers successfully develop domestic 6G antenna measurement system

Jun 27, 2024

Research shows how common plastics could passively cool and heat buildings with the seasons

Jun 27, 2024

Researchers suggest smart solution to harness waste heat from industry

Jun 27, 2024

Robotic hand with tactile fingertips achieves new dexterity feat

Jun 27, 2024

Help or hindrance? ER robots have potential to aid health care workers

Jun 27, 2024

Load comments (0)

Apple's MM1: A multimodal large language model capable of interpreting both images and text data

Researchers develop novel 3D printing strategy with controllable gradients porous structures

Researchers develop the fastest possible flow algorithm

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Wireless receiver blocks interference for better mobile device performance

Researchers successfully develop domestic 6G antenna measurement system

Research shows how common plastics could passively cool and heat buildings with the seasons

Researchers suggest smart solution to harness waste heat from industry

Robotic hand with tactile fingertips achieves new dexterity feat

Help or hindrance? ER robots have potential to aid health care workers

Ultra-fast generative visual intelligence model creates images in just 2 seconds

Google's Gemini: Is the new AI model really better than ChatGPT?

Google's Gemini showcases more powerful technology, but we're still not close to superhuman AI

TaskMatrix.AI: Making big models do small jobs with application programming interfaces

How good is Google Bard's visual understanding? An empirical study on open challenges

Google suspends Gemini AI chatbot's ability to generate pictures of people

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Robotic hand with tactile fingertips achieves new dexterity feat

Sony introduces AI for single-instrument accompaniment generation in music production

New work explores optimal circumstances for reaching a common goal with humanoid robots

Software engineers develop a way to run AI language models without matrix multiplication

New tool detects AI-generated videos with 93.7% accuracy

Phys.org

Medical Xpress

Science X

Apple's MM1: A multimodal large language model capable of interpreting both images and text data

Researchers develop novel 3D printing strategy with controllable gradients porous structures

Researchers develop the fastest possible flow algorithm

Real-time modeling of 3D temperature distributions within nuclear microreactors to improve safety systems

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Wireless receiver blocks interference for better mobile device performance

Researchers successfully develop domestic 6G antenna measurement system

Research shows how common plastics could passively cool and heat buildings with the seasons

Researchers suggest smart solution to harness waste heat from industry

Robotic hand with tactile fingertips achieves new dexterity feat

Help or hindrance? ER robots have potential to aid health care workers

Related Stories

Ultra-fast generative visual intelligence model creates images in just 2 seconds

Google's Gemini: Is the new AI model really better than ChatGPT?

Google's Gemini showcases more powerful technology, but we're still not close to superhuman AI

TaskMatrix.AI: Making big models do small jobs with application programming interfaces

How good is Google Bard's visual understanding? An empirical study on open challenges

Google suspends Gemini AI chatbot's ability to generate pictures of people

Recommended for you

Is ChatGPT the key to stopping deepfakes? Study asks LLMs to spot AI-generated images

Robotic hand with tactile fingertips achieves new dexterity feat

Sony introduces AI for single-instrument accompaniment generation in music production

New work explores optimal circumstances for reaching a common goal with humanoid robots

Software engineers develop a way to run AI language models without matrix multiplication

New tool detects AI-generated videos with 93.7% accuracy

Your Privacy