Neural network CLIP mirrors human brain neurons in image recognition

Open AI, the research company founded by Elon Musk, has just discovered that their artificial neural network CLIP shows behavior strikingly similar to a human brain. This find has scientists hopeful for the future of AI networks' ability to identify images in a symbolic, conceptual and literal capacity.

While the human brain processes visual imagery by correlating a series of abstract concepts to an overarching theme, the first biological neuron recorded to operate in a similar fashion was the "Halle Berry" neuron. This neuron proved capable of recognizing photographs and sketches of the actress and connecting those images with the name "Halle Berry."

Now, OpenAI's multimodal vision system continues to outperform existing systems, namely with traits such as the "Spider-Man" neuron, an artificial neuron which can identify not only the image of the text "spider" but also the comic book character in both illustrated and live action form. This ability to recognize a single concept represented in various contexts demonstrates CLIP's abstraction capabilities. Similar to a human brain, the capacity for abstraction allows a vision system to tie a series of images and text to a central theme.

However, a difference between biological and artificial neurons lies in semantics versus visual stimuli. Whereas neurons in the brain connect a cluster of visual input to a single concept, AI neurons respond to a cluster of ideas. Indeed, by examining exactly how systems such as CLIP identify, researchers can potentially learn more about how human neurons recognize a vast array of common concepts, such as facial expressions, famous people, geographical regions and religious iconography, among others. Likewise, by studying how CLIP forms its lexicon, scientists hope to uncover more similarities to the human brain.

Research teams examine CLIP along two lines: 1) Feature visualization, which looks at how strongly a neuron fires in response to the amount of visual input, and 2) dataset examples, which assesses the distribution of activating dataset images to which a neuron responds. Thus far, the teams have discovered that CLIP neurons seem to be immensely multi-faceted, meaning that they respond to many unique concepts at a high level of abstraction.

As a recognition system, CLIP also exhibits various forms of bias. For example, the system's "Middle East" neuron has been associated with terrorism, alongside an "immigration" neuron that responds to input involving Latin America.

In terms of limitations to these findings and room for further research, scientists acknowledge that, despite CLIP's finesse in locating geographical regions, individual cities and even landmarks, the system does not appear to exhibit a distinct "San Francisco" neuron that ties a landmark such as Twin Peaks to the identifier San Francisco.

More information: Goh, G., et al. "Multimodal Neurons in Artificial Neural Networks." OpenAI, OpenAI, 4 Mar. 2021, openai.com/blog/multimodal-neurons/

Goh, G., et al. "Multimodal Neurons in Artificial Neural Networks." Distill, Distill, 4 Mar. 2021, distill.pub/2021/multimodal-neurons/