Clustering Resolves Bottlenecks Plaguing Vision Transformer AI

RALEIGH, N.C., June 19, 2023 — Vision transformers (ViTs), which are AI models that perform image identifications or categorizations from within images, hold numerous applications in day-to-day life. For example, these powerful technologies could be used to identify the cars in an image that includes pedestrians, or vice versa.

However, ViTs face challenges. ViTs are considered a “transformer model,” which is a highly complex AI technology. Relative to the amount of data plugged into the AI mechanism, transformer models require a significant amount of computational power and use a large amount of memory. This is particularly problematic for ViTs compared to nonvision transformer AI systems, since images contain so much data.

Further, it is difficult for users to understand exactly how ViTs make decisions. Depending on the application, understanding the ViT’s decision-making process, also known as its model interpretability, can be very important.

Researchers at North Carolina State University (NCSU) have developed a ViT methodology called patch-to-cluster attention (PaCa) that addresses both of these challenges. The developers deployed clustering techniques to combat both computational and memory demands, as well as issues associated with interpretability.

According to Tianfu Wu, corresponding author of a paper on the work and an associate professor of electrical and computer engineering at NCSU, clustering occurs when the AI lumps together sections of the image in question, based on similarities it finds in the image data. This, Wu said, significantly reduces computational demands on the system.

“Before clustering, computational demands for a ViT are quadratic,” Wu said. “For example, if the system breaks an image down into 100 smaller units, it would need to compare all 100 units to each other, which would be 10,000 complex functions.”

Clustering, on the other hand, Wu said, makes this process linear; each smaller unit must be computed only to a predetermined number of individual clusters to obtain the same outcome. For example, if a user tells the system to establish 10 clusters, this equates only to 1000 complex functions, Wu said.

NCSU researchers developed a methodology that uses clustering techniques to overcome two distinct hinderances to vision transformers (ViTs), AI models that perform image identifications or categorizations from within images. The approach cut down on compute and memory demands, and made ViTs' decision-making mechanism more knowable to human operators. Courtesy of Ion Fet.

Clustering also enabled the researchers to to address model interpretability. According to Wu, the researchers could look and gain insights into how the model created the clusters in the first place. Specifically, the researchers could decipher features that it considered important when lumping these sections of data together.

Bristol Instruments, Inc. - 872 Series High-Res 4/24 MR

“Because the AI is only creating a small number of clusters, we can look at those pretty easily,” Wu said.

In tests, the researchers compared PaCa to two ViTs called SWin and PVT. They said that PaCa outperformed SWin and PVT in every way, with PaCa outperforming the two ViTs in classifying objects in images, in identifying objects in images, and in segmentation. It also performed those tasks more quickly than the other ViTs, the researchers said.

The researchers plan to scale up PaCa by training on larger, foundational data sets.

The work was done with support from the Office of the Director of National Intelligence, the U.S. Army Research Office, and the National Science Foundation.

The research will be presented at the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Published: June 2023

Glossary

vision: The processes in which luminous energy incident on the eye is perceived and evaluated.
computer vision: Computer vision enables computers to interpret and make decisions based on visual data, such as images and videos. It involves the development of algorithms, techniques, and systems that enable machines to gain an understanding of the visual world, similar to how humans perceive and interpret visual information. Key aspects and tasks within computer vision include: Image recognition: Identifying and categorizing objects, scenes, or patterns within images. This involves training algorithms...

Browse Cameras & Imaging, Lasers, Optical Components, Test & Measurement, and more.