Clustering Resolves Bottlenecks Plaguing Vision Transformer AI

Vision transformers (ViTs), which are AI models that perform image identifications or categorizations from within images, hold numerous applications in day-to-day life. For example, these powerful technologies could be used to identify the cars in an image that includes pedestrians, or vice versa.

However, ViTs face challenges. ViTs are considered a “transformer model,” which is a highly complex AI technology. Relative to the amount of data plugged into the AI mechanism, transformer models require a significant amount of computational power and use a large amount of memory. This is particularly problematic for ViTs compared to nonvision transformer AI systems, since images contain so much data.

Further, it is difficult for users to understand exactly how ViTs make decisions. Depending on the application, understanding the ViT’s decision-making process, also known as its model interpretability, can be very important.

Researchers at North Carolina State University (NCSU) have developed a ViT methodology called patch-to-cluster attention (PaCa) that addresses both of these challenges. The developers deployed clustering techniques to combat both computational and memory demands, as well as issues associated with interpretability.

According to Tianfu Wu, corresponding author of a paper on the work and an associate professor of electrical and computer engineering at NCSU, clustering occurs when the AI lumps together sections of the image in question, based on similarities it finds in the image data. This, Wu said, significantly reduces computational demands on the system.

“Before clustering, computational demands for a ViT are quadratic,” Wu said. “For example, if the system breaks an image down into 100 smaller units, it would need to compare all 100 units to each other, which would be 10,000 complex functions.”

Clustering, on the other hand, Wu said, makes this process linear; each smaller unit must be computed only to a predetermined number of individual clusters to obtain the same outcome. For example, if a user tells the system to establish 10 clusters, this equates only to 1000 complex functions, Wu said.

NCSU researchers developed a methodology that uses clustering techniques to overcome two distinct hinderances to vision transformers (ViTs), AI models that perform image identifications or categorizations from within images. The approach cut down on compute and memory demands, and made ViTs' decision-making mechanism more knowable to human operators. Courtesy of Ion Fet.

Clustering also enabled the researchers to to address model interpretability. According to Wu, the researchers could look and gain insights into how the model created the clusters in the first place. Specifically, the researchers could decipher features that it considered important when lumping these sections of data together.

“Because the AI is only creating a small number of clusters, we can look at those pretty easily,” Wu said.

In tests, the researchers compared PaCa to two ViTs called SWin and PVT. They said that PaCa outperformed SWin and PVT in every way, with PaCa outperforming the two ViTs in classifying objects in images, in identifying objects in images, and in segmentation. It also performed those tasks more quickly than the other ViTs, the researchers said.

The researchers plan to scale up PaCa by training on larger, foundational data sets.

The work was done with support from the Office of the Director of National Intelligence, the U.S. Army Research Office, and the National Science Foundation.

The research will be presented at the IEEE/CVF Conference on Computer Vision and Pattern Recognition.