Computer Vision System Mimics Human Visualization

A computer vision system that can identify objects based on the same method of visual learning that humans use has been developed at the UCLA Samueli School of Engineering. The system could be an advance in computer vision and a step toward general artificial intelligence (AI) systems, that is, computer systems that learn on their own, are intuitive, make decisions based on reasoning, and interact with humans in a more humanlike way.

Current computer vision systems are not designed to learn on their own. They must be trained on exactly what to learn, usually by reviewing thousands of images in which the objects they are trying to identify are labeled for them.

A computer vision system developed at UCLA can identify objects based on only partial glimpses, for example, by using these photo snippets of a motorcycle. Courtesy of UCLA Samueli.

The UCLA system uses a three-step approach. First, it breaks up an image into small chunks, which the researchers call “viewlets.” Second, it learns how these viewlets fit together to form the object in question. Then, it looks at what other objects are in the surrounding area, and whether these objects are relevant to describing and identifying the primary object.

To help the new system learn more like humans, the engineers immersed it in an internet replica of the environment in which humans live. “Fortunately, the internet provides two things that help a brain-inspired computer vision system learn the same way humans do,” said professor Vwani Roychowdhury. “One is a wealth of images and videos that depict the same types of objects. The second is that these objects are shown from many perspectives — obscured, bird’s-eye, up close — and they are placed in different kinds of environments.”

The researchers drew insight into contextual learning from findings in cognitive psychology and neuroscience. “Contextual learning is a key feature of our brains, and it helps us build robust models of objects that are part of an integrated worldview where everything is functionally connected,” Roychowdhury said.

The UCLA system provides a scalable framework for unsupervised learning of object prototypes that enables identification of deformable objects, from their parts, their different configurations and views, and their spatial relationships. Computationally, the object prototypes are represented as geometric associative networks.

The system understands what a human body is by looking at thousands of images with people in them and then ignoring nonessential background objects. Courtesy of UCLA Samueli.

The researchers tested the system with about 9000 images, each showing people and other objects. The system was able to build a detailed model of the human body without external guidance and without the images being labeled. The researchers ran similar tests using images of motorcycles, cars, and airplanes. In all cases, their system performed better or at least as well as traditional computer vision systems that have been developed with many years of training.

The research was published in Proceedings of the National Academy of Sciences (https://doi.org/10.1073/pnas.1802103115).