Search
Menu

Vision-Guided Robot Targets Natural Interaction with Humans

Facebook X LinkedIn Email
HOLLY O'DELL, CONTRIBUTING EDITOR

In 1970, robotics professor Masahiro Mori developed his “uncanny valley” hypothesis, which stated that people react positively to a robot if it has some humanlike features, but they become more repulsed if it starts looking too much like us.

A robot named Pepper was developed to bridge that gap by providing human-robot interaction to improve quality of life. Made by SoftBank Robotics Corp., the compact, mobile robot features a design mimicking that of a traditional robot but with eyes that move and arms that articulate. The robot is equipped with advanced computer vision and machine learning technologies — developed by Rensselaer Polytechnic Institute’s (RPI’s) Intelligent Systems Lab following nearly 20 years of research — so it can accurately detect and recognize nonverbal cues to naturally interact with humans.

Computer vision and custom algorithms allow Pepper to maintain eye contact with humans. Courtesy of Rensselaer Polytechnic Institute.

 
  Computer vision and custom algorithms allow Pepper to maintain eye contact with humans. Courtesy of Rensselaer Polytechnic Institute.

Its software uses custom algorithms to detect face and body movement inreal time. The small video camera mounted on Pepper’s head can recognize facial expressions and estimate face poses, enabling it to recognize happiness, sadness, or surprise. It can also estimate an individual’s age and gender, and even maintain eye contact with a human.

Pepper takes a similar approach to recognizing the body. “First we start with detecting the body as a 2D image,” says Qiang Ji, professor of electrical, computer, and systems engineering at RPI. “From there we can analyze the body pose, which is determined by the position and angle of the shoulders. Then from the body pose, we can recognize different body gestures.”

Pepper uses deep learning to articulate gestures such as a handshake. Combined with the small camera mounted on the robot’s head, the software helps Pepper recognize facial expressions and estimate age and gender. Courtesy of Rensselaer Polytechnic Institute.

 
  Pepper uses deep learning to articulate gestures such as a handshake. Combined with the small camera mounted on the robot’s head, the software helps Pepper recognize facial expressions and estimate age and gender. Courtesy of Rensselaer Polytechnic Institute.

For example, when Pepper sees you cross your arms, it says, “Hey, be friendly to me.” When it recognizes the drinking motion, it says, “Cheers!” Ji predicts that computer vision will also be added to audio-only programs such as Alexa from Amazon or iPhone’s Siri. This multimodal approach allows for more humanlike interaction initiated from the robot.

Pepper is only one application to which the RPI researchers have applied computer vision technology. Augmented by the computer vision algorithms developed in Ji’s group, Milo, a robot from RoboKind, is equipped with motors on its face that act as artificial muscles. These “muscles” produce and replicate different facial expressions to interact with children with autism, for example.


The computer screen on Pepper’s chest can be programmed to display various content during interactions, such as information about what the robot is seeing in real time. Courtesy of Rensselaer Polytechnic Institute.

 
  The computer screen on Pepper’s chest can be programmed to display various content during interactions, such as information about what the robot is seeing in real time. Courtesy of Rensselaer Polytechnic Institute.

Other use cases include human state monitoring and prediction, driver behavior estimation and prediction, and security and surveillance. Research from DARPA supports the use of aerial videos to detect suspicious action and activities.

Meanwhile, with funding from Honda and the U.S. Department of Transportation, RPI is focusing research on detecting distracted and fatigued drivers through behaviors such as head and eye movements, gaze, yawning, and nodding — all captured by a dash-board camera.

The Milo robot has facial motors that act as artificial muscles, producing and replicating various facial expressions. RPI professor Qiang Ji sees the robot’s ability to interact with children with autism as a promising application. Courtesy of Rensselaer Polytechnic Institute.

 
  The Milo robot has facial motors that act as artificial muscles, producing and replicating various facial expressions. RPI professor Qiang Ji sees the robot’s ability to interact with children with autism as a promising application. Courtesy of Rensselaer Polytechnic Institute.

Into the deep

While the camera has been important in the development of Pepper and other highly automated applications, advancements in deep learning software are responsible for the applications’ ongoing improved performance. Deep learning allows computers to learn to perform classification tasks directly from images, rather than from a task-specific algorithm. A large set of labeled data — in this case, many thousands of images displaying various expressions and gestures — along with neural networks are used to train the computer model, allowing it to learn much like a human does.

Still, the technology needs to mature to make the Pepper robot safely deployable in homes and other locations. “In a real-world environment, it is very hard for current deep learning software to predict a task,” Ji says. “If you train the model for one task, it cannot automatically adapt to another, even similar, task without another retraining process.”

In its early deployments, Pepper has been used by businesses to promote products. Ji also sees the robot’s potential to act as a museum guide. But the ultimate goal is to improve people’s lives. “Pepper’s ability to perceive and understand humans’ emotional states, and then respond in an empathetic manner, makes it ideal as a companion robot for the ill or older populations.”

Published: July 2019
Glossary
deep learning
Deep learning is a subset of machine learning that involves the use of artificial neural networks to model and solve complex problems. The term "deep" in deep learning refers to the use of deep neural networks, which are neural networks with multiple layers (deep architectures). These networks, often called deep neural networks or deep neural architectures, have the ability to automatically learn hierarchical representations of data. Key concepts and components of deep learning include: ...
computer vision
Computer vision enables computers to interpret and make decisions based on visual data, such as images and videos. It involves the development of algorithms, techniques, and systems that enable machines to gain an understanding of the visual world, similar to how humans perceive and interpret visual information. Key aspects and tasks within computer vision include: Image recognition: Identifying and categorizing objects, scenes, or patterns within images. This involves training...
Vision in Actionrobotdeep learningcamerascomputer visionSoftware

We use cookies to improve user experience and analyze our website traffic as stated in our Privacy Policy. By using this website, you agree to the use of cookies unless you have disabled them.