Hybrid Comparative Solution Boosts Multi-Object Tracking

A team at the Gwangju Institute of Science and Technology (GIST) in Korea, led by Moongu Jeon, implemented a technique called deep temporal appearance matching association, or Deep-TAM, to overcome short-term occlusion, which affects the ability of computer vision systems to simultaneously track objects. The framework was shown to achieve high performance without sacrificing computational speed.

Algorithms that can simultaneously track multiple objects are essential to applications that range from autonomous driving to advanced public surveillance. However, it is difficult for computers to discriminate between detected objects based on their appearance.

One example of a function that remains difficult for computers is object tracking, which involves recognizing persistent objects in video footage and tracking their movements. While computers can simultaneously track more objects than humans, they usually fail to discriminate the appearance of different objects.

This, in turn, can lead the algorithm to mix up objects in a scene and ultimately produce incorrect tracking results.

Conventional tracking determines object trajectories by associating a bounding box to each detected object and establishing geometric constraints. The difficulty in this approach is in accurately matching previously tracked objects with objects detected in the current frame. Differentiating detected objects based on features like color usually fails because of changes in lighting condition and occlusions.

The researchers’ solution focused on enabling the tracking model to accurately extract the known features of detected objects and compare them not only with those of other objects in the frame, but also with a recorded history of known features.

To this end, the researchers combined joint-inference neural networks (JI-Nets) with long-short-term-memory networks (LSTMs). LSTMs help to associate stored appearances with those in the current frame; JI-Nets allow for comparing the appearances of two detected objects simultaneously from scratch. Using historical appearances in this way allowed the algorithm to overcome short-term occlusions of the tracked objects.

“Compared to conventional methods that preextract features from each object independently, the proposed joint-inference method exhibited better accuracy in public surveillance tasks, namely pedestrian tracking,” Jeon said.

The researchers also offset a main drawback of deep learning — low speed — by adopting indexing-based GPU parallelization to reduce computing times. Tests on public surveillance data sets confirmed that the proposed tracking framework offers state-of-the-art accuracy and is therefore ready for deployment.