Algorithm Adds 3rd Dimension to Standard Video

CAMBRIDGE, Mass., Nov. 6, 2015 — By exploiting the graphics-rendering software that powers sports video games, researchers at the Massachusetts Institute of Technology (MIT) and the Qatar Computing Research Institute (QCRI) have developed a system that automatically converts 2D video of soccer games into a 3D version that can be played on commercial 3D televisions and other special-purpose displays.

Today's video games generally store very detailed 3D maps of the virtual environment that players navigate. When the player initiates a move, the game adjusts the map accordingly and, on the fly, generates a 2D projection of the 3D scene that corresponds to a particular viewing angle.

To create a 3D video from 2D source material, the researchers essentially ran this process in reverse. They set the realistic Microsoft soccer game "FIFA13" to play over and over again, and used Microsoft's video-game analysis tool PIX to continuously store screen shots of the action. For each screen shot, they extracted the corresponding 3D map.

Using a standard algorithm for gauging the difference between two images, they winnowed out most of the screen shots, keeping those that best captured the range of possible viewing angles and player configurations that the game presented; the total number of screen shots ran to the tens of thousands. Then they stored each screen shot and the associated 3D map in a database.

For every frame of 2D video of an actual soccer game, the system identifies in the database the 10 best corresponding screen shots. It decomposes those images, looking for the best matches between smaller regions of the video feed and smaller regions of the screen shots. It then superimposes the depth information from the screen shots on the corresponding sections of the video feed and, finally, stitches the pieces back together.

The result is a 3D effect with no visual artifacts. In a user study, the majority of subjects gave the 3D effect a rating of 5, or excellent, on a five-point scale; the average score was between 4 and 5.

Currently the system takes about a third of a second to process a frame of video. But successive frames could all be processed in parallel, meaning a broadcast delay of a second or two might provide an adequate buffer to permit conversion on the fly. The researchers are working to further reduce conversion time.

The research was presented at the Association for Computing Machinery’s 2015 Multimedia conference.

Published: November 2015