Multiview 3-D Photography Simplified

CAMBRIDGE, Mass., June 24, 2013 — A small checkerboard patterned plastic film inserted beneath the lens of an ordinary camera can transform the device into a light-field camera capable of producing multiperspective images.

In computational photography, light-gathering tricks and sophisticated algorithms extract more information from the visual environment than traditional cameras can. The first commercial application of this type of photography is the so-called light-field camera, which measures not only the intensity of incoming light but also its angle of arrival. This information can be used to produce multiperspective 3-D images, or to refocus a shot even after it has been captured.

Current light-field cameras trade a good deal of resolution for that extra-angle information: A camera with a 20-megapixel sensor, for instance, will yield a refocused image of only 1 megapixel. And, such devices cost nearly $400.

Researchers in the Camera Culture Group at MIT’s Media Lab aim to change that with a system they’re calling Focii. The device — which can produce a full 20-megapixel multiview 3-D image from a single exposure of a 20-megapixel sensor — relies on a small rectangle of plastic film printed with a unique checkerboard pattern that is inserted beneath the lens of an ordinary digital single-lens-reflex camera. Software does the rest.

Because a light-field camera captures information about not only the intensity of light rays but also their angle of arrival, the images it produces can be refocused later. Scientists from the Camera Culture Group at MIT’s Media Lab have developed a system called Focii that relies on a small rectangle of plastic film printed with a unique checkerboard pattern. When inserted beneath the lens of an ordinary camera, it can produce a full 20-megapixel multiview 3-D image from a single exposure of a 20-megapixel sensor. Courtesy of Kshitij Marwah.

The new work complements the Camera Culture Group’s ongoing glasses-free 3-D display research, said postdoc Gordon Wetzstein. “Generating live-action content for these types of displays is very difficult,” Wetzstein said. “The future vision would be to have a completely integrated pipeline from live-action shooting to editing to display. We’re developing core technologies for that pipeline.”

In 2007, Ramesh Raskar, the NEC Career Development Associate Professor of Media Arts and Sciences and head of the Camera Culture Group, and colleagues at Mitsubishi Electric Research showed that a plastic film with a pattern printed on it — a “mask” — and some algorithmic wizardry could produce a light-field camera whose resolution matched that of cameras that used arrays of tiny lenses, the approach adopted in today’s commercial devices.

Trioptics GmbH - Worldwide Benchmark 4-24 LB

“It has taken almost six years now to show that we can actually do significantly better in resolution, not just equal,” Raskar said.

Focii represents a light field as a grid of square patches; each patch, in turn, consists of a 5 × 5 grid of blocks. Each block represents a different perspective on a 121-pixel patch of the light field, so Focii captures 25 perspectives in all. Conventional 3-D systems, such as those used to produce 3-D movies, capture only two perspectives; with multiperspective systems, a change in viewing angle reveals new features of an object, as it does in real life.

The key to the system is a novel way to represent the grid of patches corresponding to any given light field. In particular, Focii describes each patch as the weighted sum of a number of reference patches — or “atoms” — stored in a dictionary of about 5000 patches. Rather than describing the upper-left corner of a light field by specifying the individual values of all 121 pixels in each of 25 blocks, Focii simply describes it as some weighted combination of, say, atoms 796, 23 and 4231, the investigators said.

According to graduate student Kshitij Marwah, the best way to understand the dictionary of atoms is through the analogy of the Fourier transform, a widely used technique for decomposing a signal into its constituent frequencies.

In fact, the visual images can be — and frequently are — interpreted as signals and represented as sums of frequencies. In such cases, the different frequencies also can be represented as atoms in a dictionary. Each atom consists of alternating bars of light and dark, with the distance between the bars representing frequency.

The atoms in the MIT scientists’ dictionary are similar but much more complex. Each atom is itself a 5 × 5 grid of 121-pixel blocks, each consisting of arbitrary-seeming combinations of color: The blocks in one atom might all be green in the upper-left corner and red in the lower right, with lines at slightly different angles separating the regions of color; the blocks of another atom might all feature slightly different-sized blobs of yellow invading a region of blue.

In building their dictionary of atoms, the investigators had two tools at their disposal that Joseph Fourier, working in the late 18th century, lacked: computers and several real-world examples of light fields.

To build the dictionary, several combinations of colored blobs were tested to determine which, empirically, enabled the most efficient representation of actual light fields. Once the dictionary was established, however, they still had to calculate the optimal design of the mask they use to record light-field information — the patterned plastic film that they slip beneath the camera lens.

Visiting scientists Yosuke Bando explains the principle behind mask design using, again, the analogy of Fourier transform.

“If a mask has a particular frequency in the vertical direction” — say, a regular pattern of light and dark bars — “you only capture that frequency component of the image,” Bando said. “So you have no way of recovering the other frequencies. If you use frequency domain reconstruction, the mask should contain every frequency in a systematic manner.”

“Think of atoms as the new frequency,” Marwah said. “In our case, we need a mask pattern that can effectively cover as many atoms as possible.”

Assembling an image from the information captured by the mask is currently computationally intensive, said Kari Pulli, senior director of research at graphics-chip company Nvidia. Moreover, he says, the examples of light fields used to assemble the dictionary may have omitted some types of features common in the real world.

“There’s still work to be done for this to be actually something that consumers would embrace,” Pulli said.

The work will be presented in July at Siggraph 2013 in Anaheim, Calif.

For more information, visit: www.mit.edu