Storage Keeps Pace with Data from SpaceMarie Freebody, Contributing Editor, firstname.lastname@example.orgAs imaging technology improves, space researchers are accumulating incredible amounts of image data – all of which must be transmitted and stored for analysis.
There are few fields in which the challenge of image storage and analysis is more keenly felt than in space research. From ground-based telescopes to instruments in the outer reaches of the solar system, the influx of image data that is continually being generated is staggering. And this volume is set to grow as technologies for space exploration become more sophisticated.
As NASA’s Mars Curiosity rover busies itself with its task at hand, it joins the thousands of other instruments that are gathering images and beaming them back to control centers for storage and analysis. Meanwhile, scientists on the ground are feverishly working on new ways to cope with this growing influx of image data being generated for space research.
“Over the past decade, instrumentation flown on planetary robotic exploration missions has produced massive amounts of data. The Planetary Data System [PDS], NASA’s official archive of scientific results from US planetary missions, has seen an approximate fiftyfold increase in the amount of data to nearly half a petabyte,” said Daniel J. Crichton, principal computer scientist and manager at the PDS Engineering Node at NASA’s Jet Propulsion Laboratory (JPL). “The most significant part of this volume is imaging data.”
But the true problem lies not in the storage of this data but in its transfer and subsequent handling. This is a particular challenge for scientists who do not have access to the computational power housed in astronomical facilities. A re-examination of how to scale the computation is required, as are better tools for browsing, accessing, visualizing and analyzing massive amounts of data.
This new portrait of the nearby spiral galaxy NGC 253 demonstrates that the VST, the newest telescope at ESO’s Paranal Observatory, provides broad views of the sky while also offering impressive image quality. The data were processed using the VST-Tube system developed by A. Grado and collaborators at the INAF-Capodimonte Observatory.
“The exponential growth of hard disk sizes by far outpaces the growth of the detectors, effectively making storage cheaper than it was 20 years ago,” said Dr. Jeremy R. Walsh, archive scientist at the world’s largest ground-based observatory, the European Southern Observatory (ESO). “It is probably more of an issue for the individual astronomer or group to find room and work with the TB [terabytes] of data in their project.”
The latest addition to ESO’s Very Large Telescope (VLT) is its Survey Telescope (VST), which uses a mosaic camera containing 32 4096 x 2048 CCD detectors for which large-scale reduction of data is needed. This mammoth task is carried out by dedicated facilities around Europe with sufficient computer power and disk storage.
The first released VST image shows the spectacular star-forming region Messier 17, also known as the Omega Nebula or the Swan Nebula, as it has never been seen before. The data were processed using the Astro-WISE software system developed by E.A. Valentijn and collaborators at Kapteyn Astronomical Institute at the University of Groningen in the Netherlands and elsewhere.
In the coming years, new instruments will impose even greater demands on computer capability. Data cubes larger than 2 GB will be produced from the forthcoming MUSE (Multi Unit Spectroscopic Explorer) instrument on the VLT, and such a large volume of data cannot be easily processed on current standard desktop or laptop computers. To exploit this data, users will demand more powerful machines; however, affordable computer hardware typically lags one or two years behind the delivery of these data sets.
Image data archives
But it’s not just sophisticated future missions that pose a problem for today’s space scientists. Decades of space research have generated immense archives of image data that must be accessed in a meaningful way. Astronomers conventionally tag imaging data using a set of standards for annotation that helps to catalog the image according to key features and attributes. Known as metadata or “data about the data,” it helps astronomers retrieve the correct images they want for analysis and use.
An artist’s impression of Euclid, which will map the dark universe and which is expected to launch in 2020.
“Sometimes, these metadata account for huge amounts of information that have to be stored and accessed through the database,” said Dr. Pedro Osuna, head of the Science Archives and Virtual Observatory team at ESA’s European Space Astronomy Centre [ESAC]. “Current database systems allow for big amounts of metadata storage, but are not enough to cope with the expected amounts that missions like Gaia or Euclid will deliver. Therefore, we are currently researching in areas of parallel database computing, Software as a Service technologies, NoSQL access to data (Hadoop, Map-Reduce), etcetera.”
Gaia, which is set for launch in 2013, is an ambitious mission to chart a 3-D map of the Milky Way, while Euclid is a mission to map the dark universe and is due for launch in 2020.
“Commercial and noncommercial database software companies are spending a lot of effort in designing systems that can cope with huge amounts of metadata in databases,” Osuna said.
Artist’s impression of the Gaia satellite, which will be the most
accurate optical astronomy satellite ever built. Due for launch in 2013,
it will continuously scan the sky for at least five years.
Most of today’s astronomical archives use a hard disk to store data, often with multiple copies and tape backup systems as tertiary safe storage. Data reduction (the removal of the instrument signature and conversion to physical units) is done in two ways: Observatories run pipelines that automatically process data and deliver partly reduced products to astronomers; alternatively, and additionally, reduction is done on the astronomer’s desktop, typically with scripts and legacy code.
Examples of such systems include the Image Reduction and Analysis Facility, which is a large and widely used astronomical reduction and analysis package in optical astronomy. But as ESO’s Walsh points out, the software is more than 30 years old.
“The ESO-MIDAS [Munich Image Data Analysis System] ..., developed by ESO but no longer fully supported, is of comparable vintage,” he said. “IDL is also popular and has a large library of astronomical applications. Full replacements of these systems are not contemplated on account of their large size (many millions of lines of code). Python is widely used both for scripting and for the development of systems from scratch.
“Astronomers are typically very eclectic and will use whatever piece of code is best for their particular use, be it from a large package or a dedicated task freely available. For astrophysical analysis of reduced data, most astronomers write their own codes (e.g., [using] C, Python, Fortran, etcetera).”
VST image of the largest globular cluster in the sky, Omega Centauri. The very wide field of view of VST and its powerful camera OmegaCAM can encompass even the faint outer regions of this spectacular object. The data were processed using the VST-Tube system developed by A. Grado and collaborators at the INAF-Capodimonte Observatory.
Although ESA’s Osuna is facing “big data” space missions such as Gaia and Euclid, he believes that the solution does not lie only in developing new hardware. In the case of Euclid, the data download rate is going to be about 100 GB per day, totaling about 100 PB every three years.
“The storage and processing of the data are posing big technological challenges, not only from the point of view of hardware, but also from a system architectural angle, where new distributed systems will have to be implemented to cope with such amounts of data storage and processing,” Osuna said.
This montage shows six cutouts from the new VST image of the star-forming region Messier 17,
also known as the Omega Nebula or Swan Nebula.
For NASA’s Crichton and colleague Lisa Gaddis, manager of the PDS Imaging Node, the crux of the matter is to develop a system that supports scalability at all levels, from safe and efficient storage of data to automation of mechanisms to access, visualize, download and analyze it.
“One of the major trends [at NASA] has been the push to make services available online,” Crichton said. “Over the next decade, we anticipate the need to continue to improve mechanisms to bring data from multiple instruments and missions together. This will build on our efforts to re-architect the PDS so that it supports capabilities such as on-the-fly image processing, correlation of data from multiple instruments and missions, and navigation through a wide variety of visualization and analysis tools.”
A simple way of coping with the growing influx of image data is through lossless compression (like gzip). This often is used by archives to save space, but the data must be unpacked before it can be worked with; tile compression allows access to the data in compressed form and is gaining popularity.
“Lossy compression of data with very high compression ratios is being discussed, but acceptance from the community would be necessary before employing such techniques,” said Jonas Haase, archive specialist at ESO. “Network speeds have not followed the same growth trends as hard disks and CPUs [central processing units], which will make it somewhat harder to offer convenient data access to the astronomers in the future.”
The left eye of the Mast Camera on NASA’s Mars rover Curiosity took this image of the camera on the rover’s arm. The mechanism on the right in this image is Curiosity’s dust-removal tool, a motorized wire brush.
Current data rates are still in the realm where all necessary data processing in an archive setting can be performed in clusters of machines or small grids. Some projects are pooling their resources in larger grid projects to save in the long run and to be able to scale up in the future.
“CANFAR [Canadian Advanced Network for Astronomical Research]/Canarie Inc. is an example of that. ESA/ESAC has two grids, one of which can be used by the community,” ESO’s Walsh said.
Although some initiatives use grid computing, large-scale use of cloud computing has not yet made a mark on astronomy. But as data volume grows, the need for more automated methods for data extraction increases. Along with increasing data volume, cloud computing might be an option if it can be leveraged to provide mechanisms for scalable storage and for computation.
“Data that are properly formatted and cartographically registered for various planetary bodies – for example, Mars – will help significantly in mining, extracting and visualizing scientific results,” NASA’s Crichton said. “Computation can help improve the ability to visualize and browse through massive data by parallelizing much of the work required to properly subset and tile imaging data.”