The Virtual Observatory: Science in an Exponential World

Alexander S. Szalay
Department of Physics and Astronomy, The Johns Hopkins University, USA

The amount of scientific information is doubling every year. This exponential growth is fundamentally changing every aspect of the scientific process – the collection, analysis and dissemination of scientific information. Our traditional paradigm for scientific publishing assumes a linear world, where the number of journals and articles remains approximately constant. The paper presents the challenges of this new paradigm and shows examples of how some disciplines are trying to cope with the data avalanche.

Computational science is a new branch of most disciplines. A thousand years ago, science was primarily empirical. Over the past 500 years, each discipline has added a theoretical component. Theoretical models often motivate experiments and generalize our understanding. Today, most disciplines have both empirical and theoretical branches. In the past 50 years, most disciplines (such as empirical, theoretical, and computational ecology, physics, and linguistics.)

Computational science traditionally meant simulation and grew out of our inability to find closed-form solutions for complex mathematical models; today, computers can simulate these complex models.

Computational science has now evolved to include information management. Scientists are faced with mountains of data stemming from four converging trends:

New instruments. The flood of data from new scientific instruments is driven by Moore’s Law, doubling their data output every year or so;
Simulations. The flood of data from simulations;
Storage. Petabytes of data are economically stored online; and
The Net. The Internet and computing Grid, make all these archives accessible to anyone anywhere.

Scientific information management poses profound computer science challenges. Acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. By using parallelism these problems can be solved in bounded times (minutes or hours). In contrast, most statistical analysis and data mining algorithms are nonlinear. Many tasks involve computing statistics among sets of data points in a metric space. Pairwise-algorithms on N points scale as N 2. If the data size increases a thousand fold, the work and time can grow by a factor of a million. Many clustering algorithms scale even worse and are infeasible for terabyte-scale data sets.

The new online science needs new data mining algorithms using near-linear processing, storage, and bandwidth, and that can be executed in parallel. Unlike current algorithms that give exact answers, these algorithms will likely be heuristic and give approximate answers.

Short bio:

Alexander Szalay is the Alumni Centennial Professor of Astronomy at the Johns Hopkins University. He is also Professor in the Department of Computer Science. He is a cosmologist, working on the statistical measures of the spatial distribution of galaxies and galaxy formation. He was born and educated in Hungary. After graduation he spent postdoctoral periods at UC Berkeley and the University of Chicago, before accepting a faculty position at Johns Hopkins. In 1990 he has been elected to the Hungarian Academy of Sciences as a Corresponding Member. He is the architect for the Science Archive of the Sloan Digital Sky Survey. He has been collaborating with Jim Gray of Microsoft to design an efficient system to perform data mining on the SDSS Terabyte sized archive, based on innovative spatial indexing techniques. He is leading a grass-roots standardization effort to bring the next generation Terabyte-sized databases in astronomy to a common basis, so that they will be interoperable – the Virtual Observatory. He is Project Director of the NSF-funded National Virtual Observatory. He is involved in the GriPhyN and iVDGL projects, creating testbed applications for the Computational Grid. He has written over 340 papers in various scientific journals, covering areas from theoretical cosmology to observational astronomy, spatial statistics and computer science. In 2003 he was elected as a Fellow of the American Academy of Arts and Sciences. In 2004 he received one of the Alexander Von Humboldt Prizes in Physical Sciences.