CODATA 2002 Program

Abstracts: Keynote and Invited Cross-Cutting Themes
Program Program at a Glance Summary Table of Program [printable PDF Version] Detailed Conference Program Forms and Information Hotel Floor Plan [pdf] Registration Form [pdf] Hotel reservation [pdf] Local Information [pdf] Call for papers [pdf]	Abstracts Keynote Speakers Invited Cross-Cutting Themes Workshops and Tutorials Saturday, 28 September Sunday, 29 September Prize Award and Banquet Conference Tours Sponsors of CODATA 2002 To view PDF files, you must have Adobe Acrobat Reader.)	Science Specialty Session Abstracts Physical Science Data Biological Science Data Earth and Environmental Data Medical and Health Data Behavioral and Social Science Data Informatics and Technology Data Science Data Policy Technical Demonstrations Large Data Projects Roundtables Poster Sessions

Keynote Speakers

1. Preserving and Archiving S&T Data

Trends In Archiving Digital Data
Kevin Ashley
University of London Computer Center, UK

The scientific and technical worlds have been creating and collecting information in digital form for well over 40 years, and it is arguable that they were the first to recognise the necessity of sound infrastructures to preserve that data for future reuse, examination and criticism. But it is also true that efforts were fragmented and often discipline-specific. Digital preservation is now of concern to many; it is the cultural heritage communities, business, and governments who are setting the agenda and scoping the problem. The issues are many - who pays to keep material whose value may not be realised for many years? How do we decide what to retain if we cannot keep it all ? How do we ensure we know enough about what we have preserved to enable its future use, particularly in a discipline and possibly a culture far removed from its creators ? The scientific and technical communities have solutions to offer in these areas, but they can also learn from acitivities elsewhere. I will draw on experiences in business, scientific and cultural worlds to illustrate shared problems and possible shared solutions to these and other challenges.

2. Legal Issues in the use of S&T Data

Preserving the Positive Functions of the Public Domain in Science
Pamela Samuelson
Berkeley Center for Law and Technology, University of California at Berkeley, USA

Science has greatly benefited by the absence of intellectual property rights in data and in scientific methodologies. In recent years, intellectual property has played a greater role in scientific work. While intellectual property rights may well have a positive role to play in some fields of science, so does the public domain. This talk will discuss ongoing work exploring the positive functions of the public domain. This work may help scientists and lawyers achieve a better understanding of the circumstances under which intellectual property rights will foster science and those under which preserving the public domain will be more effective in fostering science.

3. Interoperability and Data Integration

Integrating Bioinformatics Data into Science: From Molecules to Biodiversity
Robert J. Robbins
Fred Hutchinson Cancer Research Center, Seattle, WA, USA

Informatics - the acquisition, management, and assessment of large (huge) amounts of data - has permeated biology. GenBank contains billions of base pairs of DNA and complete genomic sequences are readily available. Microbial genomes are sequenced in a matter of days. Expression-array techniques allow the dissection of molecular function at the genomic level, while some in the biodiversity community now aspire to a global all-taxa inventory. Once, dreamers thought about assembling all of the sequence information necessary to document an entire genome. Now it is possible to imagine bringing together all of the information necessary to describe the biosphere - past and present. But is it possible? How vast is the challenge? Are the difficulties technical, or sociological, or semantic, or ... Most importantly of all, what could we do with all of this information? Would it - in totality - be useful in any meaningful sense? Can there ever be a biological database of everything?

4. Information Economics for S&T Data

Economics of information services for scientific and technical data in the
information age: The view from a national data center in Japan
Masamitsu Negishi
NII (National Institute of Informatics), Japan

Applications of information technology continue to spread throughout the academic and business worlds. The internet was developed and utilized originally within academia where scientists and technologists enjoyed the free exchange of scientific information with their peers. As business and entertainment uses of the web grew, approaches for controlling or restricting the flow of information more responsive to the economic needs of the business community developed. Yet the needs of the scientific community for continued easy and free exchange of information remain. This talk reviews information technology, government policy, legislation and business model issues surrounding the flow of academic information in the context of economic theories for information goods. The speaker presents an overall view of the problems based on his long experience in developing and managing database and electronic library systems at the National Institute of Informatics in Japan (formally NACSIS), a national center for scientific information. The lecture concludes with a recommended scheme for cooperative, effective usable data flows among scientists and technologists across the world.

5. Emerging Tools and Techniques for Data Handling

Text Mining - the Technology To Convert Text into Knowledge?
Stan Matwin
School of Information Technology and Engineering, University of Ottawa, Canada

In this presentation we will look at Text Mining, also known as Information Extraction: the technological solution that addresses the problem of mapping technical texts into fixed-format representations, such as database records or frames. We will define the task using real-life examples. We will take a bird’s eye view of the basic text mining architecture, and discuss components of the text mining systems. We will look at the existing tools and solution providers and will discuss the limits of the technology. The talk will be illustrated with author’s experience in the development of a text mining tool in genomics.

6. Ethics in the use of S&T Data

Ethics in the Creation and Use of Scientific and Technical Data
Prof. M.G.K. Menon
Dr. Vikram Sarabhai Distinguished Professor of Department of Space and
President, LEAD, India

Science has been moving ahead at an ever increasing rapid pace. To encourage innovation and investment, there has been increasing stress on the protection of intellectual property. The international legal system relating to patents now covers a significant part of production in diverse fields and efforts exist to extend intellectual property principles to cover all types of services, traditional knowledge and scientific and technical data in the form of data bases. Questions have been raised for some time now on what the underlying principles should be that would govern intellectual property in the area of scientific and technical data. This talk addresses the interplay between legal and economic aspects on the one hand and moral and ethical aspects on the other, particularly from the viewpoint of the advancement of science itself, which is so fundamental for progress across the total spectrum of human endeavour. Issues concerning data access by the poor and by developing countries will also be addressed along with examples illustrating the direction we need to go. Ultimately, overall human good has to be the deciding factor.

Invited Cross-Cutting Themes

1. Preserving and Archiving S&T Data
2. Legal Issues in the use of S&T Data
3. Interoperability and Data Integration
4. Information Economics for S&T Data
5. Emerging Tools and Techniques for Data Handling
6. Ethics in the use of S&T Data
7. CODATA 2015

1. Preserving and Archiving S&T Data

1. The Challenge of Archiving and Preserving Remotely Sensed Data
John L. Faundeen
US Geological Survey, EROS Data Center, Sioux Falls, SD 57198-0001

Few would question the need to archive the scientific and technical (S&T) data generated by researchers. At a minimum, the data are needed for change analysis. Likewise, most people would value efforts to ensure the preservation of the archived S&T data. Future generations will use analysis techniques not even considered today. Until recently, archiving and preserving these data were usually accomplished within existing infrastructures and budgets. As the volume of archived data increases, however, organizations charged with archiving S&T data will be increasingly challenged. The US Geological Survey has had experience in this area and has developed strategies to deal with the mountain of land remote sensing data currently being managed and the tidal wave of expected new data. The Agency has dealt with archiving issues, such as selection criteria, purging, advisory panels, and data access, and has met with preservation challenges involving photographic and digital media.

2. The Virtual Observatory: The Future of Data and Information Management in Astrophysics
David Schade
Canadian Astronomy Data Centre, Herzberg Institute of Astrophysics, National Research Council, Canada

The concept of a “Virtual Observatory”, which would put the power of numerous ground-based and space-based observatories at the fingertips of astrophysical scientists, was once a pipe dream but is now represented by funded projects in Canada, the United States, the United Kingdom, and Europe. Astronomical data has been primarily digital for 15 years and the change from analogue (e.g. photographic plates) to digital form triggered an appreciation for the scientific value of data “archiving” and the development of astronomy data centres around the world. These facilities do much more than passively “archive” their content. They have scientific and technical staff that develop the means to add value to datasets by additional processing, they integrate datasets from different wavelength regimes with one another, they distribute those data via the web, and they actively promote the use of archival data. The next step is to federate the diverse and complimentary collections residing in data centres around the world and develop seamless means for users to simultaneously access and query multi-wavelength databases and pixels and to provide the computational resources for cross-correlation and other processing. In analogy to “the greatest encyclopedia that has ever existed” that has effectively come into being because of the internet, the Virtual Observatory will be an historic leap forward in the ability of scientists, and all human beings, to understand the universe we are part of.

3. Towards a New Knowledge of Global Climate Changes: Meteorological Data Archiving and Processing Aspects
Alexander M. Sterin
All-Russian Research Institute of Hydrometeorological Information (RIHMI-WDC), Russia

This presentation will focus on a wide range of aspects related to meteorological data utilization for getting new empirical information on climate variations. The problems of meteorological data collection, their quality assurance and control, and their archiving will be discussed.

The first and the main focus will be on the problem of environmental data archiving and preservation. The collection of Russian Research Institute for Hydrometeorological Information - World Data Center (RIHMI-WDC) is currently located on 9-track magnetic tapes. The total amount of these tapes is about 60 thousand volumes. The current archiving media are obsolete, so urgent efforts on moving the collection onto modern media are beginning.

The second focus will be on the multi-level approach in constructing the informational products based on primary meteorological observational data. This approach presumes that on the lowest level (zero level) there are raw observational data. On the next level (level number one) there are the observational data that have passed the quality check procedures. Normally, in the level one the erroneous and suspicious data are flagged. The higher levels contain the derivative, data products. It appears that most customers prefer special derivative data products that are based on the primary data and that have much easier to use formats and modest volumes, rather than the primary observational data that have more complicated formats and huge volumes. The multi-level structure of the derivatives for climate studies includes the derivatives based on observational data directly (characteristics which require the calculations based on the observational data directly), derivatives of the higher level that are based on the further generalization of products - derivatives of the lower level, and so on. Examples of such a multi-level structure of data products will be given.

The third focus will be on the cycles of data processing that are required for large, data-based climate-related projects. As a result of previous experience, it is important to preserve and to reutilize the observational data collections and to provide again the main calculations. The preservation of primary observational data is very important, because it may be necessary to recalculate the products of higher levels "from the very beginning." It appears that normally these cycles may need to be repeated once (or even more than once) per decade.

The last focus will be on the software instrumentation to obtain new information and new knowledge in climate changes. The technological aspects in processing huge volumes of data in various formats will be described.

4. Strategies for Selection and Appraisal of Scientific Data for Preservation
Seamus Ross, University of Glasgow and Principal Director ERPANET, UK

With many governments and commercial organisations creating kilometres of analogue documents every year archivists have long been confronted with the challenge of handling substantial quantities of records. Recognising the impossibility of retaining this material and documenting it in ways that would enable future users to discover and use it archivists developed the concepts of appraisal. Typically archives retain only between 5% and 10% of the records created by an organisation. Indeed, in ensuring that sufficient material is retained to provide an adequate record of our cultural, scientific, and commercial heritage effective retention and disposal strategies have proven essential. As we make the transition from a paper-based world to a digital one archivists continue to recognise the power of appraisal as they attempt to manage the increasing amounts of material created digitially. The concepts that underlie appraisal are poorly understood outside the narrow confines of the archival world, but a wider appreciation of them might bring benefits to other data creating and using communities.

Appraisal is both a technical process and an intellectual activity that requires knowledge, research, and imagination on the part of the appraiser. Appraisal, characterised at its simplest level, involves establishing the value of continuing to retain and document data or records; what administrative, evidential, informational, legal, or re-usable value does a record, document, or data set. The problem is of course compounded in the digital environment by the technical aspects of the material itself. Does technology change the processes, timeframe or relevance of appraisal? Or to paraphrase from the InterPARES Appraisal TaskForce Report (January 2001) what impact does it have on ensuring that material of 'lasting value is preserved in authentic form'.

After charting the mechanisms and processes for appraisal the paper examines how the digital environment has focused attention on establishing during the appraisal process whether or not it is feasible to maintain the authenticity and integrity of digital objects over time and what impact this has on the process and point in the life of a digital objects that it must be appraised. The paper concludes by building on this work to examine the impact of the formal process of appraisal in the archiving of scientific data sets, who should be involved and responsible for the process, what appraisal criteria might be appropriate, and at what stage in the life cycle of a digital object appraisal should be cared out.

2. Legal Issues in Using and Sharing Scientific and Technical Data

1. Search for Balance: Legal Protection for Data Compilations in the U.S.
Steven Tepp
US Copyright Office, Library of Congress, USA

The United States has a long history of providing legal protection against the unauthorized use of compilations of scientific and technical data. That protection, once broad and vigorous, is now diffuse and uncertain. In light of modern Supreme Court precedent, the U.S. Congress has struggled for several years to find the appropriate balance between providing an incentive for the creation of useful compilations of data through legal protections which allow the compiler to reap commercial benefit from his work and promoting the progress of science and useful arts by allowing researchers and scientists to have unfettered access to and use of such databases. My presentation will outline the history and current state of the legal protection afforded to databases in the United States and will then discuss the different legislative models of legal protection that have been the subject of considerable debate in the U.S. Congress in recent years.

2. Legal (dis)incentives for creating, disseminating, utilizing and sharing data for scientific and technical purposes
Kenji Naemura
Keio University, Shonan-Fujisawa Campus, Japan

While Japanese policy makers differ on practical strategies for recovery and growth after a decade of struggling economy, they all agree on a view that, for restructuring the industry in a competitive environment, more vital roles should be played by advanced S&T, as well as by improved organizational and legal schemes. It is with this view that national research institutions have undergone structural reforms, and that national universities are to follow them in a near future.

Many of the enhanced legal schemes - e.g., patents to be granted to inventions in novel areas, copyrights of digital works, and other forms of IPRs - are supposed to give incentives for S&T researchers to commercialize their results. However, some schemes - e.g., private data and security protections - may become disincentives for them to disseminate, utilize and share the results.

Thus the sui generis protection of databases introduced by the EU Directive of 1996 has raised a serious concern in the scientific community. The Science Council of Japan conducted a careful study in its subcommittee on the possible merits and demerits of introducing a similar legal protection framework in this country. Its result was published as a declaration of its 136th Assembly on October 17, 2001. It emphasized "the principle of free exchange of views and data for scientific research and education" and, expressing its opposition against a new type of legal right in addition to the copyright, stated that caution should be exercised in dealing with the international trend toward such legislation.

There are various factors that need to be considered in evaluating the advantages and disadvantages of legal protection of S&T data. They are related to the nature of research area, the data, the originating organization, the research fund, the user and his/her purpose of use, etc. Geographical, linguistic, cultural and economical conditions should also be considered when studying the consequences. After all, any incentives for advancing S&T may not be easily translated into economic figures, but other types of contributions to the humane society must be more highly valued.

3. Scientific and Technical Data Policy and Management in China
Sun Honglie
Chinese Academy of Sciences, Beijing, China

The 21st century is known as an information era, in which scientific and technical data, as an important information source, will have significant effects on the social and economic development of the world. Scientific and technical data contain academic, economic, social and other values. However, the basic ways of deriving the greatest value from scientific data are not just in their creation and storage, but in their dissemination and wide application. In this regard, issues of scientific and technical data policies and management have been considered as a strategic measure in the national information system and in the scientific and technical innovation programs in China. So far, scientific and technical data policy and management in China has made progress, in which:

a) A preliminary working pattern of scientific and technical data management has been shaped-the main lead being taken by government-professional sections, with scientific institutes and universities serving a subsidiary role;
b) Digitization and networking are becoming more and more universal; and
c) Professional data management organizations are being formed and expanded.

At present, the scientific and technical data policy and management in China are mainly focused on: establishing and implementing the rules for "management and sharing of national scientific and technical data"; initiating a special project for the construction of a national scientific and technical data sharing system; and developing measures for the management of this data sharing system.

4. A Contractually Reconstructed Research Commons for Scientific Data in a Highly Protectionist Intellectual Property Environment
J.H. Reichman, Duke University School of Law, Durham, NC, USA and
Paul F. Uhlir, The National Academies, Washington, DC, USA

There are a number of well-documented economic, legal, and technological efforts to privatize government-generated and commercialize government-funded scientific data in the United States that were heretofore freely available from the public domain or on an "open access" basis. If these pressures continue unabated, they will likely lead to a disruption of long-established scientific research practices and to the loss of new opportunities that digital networks and related technologies make possible. These pressures could elicit one of two types of responses. One is essentially reactive, in which the public scientific community adjusts as best it can without organizing a response to the increasing encroachment of a commercial ethos upon its upstream data resources. The other would require science policy to address the challenge by formulating a strategy that would enable the scientific community to take charge of its basic data supply and to manage the resulting research commons in ways that would preserve its public good functions without impeding socially beneficial commercial opportunities. Under the latter option, the objective would be to reinforce and recreate, by voluntary means, a public space in which the traditional sharing ethos of science can be preserved and insulated from the commodifying trends. This presentation will review some approaches that the U.S. scientific community might consider in addressing this challenge, and that could have broader applicability to scientific communities outside the United States.

3. Interoperability and Data Integration

1. Interoperability in Geospatial Web Services
Jeff de La Beaujardiere
NASA Goddard Space Flight Center, USA

This talk will outline recent work on open standards for implementing interoperable geospatial web services. Beginning in 1999, a series of Testbeds--operated by the OpenGIS Consortium (OGC), sponsored in part by US federal agencies, and involving the technical participation of industry, government and academia--has developed specifications and working implementations of geographic services to be deployed over HTTP. Pilot Projects and Technology Insertion Projects have tested and deployed these standards in real-world applications.

These information-access services can provide an additional layer of interoperability above the data search capabilities provided by National Spatial Data Infrastucture (NSDI) Clearinghouse nodes. The Web Map Service (WMS; published 2000) provides graphical renderings of geodata. The Web Feature Service (WFS; 2002) provides point, line and vector feature data encoded in the XML-based Geography Markup Language (GML; 2001). The Web Coverage Service (WCS; in preparation) provides gridded or ungridded coverage data. Additional specifications for catalog, gazetteer, and fusion services are also in progress. This talk will provide an overview of these efforts and indicate current areas of application.

2. Expanding Spatial Data Infrastructure Capabilities to Optimize Use and Sharing of Geographic Data: A Developing World Perspective
Santiago Borrero
Global Spatial Data Infrastructure (GSDI), Instituto Geografico Agustin Codazzi, Colombia

The availabity of spatial data infrastructure (SDI) capabilities at all levels, backed by international standards, guidelines and policies on access to data is needed to support human sustainable development and to derive scientific, economic and social benefits from spatial information.

In this context, this paper focuses on the need for and the current situation regarding spatial data infrastructures, in particular, from the Developing World perspective. To this end, the author (i) presents GSDI and PC IDEA aims, scope and expected contribution; and (ii) then, based on these initiatives and business plans, presents observations on the possibilities for improved data availability and interoperability. More than 50 nations are in the process of developing SDI capabilities and an increasing number of geodata related initiatives at all levels. Finally, the author evaluates the need for better cooperation and coordination among spatial data initiatives and, where feasible and convenient, integration to facilitate data access, sharing and applicability.

3. Interoperability of Biological Data Resources
Hideaki Sugawara, National Institute of Genetics, Japan

Biological data resources are composed of databases and data mining tools. The International Nucleotide Sequence Database (DDBJ /EMBL /GenBank ) and homology search programs are typical resources that are indispensable to life sciences and biotechnology. In addition to these fundamental resources, a number of resources are available on the Internet, e.g. those listed in the annual as we are able to observe in the yearly database issue of the journal, Nucleic Acid Research.

Biological data objects widely span: from molecule to phenotype; from viruses to mammoth; from the bottom of the sea to outer space.

Users' profile are also wide and diverse, e.g. to find anticancer drugs from any organisms in anywhere based on crosscutting heterogeneous data resources distributed in various categories and disciplines. Users often find a novel way of utilization that the developer did not imagine. Biological data resources have been often developed ad hoc without any international guidance for the standardization resulting in heterogeneous systems. Therefore, the crosscutting is a hard task for bioinformatician. It is not practical to reform large legacy systems in accordance with a standard, even if a standard is created.

Interoperability may be a solution to provide an integrated view of heterogeneous data sources distributed in many disciplines and also in distant places. We studied Common Object Request Broker Architecture (CORBA) to find that it is quite useful to make data sources interoperable in a local area network. Nevertheless, it is not straightforward to use CORBA to integrate data resources over fire-walls. CORBA is not fire-wall friendly.

Recently, XML (eXtentible Markup Language) becomes widely tested and used by so-called e-Business. XML is also extensively extended to biology. However, it is not sufficient for the interoperability of biological data resources to define Document Type Definition (DTD) or XML schema. It is because multiple groups define different XML documents for the common biological object. These heterogeneous XML documents will be made interoperable by use of SOAP (Simple Object Access Protocol), WSDL (Web Service Definition Language) and UDDI (Universal Description, Discovery and Integration). The author will introduce implementation and evaluation of these technologies in WDCM (http://wdcm.nig.ac.jp), Genome Information Broker (http://gib.genes.nig.ac.jp/) and DDBJ (http://xml.nig.ac.jp).

DDBJ: DNA Data Bank of Japan
EMBL: European Molecular Biology Laboratory
GenBank: National Center for Biotechnology Information

4. The Open Archives Initiative: A low-barrier framework for interoperability
Carl Lagoze
Department of Computer Science, Cornell University, USA

The Open Archives Initiatives Protocol for Metadata Harvesting (OAI-PMH) is the result of work in the ePrints, digital library, and museum community to develop a practical and low-barrier foundation for data interoperability. The OAI-PMH provides a method for data repositories to expose metadata in various forms about their content. Harvesters may then access this metadata to build value-added services. This talk will review the history and technology behind the OAI-PMH and describe applications that build on it.

4. Information Economics for S&T Data

1. Legal Protection of Databases and Science in the "European Research Area":
Economic Policy and IPR Practice in the Wake of the 1996 EC Directive
Paul A. David
Stanford University and All Souls College, Oxford

At the Lisbon Meeting of the European Council in March 2000, the member states agreed that the creation of a "European Research Area"should be a high priority goal of EU and national government policies in the coming decade. Among the policy commitments taking shape are those directed toward substantially raising the level of business R&D expenditures, not only by means of subsidies and fiscal tools (e.g., tax incentive), but also through intellectual property protections aimed at "improving the environment" for business investment in R&D. The Economic Commission of the EU currently is preparing recommendations for the implementation of IP protections in future Framework Programmes and related mechanisms that fund R&D projects, including policies affecting the use of legal protections afforded to database owners under the national implementations of the EC's Directive of March 11, 1996. This paper reviews the economics issues of IPR in databases, the judicial experience and policy pressures developing in Europe in the years following the implementations of the EC's directive. It attempts to see the likely implications these will carry for scientific research in the ERA.

2. International Protection of Non-Original Databases
Helga Tabuchi
Copyright Law Division, WIPO, Geneva, Switzerland

At the request of its member States, the International Bureau of the World Intellectual Property Organization (WIPO) commissioned external consultants to prepare economic studies on the impact of the protection of non-original databases. The studies were requested to be broad, covering not only economic issues in a narrow sense, but also social, educational and access to information issues. The consultants were furthermore expected to focus in particular on the impacts in developing, least developed and transition economies.

Five of the studies were completed in early 2002 and were submitted to the Standing Committee on Copyright and Related Rights at its seventh session in May 2002. The positions of the consultants differ significantly. The studies are available on WIPO's website at <http://www.wipo.int/eng/meetings/2002/sccr/index_7.htm>.

Most recently another consultant has been commissioned to prepare an additional study that focuses on Latin American and the Caribbean region. The study will be submitted to the Committee in due course.

3. The Digital National Framework: Underpinning the Knowledge Economy
Keith Murray
Geographic Information Strategy, Ordinance Survey, UK

Decision making requires knowledge, knowledge requires reliable information and reliable information requires data from several sources to be integrated with assurance. An underlying factor in many of these decisions is geography within an integrated geographic information infrastructure.

In Great Britain, the use of geographic information is already widespread across many customer sectors (eg central government, local authorities, land & property professionals, utilities etc) and supports many hundreds of private sector applications. An independent study in 1999 showed that £100 billion of the GB GDP per annum is underpinned by Ordnance Survey information. However little of the information that is collected, managed and used today can be easily cross referenced or interchanged, often time and labour is required which does not directly contribute to the customers project goals. Ordnance Survey's direction is driven by solving customers needs such as this.

To meet this challenge Ordnance Survey has embarked on several parallel developments to ensure that customers can start to concentrate on gaining greater direct benefits from GI. This will be achieved by making major investments in the data and service delivery infrastructure the organisation provides. Key initiatives already underway aim to establish new levels of customer care, supported by establishing a new customer friendly on-line service delivery channels. The evolving information infrastructure has been designed to meet national needs but is well placed to support wider initiatives such as the emerging European Spatial Data Infrastructure (ESDI) or INSPIRE as it is now called.

Since 1999 Ordnance Survey has been independently financed through revenues from the sale of goods. It is this freedom which is allowing the organisation to further invest surplus revenues into the development of the new infrastructure. Ordnance Survey's role is not to engage in the applications market, but to concentrate on providing a high quality spatial data infrastructure. We believe that the adoption of this common georeferencing framework will support, government, business and the citizen in making the key decisions in the future, based on joined up geographic information and thereby sound knowledge.

4. Borders in Cyberspace: Conflicting Public Sector Information Policies and their Economic Impacts
Peter Weiss
Strategic Planning and Policy Office, National Weather Service, National Oceanographic and Atmospheric Administration (NOAA), USA

Many nations are embracing the concept of open and unrestricted access to public sector information -- particularly scientific, environmental, and statistical information of great public benefit. Federal information policy in the US is based on the premise that government information is a valuable national resource and that the economic benefits to society are maximized when taxpayer funded information is made available inexpensively and as widely as possible. This policy is expressed in the Paperwork Reduction Act of 1995 and in Office of Management and Budget Circular No. A-130, “Management of Federal Information Resources.” This policy actively encourages the development of a robust private sector, offering to provide publishers with the raw content from which new information services may be created, at no more than the cost of dissemination and without copyright or other restrictions. In other countries, particularly in Europe, publicly funded government agencies treat their information holdings as a commodity to be used to generate revenue in the short-term. They assert monopoly control on certain categories of information in an attempt – usually unsuccessful -- to recover the costs of its collection or creation. Such arrangements tend to preclude other entities from developing markets for the information or otherwise disseminating the information in the public interest. The US government and the world scientific and environmental research communities are particularly concerned that such practices have decreased the availability of critical data and information. And firms in emerging information dependent industries seeking to utilize public sector information find their business plans frustrated by restrictive government data policies and other anticompetitive practices.

5. Emerging Tools and Techniques for Data Handling

1. From GeoSpatial to BioSpatial: Managing Three-dimensional Structure Data in the Sciences
Xavier R. Lopez, Oracle Corporation

Standard relational database management technology is emerging as a critical technology for managing the large volumes of 2D and 3D vector data being collected in the geographic and life sciences. For example, database technology is playing an important role in managing the terabytes of vector information used in environmental modeling, emergency management, and wireless location-based services. In addition, three-dimensional structure information is integral to a new generation of drug discovery platforms. Three dimensional structure-based drug design helps researchers generate high-quality molecules that have better pharmacological properties. This type of rational drug design is critically dependent on the comprehensive and efficient representation of both large (macro) molecules and small molecules. The macromolecules of interest are the large protein molecules of enzymes, receptors, signal transducers, hormones, and antibodies. With the recent availability of detailed structural information about many of these macromolecule targets, drug discovery is increasingly focused toward detailed structure-based analysis of the interaction of the active regions of these large molecules with candidate small-molecule drug compounds that might inhibit, enhance, or otherwise therapeutically alter the activity of the protein target. This paper will explain the means to manage the three dimensional types from the geosciences and biosciences in object-relational database technology in order to benefit from the performance, scalability, security, and reliability of commercial software and hardware platforms. This paper will highlight recent developments in database software technologies to address the 3D requirements of the life science community.

2. Benefits and Limitations of Mega-Analysis Illustrated using the WAIS
John J. McArdle, Department of Psychology, University of Virginia, USA
David Johnson, Building Engineering and Science Talent, San Diego, CA, USA

The statistical features of the techniques of meta-analysis, based on the summary statistics from many different studies, have been highly developed and are widely used (Cook et al, 1994). However, there are some key limitations to meta-analysis, especially the necessity for equivalence of measurements and inferences about individuals from groups. These problems led us to use an approach we have termed “mega-analysis” (McArdle & Horn, 1980-1999). In this approach all raw data from separate studies are used as a collective. The techniques of mega-analysis rely on a variety of methods initially developed for statistical problems of “missing data,” “selection bias,” “factorial invariance,” “test bias,” and “multilevel analyses.” In the mega-analysis of multiple sets of raw data (a) the degree to which data from different collections can be combined is raised as a multivariate statistical question, (b) unique estimation of parameters with more breadth, precision, and reliability than can be achieved by any single study, and (c) meta-analysis results emerge as a byproduct, so the assumptions may be checked and demonstrate why a simpler meta-analysis is adequate. Mega-analysis techniques are illustrated here using a collection of data from the popular ”Wechsler Adult Intelligence Scale”(WAIS), including data from thousands of people in over 100 research studies.

3. Publication, Retrieval and Exchange of Data: an Emerging Web-based Global
Solution
Henry Kehiaian
ITODYS, University of Paris 7, Paris, France

In the era of enhanced electronic communication and world-wide development of information systems, electronic publishing and the Internet offer powerful tools for the dissemination of all type of scientific information. This is now made available in electronic form primary, secondary, as well as tertiary sources. However, because of the multitude of existing physico-chemical properties and variety of modes of their presentation, the computer-assisted retrieval of the numerical values, their analysis and integration in databases is as difficult as before. Accordingly, the need to have standard data formats is more important than ever. CODATA has joined forces with IUPAC and ICSTI to develop such formats.

Three years after its establishment the IUPAC-CODATA Task Group on Standard Physico-Chemical Data Formats (IUCOSPED) has made significant progress in developing the presentation of numerical property data, as well as the relevant metadata, in standardized electronic format (SELF).

The retrieval of SELFs is possible via a web-based specialized Central Data Information Source, called DataExplorer, conceived as a portal to data sources.
An Oracle database has been designed and developed for DataExplorer at FIZ Karlsruhe, Germany. URL http://www.fiz-karlsruhe.de/dataexplorer/ ID: everyone; Password: sesame. DataExplorer is now fully operational and demonstrates the concept using 4155 Chemical components, 998 Original Data Sources, 41 Property Types, and 3805 Standard Electronic Data Files (SELF). Inclusion of additional data will be actively pursued in the future.
A link has been established from DataExplorer to one of the associated Publishers, the Data Center of the Institute of Chemical Technology, Praha, Czech Republic.
Retooling SELF in SELF-ML, an XML version of the current SELF formats, is under way.
Besides an on-line demonstration of DataExplorer from FIZ-Karlsruhe and Praha, the procedure will be illustrated by computer demonstration of two publications: (1) Vapor-Liquid Equilibrium Bibliographic Database ; (2) ELDATA, the International Electronic Journal of Physico-Chemical Data.

This Project was awarded $ 100,000 under the ICSU (International Council for Science) Grants Program 2000 for new innovative projects of high profile potential

Acknowledgments
We express our sincere thanks for the financial assistance of UNESCO and ICSU and its associated organizations, IUPAC, CODATA and ICSTI, for helpful discussions to IDF, IUCr, and CAS representatives, and for the contributions of all IUCOSPED Members and Observers, to FIZ Karlsuhe administration and its highly competent programmers, and to all the associated Publishers.

4. Creating Knowledge from Computed Data for the Design of Materials
Erich Wimmer
Materials Design s.a.r.l., France and USA

The dramatic progress in computational chemistry and materials science has made it possible to carry out ‘high-throughput computations’ resulting in a wealth of reliable computed data including crystallographic structures, thermodynamic and thermomechanical properties, adsorption energies of molecules on surfaces, and electronic, optical and magnetic properties. An exciting perspective comes from the application of combinatorial methodologies, which allow the generation of large sets of new compounds. High-throughput computations can be employed to obtain a range of materials properties, which can be stored together with subsequent (or parallel) experimental data. Furthermore, one can include defects such as vacancies or grain boundaries in the combinatorial space, and one can apply external pressure or stress up to extreme conditions. Convenient graphical user interfaces facilitate the construction of these systems and efficient computational methods, implemented on networked parallel computers of continuously growing computational power allow the generation of an unprecedented stream of data. This lecture will discuss experience with a technology platform, MedeA (Materials Exploration and Design Analysis), which has been developed by Materials Design with the capabilities described above in mind. using heterogeneous catalysis as an example, I will illustrate how chemical concepts can be combined with high-throughput computations to transform the computed data into information and knowledge and enable the design of novel materials.

6. Ethics in the Creation and use of Scientific and Techincal Data

1. Ethics and Values Relating to Scientific & Technical Data:
Lessons from Chaos Theory
Joan E. Sieber, NSF

Current literature reveals manifold conflicting, shifting and cross-cutting values to be reconciled if we are to pursue intelligent, data-management policies. Projects currently underway to deal with these complexities and uncertainties suggest the inevitability of a paradigm shift. Consider, e.g., questions of what data to archive, how extensively to document it, how to maintain its accessibility despite changing software and hardware, who should have access, how to allocate the costs of sharing, and so on. Traditional normative ethical theories (e.g., utilitarianism) can suggest guiding principles, and in today's global culture, recent ethical (e.g., Rawlsian) notions such as consideration of the interests of unborn generations and of persons situated very differently from oneself suddenly have immediate practical implications. However, such traditional approaches to ethical problem solving offer little guidance for dealing with problems that are highly contextual, complex, ill-defined, dynamic and fraught with uncertainty. Narrowly defined safety issues give way to notions of the ecology of life on Earth. Minor changes can have major consequences. The stakeholders are not only scientists and engineers from one's own culture, but persons, professions, businesses and governments worldwide, as they exist today and into the future. Issues of scientific freedom and openness are in conflict with issues of intellectual property, national security, and reciprocity between organizations and nations. Ethical norms, codes, principles, theories, regulations and laws vary across cultures, and often have unintended consequences that hinder ethical problem solving. Increasingly, effective ethical problem solving depends on integration with scientific and technological theory and "know how" and empirical research on the presenting ethical problem. For example, we look increasingly to psychological theories and legal concepts for clearer notions of privacy, and to social experiments, engineering solutions and methodological innovation for ways to assure confidentiality of data. We often find that one solution does not fit all related problems.

Chaos theory has taught us principles of understanding and coping with complexity and uncertainty that are applicable to ethical problem solving of data-related issues. Implications of chaos theory are explored in this presentation, both as new tools of ethical problem solving and as concepts and principles to include in the applied ethics education of students in science and engineering.

2. Understanding and improving comparative data on science and technology
Denise Lievesley, UNESCO Institute for Statistics

Statistics can serve to benefit society, but, when manipulated politically or otherwise, may be used as instruments by the powerful to maintain the status quo or even for the purposes of oppression. Statisticians working internationally face a range of ethical problems as they try to 'make a difference' to the lives of the poorest people in the world. One of the most difficult is the dilemma between open accountability and national sovereignty (in relation to what data are collected, the methods used and who is to have access to the results).

This paper will discuss the role of the UNESCO Institute for Statistics (UIS), to explain some of the constraints under which we work, and to address principles which govern our activities. The UIS is involved in

The collection and dissemination of cross-nationally comparable data and indicators, guardianship of these databases and support of, and consultation with, users
The analysis and interpretation of cross-national data
Special methodological and technical projects including the development of statistical concepts
The development and maintenance of international classifications, and standardised procedures to promote comparability of data
Technical capacity building and other support for users and producers of data within countries
Establishing and sharing good practice in statistics, supporting activities which improve the quality of data and preventing the re-invention of the wheel
Advocacy for evidence-based policies

Of these activities one of the key ones is to foster the collection of comparable data across nations, the main objectives being to enable countries to gain a greater understanding of their own situation by comparing themselves with others, thus learning from one another and sharing good practice; to permit the aggregation of data across countries to provide a global picture; and to provide information for purposes of the accountability of nations and for the assessment, development and monitoring of supra-national policies.

Denise Lievesley will discuss the consultation being carried out by the UIS to ensure that the data being collected on a cross-national basis are of relevance to national policies on science and technology. The consultation process was launched with an expert meeting where changes in science policy were debated and ways in which the UIS might monitor and measure scientific and technological activities and progress across the world were identified. A background paper was produced based on the experiences and inputs of experts from different regions and organizations, which addresses key policy issues in science and technology. The UIS will use this document as a basic reference for direct consultation with UNESCO Member States and relevant institutions. A long term strategy for the collection of science and technology data will be developed as a result of these consultations.

It is vital to build on the experience of developed countries through the important statistical activities of OECD and Eurostat but nevertheless to ensure that the collection of cross-nationally harmonised data does not distort the priorities of poorer countries. We are seeking a harmony of interests in data collection and use and the views of the participants will be sought as to how this might be achieved.

3. Ethics - An Engineers' View
Horst Kremers, Comp. Sci., Berlin, Germany

The engineering profession has long experience in developing principles for appropriate relations with clients, publishing Codes of Ethics, and developing and adhering to laws controlling the conduct of professional practice. A high demand exists in society for reliable engineering in planning, design, construction and maintenance. One of the primary objectives of an engineer's actions is to provide control over a situation by providing independent advice in conformance with moral principles in addition to sound engineering principles. In a world where life to an increasing extent depends on the reliable functioning of complex information systems and where new technical techniques emerge without chance for controlled experimentation and assessment, the need to inject ethical principles into scientific and technological decisionmaking and to fully consider the consequences of professional actions is mandatory. This presentation reviews several Code of Ethics development efforts and reflects on the Codes relative to action principles in science and technology. A potential role for CODATA is presented.

4. Ethics in Scientific and Technical Communication
Hemanthi Ranasinghe, University of Sri Jayewardenepura, Sri Lanka

Research can be described as operationally successful when the research objectives are achieved and technically successful when the researcher's understanding is enhanced, more comprehensive hypothesis are developed and lessons learned from the experience. However, research is not successful scientifically until the issues, processes and findings are made known to the scientific community. Science is not an individual experience. It is shared knowledge based on a common understanding of some aspect of the physical or social world. For that reason, the social conventions of science play an important role in establishing the reliability of scientific knowledge. If these conventions are disrupted, the quality of science can suffer. Thus, the reporting of scientific research has to be right on ethical grounds too.

General category of ethics in communication covers many things. One is Error and Negligence in Science. Some researchers may feel that the pressures on them are an inducement to haste at the expense of care. For example, they may believe that they have to do substandard work to compile a long list of publications and that this practice is acceptable. Or they may be tempted to publish virtually the same research results in two different places or publish their results in "least publishable units"—papers that are just detailed enough to be published but do not give the full story of the research project described.

Sacrificing quality to such pressures can easily backfire. A lengthy list of publications cannot outweigh a reputation for shoddy research. Scientists with a reputation for publishing a work of dubious quality will generally find that all of their publications are viewed with skepticism by their colleagues. Another vital aspect of unethical behavior in scientific communication is Misconduct in Science. This entails making up data or results (fabrication), changing or misreporting data or results (falsification), and using the ideas or words of another person without giving appropriate credit (plagiarism)-all strike at the heart of the values on which science is based. These acts of scientific misconduct not only undermine progress but the entire set of values on which the scientific enterprise rests. Anyone who engages in any of these practices is putting his or her scientific career at risk. Even infractions that may seem minor at the time can end up being severely punished. Frank and open discussion of the division of credit within research groups—as early in the research process as possible and preferably at the very beginning, especially for research leading to a published paper—can prevent later difficulties.

While misallocation of credit or errors arising from negligence-are matters that generally remain internal to the scientific community. Usually they are dealt with locally through the mechanisms of peer review, administrative action, and the system of appointments and evaluations in the research environment. But misconduct in science is unlikely to remain internal to the scientific community. Its consequences are too extreme: it can harm individuals outside of science (as when falsified results become the basis of a medical treatment), it squanders public funds, and it attracts the attention of those who would seek to criticize science. As a result, federal agencies, Congress, the media, and the courts can all get involved.

All parts of the research system have a responsibility to recognize and respond to these pressures. Institutions must review their own policies, foster awareness of research ethics, and ensure that researchers are aware of the policies that are in place. And researchers should constantly be aware of the extent to which ethically based decisions will influence their success as scientists.

7. CODATA 2015

1. Scholarly Information Architecture
Paul Ginsparg
Cornell University, USA

If we were to start from scratch today to design a quality-controlled archive and distribution system for scientific and technical information, it could take a very different form from what has evolved in the past decade from pre-existing print infrastructure. Ultimately, we might expect some form of global knowledge network for research communications. Over the next decade, there are many technical and non-technical issues to address along the way, everything from identifying optimal formats and protocols for rendering, indexing, linking, querying, accessing, mining, and transmitting the information, to identifying sociological, legal, financial, and political obstacles to realization of ideal systems. What near-term advances can we expect in automated classification systems, authoring tools, and next-generation document formats to facilitate efficient datamining and long-term archival stability? How will the information be authenticated and quality controlled? What differences should be expected in the realization of these systems for different scientific research fields? What is the proper role of governments and their funding agencies in this enterprise, and what might be the role of suitably configured professional societies? These and related questions will be considered in light of recent trends.

2. The role of scientific data in a complex world
Werner Martienssen
Physikalisches Institut der Universitaet, Frankfurt am Main, Germany

Physicists try to understand and to describe the world in terms of natural laws. These laws cover two quite different approaches in physics. First, the laws show up a mathematical structure, which in general is understood in terms of first principles, of geometrical relations and of symmetry arguments. Second, the laws contain data which are characteristic for the specific properties of the phenomena and objects. Insight into the mathematical structure aims at an understanding of the world in ever more universally applicable terms. Insight into the data shows up the magnificent diversity of the world's materials and ist behavior Whereas the description of the world in terms of a unified theory one day might be reduced to only one set of equations, the amount of data necessary to describe the phenomena of the world in their full complexity seems to be open-ended.

A unified theory has not been formulated up to now; nor can we say that our knowledge about the data would be perfect in any sense, Much has to be done, still. But being asked for, where do we expect to be in data physics and chemistry in ten to fifteen years, my answer is: We -hopefully - will be able to merge the two approaches of physics. On the basis of our understanding of Materials Science and by using the methods of computational physics we will make use both of the natural laws as well as of the complete set of known data in order to modulate, to study and to generate new materials, new properties end new phenomena.

3. Life Sciences Research in 2015
David Y. Thomas, Biochemistry Department, McGill University, Montreal, Canada

Much of the spectacular progress of life sciences research in the past 30 years has come from the application of molecular biology employing a reductionist approach with single genes, often studied in simple organisms. Now from the technologies of genomics and proteomics, scientists are deluged with increasing amounts, varieties and quality of data. The challenge is how life sciences researchers will use the data output of discovery science to formulate questions and experiments for their research and turn this into knowledge. What are the important questions? We now have the capability to answer at a profound level major biological problems of how genes function, how development of organisms is controlled, and how populations interact at the cellular, organismal and population levels. What data and what tools are needed? What skills and training will be needed for the next generation of life sciences researchers? I will discuss some of the initiatives that are planned or now underway to address these problems.

Tuesday Evening Public Sessions

Biodiversité - quelles sont les espèces, où se trouvent-elles?
Guy Baillargeon

Les connaissances sur les espèces vivantes sont documentées dans des systèmes de classification élaborés et constamment mis-à-jour par les taxonomistes. D'autre part, la connaissance de la distribution des espèces au sein de la biosphère est encore aujourd'hui principalement dérivée de l'information associée à des spécimens conservés dans les musées et les collections d'histoire naturelle. À ceci s'ajoute pour plusieurs groupes d'organismes vivants, un grand nombre d'observations individuelles colligées par des groupes d'intérêt spécialisés (tel que dans le cas des oiseaux, par les clubs d'ornithologie). Le Réseau mondial d'information sur la biodiversité (SMIB), mieux généralement connu sous son nom anglais de 'Global Biodiversity Information Facility' (GBIF), entend favoriser l'accès à toute cette information en établissant un vaste réseau distribué de bases de données scientifiques transopérables et ouvertes à tous. Encore en début d'implantation, GBIF jouera bientôt un rôle crucial en favorisant la standardisation, la digitalisation et la dissémination de l'information scientifique relative à la biodiversité partout dans le monde. Déjà, plusieurs organisations membres de GBIF ont annoncé leur intention de s'associer pour développer un inventaire de toutes les formes de vie connues (Catalogue de la Vie) et un nombre croissant d'institutions permettent l'accès direct aux données de leurs collections par voie de requêtes distribuées. La présentation fournira des exemples de ce qu'il est déjà possible de faire en matière de transopérabilité en associant un ou plusieurs systèmes de classification avec un moteur de recherche et de cartographie automatisé interreliant les données de distribution de plusieurs millions de spécimens et d'observations fournies par des dizaines d'institutions participantes à l'un des réseaux d'information distribués qui coexistent présentement sur l'Internet.

Presentation is in French; this is the English abstract:

Biodiversity - what are the species, where are they?
Guy Baillargeon

Knowledge on living species is documented through elaborate classification systems that are constantly updated by taxonomists. Knowledge on the distribution of species in the biosphere is still today mainly derived from label information associated with specimens preserved in natural history collections. In addition, for many living organisms (such as birds), large numbers of individual observations are collected by specialized interest groups. The Global Biodiversity Information Facility (GBIF) intends to facilitate access to much of these data by establishing an interoperable, distributed network of scientific databases freely available to all. Still in its early stages, GBIF is expected to play soon a crucial role in promoting the standardization, digitization and global dissemination of the world's scientific biodiversity data within an appropriate framework for property rights and due attribution. Already, organisations associated with GBIF have announced their intention of working together towards a Catalogue of Life conceived as a knowledge set of names of all known organisms and a growing number of institutions are providing direct access to the data associated with their collections via distributed queries. Examples will be presented of what is already possible in terms of interoperability when coupling one or many classifications with an automated search and map engine that interconnects millions of distributional records provided by dozens of institutions participating to one of the many distributed biodiversity information network that coexist now on the Internet.

Visualizations of our Planet's Atmosphere, Land & Oceans
Fritz Hasler
NASA Goddard Laboratory for Atmospheres, USA

See how High-Definition Television (HDTV) is revolutionizing the way we communicate science. Go back to the early weather satellite images from the 1960s and see them contrasted with the latest US and international global satellite weather movies including hurricanes & "tornadoes". See the latest visualizations of spectacular images from NASA/NOAA remote sensing missions like Terra, GOES, TRMM, SeaWiFS, Landsat 7 including new 1 - min GOES rapid scan image sequences of Nov 9th 2001 Midwest tornadic thunderstorms. New computer software tools allow us to roam & zoom through massive global images, e.g. Landsat tours of the US, and Africa, showing desert and mountain geology as well as seasonal changes in vegetation. See dust storms in Africa and smoke plumes from fires in Mexico. Fly in and through venues using 1 m IKONOS "Spy Satellite" data. See vortexes and currents in the global oceans that bring up nutrients. See the how the ocean blooms in response to these currents and El Niño/La Niña climate changes. The presentation will be made using the latest HDTV technology from a portable computer server.

Presented by Dr. Fritz Hasler of the NASA Goddard Space Flight Center. http://Etheater.gsfc.nasa.gov

Last site update: 25 September 2002

18th International Conference

CODATA 2002
Frontiers of Scientific and Technical Data

Montréal, Canada
29 September - 3 October

Abstracts: Keynote and Invited Cross-Cutting Themes

Keynote Speakers

Invited Cross-Cutting Themes

1. Preserving and Archiving S&T Data

2. Legal Issues in Using and Sharing Scientific and Technical Data

3. Interoperability and Data Integration

4. Information Economics for S&T Data

5. Emerging Tools and Techniques for Data Handling

6. Ethics in the Creation and use of Scientific and Techincal Data

7. CODATA 2015

Tuesday Evening Public Sessions

18th International Conference

CODATA 2002 Frontiers of Scientific and Technical Data

Montréal, Canada 29 September - 3 October

Abstracts: Keynote and Invited Cross-Cutting Themes

Keynote Speakers

Invited Cross-Cutting Themes

1. Preserving and Archiving S&T Data

2. Legal Issues in Using and Sharing Scientific and Technical Data

3. Interoperability and Data Integration

4. Information Economics for S&T Data

5. Emerging Tools and Techniques for Data Handling

6. Ethics in the Creation and use of Scientific and Techincal Data

7. CODATA 2015

Tuesday Evening Public Sessions

CODATA 2002
Frontiers of Scientific and Technical Data

Montréal, Canada
29 September - 3 October