CODATA 2002 Program

Data Science
Program Program at a Glance Summary Table of Program [printable PDF Version] Detailed Conference Program Forms and Information Hotel Floor Plan [pdf] Registration Form [pdf] Hotel reservation [pdf] Local Information [pdf] Call for papers [pdf]	Abstracts Keynote Speakers Invited Cross-Cutting Themes Workshops and Tutorials Saturday, 28 September Sunday, 29 September Prize Award and Banquet Conference Tours Sponsors of CODATA 2002 To view PDF files, you must have Adobe Acrobat Reader.)	Science Specialty Session Abstracts Physical Science Data Biological Science Data Earth and Environmental Data Medical and Health Data Behavioral and Social Science Data Informatics and Technology Data Science Data Policy Technical Demonstrations Large Data Projects Roundtables Poster Sessions

Track I-D-5:
Data Science

Chair: Jacques-Emile Dubois, ITODYS, Université de Paris VII - France and Past-President, CODATA

1. Quality Control of Data in Data-Sharing Practices and Regulations
Paul Wouters and Anne Beaulieu, Networked Research and Digital Information (Nerdi), NIWI-KNAW, The Royal Netherlands Academy of Arts and Sciences, The Netherlands

Scientific research is generating increasing amounts of data. Overall, in each year more data has been generated than in all years before combined. At the same time, knowledge production is becoming more dependent on data sets. This puts the question of quality control of the data center stage. How is the scientific system coping with the formidable task of controlling for the quality of this flood of data? One area in which this question has not yet been fully explored is the domain of data-sharing practices and regulations. The need to share data among researchers and between researchers and the public has been put on the agenda at the level of science policy (Franken 2000), partly out of fear that the system might not be able to cope with the abundance of data. Data sharing is not only a technical issue, but a complex social process in which researchers have to balance different pressures and tensions.

Basically, two different modes of data sharing can be distinguished: peer-to-peer forms of data sharing and repository-based data sharing. In the first mode, researchers communicate directly with each other. In the second mode, there is a distance between the supplier of data and the user in which the rules of the specific data repository determine the conditions of data sharing. In both modes, the existence or lack of trust between the data supplier and the data user is crucial, though in different configurations. If data sharing becomes increasingly mediated by information and communication technologies, and hence less dependent on face to face communication, the generation of trust will have to be organised differently (Wouters and Beaulieu 2001). The same holds for forms of quality control of the data. How do researchers check for the quality in peer to peer data sharing? And how have data repositories and archives taken care of the need for quality control of the data supplied? Which dimensions of social relationships seem to be crucial in data quality control? Which technical solutions have been embedded in this social process and what role has been played by information and communication technologies?

This paper addresses these questions in a number of different scientific fields (among others functional brain imaging, high energy physics, astronomy, and molecular biology) because different scientific fields tend to display different configurations of these social processes.

References:

H. Franken (2000), “Conference Conclusions” in: Access to Publicly Financed Research, The Global Research Village III Conference, Conference Report (P. Schröder, ed.), NIWI-KNAW, Amsterdam.

Paul Wouters and Anne Beaulieu (2001), Trust Building and Data Sharing - an exploration of research practices, technologies and policies. Research Project Proposal, OECD/CSTP Working Group on Datasharing.

2. Distributed Oriented Massive Data Management: Progressive Algorithms and Data Structures
Rita Borgo, Visual Computing Group, Consiglio Nazionale delle Ricerche (C.N.R.), Italy
Valerio Pascucci, Lawrence Livermore National Laboratory (LLNL), USA

Projects dealing with massive amounts of data need to carefully consider all aspects of data acquisition, storage, retrieval and navigation. The recent growth in size of large simulation datasets still surpasses the combined advances in hardware infrastructure and processing algorithms for scientific visualization. The cost of storing and visualizing such datasets is prohibitive, so that only one out of every hundred time-steps can be really stored and visualized.

As a consequence interactive visualization of results is going to become increasingly difficult, especially as a daily routine from a desktop. High frequency of I/O operations starts dominating the overall running time. The visualization stage of the modeling-simulation-analysis activity, still the ideal effective way for scientists to gain qualitative understanding simulations results, becomes then the bottleneck of the entire process. In this panorama the efficiency of a visualization algorithm must be evaluated in the context of end-to-end systems instead of being optimized individually. There is a need at system level to design the visualization process as a pipeline of modules able to process data in stages creating a flow of data that need themselves to be optimized globally with respect to magnitude and location of available resources. To address these issues we propose an elegant and simple to implement framework for performing out-ofcore visualization and view dependent refinement of large volume datasets. We adopt a method for view dependent refinement that relies on longest edge-bisection strategies yet introducing a new method for extending the technique to the field of Volume Visualization while keeping untouched the simplicity of the technique itself. Results in this field are applicable in parallel and distributed computing ranging from cluster of PC's to more complex and expensive architectures. In our work we present a new progressive visualization algorithm where the input grid is traversed and organized in a hierarchical structure (from coarse level to fine level) and subsequent levels of detail are constructed and displayed to improve the output image. We uncouple the data extraction from its display: the hierarchy is built by one process that traverses the input 3D mesh while a second process performs the traversal and display. The scheme allows us to render at any given time partial results while the computation of the complete hierarchy makes progress. The regularity of the hierarchy allows the creation of a good data-partitioning scheme that allows us to balance processing time and data migration time still maintaining simplicity and memory/computing efficiency.

3. Knowledge Management in Physicochemical Property Databases - Knowledge Recovery and Retrieval of NIST/TRC Source Data System
Qian Dong, Thermodynamics Research Center (TRC), National Institute of Standards and Technology (NIST), USA
Xinjian Yan, Robert D. Chirico, Randolph C. Wilhoit, Michael Frenkel

Knowledge management has become more and more important to physicochemical databases that are generally characterized by their complexity in terms of chemical system identifiers, sets of property values, the relevant state variables, estimates of uncertainty, and a variety of other metadata. The need for automation of database operation, for assurance of high data quality, and for the availability and accessibility of data sources and knowledge is a driving force toward knowledge management in the scientific database field. Nevertheless, current relational database technology makes the construction and maintenance of database systems of such kind tedious and error prone, and it provides less support than the development of physicochemical databases requires.

The NIST/TRC SOURCE data system is an extensive repository system of experimental thermophysical and thermochemical properties and relevant measurement information that have been reported in the world’s scientific literature. It currently consists of nearly 2 million records for 30,000 chemicals including pure compounds, mixtures, and reaction systems, which have already created both a need and an opportunity for establishing a knowledge infrastructure and intelligent supporting systems for the core database. Every major stage of database operations and management, such as data structure design, data entry preparation, effective data quality assurance, as well as intelligent retrieval systems, depends to a degree on substantial domain knowledge. Domain knowledge regarding characteristics of compounds and properties, measurement methods, sample purity, estimation of uncertainties, data range and condition, as well as property data consistency are automatically captured and then represented within the database. Based upon this solid knowledge infrastructure, intelligent supporting systems are being built to assist (1) complex data entry preparation, (2) effective data quality assurance, (3) best data and model recommendation, and (4) knowledge retrieval.

In brief, the NIST/TRC SOURCE data system is a three-tier architecture. The first tier is considered as a relational database management system, the second tier refers to knowledge infrastructure, and the last represents intelligent supporting systems consisting of computing algorithms, methods, and tools to carry out particular tasks of database development and maintenance. The goals of the latter two tiers are to realize the intelligent management of scientific databases based on relational model. The development of knowledge infrastructure and intelligent supporting systems is described in the presentation.

4. Multi-Aspect Evaluation of Data Quality in Scientific Databases
Juliusz L. Kulikowski, Institute of Biocybernetics and Biomedical Engineering c/o the Polish Academy of Sciences, Poland

The problem of data quality evaluation arises both, when a database is to be designed and when database customers are going to use data in investigations, learning and/or decision making. However, it is not quite clear what does it mean, exactly, that the quality of some given data is high or, even, it is higher than this of some other ones. Of course, it suggests that a data quality evaluation method is possible. If so, it should reflect the data utility value, but can it be based on a numerical quality scale? It was shown by the author (1982) that information utility value is a multi-component vector rather than a scalar.Its components should characterise such information features as its relevance, actuality, credibility, accuracy, completeness, acceptability, etc. Therefore, data quality evaluation should be based on vectors ordering concepts. For this purpose the Kantorovitsch's proposal of a semi-ordered linear space (K-space) can be used. In this case vector components should satisfy the general vector-algebra assumptions concerning additivity and multiplication by real numbers. This is possible if data quality features are defined in an adequate way. It is also desired that data quality evaluation is extended on data sets. In K-space this can be reached in several ways, by introduction of the notions of: 1/minimum guaranteed and maximum possible data quality, 2/ average data quality, 3/ median data quality. In general, the systems of single data quality and data quality sets evaluation are not identical.

For example, a notion of data set redundancy (being an important component of its quality evaluation) is not applicable to single data. It also plays different roles if a data set is to be used for specific data selection and if it is taken as a basis of statistical inference. Therefore, data set quality depends on the users' point of view. On the other hand, there is no identity between points of view on data set quality of the users and of database designers, the last being intended to satisfy various and divergent users' requirements. The aim ot this paper is to present, with more details, the data quality evaluation method based on vectors ordering in K-space.

5. Modeling the Earth's Subsurface Temperature Distribution From a Stochastic Point of View
Kirti Srivastava, National Geophysical Research Institute, India

Stochastic modeling has played an important role in the quantification of errors in various scientific investigations. In the quantification of errors one looks for the first two moments i.e. mean and variance in the system output due to errors in the input parameters. Modeling a given physical system with the available information and obtaining meaningful insight into its behavior is of vital importance in any investigation. One such investigation in Earth sciences is to understand the crustal/lithospheric evolution and temperature controlled geological processes. For this an accurate estimation of the subsurface temperature field is essential. The thermal structure of the Earth's crust is influenced by its geothermal controlling parameters such as thermal conductivity, radiogenic heat sources and initial and boundary conditions.

Modeling the subsurface temperature field is either done using a deterministic approach or the stochastic approach. In the deterministic approach the controlling parameters are assumed to be known with certainty and the subsurface temperature field is obtained. However, due to inhomogeneous and anisotropic character of the Earth's interior some amount of uncertainty in the estimation of the geothermal parameters are bound to exist. Uncertainties in these parameters may arise from the inaccuracy of measurements or lack of information available on them. Such uncertainties in parameters are incorporated in the stochastic approach and an average picture of the thermal field along with its associated error bounds is obtained.

The quantification of uncertainty in the temperatures field is obtained using both random simulation and stochastic analytical methods. The random simulation method is a numerical method in which the uncertainties in the thermal field due to uncertainties in the controlling thermal parameters are quantified. The stochastic analytical method is generally solved using the small perturbation method and closed form analytical solutions to the first two moments are obtained. The stochastic solution to the steady state heat conduction equation has been obtained for two different conditions i.e. when the heat sources are random and when the thermal conductivity is random. Closed form analytical expressions for mean and variance of the subsurface temperature distribution and the heat flow have been obtained. This study has been applied to understand the thermal state in a tectonically active region in the Indian Shield.

Track IV-B-4:
Emerging Concepts of Data-Information-Knowledge Sharing

Henri Dou, Université Aix Marseille III, Marseille, France, and
Clément Paoli, Université of Marne la Vallée UMLV, Champ sur Marne, France

In various academic or professional activities the need to use distributed Data, Information and Knowledge (D-I-K) features, either as resources or in cooperative action, often becomes very critical. It is not enough to limit oneself to interfacing existing resources such as databases or management systems. In many instances, new actions and information tools must be developed. These are often critical aspects of some global changes required in existing information systems.

The complexity of situations to be dealt with implies an increasing demand for D-I-K attributes in large problems, such as environmental studies or medical systems. Hard and soft data must be joined to deal with situations where social, industrial, educational, and financial considerations are all involved. Cooperative work already calls for these intelligent knowledge management tools. Such changes will certainly induce new methodologies in management, education, and R&D.

This session will emphasize conceptual level of emerging global methodology as well as the implementation level of working tools for enabling D-I-K sharing in existing and future information systems. Issues that might be examined in greater detail include:

Systems to develop knowledge on a cooperative basis;
Access to D-I-K in remote teaching systems, virtual laboratories and financial aspects;
Corporate universities (case studies will be welcomed): alternate teaching and industrial D-I-K confidentiality innovation supported by information technology in educational systems and data format interchange in SEWS (Strategic Early Warning Systems) applied to education and usage of data;
Ethics in distance learning; and
Cases studies on various experiments and standardization of curriculum.

1. Data Integration and Knowledge Discovery in Biomedical Databases.
A Case Study
Arnold Mitnitski, Department of Medicine, Dalhousie University, Halifax, Canada
Alexander Mogilner, Montreal, Canada
Chris MacKnight, Division of Geriatric Medicine, Dalhousie University, Halifax, Canada
Kenneth Rockwood, Division of Geriatric Medicine, Dalhousie University, Halifax, Canada.

Biomedical (epidemiological) databases generally contain information about large numbers of individuals (health related variable: diseases, symptom and signs, physiological and psychological assessments, socio-economic variables etc.). Many include information about adverse outcomes (e.g.death), which makes it possible to discover links between health outcomes and other variables of interest (e.g., diseases, habits, function). Such databases also can be linked with demographic surveys that themselves contain large amounts of data aggregated by age and sex and with genetic databases. While each of the databases are usually created independently, for discrete purposes the possibility of integrating knowledge from several domains across databases is of significant scientific and practical interest. One example of linking a biomedical database (National Population Health Survey) containing more than 80,000 records of Canadian population in 1996-97 years and 38 variables (disabilities, diseases, health conditions) with mortality statistic obtained for Canadians male and female is discussed. First, the problem of the redundancy in the variables is considered. Redundancy makes it possible to derive a simple score as a generalized (macroscopic) variable that reflects both individual and group health status.

This macroscopic variable reveals a simple exponential relation with age, indicating that the process of accumulation of deficits (damage) is a leading factor causing death. The age trajectory of the statistical distribution of this variable also suggests that redundancy exhaustion is a general mechanism, reflecting different diseases. The relationship between generalized variables and the hazard (mortality) rate reveals that the latter can be expressed in terms of variables generally available from any cross-sectional database. In practical terms, this means that the risk of mortality might readily be assessed from standard biomedical appraisals collected on other grounds. This finding is an example of how knowledge from different data sources can be integrated to common good ends. Additionally, Internet related technologies might provide ready means to facilitate interoperability and data integration.

2. A Framework for Semantic Context Representation of Multimedia Resources
Weihong Huang , Yannick Prié , Pierre-Antoine Champin, Alain Mille, LISI, Université Claude Bernard Lyon 1, France

With the explosion of online multimedia resources, requirement of intelligent content-based multimedia service increases rapidly. One of the key challenges in this area is semantic contextual knowledge representation of multimedia resources. Although current image and video indexing techniques enable efficient feature-based operation on multimedia resources, there still exists a "semantic gap"between users and the computer systems, which refers to the lack of coincidence between the information that one can extract from the visual data and the interpretation that the same data has for a user in a given situation.

In this paper, we present a novel model: annotation graph (AG) for modeling and representing contextual knowledge of various types of resources such as text, image, and audio-visual resources. Based on the AG model, we attempt to build an annotation graph framework towards bridging the "semantic gap" by offering universal flexible knowledge creation, organization and retrieval services to users. In this framework, users will not only benefit from semantic query and navigation services, but also be able to contribute in knowledge creation via semantic annotation.

In the AG model, four types of concrete description elements are designed for concrete descriptions in specific situations, while two types of abstract description elements are designed for knowledge reusing in different situations. With these elements and directed arcs between them, contextual knowledge at different semantic levels could be represented through semantic annotation. Within the global annotation graph constructed by all AGs, we provide flexible semantic navigation using derivative graphs (DG) and AG. DGs enable complement contextual knowledge representation to AGs by focusing on different types of description elements. Towards semantic query, we present a potential graph (PG) tool to help users visualize query requests as PGs, and execute queries by performing sub-graph matching with PGs. Prototype system design and implementation aim at an integrated user-centered semantic contextual knowledge creation, organization and retrieval system.

3. Passer de la représentation du présent à la vision prospective du futur - Du Technology Forecast au Technology Foresight
Henri Dou, CRRM, Université Aix Marseille III, Centre Scientifique de Saint Jérôme, France
Jin Zhouiyng, Institute of Techno-Economics, Chinese Academy of Social Science (CASS), China

De nos jours, le passage du système technology forecast au système technology foresight est inévitable pour éviter que le développement scientifique ne soit qu'orienté verticalement au détriment des retombées possibles (positives ou négatives) au niveau de la Société. Dans cette communication les auteurs aborderont les aspects méthodologiques de cet passage ainsi que les différentes étapes qui ont jalonnées depuis 1930 cette évolution. Les analyses réalisées par différents payes seront présentées, avec un panorama international des actions en cours dans ce domaine.

Le concept Technology Foresight sera ensuite introduit dans la méthodologie de l'Intelligence Compétitive Technique ou Economique afin de créer pour des entreprises une vision du développement soutenable et éthique pour créer de nouveaux avantages.

La mise en œuvre internationale du concept, tant au plan Européen (6ième PCRD), qu'au niveau de la déclaration de Bologne (Juin 1999), et des actions menées au Japon ou en Chine (China 2020) sera analysée.

4. Mise en place d'un système dynamique et interactif de gestion d'activité et de connaissances d'un laboratoire
Mylène Leitzelman : Intelligence Process SAS, France
Valérie Léveillé : Case 422 Centre Scientifique de Saint-Jérôme, France
Jacky Kister : UMR 6171 S.C.C - Faculté des Sciences et Techniques de St Jérôme, France

Il s'agit de mettre en place de façon expérimentale et pour le compte de l'UMR 6171 associé au CRRM, un système interconnecté de gestion d'activité et de connaissances pour gérer l'activité scientifique d'une unité de recherche. Ce système sera doté de modules de visualisation synthétique, statistiques et cartographiques s'appuyant sur des méthodologies de datamining et de bibliométrie. Le point clé de ce système sera de proposer en même temps un outil de gestion stratégique et d'organisation d'un laboratoire et un outil permettant la compilation interlaboratoires pour en faire un outil d'analyse ou de stratégie à une plus grande échelle en laissant des accès plus ou moins libres pour que des agents extérieurs puissent à partir des données générer des indicateurs de performance, de valorisation, de qualité des productions scientifiques et de relations laboratoire/entreprises.

5. La dimension éthique de la relation pédagogique dans la formation à distance
M. Lebreton, C. Riffaut, H. Dou, Faculté des sciences et techniques de Marseille Saint-Jérôme (CRRM), France

De tous temps, enseigner a signifié être mis en relation avec quelqu'un dans le but de lui apprendre quelque chose. Le lien qui va unir le formateur à l'apprenant sera le savoir. Se forme ainsi un triangle éducatif¹ dont les branches constitutent la(les) relation(s) pédagogique(s).
Pour pouvoir activer cette structure, il est nécessaire que chaque acteur connaisse avec clarté et précision ses propres motivations et ses objectifs. Par ailleurs, il paraît évident que pour transmettre et acquérir des savoirs, il est nécessaire que les partenaires du processus d'apprentissage partagent un certain nombre de valeurs communes, véritable ciment de l'acte éducatif.

Au triangle sus-mentionné, correspond un triangle éthique où à chaque sommet on peut placer l'intitulé des missions éducatives : instruire, sociabiliser et qualifier.

Instruire, c'est avant tout acquérir des connaissances. Sociabiliser, c'est surtout acquérir des valeurs. Qualifier, c'est intégrer dans une organisation productive.

Ces deux triangles ont fonctionné pendant des siécles et l'arrivée des nouvelles technologies multimédia et de la communication a destructuré la règle des trois unités -le temps, le lieu et l'action². Cet ensemble est en train de se fissurer pour donner naissance à un nouveau paysage scolaire où la classe ne sera plus le seul lieu de formation, où le transfert des savoirs pourra être fait à tout moment et en tout lieu et où enfin l'action pédagogique sera individualisée et individualisable.

Dans ce nouveau contexte, la relation pédagogique dans la formation à distance va nécessiter la mise en œuvre de nouvelles compétences techniques, intellectuelles et sociales ou éthiques.
Pour pouvoir aborder ces nouveaux défis, il paraît nécessaire de chercher à savoir dans un premier temps en quoi l'éthique peut nous aider à comprendre de quelles manières ont évolué les dispositifs fondamentaux de production des savoirs et les changements intervenus dans le système de transfert des connaissances tout en se préoccupant de l'adaptation et de la nécessaire réactualisation pemanente des contenus éducatifs qui s'imposeront dorénavant.

Par la suite, le questionnement éthique doit conduire à aborder les conséquences liées à la dépersonnalisation de la relation d'apprentissage. A cet effet, il semble opportun de chercher à répondre à deux questions fondamentales. L'une a trait au formateur, est-il encore maître du processus de sociabilisation et s'interroger par la suite pour savoir si la formation à distance a encore des valeurs et dans ce cas quelles sont-elles?.

L'autre va concerner l'apprenant d'une part et l'on va s'interesser à ce qu'il advient de son identité dans l'univers du numérique et du virtuel et d'autre part chercher à savoir quel est son salut face à la marchandisation des connaissances et à l'accaparement des savoirs par des empires informationnels.

L'ensemble de ces interrogations éthiques peut permettre de commencer à trouver des débuts de solutions à des problématiques sans frontière et d'une complexité redoutable où cohabitent désormais le rationnel et l'iirtionnel, le matériel et l'immatériel, le personnel et l'impersonnel le tout immergé dans le numérique, fondement de la virtualité.

1. Le triangle pédagogique, J. Houssaye, Berne, Ed. Peter Lang
2. Rapport au Premier ministre du Sénateur A. Gérard, 1997

Data Policy

Track I-D-4:
The Public Domain in Scientific and Technical Data: A Review of Recent Initiatives and Emerging Issues

Chair: Paul F. Uhlir, The National Academies, USA

The body of scientific and technical data and other information in the public domain is massive and has contributed broadly to the scientific, economic, social, cultural, and intellectual vibrancy of the entire world. The "public domain" may be defined in legal terms as sources and types of data and information whose uses are not restricted by statutory intellectual property regimes and that are accordingly available to the public without authorization. In recent years, however, there have been growing legal, economic, and technological pressures on public-domain information-scientific and otherwise-forcing a reevaluation of the role and value of the public domain. Despite these pressures, some well-established mechanisms for preserving the public domain in scientific data exist in the government, university, and not-for-profit sectors. In addition, very innovative models for promoting various public-domain digital information resources are now being developed by different groups in the scientific, library, and legal communities. This session will review some of the recent initiatives for preserving and promoting the public domain in scientific data within CODATA and ICSU, the US National Academies, OECD, UNESCO, and other organizations, and will highlight some of the most important emerging issues in this context.

1. International Access to Data and Information
Ferris Webster, University of Delaware, USA

Access to data and information for research and education is the principal concern of the ICSU/CODATA ad hoc Group on Data and Information. The Group tracks developments by intergovernmental organizations with influence over data property rights. Where possible, the Group works to assure that the policies of these organizations recognize the public good to be derived by assuring access to data and information for research and education.

A number of international organizations have merited attention recently. New proprietary data rights threaten to close off access to data and information that could be vital for progress in research. The European Community has been carrying out a review of its Database Directive. The World Meteorological Organization's resolution on international exchange of meteorological data has been the subject of continuing debate. The Intergovernmental Oceanographic Commission is drafting a new data policy that may have constraints that are parallel to those of the WMO. The World Intellectual Property Organization has had a potential treaty on databases simmering for several years.

The latest developments in these organizations will be reviewed, along with the activities of the ICSU/CODATA Group.

2. The OECD Follow up Group on Issues of Access to Publicly Funded Research Data: A Summary of the Interim Report
Peter Arzberger, University of California at San Diego, USA

This talk will present a summary of the interim report of the OECD Follow up Group on Issues of Access to Publicly Funded Research Data. The Group's efforts have origins in the 3rd Global Research Village conference in Amsterdam, December 2000. In particular, it will include issues of global sharing of research data. The Group has conducted case studies of practices across different communities, and looked at factors such as sociological, economic, technological and legal issues that either enhance or inhibit data sharing. The presentation will also address issues such as data ownership and rights of disposal, multiple uses of data, the use of ICT for widening the scale and scope of data-sharing, effects of data-sharing on the research process, and co-ordination in data management. The ultimate goal of the Group is to articulate principles, based on best practices that can be interpreted into the science policy arena. Some initial principles will be discussed. Questions such as the following will be addressed:

What principles should govern science policy in this area?
What is the perspective of social informatics in this field?
What role does the scientific community play in this?

It is intended that this presentation will generate discussion and feedback on key points of the Group's interim report.

3. An Overview of Draft UNESCO Policy Guidelines for the Development and Promotion of Public-Domain Information
John B. Rose, UNESCO, Paris, FRANCE
Paul F. Uhlir, The National Academies, Washington, DC, USA

A significantly underappreciated, but essential, element of the information revolution and emerging knowledge society is the vast amount of information in the public domain. Whereas the focus of most policy analyses and law making is almost exclusively on the enhanced protection of private, proprietary information, the role of public-domain information, especially of information produced by the public sector, is seldom addressed and generally poorly understood.

The purpose of UNESCO's Policy Guidelines for the Development and Promotion of Public-Domain Information, therefore, is to help develop and promote information in the public domain at the national level, with particular attention to information in digital form. These Policy Guidelines are intended to better define public-domain information and to describe its role and importance, specifically in the context of developing countries; to suggest principles that can help guide the development of policy, infrastructure and services for provision of government information to the public; to assist in fostering the production, archiving and dissemination of an electronic public domain of information for development, with emphasis on ensuring multicultural, multilingual content; and to help promote access of all citizens, especially including disadvantaged communities, to information required for individual and social development. This presentation will review the main elements of the draft Policy Guidelines, with particular focus on scientific data and information in the public domain.

Complementary to, but distinct from, the public domain are the wider range of information and data which could be made available by rights holders under specific "open access" conditions, as in the case of open source software, and the free availability of protected information for certain specific purposes, such as education and science under limitations and exceptions to copyright (e.g., "fair use" in U.S. law). UNESCO is working to promote international consensus on the role of these facilities in the digital age, notably through a recommendation under development on the "Promotion and Use of Multilingualism and Universal Access to Cyberspace," which is intended to be presented to the World Summit on the Information Society to be organized in Geneva (2003) and Tunis (2005), as well as a number of other relevant programme actions which will also be presented at the Summit.

4. Emerging Models for Maintaining the Public Commons in Scientific Data
Harlan Onsrud, University of Maine, USA

Scientists need full and open disclosure and the ability to critique in detail the methods, data, and results of their peers. Yet scientific publications and data sets are burdened increasingly by access restrictions imposed by legislative acts and case law that are detrimental to the advancement of science. As a result, scientists and legal scholars are exploring combined technological and legal workarounds that will allow scientists to continue to adhere to the mores of science without being declared as lawbreakers. This presentation reviews three separate models that might be used for preserving and expanding the public domain in scientific data. Explored are the technological and legal underpinnings of Research Index, the Creative Commons Project and the Public Commons for Geographic Data Project. The first project relies heavily on protections granted to web crawlers under the U.S. Digital Millennium Copyright Act while the latter two rely on legal approaches utilizing open access licenses.

5. Progress, Challenges, and Opportunities for Public-Domain S&T Data Policy Reform in China
Liu Chuang, Chinese Academy of Sciences, Beijing, China

China has experienced four different stages for public-domain S&T data management and policy during the last quarter century. Before 1980, most of the government funded S&T data were free to be accessed, and the services received a good reputation from the scientific community. Most of these data were recorded on paper media, however, and took time to be accessed.

With the computer developments in the early 1980s, digital data and databases increased rapidly. The data producers and holders began to realize that the digital data could be an important resources for the scientific activities. The policy to charge fees for data access gained prominence between the early 1980s and approximately 1993. During this time period, China experienced new problems in S&T data management. For example, there was an increase of parallel work in database development and in data controlled by individual persons with a high risk of losing the data, and the price of access to data became very expensive in most cases.

In the 1994-2000 period, members of the scientific community asked for data policy reform, and for lower costs of access to government funded databases for non-profit applications. The Ministry of Science and Technology (MOST) set up a group to investigate China's S&T data sharing policies and practices.

A new program for S&T data sharing was initiated by MOST in 2001. This was a major milestone for enhancing access to and the application of public-domain S&T data. This new program, along with the current development of a new data access policy and support system, is expected to be greatly expanded during next decade.

Track IV-A-4:
Confidentiality Preservation Techniques in the Behavioral, Medical and Social Sciences
D. Johnson, Building Engineering and Science Talent, San Diego, CA, USA
John L. Horn, Department of Psychology, University of Southern California, USA
Julie Kaneshiro, National Institutes of Health, USA
Kurt Pawlik, Psychologisches Institut I, Universität Hamburg, Germany
Michel Sabourin, Université de Montréal, Canada

In the behavioral and social sciences and in medicine, the movement to place data
in electronic databases is hampered by considerations of confidentiality. The data collected on individuals by scientists in these areas of research are often highly personal. In fact, it is often necessary to guarantee potential research participants that the data collected on them will be held in strictest confidence and that their privacy will be protected. There has even been debate in these sciences about whether data collected under a formal confidentiality agreement can be placed in a database, because such use might constitute a use of the data to which the research participants did not consent.

The members of this panel will discuss a broad range of techniques that are being used across the behavioral and social sciences and medicine to protect the confidentiality of individuals whose data are entered into an electronically accessible database. Among the highly controversial data to which these techniques are being applied are data on accident avoidance by pilots of commercial aircraft and data on medical errors. The stakes in finding ways to use these data without violating confidentiality are high, since the payoff from learning how to reduce airplane accidents and medical mistakes is saved lives.

Standard techniques for separating identifier information from data, as well as less common techniques such as the introduction of systematic error in data, will be discussed. Despite the methods that are in place and those that are being experimented with, there is evidence that even sophisticated protection techniques may not be enough. The group will conclude its session with a discussion of this challenge.

1. Issues in Accessing and Sharing Confidential Survey and Social Science Data
Virginia A. de Wolf, USA

Researchers collect data from both individuals and organizations under pledges of confidentiality. The U.S. Federal statistical system has established practices and procedures that enable others to access the confidential data it collects. The two main methods are to restrict the content of the data (termed "restricted data") prior to release to the general public and to restrict the conditions under which the data can be accessed, i.e., at what locations, for what purposes (termed "restricted access"). This paper reviews restricted data and restricted access practices in several U.S. statistical agencies. It concludes with suggestions for sharing confidential social science data.

2. Contemporary Statistical Techniques for Closing the "Confidentiality Gap" in Behavioral Science Research
John L. Horn, Department of Psychology, University of Southern California, USA

Over the past three decades, behavioral scientists have become acutely aware of the need for both the privacy of research participants and the confidentiality of research data. During this same time period, knowledgeable researchers have created a variety of methods and procedures to insure confidentiality. But many of the best techniques used were not designed to permit the sharing of research data with other researchers outside of the initial data collection group. Since a great deal of behavioral science data collected at the individual level require such protections they cannot easily be shared with others in a confidential way. These practical problems have created a great deal of confusion and a kind of "confidentiality gap" among researchers and participants alike. This presentation will review some available "statistical" approaches to deal with these problems, and examples will be drawn from research projects on human cognitive abilities. These statistical techniques range from the classical use of replacement or shuffled records to more contemporary techniques based on multiple imputations. In addition, new indices will be used to relate the potential loss of data accuracy versus the loss of confidentiality. These indices will help researchers define the confidentiality gap in their own and any other research project.

References

Feinberg, S.E, & Willenborg, L. C.R.J. (1998). Special issue on "Disclosure limitation methods for protecting confidentiality of statistical data." Journal of Official Statistics, 14 (4), 337-566.
Willenborg, L. C.R.J. & de Waal, T. (2001). Elements of statistical disclosure control. Lecture Notes in Statistics, 155. New York: Springer-Verlag.
Clubb, J.M., Austin, E.W., Geda, C.L. & Traugott, M.W. (1992). Sharing research data in the social sciences. In G. H. Elder, Jr., E. K. Pavalko & E. C. Clipp. Working with Archival Data: Studying Lives (pp. 39-75). SAGE Publications.
Willenborg, L. C.R.J. & de Waal, T. (1996). Statistical disclosure control in practice. Lecture Notes in Statistics, 111. New York: Springer-Verlag.

NASA Aviation Safety Reporting System (ASRS)
Linda J. Connell, NASA Ames Research Center, USA

In 1974, the United States experienced a tragic aviation accident involving a B-727 on approach to Dulles Airport in Virginia. All passengers and crew were killed. The accident was classified as a Controlled Flight Into Terrain event. During the NTSB accident investigation, it was discovered from ATC and cockpit voice recorder tapes that the crew had become confused over information regarding the approach instructions, both in information provided in approach charts and the ATC instruction "cleared for the approach". It was discovered that another airline had experienced a similar chain of events, but they detected the error and increased their altitude. This action allowed them to miss the on-coming mountain. The second event would be classified as an incident. The benefit of the information spread rapidly in this airline, but had not reached other airlines. As a result of the NTSB findings, the FAA and NASA created the Aviation Safety Reporting System in 1976. The presentation will describe the background and principles that guide the operation of the ASRS. The presentation will also include descriptions of the uses of and products from approximately 490,000 incident reports.

Technical Demonstrations

Track II-D-2:
Technical Demonstrations

Chairs:
Richard Chinman, University Corporation for Atmospheric Research, Boulder, CO, USA
Robert S. Chen, CIESIN, Columbia University, USA

1. World Wide Web Mirroring Technology of the World Data Center System
David M. Clark, World Data Center Panel, NOAA/NESDIS, USA

The widespread implementation and acceptance of the World Wide Web (WWW) has changed many facets of the techniques by which Earth and environmental data are accessed, compiled, archived, analyzed and exchanged. The ICSU World Data Centers, established over 50 years ago, are beginning to use this technology as they evolve into a new way of operations. One key element of this new technology is known as WWW “mirroring.” Strictly speaking, mirroring is reproducing exactly the web content from one site to another at physically separated location. However, there are other types of “mirroring” which uses the same technology, but are different in appearance and/or content of the site. The WDCs are beginning to use these three types of mirroring technology to encourage new partners in the WDC system. These new WDC partners bring a regional diversity or a discipline specific enhancement to the WDC system. Currently there are ten sites on five continents mirroring a variety of data types using the different modes of mirroring technology. These include paleoclimate data mirrored in the US, Kenya, Argentina and France, and space environment data mirrored in the US, Japan, South Africa, Australia and Russia. These mirror sites have greatly enhanced the exchange and integrity of the respective discipline databases. A demonstration of this technology will be presented.

2. Natural Language Knowledge Discovery: Cluster Grouping Optimization
Robert J. Watts, U.S. Army Tank-automotive and Armaments Command, National Automotive Center, USA
Alan L. Porter, Search Technology, Inc. and Georgia Tech, USA
Donghua Zhu, Beijing Institute of Technology, China

The Technology Opportunities Analysis of Scientific Information System (Tech OASIS), commercially available under the trade name VantagePoint, automates the identification and visualization of relationships inherent in sets (i.e., hundreds or thousands) of literature abstracts. A Tech OASIS proprietary approach applies principal components analysis (PCA), multi-dimensional scaling (MDS) and a path-erasing algorithm to elicit and display clusters of related concepts. However, cluster groupings and visual representations are not singular for the same set of literature abstracts (i.e., user selection of the items to be clustered and the number of factors to be considered will generate alternative cluster solutions and relationships displays). Our current research, the results of which shall be demonstrated, seeks to identify and automate selection of a "best" cluster analysis solution for a set of literature abstracts. How then can a "best" solution be identified? Research on quality measures of factor/cluster groups indicates that those that appear promising are entropy, F measure and cohesiveness. Our developed approach strives to minimize the entropy and F measures and maximize cohesiveness, and also considers set coverage. We apply this to automatically map conceptual (term) relationships for 1202 abstracts concerning "natural language knowledge discovery."

3. ADRES: An online reporting system for veterinary hospitals
P.K. Sidhu and N.K. Dhand, Punjab Agricultural University, India

An animal husbandry department reporting system (ADRES) has been developed for online submission of monthly progress reports of veterinary hospitals. It is a database prepared under Microsoft Access 2000, which has records of all the veterinary hospitals and dispensaries of animal husbandry department, Punjab, India. Every institution has been given a separate ID. The codes for various infectious diseases have been selected according to the codes given by OIE (Office International des Epizooties). In addition to reports about disease occurrence, information can also be recorded for progress of insemination program, animals slaughtered in abattoirs, animals exported to other states and countries, animal welfare camps held and farmer training camps organized etc. Records can be easily compiled on sub-division, district and state basis and reports can be prepared online for submission to Government of India. It is visualized that the system may make the reports submission digital, efficient and accurate. Although, the database has been primarily developed for Punjab State, other states of India and other countries may also easily use it.

4. PAU_Epi~AID: A relational database for epidemiological, clinical and laboratory data management
N.K. Dhand, Punjab Agricultural University, India

A veterinary database (Punjab Agricultural University Epidemiological Animal disease Investigation Database, PAU_ Epi~AID) has been developed to meet the requirements of data management during outbreak investigations, monitoring and surveillance, clinical and laboratory investigations. It is based on Microsoft Access 2000 and includes a databank of digitalized information of all states and union territories of India. Information of districts, sub divisions, veterinary institutions and important villages of Punjab (India) has also been incorporated, every unit being represented by an independent numeric code. More than 60 interrelated tables have been prepared for registering information on animal disease outbreaks, farm data viz. housing, feeding, management, past disease history, vaccination history etc. and animal general information, production, reproduction and disease data. Findings of various laboratories such as bacteriology, virology, pathology, parasitology, molecular biology, toxicology, serology etc. can also be documented. Data can be easily entered in simple forms hyper-linked to one another, which allow queries and reports preparation at click of mouse. Flexibility has been provided for additional requirements due to diverse needs. The database may be of immense use in data storage, retrieval and management in epidemiological institutions and veterinary clinics.

5. Archiving Technology for Natural Resources and Environmental Data in Developing Countries, A Case Study in China
Wang Zhengxing, Chen Wenbo, Liu Chuang, Ding Xiaoqiang, Chinese Academy of Sciences, China

Data archiving has long been regarded as a less important sector in China. As a result, there is no long-term commitment at the national level to preserve natural resources data, and usually smaller budgets for data management than for research. Therefore, it is essential to develop a feasible strategy and technology to manage the exponential growth of the data. The strategy and technology should be cost-saving, robust, user-friendly, and sustainable in the long run. A PC-based system has been developed to manage satellite imagery, Geographic Information System (GIS) maps, tabular attribute data, and text data. The data in text format include data policies compiled from international, national, and regional organizations. Full documentation on these data are on-line and free to download. Only metadata and documentation are on-line for GIS maps and tabular data; the full datasets are distributed by CD-ROM, e-mail, or ftp.

Remote sensing data are often too expensive for developing countries. An agreement has been reached between GCIRC and remote sensing receiving station vendors. According to the agreement, GCIRC can freely use the remote sensing data (MODIS) from the receiving station, conditional on making their system available to demonstrate to potential buyers. This assures the most important data source for archiving. Considering the huge volumes of data and limited PC capacity, only quick-look images and metedata are permanently on-line. Users can search for data by date, geolocation, or granule. Full 1B images are updated daily and kept on-line for one week; users can download the recent data for free. All raw data (direct broadcast) and 1B images are archived on CD-ROMs, which are easy to read using a personal computer.

6. Delivering interdisciplinary spatial data online: The Ramsar Wetland Data Gateway
Greg Yetman and Robert S. Chen, Columbia University, USA

Natural resource managers and researchers around the world are facing a range of cross-disciplinary issues involving global and regional environmental change, threats to biodiversity and long-term sustainability, and increasing human pressures on the environment. They must increasingly harness a range of socioeconomic and environental data to better understand and manage natural resources at local, regional, and global scales.

This demonstration will illustrate an online information resource designed to help meet the interdisciplinary data needs of scientists and resource managers concerned with wetlands of international importance. The Ramsar Wetland Data Gateway, developed in collaboration with the Ramsar Bureau and Wetlands International, combines relational database technology with interactive mapping tools to provide powerful search and visualization capabilities across a range of data from different sources and disciplines. The Gateway is also being developed to support interoperable data access across distributed spatial data servers.

Large Data Projects

Track I-D-3:
Land Remote Sensing - Landsat Today and Tomorrow

Chairs: Hedy Rossmeissl and John Faundeen, US Geological Survey, USA

Scientists in earth science research and applications and map-makers have for many years been avid users of remotely sensed Landsat data. The use of remote sensing technology, and Landsat data in particular, is extremely useful for illustrating: current conditions and temporal change for monitoring and assessing the impacts of natural disasters; aiding in the management of water, biological, energy, and mineral resources; evaluating environmental conditions; and enhancing the quality of life for citizens across the globe. The size of the image files, however, raises a variety of data management challenges. This session will focus specifically on the 30-year experience with Landsat image data and will examine four components: 1) image tasking, access, and dissemination, 2) applications and use of the imagery, 3) data archiving, and 4) the future of the Landsat program.

1. Tasking, Archiving & Dissemination of Landsat Data
Thomas J. Feehan, Canada Centre for Remote Sensing, Natural Resources Canada, Canada

The Canada Centre for Remote Sensing of Natural Resources Canada (CCRS) operates two satellite ground receiving stations, the Prince Albert Satellite Station located in Prince Albert Saskatchewan and the Gatineau Satellite Station located in Cantley, Quebec. The CCRS stations provide a North American data reception capability, acquiring data to generate knowledge and information critical to resource use decision making on local, regional, national and global scales. CCRS' primary role is to provide data related to land resources and climate change, contributing to sustainable land management in Canada.

Operating in a multi-mission environment, including LANDSAT, the CCRS stations have accumulated an archive in excess of 300 TeraBytes, dating back to 1972, when CCRS started receiving LANDSAT-1 (ERST-1) data at the Prince Albert Satellite Station. Data are made available to support near-real time applications including ice monitoring, forest fire monitoring and mapping, as well as non-real time applications such as climate change, land use and topographic mapping. LANDSAT MSS, TM and ETM+ data constitute a significant portion of the CCRS archive holdings.

In addition to Canadian Public Good data use, a spin-off benefit includes the commercial exploitation by a CCRS distributor and value-added services network.

2. The Work of the U.S. National Satellite Land Remote Sensing Data Archive Committee: 1998 - 2000
Joanne Irene Gabrynowicz, National Remote Sensing and Space Law Center, University of Mississippi School of Law, USA

Earth observation data have been acquired and stored since the early 1970s. One of the world's largest, and most important, repositories for land satellite data is the Earth Resources Observation Systems (EROS) Data Center (EDC). It is a data management, systems development, and research field center for the U.S. Geological Survey's (USGS) National Mapping Discipline in Sioux Falls, South Dakota, USA. It was established in the early 1970s and in 1992, the U.S. Congress established the National Satellite Land Remote Sensing Data Archive at EDC. Although data have been acquired and stored for decades, the world's remote sensing community has only recently begun to address long-term data preservation and access. One such effort was made recently by remote sensing leaders from academia, industry and government as members of a federal advisory committee from 1998 to 2000. This presentation provides a brief account of the Committee's work product.

3. An Overview of the Landsat Data Continuity Mission (LDCM)

Bruce K. Quirk and Darla M. Duval*, U.S. Geological Survey EROS Data Center, USA

Since 1972, the Landsat program has provided continuous observations of the Earth's land areas, giving researchers and policy makers an unprecedented vantage point for assessing global environmental changes. The analysis of this record has driven a revolution in terrestrial remote sensing over the past 30 years. Landsat 7, which was successfully launched in 1999, returned operation of the Landsat program to the U.S. Government. Plans have been made for the follow-on to Landsat 7, the Landsat Data Continuity Mission (LDCM), which has a planned launch date of late 2006.

The scientific need for Landsat-type observations has not diminished through time. Changes in global land cover have profound implications for the global carbon cycle, climate, and functioning of ecosystems. Furthermore, these changes must be monitored continually in order to link them to natural and socioeconomic drivers. Landsat observations play a key role, because they occupy that unique part of the spatial-temporal domain that allows human-induced changes to be separated from natural changes. Coarse-resolution sensors, such as the Moderate-Resolution Imaging Spectroradiometer (MODIS) and the Advanced Very High Resolution Radiometer (AVHRR) are ideal for monitoring the daily and weekly changes in global biophysical conditions but lack the resolution to accurately measure the amount and origin of land cover change. High-resolution commercial systems, while valuable for validation, cannot acquire sufficient global data to meet scientific monitoring needs. Landsat-type observations fill this unique niche.

A joint effort between NASA, the U.S. Geological Survey (USGS), and the private sector, LDCM will continue the Landsat legacy by incorporating enhancements that reduce system cost and improve data quality. Following the 1992 Land Remote Sensing Policy Act, the LDCM seeks a commercially owned and operated system selected through a competitive procurement. Unlike earlier Landsat commercialization efforts, however, the LDCM procurement is based on a rigorous Science Data Specification and Data Policy, which seeks to guarantee the quantity and quality of the data while preserving reasonable cost and unrestricted data rights for end users. Thus the LDCM represents a unique opportunity for NASA and the USGS to provide science data in partnership with private industry and to reduce cost and risk to both parties, while creating an environment to expand the commercial remote sensing market.

The data specification requires the provision of 250 scenes per day, globally distributed, with modest improvements in radiometric signal-to-noise (SNR) and dynamic range. Two additional bands have been added: an "ultra-blue" band centered at 443 nm for coastal and aerosol studies, and a band at either 1,375 or 1,880 nm for cirrus cloud detection. No thermal bands will be included on this mission. Additional details are available on the LDCM specification, mission concept, and status.

* Raytheon. Work performed under U.S. Geological Survey contract 1434-CR-97-CN-40274.

4. Current Applications of Landsat 7 Data in Texas
Gordon L. Wells, Center for Space Research, The University of Texas at Austin, USA

The rapid delivery of timely information useful to decision makers is one of the primary goals of the data production and application programs developed by the Mid-American Geospatial Information Center (MAGIC) located at the University of Texas at Austin's Center for Space Research. In a state the size and nature of Texas, geospatial information collected by remote sensing satellites can assist a broad range of operational activities within federal, state, regional and local government departments. In the field of emergency management, the state refreshes its imagery basemap using Landsat 7 data on a seasonal basis to capture the locations of recent additions to street and road networks and new structures that might be vulnerable to wildfires or flashfloods. Accurately geolocated satellite imagery can be incorporated into the geographic information system used by the Governor's Division of Emergency Management much more rapidly than updated records received from the department of transportation or local entities. For many activities involving the protection and enhancement of natural resources, Landsat 7 data offer the most economic and effective means to address problems that affect large areas. Invasive species detection and eradication is a current concern of the Texas Department of Agriculture, Texas Soil and Water Conservation Board and the Upper Colorado River Authority. Invasive saltcedar is one noxious species that can be identified and removed with the help of satellite remote sensing. The information required by policy makers may extend beyond state borders into regions where satellite reconnaissance is the only practical tool available. For international negotiations involving the shared water resources of Texas and Mexico, satellite imagery has made a valuable contribution to the monitoring of irrigation activities and the local effects of drought conditions. In the future, there will be increasing concentration on shortening the time lag between the collection of instrument data by MAGIC's satellite receiving station and final product delivery in the projection, datum and file format required for immediate inclusion into operational analyses by the various agencies in the region.

5. Development of Land Cover Database of East Asia
Wang Zhengxing, Zhao Bingru, Liu Chuang, Global Change Information and Research Center, Institute of Geography and Natural Resource Research, Chinese Academy of Sciences, China

Land cover plays a major role in a wide range of fields from global change to regional sustainable development. Although land cover has dramatically changed over the last few centuries, util now there has been no consistent way of quantifying the changes globally (Nemani, and Running, 1995). Land cover dataset currently used for parameterization of global climate models are typically derived from a range of preexisting maps and atlases (Olson and Watts, 1982; Matthews, 1983; Wilson and Henderson-Sellers, 1985), this approach has several limitations (A. Strahler and J. Townshend, 1996). Another important data source is statistical report, but some statistical land cover data seems unreliable. At present, the only practical way to develop land cover dataset consistently, continuously, and at globally is satellite remote sensing. This is also true for the development of land cover dataset of East Asia.

The 17-class IGBP land cover unit includes eleven classes of natural vegetation, three classes of developed and mosaic lands, and three classes of non-vegetated lands. This system may be useful at global level, but there is a very serious shortcoming: only one class for arable land. Since the arable land is the most dynamic and important area of the man-nature system, it is essential to characterize arable land sub-system to more details.

There are still some potentials for finer classification in current 1-km AVHRR-NDVI data sets. A decision tree classifier is used to transfer all input data into various pre-defined classes. The key to accurate interpretation is to identify more reliable links (decision rules) between input data and output classes. The basic theory under the decision tree is that any land cover class should be an identical point determined by a multi-dimensional spaces, including multi temporal NDVI, phenology, ecological region, DEM, census data etc. The preliminary research shows that stratification using ecological region and DEM can simplify the decision tree structure and yield more meaningful classes in China's major agricultural regions. Arable land cover may be classified at two levels, first level describes how many times the crops are planted, and second level the crop characteristics.

The current land cover classification based on 1-km AVHRR-NDVI data sets still have serious limitations for parameterization of some models. The nominal 1-km spatial resolution images may produce quite a lot mixed pixels, but some models need pure pixel, e.g. DNDC model. However, the coming 250-m MODIS-EVI data set will narrow the gap between model need and data supply to some extent. Using the approaches developed from AVHRR, MODIS will yield more reliable land cover data of East Asia.

Roundtable

Track II-D-1:
Roundtable Discussion on Preservation and Archiving of Scientific and Technical Data in Developing Countries

Chair: William Anderson, Praxis101, Rye, NY, USA

Session Organizers: William Anderson, US National Committee for CODATA
Steve Rossouw, South African National Committee for CODATA
Liu Chuang, Chinese Academy of Sciences, Beijing, China
Paul F. Uhlir, US National Committee for CODATA

A Working Group on Scientific Data Archiving was formed following the 2000 CODATA Conference. The primary objective of this Working Group has been to create a focus within CODATA on the issues of scientific and technical data preservation and access. This Working Group, co-chaired by William Anderson and Steve Rossouw, has co-organized a workshop on data archiving with the South African National Research Foundation in Pretoria in May 2002. The Working Group is preparing a report of its activities from 2001-2002.

Another initiative of the Working Group has been to propose the creation of a CODATA "Task Group on Preservation and Archiving of S&T Data in Developing Countries." The proposed objectives of that Task Group are to: promote a deeper understanding of the conditions in developing countries with regard to long-term preservation, archiving, and access to scientific and technical (S&T) data; advance the development and adoption of improved archiving procedures, technologies, standards, and policies; provide an interdisciplinary forum and mechanisms for exchanging information about S&T data archiving requirements and activities, with particular focus on the concerns of developing countries; and publish and disseminate broadly the results of these efforts. The proposed Task Group would be co-chaired by William Anderson and Liu Chuang.

An additional related proposal of the Working Group has been to create a Web portal on archiving and preservation of scientific and technical data and information. This portal, which would be developed jointly by CODATA with the International Council for Scientific and Technical Information and other interested organizations, would provide information about and links to online:

Scientific and technical data and information archiving procedures, technologies, standards, and policies;
Discipline-specific and cross-disciplinary archiving projects and activities; and
Expert points of contact in all countries, with particular attention to those in developing countries.

Reports on all these activities will be given at the Roundtable and will then be discussed with the individuals who attend this session.

Overview and Grand Challenges
Thursday, 3 October
1245 - 1330

Chair: Fedor Kuznetzov, Institute of Inorganic Chemistry, Novosibirsk, Russia

Preserving Scientific Data: Supporting Discovery into the Future
John Rumble, CODATA President

A wide variety of methods have been used to save and preserve scientific data for thousands of years. The physical nature of these means and the inherent difficulties of sharing the physical media with others who need the data ha e been major barriers in advancing research and scientific discovery. The information revolution is changing this in many significant ways; ease of availability, breadth of distribution, size and completeness of data sets, and documentation. As a consequence, scientific discovery itself is changing now, and in the future, perhaps even more dramatically. In this talk I will review some historical aspects of data preservation and the use of data in discovery. And I will provide some speculations on how preserving data digitally might revolutionize scientific discovery.

Last site update: 25 September 2002

18th International Conference

CODATA 2002
Frontiers of Scientific and Technical Data

Montréal, Canada
29 September - 3 October

Data Science

Data Policy

Technical Demonstrations

Large Data Projects

Roundtable

Overview and Grand Challenges
Thursday, 3 October
1245 - 1330

18th International Conference

CODATA 2002 Frontiers of Scientific and Technical Data

Montréal, Canada 29 September - 3 October

Data Science

Data Policy

Technical Demonstrations

Large Data Projects

Roundtable

Overview and Grand Challenges Thursday, 3 October 1245 - 1330

CODATA 2002
Frontiers of Scientific and Technical Data

Montréal, Canada
29 September - 3 October

Overview and Grand Challenges
Thursday, 3 October
1245 - 1330