19th International CODATA Conference
Category: Plenary - Mark-Up Languages

The Design and Evolution of Markup Languages

Peter Murray-Rust
Unilever Centre for Molecular Informatics, Chemistry Department, Cambridge University, UK


This talk will use Chemical Markup Language (CML) to illustrate some of the ways in which markup languages (MLs) can be developed in a scientific community and how they can be used. Initially markup languages were a precise and often prescriptive set of constraints to ensure conformity. They evolved to provide semantics (often interactive behaviour) attached to elements and most recently have developed into tools for systematising and aggregating knowledge. Most importantly MLs represent a common agreement on how to exchange information.

In general a community has to go through most of the following:

There is now a complete set of tools for chemistry (most come from Open Source volunteers) and they open up a radically new vision of the future. Chemistry, like many disciplines, is micropublished - it consists of many unrelated primary articles in many different journals. If these can be marked up with knowledge tools (ontologies, dictionaries, etc.) then the primary literature becomes a knowledgebase, which can be read and evaluated by robots. I hope to demonstrate four such knowledgebots:

The results of this markup can be indexed with the new IUPAC/NIST chemical identifier (INChI) and we have shown that this scales easily to 250,000+ compounds. In principle, therefore, if the primary chemical literature is Openly accessible and re-usable, we can revolutionise the way in information is abstracted, evaluated, curated and re-used. However very little chemical data is actually published (we estimate 1% for spectra and 20% for crystallography). And even for the published data, many publishers currently require copyright which restricts its re-use.

We believe in the archiving of the totality of chemical experience - experimental data, computations, discussion, and annotation. We have started to do this in our own archive, http://www.dspace.cam.ac.uk. Because the Royal Society of Chemistry' is now a "green publisher" we have archived a preprint of an article to appear in Org. Biomol Chem where the practice and philosophy of this is argued: http://www.dspace.cam.ac.uk/handle/1810/741.