19th International CODATA Conference
Category: Data Quality

LiqCryst and SciDex: Material Databases as Scientific Instruments

Volkmar Vill (volkmar@liqcryst.chemie.uni-hamburg.de)
Institute
of Organic Chemistry
, University of Hamburg, Germany
www.lci-publisher.com, http://liqcryst.chemie.uni-hamburg.de/


The development of LiqCryst and SciDex was motivated by the need to evaluate numerical data of organic materials. The former is specialized for liquid crystals, whereas the latter is a recently developed object-oriented tool for all kinds of chemical information. Both systems have the goals to incorporate human intelligence into a computer system and to allow analysis of numerical properties and prediction of material parameters from chemical structures.

LiqCryst

Liquid crystals cover a huge variety of chemical structures and physical and biological properties. They are used as LCD for technical engineering and as membranes in living systems. Nowadays the database LiqCryst 3.4 contains 80,000 compounds, including display materials, lipids, bio-polymers, NLO-materials, viruses, inorganic clusters, plastic crystals and many more. All these very diverse kinds of entries can be unified under the system of LiqCryst, allowing systematic data-analysis and even prediction of phase transition temperatures. In total, 210,000 physical properties are contained, which comprise the whole field from phase types and transition temperatures via optical and dielectric data to spectroscopic information. All these facts were extracted from 20,000 different references, including conference proceedings, PhD-theses, patents and technical documentation in addition to regular journals and monographs. LiqCryst is updated regularly twice a year and available as in-house system, running on Microsoft Windows PCs, and additionally as a free online-version, with a limited number of data, but complete with substructure-search, accessible with any normal internet browser that is Java-capable.

A simple documentation of information, even supported by fast searching algorithms can never be a knowledge system. To make the database a scientific instrument, one needs concepts like the following, which are contained in LiqCryst:

1. Evaluation of data
2. Object-oriented system with hierarchical order of data
3. Similarity concept of compound structures
4. Comparison methods of compounds by pattern matching
5. Statistical and graphical methods of data analysis

Evaluation of data is a very important and very complex topic. Some misprints can always occur in literature and have to be corrected if possible; Outdated terms have to be upgraded to their currently used fashion; Unreliable data have to be marked as such; Conflicting data have to be rated; New scientific findings require often re-evaluation of already registered data. The evaluation process can partly be automated by using "computer intelligence", but requires in any case a contribution by an expert as a "human intelligence" that knows what it's doing. Especially results of interdisciplinary research would require usually more than one expert.

A classical SQL-database system would not fulfill the scientific needs, because the data are extremely heterogeneous, hence most records/fields will only scarcely be filled with information, making a simple comparison and analysis nearly impossible. An object-oriented structure allows on the other hand to arrange data in a hierarchical, tree-like organization, so that physical properties can for example be defined with different grade of specification. It would then be possible to compare exact numerical data with various degrees of preciseness in the interpretation. In the case of liquid crystals, one could for example compare the transition temperature of a "generic", unspecified smectic phase with a fully specified smectic A phase in a family-concept of phases. In this way one can for example quantify a layering tendency in the molecular arrangements.

Comparisons are a key-instrument to analyze scientific data. Computer systems have normally only straight searching-algorithms to find exact matches, while a human being is more or less able to see a similarity between different things and find relations that would be lost to a plain archiving computer system. LiqCryst has structure comparison functions, which can quantify structure-property-relationships manually, to give the user the profiles of specifically selected pairs of compounds. Additionally, these same functions can be used to predict properties by doing an automated comparison of a given compound with all registered compounds of the database.

A determining step for this method of prediction is a vectorial similarity conception. The similarity between two compounds is not given as a flat value (e.g. "80 % similar"), but a vectorial term. This term has to be defined specifically for the actual problem, e.g. for similarity of nematic liquid crystals, it would be the difference in a string of functional groups. The dividing of a structure into these functional groups in this way is specific for liquid crystals. Similarity functions as used in the Graph-theory would not lead to acceptable results with liquid crystals and their properties.

Based on the experience with LiqCryst, we wanted to generalize the concepts used. LiqCryst was hitherto developed specifically as a tool for liquid crystals, but some of its methods used therein are easily transferable and/or adaptable to all kinds of organic compounds and would be very interesting to use in a much wider field of applications. Among these are the comparison functions, the general principle of object-oriented, hierarchically ordered properties and evaluation of chemical drawings. However a prediction of the kind that LiqCryst uses, with its definition of similarity is not generally adaptable. On the other hand could a general concept probably be used for prediction of the phase transitions in LiqCryst as well, but would possible give worse results than the specific method developed specially for the liquid crystals.

The new "Scientific Data Explorer" SciDex is a system of general purpose. It can handle all type of chemical information, with the additional twist, that it's not restricted to handle only compounds. It would just as well be able to handle a pure literature-database or a property-database with its associated references.

SciDex shows many of the features of LiqCryst. Evaluation has to be performed by people who know their field, it will never be possible to properly create a database without the knowledge behind it. Of course is SciDex as well an object-oriented system, which is additionally cross-linked between compounds, properties and references, each of these can refer to a list of the two others. Similarity concepts have to be re-thought, as the formerly existing restriction to liquid crystals has to be eliminated, but the principles of comparison methods are again the same.

SciDex additionally has also several new features:

Structure evaluation with respect to stereochemistry and general plausibility

Multi-user, multi-database, complete with data security

No limits to the numbers of compounds, properties, references

Open definition for properties, including e.g. metafiles

Tool for display of graphical data

Numerical data and properties of atoms and bonds,
e.g. NMR-shifts, -coupling constants

Possibility for export of the whole database as RTF-file, readable directly by e.g. MS Word

More types of comparison functions

Modular system

Running on Windows and via the Internet

The first application for SciDex was the Index of Organic Compounds of the Landolt Börnstein, which gives references to 17000 organic compounds from 20 volumes of the Landolt Börnstein New Series.

New applications of SciDex are 29Si-NMR-database by F. Uhlig and H. Marsmann, and the "chirbase" by B. Koppen­höfer.