19th International CODATA Conference
Category: Plenary - Mark-Up Languages

XML Description of Protein Structural Data for Data Grid and Computing Grid

Haruki Nakamura
Institute for Protein Research, Osaka University, Japan


The Protein Data Bank (PDB) has been a primary archive of three-dimensional structural information of biological macromolecules. Protein Data Bank Japan (PDBj, http://www.pdbj.org/) has been curating new PDB entries as a member of world-wide Protein Data Bank (wwPDB) [1] along with Research Collaboratory for Structural Bioinformatics (RCSB) and European Bioinformatics Institute (EBI).

A new extensible mark-up language (XML) describing the PDB data, the pdbML, is being developed by wwPDB. Its structure is defined in XML Schema (pdbx-v1.000.xsd at http://deposit.pdb.org/pdbML/), based on Macromolecular Crystallographic Information Format (mmCIF). The entire content in the pdbML is now available from ftp://beta.rcsb.org/pub/pdb/uniformity/data/XML. To make the most of the XML format, we, PDBj, have constructed an XML-based PDB data browser (xPSSS: xml-based Protein Structure Search Service at http://www.pdbj.org/xpsss/), using the native XML-DB. The information of the biological and biochemical functions of proteins is also browsed. In addition to simple searches, full XPath searches are also implemented. This allows users to perform complicated searches and control the output of their search in details. The xPSSS is also used by the SOAP service for large-scale analyses and data grid applications.

In multiscale biological systems, integration of the simulation methods for models at different levels is essential, and a new platform, BioPfuga (Biosimulation Platform United on Grid Architecture), has been developed [2]. It requires that (1) application programs are divided into a set of many pieces, and that (2) data communication be made among the program components by a standard XML description. An example of the BioPfuga application to hybrid QM(HF)/QM(DFT)/MM method will be shown.

References:
[1] Berman et al. (2003) Nature Struct. Biol. 10, 980.
[2] Nakamura et al. (2004) New Generat. Comput. 22,157-166.