Abstract
Background: Protein fold recognition is a key step in protein three-dimensional (3D) structure
discovery. There are multiple fold discriminatory data sources which use physicochemical and
structural properties as well as further data sources derived from local sequence alignments. This
raises the issue of finding the most efficient method for combining these different informative data
sources and exploring their relative significance for protein fold classification. Kernel methods have
been extensively used for biological data analysis. They can incorporate separate fold
discriminatory features into kernel matrices which encode the similarity between samples in their
respective data sources.
Results: In this paper we consider the problem of integrating multiple data sources using a kernelbased
approach. We propose a novel information-theoretic approach based on a Kullback-Leibler
(KL) divergence between the output kernel matrix and the input kernel matrix so as to integrate
heterogeneous data sources. One of the most appealing properties of this approach is that it can
easily cope with multi-class classification and multi-task learning by an appropriate choice of the
output kernel matrix. Based on the position of the output and input kernel matrices in the KLdivergence
objective, there are two formulations which we respectively refer to as MKLdiv-dc and
MKLdiv-conv. We propose to efficiently solve MKLdiv-dc by a difference of convex (DC)
programming method and MKLdiv-conv by a projected gradient descent algorithm. The
effectiveness of the proposed approaches is evaluated on a benchmark dataset for protein fold
recognition and a yeast protein function prediction problem.
Conclusion: Our proposed methods MKLdiv-dc and MKLdiv-conv are able to achieve state-ofthe-
art performance on the SCOP PDB-40D benchmark dataset for protein fold prediction and
provide useful insights into the relative significance of informative data sources. In particular,
MKLdiv-dc further improves the fold discrimination accuracy to 75.19% which is a more than 5%
improvement over competitive Bayesian probabilistic and SVM margin-based kernel learning
methods. Furthermore, we report a competitive performance on the yeast protein function
prediction problem.
Translated title of the contribution | Enhanced Protein Fold Recognition through a Novel Data Integration Approach |
---|---|
Original language | English |
Pages (from-to) | 267 - 285 |
Number of pages | 18 |
Journal | BMC Bioinformatics |
Volume | 10 |
DOIs | |
Publication status | Published - 2009 |