Pietro Lovato

Post-doctoral researcher in Computer Science
 

University of Verona (ITALY)

pietro.lovato@univr.it

Soft Ngram representation and modeling for protein remote homology detection


Pietro Lovato, Marco Cristani, Manuele Bicego

ABSTRACT

Remote homology detection represents a central problem in bioinformatics, where the challenge is to detect functionally related proteins when their sequence similarity is low. Recent solutions employ representations derived from the sequence profile, obtained by replacing each amino acid of the sequence by the corresponding most probable amino acid in the profile. However, the information contained in the profile could be exploited more deeply, provided that there is a representation able to capture and properly model such crucial evolutionary information. In this paper, we propose a novel profile-based representation for sequences, called soft Ngram. This representation, which extends the traditional Ngram scheme (obtained by grouping N consecutive amino acids), permits considering all of the evolutionary information in the profile: this is achieved by extracting Ngrams from the whole profile, equipping them with a weight directly computed from the corresponding evolutionary frequencies. We illustrate two different approaches to model the proposed representation and to derive a feature vector, which can be effectively used for classification using a support vector machine (SVM). A thorough evaluation on three benchmarks demonstrates that the new approach outperforms other Ngram-based methods, and shows very promising results also in comparison with a broader spectrum of techniques.

SOURCE CODE

You can download the matlab source code, along with a demo, here.

The code is provided on an "as is" without support or guarantees.

(C) Pietro Lovato 2016

Pietro Lovato
Dipartimento di Informatica
Università degli Studi di Verona
Ca' Vignal 2, Strada Le Grazie 15, 37134 Verona, Italy
E-mail: pietro.lovato@univr.it