Thompson Rivers University

Predicting homologous proteins using subsets of variables

As part of the Science Seminar Series, TRU statistics faculty member Dr. Jabed Tomal presents his research in statistical machine learning, data science, Bayesian statistical inference, statistical ecology.

Abstract

Homologous proteins are considered to have a common evolutionary origin.

To produce an evolutionary sequence of proteins, a scientist needs to predict their biological homogeneity. We have proposed a model to predict biological homogeneity of proteins using feature variables obtained from the similarity search and amino acid sequences between candidate and target proteins.

The assumption is that the structural similarity and amino acid sequence identity of proteins relate to biological homogeneity.

The proposed model is an ensemble of logistic regression models (LRM), where each constituent LRM is fitted to a subset of feature variables.

An algorithm is developed to group the variables into subsets in a way that the variables in a subset appear to be good to put together in an LRM, and the variables in different subsets appear to be good in separate LRMs.

The strength of the ensemble depends on the algorithm’s ability to identify strong and diverse subsets of feature variables.

The methods are applied to rank rare homologous proteins ahead of non-homologous proteins in protein homology data obtained from the KDD cup website.

The performances of our ensemble are found better than the winning procedures of the competition and state-of-the-art ensembles.