Exploratory Analysis of Isoelectric Point Prediction with Simple Feature Encoding

University essay from KTH/Tillämpad fysik

Abstract: Proteomics is the large scale study of proteins in biological systems such as those found in human cells. The understanding of proteomes, i.e. the complete set of proteins expressed by an organism, has especially useful applications in the medical field such as genetic research and drug discovery. Cells that undergo biochemical processes affect the state of the contained proteins, therefore producing an abundance of physical variations and permutations of single proteins. The main goal of proteomic studies is to identify the proteins related to such processes in order to study their purpose and function. Successful reduction of the so-called "search space" in which proteins are identified, is a large determining factor in the resulting number of correct protein identifications. Fractionation processes attempt to reduce this search space through electrophoretic experiments such as IEF in order to identify individual protein properties such as the isoelectric point. Protein sample preparation relies heavily on isoelectric point values, denoted $p$I, and hence accurate theoretical prediction of these values would provide a benchmark to aid in analytical processes such as LC-MS. This dissertation explores the efficacy of using simple feature encoding to improve upon conventional theoretical $p$I predictions, based on the Henderson-Hasselbalch equation, in order to more accurately reflect experimental values. Simple feature encoding was used for two different optimisation techniques. Encoding chargeable amino acid residues in peptide sequences into various $k$-mer combinations produced a considerable improvement in predicted $p$I values compared to prediction solely based on reference $p$K$_a$ values. The approaches taken in this project highlighted, arguably at a fundamental level, the useful nature of peptide feature encoding to improve theoretical $p$I predictions. However, future research endeavours should consider extending the models discussed using more developed and complex modelling techniques for peptide sequences and more importantly, $p$K$_a$ constants.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)