Classification of sequence tags from tandem mass spectrometry spectra using machine learning models

University essay from Lunds universitet/Examensarbeten i bioinformatik

Author: Júlia Ortís Sunyer; [2022]

Keywords: Biology and Life Sciences;

Abstract: Motivation: Proteomics is the large-scale study of all the proteins found in a cell, tissue or organism. In the last few years, and thanks to the development of mass spectrometry and bioinformatics, proteomics has led the research in several fields, ranging from medicine to agriculture. In order to reconstruct the amino acid sequence de novo protein sequencing can be used. It uses the protein’s molecular weight, its mass spectrometry spectrum, and bioinformatics’ tools to reconstruct the sequence without the use of a database. This avoids problems such as the limited amount of data found in the databases. Nonetheless, more research needs to be carried out to optimize the tools and data extraction, specially to deal with the ambiguous spectra of long peptides. In this project, several machine learning algorithms were created using TensorFlow and Keras. The aim was for at least one of the models to correctly identify sequence tags extracted from tandem mass spectrometry spectra from fake tags. Results: Seven machine learning models were successfully built to classify sequence tags from tandem mass spectrometry spectra. Upon evaluation of the models, two of them delt with the data better, according to several statistical parameters (confusion matrix outcomes, accuracy, precision, recall and area under the curve) and managed to classify the true tags of each spectrum largely correctly.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)