Pre-analysis of Nanopore Data for DNA Base Calling

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Abstract: Nanopore sequencing is a relatively new DNA sequencing method which measures the current over a nanopore in a membrane as each nucleotide of the DNA passes through the nanopore. From the resulting current signal it is possible to determine the sequence of nucleotides in the DNA by using a base caller. The goal of this project was to create a machine learning model which could estimate the accuracy rate (identity score) of the sequenced DNA using the electric current signal and other data available through nanopore sequencing. The dataset that the machine learning models were trained on were samples from E. coli bacteria that had been sequenced through nanopore sequencing. In this project a linear regression model was created as well as several neural networks. The best performing model was a neural network which had a mean square error (MSE) of 6.12 ∙ 10-4, compared to a variance in the dataset of 2.11 ∙ 10-3. The low MSE indicates that the model can effectively predict identity scores.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)