Preprocessing of Nanopore Current Signals for DNA Base Calling

University essay from KTH/Skolan för elektroteknik och datavetenskap (EECS)

Author: Josef Malmström; [2020]

Keywords: ;

Abstract: DNA is a molecule containing genetic information in all living organisms and many viruses. The process of determining the underlying genetic code in the DNA of an organism is known as DNA sequencing, and is commonly used for instance to study viruses, perform forensic analysis, and for medical diagnosis. One modern sequencing technique is known as nanopore sequencing. In nanopore sequencing, an electrical current signal that varies in amplitude depending on the genetic sequence is acquired by feeding a DNA strand through a nanometer scale protein pore, a so-called nanopore. The process of then inferring the underlying genetic sequence from the raw current signal is known as base calling. Base calling is commonly modeled as a machine learning problem, typically using Deep Neural Networks (DNNs) or Hidden Markov Models (HMMs). In this thesis, we seek to investigate how preprocessing of the raw electrical current signals can impact the performance of a subsequent base calling model. Specifically, we apply different methods for normalization, filtering, feature extraction, and quantization to the raw current signals, and evaluate the performance of these methods using a base caller built from a so-called Explicit Duration Hidden Markov Model (ED-HMM), a variation of the regular HMM. The results show that the application of various preprocessing techniques can have a moderate impact on the performance of the base caller. With appropriately chosen preprocessing methods, the performance of the studied ED-HMM base caller was improved by 2 - 3 percentage points, compared to a conventional preprocessing scheme. Possible future research directions for instance include exploring the generalizability of the results to deep base calling models, and evaluating other more sophisticated preprocessing methods from adjacent fields.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)