Bioinformatics pipeline development to support Helicobacter pylori genome analysis

University essay from Göteborgs universitet/Institutionen för data- och informationsteknik

Abstract: Helicobacter pylori is a bacterium related to a variety of diseases and is a major risk factor for gastric cancer [1]. There can be differences in the genomes of H. pylori bacteria that are isolated from different patient groups and this project is motivated by the desire of biologists to investigate how these differences correlate with differences in disease. The development of high-throughput technologies in biological science has lead to big data phenomena. Next generation sequencing (NGS) is an example of such techniques which generate large text-based files, to store short fragments of the whole genome sequence data of an organism quickly and at relatively low-cost [2]. Achieving methods for analysis and management of such complex datasets has emerged as a challenge which is the subject of this thesis. Using bioinformatics approaches, we proposed a pipeline for data analysis and management on two different platforms: High-performance computing and online workflow management system. For High-performance computing, we used a computer cluster from C3SE, a centre for scientific and technical computing at Chalmers University of Technology. On this platform, we developed a pipeline by scripting using perl programming language. The first step in the pipeline is error removal and quality control. Next is to find overlaps between the short fragments of genome sequence and then merging them into continuous, longer sequences. This called de novo genome assembly. The final step is genome annotation, the process of transferring biological information from experimentally characterized datasets or reference genomes to newly sequenced genomes. For each step in the pipeline, we used benchmarking techniques to find the best programs that are developed in the bioinformatics community and the case of a missing application we implemented it. The result of the pipeline is well characterised, biologically annotated datasets that are ready for analysis by biologists. For workflow management system, we chose a widely used bioinformatics workflow management system, the Galaxy project. Galaxy provides infrastructure for creating workflows and uploading datasets via a user interface. With Galaxy, we managed to implement a pipeline including quality control and de novo genome assembly. Using the first method, we succeeded to analyse 52 datasets, and this project is the first study that on a significant scale, explores H. pylori and its association with gastric cancer.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)