Factors Affecting How Well Bacterial Whole Genome Sequencing Reads Assemble

University essay from Uppsala universitet/Institutionen för medicinsk biokemi och mikrobiologi

Abstract: Recently Whole Genome Sequencing (WGS) has become the new high-resolution tool used to trace the source of foodborne outbreaks. There are often only a few genetic differences that can distinguish closely related bacterial isolates, and variability in data quality between different laboratories may influence the results. In this project, a data set from ten laboratories where the same bacterial samples were sequenced using different library preparation kits and sequencing methods in an interlaboratory study, has been used. Factors that could be responsible for the different performance in terms of how well the raw WGS data from the different labs assembles were investigated. The raw data from the different labs assembled very differently. One lab showed adapter sequences in their reads and filtering them improved the assembly substantially. All labs utilizing the transposase-based library preparation kit Nextera, had base composition bias in the beginning of the reads. For many labs, as the coverage was increased, the number of contigs first increased and then decreased. This was due to low number of contaminating reads from other species. However, these contaminations were barely visible in the plots generated by Kraken/Krona. Filtering out contigs with very low coverage removed this problem. Two labs performed much worse than the others. Some of their reads showed quality drop towards the ends, whereas their data also had the longest read length. However, quality trimming the read ends did not improve the assembly. These two labs had higher GC content in their reads compared to the other labs, the reason for this needs further investigation. 

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)