Development and validation of bioinformatic methods for GRC assembly and annotation

University essay from Uppsala universitet/Institutionen för biologisk grundutbildning

Abstract: This thesis presents the work done during my master degree projects under the supervision of Alexander Suh and Francisco J. Ruiz-Ruano. My work focused on the development of in-silico methods to improve the assembly of the Germline Restricted Chromosome (GRC) of songbirds, more specifically that of zebra finch.GRCs are a good example of the popular saying "The exception that proves the rule". For a very long time, it was assumed that every cell in a healthy multicellular organism carries the same genetic information. Cytogenetic evidence dating back as far as early XX century suggests that this is not always the case, as it has been documented that certain organisms carry supernumerary B chromosomes, which are dispensable chromosomes that are not part of the normal karyotype of a species. GRCs are often regarded as a special case of B chromosomes, where every individual from a species carries an additional chromosome whose presence is restricted to germline cells only. GRCs presence has been documented in insects, hagfishes and songbirds. A peculiar case of GRCs is that of zebra finch, whose GRC has an estimated size of over 150 Mb, accounting for over 10% of zebra finch total genome size. Despite the first cytogenetic evidence of zebra finch GRC dating back to 1998, it was only last year that the first comprehensive genomic study about this relatively large chromosome was published. This study shed some light on the gene content of the GRC in zebra finch, revealing that the GRC of zebra finch mostly consists of paralogs of A chromosomal genes. The GRC assembly and annotation that were published as part of this study included 115 GRC-linked genes that were identified through germline/soma read mapping, as well as 36 manually curated scaffolds with a median length of 3.6 kb. Considering the conspicuous size of the GRC of zebra finch, it is clear that this is a very fragmented and likely incomplete GRC assembly. There are many factors that can have a negative impact on assembly completeness and contiguity. In the GRC case, these factors collectively affect coverage in ways that are not properly handled by available genome assemblers. In the course of my master degree project I developed kFish, a bioinformatic software to perform alignment-free enrichment of GRC-linked barcodes from a 10x Genomics linked-read DNA Chromium library. kFish uses an iterative approach where the k-mer content of a set of GRC-linked sequences is compared with that of reads corresponding to each individual 10x Genomics barcode. This comparison allows kFish to identify likely GRC-linked barcodes, and then only use reads corresponding to these barcodes when trying to assemble the GRC. First benchmarking results generated using five GRC-linked genes from zebra finch as reference sequences, show that kFish is not only capable of assembling already known GRC-linked sequences, but also new ones with high confidence. kFish can do all of this in a matter of hours, using only few gigabytes of system memory, while previous efforts took over two days to assemble zebra finch genome and identify GRC-linked scaffolds using an approach based on read mapping. High quality genome assemblies and annotations are the foundations of modern genomics research, the lack of which greatly limits the breadth of the questions that can be answered. There is still a lot that we do not understand about GRCs, and part of this is due to the lack of high quality GRC assemblies and annotations. Producing such an assembly will likely require an integrated approach, where multiple sequencing technologies as well as bleeding edge bioinformatic tools such as kFish, are combined together to produce an high quality assembly, which will be crucial to unravel the mystery of GRCs function and evolutionary history.

  AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)