Optimization and Application extension fora Bloom filter based sequence classifier
Abstract: Abstract Nowadays, with the development of sequencing technologies, more sequencing reads are generated and involved in genomics research, which leads to a critical problem, how do people process these data rapidly and accurately? A data structure named Bloom filter which is initially developed in 1970 has been reused and applied more and more in Bioinformatics field for its relatively high storage efficiency and fast accessing speed. As an application of Bloom filter technique, FACS [1] system is a rapid and accurate sequence classifier. However, several bottlenecks have restricted its usage, for instance, neither supporting large query file nor fastq format files. Hence, in this report, an improved FACS system will be introduced, which includes a hashing system for FACS; making FACS become large query files (>2GB) and compressed files supported; making FACS become fastq file supported; making FACS system more user friendly etc. Moreover, the new paralleled FACS system (FACS 2.0) will be introduced and evaluated to prove that FACS 2.0 is at least 10 times faster and equally accurate compared with the original FACS system, Fastq_screen [7] and Deconseq [8] when doing sequence decontamination process. Last but not the least, the possibility of developing an adapter trimmer based on FACS system will also be analyzed in this report. Key words: Bloom filter; Decontamination; Adapter trimming; Parallelization; Large query file (compressed and normal) supported;
AT THIS PAGE YOU CAN DOWNLOAD THE WHOLE ESSAY. (follow the link to the next page)