Novel Algorithms for the Analysis and Manipulation of Short Genomic Sequences



Journal Title

Journal ISSN

Volume Title



The storage, manipulation, and transfer of the large amounts of data produced by high-throughput sequencing instruments represent major obstacles to realizing the full potential of this promising technology. To date, significant effort has been devoted to efficiently compressing these cumbersome sequencing data sets, which are produced in two main text formats: FASTQ and FASTA. As an alternative to the current standard of storing all data, we contend that only high quality data need be stored and propose several new file formats to effectively refine and efficiently store such data. The presented file formats are specifically designed to store only high quality sequencing reads in space efficient text and binary formats. Additionally, we address the quality and redundancy issues of genetic reference databases required for a variety of investigations in the field of genomics. Presented modifications of non-alignment based sequence comparison algorithms address this challenge and make it possible to cluster together dozens of millions of genomic sequences (genes): one of the key challenges to reduce redundancy of genomic databases.



Genomics, Markov Chain, NGS data compression