NOVEL ALGORITHMS TO ESTIMATE GENOME COVERAGE USING HIGH THROUGHPUT SEQUENCING DATA

Date

2014-05

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Genetic variation can occur in the form of single base changes called Single Nucleotide Polymorphisms (SNPs) or large-scale structural alterations called Copy Number Variations (CNVs). Identification and analysis of CNV(s) is critical in understanding its association with evolution, health, and disease. Over the past decade, new advancements in DNA sequencing technologies have fuelled the field of genomics and opened new doors for performing Copy Number Analysis (CNA). To perform CNA, millions of short subsequences or reads produced by High Throughput Sequencing (HTS) platforms are aligned to reference genome sequence(s). The sequence alignment process produces total number of reads that aligned to each location in the genome and is collectively called as reads coverage. The focus of this research is to develop novel algorithms to accurately estimate coverage in the presence of DNA repeats and single nucleotide mutations. The copy number distribution of the reads mapped to the reference sequence would ideally follow a Poisson distribution assuming that the nucleotide sequence of a genome is random and the sequencing reads came from the random locations in the genome. The coverage data, however, exhibits over-dispersion in the extreme ends of the distribution. Repeatable sequences and SNPs contribute to these unexpected high coverage frequencies. This dissertation presents novel algorithms to estimate the average coverage using a model based on Poison distribution. The model was tested on both simulated and real data with different coverage depths and predicts actual model parameters with reasonably good accuracy. The proposed approach improves estimation of average genome coverage which is central to gene-expression, DNA methylation, and metagenomic studies.

Description

Keywords

Algorithms, Methylation, Bioinformatics

Citation