Browsing by Author "Fofanov, Yuriy"

Now showing 1 - 9 of 9

ALGORITHMS AND DATA STRUCTURES TO DETECT ONCOVIRUSES IN HUMAN CANCER USING NEXT GENERATION SEQUENCING DATA
(2012-12) Zhu, Rui 1980-; Fofanov, Yuriy; Widger, William R.; Tsekos, Nikolaos V.
Evidence suggests human cancer can be induced by viruses. One way to test this hypothesis is to look for viral sequences in the human cancer genome. Next Generation Sequencing (NGS) technology sequences the whole human genome in a short period of time. This opens a door for a systematic analysis of the human genome and a thorough search for oncogenic viral sequences in cancer. However, a huge amount of sequencing reads generated by NGS poses a great challenge on the computational part of data analysis in terms of computing speed and memory usage. Data structures such as hash and tree are widely implemented to improve the performance of computing algorithms. Here, I described both data structures that have been developed in our center and compared their performance. Hash out performed tree when mapping the reads to a small reference sequence database. Subsequently, real human cancer data were analyzed by using the hash-based mapper and different oncoviral sequences were found in different cancers.
Computational Approaches to Detect Pathogens in the Presence of Complex Backgrounds
(2012-12) Rojas, Mark 1973-; Fofanov, Yuriy; Widger, William R.; Chapman, Barbara M.; Tsekos, Nikolaos V.; Shah, Shishir Kirit
Fast and accurate identification of pathogenic microorganisms in complex clinical and environmental samples is essential for the prevention and treatment of infectious diseases. The most sensitive and accurate detection approaches are based on the examination of the nucleic acid composition of the sample in order to identify the presence of pathogens DNA and/or RNA. A large spectrum of nucleic acid-based tests (such as PCR, RT-PCR, and oligonucleotide microarrays) is designed to examine a sample for the presence of pre-defined genomic signatures: short pathogen-specific DNA and/or RNA fragments. Identification of such signatures however, represents significant computational challenges. To be pathogen specific, each signature (or combination of signatures) must be present (conserved) across all strains of the pathogen, and absent in all other organisms including its close neighbors, and must have assay specific biochemical and thermodynamic properties, such as binding energy, melting temperature, and nucleotide composition. All available signature design algorithms rely on heuristics and are known to miss cases when potential signatures are (explicitly or with small number of mismatches) also present in host (human) and/or non-pathogen microorganisms causing false positive outcomes. Even greater challenge for the design of biochemical platform specific genomic signatures (probes and primers) is that each type of instrument uses different biochemical protocols to detect signatures which also have to be included in the consideration during the signatures design process. To address these challenges we have developed novel algorithms and data structures able to bring all possible subsequences located in given pathogen genome into signatures design process. Moreover, the developed algorithms make it possible to consider mismatches (insertions, deletions, and substitutions for all positions and combinations) into the design process. We also have developed the concept of ultra-specific genomic islands: genomic regions in which every subsequence is several mismatches away from the closest subsequence which may appear in a host genome and/or non-pathogenic near-neighbors of targeted pathogen. This concept allows to improve the quality and flexibility (genomic islands can be used to identify thermodynamically acceptable signatures) of the design of biochemical platform specific detection tests. Developed approach was successfully used to design a variety of tests for Category A, B, and C, pathogens including the 2009 H1N1 Influenza outbreak originated in Mexico.
Effect of Repeatable Regions on Ability to Estimate Copy Number Variation in Human Genome by High Throughput Sequencing
(2012-12) Golovko, Georgiy 1983-; Fofanov, Yuriy; Widger, William R.; Ordonez, Carlos; Tsekos, Nikolaos V.; Shah, Shishir Kirit
Genomic differences (mutations) in humans are profoundly influenced by their distinction as either germ line (inherited) or somatic (developed over one’s life span). Such mutations can vary from a single nucleotide insertion, deletion, or substitution in a gene to a complete duplication or deletion of a large amount of genomic material ranging from thousands of nucleotides to an entire chromosome ultimately referred to as Copy Number Variations (CNV). While a large number of genomic variations have no significant influence on the overall quality of life, certain types of variations in a human genome called abnormalities are known to be associated with genetic disorders including cancer, autism, schizophrenia, just to name a few. Recent advancements in DNA sequencing technologies have made it possible to utilize High Throughput Sequencing (HTS) to identify and detect CNVs. The focus of this research is the development of computational methods used to address the challenges of analyzing high throughput DNA sequence data for quality assessment in relatively large genomes (e.g. human genome) to detect copy number variations and including the data representation. An evolutionary programming approach has been developed to use the set of novel algorithms and data structures introduced in this dissertation for the purpose of efficiently and accurately mapping genomic reads to one or more reference genomes. I have developed computational tools that make it possible to identify the undesirable effects of repetitive regions in the human genome with the ability to identify CNVs and propose a novel approach to reduce their influence on genomic analysis.
Novel Algorithms for the Analysis and Manipulation of Short Genomic Sequences
(2014-05) Dobretsberger, Otto 1985-; Pavlidis, Ioannis T.; Fofanov, Yuriy; Tsekos, Nikolaos V.; Widger, William R.; Pâris, Jehan-François
The storage, manipulation, and transfer of the large amounts of data produced by high-throughput sequencing instruments represent major obstacles to realizing the full potential of this promising technology. To date, significant effort has been devoted to efficiently compressing these cumbersome sequencing data sets, which are produced in two main text formats: FASTQ and FASTA. As an alternative to the current standard of storing all data, we contend that only high quality data need be stored and propose several new file formats to effectively refine and efficiently store such data. The presented file formats are specifically designed to store only high quality sequencing reads in space efficient text and binary formats. Additionally, we address the quality and redundancy issues of genetic reference databases required for a variety of investigations in the field of genomics. Presented modifications of non-alignment based sequence comparison algorithms address this challenge and make it possible to cluster together dozens of millions of genomic sequences (genes): one of the key challenges to reduce redundancy of genomic databases.
NOVEL ALGORITHMS TO ESTIMATE GENOME COVERAGE USING HIGH THROUGHPUT SEQUENCING DATA
(2014-05) Sharma, Meenakshi 1984-; Pavlidis, Ioannis T.; Fofanov, Yuriy; Chapman, Barbara M.; Tsekos, Nikolaos V.; Widger, William R.
Genetic variation can occur in the form of single base changes called Single Nucleotide Polymorphisms (SNPs) or large-scale structural alterations called Copy Number Variations (CNVs). Identification and analysis of CNV(s) is critical in understanding its association with evolution, health, and disease. Over the past decade, new advancements in DNA sequencing technologies have fuelled the field of genomics and opened new doors for performing Copy Number Analysis (CNA). To perform CNA, millions of short subsequences or reads produced by High Throughput Sequencing (HTS) platforms are aligned to reference genome sequence(s). The sequence alignment process produces total number of reads that aligned to each location in the genome and is collectively called as reads coverage. The focus of this research is to develop novel algorithms to accurately estimate coverage in the presence of DNA repeats and single nucleotide mutations. The copy number distribution of the reads mapped to the reference sequence would ideally follow a Poisson distribution assuming that the nucleotide sequence of a genome is random and the sequencing reads came from the random locations in the genome. The coverage data, however, exhibits over-dispersion in the extreme ends of the distribution. Repeatable sequences and SNPs contribute to these unexpected high coverage frequencies. This dissertation presents novel algorithms to estimate the average coverage using a model based on Poison distribution. The model was tested on both simulated and real data with different coverage depths and predicts actual model parameters with reasonably good accuracy. The proposed approach improves estimation of average genome coverage which is central to gene-expression, DNA methylation, and metagenomic studies.
Novel Alignment Based Clustering Algorithms for Pan Genome Analysis of Bacteria Species
(2016-08) Albayrak, Levent 1981-; Pavlidis, Ioannis T.; Deng, Zhigang; Fofanov, Yuriy; Shah, Shishir Kirit; Tsekos, Nikolaos V.
Understanding the basic rules of bacterial evolution and adaptation is critical in developing new anti-bacterial drugs, the use of bacteria in biotechnology applications as well as in combating undesired consequences of bacterial presence in industrial and environmental settings such as corrosion, product spoilage, and degradation. Accumulation of single nucleotide mutations beneficial (or neutral) for bacterial survival is a well-studied mechanism of bacterial adaptation which also reflects the time of species separation from a common ancestor (molecular clock hypothesis). The gene loss or gain due to horizontal gene transfer is another much more dynamic mechanism of bacterial adaptation. Using these mechanisms, bacteria can acquire new features such as virulence factors, locomotion ability (flagella), and heat or drug resistance. A major functional characteristic of bacterial species is the presence of particular gene sets common to the species (core genome) together with genes that are available to individual or groups of genomes (pan genome). The technical difficulties however, lie in how one can identify the same genes or gene families in evolutionarily distant organisms: 1. Identification of a sequence-similarity threshold 2. Computational complexity of sequence clustering algorithms 3. Creation of a biologically meaningful cluster topology In this work, we have developed methods to improve the quality and performance of gene clustering including heuristics free, novel sequence alignment algorithms able to cluster a large number of sequences significantly faster than traditional methods (a few days compared to months of computation) that permit the identification of appropriate similarity thresholds and formation of biologically meaningful cluster topology. The developed algorithms were used to build a “functional similarity” tree of the species reflecting gene composition similarity. The performed analysis also identified co-appearance and avoidance patterns of genes in bacterial species. We have applied the proposed methods to 22 genomes from Bartonella spp. using 34,060 genes.
The presence of nitrate dramatically changed the predominant microbial community in perchlorate degrading cultures under saline conditions
(BMC Microbiology, 9/7/2014) Stepanov, Victor G.; Xiao, Yeyuan; Tran, Quyen; Rojas, Mark; Willson, Richard C.; Fofanov, Yuriy; Fox, George E.; Roberts, Deborah J.
Background: Perchlorate contamination has been detected in both ground water and drinking water. An attractive treatment option is the use of ion-exchange to remove and concentrate perchlorate in brine. Biological treatment can subsequently remove the perchlorate from the brine. When nitrate is present, it will also be concentrated in the brine and must also be removed by biological treatment. The primary objective was to obtain an in-depth characterization of the microbial populations of two salt-tolerant cultures each of which is capable of metabolizing perchlorate. The cultures were derived from a single ancestral culture and have been maintained in the laboratory for more than 10 years. One culture was fed perchlorate only, while the other was fed both perchlorate and nitrate. Results: A metagenomic characterization was performed using Illumina DNA sequencing technology, and the 16S rDNA of several pure strains isolated from the mixed cultures were sequenced. In the absence of nitrate, members of the Rhodobacteraceae constituted the prevailing taxonomic group. Second in abundance were the Rhodocyclaceae. In the nitrate fed culture, the Rhodobacteraceae are essentially absent. They are replaced by a major expansion of the Rhodocyclaceae and the emergence of the Alteromonadaceae as a significant community member. Gene sequences exhibiting significant homology to known perchlorate and nitrate reduction enzymes were found in both cultures. Conclusions: The structure of the two microbial ecosystems of interest has been established and some representative strains obtained in pure culture. The results illustrate that under favorable conditions a group of organisms can readily dominate an ecosystem and yet be effectively eliminated when their advantage is lost. Almost all known perchlorate-reducing organisms can also effectively reduce nitrate. This is certainly not the case for the Rhodobacteraceae that were found to dominate in the absence of nitrate, but effectively disappeared in its presence. This study is significant in that it reveals the existence of a novel group of organisms that play a role in the reduction of perchlorate under saline conditions. These Rhodobacteraceae especially, as well as other organisms present in these communities may be a promising source of unique salt-tolerant enzymes for perchlorate reduction.
The theoretical basis of universal identification systems for bacteria and viruses
(Journal of Biological Physics and Chemistry, 2010-4) Chumakov, Segei; Belapurkar, C.; Putonti, Catherine; Li, B.; Pettitt, BM; Fox, George E.; Willson, Richard C.; Fofanov, Yuriy
It is shown that the presence/absence pattern of 1000 random oligomers of length 12� in a bacterial genome is sufficiently characteristic to readily and unambiguously distinguish any known bacterial genome from any other. Even genomes of extremely closely-related organisms, such as strains of the same species, can be thus distinguished. One evident way to implement this approach in a practical assay is with hybridization arrays. It is envisioned that a single universal array can be readily designed that would allow identification of any bacterium that appears in a database of known patterns. We performed in silico experiments to test this idea. Calculations utilizing 105 publicly-available completely-sequenced microbial genomes allowed us to determine appropriate values of the test oligonucleotide length, n, and the number of probe sequences. Randomly chosen n-mers with a constant G + C content were used to form an in silico array and verify (a) how many n-mers from each genome would hybridize on this chip, and (b) how different the fingerprints of different genomes would be. With the appropriate choice of random oligomer length, the same approach can also be used to identify viral or eukaryotic genomes.
Using Computational Analysis of Frequencies and Genomic Locations of 6-8 Nucleotide Long Sequences to Improve Quality of DNA Amplification
(2018-05) Khanipov, Kamil Ildarovich 1992-; Pavlidis, Ioannis T.; Fofanov, Yuriy; Shah, Shishir Kirit; Tsekos, Nikolaos V.
Random primer amplification (RPA) is a technique widely used in a variety of genomic studies. Nonrandom distribution of short sequences across Human, animal, and bacterial genomes, however, causes bias which affects the amplification process and downstream analysis. The presented work focuses on computational strategies to explore statistical properties of the frequencies and location distributions of all possible short subsequences (6-8 mers) in human, animal, and bacterial genomes and use them to guide the primer design process in order to: reduce bias in single genome (human/animal) amplification and perform preferential microbial enrichment in the presence of host DNA.