Novel Alignment Based Clustering Algorithms for Pan Genome Analysis of Bacteria Species
Understanding the basic rules of bacterial evolution and adaptation is critical in developing new anti-bacterial drugs, the use of bacteria in biotechnology applications as well as in combating undesired consequences of bacterial presence in industrial and environmental settings such as corrosion, product spoilage, and degradation. Accumulation of single nucleotide mutations beneficial (or neutral) for bacterial survival is a well-studied mechanism of bacterial adaptation which also reflects the time of species separation from a common ancestor (molecular clock hypothesis). The gene loss or gain due to horizontal gene transfer is another much more dynamic mechanism of bacterial adaptation. Using these mechanisms, bacteria can acquire new features such as virulence factors, locomotion ability (flagella), and heat or drug resistance. A major functional characteristic of bacterial species is the presence of particular gene sets common to the species (core genome) together with genes that are available to individual or groups of genomes (pan genome). The technical difficulties however, lie in how one can identify the same genes or gene families in evolutionarily distant organisms:
- Identification of a sequence-similarity threshold
- Computational complexity of sequence clustering algorithms
- Creation of a biologically meaningful cluster topology In this work, we have developed methods to improve the quality and performance of gene clustering including heuristics free, novel sequence alignment algorithms able to cluster a large number of sequences significantly faster than traditional methods (a few days compared to months of computation) that permit the identification of appropriate similarity thresholds and formation of biologically meaningful cluster topology. The developed algorithms were used to build a “functional similarity” tree of the species reflecting gene composition similarity. The performed analysis also identified co-appearance and avoidance patterns of genes in bacterial species. We have applied the proposed methods to 22 genomes from Bartonella spp. using 34,060 genes.
Portions of this document appear in: Kosoy, Michael, Ying Bai, Russell Enscore, Maria Rosales Rizzo, Scott Bender, Vsevolod Popov, Levent Albayrak, Yuriy Fofanov, and Bruno Chomel. "Bartonella melophagi in blood of domestic sheep (Ovis aries) and sheep keds (Melophagus ovinus) from the southwestern US: cultures, genetic characterization, and ecological connections." Veterinary microbiology 190 (2016): 43-49. DOI: 10.1016/j.vetmic.2016.05.009.