Evaluating Machine Learning Approaches for Structural Genomics



Journal Title

Journal ISSN

Volume Title



Modern molecular biology produces large amounts of data, which can be difficult to derive any useful information from. We are investigating correlations that exist between genetic annotations of human DNA and chromosome structural features. Chromatin Immuno-Precipitation Sequencing(ChIP-Seq) data tracks, made available through the ENCODE project, characterize the biochemical nature of chromosomal loci. Chromatin can be categorized into types that we call type A and type B which we further classify into chromatin sub-types(A1, A2, B1, B2, and B3). It has been previously shown that these chromatin structural types are strongly related to the overall genome architecture of cells. Machine learning algorithms have proven to be especially adept at “learning” from correlations in very large data sets. We constructed a number of machine learning models and tested how accurately each performed when identifying chromatin sub-types. Our best approach so far is a recurrent neural network which produced a total error of less than 28% when classifying chromatin sub-types.