MBI workshop
Wednesday June 25, 2008 Bio-bib, Planning, Seminars No CommentsIn the winter quarter of 2010 (i.e. January-March) I am co-organizing a workshop on Inference in Sequence Evolution at the Mathematical Biosciences Institute, Ohio State University, as part of their Network, Scale and Complexity program. (My co-organizer is Gerton Lunter.)
Here’s the scientific program for the workshop:
For decades, the canonical model of molecular evolution was the “point substitution” process, a.k.a. the finite-state continuous-time Markov chain. This model is very well understood. The state space is small: thanks to the mean-field assumption that every site in a sequence is assumed to evolve independently, one only needs consider 4 states (nucleotides) or 20 (amino acids). Transition probabilities are obtained from the matrix exponential: if
is the state of the process at time
, then
where
is the matrix of instantaneous substitution rates between nucleotides (or amino acids). Since
is finite and small, the exponential can be calculated by eigenvector decomposition. Joint likelihood functions for multiple phylogenetically-related nucleotides can be factorized, and thus marginalized (via dynamic programming algorithms), to yield probability distributions over ancestral genome states.
The high availability of data in the era of large comparative genomics projects demands we move beyond the mean-field approximation for genome evolutionary models. Contemplating the matrix exponential
for irreducible models whose states are strings and graph labelings over some alphabet—rather than just individual symbols from that alphabet—reveals some interesting mathematical opportunities.
For example, consider a simple extension of the point substitution model, where the rate of substitution of a given nucleotide is allowed to depend on the state of the neighboring nucleotides. Mathematically, this model is similar to the canonical one-dimensional spin glass of theoretical physics (the Ising Model), whose time-dependent behavior is nontrivial. Biologically, such a model is necessary to account for the observed depletion of the dinucleotide CG in metazoan genomes (due to epigenetic methylation and subsequent deamination). An algorithm yielding transition probabilities for this model was recently found by Gerton Lunter (2004): the matrix exponential can, in this case, be expanded as a Taylor series and truncated, leading to a dynamic programming recursion for the approximate transition probabilities. Useful localized and variational approximations have been developed in parallel by Adam Siepel (2004). Yet many variants of the neighbor-dependent substitution process remain unsolved; and, given the importance of the Ising Model to physicists (e.g. as a model for the renormalization group in condensed matter physics), is it not of interest to analyze these processes in more depth?
Insertions and deletions (“indels”) represent another class of sequence mutation that has traditionally eluded phylogenetics. Accurate modeling of such events is vital to a probabilistic solution of that canonical bioinformatics problem: Multiple Sequence Alignment. In 1991, Jeff Thorne and coworkers showed that indels which only insert or delete one nucleotide at a time do not break the “independent sites” assumption, are analogous to a linear birth-death process, and have a matrix exponential whose calculation directly parallels the classic string-alignment algorithm of Needleman and Wunsch. Jotun Hein coined the name “Statistical Alignment” for the sub-genre of the bioinformatics literature that springs from Thorne et al’s work. However, real indels insert or delete many nucleotides in a single event. Matrix exponentials for such processes have not, in general, been solved; though a recent breakthrough describes the generator of the process in terms of the “string transducer”, a finite-state machine familiar to computational linguists. Could the string transducer provide a theoretical and algorithmic framework for the indel problem?
Going beyond substitutions and indels, the mathematics of duplications, rearrangements and inversions has so far eluded any closed-form solution. The most common approach to inference under such models is Markov Chain Monte Carlo (MCMC), pioneered by Sankoff, Hannenhalli & Pevzner, Blanchette, Miklos and others. As for models where the evolutionary rates vary systematically along the length of the sequence (as e.g. if one part of the sequence contains a gene that is under structural selection), or detailed models for the evolution of populations of genomes (rather than individual genomes that are assumed to be typical of the whole population): the field is wide open and (with the onslaught of big comparative genomics projects) the subject timely.
The continuing exponential growth of sequence databases… an emerging scientific interest in reconstructing ancestral DNA in silico (so-called “phylogenomics” and “paleogenomics”)… a thriving research literature in molecular evolution… all these factors contribute to the rising use of sophisticated stochastic evolutionary models in bioinformatics tools.
The theme of this workshop is inference in realistic, difficult models for sequence evolution. Related areas include Statistical Alignment, phylogenomics, paleogenomics, transducers, phylogenetic Hidden Markov Models, trajectory sampling and other Markov Chain Monte Carlo approaches, permutations and rearrangements, and other attempts to model the stochastic processes of genome change.
