MBI workshop

Bio-bib, Planning, Seminars No Comments

In the winter quarter of 2010 (i.e. January-March) I am co-organizing a workshop on Inference in Sequence Evolution at the Mathematical Biosciences Institute, Ohio State University, as part of their Network, Scale and Complexity program. (My co-organizer is Gerton Lunter.)

Here’s the scientific program for the workshop:

For decades, the canonical model of molecular evolution was the “point substitution” process, a.k.a. the finite-state continuous-time Markov chain. This model is very well understood. The state space is small: thanks to the mean-field assumption that every site in a sequence is assumed to evolve independently, one only needs consider 4 states (nucleotides) or 20 (amino acids). Transition probabilities are obtained from the matrix exponential: if x_t is the state of the process at time t, then

P(x_t = j | x_0 = i) = \exp({\bf R}t)_{ij}

where {\bf R} is the matrix of instantaneous substitution rates between nucleotides (or amino acids). Since {\bf R} is finite and small, the exponential can be calculated by eigenvector decomposition. Joint likelihood functions for multiple phylogenetically-related nucleotides can be factorized, and thus marginalized (via dynamic programming algorithms), to yield probability distributions over ancestral genome states.

The high availability of data in the era of large comparative genomics projects demands we move beyond the mean-field approximation for genome evolutionary models. Contemplating the matrix exponential \exp({\bf R}t) for irreducible models whose states are strings and graph labelings over some alphabet—rather than just individual symbols from that alphabet—reveals some interesting mathematical opportunities.

For example, consider a simple extension of the point substitution model, where the rate of substitution of a given nucleotide is allowed to depend on the state of the neighboring nucleotides. Mathematically, this model is similar to the canonical one-dimensional spin glass of theoretical physics (the Ising Model), whose time-dependent behavior is nontrivial. Biologically, such a model is necessary to account for the observed depletion of the dinucleotide CG in metazoan genomes (due to epigenetic methylation and subsequent deamination). An algorithm yielding transition probabilities for this model was recently found by Gerton Lunter (2004): the matrix exponential can, in this case, be expanded as a Taylor series and truncated, leading to a dynamic programming recursion for the approximate transition probabilities. Useful localized and variational approximations have been developed in parallel by Adam Siepel (2004). Yet many variants of the neighbor-dependent substitution process remain unsolved; and, given the importance of the Ising Model to physicists (e.g. as a model for the renormalization group in condensed matter physics), is it not of interest to analyze these processes in more depth?

Insertions and deletions (“indels”) represent another class of sequence mutation that has traditionally eluded phylogenetics. Accurate modeling of such events is vital to a probabilistic solution of that canonical bioinformatics problem: Multiple Sequence Alignment. In 1991, Jeff Thorne and coworkers showed that indels which only insert or delete one nucleotide at a time do not break the “independent sites” assumption, are analogous to a linear birth-death process, and have a matrix exponential whose calculation directly parallels the classic string-alignment algorithm of Needleman and Wunsch. Jotun Hein coined the name “Statistical Alignment” for the sub-genre of the bioinformatics literature that springs from Thorne et al’s work. However, real indels insert or delete many nucleotides in a single event. Matrix exponentials for such processes have not, in general, been solved; though a recent breakthrough describes the generator of the process in terms of the “string transducer”, a finite-state machine familiar to computational linguists. Could the string transducer provide a theoretical and algorithmic framework for the indel problem?

Going beyond substitutions and indels, the mathematics of duplications, rearrangements and inversions has so far eluded any closed-form solution. The most common approach to inference under such models is Markov Chain Monte Carlo (MCMC), pioneered by Sankoff, Hannenhalli & Pevzner, Blanchette, Miklos and others. As for models where the evolutionary rates vary systematically along the length of the sequence (as e.g. if one part of the sequence contains a gene that is under structural selection), or detailed models for the evolution of populations of genomes (rather than individual genomes that are assumed to be typical of the whole population): the field is wide open and (with the onslaught of big comparative genomics projects) the subject timely.

The continuing exponential growth of sequence databases… an emerging scientific interest in reconstructing ancestral DNA in silico (so-called “phylogenomics” and “paleogenomics”)… a thriving research literature in molecular evolution… all these factors contribute to the rising use of sophisticated stochastic evolutionary models in bioinformatics tools.

The theme of this workshop is inference in realistic, difficult models for sequence evolution. Related areas include Statistical Alignment, phylogenomics, paleogenomics, transducers, phylogenetic Hidden Markov Models, trajectory sampling and other Markov Chain Monte Carlo approaches, permutations and rearrangements, and other attempts to model the stochastic processes of genome change.

customize post order — worth a look?

Meta No Comments

6 ways to customize wordpress post order

mimetex plugin

Meta No Comments

Another OS X upgrade bug fixed: Apple’s carefree replacement of my httpd.conf had broken my CGI and therefore my mimeTeX plugin.

Melbourne & Barcelona

Bio-bib, Seminars No Comments

The week before last, I gave a keynote talk at the Systems Biology Workshop at the Victorian AgriBioscience Centre at La Trobe University, Melbourne. Followed up with a couple of meetings at the Bio21 institute to talk about protein crystallography and phospholipids.

Then a weekend with my pal Giles in Sydney (and his housemate James Mathison), back to California for one day, then off to Barcelona for SMBE 2008, talking about xrate (and sneaking in some stuff about transducers too). Lots of very cool bio-oriented talks at that conference.

During all of that, I also somehow managed to get an R01 proposal in for the June 5 deadline.

Now I am back in the US for a couple of weeks before heading off to Hungary for Bayesian Phylogeny at the Rényi Institute, followed by a collaborative tour of the UK.

Yeah it sounds fun, and I guess it is. It would be a lot more fun if I could fly business class ;-) Seriously — the NIH policy that insists on me flying with US air carriers, regardless of “cost, convenience, or personal travel preference”, sucks bigtime.

grad student committees

Bio-bib, Committees No Comments

Signed theses for Peter Huggins, Joseph Dale and Liana Lareau. Sat on Alexandre Bouchard’s qual committee (ancestral reconstruction of biological sequences and languages: v cool).

misc academics

Bio-bib, Committees, Seminars No Comments

Liana Lareau’s thesis defense was fun. Alternate splicing of SR genes, nonsense-mediated decay, ultra-conserved elements: a great story. Kaspar Hansen’s algorithm for resolving isoforms on splice junction arrays sounds very interesting too.

Advising was helped this year by Jason Keller pointing me to the BMES Telebears Peer Advising session that ran Tuesday through Thursday, week 2 of advising period, outside the Terrace Café.

Got some nice gconstructor score plots from Avinash this weekend. Looks like windowed MCMC alignment sampling is starting to work as expected, which is nice. Random walk, same ballpark as windowless sampling, but longer autocorrelations (i.e. poorer mixing).

My ISMB2008 highlights talk on transducers has been accepted. Will be good to get these new results into the talk.

Michael Hoffman’s visit was cool. So was Zhi Lu’s. Should put postdoc visits into the blog, really.

Systems Biology Workshop 2008, Melbourne

Bio-bib, Planning, Seminars No Comments

My talk will be on Tuesday May 27th at 9:30am, at The John Scott Meeting House (Lectures) at La Trobe University, Bundoora, Victoria 3086.

Also, note to self: remember to advertise BioE241 in advance. I’ve been considering rebranding it as “Ancestral Genome Reconstruction” and adding a unit for lab/discussion, but decided this can wait until next year. The main point of doing it this year would be to attract students, and advertising (plus on-topic opening lectures) is an easier way of doing this.

PHP bugs

Meta No Comments

Blog briefly offline, back up now after I fixed some PHP bugs following some help from a kindly soul at the Apple Developer forums.

BioE24

Bio-bib, Seminars No Comments

Speaking in the Bioengineering undergrad seminar this afternoon. General talk on molecular evolution planned.

UCSC seminar

Bio-bib, Seminars No Comments

Seminar at UCSC on Tuesday, noon-1pm, Physical Sciences Building, Rm 305. Title “Discovering new genes and reconstructing old ones: virtual machines for evolutionary bioinformatics of proteins and ncRNA.”

« Previous Entries