The aim of imputation-based approaches to association mapping is to allow genetics variants that were not actually typed in an association study to be tested for association. This is possible only because such untyped variants are often correlated, in a known way, with one or more typed variants. Testing imputed variants can increase power to detect associations, particularly when it is used to combine data from multiple studies that use different genotyping platforms: imputation greatly facilitates such meta-analyses, since it overcomes the hurdle that the different studies typed different variants, by allowing the same set of variants to be tested in all studies.
The basic idea behind imputation is to learn about patterns of correlation from a reference panel of densely-genotyped individuals (e.g., the HapMap), and then to use this knowledge to build a way of predicting ("imputing") unmeasured genotypes that are correlated with measured genotypes. We have developed two different approaches to performing this prediction, both based on Hidden Markov Models (Li and Stephens, Scheet and Stephens). We have also worked on methods for testing these imputed variants, taking account of the uncertainty in the predicted genotypes (Servin and Stephens, Guan and Stephens). Somewhat surprisingly, it turns out that satisfactory results can often be obtained by simply replacing imputed genotypes with a point estimate - the posterior mean - and then performing a simple regression-based analysis of phenotype on the resulting imputed genotypes. This is helpful as it considerably reduces the computational cost of testing the millions of imputed variants. See Guan and Stephens for a demonstration that this approach produces similar results to a more principled approach that averages over the full distribution of imputed genotypes.
In addition to imputation, we are actively developing Bayesian methods for the analysis of association studies, based on sparse variable-selection Bayesian regression. Here our aim is to analyse many genetics variants (perhaps millions) simultaneously, and search for combinations of variants that are associated with the response. This kind of variable selection problem arises frequently in statistics, and many tools exist for tackling it, including sparse regression methods (e.g., LASSO) and Bayesian regression. We believe the Bayesian approach has an advantage in this setting for several reasons, including particularly that it provides not just a single "best" combination of variants, but a measure of how certain we are that each variant is relevant: something that in the context of genetic association studies is particularly important as we are truly interested in the relevant variables (hoping that their identification will yield helpful biological insights), and not only in predicting outcomes. We hope to release software soon implementing methods that can deal with realistic-sized data sets (thousands of individuals typed at hundreds of thousands of markers) with moderate computational resources.
One particularly interesting recent development is the use of
high-throughput sequencing to measure gene expression. This new
technology provides the potential to yield more detailed view of the
transcriptome than do micro-arrays that have formed the basis of most
gene expression experiments in the last 15 years or so. In a pilot
project joint with the Gilad lab
(Marioni et al.) we assessed
the technical reproducibility of this RNA-seq technology, and compared
it with results from Affymetrix arrays. We found that the results
from RNA-seq were generally highly technically reproducible (at least
provided that different samples were sequenced at the same
concentration) and generally concordant with those from the array. We
are currently engaged in projects that aim to exploit the detailed
information that RNA-seq can yield about the expression levels of
individual exons within each transcript, to identify differences in
exon usage (e.g., alternative splicing) among individuals and species.
I have a long-standing interest in statistical methods for understanding population structure. Our first work in this area was joint with Jonathan Pritchard, and resulted in the development of the structure software for analysis of population structure. In brief, structure is a model-based clustering method, which clusters individuals into groups on the basis of the genetic data, but allowing for the fact that some individuals may have ancestors from more than one group ("admixed"). The type of model we used is sometimes referred to as a "Grade of Membership" model in other fields, and has close connections with the "latent dirichlet allocation" (LDA) model introduced at about the same time in document clustering applications. More recently we have been working on methods for understanding "continuous" population structure, where individuals genetic make-up varies more continuously (e.g., geographically) and may not be well-described by the more discrete model under-lying structure. One tool we have found useful is Principal Component Analysis. For example, in Novembre et al., we found that the first two Principal Components (PCs) of European data closely mirror a map of Europe, and can therefore be used to infer, with high precision, from where in Europe a particular individual's DNA originated. The close correspondance between the PCs and the geographic map is striking, and may seem surprising that an off-the-shelf method like PCA yields such a elegant result. However, it turns our that there are good mathematical reasons to expect this result (i.e., the first two PCs recapitulate geography) provided only that genetic similarity decays with distance. These reasons involve fascinating connections between Fourier series and the eigenvectors of Circulant and Toeplitz matrices: Novembre and Stephens describes these connections, and gives references to other fields where these observations have been made previously.
Finally, we have also developed statistical models for continuous
population genetic variation, and applied them to the problem of
determining geographic origin of DNA from elephant tusks (in
Wasser at the University of
Washington). The motivation here is that if we can identify the
geographic origin of illegally-exported ivory seizures then this will
help law-enforcement authorities identify and control potential
hotspots of illegal elephant poaching. By using reference samples of
DNA from known locations across Africa we have estimated a continuous
"map" of the allele frequencies at each location, using spatial
smoothing methods to estimate frequencies at locations where no
reference samples are available. We can then use this map to identify
the likely origin of samples of unknown origin, by comparing its DNA
with the geographic distribution. We have applied this method to
multiple large seizures of ivory to determine their likely origin
(Wasser et al.). A general
finding among the seizures we have analysed so far is that they appear
to come from rather restricted geographic regions. This suggests that
poaching levels in these locations may greatly exceed previous
estimates, and that urgent action may be required to prevent serious
loss of genetic diversity.
Much of modern population genetics analysis is based on ideas from coalescent theory, and its extensions to deal with important biological phenomena such as recombination and selection. However, methods of inference based on explicit consideration of coalescent models turn out to be very computationally intensive, and it is helpful to consider more computationally-convenient approximations. We have developed and used approximations based on Hidden Markov Models. The idea behind these models is that each haplotype in a population will closely-resemble a mosaic of "template haplotypes". In Li and Stephens the templates are the other haplotypes in a sample; in Scheet and Stephens the templates are a smaller number of estimated haplotypes, which can be thought of as a summary of the most frequent combinations in the sample. The advantage of the latter approach is that by collapsing the templates into a small number we gain computational advantages; however, this comes at a cost of a (usually small) decrease in accuracy for some applications (see Scheet and Stephens).
These models lead to natural ways of visualising genetic variation in a population: simply color each sampled haplotype according to which template it most closely resembles. The idea here is that haplotypes that are similar to one another are colored the same color, and represent relatively recent shared evolutionary ancestry with one another. However, because the extent of shared ancestry between two haplotypes changes as one moves along the genome (due to recombination), the colors change too. The more often the colors change, intuitively the higher the recombination rate. Actually this intuition is not always reliable, but the model of Li and Stephens does turn out to provide very effective ways to estimate the recombination rate between near-by markers from population data. Applications of these methods lead to several novel insights into the recombination process in humans and chimpanzees, including helping to quantify the frequency of recombination hotspots in genes in humans (Crawford et al.) and showing that a recombination hotspot in humans appears to be absent from chimpanzees (Ptak et al.).