Stephens Lab

SOFTWARE

Please also see our Github Page for new software in development from the Stephens lab.
ashr: R package for adaptive shrinkage
BIMBAM: software for Bayesian IMputation-Based Association Mapping
BLIMP: software for Best Linear IMPutation
BRIdGE: Bayesian Regression for Identifying Gene-Environment interactions
CountClust: R package for clustering and visualizing RNA-Seq expression data using Grade of Membership models
EEMS: Estimated Effective Migration Surfaces
eQtlBma: software to detect eQTLs by Bayesian Model Averaging
fastPHASE: software for haplotype reconstruction, and estimating missing genotypes from population data
fastStructure: variational inference of population structure from SNP genotype data
GEMMA: Genome-wide Efficient Mixed Model Association
HOTSPOTTER: software for identifying recombination hotspots from population SNP data
mashr: R package for Empirical Bayes estimation and testing of multiple effects on multiple outcomes/conditions
MethylHMM: inference under hidden Markov models for double-stranded DNA methylation data
msCentipede: inference of transcription factor binding from chromatin accessibility data
mvBIMBAM: software for genetic association analysis of multiple related phenotypes
PHASE: software for haplotype reconstruction, and recombination rate estimation from population data
piMASS: Multi-SNP analysis for Genetic Association Studies, via Bayesian Variable Selection Regression
RSS: Regression with Summary Statistics
SCAT: Smoothed and Continuous AssignmenTs
SFA: Sparse Factor Analysis
smashr: R package for adaptive Gaussian and Poisson signal denoising
Variational inference for Bayesian variable selection
WaveQTL: a wavelet-based approach for genetic association analysis of functional phenotypes

mashr: R package for Empirical Bayes estimation and testing of multiple effects on multiple outcomes/conditions.

The mashr package implements methods to estimate and test many effects in many conditions (or many effects on many outcomes). The methods use Empirical Bayes methods to estimate patterns of similarity among conditions, and then exploit those patterns of similarity among conditions to improve accuracy of effect estimates. See the following paper for details of the model and methods:

S M Urbut, G Wang, M Stephens (2017). Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions. bioRxiv doi:10.1101/096552

The R package is available on Github.
smashr: R package for adaptive Gaussian and Poisson signal denoising.

smashr is an R package implementing "adaptive shrinkage" methods for signal denoising applications, including smoothing of Poisson and heteroskedastic Gaussian data. The model and methods are described in the following paper:

Z Xing and M Stephens (2017). Smoothing via Adaptive Shrinkage (smash): denoising Poisson and heteroskedastic Gaussian signals. http://arxiv.org/abs/1605.07787.

The R package is available on Github.
RSS: Regression with Summary Statistics.

The RSS software implements Bayesian large-scale multiple regression methods that can be applied to summary data. RSS is based on the statistical framework introduced in the following paper:

X Zhu and M Stephens (2017). Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. To appear in Annals of Applied Statistics.

MATLAB code implementing the methods is available on Github, and an R package is under development.
ashr: R package for adaptive shrinkage.

The ashr ("Adaptive SHrinkage in R") package aims to provide simple, generic, and flexible methods to derive "shrinkage-based" estimates and credible intervals for unknown quantities, given only estimates of those quantities and their corresponding estimated standard errors. The software is based on methods described in the following paper:

M Stephens (2017) False discovery rates: a new deal. Biostatistics 18(2): 275-294.

The R package can be downloaded from CRAN or Github.
CountClust: R package for clustering and visualizing RNA-Seq expression data using Grade of Membership models.

CountClust provides functions for fitting grade-of-membership models (also admixture models) to cluster RNA-seq gene expression count data, identifying characteristic genes driving cluster memberships, and generating a visual summary of the cluster memberships. The R package is based on the methods described in:

K Dey, C Hsiao, and M Stephens (2017). Clustering RNA-seq expression data using grade of membership models. PLoS Genetics 13(3): e1006599.

The R package can be downloaded from Bioconductor or Github.
EEMS: Estimated Effective Migration Surfaces

EEMS is software for analyzing and visualizing spatial population structure from geo-referenced genetic samples. EEMS uses the concept of effective migration to model the relationship between genetics and geography, and it outputs an "estimated effective migration surface"; that is, a visual representation of population structure that can highlight potential regions of higher-than-average and lower-than-average historic gene flow. The software is based on the methods described in:

D Petkova, J Novembre, and M Stephens (2016). Visualizing spatial population structure with estimated effective migration surfaces. Nature Genetics 48: 94-100.

The software and C++ source code can be downloaded from Github.
msCentipede: inference of transcription factor binding from chromatin accessibility data

msCentipede is an algorithm for accurately inferring transcription factor binding sites using chromatin accessibility data (Dnase-seq, ATAC-seq) and is written in Python 2.x and Cython. The hierarchical multiscale model underlying msCentipede identifies factor-bound genomic sites by using patterns in DNA cleavage resulting from the action of nucleases in open chromatin regions (regions typically bound by transcription factors). msCentipede, a generalization of the CENTIPEDE model, accounts for heterogeneity in the DNA cleavage patterns around sites bound by transcription factors.

The software is based on the methods described in:

A Raj, H Shim, Y Gilad, J K Pritchard, and M Stephens (2015). msCentipede: Modeling heterogeneity across genomic sites improves accuracy in the inference of transcription factor binding. PLoS ONE 10(9): e0138030.

The software and Python source code are available on Github.
fastStructure: variational inference of population structure from SNP genotype data

fastStructure is an algorithm for inferring population structure from large SNP genotype data. It is based on a variational Bayesian framework for posterior inference. It is written in Python 2.x. The software is based on the methods described in:

A Raj, M Stephens, and J K Pritchard (2014). fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics 197(2): 573-589.

The software and Python source code are available on Github.
mvBIMBAM: software for genetic association analysis of multiple related phenotypes.

mvBIMBAM, a version of BIMBAM for multivariate association analysis, implements a Bayesian approach for genetic association analysis of multiple related phenotypes, as described in:

H Shim, et al. (2015) A multivariate genome-wide association analysis of 10 LDL subfractions, and their response to statin treatment, in 1868 Caucasians. PLoS ONE 10(4): e0120758.

M Stephens (2013). A Unified framework for association analysis with multiple related phenotypes. PLoS ONE 8(7): e65245.

See here to get more information and to download the software.
WaveQTL: a wavelet-based approach for genetic association analysis of functional phenotypes

WaveQTL is a software implementing a wavelet-based approach for genetic association analysis of functional phenotypes (e.g. sequence data arising from high-throughput sequencing assays), as described in:

H Shim and M Stephens. Wavelet-based genetic association analysis of functional phenotypes arising from high-throughput sequencing assays. Annals of Applied Statistics 9(2): 665-686.

See here to get more information and to download the software.
eQtlBma: software to detect eQTLs by Bayesian Model Averaging

This C++ package implements the methods described in the article by Flutre et al. The software detects quantitative trait loci for gene expression levels ("eQTLs") jointly in multiple subgroups (e.g. multiple tissues). See here to get more information and to download the software.
MethylHMM: inference under hidden Markov models for double-stranded DNA methylation data

The collection of R and C code implements the hidden Markov models described in Fu et al. (2012). Click here to download the zip file containing the source code. These models estimate several properties, such as the level of processivity and preference for hemimethylated CpG dyads over unmethylated ones, of DNA methyltransferases from double-stranded binary methylation data. The inference is done by Markov chain Monte Carlo methods under a Bayesian framework. The zip file also includes several in vivo data sets collected at three genes, FMR1, G6PD and LEP, as well as existing in vitro data sets in the literature.

This software is distributed under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or any later version. For details, see the LICENSE.txt file included with this software package.
Variational inference for Bayesian variable selection

This is a MATLAB implementation of the variational inference method for Bayesian variable selection described in a forthcoming Bayesian Analysis paper. See here to get more information, and to download the software.
Genome-wide Efficient Mixed Model Association (GEMMA)

GEMMA is the software implementing the Genome-wide Efficient Mixed Model Association algorithm for a standard linear mixed model and some of its close relatives for genome-wide association studies (GWAS). It fits a standard linear mixed model (LMM) to account for population stratification and sample structure for single marker association tests. It fits a Bayesian sparse linear mixed model (BSLMM) using Markov chain Monte Carlo (MCMC) for estimating the proportion of variance in phenotypes explained (PVE) by typed genotypes (i.e. chip heritability), predicting phenotypes, and identifying associated markers by jointly modeling all markers while controlling for population structure. It is computationally efficient for large scale GWAS and uses freely available open-source numerical libraries.

See here for the software.
BRIdGE: Bayesian Regression for Identifying Gene-Environment interactions

BRIdGE implements a Bayesian approach for identifying gene-environment interactions when paired phenotypic measurements are taken under two environmental conditions. This method explicitly considers specific interaction models, while taking into account both sample pairing and the intra-individual correlation of measurements under the two conditions. Details are given in the following publication:

See here for software.
Multi-SNP analysis for Genetic Association Studies, via Bayesian Variable Selection Regression

The software piMASS (Posterior inference using Model Averaging and Subset Selection), written and maintained by Yongtao Guan, implements MCMC-based inference methods for Bayesian variable-selection regression described in Guan and Stephens (2011)
This software was developed to perform multi-SNP association analysis for large (genome-wide) datasets, although it can also be applied to smaller association analysis data (e.g. candidate genes or regions), and in this case it forms an alternative to the multi-SNP association analysis capabilities of BIMBAM (below). It may also be useful for Bayesian variable selection regression in large-scale problems more generally.
BLIMP: Best Linear IMPutation

The software BLIMP (Best Linear IMPutation) is a free package for imputing allele frequencies from pooled or summary-level genetic data. The statistical method implemented in the software is described in Wen and Stephens (2010).
Sparse Factor Analysis (SFA)

This software uses ECME to compute a sparse, low-rank matrix factorization for a given matrix, as described in:

Engelhardt BE, Stephens M (2010) "Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis." PLoS Genetics 6(9):e1001117.

Download C++ code and instructions for SFA 1.0 and further documentation for the SFA model.

BIMBAM: software for Bayesian IMputation-Based Association Mapping

The program BIMBAM implements methods for assocation mapping, based on those described in
Servin, B and Stephens, M (2007). Imputation-based analysis of association studies: candidate genes and quantitative traits. PLoS Genetics, 2007.
BIMBAM can handle both large association studies (e.g., genome scans) and smaller studies of candidate genes/regions.
The software is distributed under the GNU Public License (GPL). To register and download, go here.

fastPHASE: software for haplotype reconstruction, and estimating missing genotypes from population data

The program fastPHASE implements methods described in

Scheet, P and Stephens, M (2006). A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet

fastPHASE can handle larger data-sets than PHASE (e.g., hundreds of thousands of markers in thousands of individuals), but does not provide estimates of recombination rates. Our experiments suggest that haplotype estimates are slightly less accurate than from PHASE, but missing genotype estimates appear to be similar or even slightly better than PHASE.

The software is available here.

PHASE: software for haplotype reconstruction, and recombination rate estimation from population data

The program PHASE implements methods for estimating haplotypes from population genotype data described in
Stephens, M., and Donnelly, P. (2003). A comparison of Bayesian methods for haplotype reconstruction from population genotype data. American Journal of Human Genetics, 73:1162-1169.
Stephens, M., Smith, N., and Donnelly, P. (2001). A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics, 68, 978--989.
Stephens, M., and Scheet, P. (2005). Accounting for Decay of Linkage Disequilibrium in Haplotype Inference and Missing-Data Imputation. American Journal of Human Genetics, 76:449-462.
The software also incorporates methods for estimating recombination rates, and identifying recombination hotspots:
Crawford et al (2004). Evidence for substantial fine-scale variation in recombination rates across the human genome. Nature Genetics,.
To download software click here.
Instructions for PHASE are included on the download site, or are also available here.

SCAT: Smoothed and Continuous AssignmenTs

The program SCAT (Smoothed and Continuous AssignmenTs) implements a Bayesian statistical method for estimating allele frequencies and assigning samples of unknown (or known) origin across a continuous range of locations, based on genotypes collected at distinct sampling locations. In brief, the idea is to assume that allele frequencies vary smoothly in the study region, so allele frequencies are estimated at any given location using observed genotypes at near-by sampling locations, with data at the nearest sampling locations being given greatest weight. Details are given in

S K Wasser, A M Shedlock, K Comstock, E A Ostrander, B Mutayoba, and M Stephens. Assigning African elephant DNA to geographic region of origin: applications to the ivory trade. Proc Natl Acad Sci U S A, 41:14844-14852, 2004.
SCAT is available here.

HOTSPOTTER: software for identifying recombination hotspots from population SNP data
This software by Na Li implements methods from:

N Li and M Stephens. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics, 165(4)2213-2233, 2003.
It is available free from here.
Please direct comments and questions regarding HOTSPOTTER to Na Li, at wuolong SPAMBLOCKER AT gmail.com

Matthew Stephens Lab

SOFTWARE

Please also see our Github Page for new software in development from the Stephens lab.

mashr: R package for Empirical Bayes estimation and testing of multiple effects on multiple outcomes/conditions.

smashr: R package for adaptive Gaussian and Poisson signal denoising.

RSS: Regression with Summary Statistics.

ashr: R package for adaptive shrinkage.

CountClust: R package for clustering and visualizing RNA-Seq expression data using Grade of Membership models.

EEMS: Estimated Effective Migration Surfaces

msCentipede: inference of transcription factor binding from chromatin accessibility data

fastStructure: variational inference of population structure from SNP genotype data

mvBIMBAM: software for genetic association analysis of multiple related phenotypes.

WaveQTL: a wavelet-based approach for genetic association analysis of functional phenotypes

eQtlBma: software to detect eQTLs by Bayesian Model Averaging

MethylHMM: inference under hidden Markov models for double-stranded DNA methylation data

Variational inference for Bayesian variable selection

Genome-wide Efficient Mixed Model Association (GEMMA)

BRIdGE: Bayesian Regression for Identifying Gene-Environment interactions

Multi-SNP analysis for Genetic Association Studies, via Bayesian Variable Selection Regression

BLIMP: Best Linear IMPutation

Sparse Factor Analysis (SFA)

BIMBAM: software for Bayesian IMputation-Based Association Mapping

fastPHASE: software for haplotype reconstruction, and estimating missing genotypes from population data

PHASE: software for haplotype reconstruction, and recombination rate estimation from population data

SCAT: Smoothed and Continuous AssignmenTs

HOTSPOTTER: software for identifying recombination hotspots from population SNP data