The mashr package implements methods to estimate and test many effects in many conditions (or many effects on many outcomes). The methods use Empirical Bayes methods to estimate patterns of similarity among conditions, and then exploit those patterns of similarity among conditions to improve accuracy of effect estimates. See the following paper for details of the model and methods:
S M Urbut, G Wang, M Stephens (2017). Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions. bioRxiv doi:10.1101/096552
The R package is available on Github.
smashr is an R package implementing "adaptive shrinkage" methods for signal denoising applications, including smoothing of Poisson and heteroskedastic Gaussian data. The model and methods are described in the following paper:
Z Xing and M Stephens (2017). Smoothing via Adaptive Shrinkage (smash): denoising Poisson and heteroskedastic Gaussian signals. http://arxiv.org/abs/1605.07787.
The R package is available on Github.
The RSS software implements Bayesian large-scale multiple regression methods that can be applied to summary data. RSS is based on the statistical framework introduced in the following papers:
X Zhu and M Stephens (2017). Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. Annals of Applied Statistics 11: 1561-1592.
X Zhu and M Stephens (2018). Large-scale genome-wide enrichment analyses identify new trait-associated genes and pathways across 31 human phenotypes. Nature Communications 9: 4361.
MATLAB code implementing the methods is available on Github, and an R package is under development.
The ashr ("Adaptive SHrinkage in R") package aims to provide simple, generic, and flexible methods to derive "shrinkage-based" estimates and credible intervals for unknown quantities, given only estimates of those quantities and their corresponding estimated standard errors. The software is based on methods described in the following paper:
M Stephens (2017) False discovery rates: a new deal. Biostatistics 18(2): 275-294.
The R package can be downloaded from CRAN or Github.
CountClust provides functions for fitting grade-of-membership models (also admixture models) to cluster RNA-seq gene expression count data, identifying characteristic genes driving cluster memberships, and generating a visual summary of the cluster memberships. The R package is based on the methods described in:
K Dey, C Hsiao, and M Stephens (2017). Clustering RNA-seq expression data using grade of membership models. PLoS Genetics 13(3): e1006599.
The R package can be downloaded from Bioconductor or Github.
EEMS is software for analyzing and visualizing spatial population structure from geo-referenced genetic samples. EEMS uses the concept of effective migration to model the relationship between genetics and geography, and it outputs an "estimated effective migration surface"; that is, a visual representation of population structure that can highlight potential regions of higher-than-average and lower-than-average historic gene flow. The software is based on the methods described in:
D Petkova, J Novembre, and M Stephens (2016). Visualizing spatial population structure with estimated effective migration surfaces. Nature Genetics 48: 94-100.
The software and C++ source code can be downloaded from Github.
msCentipede is an algorithm for accurately inferring transcription factor binding sites using chromatin accessibility data (Dnase-seq, ATAC-seq) and is written in Python 2.x and Cython. The hierarchical multiscale model underlying msCentipede identifies factor-bound genomic sites by using patterns in DNA cleavage resulting from the action of nucleases in open chromatin regions (regions typically bound by transcription factors). msCentipede, a generalization of the CENTIPEDE model, accounts for heterogeneity in the DNA cleavage patterns around sites bound by transcription factors.
The software is based on the methods described in:
A Raj, H Shim, Y Gilad, J K Pritchard, and M Stephens (2015). msCentipede: Modeling heterogeneity across genomic sites improves accuracy in the inference of transcription factor binding. PLoS ONE 10(9): e0138030.
The software and Python source code are available on Github.
fastStructure is an algorithm for inferring population structure from large SNP genotype data. It is based on a variational Bayesian framework for posterior inference. It is written in Python 2.x. The software is based on the methods described in:
A Raj, M Stephens, and J K Pritchard (2014). fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics 197(2): 573-589.
The software and Python source code are available on Github.
mvBIMBAM, a version of BIMBAM for multivariate association analysis, implements a Bayesian approach for genetic association analysis of multiple related phenotypes, as described in:
H Shim, et al. (2015) A multivariate genome-wide association analysis of 10 LDL subfractions, and their response to statin treatment, in 1868 Caucasians. PLoS ONE 10(4): e0120758.
M Stephens (2013). A Unified framework for association analysis with
multiple related phenotypes. PLoS ONE 8(7): e65245.
See here to get more information and to download the software.
WaveQTL is a software implementing a wavelet-based approach for genetic association analysis of functional phenotypes (e.g. sequence data arising from high-throughput sequencing assays), as described in:
H Shim and M Stephens. Wavelet-based genetic association analysis of functional phenotypes arising from high-throughput sequencing assays. Annals of Applied Statistics 9(2): 665-686.
See here to get more information and to download the software.
This C++ package implements the methods described in the article by Flutre et al. The software detects quantitative trait loci for gene expression levels ("eQTLs") jointly in multiple subgroups (e.g. multiple tissues). See here to get more information and to download the software.
The collection of R and C code implements the hidden Markov models described in Fu et al. (2012). Click here to download the zip file containing the source code. These models estimate several properties, such as the level of processivity and preference for hemimethylated CpG dyads over unmethylated ones, of DNA methyltransferases from double-stranded binary methylation data. The inference is done by Markov chain Monte Carlo methods under a Bayesian framework. The zip file also includes several in vivo data sets collected at three genes, FMR1, G6PD and LEP, as well as existing in vitro data sets in the literature.
This software is distributed under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or any later version. For details, see the LICENSE.txt file included with this software package.
This is a MATLAB implementation of the variational inference method for Bayesian variable selection described in a forthcoming Bayesian Analysis paper. See here to get more information, and to download the software.
GEMMA is the software implementing the Genome-wide Efficient Mixed Model Association algorithm for a standard linear mixed model and some of its close relatives for genome-wide association studies (GWAS). It fits a standard linear mixed model (LMM) to account for population stratification and sample structure for single marker association tests. It fits a Bayesian sparse linear mixed model (BSLMM) using Markov chain Monte Carlo (MCMC) for estimating the proportion of variance in phenotypes explained (PVE) by typed genotypes (i.e. chip heritability), predicting phenotypes, and identifying associated markers by jointly modeling all markers while controlling for population structure. It is computationally efficient for large scale GWAS and uses freely available open-source numerical libraries.
See here for the software.
BRIdGE implements a Bayesian approach for identifying gene-environment interactions when paired phenotypic measurements are taken under two environmental conditions. This method explicitly considers specific interaction models, while taking into account both sample pairing and the intra-individual correlation of measurements under the two conditions. Details are given in the following publication:
See here for software.
The software piMASS (Posterior inference using Model Averaging and Subset Selection), written and maintained by Yongtao Guan, implements MCMC-based inference methods for Bayesian variable-selection regression described in Guan and Stephens (2011)
This software was developed to perform multi-SNP association analysis for large (genome-wide) datasets, although it can also be applied to smaller association analysis data (e.g. candidate genes or regions), and in this case it forms an alternative to the multi-SNP association analysis capabilities of BIMBAM (below). It may also be useful for Bayesian variable selection regression in large-scale problems more generally.
The software BLIMP (Best Linear IMPutation) is a free package for imputing allele frequencies from pooled or summary-level genetic data. The statistical method implemented in the software is described in Wen and Stephens (2010).
This software uses ECME to compute a sparse, low-rank matrix factorization for a given matrix, as described in:
Engelhardt BE, Stephens M (2010) "Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis." PLoS Genetics 6(9):e1001117.
Download C++ code and instructions for SFA 1.0 and further documentation for the SFA model.
The program BIMBAM implements methods for assocation mapping, based on those described in
Servin, B and Stephens, M (2007). Imputation-based analysis of association studies: candidate genes and quantitative traits. PLoS Genetics, 2007.
BIMBAM can handle both large association studies (e.g., genome scans) and smaller studies of candidate genes/regions.
The software is distributed under the GNU Public License (GPL). To register and download, go here.
The program fastPHASE implements methods described in
Scheet, P and Stephens, M (2006). A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet
fastPHASE can handle larger data-sets than PHASE (e.g., hundreds of thousands of markers in thousands of individuals), but does not provide estimates of recombination rates. Our experiments suggest that haplotype estimates are slightly less accurate than from PHASE, but missing genotype estimates appear to be similar or even slightly better than PHASE.
The software is available here.
The program PHASE implements methods for estimating haplotypes from population genotype data described in
Stephens, M., and Donnelly, P. (2003). A comparison of Bayesian methods for haplotype reconstruction from population genotype data. American Journal of Human Genetics, 73:1162-1169.
Stephens, M., Smith, N., and Donnelly, P. (2001). A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics, 68, 978--989.
Stephens, M., and Scheet, P. (2005). Accounting for Decay of Linkage Disequilibrium in Haplotype Inference and Missing-Data Imputation. American Journal of Human Genetics, 76:449-462.
The software also incorporates methods for estimating recombination rates, and identifying recombination hotspots:
Crawford et al (2004). Evidence for substantial fine-scale variation in recombination rates across the human genome. Nature Genetics,.
To download software click here.
Instructions for PHASE are included on the download site, or are also available here.
The program SCAT (Smoothed and Continuous AssignmenTs) implements a Bayesian statistical method for estimating allele frequencies and assigning samples of unknown (or known) origin across a continuous range of locations, based on genotypes collected at distinct sampling locations. In brief, the idea is to assume that allele frequencies vary smoothly in the study region, so allele frequencies are estimated at any given location using observed genotypes at near-by sampling locations, with data at the nearest sampling locations being given greatest weight. Details are given in
S K Wasser, A M Shedlock, K Comstock, E A Ostrander, B Mutayoba, and M Stephens. Assigning African elephant DNA to geographic region of origin: applications to the ivory trade. Proc Natl Acad Sci U S A, 41:14844-14852, 2004.
SCAT is available here.
N Li and M Stephens. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics, 165(4)2213-2233, 2003.
It is available free from here.
Please direct comments and questions regarding HOTSPOTTER to Na Li, at wuolong SPAMBLOCKER AT gmail.com