Matthew Stephens - PHASE software for haplotype estimation

PHASE: A program for reconstructing haplotypes from population data

Most recent version information:

The most recent version of PHASE is v2.1.1, which features improved mixing for larger data sets compared with v2.1, as well as perfoming some (very) rudimentary checking of the input file format. (For many data sets v2.1 and v2.1.1 will give very similar answers, but for some data sets the results from v2.1.1 can be considerably better.)

The most recent major release was v2.1, which introduced more flexible models for recombination (including recombination hotspots) and fixed a couple of bugs in v2.0.2.

Description

PHASE v 2.1 is a program implementing the method for reconstructing haplotypes from population data, described in

[1] Stephens, M., Smith, N., and Donnelly, P. (2001). A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics, 68, 978--989.

[2] Stephens, M., and Donnelly, P. (2003). A comparison of Bayesian methods for haplotype reconstruction from population genotype data. American Journal of Human Genetics, 73:1162-1169.

[3] Stephens, M., and Scheet, P. (2005). Accounting for Decay of Linkage Disequilibrium in Haplotype Inference and Missing-Data Imputation. American Journal of Human Genetics, 76:449-462.

The software also incorporates methods for estimating recombination rates, and identifying recombination hotspots, as described in

[3] Li, N., and Stephens, M. (2003). Modelling Linkage Disequilibrium, and identifying recombination hotspots using SNP data Genetics, 165:2213-2233.

[4] Crawford et al (2004). Evidence for substantial fine-scale variation in recombination rates across the human genome. Nature Genetics, 36: 700-706.

How to cite this software

Please see user instructions for how to cite the software.

In all cases, please specify the version of the software used, and any deviations from the default options.

Software download

PHASE is available under the following open source license.

I distribute executables of version 2.1.1 for Linux and Microsoft Windows. Contributed executables may be available for other platforms (see below). Source code (C++) is also available (below) for those who wish to compile it on other operating systems. If you do manage to compile the program successfully on another platform, and wish to contribute an executable for others to use, please email me the executable, and I will add it to this website.

Executables for v2.1.1

Linux (version 2.1.1) (compiled with g++): phase.2.1.1.linux.tar.gz
To install the software on a Linux system, download the Linux executable by clicking on the link above, and unzip it and untar it by typing
$ gunzip phase.2.1.linux.tar.gz
$ tar -xvf phase.2.1.1.linux.tar
This will create a new directory, phase.2.1.1.linux. Change to that directory, and examine the file instruct2.1.pdf for further instructions.
Microsoft Windows (version 2.1.1). (compiled with MinGW): PHASE.exe
Save this executable file in a suitable place on your computer (eg by right-clicking on the link above). You will also need to download the "supplementary files" below - particularly the instructions! To run the program you will need to open an MS-DOS window (eg choose Start-Run from the start menu, and then type COMMAND in the prompt and hit OK). Sorry these instructions are so brief! If you can't manage it, you may need to ask a resident MS-Windows expert to help you.
Solaris (version 2.1.1) : (contributed by Frank Dudbridge) PHASE.gz
Mac OS X 10.6 or earlier (version 2.1.1) (contributed by Matt Neville): PHASE2.1.1.tar.Z
Mac OS X 10.7 or higher (version 2.1.1) (contributed by Erick Castelli): PHASE2.1.1.Lion.tar.gz
See below for supplementary files and instructions in pdf format.

Source Code:

PHASE source code is available here.

Supplementary files

All these files are supplied with the linux executable, but may not be supplied with other executables, so here they are for separate download if you need them. (You should be able to save these files on your computer by right-clicking on each of them, one at a time.)

Instructions (pdf).

test.inp.

test.casecontrol.inp.

test2.inp.

eg.known.

eg.delta.

Results and simulated data files from the paper

The following resources are provided for the convenience of researchers who wish to do comparisons on the data sets used to produce figures 2 and 3 in the paper [1].

Summary of results from Figures 2 and 3

Here is a text file of the results shown in Figures 2 and 3 in the paper.

Input Files

Here is a .zip archive containing files of simulated data used for the manuscript by Stephens, Smith and Donnelly. (If you are using UNIX you should be able to extract the files using unzip truthfiles.zip.)

The archive contains files with names of the following forms:

truth.k1.t4.rX.Y: these files contain the "short sequence" data simulated using Richard Hudson's program (from Figure 2). X is 4Nr where r is the recombination rate across the region, and Y is the number of haplotypes (=2*the number of individuals). Each file contains 100 data sets, one after each other with no blank lines. Each dataset consists of Y lines. Lines 1 and 2 give the haplotypes in the first individual, lines 3 and 4 give the haplotypes for the second individual, etc. Only segregating sites are shown, and the two alleles at each site are labelled 0 and 1 (0 is the wild type, but the software does not use this information.)
truth.hw,truth.nhw: similar, to above, but for the betaglobin data, under Hardy--Weinberg Equilibrium (HWE) (.hw) and not under HWE (.nhw). (I do not remember how many data sets each contains, but only the first 20 in each were used in the results.)
msArXt8.Z.truth: these contain the microsatellite data simulated using a program provided by P. Fearnhead. X=4Nr (=R in the manuscript), and Z is the number of indivduals (10,20,30,40 or 50) in the sample. The 100 data sets are split into 2 files (A=1,2), each containing 50 data sets. The alleles at each locus are represented by ASCII characters: to convert into repeats you will need to take the ASCII value of each character (sorry about this - it was so that each allele can be represented by a single character).

Results Files

Here is a .zip archive containing files of results from our method. The names of the files are similar to those described above: hopefully it is obvious which results correspond to which datasets. Each row of each file contains the results for a single data set. Columns 32,33, and 37 contain the important quantities: column 32 is the number of ambiguous individuals in the data set, 33 is the number of these that our method got wrong (so the average of 32/33 over data sets is the "error rate" from our paper), and 37 contains the I_F score for our method. Rows that contain mostly NAs correspond to data sets with more possible haplotypes than our implementation of EM could cope with - the results from these data sets were ignored in our analyses.

UW - Statistics