Matthew Stephens - PHASE software for haplotype estimation

Home

Downloads

FAQ

PHASE: Frequently Asked Questions

Q: How long should I run PHASE?

Q: How does PHASE perform its test of association between haplotypes and case-control status?

Q: When using the -f2 input format, why does PHASE use the wrong character for the minor allele?

Q: I can't unzip the PHASE.gz file. Why not?

Q: I'm having trouble running the Mac OSX version. It doesn't seem to be executable.

Q: What is the difference between the best pair in the _pairs file, and the best pair in the main output file?

This page is under construction. Please send me any requests for FAQs to include!

Q: How long should I run PHASE?

This depends on what you are trying to estimate, and on the details of your data set. In general you should run PHASE several (eg 5) times on each data set, using different seeds for the random number generator (-S option), and check the results for consistency (meaning not that every haplotype estimate is exactly the same for all runs, but would the conclusions of your analysis differ substantively across runs?) My limited experience thus far is that the default run-lengths suffice for many problems if you are interested only in haplotype estimates. However, they are often too short to get reliable estimates of the recombination rate in the _recom file (this may seem contradictory, but it seems that the accuracy of haplotype estimates is relatively robust to the value of the recombination rate estimated). If you are interested in accurate recombination rate estimates then you may need to do longer runs (eg using the -X10, or -X100 option) to get consistent results across runs.

Q: Can I use PHASE for X chromosome data?

Yes. It's not as easy as I'd like, but you can do it. The easiest way is to randomly pair the male haplotype data into "individuals" and then specify these individuals as having "known" phase using the -k command. If you have an odd number of males, enter one of them as having one allele missing at each site. (You could enter all males like this, but it would considerably increase the computation with no advantage that I can see.) Then enter the females as usual, but specify them as having unknown phase. PHASE will use the known male haplotypes to help infer the female haplotypes.

Q: How does PHASE perform its test of association between haplotypes and case-control status?

The test can be thought of as a permutation-based likelihood ratio test. Assume for convenience that the haplotypes of all individuals were observed, rather than estimated, and consider the likelihood ratio (LR)

LR =

Pr(case haplotype data) Pr(control haplotype data)

--------------------------------------------------

Pr(case and control haplotype data combined)

where the denominator represents the probability of the haplotype data computed under the null hypothesis that they came from a single homogenous group, and the numerator represents the probability of the haplotype data under the alternative hypothesis that the case and control groups differ. Large values of LR are thus evidence towards the alternative hypothesis. The standard approach is to condition on the estimated haplotype frequencies in computing these probabilities. However, this approach is well known to have little power when there are many infrequent haplotypes. To avoid this problem we take an alternative approach, whereby the probabilities in LR are computed using the PAC model from Li and Stephens (2003). This model takes into account similarity of observed haplotypes, and has the property that LR will be large if haplotypes in the case groups are more similar to one another than to haplotypes in the control group. As a result the approach retains power even when all haplotypes in the sample are different. To allow for uncertainty in haplotype estimates, we find the average value of LR over many plausible estimates for the haplotypes. Finally, to assess significance of the resulting value for LR we compute LR in the same way for different permutations of the case-control labels. The proportion of permutations (including the identity permutation corresponding to the true case-control labels) that give average values of LR greater than or equal to the value obtained for the true case-control labels is the significance probability reported in the _signif file. Note that the smallest p-value attainable by this procedure is 1 divided by the number of permutations specified (default 1/100).

The following example, while artificial, illustrates the rationale behind our test. Consider the following sets of case and control haplotypes (with the two alleles at each locus denoted 0/1).

Case haplotypes Control haplotypes
101000000001 010111110001
101000000010 010111110010
101000000011 010111110011
101000000100 010111110100
101000000101 010111110101
101000000110 010111110110
101000000111 010111110111
101000001000 010111111000
101000001001 010111111001
101000001010 010111111010
101000001011 010111111011
101000001100 010111111100
101000001101 010111111101
101000001110 010111111110
101000001111 010111111111

Although it is clear that the case and control haplotypes do not come from a single homogenous group, a conventional likelihood-ratio test would not detect this, because every haplotype is distinct. In contrast our approach will detect the fact that the case haplotypes are more similar to one another than to the control haplotypes.

There are other ways of dealing with this problem, but most of them involve either making ad hoc decisions about how to cluster haplotypes into groups, or discarding low frequency sites to reduce the number of observed haplotypes, which may reduce power.

Q: When using the -f2 input format, why does PHASE use the wrong character for the minor allele?

With the -f2 input format, when there are no minor allele homozygotes at a site, PHASE has no way of knowing what the minor allele is. So it adopts the convention of using the character following the major allele alphabetically for the minor allele. For example, if the major allele is A then in its output PHASE uses B to represent the minor allele. Similarly, if the major allele is G then PHASE uses H for the minor allele.

Q: I can't unzip the PHASE.gz file. Why not?

Some browsers seem to unzip this file automatically. Try renaming the file to be PHASE, setting it to be executable (eg chmod a+x PHASE), and running it.

Q: I'm having trouble running the Mac OSX version. It doesn't seem to be executable.

(answer provided by M Neville) The "PHASE.hqx" file seems to have lost its executable permission. In a Terminal window and in the directory with the program type

ls -al

you should get something like -rwxr-xr-x if the program is executable. If like I did for the extracted PHASE.hqx file you get -rw-r--r-- then the file is not executable. To remedy this type

chmod +x PHASE

this should change the permissions, try ls -al again to see.

So in summary: 1)make sure you are using Mac OSX and am working from the terminal window in the same directory as the PHASE program, 2)make sure the file is extracted then check that the file permissions are intact.

Q: What is the difference between the best pair in the _pairs file, and the best pair in the main output file?

The _pairs output gives you the most probable pair for each individual. This is the guess that maximises the probability of that pair being correct. In contrast, the main output file (BESTPAIRS section) tries to minimise the expected number of differences between the true haplotypes and the guess. Minimising the expected number of differences and maximising the probability of getting the guess exactly correct are not the same things, and so the two guesses can be different (although they are typically rather similar). Which is "better" is difficult to say, and context-dependent.

UW - Statistics