Posterior Predictive Checks of Coalescent Models (P2C2M) is an R package that conducts posterior predictive checks to determine if your data fit the model assumptions of the coalescent analysis you are performing. Currently, P2C2M can be used with *BEAST to help determine if your data violate assumptions of the Multispecies Coalescent (MSC). The following summary statistics can currently be used to compare posterior and posterior predictive distributions: two probability-based statistics based on the coalescent likelihood (COAL and LCWT: Rannala & Yang 2013), the Genealogical Sorting Index (GSI: Cummings et al. 2008), and the Number of Deep Coalescences (NDC: Maddison 1997).
P2C2M can be installed in R through CRAN:
install.packages("P2C2M")
To use P2C2M, the default version of Python must be set to Python 2.7. Users of unix-like operating systems can insure that this requirement is fulfilled by setting the following alias in a command line:
echo 'alias python=python2.7' >> ~/.bashrc
Mandatory and optional dependencies of P2C2M can be installed automatically via two installation scripts that are co-supplied with the package. These scripts were designed for unix-like operating systems and are located in folder /exec. To use these installation scripts, a correct configuration of python2-setuptools is required. Users of unix-like operating systems can insure a correct configuration by setting the following alias in a command line:
echo 'alias python-config=python2-config' >> ~/.bashrc
To execute the R dependency installer, please run the following commands in R:
source('/path_to_P2C2M/exec/P2C2M.installRlibs.R'); p2c2m.install()
To execute the Python dependency installer, please run the following command in a terminal:
python /path_to_P2C2M/exec/P2C2M.installPylibs.py
Special Note for MacOS
Users of the MacOS operating system need to install the dependencies manually. Prior to their installation, please confirm that file ‘/usr/bin/python2-config’ exists in your file system and that it points to the Python 2.7 executable. Please refer to the CRAN Documentation on how to install R packages manually. For the manual installation of Python libraries, please refer to the Python Documentation.
Dependencies:
System:
gcc (>=4.9)
Python (=2.7)
R:
R (>=3.0.0)
ape (>=3.1-4)
ggplot2 (>=1.0.0)
rPython (>=0.0-5)
stringr (>=0.6.2)
phybase (>=1.3.1) - not on CRAN
Rmpi (>=0.6-5) - if using MPI
xtermStyle (>=2.2-4)
Python:
NumPy (>= 1.9.0)
DendroPy (>=3.12.0)
\(*\)BEAST must be run using BEAST version 1.7 or 1.8. In order to execute P2C2M, a user must provide a directory with three different types of input files:
1. a species tree file containing a draw of s generations from the posterior distribution of species trees (must be named species.trees)
2. a gene tree file containing a draw of s generations from the posterior distribution of ultrametric geneologies for each gene (must be named g.trees)
Note: The generations recorded in the species tree file must exactly match those in the gene tree files
3. an XML-formatted file generated by BEAUTi, the input generator of BEAST (no naming requirements, but P2C2M will use this file name to label subsequent ouput files)
A single function is necessary to conduct posterior predictive checks. The function is run as:
p2c2m.complete(path = "/home/user/Desktop/", xml.file = "beast.xml",
descr.stats = "LCWT,NDC", beast.vers = "1.8", single.allele = c("0"),
num.reps = 1, use.mpi = FALSE, save.metadata = FALSE, verbose = FALSE)
The p2c2m.complete function contains the following arguments (default values are shown in the code above):
1. path - the absolute file path to the input directory
2. xml.file - the name of the BEAUTi-generated and XML-formatted input file
3. descr.stats - the name(s) of the summary statistic(s) to be applied. If multiple statistics are specified, they must be separated by commas. A total of four summary statistics are currently available: “COAL”, “LCWT”, “GSI”, and “NDC”. Note: The package to calculate GSI is currently depracated, so the GSI statistic should not be used. We are working to incorporate our own GSI calculation that doesn’t rely on an outside package. A similar issue occurs with the COAL statistic, so it is best to use LCWT and NDC until an updated version of P2C2M is released.
4. beast.vers - the version of BEAST used to perform the species tree inference
5. single.allele - the name of any species that is represented by only a single allele. Calculation of the GSI statistic may prodcuce an error if this argument is not specified correctly. Additionally, this setting is useful when defining an outgroup, because the species so defined does not contribute towards the calculation of the GSI statistic
6. num.reps - the number of simulation replicates to be conducted. Increasing num.reps may improve the detection of model violations
7. use.mpi - specifies if P2C2M utilizes multiple computer CPUs (if such exist on the system) in order to speed up the calculations. Computations are then executed as parallel processes
8. save.metadata - specifies if P2C2M saves the metadata of the analysis to the output
9. verbose - specifies if P2C2M prints status information to the screen
Our example dataset comes from Myotis bats. It contains 27 samples from 7 subspecies and 7 loci. The dataset was run with BEAST version 1.8. Note: This dataset was run with a short markov chain for the purposes of this tutorial. We do not need to worry about single-allele species, as we will not use the GSI statistic.
Let’s run the analysis with 2 summary statistics (LCWT and NDC) and 1 simulation replicate (num.reps = 1)
library(P2C2M)
example_2stats <- p2c2m.complete(path = "path/to/*BEAST/files/",
xml.file = "myotis.xml", descr.stats = "LCWT,NDC",
beast.vers = "1.8", single.allele = c("0"),
num.reps = 1, use.mpi = FALSE, save.metadata = FALSE,
verbose = FALSE)
example_2stats
Or, taking advantage of default values:
library(P2C2M)
example_2stats <- p2c2m.complete(path = "path/to/*BEAST/files/",
xml.file = "myotis.xml", num.reps = 1)
example_2stats
## $results
## $results$alpha0.1
## $results$alpha0.1$perGene
## LCWT[2] NDC[2]
## 681a "197.55 (±31.15) *" "-98.62 (±9.84) *"
## 681b "187.07 (±26.87) *" "-101.55 (±16.08) *"
## 685a "163.22 (±29.24) *" "-94.11 (±16.5) *"
## 734z "168.14 (±28.7) *" "-96.88 (±21.93) *"
## 735b "167.48 (±27.31) *" "-91.98 (±21.62) *"
## 735f "159.88 (±28.24) *" "-91.77 (±16.39) *"
## cytb_re "87.39 (±17.04) *" "-49.4 (±7.66) *"
##
## $results$alpha0.1$acrGenes
## LCWT[2] NDC[2]
## Sum "1130.72 (±160.54) *" "-624.32 (±72.13) *"
## Mean "161.53 (±22.93) *" "-89.19 (±10.3) *"
## Median "165.33 (±33.65) *" "-85.41 (±15.46) *"
## Mode "0 (±0) n.s." "-83.56 (±15.65) *"
## CV "0.21 (±0.29) *" "-0.25 (±0.16) *"
##
##
## $results$alpha0.05
## $results$alpha0.05$perGene
## LCWT[2] NDC[2]
## 681a "197.55 (±31.15) *" "-98.62 (±9.84) *"
## 681b "187.07 (±26.87) *" "-101.55 (±16.08) *"
## 685a "163.22 (±29.24) *" "-94.11 (±16.5) *"
## 734z "168.14 (±28.7) *" "-96.88 (±21.93) *"
## 735b "167.48 (±27.31) *" "-91.98 (±21.62) *"
## 735f "159.88 (±28.24) *" "-91.77 (±16.39) *"
## cytb_re "87.39 (±17.04) *" "-49.4 (±7.66) *"
##
## $results$alpha0.05$acrGenes
## LCWT[2] NDC[2]
## Sum "1130.72 (±160.54) *" "-624.32 (±72.13) *"
## Mean "161.53 (±22.93) *" "-89.19 (±10.3) *"
## Median "165.33 (±33.65) *" "-85.41 (±15.46) *"
## Mode "0 (±0) n.s." "-83.56 (±15.65) *"
## CV "0.21 (±0.29) *" "-0.25 (±0.16) *"
##
##
## $results$alpha0.01
## $results$alpha0.01$perGene
## LCWT[2] NDC[2]
## 681a "197.55 (±31.15) *" "-98.62 (±9.84) *"
## 681b "187.07 (±26.87) *" "-101.55 (±16.08) *"
## 685a "163.22 (±29.24) *" "-94.11 (±16.5) *"
## 734z "168.14 (±28.7) *" "-96.88 (±21.93) *"
## 735b "167.48 (±27.31) *" "-91.98 (±21.62) *"
## 735f "159.88 (±28.24) *" "-91.77 (±16.39) *"
## cytb_re "87.39 (±17.04) *" "-49.4 (±7.66) *"
##
## $results$alpha0.01$acrGenes
## LCWT[2] NDC[2]
## Sum "1130.72 (±160.54) *" "-624.32 (±72.13) *"
## Mean "161.53 (±22.93) *" "-89.19 (±10.3) *"
## Median "165.33 (±33.65) *" "-85.41 (±15.46) *"
## Mode "0 (±0) n.s." "-83.56 (±15.65) *"
## CV "0.21 (±0.29) n.s." "-0.25 (±0.16) *"
##
##
## $results$legend
## [1] "Differences between the posterior and the posterior predictive distributions. Each cell contains the following information in said order: mean, standard deviation, significance level. Codes in square brackets indicate the number of tails. Alpha values are automatically adjusted for the number of tails."
Take a look at the results. You can see there is a column for the loci and for each summary statistic. The summary statistic column values are the difference between the posterior and posterior predictive datasets. \(*\) indicates a significant difference between the two datasets, indicating poor model fit. Significance is assessed at three alpha values: 0.1, 0.05, and 0.01 Below the single locus statistics is a section for analyzing all the loci together. We can see that at each alpha value, the multispecies coalescent model in *BEAST is a poor fit for our data.