Empirical realism of simulated data is more important than the model used to generate it: a reply to Goloboff et al. Palaeontology

provide our only direct insight into the history of

F O S S I L S provide our only direct insight into the history of life and realizing their evolutionary significance invariably requires that they are integrated into a phylogeny with their living and/or fossil relatives.There are many competing approaches to phylogeny estimation and, episodically, debate over their relative efficacy has erupted into controversy, as exemplified by the introduction of cladistics into palaeontology (Hull 1988).While there have been skirmishes on the role of stratigraphy in phylogeny estimation (Fox et al. 1999;Smith 2000;Fisher et al. 2002;Wagner 2002) parsimony has since achieved hegemony despite the introduction and implementation of a model-based approach to the analysis of morphological data (Lewis 2001).Increasingly, over the last few years, palaeontologists have performed parallel phylogenetic analyses using parsimony and model-based approaches, perhaps in a bid to integrate over the uncertainty over which method provides the most credible estimate of inter-specific relationships.Certainly, without knowledge of the true phylogeny it is not possible to reconcile the conflicting results from competing methods.Hence, Wright & Hillis (2014) took a simulation approach, generating thousands of morphology-like datasets on a known tree and then assessing the relative performance of parsimony and the Bayesian implementation of the Mk model in recovering the generating tree from the simulated data.They found that the model-based method performed best.Schooled in parsimony, we were surprised by the findings of Wright & Hillis (2014) and believed that there were aspects of their experimental design that potentially biased their analyses in favour of the probabilistic model; not least that their data were effectively generated using the Mk model.Also, we wanted to assess the performance of alternative parsimony methods, and benchmark their performance with simulated data against empirical matrices.However, even when accounting for these factors, we recovered the same result as Wright & Hillis (2014): the Bayesian implementation of the Mk model outperformed parsimony (O'Reilly et al. 2016).Both ourselves and others have since attempted to explore other variables influencing the estimation of phylogenetic relationships, such as tree symmetry and character design (Puttick et al. 2017a), as well as measures of clade support (Brown et al. 2017;O'Reilly et al. 2017).There are many other variables that have yet to be investigated, including character covariation, the accuracy of branch length estimates, and the impact of non-contemporaneous taxa.However, based on existing simulation approaches and the variables considered to date, the Bayesian implementation of the Mk model continues to perform with greatest accuracy, particularly when datasets are small and levels of homoplasy are high (O'Reilly et al. 2017).
Is parsimony dead?Goloboff et al. (2018) certainly do not think so, calling into question all of our results based principally on the argument that the model of evolution that we used to simulate morphology-like data, is not biologically realistic.We cannot address every point they make, not least since their critique is focused explicitly on what we did not write, rather than what we did write.However, Goloboff et al. (2017Goloboff et al. ( , 2018) ) object particularly to the assumption in our simulating framework of the proportionality of branch lengths among characters, which is clearly an unrealistic expectation of morphological evolution.In this we are agreed; if there were an entirely realistic model of morphological evolution available we would have used it.However, if such a model were available, we could dispense with both parsimony and the Mk model and simply apply this model to derive the true relationships among taxa.

Empirical realism of simulated data
The critique of Goloboff et al. (2018), focused on the biological realism of the model of evolution that we used to simulate morphology-like data, is based on a revisionist perspective.All of our original model choices were informed by the pioneering study of Wright & Hillis (2014); we employed an identical methodology to these authors, where possible, to allow for a direct comparison: we used the same generating tree (Pyron 2011); we used the same number of characters for simulation (350 and 1000 sites; we added 100 character datasets as they are representative of the size of many palaeontological studies); and we used a similar character simulation model but with modifications to violate the Mk model.We used the HKY+G model of molecular evolution on known trees to create datasets that violate assumptions underlying the Mk model.The HKY model generates data with an uneven stationary distribution of state frequencies in our simulations, violating one of the primary assumptions of the Mk model.These nucleotide datasets were converted to binary or multistate morphology-like datasets by reducing the four nucleotide states to purines and pyrimidines (R/Y coding) and recoding them as binary states, or by directly mapping the four nucleotides to integers for multistate characters.We also achieved further model misspecification by drawing a unique rate for each character from a continuous gamma distribution; the Mk model assumes all characters have an equal expected number of changes on individual branches, and the Mk+G model assumes there are n unique rates, where n is the number of discrete gamma categories.To ensure that these simulated datasets were also empirically realistic, we evaluated their overall consistency index (CI), excluding datasets that fell outside the range of CI in a published survey of empirical datasets (Sanderson & Donoghue 1989, 1996) As we stated explicitly in our study, we attempted to obtain two qualities in our simulated data: (1) that the generating model violated the Mk model; and (2) that it achieved our prescribed measure of empirical realism.Our analyses using the Mk model frequently failed to recover the generating tree with precision or accuracy, demonstrating effectively that the simulated datasets are not compatible with this evolutionary model and achieve a suitable level of model misspecification.Goloboff et al. (2017) as it allows for unique rates for each character on each branch, whereas our simulation approach reduces the number of parameters by allowing the expected number of changes on a branch to be shared among all characters, with some proportional augmentation by factors randomly sampled from a gamma distribution.Parsimony and maximum likelihood will achieve identical results if all branches are allowed a unique rate for each character (Tuffley & Steel 1997).However, this no common mechanism model is unwieldy as it employs a huge number of parameters that grows exponentially with dataset size: (2 9 number of taxa À 3) 9 number of characters (Huelsenbeck et al. 2011;Yang 2014).The simulation procedure of Goloboff et al. ( 2017) is comparable to this extremely parameterrich model that sits at the extreme of branch-rate independence.
In reality, a more suitable model of morphological evolution probably exists somewhere on the continuum of potential models separating our simulation framework and that of Goloboff et al. (2017).The idiosyncrasies of morphological evolution mean that it is daunting to construct a single model of discrete character change applicable to all datasets.We possess little, if any, meaningful data regarding the manner in which rates of morphological evolution vary across characters and along the branches of trees.Thus, if we are to assess the performance of the relatively na€ ıve inference frameworks we have available to us, it seems logical to focus instead on the empirical realism of the structure of simulated data itself and not the biological realism of the process that generated it.Similarly, identifying a useful model that separates the simulation procedures of O'Reilly et al. (2017) andGoloboff et al. (2017) is neither straightforward nor necessary to assess the efficacy of the available phylogenetic estimation frameworks.et al. (2017, 2018) conclude both of their papers by observing 'the use of simulated datasets alone cannot solve that [sic] problem of model adequacy; empirical tests of whether morphological data fulfill the crucial assumptions of the model are required as well.'Nevertheless, they emphasize, the benefit of simulation is that it is possible to derive general patterns from statistically significant numbers of replicates.We prefer our own approach to the simulation of a set of morphological matrices through the filter of character consistency since, in our view, the approach taken by Goloboff et al. (2017) yielded datasets with empirically unrealistic distributions of character consistency which were frequently by characters with a high CI; datasets that parsimony analysis will naturally perform well on.Implied Weights Parsimony relies upon a measure of character consistency, and is only likely to reinforce the true tree when homoplasy is low (Kluge 1997;Congreve & Lamsdell 2016).

Goloboff
Goloboff et al. (2018) question how we evaluated their simulated datasets since they did not provide any with their paper (Goloboff et al. 2017); we used the code provided in the supplementary materials of Goloboff et al. (2017) to create simulated datasets and, if the simulation strategy of Goloboff et al. (2017) is effective, our sample of simulated data should be statistically comparable to the data they generated and based their study on.We present the CI profile of characters within datasets simulated using their strategy in Figure 1, comparing the CI profile of empirical datasets surveyed by Goloboff et al.
(2017) for 1000 replicates of 100 characters simulated on an asymmetric tree.Datasets simulated following the approach of Goloboff et al. (2017) always include a significant number of characters with a CI = 1.0 even though they are all comprised of multistate characters.Similarly, the simulated matrices of Goloboff et al. (2017) often under-represent characters with CI < 0.5 relative to the empirical matrices they surveyed.This under representation of low CI characters is particularly obvious in CI bins spanning the range 0.0-0.2,containing the most inconsistent characters.This distribution of per character CI effectively reduces the exposure of the different phylogenetic estimation methods to increasingly inconsistent characters.This bears out the point made in O' Reilly et al. (2017) and it is in this sense that we viewed the simulation strategy of Goloboff et al. (2017) to be biased in favour of parsimony.Goloboff et al. (2017Goloboff et al. ( , 2018) ) ignore the empirical analyses we conducted (O'Reilly et al. 2016(O'Reilly et al. , 2017;;Puttick et al. 2017a) even though model comparison using empirical data is the approach advocated by Goloboff et al. (2017Goloboff et al. ( , 2018)).These analyses show that the predictions based on our simulation data are extendable to empirical datasets.Specifically, smaller datasets achieve lower precision with the Bayesian implementation of the Mk model, and larger datasets show increasing congruence in the recovered topological across all inference methods.We would not expect these predictions to be true if our simulationbased analyses were inherently invalid.et al. (2017, 2018) conflate the issue of method efficacy and model adequacy.Our explicit aim was to evaluate the efficacy of parsimony, and both maximum likelihood and Bayesian approaches to the estimation of phylogeny.At no stage did we attempt to evaluate the adequacy of the Mk model, or its ability to effectively capture the process of morphological evolution.Similarly, at no stage did we argue that either the single parameter Markov model or the manner in which the likelihood of a topology is calculated across a dataset adequately capture the process of morphological change.Indeed, it is widely observed among proponents of statistical phylogenetic inference that the Mk model will require further development if it is to encapsulate the process of morphological change to the maximum afforded by the Markov model framework (e.g.Wright et al. 2016), and the potential for improvement in the Mk model can be viewed as a strength, rather than a weakness.

The future
We and others have made steps towards a simulationbased assessment of phylogenetic methods (Wright & Hillis 2014;O'Reilly et al. 2016O'Reilly et al. , 2017;;Brown et al. 2017;Goloboff et al. 2017;Puttick et al. 2017a, b) (2017,2018) observe, there are other parameters to consider, such as non-contemporaneous terminals, the accuracy of branch length estimates, character coevolution and covariation.We look forward to their exploration in turn.
In the interim, model-based phylogenetic methods appear to perform best when parsimony methods perform most poorly (when datasets are small and exhibit low character consistency) and perform at least as well as parsimony methods when they perform best (when datasets are large and exhibit high character consistency).

Proportion of characters with CI=1 in each matrix
(2017) have already corroborated the empirical realism of the simulated datasets.Thus,Goloboff et al. (2017Goloboff et al. ( , 2018)  )   effectively conflate the need for empirical realism in the model used to generate the data with the efficacy of the methods in analysing the data.Alternative simulation approachesTo simulate data, Goloboff et al. (2018) prefer their own model, in which the rate of change for each character is completely independent on every branch of the tree.Their implicit (Goloboff et al. 2017) and then later explicit (Goloboff et al. 2018) claim that their model is more biologically realistic is no better justified than the Mk model, as neither can be supported with meaningful quantitative empirical evidence.If Goloboff et al. (2018) consider a model in which characters share a set of branch lengths to be biologically unrealistic, they must also accept that the assumptions of their own model are at least equally biologically unrealistic, if not potentially more so.Goloboff et al. (2018) argue that it is not possible to generalize based on the simulation procedure from O'Reilly et al. (2017) and Puttick et al. (2017a); if true, this same argument can be levelled at their own simulation procedure.The Goloboff et al. (2017) simulation model effectively represents an almost polar opposite to the HKY+G simulation procedure of O'Reilly et al.
Comparison of empirical and simulation datasets in terms of the consistency of the component characters.A-B, empirical datasets compiled by Goloboff et al. (2017).C-D, datasets simulated using the strategy of Goloboff et al. (2017).E-F, datasets simulated using the strategy of O'Reilly et al. (2017).A, C, E, the proportion of characters within each dataset that have a consistency index of 1.0.B, D, F, the proportion of characters within each dataset within each of ten consistency index bins.Colour online.
. In O'Reilly et al. (2016), we explored the impact of CI filtering on our results, and in subsequent papers the use of CI filtering became part of the simulation procedure (O'Reilly et al. 2017; Puttick et al. 2017a).
so far considering the impact of tree symmetry (Puttick et al. 2017a) and clade support (Brown et al. 2017; O'Reilly et al. 2017).As Goloboff et al.