Empirical distributions of homoplasy in morphological data

Cladistic datasets of morphological characters are comprised of observations that exhibit varying degrees of consistency with underlying phylogenetic hypotheses, reflecting the acquisition and retention of character states (highly consistent characters), or the convergent evolution and loss of character states (less consistent characters). The consistency between phylogenetic history and individual character histories has a bearing both on the evolutionary process and on the relative ease with which phylogenetic history may be inferred from morphological data. We surveyed 486 tetrapod morphological cladistic datasets to establish an empirical distribution of consistency among characters and datasets. Average dataset size has increased in the number of characters and taxa through time. The Consistency Index measure of homoplasy decreases as more characters are added but the most significant decreases result from the addition of taxa. Retention Index and Homoplasy Excess Ratio remain relatively constant with changes in taxa and character number. Our sampling of larger datasets confirms that the positive relationship between dataset size and homoplasy is primarily caused by an increase in taxa, not an increase in characters. Genealogies of cladistic data matrices for early vertebrates, scalidophorans and crocodilians, which have been modified in succession, show a trend of generally consistent quality through research time. Thus, we find no support for the widely shared conjecture that in the search for phylogenetic resolution, high quality phylogenetic characters are quickly exhausted, with subsequent research leading to the inclusion of potentially misleading characters exhibiting high levels of homoplasy.

T H E reconstruction of evolutionary relationships among species is based on the premise that their shared characters reflect shared ancestry, and that similarity is known as homology.However, taxa can share superficially similar characters that are derived from separate ancestors and homologous characters can be absent as a consequence of loss, both of which contribute to the phenomenon of homoplasy which can misguide phylogenetic inference.When the proportion of homoplastic characters in a dataset is large, the homologous phylogenetic signal can be difficult to discriminate from homoplasy, resulting in recovery of an incorrect topology (Scotland & Steel 2015) that may, nevertheless, exhibit spuriously high statistical support in measures of phylogenetic fidelity, such as both the bootstrap or posterior probability (Philippe et al. 2011).
Cladistic matrices tend to grow in size over research time, but it is unclear whether newly added characters confer more phylogenetic signal than noise.Previous surveys have demonstrated that the proportion of homoplasy within a dataset increases with the number of characters and taxa included (Sanderson & Donoghue 1989;Archie 1996;Hauser & Boyajian 1997), interpreted by some to reflect the rapid exhaustion of highly informative, low homoplasy, phylogenetic characters in the compilation of phylogenetic datasets and, perhaps, the increasing desperation of phylogeneticists seeking characters to resolve otherwise poorly supported clades (Scotland et al. 2003;Nelson 2004;Scotland & Steel 2015).
Homoplasy is not merely a nuisance parameter that plagues the efforts of phylogeneticists, but an interesting evolutionary phenomenon worthy of study in its own right (Hauser & Boyajian 1997;Sanderson & Donoghue 1989, 1996).Recently, the nature and distribution of homoplasy, as well as the efficacy of its measures, have come into sharp focus as a measure of the phylogenetic informativeness of cladistic matrices as researchers have attempted to benchmark the performance of competing methods for the phylogenetic analysis of morphological data (O'Reilly et al. 2016(O'Reilly et al. , 2018a, b;, b;Brown et al. 2017;Puttick et al. 2017aPuttick et al. , b, 2019;;Goloboff et al. 2018a, b;Sansom et al. 2018;Keating et al. 2020).Much of this debate has been conducted within the context of experiments based on simulated datasets, a conventional approach in molecular phylogenetics where empirically realistic data can be simulated based on well-founded theoretical expectations and empirical observations of sequence evolution.In contrast, the evolution of morphology lacks a coherent statistical description that can be modelled.Thus, the level of homoplasy observed in real datasets has been invoked to assay the empirical realism of simulated morphological data.Unfortunately, the most recent detailed surveys of homoplasy in morphological datasets were conducted more than 20 years ago (Sanderson & Donoghue 1989, 1996;Hauser & Boyajian 1997), at a time when datasets were limited in size by computing power and efficiency.Indeed, both the number and size of datasets investigated in these surveys were small by contemporary standards where matrices comprising hundreds of taxa and characters, are not unusual (O'Reilly et al. 2018a).Furthermore, early studies measured homoplasy as a matrix average, rather than as a proportional distribution which might provide a more precise characterization of empirical data (Goloboff et al. 2018a, b;O'Reilly et al. 2018a).Finally, previous surveys have focused on the Consistency Index (CI) as a measure of homoplasy, while this is just one of several widely used metrics, including the Homoplasy Excess Ratio (HER) (Archie 1989) and the Retention Index (RI) (Kluge & Farris 1969;Farris 1989) which some suggest to be a more accurate characterization of the homoplasy in empirical morphological data (Sansom & Wills 2017).
We undertook a survey of homoplasy in empirical morphological datasets, encompassing the greater number and scale of contemporary studies, using the CI, HER and RI.Our results provide a better understanding of morphological homoplasy and a better characterization of this phenomenon as a target for simulation studies.

MATERIAL AND METHOD
We based our study on two published databases of morphological cladistic datasets (Wright et al. 2016;Sansom et al. 2018), for a total of 486 datasets, excluding a small number of MRP (matrix representation with parsimony) supertree matrices and categorical molecular datasets.These datasets derive ultimately from a database curated by Graeme Lloyd (http://www.graemetlloyd.com/matr.html).The year of publication and the dimensions of these datasets, in terms of numbers of taxa, characters and character states, are summarized in Figure 1.Assumptions were removed from all datasets (e.g.differential character weighting and character state ordering) to eliminate the effects of subjective opinions on the true history of characters or their relative value as phylogenetic markers.All the other parsimony analyses were implemented in PAUP4a build 166 for Unix/Linux (Swofford 2002).We tested whether the retained datasets conveyed significant clustering information using permutation tail probability (PTP) tests.For the PTP test, 20 000 permutations with heuristic search were used, with one random addition replicate for each search, retaining just one most-parsimonious tree for each permuted dataset.
For all analyses, a most parsimonious tree was inferred for each dataset using a heuristic search, with 10 000 random addition sequences and swapping trees using TBR in PAUP.The multree option was turned off to speed up searches.This implies that for each island of multiple most parsimonious trees, only one tree was retained.This strategy is adequate for our study as all trees in an island of most-parsimonious trees will have the same tree length, CI, RI and HER.
Constant and parsimony-uninformative sites were removed prior to analysis, as they can inflate CI and RI values (Sanderson & Donoghue 1989).The reuse of matrices can also bias the calculation of CI and RI values and so the independence of matrices was determined following the approach of Sansom et al. (2018), where matrices with a high proportion of taxa shared with one or more other matrices, are removed iteratively until the only remaining matrices have at least 50% unique taxa and on average all matrices had 75% unique taxa; this left 364 independent matrices.

Measures of homoplasy
We measured homoplasy based on the CI, RI and HER which are calculated as follows: where m is the minimum number of possible changes, s is the observed number of changes, and g is the maximum number of changes for a tree computed using maximum parsimony (Kluge & Farris 1969;Farris 1989).
Both indices are scaled from zero to one, where a value of one reflects no homoplasy and the proportion of homoplasy increases as the value gets smaller (i.e.homoplasy increases as the index value decreases).CI and RI can be used to characterize the entire matrix or each character individually.CI is a blunt measure of homoplasy, in binary characters reducing from 1 to 1 2 ; 1 3 ; 1 4 ; 1 5 . . .
(reflecting increasing homoplasy) with every additional state change beyond the minimum possible number.Thus, characters can have a low CI even if they have a high consistency in one part of the tree but not another.RI reflects the congruence of a character with a topology in a way that better reflects the fact that homoplastic  2016;Sansom et al. 2018) but filtered to exclude a small number of MRP supertree matrices and a categorical molecular dataset.A, number of datasets relative to their year of publication.B, number of datasets relative to the number of character states used.C, number of taxa within datasets relative to their year of publication.D, number of characters within datasets relative to their year of publication.E, number of characters per taxon within datasets relative to their year of publication.F, number of taxa and characters within datasets.
characters can nevertheless be phylogenetically informative.Hence, a phylogenetically informative character with a low CI might still have a high RI.In this sense RI is a less blunt measurement of homoplasy.
We calculated HER using the following formula: The HER expresses the excess homoplasy in a dataset with reference to the level of homoplasy we expect to observe in random datasets of equivalent size.In the formula above, l represents the parsimony length of the optimal tree; lmin is the theoretical minimal tree length for a dataset with that tree, number of taxa, characters and character states; and MeanNS (mean number of steps) is the average tree length of a set of trees generated from a set of randomizations of the considered dataset.Expressing excess homoplasy with reference to the homoplasy of randomized datasets is an alternative way to explore homoplasy in datasets.HER values, in the largest majority of cases, are expected to be bound between 0 and 1 (with 1 indicating datasets with no homoplasy) but the HER may produce negative values (Farris 1991).Archie (1989) observed that this could occur due to sampling error, and our analyses (below) show that, indeed, negative HER values are associated with datasets that do not convey clustering signal (i.e.random at the PTP test).To estimate HER, we used custom Perl scripts to first use PAUP to calculate, for each dataset, a most parsimonious tree and its length (l), the theoretical minimal tree length (lmin) and to run a PTP test.After that, a second Perl script used the output from the PTP test to estimate MeanNS, and to calculate HER using the formula above.The PTP test was run as described above.
We analysed patterns of CI, RI and HER in R (R Core Team 2016).As CI and RI are bounded between 0 and 1, the relationship of CI and RI against the number of characters and taxa was analysed using beta regressions using the betareg() function in the betareg R package (Cribari-Neto & Zeileis 2010).We also analysed HER with beta regressions and converted negative values to zero prior to analyses.To analyse the within-matrix homoplasy distributions, the per character CI and RI values for each matrix were separated into one of 10 bins between 0 and 1, with the bins increasing in increments of 0.1.
To test the hypothesis that character consistency decreases over sampling time we identified three genealogies of matrices that are linked by successive modification based necessarily on our familiarity with the originating studies; this required extending our analyses beyond tetrapods.These matrices were developed to resolve the relationships among early vertebrates (Forey 1995;Janvier 1996;Donoghue et al. 2000;Shu et al. 2003;Gess et al. 2006;Sansom et al. 2010;Conway Morris & Caron 2014;Gabbott et al. 2016), scalidophorans (Wills 1998;Dong et al. 2004Dong et al. , 2005;;Harvey et al. 2010;Wills et al. 2012;Ma et al. 2014;Shao et al. 2018) and crocodilians (Brochu 1997(Brochu , 1999(Brochu , 2006(Brochu , 2011;;Gold et al. 2014;Lee & Yates 2018).We reduced each of these to a common set of taxa to control for the effects of taxa on character consistency, and inapplicable and polymorphic character states were converted to a missing state.Otherwise, the same protocol described above was applied to characterizing homoplasy in these datasets.
The R and Perl scripts used, metadata on the datasets analysed and the datasets used in the evaluation of matrix genealogies, are available in Murphy et al. (2021).

RESULTS
The relationship between the year of publication and the number of taxa and characters comprising a dataset clearly demonstrates that average number of characters and taxa has increased over time (Fig. 2A, B); each year the size of the average matrix has increased by 1.069 parsimony-informative characters and 1.059 taxa.This growth in matrices through time is mainly the result of the augmentation of existing matrices as unique matrices have no significant increase in the total number of characters or taxa through time.Linear regression analysis of the relationship between the number of characters and number of taxa in a matrix identified a 2.03:1 ratio of characters to taxa (Fig. 2C), with a ratio of 2.07:1 when parsimony uninformative characters are included.
CI is negatively impacted by the inclusion of more taxa (Fig. 3A) and more characters (Fig. 3B) in a dataset, reflecting increased homoplasy, while RI (Fig. 3C, D) and HER (Fig. 3E, F) values remain relatively consistent.However, RI values slightly decrease (Fig. 3C, D) and HER increases slightly (Fig. 3E, F) as more taxa and characters are added.Beta regressions identified a highly significant negative association (p < 0.001) for each of the relationships between matrix-wide CI and RI and the number of taxa and characters (Fig. 3) in a dataset.However, the amount of variance explained by the number of characters or taxa, shown by R 2 , and the amount the slope deviates from zero to negative values, are highly variable (Table 1).In contrast, much less of the variance observed in RI can be explained by the number of taxa or characters (R 2 values of 0.02-0.05).HER increases with higher numbers of characters and taxa but neither is a good predictor of HER values in a data matrix (R 2 = <0.03).These models can predict the expected CI, RI and HER for a matrix given the number of taxa or characters (Table 2).
When the number of taxa and characters are considered together in a multivariate regression, the number of taxa is a significant negative covariate of CI but the slope for character number does not significantly differ from zero (Table 1).The number of taxa is not a significant correlate of RI in the multivariate regression with unique character matrices, and character number is only a significant negative correlate when all matrices (unique and reused) are included.There was no significant collinearity between the number of taxa and characters in the Trends in the number of characters and taxa in matrices through time.The plots include information for all matrices and unique matrices are those that we consider to be independent.When all matrices are considered, the beta regression model estimates show an increasing number of characters and taxa through time.However, this may be an artefact of expansion of published matrices as unique matrices do show a flat relationship of character and taxa number through time.A, numbers of characters in a matrix plotted against year of publication.B, numbers of taxa in a matrix plotted against year of publication.C, numbers of taxa in a matrix plotted against numbers of characters in a matrix.
multiple beta regression with the generalized variance inflation factors (range 2.16-2.45)all below 5 (James et al. 2013).
A bimodal distribution was recovered for the per character CI distribution (Fig. 4).The first peak, at the 0.9-1 bin, accounts for an average of c. 40% of the characters within datasets and the second peak, within the CI 0.4-0.5 bin, accounts for c. 25% of the characters.Less than 5% of the characters in datasets have CI 0.5-0.9, while characters with CI ≤ 0.4 gradually decline proportionally as overall matrix CI decreases.The per character RI distribution is unimodal with a peak of c. 40% of characters at RI 0.9-1 (Fig. 3).For RI 0.3-0.9, an approximately consistent proportion of characters (c.5-10%) is observed in each bin.Characters with the most homoplasy, RI bin 0-0.1, account for c. 7% of the characters in a dataset.
All empirical estimates of CI, RI and HER for the matrix genealogies fall within the range of values expected based on our regression analyses of the broader dataset of matrices.As taxa number, not character size, is the significant correlate of CI and RI, these remain constant as characters are added to matrices.The early vertebrate matrices show the worst RI and HER scores relative to the expectation.The biggest change in CI, RI and HER from one matrix to the next are the decreases in each value from Gold et al. (2014) to Lee & Yates (2018) crocodilian matrices.
There are only small changes in matrix-wide values of homoplasy for the genealogies of matrices that we  identified (Table 3; Fig. 5).The total range of matrixwide CI values of matrix genealogies is lower than 0.1 for early vertebrates (0.069), scalidophorans (0.068) and crocodilians (0.085).Similar trends are seen with RI values, but variance is higher for HER through time, even though the patterns of increment and decrement are comparable across all three measures.There is a significant decrease in the proportion of parsimony-informative characters across the three matrix genealogies.This is likely to be an artefact of pruning matrices with increasing numbers of taxa, to the same small core set of taxa.Nevertheless, the percentage of parsimony informative characters decreases each year by 16% (early vertebrates), 5% (scalidophorans) and 1% (crocodilians).Otherwise, there is no significant change in matrix-wide CI values across the matrix genealogies.For the crocodilian matrices, there is a significant trend of decreasing CI value (p = 0.02) particularly associated with the most recent matrix generation within the genealogy (Lee & Yates 2018), but there is no significant trend in matrix-wide CI across the genealogy of crocodilian matrices.All genealogies of datasets show a trend of decreasing matrix-wide RI value through time, but this is only significant for the early vertebrate matrices.All dataset genealogies show a decreasing trend of HER values from time, but none of these is significant.

Homoplasy and matrix size
As observed by previous metanalyses (Kluge & Farris 1969;Archie 1989;Sanderson & Donoghue 1989, 1996;Klassen et al. 1991;Archie & Felsenstein 1993;Lamboy 1994;Givnish & Sytsma 1997;Hauser & Boyajian 1997;Wiens 2004) datasets that have more taxa and/or more characters contain more homoplasy, reflected in decreasing CI and RI (Fig. 3A-D).Increasing taxon numbers in a matrix has a greater negative effect on CI values compared to increasing character number (Table 1).A significant negative relationship was also identified between larger datasets and the overall RI values (Fig. 3C, D).However, as the number of characters and taxa increase, there is a greater decrease in CI compared to RI.This suggests that the RI value is not greatly affected by the size of the dataset (Wiens 2004) while CI is greatly affected.In particular, the number of characters and taxa explain around 45% of the variance in CI values compared to only 5% for RI.It is likely that the discrepancy between the effect of matrix size on CI and that on RI or HER, reflects the difference in how these indexes are calculated and what they actually measure.For HER, the excess of homoplasy is measured with reference to that observed in random data.
Given the approximately flat trend in overall RI and HER (but with different directionalities; Fig. 3C-F) there is no evidence to suggest that as more data are added to morphological datasets their phylogenetic signal decreases (Wiens 2004).While in individual cases this might have occurred, for example in the dataset of Lee & Yates (2018) (Fig. 5C), this is far from representing a trend.Instead, the general observed trend seems to suggest that while homoplasy increases through generations of dataset development, this is matched broadly with an increase in phylogenetic signal.Thus, as datasets become larger, their excess homoplasy, with reference to that of random datasets, tends to decrease (Fig. 3E, F).Furthermore, there is a significant relationship between the number of characters and the number of taxa, with a character to taxon ratio of 2.11:1 (Fig. 2).
We suggest that the difference in the behaviour of the RI and HER in response to increasing numbers of taxa and characters, could be explained with reference to the fact that HER measures the homoplasy in a dataset with reference to that observed in random data.As the number of taxa and characters increase, the trend observed in RI tells us that only a small amount of homoplasy is added to the data (Fig. 3C, D), while the HER trend indicates that despite some addition of homoplasy (RI trend) the overall effect on the signal to noise ratio of empirical datasets is still positive (Fig. 3E, F).That is, contra Scotland & Steel (2015) but in agreement with Wiens (2004), we do not see any evidence to support a general view that morphological dataset expansion occurs at the expense of phylogenetic signal.Increasing the number of taxa in a dataset increases homoplasy through detection of otherwise unobserved character changes.As more taxa are added to a dataset, there is an increased probability that two or more taxa have the same character state; the probability that those taxa did not derive that character through common ancestry also increases; and the likelihood of detected character loss is higher.Scotland & Steel (2015) found that datasets with large proportions of homoplastic characters can lead methods, such as maximum parsimony, to produce trees that are based entirely on homoplastic characters.To address phylogenetic noise in datasets with large numbers of taxa, they used compatibility analysis to remove characters saturated with homoplasy.This improvement in accuracy comes at the expense of precision, as the removal of characters caused a general reduction in tree resolution.
Increasing the number of characters in a dataset increases the amount of homoplasy (Figs 3B, 4) However, it does not follow that adding homoplastic characters decreases phylogenetic resolution or support (Wiens 2004).This distinction is drawn out in the difference between the behaviour of the CI and RI in relation to the number of characters (Fig. 3B-D).While CI decreases with numbers of characters, RI remains approximately constant, indicating that while the consistency of additional characters may be decreasing, their phylogenetic informativeness is not; this is also reflected in the positive trend between HER and numbers of taxa (Fig. 3E) and characters (Fig. 3F).Our analysis of matrix genealogies provides further support for this view since there is no significant decrease in CI, RI or HER through the temporal sequence of matrix development (Fig. 5), even though the dataset of Lee & Yates (2018) could represent an individual case where scoring more characters might have led to a decrease in the phylogenetic informativeness of a dataset.Furthermore, when the number of taxa and characters are considered jointly, the negative relationship between number of taxa and homoplasy (CI) remains, but number of characters is no longer significant (Table 1).Thus, the relationship between character number and homoplasy may be an artefact of the positive relationship between taxa and characters so the exhaustion of phylogenetically informative characters may not be reached in the majority of matrices.In summary, we find no support for the expectation that the phylogenetic quality of characters diminishes with discovery time (Wiens 2004).
Sanderson & Donoghue (1989) did not find a significant relationship between the number of characters and the amount of homoplasy in a dataset, though this relationship has been observed in other, smaller studies and simulations (Kluge & Farris 1969;Archie 1989;Sanderson & Donoghue 1989, 1996;Klassen et al. 1991;Archie & Felsenstein 1993;Lamboy 1994;Givnish & Sytsma 1997;Hauser & Boyajian 1997); we find the same results with multiple regressions despite the gradual increase in the average number of characters in morphological datasets in the intervening years: taxa, not character, number is a significant predictor of changes in homoplasy measures (Fig. 2A).Indeed, the median amount of characters in the year Sanderson & Donoghue (1989) published their study was c. 25, rising to a median of c. 150 in 2017 (Fig. 2A).This increase in the size of datasets over time is associated with increased computational power and efficiency, as well as the (not necessarily correct; Philippe et al. 2011) expectation that more data should yield better results (Nelson 2004).Evidently, quality, not quantity, should be the priority when choosing which taxa and characters to include in morphological datasets for phylogenetic analysis (Nelson 2004).However, an interesting corollary of this is the observation from simulated studies, that all methods of phylogenetic inference tend to support the same tree when datasets have large numbers of characters relative to numbers of taxa (O'Reilly et al. 2016(O'Reilly et al. , 2018b;;Puttick et al. 2017bPuttick et al. , 2019)).Thus, these simulations may be producing a higher proportion of highly informative, low homoplasy characters than occurs in empirical data.However, similar trends are observed in empirical data, such as O' Leary et al. (2013) in which all methods tend to support a similar topology.
T A B L E 3 .Metrics of the character matrix genealogies showing the matrix-wide CI, RI and HER values, as well as the total number of characters (N char.) and parsimony-informative characters (N char.inf.).

Artefacts of parsimony and homoplasy metrics
Empirical datasets have 35-40% of characters with essentially no homoplasy (i.e.CI and RI values between 0.9 and 1) (Fig. 4).This high proportion of low homoplasy characters is likely to be an artefactual consequence of these metrics being parsimony-based, measured on a most parsimonious tree.Maximum parsimony attempts to minimize the number of steps on the tree, and CI and RI are ratios of the number of steps.For this reason a high proportion of characters are expected to have the minimum number of steps and thus have a CI or RI value of one (i.e.no homoplasy).The peak observed in the 0.4-0.5 bin in the per character CI distribution (Fig. 4) is another artefact, here reflecting the fact that 0.5 is the second highest value that a binary character can achieve after a value of one.The minimum number of steps for a binary character is always one and so a CI value of 0.5 is obtained when a binary character has just one extra step (1/2 = 0.5).As these parsimony-based homoplasy metrics are sensitive to such artefacts it would be desirable to develop parsimony-independent measures of homoplasy.

Outliers
A number of our analysed datasets appeared as extreme outliers on the per character CI and RI distributions, with 100% of their characters obtaining a value of one (i.e.no homoplasy) (Fig. 3A-D).The relationships among the taxa could effectively be read directly from the matrix suggesting that the range of possible characters has been filtered to exclude those with higher homoplasy.Outliers were also identified in the 0-0.1 bin on the per character RI distribution (Fig. 4), with some datasets including as much as 50% of their characters with these maximally high levels of homoplasy.Datasets composed of high proportions of highly homoplastic characters do not inspire confidence in the accuracy of the phylogenetic hypotheses derived from them.
Implications for the simulations of morphology-like data  (1989,1996).This approach has developed from ensuring that individual simulated datasets match the profile of CI exhibited by the universe of datasets analysed by Sanderson & Donoghue (1989) (1989,1996).Characterizing this phenomenon has revealed that the range of CI values exhibited by characters and datasets has increased since previous surveys (Sanderson & Donoghue 1989, 1996;Hauser & Boyajian 1997).Furthermore, the profile of CI values exhibited by characters within datasets is quite distinct from the profile of average CI exhibited by empirical datasets, as characterized by the original surveys conducted by Sanderson & Donoghue (1989, 1996).This provides a more informative target profile for future studies that attempt to simulate morphology-like data.

CONCLUSION
Our study has established a new benchmark for the distribution of CI, RI and HER in discrete character morphological datasets, revealing a greater range in the levels of matrix-wide homoplasy than has been observed in previous surveys.In particular, it seems clear that while large datasets are associated with higher levels of homoplasy, this does not necessarily imply that large datasets are less reliable or that they are inherently less phylogenetically informative.Questions have been raised about the utility of morphological datasets of ever-increasing dimensions, which may ultimately be less informative than smaller datasets, suggesting the existence of potential limits to the utility of morphology in phylogenetics.Our results show that this expectation is not met, at least as measured by the CI, RI and HER.Indeed, the approximate constancy of matrix RI relative to character number provides no evidence for the exhaustion of phylogenetically informative characters through research time, and the (slightly) positive correlation in HER between number of characters and number of taxa, indicates that there is still scope for the discovery of additional phylogenetically informative morphological characters to expand existing datasets.However, for individual studies, it might generally be worth benchmarking newly generated datasets against their progenitors to ensure that dataset expansion has not diminished phylogenetic signal.This could easily be done through comparison of RI and HER values of original and revised datasets, or by ensuring that new datasets have HER and RI values comparable to what is expected, given their size.This could be achieved by plotting the HER and RI values against those in our empirical survey (Fig. 3).The marked difference in CI and RI as characters and taxa are added to datasets, suggests that RI may be a useful characteristic for guiding the simulation of morphology-like characters for benchmarking phylogenetic methods and studying the evolution of morphology.
Characteristics of the datasets analysed in this study.The datasets were compiled by previous metanalyses (Wright et al.
Distribution of per-character CI and RI values between 0 and 1, pooled across all matrices and placed in 0.1 size bins.
Changes in homoplasy among matrix genealogies through research time.The trends in CI (A), RI (B) and HER (C) through time.M U R P H Y E T A L .: H O M O P L A S Y I N M O R P H O L O G I C A L D A T A 5 1 5 Summary of beta regression for CI, RI and HER against the number of characters and taxa using univariate and multiple regressions (separate results are summarized for all matrices and those with unique characters).
T A B L E 1 .
Sanderson & Donoghue (1989) derived a formula for the expected level of CI in a morphological dataset comprised of a given number of taxa based on their survey of empirical morphological datasets.The values obtained from this survey have been used as a basis for assaying and filtering simulated morphology-like data to achieve empirical realism(O'Reilly et al. 2016(O'Reilly et al.  , 2018b;; Puttick et al. 2017b Puttick et al.  , 2019)).In recent years, simulation analyses have attempted to match empirical distributions of CI to increase realism of these simulated data.Initially, this approach ensured that the overall CI profile from the simulated datasets matched the profile of overall CI exhibited by the datasets analysed bySanderson & Donoghue (1989).Subsequently, simulations have attempted to match the average profile of CI exhibited by characters within datasets (Goloboff et al. 2018a; O'Reilly et al. 2018a; Puttick et al. 2019), which was not considered by Sanderson & Donoghue , to the average profile of CI exhibited by characters within datasets (Goloboff et al. 2018a; O'Reilly et al. 2018a; Puttick et al. 2019), which was not considered by Sanderson & Donoghue