Meta‐analysis of dichotomous and ordinal tests with an imperfect gold standard

Abstract Standard methods for the meta‐analysis of medical tests, without assuming a gold standard, are limited to dichotomous data. Multivariate probit models are used to analyse correlated dichotomous data, and can be extended to model ordinal data. Within the context of an imperfect gold standard, they have previously been used for the analysis of dichotomous and ordinal test data from a single study, and for the meta‐analysis of dichotomous tests. However, they have not previously been used for the meta‐analysis of ordinal tests. In this article, we developed a Bayesian multivariate probit latent class model for the simultaneous meta‐analysis of ordinal and dichotomous tests without assuming a gold standard, which also allows one to obtain summary estimates of joint test accuracy. We fitted the models using the software Stan, which uses a state‐of‐the‐art Hamiltonian Monte Carlo algorithm, and we applied the models to a dataset in which studies evaluated the accuracy of tests, and test combinations, for deep vein thrombosis. We demonstrate the issues with dichotomising ordinal test accuracy data in the presence of an imperfect gold standard, before applying and comparing several variations of our proposed model which do not require the data to be dichotomised. The models proposed will allow researchers to more appropriately meta‐analyse ordinal and dichotomous tests without a gold standard, potentially leading to less biased estimates of test accuracy. This may lead to a better understanding of which tests, and test combinations, should be used for any given medical condition.

simultaneous meta-analysis of ordinal and dichotomous tests without assuming a gold standard, which also allows one to obtain summary estimates of joint test accuracy.We fitted the models using the software Stan, which uses a state-of-the-art Hamiltonian Monte Carlo algorithm, and we applied the models to a dataset in which studies evaluated the accuracy of tests, and test combinations, for deep vein thrombosis.We demonstrate the issues with dichotomising ordinal test accuracy data in the presence of an imperfect gold standard, before applying and comparing several variations of our proposed model which do not require the data to be dichotomised.The models proposed will allow researchers to more appropriately meta-analyse ordinal and dichotomous tests without a gold standard, potentially leading to less biased estimates of test accuracy.This may lead to a better understanding of which tests, and test combinations, should be used for any given medical condition.

K E Y W O R D S
imperfect gold, latent class, meta-analysis, multivariate probit, ordinal tests, test accuracy

Highlights
What is already known?Standard, well-established methods exist for the synthesising estimates (i.e., conducting a meta-analysis) of test accuracy.These methods estimate test accuracy by comparing test results to some test which is assumed to be perfect-which is referred to as a 'gold standard' test.However, in clinical practise, these tests are often imperfect, which can cause estimates of the tests being evaluated to be biased and potentially lead to the wrong test being used in clinical practise.Meta-analytic methods, which do not assume a gold standard, have previously been proposed, but only for dichotomous tests.

What is new?
We developed a model which allows one to simultaneously meta-analyse ordinal and dichotomous tests without assuming a gold standard.The model also allows one to obtain summary estimates of the accuracy of two tests used in combination (i.e., joint test accuracy).

Potential impact for research synthesis methods readers outside the authors field
The methods are widely applicable.For instance, psychometric measures and radiologic tests are typically ordinal, and the studies assessing these tests often do not use a gold standard; hence, applying standard models to these datasets may lead to misleading conclusions.The methods we proposed may lead to less biased accuracy estimates, and hence potentially a better understanding of which tests, and test combinations, should be used for these conditions.

| INTRODUCTION
Medical tests are used to screen, monitor and diagnose medical conditions.In order to evaluate their accuracy, we can carry out test accuracy studies-studies which estimate the accuracy of a test by comparing its results to some existing test assumed to be perfect (i.e., 100% sensitive and specific).The tests under evaluation, and those assumed to be perfect, are referred to as index tests and reference (or gold standard) tests, respectively.Index tests often have a lower sensitivity and/or specificity than the gold standard; however, they may be quicker, less invasive, and/or less costly.Unfortunately, the fact that gold standard tests are often imperfect is ignored in routinely used methods 1,2,3 to meta-analyse studies of test accuracy, which can lead to misleading results. 4he results between tests are usually conditionally dependent-that is, they are correlated within each disease class (diseased and non-diseased individuals).Models which account for this dependency, in addition to an imperfect gold standard have been proposed. 5,6,7,8,9These models-which we will refer to as traditional latent class models (TLCMs)-assume that all tests are measuring the same latent disease, and each individual is modelled as belonging in either disease class.Since they can model imperfect gold standards, they also allow one to compare the accuracy between gold standard and index tests.
Proposed models which can model an imperfect gold standard based on TLCMs have some limitations, which motivates the proposal of more flexible latent class models. 10,11,8,12,13,14,9For instance, multivariate probit latent class (MVP-LC) models, 10,11,8,12,13,14,15 which are a type of regression model.Unlike TLCMs, MVP-LC models can be extended to model ordinal test accuracy data 12,16 without forcing the user to dichotomise it, whilst simultaneously modelling conditional dependence.For example, Xu et al. 10 presented an MVP-LC model to analyse primary studies evaluating multiple dichotomous tests without assuming a gold standard, which they later extended 11 to model ordinal tests with two cutpoints.The latent trait model proposed by Qu et al. 13 is a variation of the MVP-LC model which is defined by specifying a series of univariate regressions with a common subjectspecific latent variable.This model was later expanded upon by Sadatsafavi et al. 8 to the meta-analysis settingto analyse studies evaluating up to three dichotomous tests using direct comparisons-whilst allowing the test accuracy to vary between studies.However, it cannot appropriately model ordinal tests, nor can it model between-study variation for the conditional dependence parameters.
In clinical practise, tests are rarely used in isolation.The accuracy of two or more tests used in combination is often referred to as the joint test accuracy.Few metaanalytical methods have been proposed which can simultaneously calculate summary joint test accuracy and incorporate ordinal tests, all of which assume a perfect gold standard.For instance, Novielli et al. 17 proposed a model based on conditional probabilities, in which studies evaluated up to one ordinal test with two cutpoints and two dichotomous tests.This model can estimate summary test accuracy at each cutpoint whilst modelling conditional dependence.
To address the gaps in the literature discussed above, we developed a Bayesian model for the meta-analysis of studies evaluating both ordinal and dichotomous tests without assuming a perfect gold standard.The model also enables the estimation of summary joint test accuracy, whilst allowing the conditional dependence parameters to vary between studies.The proposed model is an extension of previous MVP-LC models which have been developed to analyse multiple tests in a single study. 10,11In Section 2, we describe the case study dataset which will serve to motivate our proposed model, which we will describe in Section 3.Then, we apply several variations of our proposed model to this dataset in Section 4. Finally, in Section 5 we discuss the benefits and limitations of the model, as well as possible extensions.

| MOTIVATING EXAMPLE
Deep vein thrombosis (DVT) is the formation of a blood clot in a deep (i.e., not superficial) vein.DVT can occur in the upper (proximal) or lower (distal) part of the leg, with the former more likely to be life-threatening.A potential complication of DVT occurring in up to a third of patients 18 is pulmonary embolism (PE).PE occurs when a blood vessel in the lungs becomes blocked by a blood clot (formed as a result of DVT) which has migrated from the legs to the lungs.Contrast venography is generally considered to be a gold standard for DVT, as it is almost 100% sensitive and specific. 19,20However, it is not commonly used in clinical practice because it is time consuming and invasive. 19,20Instead, ultrasound is often used to diagnose DVT, since it is non-invasive and cost-effective. 18,21,22,23owever, it is less accurate than contrast venography 24,25 for both distal and proximal DVT, with its sensitivity being lower for distal DVT. 24Furthermore, although ultrasound is known to have a very high specificity, it is still nonetheless imperfect. 24,25A commonly used 18 screening tool for DVT is a questionnaire called the Wells score, 26 which groups patients into one of three risk categories-'low,' 'intermediate' or 'high.'Another DVT test is the D-Dimer assay: a blood test measuring the amount of a protein fragment called D-Dimer, higher concentrations of which are indicative of DVT.Despite being considered to be generally more accurate than the Wells, 27 the D-Dimer assay is intended to be used for screening as opposed to diagnosis, 27 since a number of other conditions can elevate serum D-dimer concentrations. 18nvestigating the joint test accuracy of the aforementioned tests for DVT is important for a variety of reasons.
The Wells and D-Dimer are both relatively cheap, quick and non-invasive to carry out, particularly the Wells test.A combined screening approach utilising the Wells and D-Dimer may be more cost-effective and reduce test burden for patients compared to using either alone.Furthermore, despite the fact that neither the Wells nor the D-Dimer alone are generally considered to be diagnostic tools for DVT, they may have diagnostic potential when combined. 17,24,28An example of a potential screening strategy is to use the Wells prior to the D-dimer in the diagnostic pathway as a pre-screening tool to rule out individuals at low risk for DVT.Following this, individuals who scored as intermediate or high risk are subsequently screened using the D-Dimer assay, and only patients who also test positive on the D-Dimer undertake ultrasound.Another potential strategy is to refer patients scoring as high risk on the Wells score directly to ultrasound.Both of the aforementioned joint testing strategies are examples of 'believe the negatives' (BTN) strategies. 17his is a testing strategy where only those patients who test positive on an initial test go on to receive a second test, then only individuals who also test positive on the second test are considered positive.Conversely, 'Believe the positives' (BTP) is a testing strategy where only those patients who test negative on the first test go on to receive a second test, with only those patients who also test negative on this test being considered negative.Joint testing strategies are important across clinical areas besides DVT, for example for depression screening and for COVID-19-see discussion Section 5.1 for more details.
Novielli et al. 17 proposed a statistical model in order to conduct a meta-analysis of studies investigating the Ddimer, Wells score and ultrasound for DVT.The proposed model allowed them to model the Wells score without dichotomising the data whilst modelling the conditional dependence between tests, enabling them to estimate summary-level joint test accuracy.However, their model assumes that ultrasound is a perfect gold standard, which could have led to biased estimates of the performance of other tests under evaluation.Novielli et al. 17 carried out several analyses based on different datasetsfor instance, one based on the 11 studies which directly compared the D-dimer, Wells' score via the gold standard (ultrasound), and another which also included studies which only analysed one of Wells or D-dimer tests, and utilised indirect comparisons.In Section 4 of this article, we re-analyse the direct comparisons data (see Table 1) from Novielli et al. 17 without assuming a perfect gold standard, using a variety of models we propose in Section 3; namely, models which dichotomised the Wells score and those which modelled it as an ordinal test, those which assumed conditional independence (CI) and conditional dependence (CD) between tests, as well as models which assumed ultrasonography was perfect or imperfect.This dataset consisted of 11 studies, with a total of 4096 individuals and 12,288 observations, with all 11 studies evaluating all three tests.

| METHODS
Before describing our proposed Bayesian MVP-LC model, we will first define some terminology and notation in Section 3.1.For a formal model specification, please refer to the full technical model specification (supplementary material 1).

| Terminology and notation
The model is for a meta-analysis dataset with a total of S studies, with each study having a total of N s individuals, where s is an index for study-so s can be used to denote anything between the first (s ¼ 1) and the last (s ¼ S) study.Each study is assessing the same number of tests, T. Will we use t as an index for test, which can be between 1 and T, and n as an index for individual, which can be between 1 and N s for study s.
For the nth individual from study s, we will denote the vector of observed test responses as y s,n ¼ y s,n,1 ,,y s,n,T À Á 0 .Each test is either dichotomous or ordinal.
For dichotomous tests, each observed test response, y s,n,t , is coded as 0 and 1 for negative and positive results, respectively.For ordinal tests, each test t has K t categories (hence K t À 1 cutpoints).We will use k as an index to refer to any given cutpoint, which can be between and variance-covariance matrix , where: where ν d ½ s,t and τ d ½ s,t , denote the study-specific means and standard deviations, respectively.Each ϵ d ½ s,1,t denotes the study-specific correlations between each test-pair (or 'test-pair'-denoted as 't and t 0 ') for the augmented data (Z s,n,t and Z s,n,t 0 )-the polychoric correlation 16,29 -which is not the same as the correlation between the observed data y s,n,t and y s,n,t 0 .Each ϵ d ½ s,1,t models the conditional dependence between each test-pair.However, we can assume CI by setting ϵ d ½ s,1,t ¼ 0 and ϵ d ½ s,1,t 0 ¼ 0. We must to ensure that the number of parameters being estimated from our model is not greater than what is possible for the given dataset; otherwise it may be non-identifiablewhich means that the model will give misleading results.For example, it may estimate the sensitivity for a test to be equal to both 0.20 and 0.80.To ensure that our model is identifiable, 12,16 as in Xu et al., 11 we set each τ d ½ s,t ¼ 1 (i.e., set all Ψ d ½ s to be correlation matrices).Please see supplementary material 6.
For dichotomous tests, the augmented data (Z s,n,t ) will be less than 0 for negative results (y s,n,t ¼ 0) or greater than 0 for positive results (y s,n,t ¼ 1), and the measures of test accuracy for a given study s are given by, where Φ Á ð Þ denotes the cumulative density function (CDF) of the standard normal distribution-that is, a normal distribution with mean 0 and SD 1.For ordinal tests, the augmented data (Z s,n,t ) will belong to an interval defined by strictly increasing latent cutpoint parameters ( k,s,t , and k between 2 and K t À 1).This interval will depend on the observed test result as follows-if the test result is below the first cutpoint (i.e., in the first category), then the augmented data will be less than the first cutpoint parameter; if it is above the last cutpoint (i.e., in the last category), then the augmented data will be greater than the last cutpoint parameter; otherwise, if the test result falls between two cutpoints (i.e., the test result belongs to any other category), then the augmented data will fall between the corresponding cutpoint parameters.The measures of test accuracy are given by,

| Between-study model
Now we will explain how we will model the variation in test accuracy between studies-called the between-study heterogeneity, as well as the correlation between the sensitivities and specificity between studies-called the between-study correlation.It is important to bear in mind the distinction from the within-study correlations (defined in Section 3.2), which model the conditional dependence between tests.For each test t, we will assume that the study-specific means (ν The model described in Equation ( 4) is known as a partial pooling model (using the terminology from Gelman & Hill 30 -otherwise known as 'random-effects').These models allow the study-specific accuracy parameters across studies to inform one another, without assuming full homogeneity like a full pooling (i.e., 'fixed-effects') would-which would allow no between-study variation in the means ν d ½ s,t .The disease prevalence's in each study, p s , are modelled independently of each other, known as a no pooling model.There are several differences between partial pooling and no pooling models. 31For example, the former uses less parameters than no pooling, which means that there is less likelihood of encountering parameter identifiability issues.An advantage of our partial pooling model is that allows us to summarise the results using the parameters which are shared across studies (see Section 3.3), allowing us to more easily summarise test accuracy as well as the heterogeneity in accuracy between studies and correlation between sensitivities and specificities.We can incorporate meta-regression covariates into the model by extending the partial pooling model defined in Equation ( 4) above-see supplementary material 1, meta-regression section (Section 1.2.1) for details.We can assume that a given test is a perfect gold standard by setting μ

| Within-study correlations
We will model the within-study correlation matrices (Ψ d ½ s ) defined in Equation (1) in Section 3.2 using a partial pooling model.As suggested by Goodrich,32 s .Note that we can also model the conditional dependence between only certain pairs of tests by setting the relevant terms for the other tests in

| Cutpoints
The cutpoint parameters can be modelled using an induced Dirichlet model, an approach which has been proposed by Betancourt 33

| Test accuracy summaries
For dichotomous tests, the summary sensitivity and specificity estimates for test t are given by evaluating Equation (2) at the means of the between-study model (see Equation ( 1)).More specifically, Se G,t ¼ Φ μ , and . Similarly, for ordinal tests, the summary measures for test t at cutpoint k are given by evaluating Equation (3) evaluated at the means of the partial pooling model (see Equation ( 1)), and at the global (sum- and for ordinal tests.

| Joint test accuracy summaries
The summary estimates for the joint test accuracy of tests t and t 0 at cutpoints k and k 0 are given by: With BTN and BTP as defined in Section 2. Note that for ordinal tests, the order of the tests can affect joint test accuracy estimates. 28However, for dichotomous tests it does not, 28 although this is often still important for clinical practise.For example, the first test may be cheaper to carry out.The parameter cov d ½ G,tt 0 ,kk 0 is the global conditional covariance between all possible test-pairs.Obtaining these covariances requires us to calculate the global conditional correlations between each test-pair, which we will denote as ρ 0 ½ G,tt 0 ,kk 0 ; this assumes that the test results are the same form as the observed data.However, our model is parameterised in terms of the polychoric correlations (ϵ d ½ G,tt 0 ,kk 0 -see equation [Equation (1)] in Section 3.2).Therefore, in order to be able to estimate the joint test accuracy estimates in Equation ( 5), we will need to convert from ϵ d ½ G,tt 0 ,kk 0 to ρ d ½ G,tt 0 ,kk 0 .For details on how this is achieved, please refer to Section 1.2.4 of supplementary material 1.

| Assessing model fit and model comparison
For our MVP-LC model, we can check how well our model predicts the data by using a technique called posterior predictive checking-where we generate data from our model, and compare it to the observed data.For example, we can plot the model-predicted test results against the observed test results for each test-pair. 5,4,6We also assessed model fit by plotting the model-predicted within-study correlations against the observed withinstudy correlations, using the correlation residual plot proposed by Qu et al. 13 For model comparison, we used leave-one-out (LOO) cross-validation 34 -an iterative procedure which removes part of the data and re-fits the model, and sees how well the model predicts the missing data.For more details on model comparison and posterior predictive checking, including relevant formulae, please refer to Section 1.3 in supplementary material 1.

| Model implementation
We implemented the models in R 35 using the probabilistic programming language Stan 36,37 via the R package CmDStanR 38 using a PC with 32GB of RAM and an AMD Ryzen 3900X 12-core CPU with Linux Mint OS.To code the model in Stan, we extended the code for a standard binary multivariate probit model. 32This is described in detail in Goodrich 2017 39 and is summarised in supplementary material 5. We implemented the between-study partial pooling model for the within-study correlations described in Section 3.3.1 in Stan by using the function provided by Stephen Martin and Ben Goodrich. 40For the cutpoint between-study model, we used Betancourt's induced Dirichlet model 33 described in Section 3.3.2;this is described in more detail in supplementary material 4, and this was implemented using code by Betancourt. 33e ran all models using four chains until the split Rhat statistic was less than 1.05 for all parameters and the number of effective samples was satisfactory for all parameters. 41We only reported results when we obtained no warnings for divergent transitions or energy fraction of missing information (E-FMI), important diagnostics for geometric ergodicity. 38We used the CmDStanR diagnostic utility to check all of the aforementioned model diagnostics CmDStanR.We also inspected trace plots and plotted the posterior distributions to check they were not bimodal.Rather than using Φ Á ð Þ, which is prone to numerical instability, we can use the closely resembling logistic function, Φ 0 x ð Þ ¼ 1 1þe À1:702 Á x , which has an absolute maximum deviation from Φ Á ð Þ of 0.0095.This is the same probit approximation used for the meta-analysis of dichotomous tuberculosis tests using latent trait models in Sadatsafavi et al. 8 The data, Stan model code and R code to reproduce the results and figures for the case study application in Section 4 is provided at https:// github.com/CerulloE1996/dta-ma-mvp-1.

| APPLICATION TO CASE STUDY
Since our model is Bayesian, we must formulate a prior model-that is, specify prior distributions for the model parameters defined in Section 3. We describe this prior model in Section 4.1.When faced with the task of analysing a dataset with an imperfect gold standard which contains test accuracy data from an ordinal test, in order to be able to apply proposed methods for meta-analysis without assuming a gold standar, 5,7 one must first dichotomise the data at each cutpoint and conduct a series of stratified analyses.We applied a priori dichotomisation technique using our proposed MVP-model in Section 4.2.Finally, in Section 4.3, we applied the models proposed in Section 3, but without dichotomising the Well's score.
In Section 4.1, we will index the gold standard (ultrasound), the D-Dimer, and the Wells' score by t ¼ 1, t ¼ 2, and t ¼ 3, respectively.In Sections 4.2 and 4.3 we will denote summary estimates as 'X [Y, Z]', where X is the posterior median and [Y, Z] is the 95% posterior interval.

| Prior distributions
For the summary-level accuracy parameter for the gold standard test (ultrasound-i.e., μ d ½ t¼1 ), we constructed informative priors using subject-matter knowledge for ultrasound, based on meta-analyses from the literature 24,25 (see supplementary material 2 for more details).These priors correspond to 95% prior intervals of 0:49, 0:94 ð Þ and 0:82, 0:99 ð Þfor the sensitivity and specificity, respectively.On the other hand, for the summary-level accuracy parameters for the D-Dimer and the Wells score (i.e., μ d ½ t¼2 and μ d ½ t¼3 ), we specified priors conveying very little information-equivalent to assuming a 95% prior interval of 0:04,0:96 ð Þfor the sensitivities and specificities.For the between study deviation parameters for all three tests (i.e., for σ d ½ t for t ¼ 1, 2, 3 f g-see Equation ( 4) in Section 3.3), we used weakly informative priors corresponding to a 95% prior interval of 0:02,1:09 ð Þ .The priors are weakly informative since they weakly pull the study-specific sensitivities and specificities towards each other, whilst allowing for large between-study heterogeneity if the data demands.For example, if 0:8 is the value found for the summary sensitivity, and the data suggests a standard deviation equal to 2 (corresponding to a high degree of between-study heterogeneity), then these priors would allow the study-specific sensitivities and specificities to be in the interval 0:44,0:97 ð Þwith 95% probability.We also used weak priors for the between-study correlation parameters for all tests (i.e., ρ t for t ¼ 1, 2, 3 f g-see Equation ( 4) in Section 3.3), corresponding 95% prior probability interval of (À0.82, 0.82).Finally, for conditional dependence models, for the within-study correlation parameters (see Section 3.2 and Equation ( 1 s ), respectively.These are weakly informative and allow a moderately large between-study deviation in the strength of the conditional dependence between tests.For more detail on these prior distributions, please see supplementary material 2.

| The pitfalls of a priori dichotomisation in the presence of an imperfect gold standard
We consider two dichotomisations of the Wells score.For the first, we dichotomised the Wells' score by grouping together those patients who obtain a score of 'low' or 'moderate' as a negative result and those who scored 'high' as positive.On the other hand, for the second dichotomisation, we grouped together patients who scored 'moderate' or 'high' and considered this as a positive result, and those who scored 'low' as a negative result.We will refer to the former dichtomisation as 'low +moderate versus high' and the latter as 'low versus moderate + high.'We applied this technique to this dataset, to allow comparison with our 'full' model, using the models proposed in Section 3, fitting both CI and CD models, the results of which are shown in Figure 1.
When assuming CI between all three tests, we see that (Figure 1) some of the estimates of the accuracy of the other two tests change substantially depending on whether we dichotomise the Wells score as low + moderate versus high, or as low versus moderate + high.For the former dichotomisation, the sensitivity of ultrasound was estimated as 0.80 (0.70, 0.88) whereas for the latter it was 0.69 (0.56, 0.82).The specificity of ultrasound and the sensitivity of the D-Dimer were similar between both dichotomisations.However, there was a notable difference in the specificities of the D-Dimer test, where we obtained specificities of 0.69 (0.57, 0.78) and 0.76 (0.65, 0.85) for the low + moderate versus high and low versus moderate + high dichotomisations, respectively.
The differences in the results were similar when modelling conditional dependence between the three tests (see Figure 1).In the low + moderate versus high dichotomisation, for the ultrasound sensitivity we obtained 0.83 (0.67, 0.92) and for the low versus moderate + high dichotomisation 0.74 (0.57, 0.89).For the D-Dimer specificities, we obtained 0.67 (0.54, 0.78) and 0.71 (0.58, 0.83) for the low + moderate versus high and low versus moderate + high dichotomisations, respectively.As with CI, the specificity of the ultrasound and the sensitivity of the D-Dimer were similar between the two dichotomisations.We can also see the estimates of disease prevalence increase for most studies for the low versus moderate + high dichotomisation relative to the low + moderate versus high dichotomisation, for both CI (left panel of Figure 2) and dependence models (right panel of Figure 2).
Overall, regardless of whether we assume CI or CD, some of the accuracy estimates change notably depending on how we dichotomise the Wells score.This is not surprising, since imperfect gold standard models based on latent class analysis utilise the full distribution of test responses from all tests to estimate accuracy and disease prevalence. 4This simple example demonstrates the importance of modelling the available data for ordinal non-dichotomous tests, such as the Wells score, in the presence of an imperfect gold standard, as opposed to simply conducting multiple stratified analyses at each cutpoint of the ordinal test using simpler methods.This observation serves to motivate the implementation of ordinal regression into the models to appropriately model the ordinal nature of the Wells score.

| Modelling the Wells score as an ordinal test
Now we fit the models without dichotomising the Wells score, by simultaneously modelling all three categories.For these models, we used weakly informative priors of μ 3 $ N 0,1 ð Þ for the mean parameters for the Wells test.We used the partial pooling model on the Wells score cutpoint parameters (see Section 3.2, Equation ( 3)).For the Dirichlet population parameters, we used a weakly informative prior κ d ½ $ N ≥ 0 0, 50 ð Þ.This allows considerable asymmetry in the Dirichlet population vector α k , as can be seen from the prior predictive check (see Figure 1 in Section 1.3 in supplementary material 1).The rest of the priors were the same as those discussed in Section 4.1.
We fit the following models: one assuming that ultrasound is a perfect gold standard and CI between all three tests (M1); the same model but modelling the conditional dependence between the Well's score and D-Dimer (M2); a model assuming ultrasound to be an imperfect gold standard and CI between all three tests (M3); and a variation of M3 which modelled the conditional dependence between all three tests (M4).
The results for the summary sensitivity and specificity estimates for the four models are shown in Figure 3, and the results for each of the Wells score strata are shown in Figure 4.The estimates for the two models assuming a perfect gold standard (M1 and M2) are within 2% of those obtained from Novielli et al. 17 The similarity of the results is not surprising, since despite using different models and different link functions (logit vs. approximate probit), both models assume that ultrasound is perfect.
For both the CI (M1) and CD (M2) models which assumed that ultrasound is a perfect reference test, the results we obtained for the accuracy of the BTP and BTN testing strategies for the Wells & D-Dimer tests were similar to those obtained by Novielli et al. 17 More specifically, for the BTP testing strategy, we found that the summary specificity estimates for M1 (33 [25, 41]) were around 8% lower than M2 (41 [32, 50]).For the BTN strategy, we found that the estimates for M1 (74 [65, 82]) were around 9% higher (83 [76, 88]) than M2.
When we modelled ultrasound as an imperfect test, the summary estimate for the sensitivity of the Wells test for the model assuming CI (M3-88 [81, 93]) was around 4% higher than the model which modelled conditional dependence (M4-84 [74, 92]), and around 5% higher for the sensitivity of the Wells & D-Dimer BTN testing  respectively).The other differences between M3 and M4 were 3% or less (see Figure 3).
The summary receiver operating characteristic plot for M4 is shown in Figure 5.The prediction regions suggest that there is substantial between-study heterogeneity for the sensitivity and specificity for most estimates.However, we found relatively narrow prediction regions for the specificity of ultrasound and for the Wells and D-Dimer BTP testing strategy, so we can be more confident in generalising our inferences for these estimates.
The LOO-CV results for all of the models are shown in Table 2.The results suggested that M1 has the poorest fit.Modelling the dependency between the D-Dimer and Wells tests (M2) improved the fit (LOO-IC = 16038.6and 15819.0 for M1 and M2, respectively).Out of the two models not assuming a perfect gold standard, the CI model gave a worse fit than the CD model (difference in ELPD between M3 and M4 = À31.4,se = 6.4).The two CD models were the two best fitting models, with the conditional dependence model giving the best fit (difference in ELPD between M2 and M4 = À20.8,se = 6.2).The posterior predictive checks for this model are shown in Figure 6 (correlation residual plot) and Figure 2 in supplementary material 3 (2 Â 2 table count residual plot).Both plots show that the model fits the data well.

| Summary
Our proposed MVP-LC model addresses the novel problem of carrying out meta-analysis of two or more conditionally dependent tests when there is no perfect gold standard, for the case where there are both ordinal and dichotomous test(s) under evaluation, and estimation of joint test accuracy is of interest.
Using the case study as a demonstrative aid for the model (see Section 4), we showed why treating ordinal tests as dichotomous in the context of an imperfect gold standard is suboptimal (see Section 4.2).When we modelled the Wells test as ordinal and treated ultrasound as a perfect gold standard (see section 4.3), the summary estimates from Novielli et al. 17 are replicated in our findings.However, we found that the most complex model-which treated ultrasound as an imperfect gold standard in addition to modelling the conditional dependence between tests-had the best fit to the data.For this model, our estimates of test accuracy differed considerably compared to other models we fit (which gave worse fit to the data) and compared to the results obtained in the analysis conducted by Novielli et al. 17 In particular, we obtained considerably different estimates of specificity for both the D-Dimer and the Wells score tests when used alone, and for the joint specificity of the Wells and D-Dimer BTN testing strategy.However, the large between-study heterogeneity limited the generalisability of our results.

| Potential applications
The methods we have developed in this article have a wide scope of applicability in clinical practise, further than just DVT.For instance, Hamza et al. 42 re-analysed a meta-analysis, 43 which assessed the accuracy of the CAGE questionnaire 44 -a four-category ordinal test used as a screening tool to detect individuals who may be suffering from alcoholism.However, their model assumed a perfect gold standard. 1Our proposed MVP-LC model could be used to more appropriately estimate the accuracy of the CAGE questionnaire, since we would not need to assume that the reference test in each study is perfect.
The methods could also be used to more appropriately assess joint testing strategies.For instance, current UK Health Security Agency guidance 45 states that individuals who have symptoms suggestive of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and test positive using Lateral flow tests (LFTs) should be considered as positive and require no subsequent testing.On the other hand, it states that individuals with negative LFTs should be assessed with a polymerase chain reaction (PCR), with only those who also test negative on PCR being considered negative.Our methods could be used to investigate this joint test accuracy strategy without modelling PCR as an imperfect gold standard, particularly with respect to its sensitivity.For depression screening, one potential BTN testing strategy is one in which individuals undertake a very brief 2-item version of the Patient Health Questionnaire (PHQ-2 46 ) followed by the 9-item version (PHQ-9 47 ).This was investigated recently by Levis et al.; 48 however, they assumed perfect gold standards, and they only used around half of the available studies, since they discarded those studies which used inferior gold standards.Our MVP-LC model could be used to analyse these data without assuming a perfect gold standard whilst accounting for differences between reference tests with meta-regression, and using all of the available data.Furthermore, we would be able to model the differences in gold standards between studies using meta-regression (see Section 3.3).

| Advantages
Our proposed Bayesian MVP-LC model addresses some important limitations which are present in models based on TLCMs. 4,49,5,6,7For example, although TLCMs have fast run times due to being computationally inexpensive, and they can model the conditional dependence between tests, 50 an important limitation is that, unlike our proposed MVP-LC model, they cannot appropriately model ordinal tests.For example, if one wishes to simultaneously model ordinal tests whilst modelling conditional dependence, they would first need to dichotomise the data a priori.As we showed in Section 4.2, this is suboptimal in the context of an imperfect gold standard, since the test accuracy and disease prevalence estimates were varied depending on which cutpoint we dichotomise the data at.A limitation of TLCMs which have been proposed for meta-analysis 5,6,7 is that, unless one assumes a complete pooling model between studies, it is not possible estimate summary correlation parametersparameters which are required to estimate summarylevel joint test accuracy.This is due to the fact that, in contrast to our MVP-LC model (which uses the withinstudy correlations), TLCMs model the conditional dependence using the within-study covariances, making it difficult to construct a partial pooling model for the within-study conditional dependence parameters.These covariances have bounds based on the sensitivity and specificity parameters in each study. 49,51Therefore, any summary-level covariance parameters obtained would be questionable.Our MVP-LC model also has advantages over more advanced models for meta-analysis of test accuracy, such as the model proposed by Sadatsafavi et al, 8 which is also based on multivariate probit regression and is an extension of the latent trait model. 13Two important limitations of this model-not present in our MVP-LC model-is that it can only model dichotomous data, and it assumes that the within-study correlations are fixed across studies.Furthermore, since our proposed MVP-LC model is an extension of the model for single studies proposed by 11 , another benefit over the model by Sadatsafavi et al. 8 is that it can also be used to specify more general correlation structures (by setting certain correlations to zero-see Section 3.2).The fact that our model is Bayesian means that one can incorporate subject-matter knowledge into the model, as we did for our case study.Furthermore, the Induced Dirichlet partial pooling model 37 (see Section 3.3.2and supplementary material 4) for the ordinal tests makes it possible to specify priors for ordinal tests and obtain summary estimates.

| Limitations
When applying the model to our case study dataset (see Section 2), we used available subject-matter knowledge 24 to construct informative prior distributions for the gold standard test (ultrasound), and weakly informative priors for other parameters (see Section 4.1 and supplementary material 1).Attempts to conduct sensitivity analysis using more diffuse priors led to diagnostic errors.This is likely due to the fact that Stan is quite sensitive at detecting non-identifiabilities in the posterior distributions, 36 and non-identifiability is more likely to occur with less informative priors, particularly for latent class models due to the large number of parameters relative to the data.
Another limitation of our case study analysis is that, although our model can easily incorporate metaregression coefficients (see supplementary material 1), the case study dataset did not contain any study-level covariates, since primary studies did not report sufficient data.In an ideal world where such data were available, a more principled analysis could be carried out by using a meta-regression covariate for the proportion of patients who have proximal versus distal DVT, which would have enabled us to model the variation of ultrasound sensitivity that exists between the two DVT groups in Novielli et al.'s 17 data.
A limitation of our model, which is present across all imperfect gold standard methods based on latent class models (including TLCMs), is that full cross-classification tables (i.e., the full distribution of test results) are required for each study.This is a potential barrier to the uptake of our proposed MVP-LC model, as this data is frequently not reported for studies evaluating 3 or more tests and/or studies assessing ordinal tests.One way in which we could have assessed the general performance of our MVP-LC model is by running a simulation study. 52A simulation study comparing our proposed MVP-LC model to other models would also be very useful.However, it is important to note that, at the time of writing, no other models have been proposed to simultaneously meta-analyse both dichotomous and ordinal tests without assuming a perfect gold standard.That being said, a simulation study would still be useful, since we could compare the performance of our model to other proposed models which do assume a perfect gold standard (e.g., Novielli et al. 17 ) under a variety of different scenarios.
Although our proposed MVP-LC model offers considerable benefits in comparison to the more commonly used TLCM models 5,6,7,50 (see Section 5.3), we found that our proposed model was considerably less time efficient than TLCM models.Although this was not prohibitive for the case study used in this article, our MVP-LC model may be intractable for larger sample sizes.Speeding up models based on augmented continuous data, such as our MVP-LC model, is an active area of research. 53,54,55,56,57,58,59,60n important area for future research would be to apply the models developed in this article using these more efficient algorithms, which would make our proposed MVP-LC model more suitable for general use with larger meta-analyses, and it would also make it easier to conduct more meaningful simulation studies.

| Future work
Models for the meta-analysis of test accuracy which can incorporate patient-level covariates-otherwise known as individual patient data (IPD)-have been proposed, 61 but only for dichotomous data and they assume a perfect gold standard. 61Modelling IPD can lead to results which are more applicable to clinical practise as they can more easily be applied to patients when there is between-study heterogeneity, rather than only providing summary estimates which relate to some 'average' patient.Extending our model to incorporate IPD would be relatively straightforward, since our model uses the patient-level data (as reconstructed from the reported contingency tables) as opposed to aggregated data for each study.It is straightforward to extend our model to the case where not all studies are assessing the same number of tests, using direct comparisons only.This could be further extended to allow indirect comparisons (network metaanalysis [NMA]), by assuming tests are missing at random (MAR), 62 and extending the between-study model described in Section 3.3 to an arm-based network-metaanalysis model. 63,64Another straightforward modelling extension would be to incorporate data from ordinal tests which have missing data for some categories.
Our model could also be extended to synthesise data from ordinal tests for the case where some (or all) studies do not report data for every cutpoint-which is common in research.One could formulate such a 'missing cutpoint' version of our MVP-LC model by extending the partial pooling between-study cutpoint model (see Section 3.3.2and supplementary material 4), and viewing the cutpoints as MAR.Another possible 'missing cutpoint' model could be constructed by modelling the cutpoint parameters as the same in the diseased and nondiseased classes, and assume that they are fixed between studies by using a no pooling model.Then, as opposed to our MVP-LC model, in which the within-study variances are set to 1 to ensure parameter identifiability (see Section 3.2), the no pooling cutpoint model would allow us to introduce within-study variance parameters and model them using a partial pooling model without encountering significant identifiability issues.These within-study parameters could be set to vary between the two latent classes, which would result in a smooth, nonsymmetric receiver operating characteristic (ROC) curve.Another possible 'missing cutpoint' approach would be one based on the model proposed by Dukic et al., 65 which assumes a perfect gold standard.This model also results in a smooth, non-symmetric ROC curve, since it assumes that the cutpoints vary between studies and are the same in the diseased and non-diseased class.However, it would be more parsimonious since it assumes that the sensitivity is some location-scale change of the false positive rate.
For the case where studies report thresholds at explicit numerical cutpoints (as is sometimes reported for continuous tests, such as biomarkers), some 'missing threshold' methods which assume a perfect gold standard have been proposed. 66,67Rather than modelling the cutpoints as parameters, these methods assume that the cutpoints are constants, equal to the value of the numerical cutpoint, and they estimate separate location and scale parameters in each study and disease class.Our MVP-LC model could be extended to achieve this without assuming a gold standard.An important area for future research would be to construct other models which can be used for the same purposes as our proposed MVP-LC model.For instance, a multivariate logistic regression model could be constructed by using the Bayesian multivariate logistic distribution proposed by O 0 Brien et al. 68 Such a model would use logistic link functions as opposed to probit (or approximate probit) links like our MVP-LC model, which are more numerically stable than probit and may give better fit to some datasets.Another multivariate regression approach would be to use copulas. 69,70,71Besides multivariate regression based on augmented data, another approach to modelling conditionally dependent ordinal diagnostic tests without assuming a perfect gold standard is log-linear models. 11hese models can account for higher-order correlations. 11However, this requires estimation of additional parameters, so it is likely to introduce identifiability issues.Similarly, to the multivariate probit models utilised in this article, it may be possible to extend these models to meta-analyse multiple, imperfect diagnostic tests with multiple cutpoints.
this can be achieved by specifying each Ψ d ½ s as a weighted linear combination of a global 'average' correlation matrix across studies (Ψ d ½ G ), and a matrix of study-level deviations from this global matrix (Ψ d ½ Δ s ), with weight β d ½ which is between 0 and 1.More specifically, )), we used priors which correspond to 95% prior intervals of À0:65, 0:65 ð Þ for both the global 'average' correlation matrices (Ω d ½ G ) and the study-specific deviation matrices (Ω d ½ Δ

F I G U R E 1
Posterior medians and 95% posterior intervals for models dichotomising the Well's score.CD, conditional dependence; CI, conditional independence [Colour figure can be viewed at wileyonlinelibrary.com]

F I G U R E 2
Posterior density plots for disease prevalence parameters.CD, conditional dependence; CI, conditional independence [Colour figure can be viewed at wileyonlinelibrary.com]

F I G U R E 3
Posterior medians and 95% posterior intervals for summary sensitivities and specificities, for models 1-4.The Wells score summary estimates are dichotomised as low versus moderate + high.CD, conditional dependence; CI, conditional independence; GS, gold standard [Colour figure can be viewed at wileyonlinelibrary.com] strategy (89 [83, 94] and 84 [75, 92] for M3 and M4,

F
I G U R E 4 Posterior medians and 95% posterior intervals for the Well's score stratum, for models 1-4.CD, conditional dependence; CI, conditional independence; GS, gold standard [Colour figure can be viewed at wileyonlinelibrary.com]

F I G U R E 5 a
Summary receiver operating characteristic (sROC) plot for M4.Shaded regions represent 95% posterior regions and regions surrounded by dashed lines represent 95% prediction regions.The Wells score summary estimates are dichotomised as low versus moderate + high [Colour figure can be viewed at wileyonlinelibrary.com]T A B L E 2 Leave-one-out cross validation (LOO-CV) for comparison of model fit for case study 1 dataset Model a LOO-IC b ELPD M4 À ELPD Mi c,d se ELPD M4 À ELPD Mi ð Models are ordered from best to worst fitting.b LOO-IC = Leave-one-out information criterion; note that LOO-IC is on the deviance scale.c ELPD = Estimated log pointwise predictive density for a new dataset.d Mi denotes the ith model.CI = conditional independence; CD = conditional dependence.F I G U R E 6 Posterior predictive check for model 4; correlation residual plot [Colour figure can be viewed at wileyonlinelibrary.com] Sample of case study dataset All test results are modelled at the individual level.We show the aggregate data in this table for ease of presentation.
T A B L E 1 Note: a The Wells score is classified as L = Low, M = Moderate, H = High.
By taking advantage of the properties of a type of statistical distribution called a Dirichlet distribution, this model is able to map the latent cutpoint parameters in each study ( C corresponds to the probability that an individuals' test result for test t falls in category k for study s.In this article, we used a partial pooling model for the cutpoints accross studies, enabling us to model the between-study heterogeneity in the cutpoints.We can also obtain 'average' cutpoints, (C