Bayesian network analysis complements Mendelian randomization approaches for exploratory analysis of causal relationships in complex data

Mendelian randomization (MR) is an increasingly popular causal inference tool used in genetic epidemiology. But it can have limitations for evaluating simultaneous causal relationships in complex data sets that include, for example, multiple genetic predictors and multiple potential risk factors associated with the same genetic variant. Here we use real and simulated data to investigate Bayesian network analysis (BN) as an alternative approach. A Bayesian network describes the conditional dependencies/ independencies of variables using a graphical model (a directed acyclic graph) and its accompanying joint probability. In real data, we found BN inferred simultaneous causal relationships that confirmed the individual causal relationships suggested by bi-directional MR, while allowing for potential horizontal pleiotropy (that violates MR assumptions). In simulated data, BN with two directional anchors (mimicking genetic instruments) had greater power for a fixed type 1 error than bi-directional MR, while BN with a single directional anchor performed better than or as well as bi-directional MR. Both BN and MR could be adversely affected by violations of their underlying assumptions (such as genetic confounding due to unmeasured horizontal pleiotropy). BN with no directional anchor generated inference that was no better than by chance, emphasizing the importance of directional anchors in BN (as in MR). Under highly pleiotropic simulated scenarios, BN outperformed both MR (and its recent extensions) and two recently-proposed alternative approaches: a multi-SNP mediation intersection-union test (SMUT) and a latent causal variable (LCV) test. We conclude that BN is a useful complementary method to MR for performing causal inference in complex data sets such as those generated from modern “omics” technologies Author summary Mendelian randomization (MR) is a popular method for inferring causal relationships between variables (such as between an intermediate biological factor and a disease outcome). However, MR relies on a number of assumptions that may be hard to verify, and it is not ideally suited to comparing different underlying causal scenarios. Here we propose the use of an alternative method, Bayesian network analysis (BN), as a complementary tool to MR. We use real and simulated data to investigate the performance of MR, BN and several other recently-proposed methods, and find that BN performs as well as, or better than, the other methods, particularly under complex scenarios. We conclude that BN is a useful complementary method to MR for performing causal inference in complex data sets.

Causal inference methods offer an attractive avenue for understanding complex 2 mechanisms in disease development and identifying ways to intervene upon them. An 3 observed association between a risk factor and disease outcome does not necessarily 4 imply causation, as it may arise via an alternative mechanism such as reverse causation 5 or confounding [1]. A gold standard experimental approach for causal inference is to 6 carry out a randomized controlled trial (RCT). By randomly allocating participants to 7 intervention and control groups, an RCT can eliminate selection bias or confounding. 8 However, it is an expensive and time-consuming process, and its result may imply a 9 relatively short-term effect unless the trial is of long duration. Furthermore, 10 intervention via an RCT is not always ethical, or (due to technical limitations) not 11 feasible, for example when the potential risk factor involves DNA methylation or small 12 metabolite variation. 13 Traditional non-experimental approaches for causal inference include discordant 14 identical twin studies and longitudinal studies, which can be used to infer causal 15 relationships under certain assumptions. Studies of identical twins are not subject to 16 genetic confounding, and confounding by shared environmental factors is expected to be 17 low (but reverse causation -which is unshared, and a major distorter of observational 18 estimates -may bias the findings). In longitudinal studies with many repeated 19 measures, methods such as g-computation can be applied [2,3], but most longitudinal 20 studies do not have the data measurements that allow the use of this approach. 21 Mendelian Randomization 22 Mendelian randomization (MR) [4,5] is an alternative non-experimental approach for 23 causal inference applicable to a general population. In its simplest form it utilizes a 24 genetic variant whose robust association with a risk factor provides a directional causal 25 anchor. The approach is based on the fact that there is only one fixed direction of 26 causation between the genetic variant and the outcome. Use of the genetic variant 27 (which is allocated at gamete formation during conception) has some analogies to the 28 randomization procedure in an RCT. Hence, causal inference is made from the 29 difference in the outcome seen between people with different genetic variants. These 30 genetic variants, usually single nucleotide polymorphisms (SNPs), can be considered to 31 operate as instrumental variables (IVs) provided certain conditions are met. 32 MR has been widely applied to evaluate the causal role of traditional risk factors in 33 disease, such as HDL and LDL cholesterol in cardiovascular disease [6,7]. It has also 48 this could violate the MR assumption that the genetic variant used as an instrument 49 influences the outcome only via the risk factor tested. 50 To address this issue, several approaches that attempt to either detect or allow for 51 pleiotropy in the context of MR, or to investigate more complex networks of 52 relationships between variables, have been proposed [15][16][17][18][19][20][21][22]. MR can also be used in a 53 "bi-directional"' or "reciprocal" fashion to determine the direction of causation between 54 two variables, say X and Y [12,23]. In most of these approaches, an underlying 55 hypothesised graphical structure representing the relationships between variables must 56 be assumed (rather than being learned from the data). However a recently-proposed 57 addition to bi-directional MR, known as MR Steiger [24], moves a step further by first 58 carrying out an initial determination of whether a genetic variable G is most suitable as 59 an IV for variable X or Y, prior to conducting standard a MR analysis between them  Bayesian network analysis (BN) is another non-experimental, statistical technique for 69 causal inference. It was first formalized and developed by Pearl [26] and has now 70 become widely applied in the social and natural sciences. Briefly, a Bayesian network 71 describes the conditional dependencies of variables using a graphical model known as a 72 directed acyclic graph (DAG) and an accompanying joint probability [27]. In a DAG, 73 the variables and their conditional relationships are represented as nodes and directional 74 edges (arrows), respectively. The joint probability is decomposed as a product of local 75 probabilities where the local probability of each variable is explained by its conditional 76 dependencies on its immediate neighbours [28]. The local probability distribution can 77 take any form, but usually a multinomial distribution is used for discrete variables and 78 a multivariate normal distribution is used for continuous variables. 79 Valid estimation of the underlying conditional dependencies (and thus causal 80 inference) in BN can be made under three assumptions: 1) the causal Markov 81 assumption, 2) the causal faithfulness assumption, and 3) the causal sufficiency 82 assumption. The causal Markov assumption states that a variable is independent of all 83 other variables, except for its effect or descendent ("child"/"grandchild" etc.) variables, 84 conditional on its direct causal (or "parent") variables [28,29]. The causal faithfulness 85 assumption states that the network structure and the causal Markov relations assumed 86 represents all (and the only existing) conditional independence relationships among 87 variables [27,30]. The causal sufficiency assumption corresponds to asserting that there 88 are no external variables which are causes of two or more variables within the model, 89 implying that all causes of the variables are included in the data and there are no 90 unobserved confounding variables [27,30,31]. A further (sometimes unappreciated) 91 assumption is that of no measurement error i.e. the variables are measured without any 92 errors [30]. These assumptions are essential for causal inference, and are quite commonly assumed in other causal inference methods, but they are generally impossible 94 to validate (and, indeed, may be considered unlikely to hold completely, raising the 95 question of sensitivity to their violation). In the MR literature a large (and growing) set 96 of sensitivity analyses allow relaxation of some of the assumptions required for 97 identification [32]. 98 In most analyses using BN, the true causal relationships (and the corresponding 99 network structure) are unknown. Hence, the network is estimated from the most likely 100 DAG (i.e. the DAG that has the best score (highest or lowest, depending on how the 101 score function is defined), or the highest posterior joint probability, out of all possible 102 DAGs. As the number of variables in the data set increases, the number of all possible 103 DAGs increases and the enumeration of all possible DAGs becomes infeasible [33]. Thus, 104 in many cases, the most likely DAG is estimated using a model search algorithm or a 105 model averaging algorithm. As the DAG structure is learned/estimated, the parameters 106 of the probability distributions are also learned/estimated from the data using a DAGs [30]. In analysis of "omics" data, genetic variants are natural instruments that 114 can be used to define directional anchors.

115
BN has some advantages over other causal inference methods with regards to the 116 ability to accommodate large complex data relatively flexibly. This feature is 117 particularly useful when the study aims to address simultaneous causal relationships in 118 "omics "-scale data sets, for example in studies of gene expression [34] or metabolites [35]. 119 Recent methods have been developed that allow the analysis of hundreds of variables, 120 including both discrete and continuous data types, taking advantage of the ability of 121 genetic variables to operate as causal anchors to help orient the direction of 122 relationships between non-genetic variables [36][37][38][39]. (Note that MR-based 123 approaches [24] also exist for constructing such networks). Nevertheless, BN has known 124 limitations. The model search process, particularly when there are large numbers of 125 variables, requires massive computational power and often elimination or pre-filtering of 126 variables is required. The conditional relationships implied by each tested model are Motivating Example: TwinsUK Data 141 As an initial motivating example, we investigated possible causal relationships between 142 metabolites and body mass index (BMI) using the TwinsUK study data [41]. We 143 applied both MR and BN to these data, and compared the causal inferences obtained. 144 We note that this example is intended as a (relatively straightforward) illustration of 145 analysing data using both MR and BN approaches, rather than making any strong 146 claims for the validity of the instruments (and thus for the robustness of the inferences 147 obtained) in this particular case.

148
The metabolites considered were the omega-3 fatty acids eicosapentaenoate (EPA) 149 and dihomo-linolenate (DGLA). For testing whether a causal relationship existed 150 between these metabolites and BMI, genetic IVs for EPA and DGLA were chosen based 151 on knowledge gained from prior investigation of this data set (along with an additional 152 German cohort) [42]; we note that re-use of (some of) the same data used to identify the 153 instruments can, in theory, run the risk of over-fitting. Based on these previous results, 154 the SNP rs174556 in FADS1-2-3 was used as an IV for EPA, while the SNPs rs968567 155 in FADS1-2-3 and rs6498540 in PDXDC1 were used as IVs for DGLA. 156 rs174556 and rs968567 are correlated with an r 2 value of ≈ 0.52. It is conceivable 157 that rs174556 is actually a causal variant for DGLA, and so could have an effect on 158 BMI operating in parallel through both EPA and DGLA. This would violate one of the 159 three assumptions required for the genetic variant to be used as an instrumental 160 variable (IV) for EPA, namely the no-horizontal pleiotropy assumption that the IV has 161 no effect on the outcome besides the effect mediated through the risk factor (EPA).

162
MR, based on the individual level data (rather than based on summary statistics via 163 two-sample MR), was used to test for a causal relationship between each metabolite and 164 BMI. The rationale for using individual level data (rather than performing the 165 asymptotically equivalent two-sample MR analysis) was to allow comparison with BN 166 which (at least in its current implementations) requires access to individual level data. 167 A causal relationship from BMI to each metabolite was also tested using MR with an 168 instrumental variable for BMI given by a BMI allele score formed (on the basis of prior 169 knowledge [43,44]) from 39 BMI-associated SNPs. Again, individual level data (and 170 resulting individual level BMI allele scores) were used, although the weighting of the 171 SNPs to construct the allele score variable could be considered to incorporate external 172 information as 'prior knowledge', being informed by previous results from larger 173 studies [43,44].
174 Table 1 shows the results of applying Mendelian randomisation to the TwinsUK 175 data set. Both metabolites, EPA and DGLA, were inferred to have a causal relationship 176 with BMI at the 0.05 significance level (p-values 0.047 and 0.019 respectively).

177
Conversely, reverse causation (with BMI causing the metabolite levels) did not show 178 such compelling p-values (0.665 and 0.707 respectively). Mendelian randomisation 179 between the metabolites provided somewhat conflicting results, with both directions 180 achieving p-values < 0.05, but overall there seemed to be stronger support for a causal 181 effect from DGLA to EPA.

182
Despite recent debate about the utility and potential for misinterpretation of 183 p-values in scientific research [45], we note that p-values can still be considered as useful 184 summaries of the compatibility between a data set and an underlying hypothesized 185 model [45,46]. Indeed, in the human genetics literature, p-values remain the most "blacklisted" to not be allowed to exist. The average network score (BIC) when both the 229 SNPs and the BMI allele score are allowed to have children (equivalent to Fig 3A) Table 1. Mendelian randomisation results for TwinsUK data. The Instrument-risk factor p-value(s) are from the regression of the risk factor on the instrument(s), and the MR p-value is from the regression using the predicted value of the risk factor as an explanatory variable for the outcome variable.
-33500.99. The average network score when SNPs can have children but the BMI allele 231 score is constrained to have no children (conceptually similar to Fig 3B) is -33528.85.

232
The average network score when the BMI allele score can have children but the SNPs 233 are constrained to have no children (conceptually similar to Fig 3C) is -33546.05. The 234 average network score when SNPs and BMI allele score are both constrained to have no 235 children (conceptually similar to Fig 3D) is -33573.91. These average BICs illustrate the 236 considerably better fit obtained when all anchor variables are allowed to influence the 237 values of the other variables in the model.

238
Overall, these results support the inference seen with this data set using MR. They 239 also illustrate the advantage in BN analysis of being able to easily include 240 simultaneously anchor variables for both BMI and the metabolites -although removing 241 one or other anchor still produced broadly similar inference concerning the direction of 242 the relationships between the metabolites and BMI, we found the support for these We used three simulation models (Fig 1) to investigate the powers and type I errors of 251 MR, MR Steiger and BN for testing the relationship X to Y, with assumed effect size  [48]) and Figs 258 S7-S9 (for BN implemented via the deal algorithm [49]). Comparison of Figs S4-S6 with 259 Figs S7-S9 shows the power when using deal to be consistently lower than when using 260 BNLearn (with no compensating advantage in terms of better type 1 error), and for this 261 reason we discard the deal algorithm from any further consideration.

262
Direct comparison between MR/MR Steiger and BN is complicated as the most 263 natural processes for determining if X is causal on Y (or for examining the extent of the 264 evidence for X being causal on Y) are different for the three methods. For MR/MR 265 Steiger, we make inference based on a type I error (p-value) threshold given by α (the 266 probability of a false positive). As mentioned previously, we concur with the 267 opinion [45] that, in real data analysis, use of absolute p-value thresholds should be 268 avoided, and we do not in fact propose that any particular threshold should be 269 considered as "correct"; here we instead use the thresholds as heuristics and examine  For BN, we find the power is generally higher when both G and Z are used together 297 rather than using either alone (Figs 4 and 5, panels A and B). Under model 1, the 298 probability of making an incorrect inference is very low when testing Y to X when there 299 is actually an effect from X to Y (Fig S4,    causal on X (bottom row) given by BN, for data simulated under model 1 where X was 326 causal on Y. As the true effect size β XY increases, the probability of correctly (top row) 327 detecting an X to Y effect increases, while when β XY is zero, the probability of BN analysis (panel A) the probability estimates are higher than when only one of these 330 is included (panels B and C), illustrating the advantage of using the extra information 331 from both variables. In addition, as the true effect size, β XY , increases, the probability 332 of falsely (bottom row) detecting a Y to X effect decreases, with the lowest probability 333 seen when both G and Z are included in the BN analysis (panel D), again illustrating 334 the advantage of using the extra information from both variables. when there is no effect from X to Y. However, for the analysis with only G included, it 342 can be seen the probability estimates approach 0.5 as the effect size increases, rather 343 than increasing to above 0.5 as occurred for models 1 and 2.

344
An overall summary of the performance of the methods based on Simulation Study 1 345 is given in Table 2 Table 3. The parameter values for each scenario used to simulate discrete binary data. Models are described in detail by Shih et. al. [50]. edges between Q and Y, and between Y and W, are much stronger than other edges in 358 the model, each with probability strength 1, and so are always inferred to be in the 359 model. The weak relationship between Q to W can be modelled by an additional edge 360 from Q to W, but can also be modelled via the edges from Q to Y to W, which has the 361 advantage of using 2 edges rather than 3 edges to model the entire system of relations 362 between Q, Y and W. Therefore, although incorrect, this model was sometimes chosen 363 as the best model on account of the fact that the BIC measure used in BNLearn 364 algorithm penalizes the number of edges in the network. hand panels) with a reverse effect (so that Y influenced the metabolite). We applied 379 two recently proposed methods, LCV [25] and SMUT [40], along with BN, MR and a 380 recent MR extension (MR-BMA) [22]. required different approaches to handle the 10,000 SNPs that were potentially causal on 390 the metabolites or Y: for MR, B1 and B12 a weighted allele score was constructed 391 (using SNPs passing a p-value threshold of p < 5 × 10 −6 ), SMUT and MR-BMA used a 392 subset of SNPs passing a p-value threshold of p < 5 × 10 −6 with any metabolite, and 393 LCV was the only test to use all 10,000 SNPs in the final analysis.

394
For MR and SMUT, we see high power to detect a true causal relationship when it is 395 present (Fig 10, middle panels), however there is very high inflated type I error when 396 the effect is in the opposite direction (Fig 10, right hand panels). There is also inflated 397 type I error when there are no effects at all (Fig 10, left hand panels). This is due to   Somewhat perversely, the detection rate for a direct causal relationship, in either 406 direction, is much higher when there are no effects (Fig 10, left hand panels)  alleviate this concern to some extent, but, in most of these approaches, an underlying 445 hypothesised graphical structure representing the relationships between variables must 446 be assumed (rather than being learned from the data).

447
Here we suggest BN analysis as a complementary approach for performing 448 exploratory analysis of causal relationships in complex data. We note that BN in the cannot explain a cyclic or feedback relationship among variables, whereas bi-directional 495 MR can test this to some extent. The performance of BN is affected by the sample size 496 and true causal effect sizes, and the posterior probability threshold that corresponds to 497 any particular type 1 error rate is therefore hard to define. Arguably the most serious 498 limitation of BN is the fact that analysis is performed on individual level data, and the 499 method is not readily extended to summary data (although this represents an 500 interesting topic for future investigation). In contrast, MR approaches such as  extended to utilize larger numbers of variables [36][37][38][39]). It is possible that BN may 510 perform differently or worse in larger, more complex data sets. Thus, further studies on 511 more complex real and simulated data (for example involving known biological or 512 metabolic pathways) are required. In spite of these limitations, our study highlights the 513 utility of BN as an appealing approach for performing causal inference in complex 514 biological data sets that thus warrants further investigation.

526
We note that MR was originally [4] introduced as a general approach that uses the 527 directionality from genetic variable to phenotype as the basic principle, but not with 528 any particular analytical strategy (such as that suggested by its use here) in mind. In 529 practice, the large majority of MR studies have attempted effect estimation and have 530 used either two-stage least squares linear regression for analysis carried out within a 531 single study sample, or two-sample MR based on summary statistics when utilizing data 532 from two separate studies [12,14] (which provides equivalent inference). This motivates 533 our choice of two-stage least squares linear regression as reflecting the most commonly 534 used analysis strategy, while also having the advantage of allowing direct comparison 535 with BN (which, at least in its current implementations, requires access to individual 536 level -rather than summary statistic level -data).

538
A variety of algorithms have been proposed for performing BN. We considered two 539 different Bayesian Network methods, deal [49] and BNLearn [48], which were 540 implemented in C++ in our own software package, BayesNetty [55], using a hill 541 climbing algorithm with random restarts and likelihood-based network scores for model 542 selection. The BNLearn method used the Bayesian information criterion (BIC) to form 543 the network score and (as we demonstrate) was found to be more powerful and robust 544 than deal. The BNLearn method is therefore the primary method used to generate the 545 Bayesian Network results presented in this article.

546
Networks were drawn using the igraph [56] R package. Average networks were 547 calculated by bootstrapping the data with replacement 1000 times, and selecting the 548 best-fit network for each replicate. The probability of an edge existing, and the 549 probability of the edge being in a particular direction (given that it exists) were 550 estimated by counting the proportion of times that such events occurred amongst the 551 1000 best-fit bootstrap networks. For plotting the resulting average network, only edges 552 that were considered sufficiently strong in the context of the current average 553 network [48] were plotted.

555
In addition to MR and BN, we also considered a recently-proposed extension of MR 556 known as MR Steiger [24]. This approach involves applying two tests which must both 557 pass a p-value threshold in order to conclude a causal relationship between variables X 558 and Y: firstly a test to decide if a genetic variable G is most suitable as an IV for 559 variable X or Y, then a standard MR test using G as an instrument to test either the 560 relationship X to Y, or the relationship Y to X.

561
Multivariable MR based on Bayesian Model Averaging 562 We also considered a recently-developed extension to multivariable MR [16,57] termed 563 "multivariable MR based on Bayesian model averaging" (MR-BMA) [22]. MR-BMA, like 564 original multivariable MR, is basically an extension of standard MR to model not one, 565 but multiple, risk factors on an outcome, thus accounting for measured pleiotropy. We also applied the recently-developed latent causal variable (LCV) method [25] which 574 infers, for pairs of measured traits, the extent to which part or all of the genetic 575 component of one trait is causal for the other. This method makes use of genetic data 576 across the whole genome, rather than following the usual MR approach of selecting 577 specific genetic variants to be used as instruments. The method tests a newly-defined 578 quantity between two traits, the "genetic causality proportion" (GCP), where large 579 (positive or negative) values of GCP imply that interventions on one trait are likely to 580 affect the other, suggesting (without specifically testing) that one trait may itself be 581 causal on the other. Formally, the GCP test performs a two-sided test of the null 582 hypothesis that the GCP= 0. The software also produces p-values for "full causality" 583 between the two traits in either direction.

584
The underlying graphical model used to motivate the LCV method actually 585 corresponds to a model in which an (unmeasured) latent variable is the causal variable 586 for both measured traits. One could therefore argue that demonstration of such an 587 effect suggests that it is actually the latent variable that should be intervened upon, 588 rather than one of the traits, if one wishes to bring about a corresponding change in the 589 value of the other trait.

591
We also considered a recently-proposed multi-SNP mediation intersection-union test 592 known as SMUT [40]. SMUT tests the joint mediation effects of multiple (potentially 593 correlated) genetic variants on a trait through a single mediator, effectively generating a 594 hypothesis test for mediation but with a multivariate exposure. SMUT adopts the 595 classical mediation framework, takes a frequentist approach, and relies on individual 596 level data, treating the mediator effect as fixed and the effects of multiple SNPs upon 597 the mediator as random. The data analysed here consisted of 5654 twins with measurements of BMI, the two 600 metabolites EPA and DGLA, and 42 SNPs. For the purposes of the current analysis we 601 used all twins and treated them as independent (i.e. ignoring pairwise clustering due to 602 twin relationships; this will overestimate the statistical significance (nominal p-value) of 603 any observed associations, but we anticipate that this phenomenon should affect all the 604 methods evaluated equally). Three SNPs were used directly as IVs: rs174556 for EPA, 605 and rs968567 and rs6498540 for DGLA. The other 39 SNPs known [43,44] to be 606 associated with BMI were combined into a weighted allele score variable (i.e. the 607 number of effect alleles for each individual were counted for each the 39 SNPs, and 608 these were summed, weighting by their previously estimated [43,44] regression 609 coefficient). The EPA and DGLA data were pre-processed by logging and adjusting for 610 study day using linear regression, and were then normal quantized [42]. The genetic 611 variables (rs174556, rs968567, rs6498540 and the weighted allele score) were used as 612 instruments in MR to explore causal relationships between the metabolites and BMI.

613
Relationships between all variables were also explored via BN, with the directional 614 constraint that arrows were constrained to come out from (rather than go into) any 615 nodes corresponding to genetic variables.  Fig 1), 622 using a variety of values for the regression coefficients (the βs). These models cover a 623 variety of plausible scenarios in terms of potential confounders (C and S). In each case, 624 the direction of causality goes from X to Y.

625
For all three simulation models, the following analyses were implemented:      to X were calculated with β XY equal to either 0 or 0.5, with probability thresholds 0.7, 644 0.8 and 0.9 used to define detection of a relationship. As a further visualisation of the 645 performance of BN for different values of β XY ranging from 0 to 0.5, the estimated 646 probabilities (based on the average bootstrap network) of an edge existing from X to Y 647 and from Y to X were calculated for each of the 1000 simulation replicates, and the 648 distributions plotted as box plots.

649
Receiver operating characteristic (ROC) curves (based on 1000 simulation replicates) 650 for the detection of an edge existing from X to Y were generated by imposing either 651 different α thresholds (for MR and MR Steiger) or different probability thresholds (for 652 BN). As the relevant threshold is relaxed, the chance of a true positive detection of a 653 relationship increases, but so does the chance of a false detection of a relationship. We also investigated the utility of BNs for performing causal inference in a binary trait 661 setting. Discrete binary data was simulated for 5000 individuals using four models 662 considered in recent work by Shih et al. [50], in the context of quantifying the effects of 663 alcohol consumption and high alanine transaminase levels on hepatocellular carcinoma. 664 We used the same graph (Fig 2) and parameter settings ( We also carried out a simulation study involving more complex networks of variables, as 674 considered by Zuber et al. [22] in their development of the "multivariable MR based on 675 Bayesian model averaging" (MR-BMA) method. We simulated data in a very similar 676 manner to Zuber et al. [22] and then applied MR-BMA, along with BN, MR, SMUT 677 and LCV. Data was simulated for 1000 individuals, using 1000 replicates (allowing us to 678 determine powers and type I errors using p-value thresholds of 0.1, 0.05 and 0.01, or 679 posterior probability thresholds of 0.7, 0.8 and 0.9, respectively).

680
To inform our simulation model, we used the same publicly available summarized 681 data on genetic associations with risk factors derived from a recent metabolite 682 GWAS [58] as was used by Zuber et al. [22]. To avoid selection bias we took the same 683 subset of 150 independent SNPs as Zuber et al. [22], that had been found to be 684 associated with any of the three main lipid measurements (LDL-cholesterol, 685 triglycerides or HDL-cholesterol) at a genome-wide level of significance (p-value 686 from each pair of metabolites that had a genetic correlation (calculated using the 693 beta-coefficients of the 150 SNPs) stronger than |r| > 0.99. From the resulting 92 694 metabolites, 12 were chosen at random to be used in our simulation study. Four of the 695 metabolites were chosen to be used in the simulation model as null variables (with no 696 effects on the outcome variable, Y), four were chosen to be used in the simulation model 697 with a direct effect on Y, and the other four were chosen to be used in the simulation 698 model with a reverse effect (from Y to the metabolites).

699
The data for the 150 SNPs were simulated using the allele frequencies given in by 700 the Global Lipids Genetics Consortium [59], assuming Hardy-Weinberg equilibrium 701 (HWE). The four metabolites with direct effects on Y were simulated conditional on the 702 simulated data for the 150 SNPs (based on their corresponding beta-coefficients for 703 association with metabolites). Y was then simulated based on these four metabolites 704 and 75 randomly-chosen SNPs (with beta-coefficients derived from their relationship 705 with a randomly discarded metabolite, Ile). The metabolites not directly affecting Y 706 were simulated based on the 150 SNPs and (for the 4 metabolites where a reverse effect 707 was present) on Y. Any causal effects between the metabolites and Y, or vice versa, 708 were simulated using a beta-coefficient of value 0.3, and the standard error was set to 1. 709 A further 9775 SNPs with no effects on any other variables were simulated assuming 710 HWE using a minor allele frequency simulated from a uniform distribution between 0.01 711 and 0.5. This gave a final simulated data set consisting of 10,000 SNPs, 12 metabolites 712 and one outcome variable, Y. 713 We performed MR between every individual metabolite and Y, as well as MR in the 714 reverse direction to test if Y has a causal effect on each metabolite. Weighted allele 715 score variables were used as instrumental variables and were re-constructed within each 716 simulation replicate using SNPs passing a p-value threshold of p < 5 × 10 −6 of 717 association with the appropriate metabolite or with Y, using the estimated regression 718 coefficients as weights. The same SNPs were also used as the genetic variants in SMUT 719 (using the R package SMUT [40]), which was also performed between every individual 720 metabolite and Y (or vice versa for reverse causal effects).

721
The MR-BMA test was performed using R code written by the MR-BMA authors 722 and was designed to detect which risk factors for an outcome are causal. The test 723 outputs marginal probabilities for each metabolite being causal on the outcome variable; 724 these are the probabilities presented in the results. The SNPs were chosen for the 725 MR-BMA test using a p-value threshold of p < 5 × 10 −6 for association with any of the 726 metabolites. 727 We also applied the LCV method proposed by O'Connor and Price [25]. The test 728 was evaluated using the R code written by the LCV authors and uses only one 729 metabolite and the outcome variable at any one time, together with all 10,000 SNPs.

730
Average Bayesian networks (BN) were used to estimate the probabilities of causal 731 effects between the metabolites and Y using the same instrumental variables as used by 732 the MR tests. Bayesian network analyses were initially performed using only 4 variables 733 for every metabolite: the metabolite itself, the outcome Y, and the 2 corresponding 734 allele scores (one for the metabolite and one for Y). Subsequently we considered 735 Bayesian network analyses that used all 12 metabolites, Y, and the 13 relevant allele 736 score variables (for the 12 metabolites and Y) simultaneously. In all analyses, the allele 737 score variables were constrained to operate as individual genetic instruments i.e. to have 738 one and only one edge going from itself to the corresponding instrumented variable 739 (either a metabolite or Y). Simulation models used in Simulation Study 1 of quantitative trait data. Data were simulated for two continuous variables (X and Y), together with a genetic instrument G (coded as 0, 1, 2) and a continuous instrumental variable Z. Parameter values for models involving weak confounding were chosen as β GX = 0.1, β ZY = 0.075 and β CX = β CY = β GS = β SY = 0.25. For models involving strong confounding, the parameter values were the same except that β CX = β CY = β GS = β SY = 0.5 i.e. the parameters controlling the confounding effects were doubled. The parameter β XY was varied using values of 0.0, 0.1, 0.2, 0.3, 0.4 and 0.5. For the calculation of ROC curves where false positives were counted as detections of an arrow between X and Y in the wrong direction, the direction of causality was reversed between X and Y , such that for model 1 the equations become X ∼ N (β Y X Y + β GX G, 1) and Y ∼ N (β ZY Z, 1), with β Y X varied using values of 0.1, 0.3 and 0.5 (and similarly for models 2 and 3) .   Fig 2. Graph of the simulation model used for Simulation Study 2 for four different parameter scenarios as described by Shih et al. [50]. The data simulated consisted of four binary variables: Q, representing a gene; W, representing high alcohol; H, representing high alanine transaminase; and the outcome variable, Y, representing hepatocellular carcinoma.