The confidence interval method for selecting valid instrumental variables

We propose a new method, the confidence interval (CI) method, to select valid instruments from a larger set of potential instruments for instrumental variable (IV) estimation of the causal effect of an exposure on an outcome. Invalid instruments are such that they fail the exclusion conditions and enter the model as explanatory variables. The CI method is based on the CIs of the per instrument causal effects estimates and selects the largest group with all CIs overlapping with each other as the set of valid instruments. Under a plurality rule, we show that the resulting standard IV, or two‐stage least squares (2SLS) estimator has oracle properties. This result is the same as for the hard thresholding with voting (HT) method of Guo et al. (Journal of the Royal Statistical Society : Series B, 2018, 80, 793–815). Unlike the HT method, the number of instruments selected as valid by the CI method is guaranteed to be monotonically decreasing for decreasing values of the tuning parameter. For the CI method, we can therefore use a downward testing procedure based on the Sargan (Econometrica, 1958, 26, 393–415) test for overidentifying restrictions and a main advantage of the CI downward testing method is that it selects the model with the largest number of instruments selected as valid that passes the Sargan test.

Invalid instruments are such that they fail the exclusion conditions and enter the model as explanatory variables.
The CI method is based on the CIs of the per instrument causal effects estimates and selects the largest group with all CIs overlapping with each other as the set of valid instruments. Under a plurality rule, we show that the resulting standard IV, or two-stage least squares (2SLS) estimator has oracle properties. This result is the same as for the hard thresholding with voting (HT) method of Guo et al. (Journal of the Royal Statistical Society : Series B, 2018, 80, 793-815). Unlike the HT method, the number of instruments selected as valid by the CI method is guaranteed to be monotonically decreasing for decreasing values of the tuning parameter. For the CI method, we can therefore use a downward testing procedure based on the Sargan (Econometrica, 1958, 26, 393-415) test for overidentifying restrictions and a main advantage of the CI downward testing method is that it selects the model with the largest number of instruments selected as valid that passes the Sargan test.

| INTRODUCTION
Instrumental variables (IV) estimation is a well-established method for determining causal effects of an exposure on an outcome, when this relationship is potentially affected by unobserved confounding. For recent reviews and examples, see Clarke and Windmeijer (2012), Imbens (2014), Kang et al. (2016) and Burgess et al. (2017).
As Guo et al. (2018, p. 793) state, an IV analysis requires instruments that 1. are associated with the exposure (Condition 1), 2. have no direct pathway to the outcome (Condition 2) and 3. are not related to unmeasured variables that affect the exposure and the outcome (Condition 3).
Condition 1 is often referred to as the relevance condition and Conditions 2 and 3 as the exclusion conditions, see Section 2 for details.
This paper is concerned with violations of the exclusion conditions of the instruments. Following closely the setup of Kang et al. (2016), Windmeijer et al. (2019) and Guo et al. (2018), if an instrument satisfies the exclusion Conditions 2 and 3 it is classified as a valid instrument. If an instrument does not satisfy Condition 2 and/or 3, it is classified as invalid. Use of invalid instruments in an IV analysis leads to inconsistent estimates of the causal effect and it is therefore important to select the set of valid instruments from the set of putative IVs that may include invalid ones.
As an example, Mendelian randomisation is a technique employed in epidemiology to learn about the causal effects of modifiable health exposures on disease. It posits that genetic variants, which are known to be associated with the exposure and hence satisfy Condition 1, additionally satisfy the exclusion conditions and are only associated with the outcome through the exposure. In our Mendelian randomisation application in Section 8, we utilise genetic variants as potential instruments for BMI in order to determine its causal effect on diastolic blood pressure. However, a genetic variant could be an invalid instrument for various reasons, such as linkage disequilibrium and horizontal pleiotropy, see, for example, Lawlor et al. (2008) and von Hinke et al. (2016).
The so-called plurality rule holds if the set of valid instruments forms the largest group, as specified in Section 2. An approach for selecting the valid instruments could then be to follow Andrews (1999) and estimate the causal effect for all 2 k z − k z + 1 possible subsets of at least two instruments, where k z denotes the total number of instruments, and to select the model that minimises an information criterion based on the Sargan (1958) test of overidentifying restrictions. A large value of the Sargan test statistic is an indication that invalid instruments are present. This approach is only feasible with a relatively small number of instruments, unlike in our application where we have 96 putative genetic instruments. We therefore need dimension reduction techniques, even though we are in a setting of a fixed number of instruments k z with a large sample size n, the setting referred to as low dimensional by Guo et al. (2018).
Following the Lasso proposal by Kang et al. (2016), Windmeijer et al. (2019) proposed an adaptive Lasso estimator in combination with a downward testing procedure based on the Sargan test as in Andrews (1999). When the majority rule holds, meaning that more than 50% of the potential instruments are valid, then this approach results in consistent selection of the invalid instruments and oracle properties of the resulting standard IV, or two-stage least squares (2SLS) K E Y W O R D S causal inference, instrumental variables, invalid instruments estimator. This means that the limiting distribution of the estimator is the same as the oracle estimator, which is the 2SLS estimator when the set of invalid instruments is known. Guo et al. (2018) proposed a two-stage hard thresholding with voting (HT) method that results in consistent selection of the valid instruments and oracle properties of the 2SLS estimator when the weaker plurality rule holds.
In this paper, we develop an alternative method, which we call the confidence interval (CI) method as presented in Section 3. This method simply selects as valid instruments the largest group of instruments where all CIs of the instrument-specific causal effect estimates overlap, with a tuning parameter varying the width of the CIs. Like the Guo et al. (2018) method, we show that the CI method results in consistent selection and oracle properties of the resulting 2SLS estimator when the plurality rule holds. An advantage of the CI method is that the number of instruments selected as valid decreases monotonically for decreasing values of the tuning parameter, which is not the case for the HT method as we discuss in Section 4. For the CI method, we can therefore use a downward testing procedure based on the Sargan test and a main advantage of this CI method is that it selects the model with the largest number of instruments selected as valid that passes the Sargan test.
While initially making the assumptions of conditional homoskedasticity and strong instruments in Section 2 for ease of exposition, we discuss in Section 5 how to adapt the methods to deal with general forms of heteroskedasticity. We further discuss the first-stage thresholding method of Guo et al. (2018) to dealing with weak instruments in Section 6.
We evaluate the two methods in the Monte Carlo exercise in Section 7, for a design very similar to that in Guo et al. (2018). We find that, overall, the CI method has a better finite sample performance than the HT method in this design. In the application in Section 8, we find that the HT method selects too few instruments as invalid, resulting in models that are rejected by the Sargan test. By design, the CI method selects models that pass the Sargan test. It produces results very similar to the adaptive Lasso method which suggests that the majority rule is not violated in this application. We adopt the following notation. x denotes the vector with elements x j . For a general matrix X, X ′ denotes its transpose. All vectors are taken as column vectors, including X i. , where the row vector X ′ i.
is the ith row of the matrix X. For a full column-rank matrix X with n rows define P X = X X � X − 1 X � , the projection onto the column space of X, and M X = I n − P X , where I n is the n-dimensional identity matrix. Proofs of Lemma 1 and Theorems 3 and 4 in Section 3 are presented in the Supplementary Appendix A.1.

| MODEL AND ASSUMPTIONS
Let the observed outcome for observation i be denoted by the scalar Y i , the treatment or exposure by the scalar D i and the vector of k z potential instruments by Z i. . The instruments may not all be valid and can have a direct effect on, and/or an indirect association with the outcome, violating Condition 2 and/or 3. We have a sample . We follow Kang et al. (2016) and Guo et al. (2018), who, starting from the additive linear, constant effects model of Holland (1988), arrived at the observed data model for the sample given by where is the causal parameter of interest, and with E u i |Z i. = 0, but D i might be correlated with u i . The parameter vector represents the possible violations of the exclusion conditions and can be used to formalise the definition of valid IVs as follows (Guo et al. 2018, p 797). (1) Definition 1 If j = 0, then instrument j, j = 1, …, k z , is valid, it satisfies both Conditions 2 and 3. If j ≠ 0, then instrument j is invalid.
We present some graphical representations of the causal model and possible violations of the exclusion conditions in Appendix A.3.
Let y and d be the n-vectors of n observations on Y i and D i , respectively, and let Z be the n × k z matrix of potential instruments. As an intercept is implicitly present in the model, y, d and the columns of Z have all been centered by the subtraction of their means. Other covariates can be partialled out in the same way. Let Z  0 and Z  0 be the sets of valid and invalid instruments,  0 = j: j = 0 ,  0 = j: j ≠ 0 , with dimensions k  0 and k  0 , respectively, and k z = k  0 + k  0 .  = 1, . . . , k z denotes the full set and so  0 = � 0 .
The oracle model is then given by We initially assume that all instruments satisfy Condition 1, implying that the k z elements j in , are all different from 0: This is the same assumption as in Kang et al. (2016) and Windmeijer et al. (2019). Guo et al. (2018) relaxed this assumption and proposed a first-stage hard thresholding procedure to consistently select only instruments with j ≠ 0. We will discuss this further in Section 6 and apply this first-stage thresholding in our application.
Then define j as for j = 1, . . . , k z . It follows from Definition 1 and Assumption 1 that for valid instruments, j ∈  0 , j = . Following Theorem 1 in Kang et al. (2016) and Guo et al. (2018), a necessary and sufficient condition to identify and the j , given Γ and , is that the valid instruments form the largest group, where instruments form a group if they have the same value for j . This is the plurality rule. As in Guo et al. (2018), we maintain the assumption that this condition is satisfied: , and models (1) and (6), we further assume that the following standard conditions hold:

Assumption 4 Let
The elements of are finite.
While Assumption 5 holds if the observations are i.i.d., as the moments are assumed to exist, these conditions further hold under various weak dependence assumptions, see Staiger and Stock (1997, p. 560).
Note that conditional homoskedasticity E w i w � i |Z i. = is implicit in Assumption 6. We make this assumption primarily for ease of exposition and will relax this in Section 5.
The plurality rule, Assumption 2, is the main assumption on the instruments needed to establish oracle properties for the CI method described below and the HT method of Guo et al. (2018). In particular, the values of j and j can be arbitrary and arbitrarily correlated. The CI and HT methods are robust to any such correlation. Alternatively, the methods of Kolesár et al. (2015) and Bowden, Smith and Burgess (2015) do not make the plurality assumption and can have all instruments invalid. A bias corrected 2SLS estimator is then consistent under the INstrument Strength Independent of Direct Effect (INSIDE) assumption that Cov j , j = 0, together with the requirement that the number of instruments increases with the sample size. Guo et al. (2018) provide a discussion of and comparison to these methods, also including alternative methods proposed by Bowden

| THE CONFIDENCE INTERVAL METHOD
From the plurality rule Assumption 2, it follows that consistent instrument selection procedures can be based on consistent and asymptotic normal estimators of the parameters j as defined in (7). Then groups of instruments are formed by similar estimates ̂ j , and, in large samples, the largest group will constitute the group of valid instruments under Assumption 2. While in principle all combinations of instruments could be tested separately, see Andrews (1999), in practice this may not be feasible when there are a large number of instruments. The Guo et al. (2018) method as described further in Section 4 reduces the dimensionality of the problem by essentially performing k z k z − 1 ∕2 pairwise tests of the null H 0 : j = k , combined with a voting scheme to group the instruments.
A clear reduction of the dimensionality of the problem is achieved by alternatively considering testing H 0 : j = g , for a grid g spanning the possible values of and selecting as the set of valid instruments the largest set over all values of g for which a particular value of g is not rejected. The CI method operationalises this idea without having to consider the grid points g by grouping together instruments with overlapping CIs.
Let Γ and ̂ be the OLS estimators for Γ and in the model specifications Under Assumptions 3-6 it follows that Guo et al. (2018), let an estimator for j be then it follows, using the delta method, that √ n An estimator for the variance of ̂ j is then given by We show in Appendix A.5 that ̂ j is identical to the 2SLS estimator of j in the just-identified model where Z {−j} = Z � Z .j , using Z .j as the instrument for d. This therefore implies that ̂ j is the IV estimator for based on instrument Z .j while treating all other instruments as invalid. The variance estimator Var ̂ j as defined in (11) is also the same as the standard 2SLS variance estimator in the just-identified model (12).
The CI method is a fast method that consistently selects the valid instruments. Let v j = √V ar ̂ j . Given a value n , define the confidence interval ci j n for j as for j = 1, . . . , k z . The following lemma gives the conditions on n under which all CIs within groups  g will overlap which each other when n → ∞, whereas none of the CIs in different groups will overlap with each other.
Lemma 1 Let the groups  g be as defined in Assumption 2 and the confidence intervals ci j n , j = 1, …, k z , as defined in (13). Then, under Assumptions 1 and 3-6, for n → ∞, n → ∞, n = o n 1∕2 , and ∀ g, all confidence intervals ci j n within a group, j ∈  g , will overlap with each other, whereas none of the confidence intervals in different groups, ci j n , ci j ′ n , j ∈  g , j � ∈  g �, will overlap with each other.
We can use the results of Lemma 1 to obtain a selection rule that consistently selects the valid instruments as valid, with the resulting 2SLS estimator having oracle properties. For any value n , classify the instruments in groups  over t n , for t = 1, …, T n , with 1 ≤ T n ≤ k z . For members j ∈ over t n , all ci j n overlap with each other. Only the largest of such groups are considered, and not their subdivisions. If, for example, all k z CIs overlap with each other, then T n = 1. It is clear from this definition that instruments can be members of multiple groups, and a group can be a singleton. For any value n , we then select as the group of valid instruments the largest group, denoted  n , defined as Note that for any value of n , there may be multiple groups with the largest number of overlapping CIs. If that is the case, at this point we simply randomly select one of these in order to have a single set of instruments for each n . We will discuss selection using the Sargan test in Section 3.1.
The next theorem states the conditions under which the selection  n is consistent, which follows directly from the results of Lemma 1, as  0 is the largest group by Assumption 2.
Theorem 1 Let the ̂ j be defined as in (9) and their confidence intervals as in (13). Let  n be one of the largest groups of instruments for which all confidence intervals overlap with each other as defined in (14). For n → ∞, n = o n 1∕2 , and under Assumptions 1-6 it follows that For any value n , the sets of overlapping CIs can easily and rapidly be obtained as follows.
Algorithm 1 Denote the lower and upper endpoints of ci j n as defined in (13) by cil j n and ciu j n . Order the confidence intervals in ascending order of the lower endpoints, and use the notation cil [j] n and ciu [j] n for the ordered intervals.
Then the largest set(s) of overlapping intervals are those associated with the maximum value of no [j] n .
For the sequences n → ∞, n = o n 1∕2 , it follows from the results of Lemma 1 and Theorem 1 that  n as defined in (14) converges to the unique set  0 . It is therefore immaterial for consistent selection and oracle properties how we choose the set  n for those values of n where there are multiple groups with the largest number of overlapping CIs. We can extend the range of sequences n if we choose in that case the group with the minimum value of the Sargan test as we show next.

| Sargan test
For the oracle model (2), Sargan (1958) test statistic is given by As E Z i. u i = 0, and for k  0 < k z , it follows under Assumptions 1 and 3-6 that √ n The results of the CI selection method can be linked to the behaviour of the Sargan test statistic as it follows from the results of Theorems 1 and 2 that, under the conditions of Theorem 1, We can now allow for a wider range of values of the sequence n if we select from the groups with the largest number of overlapping CIs the one with the minimum value of the Sargan test statistic. Let M n denote the number of groups with the largest number of overlapping CIs, the collection of these groups denoted by  max The next theorem gives the conditions for consistent selection and oracle properties when selecting  Sar n as the set of valid instruments.
Theorem 3 Let the ̂ j be defined as in (9) and their confidence intervals as in (13). Let  sar n be as defined in (16) ≤ c n , and under Assumptions 1-6 it follows that and

| Downward testing procedure
From the results of Theorem 3, we can devise a downward testing procedure as in Andrews (1999), reducing the dimension of the problem by evaluating only the models selecting the sets with the largest number of overlapping CIs as valid instruments. The Andrews (1999) downward testing procedure uses the Sargan test statistic as a selection device for the consistent selection of the valid instruments. It starts with the model that selects all k z instruments as valid. If the Sargan test rejects this model, then the procedure next evaluates the k z models with k z − 1 instruments selected as valid, treating each instrument in turn as invalid. If the minimum of the k z Sargan test statistics does not reject the null, then the associated model is selected as the valid model. If the minimum rejects the null, then all k z 2 models with k z − 2 instruments selected as valid are evaluated. This gets repeated until a model with k z − k  − 1 degrees of freedom has a Sargan test result that does not reject the null hypothesis. Denote the minimum of the k z k  Sargan statistics of all possible models with k  instruments selected as invalid by S min k  . Let Then if the critical values n,k z −k  −1 of the 2 it follows from the results in Andrews (1999), that, under Assumptions 1-6, lim n→∞ P  ns =  0 = 1, or equivalently, lim n→∞ P  ns =  0 = 1, with  ns =  � ns .
In order to use the CI method to reduce the dimension of the downward testing procedure, consider the set of breakpoints for j = 1, …, k z − 1, r = j + 1, …, k z . From Algorithm 1 it follows that for n ≤ * j,r , ci j n and ci r n do not overlap, whereas they do when n > * j,r . Let * [k z − 1] = max j,r * j,r . For n > * [k z − 1] all k z confidence intervals overlap. At n = * [k z − 1] , the number of overlapping CIs in the largest groups drops by one to k z − 1, and there will be two groups, denoted as before as  max The next breakpoint where the size of the largest groups of overlapping CIs is equal to k z − 2 is the minimum of the maximum of the breakpoints (18) in the two largest groups of size k z − 1. Denote these maximum groupspecific breakpoints by , the maximum group size remains k z − 1. Then at n = * [k z − 2] , there will be 2 ≤ M * [k z − 2] ≤ 3 groups with the maximum k z − 2 overlapping CIs, and the next breakpoint where the size of the largest groups of overlapping CIs is equal to k z − 3 is again determined by the minimum of the maxima of the breakpoints (18)   Following a result in Pötscher (1983), Andrews (1999) shows that (17)  With this strategy, there is a maximum of k z k z − 1 ∕2 models to be evaluated. Together with the use of Algorithm 1, which has a computational cost of O k z log k z , at at most k z − 2 breakpoints, the computational cost of this downward testing algorithm is of the order O k 2 z log k z . We give a stepwise description of the full downward testing algorithm in Appendix A.4, together with an illustration using a single generated data set. 1 Under the plurality Assumption 2, the CI downward testing procedure will consistently select the set of valid instruments. In any application it may well be the case that multiple sets of maximum size are found for which the Sargan test statistics do not reject the null. The method of Andrews (1999) is then to select the model with the minimum value of the Sargan test statistics for these models with the same degrees of freedom, which is replicated by  dts n . In practice, however, a researcher should acknowledge the fact that there are multiple such models, which could be an indication of a violation of Assumption 2, and investigate their results, which could lead to additional insights on the possible pathways from instruments to exposure and from exposure to outcomes. While the CI method achieves dimension reduction by ignoring the covariances between the estimators ̂ j when constructing the sets with overlapping CIs, by using the downward Sargan based testing procedure the selected model is the one with the largest number of instruments with overlapping CIs for which the joint null hypothesis is not rejected, incorporating the full covariance structure.

| HARD THRESHOLDING METHOD
Consider next pairwise testing of the null hypotheses H 0 : j = k , j = 1, . . . , k z − 1;k = j + 1, . . . , k z . These are equivalent to H 0 : and a reformulation is given by H 0 : Guo et al. (2018) use the latter as the basis for their pairwise testing using Wald test statistics. Unlike the score test, the Wald test is not invariant to the reformulation of a non-linear restriction, see for example Davidson and MacKinnon (2004, pp. 422-424), and while the Wald tests for H 0 : j = k are symmetric, this is not the case for H 0 : [j] k = 0. As we discuss below in Section 4.3, the score test here is the same as the Sargan test for overidentifying restrictions when Z .j and Z .k are the excluded instruments.

An estimator for
[j] k is given by

It follows from the delta method that
, where 2 j is as defined in (10)  These are the instruments k = 1, . . . , k z , for which H 0 : [j] k = 0 is not rejected using critical value, or threshold, n . Note that instrument j is always contained in  [j] n . It follows that, for n → ∞, As these are not joint, but only pairwise comparisons, Guo et al. (2018) propose a majority and plurality voting scheme to consistently obtain the set of valid instruments. In their terminology, n is expert j's ballot that contains expert j's opinion about which instruments are valid. The number of votes an instrument k gets is given by The majority rule then selects an instrument as valid if it gets a vote from more than 50% of the experts. The group of instruments selected as valid is then given by If none of the instruments gets a majority vote, the plurality rule is applied, which defines the set of instruments selected as valid by Let  HT n = M ∪ P , then Guo et al. (2018, pp. 13-14) show that under Assumptions 1-6 it follows that  Guo et al. (2018) do not treat n as a classical tuning parameter and they do not specify the rate, n → ∞, n = o n 1∕2 , as obtained for results (23) and (24) above. They set n = √ 2. 01 2 log max k z , n which in the setting here with fixed k z and n > k z would lead to n = √ 2. 01 2 log (n). The motivation seems to be from the fact that there are k z k z − 1 statistics

| Choice of tuning parameter
If they were all independent N (0, 1) distributed random variables, then it follows that for an increasing number of instruments k z , see Donoho and Johnstone (1994). For the k z fixed case considered here, if the t [j] k were independent N (0, 1) distributed random variables, we have that It is unclear how the result in (29) translates into an optimal choice n as a function of n, even if the t [j] k were independently distributed, which they are clearly not. We find in the Monte Carlo experiments below that the value of n = √ 2. 01 2 log (n) can be much too large, resulting in selecting a large group of instruments as valid that includes invalid instruments. Guo et al. (2018, p. 800) state that in practice, the max k z , n is often replaced by k z or n to improve the finite sample performance.
In the R-routine TSHT.R, Kang (2018), the default threshold parameter for the low dimensional setting is set equal to = √ 2. 01 2 log k z , in line with the results (28) and (29) above. In principle this choice of does not lead to consistent selection for fixed k z and n → ∞. In their Monte Carlo simulations, Guo et al. (2018) instead set = √ 2.01log k z We will use these latter two values to evaluate the performance of the hard thresholding method in the simulations and application below.

| Voting
The Guo et al. (2018) method achieves dimension reduction by pairwise testing of H 0 : [j] k = 0 and the voting mechanism. A weakness of the voting scheme is that it does not have a mechanism to choose between sets of instruments when there are ties, and the number of instruments selected as valid is not guaranteed to be monotonically decreasing for decreasing values of n . Consider the example as depicted in Table 1. There are five potential instruments. In the left panel of the table, for a value 1 for the tuning parameter, instruments 2 and 3 both get three votes, including the votes for themselves, whereas instruments 1 and 2 get two votes and instrument 5 only one vote. Hence,  , then a robust estimator of Var ̂ is given by and straightforward application of the delta method results in robust variance estimators Var r ̂ j and Var r ̂ [j] k . For the CI method, instead of using the Sargan test for selection, a robust score test needs to be used, like the two-step Hansen J-test, (Hansen, 1982). For the oracle model (2), the two-step GMM estimator is given by where ̂  0 ,1 is an initial one-step estimator, for example the 2SLS estimator, and Let û 2 = y − X  0̂  0 ,2 then the Hansen J-test statistic is given by , thus generalising the result for the Sargan test under conditional homoskedasticity to the case of general heteroskedasticity.
As the oracle estimator, we can then specify the 2SLS estimator with robust standard errors, or the efficient two-step GMM estimator.

| WEAK INSTRUMENTS
The relevance Assumption 1 states that j ≠ 0 for all j = 1, . . . , k z . In our application we use 96 single nucleotide polymorphisms (SNPs) as potential instruments for BMI to investigate its effect on blood pressure. These SNPs have been found to be associated with BMI in independent genome wide association studies (GWAS), see Locke et al. (2015). While the assumption is therefore very likely to be valid, it may well be the case that in our sample individual instruments are weak in the sense that they only explain a small amount of the variation of the exposure.
The presence of many weak instruments leads to bias in the 2SLS estimator. This many weak instrument bias is much less for the Limited Information Maximum Likelihood (LIML) and Continuously Updated GMM (CU-GMM) estimators, see Davies et al. (2015) and the references therein. Analogous to the problem of heteroskedasticity discussed in the previous section, to counter a potential many weak instruments bias problem of the 2SLS estimator, the CI and HT methods can estimate the parameters by LIML or CU-GMM, with the CI method adjusting the Sargan or Hansen test statistic accordingly.
For the selection of valid instruments, a very weak invalid instrument could often be classified as a valid instrument in the CI method due to its large standard error, and can change the selection in the HT method by giving votes to a large number of instruments. In order to overcome the selection problem with weak instruments, Guo et al. (2018) proposed a first-stage hard thresholding for H 0 : j = 0 and to classify instruments as uninformative and treated as invalid if with n = √ 2.01log max k z , n , and where Var ̂ j can be a robust variance estimator in case of heteroskedasticity. As with the setting of n discussed in Section 4.1, the threshold parameter is set to n = on the exposure than it otherwise would have been, and a larger chance of crossing the first-stage threshold.

| SOME MONTE CARLO RESULTS
In order to illustrate how the CI and HT methods utilise the available information, following the discussion in Sections 3 and 4, we consider a design similar to that in Guo et al. (2018; Table 2) who considered a setting with a small number of potential instruments, k z = 7, in their design where the majority rule is violated, but the plurality rule holds. We consider here such setting but with a larger number of potential instruments, k z = 21. We present a replication of their k z = 7 design in Appendix A.7. The data are generated from where with = 1; k z = 21; = 0.25; k  0 = 12, = c a � 6 , 0.5 � 6 , 0 � 9 � and = c × k z , where r is an r-vector of ones, and 0 r is an r-vector of zeros. There are therefore 3 groups of instruments,  c ∕c = {1, 2, …, 6},  0.5c ∕c = {7, 8, …, 12} and  0 = {13, 14, …, 21}.  0 is the largest group and so the plurality rule holds, but not the majority rule. The elements of z are given by Σ z,jk = |j − k| z . We set z = 0.5 and c = c = 0.4. As in Guo et al. (2018), in this setting all instruments are strong, and the firststage thresholding is omitted. Note that this simple design represents invalid instruments with a direct effect on the outcome of the type Z 1 as displayed in Figure A1 in Appendix A.3.
Before evaluating estimation results using the downward testing CI method and the HT method as described above, Figure 1 shows the frequency of selection of the oracle model for the HT and CI methods, the latter on the basis of  sar n ( ) as defined in (16), for 10,000 Monte Carlo replications, as a function of values = (0.15, 0.20, . . . , 6.95, 7) and for a sample size of n = 2000 . It is clear that the CI method utilises the available information better in this case and obtains a maximum frequency of selecting the oracle model of 0.98 at = 2.60, whereas the maximum frequency for the HT method is only 0.60 at = 2.40.  Figure 2 shows the average total number of instruments selected as invalid, | | | n | | | , and the average number of invalid instruments selected as invalid as a function of . While both methods can correctly select the 12 invalid instruments as invalid for a range of values of , the CI method can do so without also selecting valid instruments as invalid. In contrast, the HT method selects on average additional valid instruments as invalid, resulting in the difference in the frequencies of selecting the oracle model. At = 2.40, the HT method selects on average 11.94 invalid instruments correctly as invalid, but selects on average a total of 13.52 instruments as invalid. At = 2.60, the CI method selects on average 11.99 invalid instruments correctly as invalid, and selects on average a total of 12.01 instruments as invalid, hence the much higher frequency of selecting the oracle model for the CI method.
As is clear from Figure 2, the number of selected instruments as invalid is not monotonically increasing for decreasing values of the threshold for the HT method, as discussed in Section 4.2, whereas it is for the CI method.
The proposed threshold value for the HT method, n = √ 2. 01 2 log (n) = 5.54 is clearly too large a value in this design. The alternative choice is = √ 2. 01 2 log k z = 3.51. As shown in Figure 1, the probability of selecting the oracle model at this value is equal to only 0.018. Figure  2 shows that the average number of correctly selected invalid instruments at this value of is 10.93, and quite a few valid instruments are selected as invalid, with the average total number of instruments selected as invalid equal to 18.42. Guo et al. (2018) used the value of = √ 2.01log k z in their Monte Carlo simulations, which in this case is equal to = 2.47, very close to the optimal value of = 2.40 for the maximum frequency of oracle selection. Here the probability of selecting the oracle model is equal to 0.59, on average correctly selecting 11.91 invalid instruments as invalid, and selecting on average a total number of 13.68 instruments as invalid. Table 2 shows estimation results for the downward testing CI method and the HT method for this design for different values of the sample size n = 500, 1000, 2000, 5000, for 10,000 Monte Carlo replications. As in Guo et al. (2018), we present the median absolute error (mae), the coverage probability of the 95% CI for and the average length of the confidence intervals. In addition, we present the average number of instruments selected as invalid, | | | n | | | , the frequency of selecting the oracle model, p or , and the frequency of selecting all invalid instruments as invalid, p allinv . The 95% CI is given by Results are presented for the HT method, using = √ 2. 01 2 log k z = 3.51 and = √ 2.01log k z = 2.47 as threshold values, denoted HT 4k z and HT 2k z , respectively, and for the CI method using the downward testing procedure based on the Sargan test threshold p-value of 0.1∕log (n) as described in Section 3.2 and denoted CI sar . Also given are the estimation results for the oracle 2SLS estimator (2SLS or) and the naive 2SLS estimator (2SLS) that treats all instruments as valid.
The CI sar estimator is better behaved than the HT estimators, especially at the smaller sample sizes n = 500 and n = 1000, with the CI sar estimator having a much smaller mae and much better coverage probability than either HT estimator. For example, at n = 1000, the mae for CI sar is very similar to that of oracle 2SLS, 0.014 vs 0.011, and the coverage probability is 0.89, with the average length of the CI being the same as that of the oracle estimator and equal to 0.066. In contrast, the mae for HT 2k z at n = 1000 is equal to 0.31. Its coverage probability is only 0.088, and the average length of the CI is large and equal to 0.22. The latter is due to the fact that too many instruments get selected as invalid, the average | | | n | | | being 17.10, compared to 11.60 for CI sar . In terms of mae and coverage probability HT 2k z is better behaved than HT 4k z for n = 1000 and n = 2000. Although all three estimators are close to oracle 2SLS at n = 5000, and select all invalid instruments correctly as invalid, the HT 4k z is now better behaved overall than HT 2k z as HT 2k z still selects on average too many instruments as invalid, 12.69, versus 12.03 and 12.01 for HT 4k z and CI sar , respectively. This is as expected, as the threshold parameter needs to increase with the sample size for consistent selection in this fixed k z setup.
The results for the k z = 7 case as presented in Appendix A.7 show again a better performance of the CI sar estimator in terms of mae and coverage probability compared to the HT estimators, although the differences are overall smaller due to the smaller number of instruments.
The CI method, as it ignores covariances for the grouping of instruments, is well suited to low instrument correlation settings as in Mendelian randomisation, but it clearly does also very well in the instrument correlation setting as specified above. The HT method may well have better finite sample properties in different settings, but a main advantage of the CI downward testing method is that it selects the model with the largest number of instruments selected as valid that passes the Sargan test. In contrast, the HT method may select models that do get rejected by the Sargan test, as we find in the application presented next.

BLOOD PRESSURE
We use data on 105,276 individuals from the UK Biobank and investigate the effect of BMI on diastolic blood pressure, DBP. See for further details Windmeijer et al. (2019). We use 96 SNPs as potential instruments for BMI as identified in independent GWAS studies, see Locke et al. (2015). Because of skewness, we log-transformed both BMI and DBP. The linear model specification includes age, age 2 and sex, together with 15 principal components of the genetic relatedness matrix as additional explanatory variables. Because of the log-transformation, the interpretation of the causal parameter of interest is that of an elasticity, that is an increase of BMI by 1% changes DPB by %. Table 3 presents the estimation results. R code for the estimation procedure is available at https://github.com/xlbri stol/CIIV. We present here the results based on the assumption of conditional homoskedasticity. Robust methods as discussed in Section 5 produce virtually identical results. The first set of results is based on the full set of instruments, not performing a first-stage thresholding, or in other words setting n = 0 in (31). The OLS estimate of the causal parameter is equal to 0.206 (SE 0.002), whereas the 2SLS estimate treating all 96 instruments as valid is much smaller at 0.087 (SE 0.016). The Sargan test, however, rejects the null that all the instruments are valid with a p-value of 2.05e-19.
The HT 4k z method does not select any instruments as invalid, whereas HT 2k z selects three instruments as invalid. The HT 2k z estimate is equal to 0.104 (SE 0.016), slightly larger that the 2SLS estimate, but the Sargan test still has a very small p-value of 3.11e-11, rejecting this model.
Using a threshold p-value of 0.1∕log(n) = 0.0086 for the downward testing CI sar procedure results in a selection of 13 instruments as invalid. The CI sar estimate is 0.140 (SE 0.019), indicating a downward bias of the 2SLS estimator when treating all instruments as valid. The p-value of the Sargan test in the resulting model is equal to 0.011.
Further presented are the estimation results of the post-adaptive Lasso estimator of Windmeijer et al. (2019), also using a downward Sargan p-value based testing procedure. This method selects 11 instruments as invalid, resulting in an estimate of 0.163 (SE 0.018) and a p-value of the Sargan test of 0.013. This method has oracle properties if more than 50% of the instruments are valid, an assumption that does not appear to be invalid given the estimation results of the CI sar method. It is more efficient in this case than the CI sar method as it finds a model with a larger group of valid instruments that passes the Sargan test. Of the selected invalid instruments, the CI and Lasso methods have eight in common. In particular, the Lasso method is able to select as invalid two instruments that are very weak with large values of | | |̂ j | | | and se ̂ j . The CI method is not able to classify these as invalid, as discussed in Section 6. We can therefore apply the first-stage thresholding in order to exclude these instruments from consideration. The second set of results presented in Table 3 performs a first-stage thresholding using the Guo et al. (2018) recommended value of n = √ 2.01log k z = 3.03. A total of 34 instruments do not pass this threshold. They are treated as invalid and included in the model as explanatory variables. The OLS and naive 2SLS estimators are virtually unchanged. The HT 4k z estimator selects one additional instrument as invalid, with the p-value of the Sargan test of the resulting model equal to 5.29e-14, clearly rejecting the model. The HT 2k z procedure selects two instruments as invalid and the model is also rejected by the Sargan test. Interestingly, the CI sar and post-adaptive Lasso procedures result in the same model selection with the same nine instruments selected as invalid. The resulting estimate is equal to 0.174 (SE 0.020), again showing that the naive 2SLS estimator of the effect of log (BMI) on log (DBP) is downward biased. This result is quite close to the OLS result, indicating that there is much less unobserved confounding in this relationship than suggested by the naive 2SLS estimator. The 9 instruments selected as invalid for n = 3.03 are a subset of the 13 instruments selected for n = 0 for CI sar . For the Lasso procedure, eight of the nine instruments selected as invalid for n = 3.03 were also selected as invalid for n = 0. Figure A4 in Appendix A.8 displays the CIs for the n = 3.03, k z = 62 case at the selected final breakpoint * n = 2.35. Only one of the instruments selected as invalid has a positive estimate for the causal effect, whereas the other eight have negative estimates, resulting in a larger estimate of the causal effect when these instruments are treated as invalid.
In order to compare the results to those found by Zhao et al. (2019), we also performed the analysis on the untransformed BMI and DPB variables. The results for OLS in this case are 0.559 (0.0062), for 2SLS, 0.248 (0.0452), and for CI sar , 0.568 (0.0565), with 13 instruments found to be invalid. For the pre-selected k z = 62 case, the results for 2SLS are 0.244 (0.0469), and for CI sar , 0.494 (0.0557), with nine instruments found to be invalid. In the latter case, these invalid instruments are identical to the ones found above, but this is not the case when k z = 96. Again these results suggest that the original OLS results suffer much less from unobserved confounding bias than the naive 2SLS estimator suggests. These results are similar to those found in the two-sample summary data analysis of Zhao et al. (2019), who found profile score, RAPS, IVW and weighted median estimates of 0.601 (0.054), 0.402 (0.106), 0.514 (0.102) and 0.472 (0.176), respectively in their analysis with 160 SNPs as potential instruments.

| CONCLUSION AND DISCUSSION
We have shown that the CI method for selecting the set of valid instruments from a putative set of instruments that may include invalid ones for an instrumental variables analysis is a viable alternative to the hard thresholding method and the adaptive Lasso method when the plurality rule holds. The methods developed for selecting invalid instruments thus far have only considered a single endogenous treatment variable. Recent analyses have considered models with multiple treatments, see for example Sanderson et al. (2019) for an examination of multivariable Mendelian randomisation. An extension of the instrument selection methods for multiple treatment models is not straightforward. When the majority rule applies, the adaptive Lasso method can be utilised by constructing an initial consistent median-of-medians estimator, see Liang and Windmeijer (2020). For the HT and CI methods, such an extension is the subject of future research.