Testing the specificity of environmental risk factors for developmental outcomes

Developmental theories often assume that specific environmental risks affect specific outcomes. Canonical Correlation Analysis was used to test whether 28 developmental outcomes (measured at 11– 15 years) share the same early environmental risk factors (measured at 0– 3 years), or whether specific outcomes are associated with specific risks. We used data from the UK Millennium Cohort Study ( N = 10,376, 51% Female, 84% White) collected between 2001 and 2016. A single environment component was mostly sufficient for explaining cognition and parent-rated behavior outcomes. In contrast, adolescents’ alcohol and tobacco use were specifically associated with their parents’, and child-rated mental health was weakly associated with all risks. These findings suggest that with some exceptions, many different developmental outcomes share the same early environmental risk factors.


I N T RODUC T ION
Specificity models, which propose that different environmental exposures result in different developmental sequelae, have an enduring popularity in developmental science.Home language use is a popular example.Bilingual home environments (Bialystok et al., 2005), caregiver use of mental state language (Meins et al., 2002), and "number-talk" (Levine et al., 2010), have been, respectively, linked to inhibitory control, theory of mind, and arithmetic development in children.These kinds of specificity models easily lend themselves to neat, mechanistic interpretations.
Yet, a comprehensive evidence base shows that environmental risk factors are highly co-occurring and predict a broad range of outcomes (Asmussen et al., 2020;Evans et al., 2013).For example, socioeconomic status correlates with a wide range of life outcomes, including cognition, mental health, and educational outcomes (Dalmaijer et al., 2021), and with exposure to many different adverse childhood experiences (ACE; Walsh et al., 2019).Co-occurrence can make detecting specific effects in observational data more difficult, as measuring and accounting for many confounding variables may not be possible.
A popular theoretical approach for understanding the impact of a child's environment is the "cumulative risk factor" model.According to this perspective, the kind of risk is less important than the absolute number of risks experienced (Evans et al., 2013).Researchers typically calculate a cumulative risk score by dichotomizing 3-12 risk factors, then summing across these to determine the number of exposed risks, and then regressing this against a wide range of factors.Cumulative risk scores are good predictors of mental health (Evans et al., 2013), cognition (Burchinal et al., 2008), and behavior (Stouthamer-Loeber et al., 2002).Perhaps unsurprisingly, being exposed to more risk factors does in fact predict more severe outcomes (Appleyard et al., 2005).The ACE literature takes a similar cumulative approach to measuring risk exposure, albeit focusing on particularly extreme risks such as abuse and neglect (Asmussen et al., 2020).
However, there are various limitations to the cumulative risk factor approach.Cumulative risk research has frequently utilized retrospective, self-report measures of risk exposure (Smith & Pollak, 2020), which may be weakly correlated with prospective measures (Baldwin et al., 2019).Dichotomizing and summing risk factors into a single composite score makes various unrealistic assumptions.For example, that each risk factor is equally impactful (Olofson, 2018), that dichotomization of variables does not lose important granularity (Raviv et al., 2010), and that all risk factors produce the same outcome (McLaughlin & Sheridan, 2016).
Between specificity and cumulative risk models, there are dimensional models (McLaughlin et al., 2020), which share features of both approaches.Dimensional models split environmental factors into categories such as deprivation and threat (McLaughlin & Sheridan, 2016), neglect and harm (Humphreys & Zeanah, 2015), or harshness and unpredictability (Belsky et al., 2012).Different dimensions of environmental factors are then linked to different kinds of outcomes.For example, McLaughlin and Sheridan's (2016) theory predicts that environmental threat risks (e.g., abuse) primarily impact emotional and social information processing, and environmental deprivation risks (e.g., neglect) primarily impact cognition.

Specificity biases in developmental science
These perspectives cannot be divorced from methodological approaches, because the theoretical conclusion is often a direct consequence of the deployed methodology.Support for specificity models tends to come from two limited statistical approaches.The first is to compare p-values for different associations.For example, in testing the Deprivation/Threat model, some studies only demonstrate that threat risk factors significantly predict emotional or social outcomes, whilst deprivation risk factors do not (e.g., Lambert et al., 2017;Rosen et al., 2019).However, significant and non-significant effect sizes are often very similar, and these effects are seldom directly compared.Additional tests are needed to demonstrate that two effect sizes are significantly different from each other (Gelman & Stern, 2006).
A second common approach is to show that one risk dimension significantly predicts a developmental outcome when controlling for another dimension, or after controlling for confounding factors (e.g., Dennison et al., 2019;Machlin et al., 2019).However, such models assume all relevant confounders are included in the model and measured without error.Epidemiologists have long documented the effect of residual confounding (Armstrong, 1998;Greenland, 1980).Residual confounding occurs when a confounding variable is imperfectly measured, meaning that it can only imperfectly "control" for the confound.
In an extreme case, let us imagine that deprivation is associated with an outcome (e.g., cognition), but deprivation is measured twice with some independent measurement error.When controlling for deprivation (first measurement), deprivation (second measurement) would remain a useful predictor in a regression model.Combining both imperfect measures of deprivation will yield better predictions of the outcome.Thus, regression models with two noisy measures of deprivation will generally explain more variance than those with just one.In these cases, we might mistakenly conclude that deprivation (second measurement) has a specific association with the outcome.While this logic is commonly deployed, it provides weak tests of specificity models.

The current study
Previous dimensional theories of adversity have been largely inspired by experimental research in neuroscience and evolutionary psychology (Belsky et al., 2012;McLaughlin & Sheridan, 2016;McLaughlin et al., 2020;Smith & Pollak, 2020).They are often tested using observational methods with the methodological limitations outlined above.
Data-driven methods are increasingly popular for exploring how environmental risks and developmental outcomes are associated and complement experimental approaches.For example, network models have recently been used to explore associations between adversities and developmental outcomes (Dalmaijer et al., 2021;Sheridan et al., 2020).Here, we introduce a novel exploratory approach, which is particularly appropriate for identifying dimensions of risk factors that best predict dimensions of developmental outcomes.
Canonical Correlation Analysis (CCA) was used, which finds pairs of risk factor and developmental outcome components that are maximally correlated, but uncorrelated with all other components extracted (Anderson, 2003;Hotelling, 1936).In other words, it finds dimensions of risk factors specifically associated with dimensions of outcome variables, where they exist.CCA also provides a principled way of testing if a single dimension of risk factors is sufficient for explaining a broad range of different outcomes, or whether there are multiple distinct dimensions.
This approach circumvents issues with p-value misuse and residual confounding in regression analyses.Rather than testing whether an individual risk factor contributes to the prediction of a single outcome, this approach focuses instead on finding components of risk factors that best explain different kinds of outcomes.As with principal component analysis, independent measurement error should not impact the number of components identified (Hellton & Thoresen, 2014).We provide a detailed primer to CCA in the Method section.
We aimed to understand how a broad range of different risk factors in infancy relate to a broad range of different outcomes (including cognition, mental health, and behavior) in adolescence.This required rich developmental data, and a large sample size to balance the number of parameters estimated.Therefore, data from a large-scale longitudinal birth cohort, the United Kingdom (UK) Millennium Cohort Study (MCS) were used.This UK-based cohort contains many measures of adversity collected in infancy, and has been followed up to late adolescence (Centre for Longitudinal Studies, 2017).
These developmental time points were selected for two main reasons.Practically, temporally separating risk exposure and outcome reduces concerns that the measurement of the environmental risks will be impacted by the outcomes themselves.Theoretically, the environment in infancy and early childhood has long been viewed as important for future development (Bowlby, 1951;Freud, 1964).Much subsequent research has highlighted the importance of early adversity (Thompson & Steinbeis, 2020).Adolescence is also a particularly important developmental period: it is when many adult mental health disorders first emerge (Jones, 2013) and also represents a key educational stage.
Using CCA, we can answer three related questions.First, how many distinct linear combinations of risk factors are associated with distinct linear combinations of outcome variables?The cumulative risk factor approach would predict that only a single risk component is sufficient to explain all developmental outcomes.Alternatively, many different risk factor components may predict many different outcome components, in line with a specificity perspective.Second, we can ask which risks demonstrate more specific associations to outcomes by inspecting the weights that define each CCA component extracted.Third, we can determine how many dimensions of risk factors are necessary to predict each outcome variable.This tells us how many (and which) risk factor components are sufficient to explain each outcome variable.

M ET HOD
The UK MCS includes children born between September 2000 and January 2002, from Scotland, England, Wales, and Northern Ireland.18,552 families took part in the first sweep of testing, and an additional 1387 families were recruited in the second sweep.By the sixth sweep, only data from 11,726 families were available for analysis.
The study used a stratified, clustered, random sample, oversampling children of ethnic minority backgrounds and those living in disadvantaged areas or smaller UK nations at baseline (Centre for Longitudinal Studies, 2017).
In rare cases where there were multiple cohort members per family (e.g., twins), only data for the first cohort member were used.The MCS cohort contains data from caregiver-reports, teacher-reports, child-reports, child cognitive tests, and observational data from survey members attending the home visits.

Data, materials, and online resources
All R scripts written for data processing, analysis, and visualization are provided in the accompanying GitHub repository.The original data are accessible from the UK Data Service (Centre for Longitudinal Studies, 2017).Given the primarily exploratory nature of the analyses, no analyses were pre-registered.

Environmental risk factors
Thirty-one risk factors were included from the first and second sweeps of the MCS, when children were aged approximately 9 months (M = 0.81 years, SD = .04)and 3 years old (M = 3.1 years, SD = .20),respectively.A detailed description of all risk factor variables is provided in Table 1.Most measures were derived from parent reports, either from a computerized interview, or selfcompletion for sensitive questions.

Developmental outcomes
Outcome variables were included from the fifth and sixth study sweeps, when children were aged approximately 11 (M = 11.2,SD = .33)and 14 (M = 14.25,SD = .34)years.A broad range of cognitive, behavioral, and mental health outcomes were collected using parent and child reports, as well as cognitive testing, again detailed in Table 1.In the fifth sweep, a Teacher Survey was conducted for children in England and Wales only (7430 responses), involving either a postal self-completion questionnaire or computer-assisted telephone interview.Therefore, teacher ratings were excluded from the main analyses, leaving 22 outcome variables in total, but were included in a secondary analysis with a total of 28 outcomes.

Data pre-processing
At the start of data processing, participants were randomly assigned to training and testing datasets.The training dataset was used to make decisions about T A B L E 1 Variables included in analyses.Where possible, scale internal consistency was estimated using omega (ω).More information on variables is available in the accompanying repository, including scripts used to generate variables from the original data, and pointers to MCS variable names variable inclusion and scoring.Once these decisions had been made, confirmatory analyses were run on the testing dataset, reducing the risk of researcher bias (Gelman & Loken, 2013).The testing dataset was also used to validate the final CCA model (see Data analysis section).A third dataset containing all participants was created to report descriptive statistics for the full sample.
Teacher-rated measures were not included in the primary analyses of the above datasets due to the lower response rate, which reduces the useable sample size (see Imputation section).To perform supplementary analyses with a smaller sample including teacher data, three additional datasets (training, testing, and combined datasets) were created including teacher-reported metrics.
These six datasets were independently passed through the same pre-processing steps (including variable scoring) outlined below.

Variable scoring
Where summary scores were calculated from more than one indicator variable (e.g., for creating a parental selfesteem scale), a factor score was estimated using the psych::fa R function (Revelle, 2018).Few variables were dichotomized based on established cut-offs (e.g., for categorizing premature vs. non-premature babies).The steps used to process each variable are outlined in Supporting Information, and where possible internal consistency is reported in Table 1.
Apart from two cases (number of siblings, monolingual-English home language use), all variables were coded so that a higher score indicated a "better" value (e.g.lower alcohol consumption, less air pollution, or fewer depression symptoms).Continuous variables with over 10 response categories were mapped onto a standard normal distribution by first estimating the percentage-ranks for participants on each variable (so that scores vary uniformly between 0 and 1 for the lowest and highest responses, respectively).The quantile function for a standard normal distribution was then applied to map these ranks onto equivalent points on the normal distribution.This approach preserves the ranking of individuals but alters the distances between scores to approximate a normal distribution.Each variable is scaled before analysis to have zero mean and unit variance.All variable scoring steps were independently applied to each training or testing dataset.More detail is provided in the accompanying code.

Imputation
After variable scoring, missing data were imputed using the R package mice (Buuren & Groothuis-Oudshoorn, 2011), using classification and regression trees that can handle imputing different data types.Imputation was conducted separately amongst the risk factor and outcome variables, so that information from the predictor variables was not used to impute missing information in the outcome variables, and vice versa.
Excluding participants with more than 20% missing data left a sample size of 5160, 5216, and 10,376 for the training, testing, and complete datasets, respectively.For the datasets with teacher ratings (participants without teacher-reports were excluded), this left a sample size of 2831, 2834, and 5665 for the training, testing, and complete datasets, respectively.Imputation was performed separately in each of the six datasets.

Demographics
Demographic details are provided for families included in the analyses, after the exclusions for missing data applied above (i.e., all participants in the "combined" dataset, N = 10,376).The ethnic composition of the sample was as follows: White (84.3%),Pakistani (4.68%), Black (2.68%) Mixed (2.54%), Indian (2.54%), Bangladeshi (1.91%), and other (1.33%).The gender composition of the sample is 49.5% male and 50.5% female.At the first study sweep, the proportion of children resident in England, Northern Ireland, Scotland, and Wales were 63.9%, 9.92%, 11.3%, and 14.9%, respectively.At the first and second study sweeps, respectively, 99.9% and 98.7% of main interview responders were the birth mother.At the first sweep, 34.3% of main interview responders had a university-level degree, diploma in higher education, or other professional qualification at degree level.The employment level of the main interview responder at the first study sweep was rated using the UK National Statistics of Socio-Economic Classification (NS-SEC 7), using the last known job.Thirty-two percent of the sample were in higher or lower, managerial or professional occupations, and 32.5% in semi-routine or routine occupations.

Data analysis
Data cleaning, analysis, and visualizations were implemented in R (v3.6.2).Scripts are available in the accompanying repository (github.com/giac01/mcs_cca).To facilitate computational reproducibility, analyses were run within a docker container (available at dockerhub; bignardig/rocker:MCS2021), and a bespoke R package was created to perform and document the CCA analyses (github.com/giac01/ccatools).We have followed guidelines for reporting exploratory research, focusing on parameter and confidence interval estimation, rather than hypothesis-testing (Weston et al., 2019).
A primer to CCA Canonical Correlation Analysis extracts pairs of components, also called the "variates," which are composed of one risk factor component, and one outcome component.Components are extracted in a way that ensures each pair are maximally correlated, but subsequent components extracted must be uncorrelated with previous ones (Anderson, 2003;StataCorp, 2019).Here, environmental risk component scores are calculated from a weighted sum of the risk factor variables, and outcome component scores are calculated from a weighted sum of the outcome variables.Consequently, variables with large weights in magnitude have a larger impact on the component score.The goal of CCA is to find the weights (also called the "raw coefficients") that create components with the properties outlined here.
To estimate a risk factor component score, one multiplies each risk factor variable score to the corresponding CCA weight, and sums across all variables for each participant (i.e., a linear combination of variables).The first, second, and third canonical correlations refer to the correlations between, respectively, the first, second, and third pairs of components.
A key feature of CCA is that the magnitude of canonical correlations declines monotonically from the first to the last canonical correlation.Therefore, components are ordered in terms of importance.The number of component pairs that can be extracted (and thus the number of canonical correlations) is limited by the minimum number of predictor or outcome variables.The first canonical correlation represents the highest possible correlation between any possible linear combination of predictor variables and any linear combination of outcome variables.
Canonical Correlation Analysis differs from the more common approach of conducting dimensionality reduction first, for example, by computing the principal components in the risk factor and outcome matrices separately, then regressing the risk factor principal components onto the outcome components in a second step.Such an approach assumes that the principal components extracted from risk factor variables are also important for predicting outcomes, which may not be the case (Rohart et al., 2017).CCA is more suitable for the current research question as the components extracted best explain relations between risk factor and outcome variables, rather than being optimized to explain variance amongst the risk factor or outcome variables separately.
Although it is a well-established technique (Hotelling, 1935), CCA is not widely used in developmental psychology, despite its usefulness for relating multiple predictors to multiple outcome variables.In neuroscience, it is increasingly employed to summarize associations between multiple neuroimaging features (e.g., measures of brain connectivity between hundreds of pairs of brain regions) and complex, multivariable outcomes.For example, Marek et al. (2020) used CCA to describe associations between functional brain connectivity and multiple cognitive assessments, and separately between brain connectivity and multiple behavioral problem ratings, in a large cohort of 9-to 10-year olds.Connectivity was more strongly associated with cognition (canonical correlation = .22)than with behavioral problems (canonical correlation = .06).Partial least squares, a closely related technique, has been used to explore associations between multiple measures of the environment, cognition, behavior, and brain connectivity, in a sample of 113 six-to twelve-year olds (Johnson, 2018;Johnson et al., 2021).Lillard and Kavanaugh (2014) used CCA to explore how a battery of assessments longitudinally predict future theory of mind tasks, in a sample of 58 children.

Estimating canonical correlations and confidence intervals
An important limitation of CCA is that sample canonical correlations are optimistically biased, with overfitting particularly bad in studies with small sample sizes and large numbers of variables (Helmer et al., 2020;Lee, 2007).Whilst this problem of model overfitting is not unique to CCA, we addressed this issue by internally validating the models using a hold-out testing dataset.
Canonical Correlation Analysis models were fit in a training dataset (N = 5160), and used to generate predicted canonical component scores and canonical correlations in the testing dataset (N = 5216).This is achieved by applying the CCA weights (raw coefficients) found in the training dataset to the testing dataset to find predicted component scores (Marek et al., 2020).Correlations between these predicted components were used to find the predicted canonical correlations.Standard inferential methods for Pearson's correlation coefficient were used to estimate confidence intervals and p-values for the predicted canonical correlations.

Research question 1: Testing the cumulative risk model
Cumulative risk theory predicts that only a single CCA risk factor component is associated with a single CCA outcome component, with subsequent canonical correlations near zero.In contrast, if there are multiple, large canonical correlations, this means that specific linear combinations of risks are associated with specific linear combinations of outcomes, refuting the cumulative risk approach.
Research question 2: Which risk factors predict which developmental outcomes?
If there are multiple, large predicted canonical correlations, specific associations between risk factors and outcome components can be explored by inspecting the CCA weights.The CCA weights describe which variables define each component.To aid interpretation, 95% CIs are estimated for all weights using non-parametric bootstrapping.Procrustes rotations are used to align CCA weights across different bootstrap resamples (Johnson, 2018;Zientek & Thompson, 2007).Rotating each resampled CCA model is necessary because the component extracted can change order or sign (i.e., weights flip from positive to negative) across resamples.Only CCA weights for the first five components are used for Procrustes analysis.
Research question 3: Predicting developmental outcomes from dimensions of risk factors Finally, we assessed how much variance each risk factor component predicts in each outcome variable.This illustrates which outcomes are best explained by which risk factor components, more directly than relying only on the CCA weights.Using the CCA fitted in the training dataset, a series of linear regression models are fitted predicting each outcome variable from the risk factor components in the testing dataset.For each outcome, the number of risk factor CCA components used in each regression model was varied; selecting the first z components to the maximum (22 or 28).This generates vectors of regression beta coefficients for predicting each outcome using different numbers of CCA components as predictors.

R E SU LT S
First, the CCA model was fitted to the training dataset to estimate raw coefficients (Figure 1b) and used to generate predicted component scores for participants in the testing dataset (Figure 1d).Correlations between estimated CCA components from the training dataset are presented in Figure 1c, and correlations between predicted CCA components in the testing dataset are presented in Figure 1d.Correlations between all variables are presented in Supporting Information.

Research question 1: Testing the cumulative risk model
According to the cumulative risk factor model, only a single pair of CCA components should be sufficient to describe associations between risk factors and outcomes.Our results do not support this.The first 8 predicted canonical correlations were significant (all p < .001).However, predicted canonical correlations from the fourth correlation (r = .13)were very weak, so we have focused on interpreting the first three component pairs.The first predicted canonical correlation (r = .51,95% CI [.49, .53],p < .001) is large compared to the second (r = .33,95% CI [.31, .36],p < .001)and third (r = .25,95% CI [.22, .27],p < .001)canonical correlations, though these were still moderately large.

Research question 2: Which risk factors predict which developmental outcomes?
Given that at least three medium-to-large predicted canonical correlations were found, we examined which combinations of risk factors were specifically associated with which combinations of developmental outcomes.CCA weights (raw coefficients) have been plotted in Figure 2.

First canonical correlation
The first risk factor CCA component was strongly represented by socioeconomic factors, with the largest CCA weights for parental education (w = .69),income (w = .30),and employment (w = .23),as well as for other factors including breastfeeding (w = .22),housing quality (w = .19),and more.The first developmental risk component was broadly related to many outcomes, particularly parentrated behavioral difficulties and cognition (vocabulary and spatial working memory, in particular).The correlation between these two components was large (r = .51,95% CI [.49, .53],p < .001).

Second canonical correlation
The second CCA risk factor component was strongly represented by monolingual-English language use in the home (w = .61),higher parent alcohol consumption after pregnancy (w = −.48), and smoking after pregnancy (w = −.30In other words, alcohol use in adolescence appeared to be specifically associated with monolingual-English language use in the home, and higher parental alcohol and tobacco use.The association between these two components was moderately strong (r = .33,95% CI [.31, .36],p < .001)and independent from the first pair of components.

Third canonical correlation
Subsequent CCA components represent increasingly complex profiles of "positive" and "negative" factors.The third environment component was characterized by lower parental education (w = −.36) and employment level (w = −.25), but higher parent mental (w = .34)and general health (w = .26),less harsh parenting practices (w = .29)and a larger number of siblings (w = .46).This predicted an outcome component (r = .25,95% CI [.22, .27],p < .001)consisting of worse vocabulary skills (sweep 5 w = −.27;sweep 6 w = −.55), but fewer parent-rated emotional (w = .42)and hyperactivity (w = .29)problems.However, adolescent self-rated mental health variables had mostly small, non-significant (p > .05)weights, as shown by their 95% CIs overlapping with 0. This suggests the component is mostly specific to parent-ratings of behavior.

Research question 3: Predicting developmental outcomes from dimensions of risk factors
For each outcome, the out-of-sample variance explained was estimated using regression models with increasing numbers of risk factor components.This tells us the variance explained in a particular outcome (e.g., vocabulary) by the first risk factor CCA component, the first two components, or the first three components.To compare, we also estimated how much variance could be explained when regressing the original 31 risk factors onto each outcome in separate regression models.All prediction models were fitted using the training data and validated in the testing dataset.These results are presented in Figure 3 and Table 2. To aid visualization, outcomes have been clustered into four groups using hierarchical cluster analysis (using Manhattan distances and the complete linkage clustering method).The four clusters included parent-rated behavioral problems, cognition, drug-use (self-harm also clustered into this group), and child-rated mental health (self-rated antisocial behavior and bullying were also clustered into this group).
First, we inspected the variance explained (R 2 ) for each outcome variable when utilizing all 31 risk factors in regression models.Overall, early environmental risk factors could only explain a moderate amount of variance in each developmental outcome.At most, just over 14% of variance in vocabulary ability could be explained from risk factors.In general, risk factors were better at predicting parent-rated behavioral problems (7.7% ≤ R 2 ≤ 9.9%; excluding prosocial behavior) and vocabulary (R 2 = 11% and 14%), and relatively poor at predicting self-rated mental health outcomes (R 2 ≤ 3.7%).Alcohol (R 2 = 9.1%) and cigarette (R 2 = 5.8%) use were moderately well explained by risk factors.
Next, we inspected which CCA components were crucial for predicting different kinds of outcomes.The first risk factor component, which also can be thought of as a cumulative risk factor, explained a large proportion of variance in parent-rated behavioral problems and vocabulary (5.2% ≤ R 2 ≤ 13.2%; excluding prosocial behavior) but was relatively poor at predicting drug-use and self-rated mental health (−0.1% ≤ R 2 ≤ 2.2%).The second CCA risk factor explained the majority of variance in drug-use (alcohol, tobacco, e-cigarette, and cannabis use).For example, when comparing the variance explained by 1 versus 2 components, the percentage of variance explained greatly increases for both alcohol (0% to 8.7%), cigarette (2.2% to 5.9%), and cannabis (0.3% to 2.0%) use.This would not have been obvious by solely inspecting the CCA weights, where adolescent alcohol use dominated.
The third risk factor component explained additional, but much less variance in parent-rated behavioral problems, vocabulary, and self-rated bullying.For example, comparing the variance explained in regression models with 2 versus 3 components, variance explained in selfreported bullying (0.3% to 1.1%) and vocabulary at the sixth sweep increase (13.2% to 14.5%) notably.
The variance explained when regressing all 31 risk factors onto each outcome (see Table 2 or Figure 3) was very similar to using three CCA components to predict each outcome.This illustrates that risk factors can be reduced to a handful of dimensions, but still robustly predict a wide range of developmental outcomes.Only F I G U R E 2 CCA weights (raw coefficients) for first three risk and outcome components.95% confidence intervals for weights are estimated using bootstrap resampling.Positive and negative weights are indicated by green and orange bars respectively.All variables were coded so that a higher score indicates a "better" outcome or environment where applicable.For example, a higher score on the first CCA component captures children with higher vocabulary, fewer peer problems, but greater alcohol use occasionally regression models with 31 risk factors explain more variance than models with just three CCA components as predictors.For example, parent-rated emotional problems (7.0% to 7.7%) and alcohol use (8.8% to 9.1%).

Additional analyses including teacher ratings
The above analyses were run again on separate datasets including teacher-rated Academic Ability and Strengths and Difficulties Questionnaire (SDQ) assessments (see Supporting Information).Analyses that included teacher-rated outcomes yielded similar results.Teacher-rated academic ability was the developmental outcome most strongly predicted by risk factors (R 2 = 16.4%),explained primarily by the first risk factor component (R 2 = 16.1%),like other cognitive outcomes.Even though Teacher SDQ behavior ratings were only weakly-to-moderately correlated with parents' SDQ behavior ratings on the same subscales (.21 ≤ r ≤ .44;see Figure S2), both sets of variables were primarily explained by the first CCA risk factor component.However, the first CCA risk component explained less variance in teacherrated SDQ scores (1.1% ≤ R 2 ≤ 4.1%) than parent-rated ones (1.7% ≤ R 2 ≤ 8.5%; see Figure S4).While the third risk factor component was moderately associated with parent-rated SDQ scores, it did not substantially explain additional variance in teacher-rated SDQ scores, though teacher-rated academic ability was strongly negatively associated with this factor.

DI SC US SION
A widespread theoretical assumption in both folk (e.g., Lilienfeld et al., 2011) and academic psychology is that specific environmental exposures result in specific developmental outcomes.This study explored whether a diverse set of 28 developmental outcomes covering cognition, behavior, mental health, and drug-use, share the same environmental risk factors measured in infancy, or whether risk factors are associated with specific developmental outcomes.CCA was used, which finds distinct components of risk factors that are maximally correlated with distinct components of developmental outcomes.We found that a single dimension of risk factors, primarily comprised of socioeconomic status variables, strongly correlated with a broad array of outcomes, including parent-and teacher-rated behavioral problems, cognition, and academic achievement (r = .51).The first CCA risk factor component can be thought of as a "cumulative risk factor," as it represents a linear combination of risk factors that is maximally correlated to a linear combination of outcome variables.
However, in line with the specificity approach, we found that distinct components of environmental risk factors predicted distinct components of developmental outcomes, but these associations were all much weaker (r ≤ .33).Monolingual-English home language, as well as greater parent alcohol and tobacco use, formed a CCA component that strongly explained alcohol and tobacco use in adolescence.The third pair of components identified  ) a profile of environmental factors including low parental socioeconomic status (education and employment) but higher parental mental health and less harsh parenting, which predicted fewer parent-rated emotional and hyperactivity difficulties, but worse vocabulary and academic skills.However, these effects did not extend to teacherrated behavioral difficulties from the same questionnaire, or other adolescent-rated measures of well-being.Subsequent CCA components become much more complicated to interpret, and have larger standard errors on the CCA weights, while also accounting for relatively little covariance between risk factors and outcomes.For these reasons, we focused on the first three pairs of components.

Evaluating the cumulative risk approach
The current study represents an advance on previous cumulative risk factor research, which has often utilized retrospective and simplistic measures of cumulative risk (see reviews; Evans et al., 2013;Smith & Pollak, 2020).The results support a central criticism of the cumulative risk approach: a single environmental risk score (estimated using the first CCA component) does not fully account for covariance between risk factors and developmental outcomes.In contrast, the first three predicted canonical correlations were meaningfully large in magnitude (r = .51, .33, .25).This demonstrates that a single, global risk score cannot best explain all outcomes, and that distinct combinations of risk factors associate with distinct combinations of outcomes.
We also examined how many CCA risk components are required to predict each developmental outcome in regression models.Depending on the outcome, we found that 1-3 CCA risk components could explain variance in most outcomes as effectively as using all 31 original environmental risk variables.In other words, for predicting developmental outcomes in adolescence, the 31 risk factors could be reduced to three components, with minimal loss in predictive accuracy.
In defense of the cumulative risk approach, all 31 risk factors can be reduced to a single CCA component that strongly predicts cognition and behavioral problems.Additional CCA components only explained relatively  small amounts of additional variance in these variables.A common saying in statistics is that all models are wrong, but some are useful.Cumulative risk scores, especially when aggregated using more sophisticated methods, are likely to be quite useful predictors of many outcomes, but not all.

Cognitive and behavioral difficulties in adolescence are strongly patterned by risk factors in infancy
Outcomes ranging from education, cognition to behavioral problems rated by parents and teachers were all strongly-to-moderately predicted by risk factors in infancy.Our novel findings also show that these outcomes could be primarily explained by a single component of risk factors, primarily composed of socioeconomic status variables.Associations with teacher-rated behavior were slightly lower than for parent-rated behavior.In contrast, adolescent self-rated mental health was weakly associated with these risk factors (0.7% ≤ R 2 ≤ 3.7%).This study extends previous research by demonstrating that the environmental predictors of academic achievement, cognition, and behavioral difficulties largely overlap.Unlike previous studies, we demonstrate this by using a large hold-out sample, and by using CCA to analyze multiple outcomes in the same model.
It is perhaps not surprising that cognition and academic performance share the same risk factors, given their strong association (e.g., Bignardi et al., 2021).Indeed, cognitive assessments reflect developed skills and not necessarily latent capacity or genetic ability (Dumas & McNeish, 2017).However, we used a limited variety of cognitive measures, and future research could explore if other cognitive assessments exhibit different associations to risk factors.Outcomes from the gambling task were very weakly associated with all outcomes (but see Limitations).
Parental education was identified as a particularly important predictor, with its first CCA component weight being more than twice that of any other variable.This finding aligns with theories that emphasize parental education as a key causal driver between socioeconomic status and educational outcomes (Davis-Kean et al., 2021).Davis-Kean et al. (2021) suggest that parental education impacts expectations and involvement with children's education at home, influencing children's educational outcomes.One study also suggests that the association between maternal education and externalizing difficulties (rated by the SDQ) is at least partly mediated by parenting (Bøe et al., 2014).

Weak associations between adolescent-rated mental health and risk factors in infancy
Our results highlight differences in how parent and adolescent ratings of mental health relate to risk factors in infancy.This was evident both in the total percentage of variance explained, which was much lower for adolescent ratings than parent ratings, and for which CCA components predicted these different outcomes.The first CCA risk component, which strongly loaded on socioeconomic status variables, explained a moderate amount of variance in parent-rated emotional and conduct problems (R 2 = 5.2% and 7.6%, respectively), but very little for adolescent-rated outcomes (all R 2 ≤ 0.8%).
Previous research has suggested that socioeconomic gradients in mental health are less pronounced relative to cognition.For example, one registry-based study of all children in Norway (from 2008 to 2016), reported that the prevalence of depression between the lowest and highest percentile of family income differs <1%, though this figure is slightly larger for anxiety disorder (~2%; Kinge et al., 2021).
The current study only examined early-life risk factors, but concurrent factors such as peer relationships may be more influential for adolescents' self-rated mental health (Lamblin et al., 2017).Indeed, adolescence is often characterized as a period in which many mental health disorders first present to clinicians, and in which when social relationships outside the home increase in importance.For example, Singham et al. (2017) investigated how being bullied is associated with differences in mental health between identical twins.Bullying had larger associations with concurrent, self-rated depression and anxiety, and smaller effects for parent-rated outcomes or outcomes measured 2 or 5 years later.The importance of concurrent risk factors in predicting adolescent mental health may explain why the associations with early-life risk factors are weak.Well-being in adolescence may also be more transitory and less predictable than other developmental periods, which could be explored in future research.
Subjective experiences of adversity may also be important for the development of psychopathology, in addition to objective exposures.Individual perception of the environment is an essential feature of chronic stress and anxiety responses (Brosschot et al., 2017).For example, Danese and Widom (2020) found that adults with objective, court-substantiated maltreatment in childhood, but without retrospective reports of abuse or neglect, had similar levels of psychopathology to matched controls with no objective or subjective experience of maltreatment.Again, our lack of measures of these subjective experiences may explain why the associations with adolescent mental health are relatively weak.

Specific association between parent and child alcohol and tobacco use
Tentative evidence for a specific association between parents' alcohol and tobacco consumption (both during and after pregnancy) and adolescents' alcohol and tobacco use was found.These variables loaded highly onto the second risk factor and outcome CCA components, which were moderately correlated (r = .33),independently of other components.Alcohol and tobacco use in adolescence is a concern, as it is associated with alcohol dependence in adulthood (McCambridge et al., 2011).Early-onset alcohol use, as measured here, is associated with heavier and more problematic use later (Hingson & Zha, 2009).Together, tobacco and alcohol use accounted for 9 million deaths globally in 2015 (Forouzanfar et al., 2016), representing one of the leading causes of preventable disease.These results suggest that reducing parent alcohol and tobacco use may also benefit children, although causal evidence is required.
These findings accord with the sizeable correlational literature on alcohol use and dependence in parents, and their relation to children's alcohol consumption (Rossow et al., 2016).However, genetically sensitive research designs have questioned whether environmental influence can explain the observed correlations entirely.Childrenof-twins designs can be used to disentangle genetic and environmental effects.The offspring of parent identical twins who diverge in a particular trait (e.g., alcoholism) are compared.Social and genetic factors shared between parent twins are more carefully balanced by making matched comparisons between children.Two studies have not found conclusive evidence of environmentally transmitted alcohol abuse in children, although sample sizes were small (Duncan et al., 2006;Slutske et al., 2008).Regardless, alcohol use in parents and children was assessed here by the frequency of drinking, rather than abuse or dependency symptoms.Indeed, we observed the largest increase in adolescent alcohol use when comparing children of a caregiver that never drinks to caregivers that drink a small amount (see Supporting Information).
Our results cannot tell us if parental alcohol use has a direct, causal effect on children's use.In addition to genetic factors, several unmeasured factors correlated to parents' alcohol use may mediate the relation.For example, two meta-analyses have linked parenting factors, including alcohol use approval, provision, and rules to children's alcohol use (Sharmin et al., 2017a(Sharmin et al., , 2017b)).General factors such as parental monitoring, support, and involvement are also linked to alcohol initiation and misuse (Yap et al., 2017).Evidence from randomized controls trials suggests that parent programs can have modest impacts on adolescent alcohol use (Bo et al., 2018).Similar factors have also been linked to smoking initiation in adolescence (Hill et al., 2005;Huver et al., 2006).Twin studies suggest that for smoking initiation, the shared home environment is one of the most important factors (Li et al., 2003).
While higher equivalized household income (and socioeconomic status more generally) usually predicts an average reduction in risk exposure or improvement in outcomes, higher household income was associated with increased parental alcohol consumption both during (r = −.18;note alcohol consumption is coded so a higher value means less consumption) and after pregnancy (r = −.35).Research from other Western countries has found similar results (Charitonidi et al., 2016;Patrick et al., 2012).

Dimensional models of adversity
As outlined above, our data are in line with neither specific nor cumulative risk models.Instead, we found CCA components comprising several risk factors and related outcomes (in line with cumulative risk), as well as components with more specific relations between risk factors and outcomes (in line with specificity).Because our data sit in this middle ground, one intuitive interpretation is that they support dimensional models (McLaughlin et al., 2020).Given its goal of finding dimensions of predictors that are highly related to dimensions of outcome variables, CCA indeed represents a promising approach for developing or testing dimensional theories of adversity.The method can also be extended to allow dimensions of predictors (or outcomes) to be correlated (Rohart et al., 2017).Despite CCA's suitability for testing dimensional theories, we did not explicitly test any existing models.We lacked the detailed data to fully map the dimensions posited by prominent dimensional theories that focus on deprivation and threat (McLaughlin & Sheridan, 2016), neglect and harm (Humphreys & Zeanah, 2015), or harshness and unpredictability (Belsky et al., 2012).Instead, our results form a data-driven path toward new dimensional theories.As outlined above, our approach and data support the notion of one "cumulative risk" dimension, and two more specific dimensions: one related to alcohol use, and another to mental health and socioeconomic status.Hence, while our results do not directly support or refute specific dimensional theories, the methods form a basis upon which future dimensional theories could be built.

Limitations
It is not certain to what extent these results will replicate across different cultures and time periods.While many developmental theories assume that associations between risk factors and outcomes are determined through biologically universal pathways (McLaughlin et al., 2020), societal and political factors (which differ between countries and time points) may also be important.For example, the correlation between socioeconomic status and tobacco use has increased over time in the UK and North America, because of lowering smoking prevalence in higher educated adults (Corsi et al., 2014).The importance of monolingual-English home language for predicting alcohol use may indicate the effect of social and cultural norms on behavior.Future research could also explore if the same canonical components operate across different demographic subgroups.
Further methodological advances could improve the analysis methods in future studies.Splitting datasets into training and testing datasets is generally a less efficient approach than other approaches to internal validation (e.g., 10-fold cross-validation; Kohavi, 1995).The use of multiple rather than single imputation is also often recommended (Buuren & Groothuis-Oudshoorn, 2011).However, the best approach for incorporating these methods in CCA remains unclear, as combining CCA variates across resamples or imputations is not a trivial problem.However, given the large sample size, a small reduction in efficiency would have minimal impact on our results or interpretations.

CONC LUSION
It is widely assumed that specific environmental factors impact specific developmental outcomes, but testing specificity theories with observational data is fraught with challenges.Using CCA, we explored whether developmental outcomes are best explained by a shared set of early environmental risks, or whether different developmental outcomes are associated with distinct sets of risk factors.A single component of risk factors, primarily comprising socioeconomic status variables, explained a large proportion of variance across cognitive and behavioral outcomes.However, adolescent drug-use, particularly alcohol use, was more strongly associated with parental alcohol and tobacco use, and with whether children come from a monolingual-English home.Adolescent self-rated mental health was weakly associated with all risk factors.The results partially support the cumulative risk approach, with the caveat that weaker, specific associations between some risks and outcomes are likely also present.

From
these prediction models constructed in the training dataset, the out-of-sample variance explained in the testing dataset is calculated, using the coefficient of determination 1 − Residual Sums of Squares Total Sums of Squares .Logistic regression is used instead of linear regression to link CCA components to binary outcomes.

F
I G U R E 3 The percentage variance explained (r 2 ) in each outcome variable by CCA-derived dimensions of environmental risk factors measured during infancy.The orange bars indicate the percentage variance explained in each outcome variable by the first risk factor component.The green bars indicate the variance explained by the first and second risk factor components, and the purple bar indicates the variance explained from the first three components.The black bars indicate the variance explained when entering all 31 risk factors into each regression model.Because prediction accuracy is validated on an external dataset, simpler models with fewer predictors may out-perform more complicated models.These values are also presented numerically in Variance explained (R 2 ) for each outcome variable by either the first, first two, or first three CCA components.The final column indicates the variance explained when regressing all 31 risk factors onto each outcome variable.See Figure3

Table 2
Number of CCA Components as predictors Negative R 2 indicates that the model residual sums of squares are larger than the total sums of squares, which can occur when using a testing dataset for internal validation.Numbers indicate a percentage of variance, for example, .031indicates 3.1% variance explained. Note: