Dairying, diseases and the evolution of lactase persistence in Europe

In European and many African, Middle Eastern and southern Asian populations, lactase persistence (LP) is the most strongly selected monogenic trait to have evolved over the past 10,000 years1. Although the selection of LP and the consumption of prehistoric milk must be linked, considerable uncertainty remains concerning their spatiotemporal configuration and specific interactions2,3. Here we provide detailed distributions of milk exploitation across Europe over the past 9,000 years using around 7,000 pottery fat residues from more than 550 archaeological sites. European milk use was widespread from the Neolithic period onwards but varied spatially and temporally in intensity. Notably, LP selection varying with levels of prehistoric milk exploitation is no better at explaining LP allele frequency trajectories than uniform selection since the Neolithic period. In the UK Biobank4,5 cohort of 500,000 contemporary Europeans, LP genotype was only weakly associated with milk consumption and did not show consistent associations with improved fitness or health indicators. This suggests that other reasons for the beneficial effects of LP should be considered for its rapid frequency increase. We propose that lactase non-persistent individuals consumed milk when it became available but, under conditions of famine and/or increased pathogen exposure, this was disadvantageous, driving LP selection in prehistoric Europe. Comparison of model likelihoods indicates that population fluctuations, settlement density and wild animal exploitation—proxies for these drivers—provide better explanations of LP selection than the extent of milk exploitation. These findings offer new perspectives on prehistoric milk exploitation and LP evolution. Examination of archaeological pottery residues and modern genes suggest that environmental conditions, subsistence economics and pathogen exposure may explain selection for lactase persistence better than prehistoric consumption of milk.


Dairying, diseases and the evolution of lactase persistence in Europe
In European and many African, Middle Eastern and southern Asian populations, lactase persistence (LP) is the most strongly selected monogenic trait to have evolved over the past 10,000 years 1 . Although the selection of LP and the consumption of prehistoric milk must be linked, considerable uncertainty remains concerning their spatiotemporal configuration and specific interactions 2,3 . Here we provide detailed distributions of milk exploitation across Europe over the past 9,000 years using around 7,000 pottery fat residues from more than 550 archaeological sites. European milk use was widespread from the Neolithic period onwards but varied spatially and temporally in intensity. Notably, LP selection varying with levels of prehistoric milk exploitation is no better at explaining LP allele frequency trajectories than uniform selection since the Neolithic period. In the UK Biobank 4,5 cohort of 500,000 contemporary Europeans, LP genotype was only weakly associated with milk consumption and did not show consistent associations with improved fitness or health indicators. This suggests that other reasons for the beneficial effects of LP should be considered for its rapid frequency increase. We propose that lactase non-persistent individuals consumed milk when it became available but, under conditions of famine and/or increased pathogen exposure, this was disadvantageous, driving LP selection in prehistoric Europe. Comparison of model likelihoods indicates that population fluctuations, settlement density and wild animal exploitation-proxies for these drivers-provide better explanations of LP selection than the extent of milk exploitation. These findings offer new perspectives on prehistoric milk exploitation and LP evolution.
When our ancestors started to consume the milk of domesticated ruminant animals, they set in motion a sequence of events leading to the evolution of lactase persistence (LP) 6 and ultimately the modern global dairy industry, using about 1.5 billion cattle 7 as the principal milk source. Evidence for prehistoric milk exploitation based on combined faunal and organic residue analysis indicates considerable spatial variability. Analysis of sheep or goat age-at-death profiles suggest that dairy management was concurrent with the domestication of caprines in southwest Asia in the ninth millennium bc 8,9 . The earliest organic residue-based evidence of dairying locates to seventh millennium bc northwest Anatolia and is associated with high proportions of cattle bones in faunal assemblages 2 . It is clear that dairy husbandry and processing technologies were brought by the first farmers into Europe via the Mediterranean coast 3 and through the Balkans 10,11 into central Europe [12][13][14] , although milk use remains undetected in Neolithic sites of northern Greece 15 . In northern Europe, extensive studies of organic residues in pottery and dental calculus show unequivocally that the fourth millennium bc Neolithic inhabitants of the British Isles and Ireland were accomplished dairy farmers (for example, refs. [16][17][18][19] ). In prehistoric Denmark, there is evidence of milk use alongside aquatic resource processing from 4000 bc (ref. 20 ). In prehistoric Finland, the region of highest milk consumption globally today 7 , milk use appears with the arrival of Corded Ware pottery in the early third millennium bc, although climate fluctuations may subsequently have triggered returns to hunter-gatherer-fisher subsistence strategies 21,22 .
is a prerequisite 43 . However, LP is not accompanied at an individual level by meaningful differences in milk consumption 31,37,44,45 or the risks from a range of diseases 46 in modern populations and thus its selection may be underpinned by potential adverse consequences of milk consumption in LNP individuals only in certain historically contingent environments 37 .

Mapping ancient milk use
To build a more comprehensive picture of prehistoric milk use and examine how well it explains selection on LP, we compiled all published instances of the occurrence of milk fat residues detected in pottery containers from 366 archaeological sites in Europe and southwest Asia. To improve spatiotemporal coverage we added newly generated organic residues data from 188 sites from various regions and periods. Overall, we used 6,899 animal fat residues derived from 13,181 potsherds from 826 phases from 554 sites, with associated georeferences and phase chronology including more than 1,000 radiocarbon dates. These data were used to generate time series depicting the frequency of milk use across prehistoric continental Europe from 7000 bc to ad 1500 ( Fig. 1) and regional time series ( Fig. 2 and Extended Data Fig. 1). The high density of organic residues around 5500 to 5000 bc in Fig. 1b reflects our new sampling, targeted to explore the coevolution of dairying and LP amongst the first farmers in the central European Linearbandkeramik culture 29 . If we consider the profusion of Δ 13 C values ≤−3.1‰ (a proxy for milk fat residues 16,47,48 ; Methods; Fig. 1b), it becomes immediately clear that milk use was a very widespread activity across all periods in European prehistory and at the broadest scale this was congruous with the spread of farming across the continent.
The extensive sampling and analysis of pottery has enabled us to resolve the spatiotemporal patterning of milk use in unprecedented detail (Fig. 2). Importantly, this shows that milk exploitation arrived with the first farmers into the Mediterranean basin (except the region covered by modern day Greece) and continued throughout the Neolithic period, whilst the later arrival of the Neolithic in Southern Britain was associated with immediate and high milk use-probably reflecting the arrival of advanced dairying populations from adjacent regions of continental Europe 27 -which then gradually reduced. The prehistoric Balkans was the core region for the early intensification of milk use, which is consistent with the rise of cattle exploitation there 49 yet, surprisingly, milk use appears largely absent from geographically adjacent Neolithic site phases located in Greece (on the basis of analyses of >190 animal fats from >870 potsherds; however, we note that other types of containers may have been used for milk processing in this cultural context). In some regions our data suggest continuous intensive milk use, notably in western France, northern Europe and the British Isles from 5500 bc to ad 1500. Despite intensive sampling from central Europe (>2,800 animal fats from >6,810 potsherds), a region previously inferred to be the origin of selection on LP 29 , the frequency of milk use was consistently lower than for the southeast, west or north of Europe throughout prehistory. The overarching picture is that, whilst dairying persisted throughout the Neolithic period, its intensity fluctuated substantially in space and through time, suggesting regionally specific instabilities in food production and cultural changes in dietary preferences. This is consistent with previous studies showing regional 'boom and bust' fluctuations in population density across Europe over the same period 50 .

Milk use and LP evolution
LP allele (rs4988235-A) frequency trajectory estimates (Fig. 3) based on published aDNA data from 1,786 prehistoric European and Asian individuals ('Data' under 'Ancient DNA analyses' in Methods), are well-described by a sigmoidal curve, as expected for a selective sweep. Curves fitted only to aDNA broadly predict modern allele frequencies 31 . We note that, although the allele only reaches appreciable frequencies by about 2000 bc (ref. 28 ), nearly three millennia after its first detection (earliest LP individual dated to about 4700-4600 bc), the dates of origin, the start of selection, the earliest observation and the reaching of appreciable frequencies of this allele are all distinct 'events', each possibly separated by thousands of years 28 . To examine if milk usage patterns can explain these allele frequency trajectories, we devised a maximum likelihood modelling approach (Methods), whereby LP selection intensity was determined by the proportion of potsherds with dairy fat residues (as a proxy for milk use). Using simulations, we made sure that our approach had the power to detect an association, if there was one, even in the presence of moderate levels of noise in our milk use patterns ('Power analysis' under 'Ancient DNA analyses'  84 and would thus be defined as 'non-milk'. c, Percentage of dairy fat residues through time, calculated using all animal fat residues. Grey bars and black lines illustrate 95%, 50% confidence interval (CI) and maximum a posteriori (MAP) in each time slice, using a uniform prior.

Article
in Methods). We found that patterns of milk usage provided no better explanation (no increased probability of observed allele frequency data) than assuming uniform selection in space and through time, contrary to the claims of others 33 . As an additional sensitivity test of this methodology, we tried using incident solar radiation as the main environmental variable driving LP selection intensity-in line with the calcium assimilation hypothesis 32 -and obtained higher probabilities of the observed allele frequency data (Extended Data Figs. 2 and 3 and Extended Data Table 1).

Health impacts of milk consumption
Our analysis of potsherd lipid residue and ancient LP allele data suggest that intensity of milk usage-beyond its mere presence-did not drive selection on LP. To explore other explanations for LP selection we first turned to contemporary data. The UK Biobank 4,5 includes genotypic and phenotypic data from about 500,000 people aged between 37 and 73 years, recruited between 2006 and 2010 (http://biobank. ndph.ox.ac.uk/showcase). In a subset of approximately 337,000 unrelated participants who classified themselves as 'white, British' and have similar genetic ancestry, LP genotypes were out of Hardy-Weinberg equilibrium (Extended Data Table 2), which could reflect genotype-related assortative mating 51 (for which there is weak evidence for LP; Extended Data Table 3), selection into a study or population structure, amongst other influences 52 . Around 92% (95% confidence interval (CI) 91.5-92.2%) of genetically LNP participants mainly use fresh cows' milk instead of soya or no milk (Table 1) and only 0.63% of milk consumers reported following a lactose-free diet (95% CI 0.59-0.67%). This suggests that LP has only a small effect on milk consumption and use of lactose-free products is not an explanation. This finding is at odds with selection hypotheses of nutritional advantages from milk. Consistent with these results, some non-European countries with very low levels of LP have been importing milk in vast quantities in recent years as part of a more general adoption of Western diets. China, for example, has very low LP frequencies 31 and milk was rarely consumed in the early twentieth century 53 but milk consumption has increased more than 25-fold over the past five decades 54 . These findings bring into question the widely held belief that selection against LNP was the result of detrimental effects of milk consumption in otherwise healthy individuals-for example, through inducing stomach cramps, diarrhoea and flatulence 40,41,[55][56][57] . One possible explanation for the small effect of LNP on milk intake in contemporary data is the use of mealtime lactase supplements that reduce symptoms of lactose intolerance. However, these have only become widely available recently and UK Biobank participants would not have had access to them for most of their lives. We examined phenotypic associations of genetically predicted LP (Fig. 4) to investigate potential pathways through which selection could have occurred. The related calcium assimilation and vitamin D hypotheses 32 would anticipate LP to be associated with higher vitamin D concentration and bone mineral density due to its positive relationship with milk consumption [58][59][60] . However, these estimates are very close to the null and the large sample size puts considerable constraints on the magnitude of effects that could exist.
Milk consumption has been suggested to increase circulating insulin-like growth factor I (IGF-1) 61 , leading others 62 to propose a model of LP selection driven by fitness advantages of IGF-1 acting to increase body size and lower age of sexual maturation. However, there were no meaningful differences in IGF-1 between LP and LNP individuals.

Fig. 2 | Regional variation in milk use in prehistoric Europe.
Interpolated time slices of the frequency of dairy fat residues in potsherds (colour hue) and confidence in the estimate (colour saturation) using two-dimensional kernel density estimation. Bandwidth and saturation parameters were optimized using cross-validation. Circles indicate the observed frequencies at site-phase locations. The broad southeast to northeast cline of colour saturation at the beginning of the Neolithic period illustrates a sampling bias towards earliest evidence of milk use. Substantial heterogeneity in milk exploitation is evident across mainland Europe. By contrast, the British Isles and western France maintain a gradual decline across 7,000 years after first evidence of milk about 5500 bc. Note that interpolation can colour some areas (particularly islands) for which no data are present.
Nevertheless, LP allele presence was associated with 0.01 s.d. (95% CI 0.01, 0.02) higher body mass index (BMI), as previously reported [58][59][60] but not with height, consistent with other studies well controlled for population stratification 63,64 . Whilst statistically robust and replicating what has been observed in other large samples [58][59][60] , the BMI effect is of small magnitude and congruent with a small difference in calorie intake consequent to milk consumption. LP allele association with all-cause mortality, mother's age of death and father's age of death, were all close to the null. These results suggest that milk has little or no adverse health effect when consumed by LNP adults in a contemporary population. Another possible driver of selection is the influence of the LP allele on fertility but we found no meaningful effect of the LP allele presence on number of live births or number of children fathered.

Other drivers of LP evolution
Aside from dietary change, a number of other factors are likely to have influenced fecundity and mortality following the establishment of farming communities in Europe, including increased population and settlement density 65 , increased mobility 66 , proximity to animals, frequent crop failure, famine and population collapse 50

Article
61% of known and about 75% of emerging human infectious disease today come from animals 67,68 ). Given the widespread prehistoric exploitation of milk shown here ( Fig. 2) and its relatively benign effects in healthy LNP individuals today, we propose two related mechanisms for the evolution of LP. First, as postulated in ref. 24 , the detrimental health consequences of high-lactose food consumption by LNP individuals would be acutely manifested during famines, leading to high but episodic selection favouring LP. This is because lactose-induced diarrhoea can shift from an inconvenient to a fatal condition in severely malnourished individuals 69 and high-lactose (unfermented) milk products are more likely to be consumed when other food sources have been exhausted 70 . This we name the 'crisis mechanism', which predicts that LP selection pressures would have been greater during times of subsistence instability. A second mechanism relates to the increased pathogen loads-especially zoonoses-associated with farming and increased population density and mobility 66 . Mortality and morbidity due to pathogen exposure would have been amplified by the otherwise minor health effects of LNP in individuals consuming milk-particularly diarrhoea-due to fluid loss and other gut disturbances, leading to enhanced selection for LP 37 . We name this the 'chronic mechanism', which predicts that LP selection pressures would have increased with greater pathogen exposure. Crucially, proxies are available for some of the drivers of these suggested mechanisms, permitting us to explore the extent to which they explain LP selection and so allele frequency trajectories, using our maximum likelihood modelling approach. For the 'crisis mechanism' we use fluctuations in population size (residual fluctuations detrended for overall background growth) as a proxy for malnutrition exposure. For the 'chronic mechanism', we use temporal variation in settlement density as a proxy for pathogen exposure. Both proxies are constructed using an extensive database comprising >110,000 14  We also considered the proportion of domestic and wild animals in faunal assemblages-constructed using >1,000,000 NISPs (number of identified specimens) of 17 main meat-bearing taxa from 1,093 phases from 825 sites 49,71 -as explanatory variables for LP allele frequency trajectories. However, it is not clear if exploiting more domestic animals would increase pathogen exposure, as domesticates tend to live in closer proximity to humans than to wild animals 72 , or if exploiting more wild animals would increase pathogen exposure, as greater pathogen diversity is expected for wild animals 73,74 . Furthermore,  a shift from domestic to wild animal exploitation might result from the economic instabilities associated with the 'crisis mechanism' 50,75 . Interestingly, the proportion of wild rather than domestic animals in archaeological assemblages provided higher model likelihoods (the LP allele data are 34 times more probable under a model of wild animal consumption driving selection, than a model of constant selection; Extended Data Fig. 6). This could be interpreted as support for either the 'crisis' or 'chronic' mechanisms or be argued to counter the 'chronic' mechanism if prehistoric pathogen exposure was mainly from domesticates.
The prevailing narrative for the coevolution of dairying and LP has been a virtuous circle mechanism in which LP frequency increased through the nutritional benefits and avoidance of negative health costs of milk consumption, facilitating an increasing reliance on milk that further drove LP selection. Our findings suggest a different picture. Milk consumption did not gradually grow throughout the European Neolithic period from initially low levels but rather was widespread at the outset in an almost entirely LNP population. We show that the scale of prehistoric milk use does not help to explain European LP allele frequency trajectories and thus it also cannot account for selection intensities. Furthermore, we show that LP status has little impact on modern milk consumption, mortality or fecundity and milk consumption has little or no detrimental health impact on contemporary healthy LNP individuals.
Instead, we find support for two related hypotheses: that LP selection was driven episodically and acutely by famine and/or on a more continuous basis by synergies between pathogen exposure and the otherwise benign consequences of milk consumption in LNP individuals 37 . We propose that these mechanisms would have applied in the disease-and malnutrition-prone environment existing during the period of rapid increase in LP frequency but would not be expected to apply outside these circumstances. This historical contingency is supported by the substantial increase in consumption of milk in contemporary China, where most of the population are LNP 76,77 . In contemporary populations, LP genotype associates strongly with aspects of the gut microbiome 78 but only in consumers of milk 79 . This suggests that when milk entered the diet of a LNP population it would have altered the gut microbiome. We postulate that, when combined with the changing pattern and prevalence of circulating infections consequent on increasing population density and settlement size, diarrhoeal disease mortality in late childhood could have been increased in LNP individuals drinking milk. Over the period in which LP prevalence increased, the ratio of late childhood (5-18 yr) to early childhood (2-5 yr) mortality increased 80 . The 'crisis' and 'chronic' mechanisms are, of course, not mutually exclusive, nor do they exclude other LP selection mechanisms, especially outside western Eurasia. We also test modern midday solar insolation incident on a horizontal surface 81 and our maximum likelihood approach does indeed provide support (see Extended Data Fig. 3) for the calcium assimilation hypothesis 32 , although it should be noted that solar insolation is strongly correlated with latitude and may therefore be confounded by factors influencing the spatial spread of the LP allele, such as its origin location, 'allele surfing' 29,34 or large-scale population movements 26 . Other plausible proposals for selection favouring LP in Europe, such as milk as a relatively pathogen-free fluid 36 , milk allowing earlier weaning and thus increased fertility 82 , milk galacto-oligosaccharide benefits for the colonic microbiome 38,39 or higher efficiencies of calorie production from dairy farming 83 are more difficult to assess. However, we note that the best proxy for all of these processes is milk consumption level, which itself provides no better explanation for LP allele frequency trajectories than does constant selection through time. Therefore, the spatiotemporal patterns of LP selection in ancient Europeans probably do not reflect variation in their milk consumption. Instead, they map latitude, population volatility and settlement density-proxies for environmental conditions, subsistence economics and pathogen exposure, respectively. We note that-albeit in different configurations-these factors remain drivers of human morbidity and mortality today.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41586-022-05010-7.

Lipid residue analyses Data, lipid extraction and analysis.
Our lipid residue analysis draws on our SQL database of 13,181 archaeological potsherds from 554 sites, of which 6,899 sherds have yielded animal fat residues from georeferenced sites and dated phases (826 phases from 554 sites). This analysis represents about 30% new data (Supplementary Table 1) and about 70% collated from previously published studies 3,11,12,[15][16][17][18][20][21][22]33, . Lipid residue analyses and interpretations were based on established protocols 160,161 . Lipids were extracted from 1-3 g of cleaned potsherd and analysed using gas chromatography (GC) and GC-mass spectrometry (GC-MS) to determine their molecular composition. Compound-specific δ 13 C values of fatty acids were determined using Agilent Industries 7890A gas chromatograph coupled to an Isoprime 100 isotope ratio mass spectrometer (GC-C-IRMS) for extracts identified as animal fats. Samples were introduced via a split/splitless injector in splitless mode onto a 50 m × 0.32 mm fused silica capillary column coated with a HP-1 stationary phase (100% dimethylpolysiloxane, Agilent, 0.17 μm). The GC oven temperature programme was set to hold at 40 o C for 2 min, followed by a gradient increase to 300 o C at 10 o C min −1 ; the oven was then run isothermally for 10 min. Helium was used as a carrier gas and maintained at a constant flow of 2 ml min −1 . The combustion reactor consisted of a quartz tube filled with copper oxide pellets which was maintained at a temperature of 850 o C. Each sample was run at least in duplicate. Instrument stability was monitored by running a fatty acid methyl ester standard mixture every two or four runs. The δ 13 C values are the ratios 13 C/ 12 C and expressed relative to the Vienna Pee Dee Belemnite, calibrated against a CO 2 reference gas injected directly in the ion source as two pulses at the beginning of each run. Instrumental precision was 0.3‰. Data processing was carried out using Ion Vantage software (v.1.5.6.0, IsoPrime). Δ 13 C proxy for milk exploitation. The δ 13 C values of the C 16:0 and C 18:0 fatty acids from more than 170 modern ruminant adipose and dairy fats taken from animals raised in Europe, Eurasia and Africa, reveal a consistent difference in the δ 13 C values of the C 18:0 fatty acid, which relate to the different biosynthetic origins of this fatty acid in the two fat types 16,47 . The result is that δ 13 C values of the C 18:0 fatty acid in milk are always depleted compared to the C 18:0 fatty acid in ruminant adipose fats by on average 2.3‰ (ref. 16 ). This difference allows archaeological ruminant fats extracted from pottery to be assigned as either milk or adipose fats. Initial reference fats were taken from animals grazing in a pure C 3 plant-dominated pasture in the UK and archaeological investigations of dairying in historically pure C 3 regions can be achieved by comparisons with the raw δ 13 C values, for example, refs. 16,47 . However, in regions where C 4 plants are present and other environmental factors (for example, high aridity) affect the raw δ 13 C values, the Δ 13 C values (Δ 13 C = δ 13 C 18:0 − δ 13 C 16:0 ) are used to remove the environmental effects (for example, refs. 2,48 ). For modern reference fats, ruminant milk fats are distinguished from ruminant adipose fats in displaying Δ 13 C values ≤−3.1‰, ruminant adipose fats between −3.1 and −0.3‰, with non-ruminant fats at >−0.3‰. Individual archaeological pottery lipid extracts are assigned to fat type on the basis of their Δ 13 C values using the ranges defined by the reference fats.
Animal fats extracted from potsherds are categorized as containing milk lipids if their Δ 13 C (= δ 13 C 18:0 − δ 13 C 16:0 ) values are below the threshold −3.1‰ or non-milk otherwise 48 . Although this threshold choice influences the absolute proportion estimates, it has almost no influence when evaluating the fluctuations in milk use through time, as the relative proportions will be largely unaffected (Supplementary Fig. 1). Indeed, even if the exact number of sherds containing dairy fat residues were known, this could not be used as a proxy for milk proportion in overall diets because of biases in pottery use. Similarly, our conclusion that milk exploitation was widespread throughout the Neolithic period is also based on the observation that the proportion of sherds with dairy fat residues were similar to that of the Bronze Age.
Chronology. Archaeological sherds are associated in the MySQL database with specific site phases. Each phase is assigned a uniform date range using published start and end dates from archaeological site reports. If unavailable, we used calibrated 162 radiocarbon dates to construct phase models in OxCal 163 and used the mode of each boundary posterior distribution as start and end dates. In cases in which neither of these sources was available (10% of phases), phases 'borrow' their chronology from the geographically closest site phase that has the same associated culture. Date ranges of site phases are reasonably constrained (75% less than 580 years).

UK Biobank analysis
Data. UK Biobank 4,5 is a large prospective cohort study of approximately 500,000 UK participants aged 37-73 years at recruitment (2006-2010). The study collected a wide-range of measures including demographics, health status, lifestyle, cognitive function, mental and physical health and molecular data such as genetic and biomarkers. UK Biobank received ethical approval from the Research Ethics Committee (REC reference for UK Biobank is 11/NW/0382). All analyses were performed under application no. 16729.

Dairy consumption.
Multiple measures of milk consumption were evaluated in this study. Milk servings. Participants were asked: How many glasses/cartons/ 250 ml of milk (excluding milkshakes) did you consume yesterday? (field 100520) and how many glasses/cartons/250 ml of yogurt drinks, flavoured milk or milkshakes did you consume yesterday? (field 100530). These measures were only asked if participants drank some milk the previous day and therefore there was no zero servings value. To reduce missingness, mean servings were calculated from five repeated measures taken in 2009-2012. Dairy milk preference. Participants were asked at baseline: What type of milk do you mainly use? (field 1418). From this variable binary non-dairy versus dairy consumption was derived: soya and never/rarely have milk versus full cream, semi-skimmed and skimmed. Lactose-free diet. Participants were asked: Do you routinely follow a special diet? Respondents answering 'Yes' to diet for lactose intolerance on any of the five repeated measures undertaken during 2009-2012 were compared against participants answering 'No' to all non-missing responses.
Genetic data. We focused on the major European LP variant located upstream of the coding gene promotor (rs4988235-A, LCT-13,910*T) (ref. 30 164 ). This finding could arise from non-random mating, ascertainment bias, genotyping error or population stratification/admixture 52 . The last was mitigated as far as reasonably possible using a homogenous population which was clearly effective when considering the more extreme HWE violation in 'any other white background' group (Extended Data Table 2; χ 2 = 423.57; P = 4.10 × 10 −94 ). HWE violation was also observed in gnomAD 165 which contains genotype frequencies on 33,992 non-Finnish Europeans using whole genome sequencing (Extended Data Table 2; χ 2 = 300.32; P = 2.81 × 10 −67 ). Sequencing and array technologies have distinct error profiles and therefore technical artefact is an unlikely explanation.
Crude gene association with consumption. Unadjusted genotype frequencies stratified by dairy consumption were calculated from an unrelated white British subsample of UK Biobank prepared according to ref. 166 .
Gene association with health outcomes. Variant association with health outcomes was performed using an unrelated white British subset of UK Biobank for the main analysis and within-family estimates for sensitivity analyses. Time-to-death was estimated using a Cox proportional hazards model. All-cause mortality was derived from field date-of-death (40000). The following fields within the UK Biobank were used: body mass index (21001); father's age at death (1807); heel bone mineral density (78); IGF-1 (30770); mother's age at death (3526); number of children fathered (2405); number of live births (2734); standing height (50); vitamin D (30890). All models assumed additive inheritance and were adjusted for age, sex and top 40 genetic principal components. Within-family estimates were produced as previously described for continuous 167 and binary outcomes 168 . Date-of-death was obtained from linkage with death registries. All other outcomes were collected at baseline.
Assortative mating influence on lactase persistence. To determine whether milk consumption influences mate choice and therefore invalidates HWE assumptions 52 , we estimated the spousal LP genetic correlation (Extended Data Table 3) adjusted for age, the top ten genetic principal components, partner age and partner top ten genetic principal components. Under an additive linear model there was very weak evidence in support of assortative mating on LP (0.004 mean increase in partner LP allele per increase in individual LP allele (95% CI −0.002, 0.011)). However, a single allelic copy of LP is sufficient to produce the observable phenotype of lactase persistence (which produces a very small influence on milk intake; Table 1) and therefore we considered a dominant inheritance model (imputed variant dosage capped at one; rs4988235 GG versus GA and AA). We term individuals with GA or AA 'lactase persistent'. Using logistic regression, we found that lactase persistent individuals were at 10% increased odds of selecting a lactase persistent partner (1.10 odds ratio (OR) (95% CI 1.00, 1.22) for partner being lactase persistent). The effect was consistent under sensitivity conditions additionally adjusting for the top 40 genetic principal components (mean increase of 1.11 OR (95% CI 1.00, 1.22) partner having one or more copies of LP among LP individuals) reducing the possibility that partner selection was driven entirely by population stratification.
Spousal pairs were derived and validated in a European subsample of UK Biobank (n = 47,459) supplied by L. Howe, MRC IEU 51 . Analyses were restricted to spousal pairs with complete data (n = 46,560).

Ancient DNA analyses
Data. We collated published BAM files from 2,999 ancient Eurasian individuals 25,27, , out of which 1,786 had the European LP single nucleotide polymorphism (SNP) covered at least once (read depth: span 1 to 128-fold, average sevenfold). All genomes came aligned to the GRCh37/hg19 human reference assembly but for the Tyrolean Iceman (hg18). Relevant geographic and temporal annotation was extracted from D. Reich's aDNA compendium v.44.3 (ref. 218 ), see Fig. 3 for a representation of the full dataset. For the joint analyses of genetic and ecological data, we restricted this dataset to those individuals spatiotemporally overlapping the ecological data (four geographic regions: 'British Isles', 'Rhine-Danube axis', 'Mediterranean Europe' and the 'Baltic region'; 8,000-2,500 years bp), leaving 2,162, out of which 1,293 had the lactase locus covered at least once (read depth: span 1-128-fold, average sevenfold).
We extracted the pileup at the position of the European LP SNP (rs4988235; in hg19, chromosome 2, position 136608646; hg18, chromosome 2, position 136325116) with default quality cut-offs using Samtools 219 . For sites with read depth at least two, we called the diploid genotype with the highest likelihood; for sites covered only once, only a single allele was called. Exceptions are individuals PEN001_real2 (ref. 207 ) and Klei10 (ref. 185 ), which each have one out of four reads with the derived allele. However, as discussed in the respective publications, the ages of the samples suggest that the derived alleles are rather the consequence of post-mortem damage and we therefore consider both of them as homozygous ancestral.
Likelihood comparison of selection drivers. Many phenomena have the potential to influence LP allele frequency trajectories in each of our geographic regions, including selection, migration, genetic drift and population structure. We explore just one of these factors in isolation-selection-to test if the observed archaeological allele frequencies are best explained purely under a null model of constant selection or might be partly explained by a model whereby the intensity of selection varies through time in accordance with a suggested ecological driver. Although selection strength cannot be observed directly, population genetic models describe its effect on expected allele frequencies. Given the set of ecological proxy variables that we suggest may act as selective agents, we can ask if fluctuating selection models based on these variables fit the genetic data better than a null model of constant selection strength. If a fluctuating model fits best, we may infer that the ecological proxy is linked to the true selective agent.
We As selection on lactase is strong, we assume a deterministic allele frequency trajectory, ignoring random drift. We assume a generation time of 28 years. In case of an additive model for selection-corresponding to relative fitnesses s 1 + 2 , s 1 + and 1 for homozygous derived, heterozygous and homozygous ancestral genotypes, respectively-we approximate the expected allele frequency after t generations (divide by generation time to scale to years) by . This avoids the challenges of estimating the date of the LP mutation by assuming it was present in the population at t 0 at some unknown frequency, y 0 . Moreover, this avoids modelling the lowest frequency part of the allele trajectory for which drift is strongest.
In most cases, m t ( ) will be piecewise constant over time windows of the order of a hundred years, meaning that s t ( ) is piecewise constant as well and allele frequencies can be computed in between these boundaries with the equation above setting the initial frequency of later time intervals to the last one of the preceding time interval.
The dominant model-corresponding to relative fitnesses s 1 + , s 1 + and 1 for homozygous derived, heterozygous and homozygous ancestral genotypes, respectively-is considered here as well, as the LP phenotype has been shown to be dominant. It requires numerical optimization of an approximation, as no exact expression for y t ( ) as a function of y 0 , s and t exists. The value y t ( ) can be obtained by keeping y 0 , s and t constant and adjusting y t ( ) until equation  220 . As above, this allows us to compute expected deterministic allele frequency trajectories for piecewise constant selection strengths. Note that owing to the parameterization of fitness, the dominant selection coefficient has to be multiplied by a factor 2 to be comparable to the additive model.
For an expected allele frequency trajectory over the entire time range under one of the above inheritance models, the log likelihood given k observed ancestral alleles at time points t i and n derived alleles observed at time points t j is computed as ∑ ∑ y t y t ln(1 − ( )) + ln ( ) . We were able to construct ecological proxy variables for four distinct regions, with different time series m t ( ) and therefore different expected allele frequency trajectories. To analyse these regions jointly, log likelihoods were summed over all four regions, to result in a single optimal parameter combination across all regions.

Article
For a given time series m t ( ) we optimized the model parameters y 0 (initial allele frequency at time point 0), a and b (with a = 1 fixed for the constant selection strength model) using the 'JDEoptim' function from the 'DEoptimR' library 221 in R (ref. 222 ). We chose the best parameter combination from six independent runs, three with and three without initializing JDEoptim with the parameters found for the constant selection model. Finally, statistical model comparison was performed by computing P values obtained from a likelihood ratio test accounting for the different number of parameters between constant and fluctuating models (two and three, respectively). We controlled the false discovery rate in the multiple testing of different models with the Benjamini-Hochberg procedure, ordering the P values, P P P ≤ ≤. . . ≤ m 1 2 and declare a test of rank j non-significant at level δ if P δ > j j m .
Power analysis. We analysed the power of our approach to estimate parameters and detect ecological proxies linked to selection strength via simulations. We chose to focus on milk exploitation, in particular to assess the robustness of our finding that milk is unlikely to drive selection on the LP allele. First, we simulated genotypes under nine different model parameter combinations, with three replicates per combination, sampling a matching number of alleles (one or two) at the corresponding time points in the real data. Supplementary Fig. 2 shows the results for all combinations of s ∈ {0.02, 0.035, 0.05} and a ∈ {0.2, 0.6, 0.8}, with an initial frequency y 0 fixed at 0.005. We observe that both parameters can be estimated accurately when s > 0.02 or a > 0. 25. Models with small values for s or a are most challenging as both lead to small selection coefficients on average (small values for a lead to very few bursts of high selection over a background of otherwise weak selection; for example, Extended Data Fig. 4) and, therefore, a generally low number of sampled derived alleles. Supplementary Fig. 3 illustrates this phenomenon, in which absolute errors in the estimates of s and a for three replicate simulations per parameter combination above repeated independently for three different initial frequencies y ∈ {0.0005, 0.005, 0.01} 0 , are plotted against the number of simulated derived alleles. High estimation errors are exclusively found when few derived alleles are simulated, as the latter are responsible for important likelihood differences between models with appreciable LP allele frequencies. The number of derived alleles in our dataset seems to be high enough to estimate s with high accuracy and a at least with intermediate accuracy. Besides the estimation of actual model parameters, we quantified the power to select a model with milk as a driver over a constant selection model. For the same set of simulations underlying Supplementary Fig. 3, we show the likelihood differences between a model of constant selection and one in which milk exploitation drives selection strength on the derived LP allele in Supplementary Fig. 4. With less than 30 derived alleles, which is the number we observe in the actual data, only 20% of the likelihood differences pass the significance threshold, compared to 81% for simulations with at least the same number of derived alleles (51% overall), in which we restricted the set of simulations only to those with an estimated parameter a above 0.95 (values close to one effectively lead to constant selection models and therefore a likelihood difference close to zero). We are therefore confident that, if milk acted as a driver of selection strength in our data, we would at least find a meaningfully increased likelihood.
Second, we assessed the impact of noise in the ecological proxies, here exemplarily and most importantly in the context of this study on milk exploitation. To simulate inaccuracies, we multiplied each constant piece of the milk exploitation proxy by a random number uniformly drawn from the interval [0.5, 1.5], leading to a 'noisy' version that differs from the original values on average by 0.091 (corresponding to about 11% of the milk variable's full range from 0 to about 0.8; maximum deviation 0.32). Supplementary Fig. 5 compares parameter estimates and likelihoods with the values found in the corresponding simulation shown in Supplementary Fig. 2. Although there are outliers with noticeable deviation when a = 0.25 (most challenging to distinguish from a constant selection model as discussed above), the remaining parameter estimates are barely affected and none of the likelihoods change across the significance threshold. Therefore, we consider our approach robust against moderate levels of noise in the milk exploitation proxy and more generally in other time series.
Lastly, we sought to quantify the effect that drift-which we are not explicitly modelling-may have on our inferences. We repeated the experiments shown in Supplementary Figs. 2-4, however, simulating allele frequency trajectories with fluctuating selection and drift as the basis from which alleles are drawn with the sampling strategy described above that emulates our real aDNA data (drift simulations only with initial frequency y = 0.005 0 ). We used the 'learnPopGen' R library 223 and simulated two levels of drift by setting the effective population size to N ∈ {1,000, 10,000} e (generation time implemented by extending allele frequencies per generation to piecewise constant frequencies lasting 28 years). We observe that the ability to accurately infer s is only mildly affected, whereas the effect on a is stronger ( Supplementary  Fig. 6). This is in line, for example, with ref. 224 in which the authors find limited benefit to explicitly model drift for the inference of selection coefficients. Plotting the same data differently shows that the relationship with the number of simulated derived alleles continues to hold, although the accuracy of a estimates drops with stronger drift even at high numbers of simulated derived alleles ( Supplementary Fig. 7). Importantly, the ability to detect significant likelihood differences in the presence of drift is not affected (Supplementary Fig. 8), as detailed in Extended Data Table 4. Hence, we consider our modelling to be robust against mild levels of drift and a deterministic model to be appropriate, especially as selection coefficients are known to have been high on the derived LP allele and as our geographic regions are large enough to warrant high effective population sizes.
In conclusion, our approach performs well at estimating parameters, especially when the number of derived alleles is at least that in the real aDNA data we curated for this paper and, most importantly, has the power to select a fluctuating selection model over a constant selection model, even in the presence of moderate drift and noise in ecological driver variables.

Ecological proxy variables
We generate five ecological variables as proxies for suggested drivers of LP selection. Each variable is generated for four geographic polygons which overlap spatially and temporally with our ancient genetic data ('British Isles', 'Baltic region', 'Rhine-Danube axis' and 'Mediterranean Europe'). We make no assumption about the direction of the possible relationship between each variable (V) and the selection coefficient (that is, if they are positively or negatively correlated). Therefore, we also generate a set of inverse variables, which can be notionally considered as a reflection; for example, a variable that ranges between 0 and 1 will have an inverse variable between 1 and 0.
Radiocarbon dates. Two of the five ecological variables (residential density and fluctuations in population levels) were constructed directly by using an extensive database comprising >110,000 14 C dates which we compiled from over 1,000 published sources including CalPal 225 , BANADORA 226 , RADON 227 , EUROEVOL 228 , Wales and Borders radiocarbon database 229 , Scottish Radiocarbon Database 230 , ref. 231 and the Palaeolithic Europe Database 232 , which provided most radiocarbon dates. A summary of the radiocarbon dates used within our four polygons canbe found at https://github.com/AdrianTimpson/2020-03-03523A and a full list of sources is available on request.
Residential density (clustering). Under the assumption that dwelling proximity increases the potential for the spread of disease in a population, our 'chronic' mechanism proposes that periods and regions of greater residential density would tend to suffer more from diseases, providing greater selection on LP. We construct a residential density statistic (RDS) from the site locations of our radiocarbon database. The RDS is influenced neither by the number of archaeological sites (which is heavily biased by differential sampling and a general increase through the Neolithic period from population growth) nor by geographic structure in the site locations (caused by features such as islands and mountains). Instead, our RDS uses the distribution of pairwise distances between all sites in a region as a null expectation and then calculates how the distribution of pairwise distances in each time slice deviates from this expected distribution. We achieve this by first generating the empirical cumulative distribution function (ECDF) of pairwise distances of all sites (ECDF all ), then for a single time slice we generate the ECDF of pairwise distances of sites that fall within that time slice (ECDF t ). We then calculate the RDS in that slice as sum(ECDF t − ECDF all ). This is repeated for each time slice, giving a vector of RDS values. Given that there is substantial chronological uncertainty for each site, we also use associated radiocarbon dates to generate a summed probability distribution (SPD) after calibrating through IntCal13 (ref. 162 ). We use the 99% quantiles of this SPD as a conservatively wide date range for the site. We then uniformly sample a date from this range, to assign each site to just one time slice. The entire process is repeated 500 times to generate 500 RDS vectors. This allows CIs to be generated that incorporate the chronological uncertainty in the dataset and we use the mean of these 500 replicates as the final time series in the LP model.

Animal exploitation.
Fluctuations between domestic and wild animal exploitation for meat have the potential to inform on various economic, social and food production changes. For example, we might expect an increase in the proportion of domesticates to be associated with greater animal husbandry and increased contact with animals could lead to a greater probability of zoonotic diseases and therefore greater selection. Conversely, we might expect periods of famine from failure of agriculture to result in a return to exploiting wild resources, thus a decrease in domestic animals might correlate with periods (and regions) with greater selection. We extract NISP counts of the main meat-bearing taxa for domestic ( Fluctuations in population levels. We use our radiocarbon database and the R package ADMUR to directly model the underlying broadscale population dynamics as the combination of exponential population growth with power law taphonomic loss. An SPD was also generated and the residual difference between the short-term fluctuations of the SPD and the long-term trend of the model was used as the detrended proxy for short-term fluctuations in human presence through time. This was achieved independently for each of the four geographic polygons, after all radiocarbon dates were calibrated through the intcal20 curve 162 Finally, we normalize these residuals to a range between 0 and 1 globally, by scaling each variable by the smallest and largest values found across all polygons. In all cases, to adjust for ascertainment bias, radiocarbon dates in a site phase were first aggregated and normalized, so that the total probability mass for each site phase was 1, no matter how many radiocarbon dates at a site phase.

Solar insolation.
We calculate mean annual solar insolation at 0.5 × 0.5 degree resolution using modern data (1983 to 2013) from NASA's midday insolation incident on a horizontal surface 81 (W m −2 ).
Values at each grid point are then used to calculate a single mean value for each of the four geographic polygons. Therefore, unlike the previous ecological variables that each comprise three time series, solar insolation (and its inverse) are not time variable. No doubt insolation would have fluctuated over the study period but the magnitude of this variation is trivial compared to the variation between polygons. These values are then normalized to a range between 0 and 1 by dividing by the Earth's single greatest mean annual solar insolation value of 867.90 W m −2 .

Time-series plots and interpolation maps
We account for three sources of uncertainty when estimating frequencies from count data (LP and LNP alleles, potsherds with and without dairy fat residues, domestic and wild species NISP). Within any time slice we have multiple site phases, each of which has integer counts of presence and absence. Sample sizes vary hugely between site phases (potsherds between 1 and 121, NISP between 1 and 61,497) and our objective is to fairly estimate the 'underlying' frequencies of these count data across all these site phases. We ensure that each site contributes fairly to this estimate, by weighting by the precision of each site's frequency estimate (a function of sample size n, for which the weight is defined by 1 − 1/ n ( +1)). In addition, to fairly incorporate in our estimate the uncertainty both from small sample size and chronology, we adopt a three-stage sampling process in our estimate, which is repeated 5,000 times to also find the maximum a posteriori and confidence interval. The first stage randomly samples a single frequency from a Beta distribution, in which the two shape parameters are the observed presence and absence counts in the site phase. This includes an additional one count of each as a uniform prior. The second stage randomly samples a date from the site phase's date range. This then allows us to assign each phase to exactly one time slice. The third stage then calculates the weighted average frequency of the site phases (in each time slice), weighted according to their sample size as explained above. Where a phase spans two (or more) time slices, its contribution to each slice is weighted accordingly.
Interpolated maps of the frequency of dairy fat residues (Fig. 2) were achieved with a new kernel density approach that generates both the interpolated frequencies using colour hue and confidence in the interpolation using colour saturation. The interpolated surface was masked to coastlines and used the following R libraries: maptools 233 , mapdata (original S code by R. A. Becker and A. R. Wilks; R version by ref. 234 ), raster 235 , rgeos 236 and RColorBrewer 237 . We constructed a two-dimensional Gaussian kernel for each site phase, the magnitude of which was weighted by the precision of the estimate, using 1 − 1/ n ( +1) ) for which n is the number of sherds that have yielded animal fats. The frequency of dairy residues at any geographic pointwas then calculated as the average of the proportion of sherds containing dairy fat residues over the total number of sherds yielding residues at a given site phase, weighted by the probability density function of each site phase's kernel at that point. Therefore, the influence each site phase has on any interpolated point on the map depends primarily on its geographic distance from that point but is also influenced by the error on its frequency estimate. This second influence is small in comparison because the most extreme difference in the relative contribution of sites is only threefold (0.909 for which n = 121 compared to 0.293 for which n = 1). The combined surface of two-dimensional Gaussian kernels (for all site phases) generated during the interpolation then determines the colour saturation, to illustrate geographic regions where there is an absence of data. Conceptually this approach describes the idealized process of shooting paintballs at a grey map. A paintball corresponds to the information content at a site phase, such that its central position is determined by the site's latitude and longitude; hue is determined by the dairy fatfrequency; and size of paintball is determined by the precision.
Two further parameters are required. Firstly, the bandwidth of the kernel, which determines the spread of each paintball. Secondly, a saturation exponent which determines the hue of each paintball (the amount of information content). Both parameters were estimated using a cross-validation approach that partitioned the data into 80% training and 20% test, repeated 1,000 times. For each replicate, the training partition was used to generate interpolated maps (given a proposed bandwidth and saturation exponent) and a comparison was made at the locations of the test partition between the predicted values and the observed values. This comparison used a summary statistic that accounts for both the interpolated dairy residue frequencies and the confidence (saturation). A good answer (predicted frequency close to the observed frequency) is better than an uncertain prediction (low saturation) which is better than a bad answer, therefore we seek to maximize saturation for which predicted frequencies are accurate and minimize saturation for which predicted frequencies are poor. Therefore, our summary statistic was the magnitude of the difference between saturation and prediction accuracy, which was minimized when bandwidth = 6.18 and the saturation exponent = 0.0538 ( Supplementary Fig. 9).

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Code availability
R code for running the aDNA analyses is available from https://github. com/ydiekmann/Evershed_Nature_2022. Open-source R Code for running the UK Biobank analyses under MIT license are available from https://github.com/MRCIEU/lp-coevolution. R code for the generation of Figs. 1, 2, 3 and Extended Data Fig. 1 are available from https://github. com/AdrianTimpson/2020-03-03523A.   , for which nine is the total number of hypothesis tests. Owing to the parametrisation of fitness, the dominant selection coefficient has to be multiplied by a factor 2 to be comparable to the additive models (see Methods section on ancient DNA analysis for details). Abbreviations: number (nb.), parameters (par.), logarithmic (log.), likelihood ratio test (LRT), inverse (inv.), fluctuation (fluc.).

Article
Extended Data Association of individual lactase persistence genotype on spouse genotype adjusted for age, partner age, top ten genetic principal components and partner top ten genetic principal components. Sensitivity analyses were additionally adjusted for top 40 genetic principal components. Additive and dominant inheritance models were evaluated using linear and logistic regression, respectively. CI, confidence interval. N, sample size. OR, odds ratio.