Evaluating linkage quality and impact of linkage error in administrative datasets for epidemiological analyses

Katie Harron, James Doidge, Hannah Knight, ruth gilbert, Harvey Goldstein, David Cromwell, Jan VanderMeulen

Research output: Contribution to journalArticle (Academic Journal)peer-review

77 Citations (Scopus)
255 Downloads (Pure)


Linkage of administrative data is an important tool for population-based epidemiological analyses, but unlinked or falsely-matched records can lead to selection bias or measurement error, particularly when linkage error disproportionally affects particular subgroups of records. Assessing the potential impact of linkage error on results can be difficult due to the separation of linkage and analysis processes. We aim to describe methods for evaluating linkage, using an exemplar of linked, population-based data for mothers and babies.
We validated previous linkage of mother-baby records within national hospitalisations data for England (Hospital Episode Statistics). We applied our original linkage algorithm to a subset of gold-standard data to quantify false-matches and missed-matches; we compared characteristics of linked and unlinked data to identify potential sources of bias; and we evaluated how sensitive results of analyses were to changes in the linkage criteria (i.e. to a more conservative linkage algorithm and deterministic linkage only).
Of 72,817 mother-baby records in the gold-standard dataset, 636 (0.9%) were false-matches and 297 (0.4%) were missed-matches using the original linkage algorithm. Records that failed to were more likely to have missing data, and be of lower birthweight and gestational age, or stillbirths. Using a more conservative probabilistic linkage algorithm resulted in fewer false matches (212; 0.3%) but substantially increased the number of missed-matches (7797; 10.7%). Using more conservative probabilistic linkage criteria, or deterministic linkage only, resulted in under-ascertainment of preterm births, reduced precision, and led to bias towards the null in assessing the association between maternal risk-factors and baby outcomes.
Evaluation of linkage helps us to understand the potential impact of linkage error on results and provides an opportunity to select the most appropriate linkage criteria for a particular analysis. Further research is required to develop methods for identifying and handling bias arising from linkage error. Researchers using linked data should work with data providers to understand linkage quality, as well as other aspects of administrative data quality such as missing data.
Original languageEnglish
Pages (from-to)1-12
Number of pages12
JournalInternational Journal of Epidemiology
Publication statusPublished - 1 Oct 2017


  • Record linkage
  • linkage errors
  • Bias
  • data accuracy


Dive into the research topics of 'Evaluating linkage quality and impact of linkage error in administrative datasets for epidemiological analyses'. Together they form a unique fingerprint.

Cite this