Abstract
Background
Linkage of administrative data is an important tool for population-based epidemiological analyses, but unlinked or falsely-matched records can lead to selection bias or measurement error, particularly when linkage error disproportionally affects particular subgroups of records. Assessing the potential impact of linkage error on results can be difficult due to the separation of linkage and analysis processes. We aim to describe methods for evaluating linkage, using an exemplar of linked, population-based data for mothers and babies.
Methods
We validated previous linkage of mother-baby records within national hospitalisations data for England (Hospital Episode Statistics). We applied our original linkage algorithm to a subset of gold-standard data to quantify false-matches and missed-matches; we compared characteristics of linked and unlinked data to identify potential sources of bias; and we evaluated how sensitive results of analyses were to changes in the linkage criteria (i.e. to a more conservative linkage algorithm and deterministic linkage only).
Results
Of 72,817 mother-baby records in the gold-standard dataset, 636 (0.9%) were false-matches and 297 (0.4%) were missed-matches using the original linkage algorithm. Records that failed to were more likely to have missing data, and be of lower birthweight and gestational age, or stillbirths. Using a more conservative probabilistic linkage algorithm resulted in fewer false matches (212; 0.3%) but substantially increased the number of missed-matches (7797; 10.7%). Using more conservative probabilistic linkage criteria, or deterministic linkage only, resulted in under-ascertainment of preterm births, reduced precision, and led to bias towards the null in assessing the association between maternal risk-factors and baby outcomes.
Discussion
Evaluation of linkage helps us to understand the potential impact of linkage error on results and provides an opportunity to select the most appropriate linkage criteria for a particular analysis. Further research is required to develop methods for identifying and handling bias arising from linkage error. Researchers using linked data should work with data providers to understand linkage quality, as well as other aspects of administrative data quality such as missing data.
Linkage of administrative data is an important tool for population-based epidemiological analyses, but unlinked or falsely-matched records can lead to selection bias or measurement error, particularly when linkage error disproportionally affects particular subgroups of records. Assessing the potential impact of linkage error on results can be difficult due to the separation of linkage and analysis processes. We aim to describe methods for evaluating linkage, using an exemplar of linked, population-based data for mothers and babies.
Methods
We validated previous linkage of mother-baby records within national hospitalisations data for England (Hospital Episode Statistics). We applied our original linkage algorithm to a subset of gold-standard data to quantify false-matches and missed-matches; we compared characteristics of linked and unlinked data to identify potential sources of bias; and we evaluated how sensitive results of analyses were to changes in the linkage criteria (i.e. to a more conservative linkage algorithm and deterministic linkage only).
Results
Of 72,817 mother-baby records in the gold-standard dataset, 636 (0.9%) were false-matches and 297 (0.4%) were missed-matches using the original linkage algorithm. Records that failed to were more likely to have missing data, and be of lower birthweight and gestational age, or stillbirths. Using a more conservative probabilistic linkage algorithm resulted in fewer false matches (212; 0.3%) but substantially increased the number of missed-matches (7797; 10.7%). Using more conservative probabilistic linkage criteria, or deterministic linkage only, resulted in under-ascertainment of preterm births, reduced precision, and led to bias towards the null in assessing the association between maternal risk-factors and baby outcomes.
Discussion
Evaluation of linkage helps us to understand the potential impact of linkage error on results and provides an opportunity to select the most appropriate linkage criteria for a particular analysis. Further research is required to develop methods for identifying and handling bias arising from linkage error. Researchers using linked data should work with data providers to understand linkage quality, as well as other aspects of administrative data quality such as missing data.
Original language | English |
---|---|
Pages (from-to) | 1-12 |
Number of pages | 12 |
Journal | International Journal of Epidemiology |
DOIs | |
Publication status | Published - 1 Oct 2017 |
Keywords
- Record linkage
- linkage errors
- Bias
- ADMINISTRATIVE DATA
- data accuracy