Identifying clinical features in primary care electronic health record studies: methods for codelist development

Jessica Watson, Brian Nicholson, William Hamilton, Sarah Price

Research output: Contribution to journalArticle (Academic Journal)peer-review

31 Citations (Scopus)
282 Downloads (Pure)



Analysis of routinely collected Electronic Health Record (EHR) data from primary care is reliant upon the creation of codelists to define clinical features of interest. To improve scientific rigor, transparency and replicability we describe and demonstrate a standardised reproducible methodology for clinical codelist development.


We describe a three stage process for developing clinical codelists. First, the clear definition a priori of the clinical feature of interest using reliable clinical resources. Second, development of a list of potential codes using statistical software to comprehensively search all available codes. Third, a modified Delphi process to reach consensus between primary care practitioners on the most relevant codes, including the generation of an ‘uncertainty’ variable to allow sensitivity analysis.


These methods are illustrated by developing a codelist for shortness of breath in a primary care EHR sample, including modifiable syntax for commonly used statistical software.


The codelist was used to estimate the frequency of shortness of breath in a cohort of 28,216 patients aged over 18 years who received an incident diagnosis of lung cancer between 1 January 2000 and 30 November 2016 in the Clinical Practice Research Datalink (CPRD).


Of 78 candidate codes, 29 were excluded as inappropriate. Complete agreement was reached for 44 (90%) of the remaining codes, with partial disagreement over 5 (10%). 13,091 episodes of shortness of breath were identified in the cohort of 28,216 patients. Sensitivity analysis demonstrates that codes with the greatest uncertainty tend to be rarely used in clinical practice.


Although initially time-consuming, using a rigorous and reproducible method for codelist generation ‘future-proofs’ findings, and an auditable, modifiable syntax for codelist generation enables sharing and replication of EHR studies. Published codelists should be badged by quality and report the methods of codelist generation including: definitions and justifications associated with each codelist; the syntax or search method; the number of candidate codes identified; and the categorisation of codes after Delphi review.
Original languageEnglish
Article numbere019637
Number of pages10
JournalBMJ Open
Issue number11
Early online date23 Nov 2017
Publication statusPublished - Nov 2017


Dive into the research topics of 'Identifying clinical features in primary care electronic health record studies: methods for codelist development'. Together they form a unique fingerprint.

Cite this