Purpose: In the UK, primary care databases include repeated measurements of health indicators at the individual level. As these databases encompass a large population, some individuals have extreme values, but some values may also be recorded incorrectly. The challenge for researchers is to distinguish between records that are due to incorrect recording and those which represent true but extreme values. This study evaluated different methods to identify outliers. Methods: Ten percent of practices were selected at random to evaluate the recording of 513,367 height measurements. Population-level outliers were identified using boundaries defined using Health Survey for England data. Individual-level outliers were identified by fitting a random-effects model with subject-specific slopes for height measurements adjusted for age and sex. Any height measurements with a patient-level standardised residual more extreme than ±10 were identified as an outlier and excluded. The model was subsequently refitted twice after removing outliers at each stage. This method was compared with existing methods of removing outliers. Results: Most outliers were identified at the population level using the boundaries defined using Health Survey for England (1550 of 1643). Once these were removed from the database, fitting the random-effects model to the remaining data successfully identified only 75 further outliers. This method was more efficient at identifying true outliers compared with existing methods. Conclusions: We propose a new, two-stage approach in identifying outliers in longitudinal data and show that it can successfully identify outliers at both population and individual level.
- Longitudinal data
- Primary care database