A novel ensemble learning approach to unsupervised record linkage

Anna Jurek, Jun Hong, Yuan Chi, Weiru Liu

Research output: Contribution to journalArticle (Academic Journal)peer-review

28 Citations (Scopus)
446 Downloads (Pure)

Abstract

Record linkage is a process of identifying records that refer to the same real-world entity. Many existing approaches to record linkage apply supervised machine learning techniques to generate a classification model that classifies a pair of records as either match or non-match. The main requirement of such an approach is a labelled training dataset. In many real-world applications no labelled dataset is available hence manual labelling is required to create a sufficiently sized training dataset for a supervised machine learning algorithm. Semi-supervised machine learning techniques, such as self-learning or active learning, which require only a small manually labelled training dataset have been applied to record linkage. These techniques reduce the requirement on the manual labelling of the training dataset. However, they have yet to achieve a level of accuracy similar to that of supervised learning techniques. In this paper we propose a new approach to unsupervised record linkage based on a combination of ensemble learning and enhanced automatic self-learning. In the proposed approach an ensemble of automatic self-learning models is generated with different similarity measure schemes. In order to further improve the automatic self-learning process we incorporate field weighting into the automatic seed selection for each of the self-learning models. We propose an unsupervised diversity measure to ensure that there is high diversity among the selected self-learning models. Finally, we propose to use the contribution ratios of self-learning models to remove those with poor accuracy from the ensemble. We have evaluated our approach on 4 publicly available datasets which are commonly used in the record linkage community. Our experimental results show that our proposed approach has advantages over the state-of-the-art semi-supervised and unsupervised record linkage techniques. In 3 out of 4 datasets it also achieves comparable results to those of the supervised approaches.
Original languageEnglish
Pages (from-to)40-54
Number of pages15
JournalInformation Systems
Volume71
Early online date28 Jun 2017
DOIs
Publication statusPublished - Nov 2017

Research Groups and Themes

  • Jean Golding

Keywords

  • Unsupervised record linkage
  • Data matching
  • Classification
  • Ensemble learning

Fingerprint

Dive into the research topics of 'A novel ensemble learning approach to unsupervised record linkage'. Together they form a unique fingerprint.

Cite this