Improving Record Linkage Accuracy with Hierarchical Feature Level Information and Parsed Data

Yun Zhou, Minlue Wang, Valeriia Haberland, John Howroyd, Sebastian Danicic, Mark Bishop

Research output: Contribution to journalArticle (Academic Journal)peer-review

222 Downloads (Pure)

Abstract

Probabilistic record linkage is a well established topic in the literature. Fellegi–Sunter probabilistic record linkage and its enhanced versions are commonly used methods, which calculate match and non-match weights for each pair of records. Bayesian network classifiers–naive Bayes classifier and TAN have also been successfully used here. Recently, an extended version of TAN (called ETAN) has been developed and proved superior in classification accuracy to conventional TAN. However, no previous work has applied ETAN to record linkage and investigated the benefits of using naturally existing hierarchical feature level information and parsed fields of the datasets. In this work, we extend the naive Bayes classifier with such hierarchical feature level information. Finally we illustrate the benefits of our method over previously proposed methods on four datasets in terms of the linkage performance (F1 score). We also show the results can be further improved by evaluating the benefit provided by additionally parsing the fields of these datasets.
Original languageEnglish
Pages (from-to)87–104
JournalNew Generation Computing
Volume35
Issue number1
DOIs
Publication statusPublished - 10 Jan 2017

Fingerprint Dive into the research topics of 'Improving Record Linkage Accuracy with Hierarchical Feature Level Information and Parsed Data'. Together they form a unique fingerprint.

Cite this