Vertex deduplication based on string similarity and community membership

Ryan McConville*, Weiru Liu, Jun Hong

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)

218 Downloads (Pure)

Abstract

Entity resolution is a challenging problem with unresolved and duplicated entities common in many large real world datasets. New methods are required for addressing this problem as the use of graphs to model data continues to proliferate. In this paper we propose a general framework for the fast resolution of duplicate vertices in graphs. Our framework utilises locality sensitive hashing for the quick identification of potential duplicates based on string similarity. However it is clear that in many tasks string similarity alone is not enough to determine duplication. This motivates the second aspect of our method which discovers the community structure in the graph using an ensemble of community detection algorithms. These communities are then used to augment the string similarity in the deduplication process. We evaluate our approach on a real world graph consisting of 620885 vertices and 1129986 edges and report a high accuracy score on a commercial real world graph.

Original languageEnglish
Title of host publicationComplex Networks and Their Applications VI
Subtitle of host publicationProceedings of Complex Networks 2017 (The Sixth International Conference on Complex Networks and Their Applications)
PublisherSpringer, Cham
Pages178-189
Number of pages12
ISBN (Electronic)9783319721507
ISBN (Print)9783319721491
DOIs
Publication statusPublished - 27 Nov 2017
Event6th International Conference on Complex Networks and Their Applications, Complex Networks 2017 - Lyon, France
Duration: 29 Nov 20171 Dec 2017

Publication series

NameStudies in Computational Intelligence
Volume689
ISSN (Print)1860-949X

Conference

Conference6th International Conference on Complex Networks and Their Applications, Complex Networks 2017
CountryFrance
CityLyon
Period29/11/171/12/17

Structured keywords

  • Jean Golding

Cite this