Vertex deduplication based on string similarity and community membership

Ryan McConville*, Weiru Liu, Jun Hong

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)

222 Downloads (Pure)


Entity resolution is a challenging problem with unresolved and duplicated entities common in many large real world datasets. New methods are required for addressing this problem as the use of graphs to model data continues to proliferate. In this paper we propose a general framework for the fast resolution of duplicate vertices in graphs. Our framework utilises locality sensitive hashing for the quick identification of potential duplicates based on string similarity. However it is clear that in many tasks string similarity alone is not enough to determine duplication. This motivates the second aspect of our method which discovers the community structure in the graph using an ensemble of community detection algorithms. These communities are then used to augment the string similarity in the deduplication process. We evaluate our approach on a real world graph consisting of 620885 vertices and 1129986 edges and report a high accuracy score on a commercial real world graph.

Original languageEnglish
Title of host publicationComplex Networks and Their Applications VI
Subtitle of host publicationProceedings of Complex Networks 2017 (The Sixth International Conference on Complex Networks and Their Applications)
PublisherSpringer, Cham
Number of pages12
ISBN (Electronic)9783319721507
ISBN (Print)9783319721491
Publication statusPublished - 27 Nov 2017
Event6th International Conference on Complex Networks and Their Applications, Complex Networks 2017 - Lyon, France
Duration: 29 Nov 20171 Dec 2017

Publication series

NameStudies in Computational Intelligence
ISSN (Print)1860-949X


Conference6th International Conference on Complex Networks and Their Applications, Complex Networks 2017

Structured keywords

  • Jean Golding

Cite this