Accelerating Large Scale Centroid-based Clustering with Locality Sensitive Hashing

Ryan McConville, Xin Cao, Weiru Liu, Paul Miller

Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)

217 Downloads (Pure)

Abstract

Most traditional data mining algorithms struggle to cope with the sheer scale of data efficiently. In this paper, we propose a general framework to accelerate existing algorithms to cluster large-scale datasets which contain large numbers of attributes, items, and clusters. Our framework makes use of locality sensitive hashing to significantly reduce the cluster search space. We also theoretically prove that our framework has a guaranteed error bound in terms of the clustering quality. This framework can be applied to a set of centroid-based clustering algorithms that assign an object to the most similar cluster, and we adopt the popular K-Modes categorical clustering algorithm to present how the framework can be applied. We validated our framework with five synthetic datasets and a real world Yahoo! Answers dataset. The experimental results demonstrate that our framework is able to speed up the existing clustering algorithm between factors of 2 and 6, while maintaining comparable cluster purity.
Original languageEnglish
Title of host publication2016 IEEE 32nd International Conference on Data Engineering (ICDE 2016)
Subtitle of host publicationProceedings of a meeting held 16-20 May 2016, Helsinki, Finland
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages649-660
Number of pages12
ISBN (Electronic)9781509020201
ISBN (Print)9781509020218
DOIs
Publication statusPublished - Aug 2016
Event32nd IEEE International Conference on Data Engineering, ICDE 2016 - Helsinki, Finland
Duration: 16 May 201620 May 2016

Conference

Conference32nd IEEE International Conference on Data Engineering, ICDE 2016
CountryFinland
CityHelsinki
Period16/05/1620/05/16

Structured keywords

  • Jean Golding

Keywords

  • Clustering algorithms
  • Algorithm design and analysis
  • file organisation
  • pattern clustering
  • large scale centroid-based clustering
  • locality sensitive hashing
  • large-scale datasets clustering
  • cluster search space reduction
  • K-modes categorical clustering algorithm
  • cluster purity

Fingerprint Dive into the research topics of 'Accelerating Large Scale Centroid-based Clustering with Locality Sensitive Hashing'. Together they form a unique fingerprint.

Cite this