## Abstract

Data clustering is an unsupervised learning task that has found many applications in various scientific fields. The goal is to find subgroups of closely related data samples (clusters) in a set of unlabeled data. A classic clustering algorithm is the

so-called k-Means. It is very popular, however, it is also unable to handle cases in which the clusters are not linearly separable. Kernel k-Means is a state of the art clustering algorithm, which employs the kernel trick, in order to perform clustering on a higher dimensionality space, thus overcoming the limitations of

classic k-Means regarding the non linear separability of the input data. Kernel k-Means typically computes the kernel matrix, which contains the results of the kernel function for every possible sample combination. This matrix can be viewed as the weight matrix of a full graph, where the samples are the vertices and the

edges are weighed according to the similarity between the samples they connect, according to the kernel function. In this context, it is possible to work on the Nearest Neighbor graph, where each sample is only connected to some of its closest samples, or only using information from samples that are sufficiently close to each other, referred to as -ball. Doing so reduces the size of the kernel

matrix and can provide improved clustering results. In this paper, we present a MapReduce based distributed implementation of Nearest Neighbor and -ball Kernel k-Means.

so-called k-Means. It is very popular, however, it is also unable to handle cases in which the clusters are not linearly separable. Kernel k-Means is a state of the art clustering algorithm, which employs the kernel trick, in order to perform clustering on a higher dimensionality space, thus overcoming the limitations of

classic k-Means regarding the non linear separability of the input data. Kernel k-Means typically computes the kernel matrix, which contains the results of the kernel function for every possible sample combination. This matrix can be viewed as the weight matrix of a full graph, where the samples are the vertices and the

edges are weighed according to the similarity between the samples they connect, according to the kernel function. In this context, it is possible to work on the Nearest Neighbor graph, where each sample is only connected to some of its closest samples, or only using information from samples that are sufficiently close to each other, referred to as -ball. Doing so reduces the size of the kernel

matrix and can provide improved clustering results. In this paper, we present a MapReduce based distributed implementation of Nearest Neighbor and -ball Kernel k-Means.

Original language | English |
---|---|

Title of host publication | 2015 IEEE Symposium Series on Computational Intelligence (SSCI 2015) |

Publisher | Institute of Electrical and Electronics Engineers (IEEE) |

Pages | 509-515 |

Number of pages | 7 |

ISBN (Electronic) | 9781479975617 |

ISBN (Print) | 9781479975600 |

DOIs | |

Publication status | Published - Jan 2016 |

Event | 2015 IEEE Symposium Series on Computational Intelligence - Cape Town, South Africa Duration: 7 Dec 2015 → 10 Dec 2015 |

### Conference

Conference | 2015 IEEE Symposium Series on Computational Intelligence |
---|---|

Abbreviated title | SSCI 2015 |

Country/Territory | South Africa |

City | Cape Town |

Period | 7/12/15 → 10/12/15 |