Fast unsupervised online drift detection using incremental Kolmogorov-Smirnov test

Denis Dos Reis, Peter Flach, Stan Matwin, Gustavo Batista

Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)

67 Citations (Scopus)
552 Downloads (Pure)

Abstract

Data stream research has grown rapidly over the last decade. Two major features distinguish data stream from batch learning: stream data are generated on the fly, possibly in a fast and variable rate; and the underlying data distribution can be non-stationary, leading to a phenomenon known as concept drift. Therefore, most of the research on data stream classification focuses on proposing effcient models that can adapt to concept drifts and maintain a stable performance over time. However, specifically for the classification task, the majority of such methods rely on the instantaneous availability of true labels for all already classified instances. This is a strong assumption that is rarely fulfilled in practical applications. Hence there is a clear need for effcient methods that can detect concept drifts in an unsupervised way. One possibility is the well-known Kolmogorov-Smirnov test, a statistical hypothesis test that checks whether two samples differ. This work has two main contributions. The first one is the Incremental Kolmogorov-Smirnov algorithm that allows performing the Kolmogorov-Smirnov hypothesis test instantly using two samples that change over time, where the change is an insertion and/or removal of an observation. Our algorithm employs a randomized tree and is able to perform the insertion and removal operations in O(log N) with high probability and calculate the Kolmogorov-Smirnov test in O(1), where N is the number of sample observations. This is a significant speed-up compared to the O(N log N) cost of the non-incremental implementation. The second contribution is the use of the Incremental Kolmogorov-Smirnov test to detect concept drifts without true labels. Classification algorithms adapted to use the test rely on a limited portion of those labels just to update the classification model after a concept drift is detected.

Original languageEnglish
Title of host publicationProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
PublisherAssociation for Computing Machinery (ACM)
Pages1545-1554
Number of pages10
ISBN (Print)9781450342322
DOIs
Publication statusPublished - 13 Aug 2016
Event22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016 - San Francisco, United States
Duration: 13 Aug 201617 Aug 2016

Conference

Conference22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016
Country/TerritoryUnited States
CitySan Francisco
Period13/08/1617/08/16

Structured keywords

  • Jean Golding

Keywords

  • Cartesian tree
  • Concept drift
  • Data stream
  • Kolmogorov-Smirnov
  • Lazy propagation
  • Treap

Fingerprint

Dive into the research topics of 'Fast unsupervised online drift detection using incremental Kolmogorov-Smirnov test'. Together they form a unique fingerprint.

Cite this