An information theoretic framework for data mining

Bie Tijl De

Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)

65 Citations (Scopus)


We formalize the data mining process as a process of information exchange, defined by the following key components. The data miner's state of mind is modeled as a probability distribution, called the background distribution, which represents the uncertainty and misconceptions the data miner has about the data. This model initially incorporates any prior (possibly incorrect) beliefs a data miner has about the data. During the data mining process, properties of the data (to which we refer as patterns) are revealed to the data miner, either in batch, one by one, or even interactively. This acquisition of information in the data mining process is formalized by updates to the background distribution to account for the presence of the found patterns. The proposed framework can be motivated using concepts from information theory and game theory. Understanding it from this perspective, it is easy to see how it can be extended to more sophisticated settings, e.g. where patterns are probabilistic functions of the data (thus allowing one to account for noise and errors in the data mining process, and allowing one to study data mining techniques based on subsampling the data). The framework then models the data mining process using concepts from information geometry, and I-projections in particular. The framework can be used to help in designing new data mining algorithms that maximize the efficiency of the information exchange from the algorithm to the data miner.
Translated title of the contributionAn information theoretic framework for data mining
Original languageEnglish
Title of host publicationThe 17th ACM SIGKDD conference on Knowledge Discovery and Data Mining (KDD)
PublisherAssociation for Computing Machinery (ACM)
Number of pages9
ISBN (Print) 9781450308137
Publication statusPublished - Aug 2011


Dive into the research topics of 'An information theoretic framework for data mining'. Together they form a unique fingerprint.

Cite this