Data mining is quickly becoming one of the mainstream tools to analyze data, in marketing applications, bioinformatics, security, text mining, and so on. The result is often a glut of patterns foundin the data, each of which may be suitable for further exploitation by post-processing algorithms, such as for making recommendations, inferring networks, triggering alarms. However, if the direct user of such patterns is a human (a market analyst, biologist, internet user, and so on) rather than a subsequent algorithm in an analysis pipeline, the amount of patterns returned needs to be kept under control. To achieve this without loosing information, the patterns need to be filtered for informativeness and redundancy. We argue that present approaches fall short of this goal, and this has severely limited the adoption of data mining tools in such contexts. Probably the main difficulty is the vagueness of the term informative; it is a highly subjective notion, and a satisfying objective formalization is yet to be found. In this paper we suggest a measure of informativeness for itemsets in transactional databases, and an efficient algorithm (which we call MINI) to filter a possibly large set of frequent itemsets down to a small and list of informative and nonredundant itemsets, sorted in order of decreasing added informativeness. Our approach is based on solid foundations from statistics, and we believe it mirrors an intuitively satisfying notion of informativeness. Importantly, these ideas presented for itemsets can be transported to more general patterns, and we will suggest a general framework for mining informative and non-redundant patterns. The MINI implementation used in the experiments will be made freely available on the authors' website.
|Translated title of the contribution||From frequent itemsets to informative patterns|
|Publisher||University of Bristol|
|Publication status||Published - 2009|