Most feature selection approaches perform either exhaustive or heuristic search for an optimal set of features. They typically only consider the labelled training set to obtain the most suitable features. When the distribution of instances in the labelled training set is different from the unlabelled test set, this may result in large generalization error. In this paper, a combination of heuristic measures and exhaustive search based on both the labelled dataset and the unlabelled dataset is proposed. The heuristic measures concerned are two contingency table measures ? Goodman-Kruskal measure and Fisher?s exact test ? which are used to rank the feature according to how well a feature predicts the class. Secondly, an exhaustive search is employed: by using test for goodness-of-fit, information on both the labelled dataset and the unlabelled dataset is applied to choose a better combination of features. We evaluate the approaches on the KDD Cup 2001 dataset.
|Translated title of the contribution||Feature selection with labelled and unlabelled data|
|Title of host publication||Unknown|
|Editors||M. Bohanec, B. Kasek, N. Lavrac, D. Mladenic|
|Publisher||University of Helsinki|
|Pages||156 - 167|
|Number of pages||11|
|Publication status||Published - Aug 2002|