Feature selection with labelled and unlabelled data

S Wu, PA Flach

Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)


Most feature selection approaches perform either exhaustive or heuristic search for an optimal set of features. They typically only consider the labelled training set to obtain the most suitable features. When the distribution of instances in the labelled training set is different from the unlabelled test set, this may result in large generalization error. In this paper, a combination of heuristic measures and exhaustive search based on both the labelled dataset and the unlabelled dataset is proposed. The heuristic measures concerned are two contingency table measures ? Goodman-Kruskal measure and Fisher?s exact test ? which are used to rank the feature according to how well a feature predicts the class. Secondly, an exhaustive search is employed: by using test for goodness-of-fit, information on both the labelled dataset and the unlabelled dataset is applied to choose a better combination of features. We evaluate the approaches on the KDD Cup 2001 dataset.
Translated title of the contributionFeature selection with labelled and unlabelled data
Original languageEnglish
Title of host publicationUnknown
EditorsM. Bohanec, B. Kasek, N. Lavrac, D. Mladenic
PublisherUniversity of Helsinki
Pages156 - 167
Number of pages11
Publication statusPublished - Aug 2002

Bibliographical note

Conference Proceedings/Title of Journal: ECML/PKDD'02 workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning


Dive into the research topics of 'Feature selection with labelled and unlabelled data'. Together they form a unique fingerprint.

Cite this