A vast range of information is expressed in unstructured or semi-structured text, in a form that is hard to decipher automatically. Consequently, it is of enormous importance to construct tools that allow users to extract information from textual documents as easily as it can be extracted from structured databases. Information Extraction (IE) systems identify predetermined relevant information in text documents from some specific domain and fill it into a structured form. Our approach to solving this problem is based on Hidden Markov models (HMMs) which can be generated automatically from manually labeled example documents. We consider the challenging task of learning HMMs when only partially (sparsely) labeled documents are available for training. In order to further reduce the amount of data labeling effort a user has to invest, we describe how our algorithm can be naturally extended to an active learning algorithm that selects ""difficult"" unlabeled tokens and asks the user to label them.We study empirically by how much active learning reduces the required labeling effort.
|Translated title of the contribution
|Learning Hidden Markov Models for Information Extraction Actively from Partially Labeled Text.
|Kuenstliche Intelligenz, Themenheft Textmining
|Published - Jun 2002