UNGRADE: UNsupervised GRAph DEcomposition

Bruno Golenia, Sebastian Spiegler, Peter Flach

Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)

Abstract

This article presents an unsupervised algorithm for word decomposition called UNGRADE (UNsupervised GRAph DEcomposition) to segment any word list of any language. UNGRADE assumes that each word follows the structure prefixes, a stem and suffixes without giving a limit on the number of prefixes and suffixes. The UNGRADE’s algorithm works in three steps and is language independent. Firstly, a pseudo stem is found for each word using a window based on Minimum Description Length. Secondly, prefix sequences and suffix sequences are found independently using a graph algorithm called graph-based unsupervised sequence segmentation. Finally, the morphemes from previous steps are joined to provide a segmented word list. We focus purely on the segmentation of words, thus, we employ a trivial method for labeling each morpheme which is the segment of the morpheme itself. UNGRADE is applied to 5 languages (English, German, Finnish, Turkish and Arabic) and results are provided according to their gold standard.
Translated title of the contributionUNGRADE: UNsupervised GRAph DEcomposition
Original languageEnglish
Title of host publicationWorking Notes for the CLEF 2009 Workshop, Corfu, Greece
Publication statusPublished - 2009

Bibliographical note

Other page information: -
Conference Proceedings/Title of Journal: Working Notes for the CLEF 2009 Workshop, Corfu, Greece
Other identifier: 2001116

Fingerprint Dive into the research topics of 'UNGRADE: UNsupervised GRAph DEcomposition'. Together they form a unique fingerprint.

Cite this