This article presents an unsupervised algorithm for word decomposition called UNGRADE (UNsupervised GRAph DEcomposition) to segment any word list of any language. UNGRADE assumes that each word follows the structure preﬁxes, a stem and sufﬁxes without giving a limit on the number of preﬁxes and sufﬁxes. The UNGRADE’s algorithm works in three steps and is language independent. Firstly, a pseudo stem is found for each word using a window based on Minimum Description Length. Secondly, preﬁx sequences and sufﬁx sequences are found independently using a graph algorithm called graph-based unsupervised sequence segmentation. Finally, the morphemes from previous steps are joined to provide a segmented word list. We focus purely on the segmentation of words, thus, we employ a trivial method for labeling each morpheme which is the segment of the morpheme itself. UNGRADE is applied to 5 languages (English, German, Finnish, Turkish and Arabic) and results are provided according to their gold standard.
|Translated title of the contribution||UNGRADE: UNsupervised GRAph DEcomposition|
|Title of host publication||Working Notes for the CLEF 2009 Workshop, Corfu, Greece|
|Publication status||Published - 2009|
Bibliographical noteOther page information: -
Conference Proceedings/Title of Journal: Working Notes for the CLEF 2009 Workshop, Corfu, Greece
Other identifier: 2001116