Statistical modeling of agglutinative languages

Date

2000

Editor(s)

Advisor

Supervisor

Oflazer, Kemal

Co-Advisor

Co-Supervisor

Instructor

Source Title

Print ISSN

Electronic ISSN

Publisher

Bilkent University

Volume

Issue

Pages

Language

English

Journal Title

Journal ISSN

Volume Title

Series

Abstract

Recent advances in computer hardware and availability of very large corpora have made the application of statistical techniques to natural language proce.ssing a possible, and a very appealing research area. Alany good results h;.i,ve been obtained by applying these techniques to English (and similar languages) in parsing. word sense disambiguation, part-of-speech tagging, and speech recognition. However, languages like Turkish, which have a number of characteristics that differ from English have mainly been left unstudied. Turkish presents an interesting problem for statistical modeling. In contrast to languages like English, for which there is a very small number of possible word forms with a gi\’en root wc>rd. for languages like Turkish or Finnish with very productive agglutinative morphology, it is possible to produce thousands of forms for a given root word. This causes a serious data sparseness problem for language modeling. This Ph.D. thesis presents the results of research and development of statistical language modeling techniques for Turkish, and tests such techniques on basic applications of natural language and speech processing like morphological disambiguation, spelling correction, and ?r-best list rescoring for speech recognition. For all tasks, the use of units smaller than a word for language modeling were tested in order to reduce the impact of data sparsity problem. For morphological disambiguation, we examined n-gram language models and ma.ximum entropy models using inflectional groups as modeling units. Our results indicate that using smaller units is useful for modeling languages with complex morphology and n-gram language models perform better than maximum entropy models. For n-best list rescoring and spelling correction, the n-gram language models that were developed for morphological disambiguation, and their approximations, via prefix-suffix models were used. The prefix-suffix models performed very well for n-best list rescoring, but for spelling correction, they could not beat word-based models, in terms of accuracy.

Course

Other identifiers

Book Title

Citation

item.page.isversionof