Statistical modeling of agglutinative languages

Hakkani-Tür, Dilek Z.2016-01-082016-01-0820002000http://hdl.handle.net/11693/18580Cataloged from PDF version of article.Includes bibliographical references (leaves 107-116).Recent advances in computer hardware and availability of very large corpora have made the application of statistical techniques to natural language proce.ssing a possible, and a very appealing research area. Alany good results h;.i,ve been obtained by applying these techniques to English (and similar languages) in parsing. word sense disambiguation, part-of-speech tagging, and speech recognition. However, languages like Turkish, which have a number of characteristics that differ from English have mainly been left unstudied. Turkish presents an interesting problem for statistical modeling. In contrast to languages like English, for which there is a very small number of possible word forms with a gi\’en root wc>rd. for languages like Turkish or Finnish with very productive agglutinative morphology, it is possible to produce thousands of forms for a given root word. This causes a serious data sparseness problem for language modeling. This Ph.D. thesis presents the results of research and development of statistical language modeling techniques for Turkish, and tests such techniques on basic applications of natural language and speech processing like morphological disambiguation, spelling correction, and ?r-best list rescoring for speech recognition. For all tasks, the use of units smaller than a word for language modeling were tested in order to reduce the impact of data sparsity problem. For morphological disambiguation, we examined n-gram language models and ma.ximum entropy models using inflectional groups as modeling units. Our results indicate that using smaller units is useful for modeling languages with complex morphology and n-gram language models perform better than maximum entropy models. For n-best list rescoring and spelling correction, the n-gram language models that were developed for morphological disambiguation, and their approximations, via prefix-suffix models were used. The prefix-suffix models performed very well for n-best list rescoring, but for spelling correction, they could not beat word-based models, in terms of accuracy.xx, 122 leaves : illustrations, charts ; 30 cm.Englishinfo:eu-repo/semantics/openAccessNatural language processingStatistical language modelingAgglutinative languagesMorphological disambiguationSpeech recognitionSpelling correctionN-gram language modelsMaximum entropy modelsStatistical modeling of agglutinative languagesSondan eklemeli dillerin istatistiksel modellenmesiThesisBILKUTUPB053051