Statistical modeling of agglutinative languages

buir.supervisorOflazer, Kemal
dc.contributor.authorHakkani-Tür, Dilek Z.
dc.date.accessioned2016-01-08T20:20:35Z
dc.date.available2016-01-08T20:20:35Z
dc.date.copyright2000
dc.date.issued2000
dc.departmentDepartment of Computer Engineeringen_US
dc.descriptionAnkara : Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2000.en_US
dc.descriptionThesis (Ph.D.) -- Bilkent University, 2000.en_US
dc.descriptionIncludes bibliographical references (leaves 107-116).en_US
dc.descriptionCataloged from PDF version of article.
dc.description.abstractRecent advances in computer hardware and availability of very large corpora have made the application of statistical techniques to natural language proce.ssing a possible, and a very appealing research area. Alany good results h;.i,ve been obtained by applying these techniques to English (and similar languages) in parsing. word sense disambiguation, part-of-speech tagging, and speech recognition. However, languages like Turkish, which have a number of characteristics that differ from English have mainly been left unstudied. Turkish presents an interesting problem for statistical modeling. In contrast to languages like English, for which there is a very small number of possible word forms with a gi\’en root wc>rd. for languages like Turkish or Finnish with very productive agglutinative morphology, it is possible to produce thousands of forms for a given root word. This causes a serious data sparseness problem for language modeling. This Ph.D. thesis presents the results of research and development of statistical language modeling techniques for Turkish, and tests such techniques on basic applications of natural language and speech processing like morphological disambiguation, spelling correction, and ?r-best list rescoring for speech recognition. For all tasks, the use of units smaller than a word for language modeling were tested in order to reduce the impact of data sparsity problem. For morphological disambiguation, we examined n-gram language models and ma.ximum entropy models using inflectional groups as modeling units. Our results indicate that using smaller units is useful for modeling languages with complex morphology and n-gram language models perform better than maximum entropy models. For n-best list rescoring and spelling correction, the n-gram language models that were developed for morphological disambiguation, and their approximations, via prefix-suffix models were used. The prefix-suffix models performed very well for n-best list rescoring, but for spelling correction, they could not beat word-based models, in terms of accuracy.
dc.description.degreePh.D.en_US
dc.description.provenanceMade available in DSpace on 2016-01-08T20:20:35Z (GMT). No. of bitstreams: 1 1.pdf: 78510 bytes, checksum: d85492f20c2362aa2bcf4aad49380397 (MD5)en
dc.description.statementofresponsibilityby Dilek Z. Hakkani-Türen_US
dc.format.extentxx, 122 leaves : illustrations, charts ; 30 cm.en_US
dc.identifier.itemidBILKUTUPB053051
dc.identifier.urihttp://hdl.handle.net/11693/18580
dc.language.isoEnglishen_US
dc.publisherBilkent Universityen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectNatural language processing
dc.subjectStatistical language modeling
dc.subjectAgglutinative languages
dc.subjectMorphological disambiguation
dc.subjectSpeech recognition
dc.subjectSpelling correction
dc.subjectN-gram language models
dc.subjectMaximum entropy models
dc.titleStatistical modeling of agglutinative languagesen_US
dc.title.alternativeSondan eklemeli dillerin istatistiksel modellenmesi
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
B053051.pdf
Size:
6.97 MB
Format:
Adobe Portable Document Format
Description:
Full printable version