Statistical modeling of agglutinative languages

Hakkani-Tür, Dilek Z.

Statistical modeling of agglutinative languages

Files

B053051.pdf (6.97 MB)

Date

2000

Authors

Hakkani-Tür, Dilek Z.

Supervisor

Oflazer, Kemal

BUIR Usage Stats

10
views

26
downloads

Abstract

Recent advances in computer hardware and availability of very large corpora have made the application of statistical techniques to natural language proce.ssing a possible, and a very appealing research area. Alany good results h;.i,ve been obtained by applying these techniques to English (and similar languages) in parsing. word sense disambiguation, part-of-speech tagging, and speech recognition. However, languages like Turkish, which have a number of characteristics that differ from English have mainly been left unstudied. Turkish presents an interesting problem for statistical modeling. In contrast to languages like English, for which there is a very small number of possible word forms with a gi\’en root wc>rd. for languages like Turkish or Finnish with very productive agglutinative morphology, it is possible to produce thousands of forms for a given root word. This causes a serious data sparseness problem for language modeling. This Ph.D. thesis presents the results of research and development of statistical language modeling techniques for Turkish, and tests such techniques on basic applications of natural language and speech processing like morphological disambiguation, spelling correction, and ?r-best list rescoring for speech recognition. For all tasks, the use of units smaller than a word for language modeling were tested in order to reduce the impact of data sparsity problem. For morphological disambiguation, we examined n-gram language models and ma.ximum entropy models using inflectional groups as modeling units. Our results indicate that using smaller units is useful for modeling languages with complex morphology and n-gram language models perform better than maximum entropy models. For n-best list rescoring and spelling correction, the n-gram language models that were developed for morphological disambiguation, and their approximations, via prefix-suffix models were used. The prefix-suffix models performed very well for n-best list rescoring, but for spelling correction, they could not beat word-based models, in terms of accuracy.

Keywords

Natural language processing, Statistical language modeling, Agglutinative languages, Morphological disambiguation, Speech recognition, Spelling correction, N-gram language models, Maximum entropy models

Degree Discipline

Computer Engineering

Degree Level

Doctoral

Degree Name

Ph.D. (Doctor of Philosophy)

Permalink

http://hdl.handle.net/11693/18580

Collections

Graduate School of Engineering and Science

Language

English

Type

Thesis

Full item page

Statistical modeling of agglutinative languages

Files

Date

Authors

Editor(s)

Advisor

Supervisor

Co-Advisor

Co-Supervisor

Instructor

BUIR Usage Stats

Series

Abstract

Source Title

Publisher

Course

Other identifiers

Book Title

Keywords

Degree Discipline

Degree Level

Degree Name

Citation

Permalink

Published Version (Please cite this version)

Collections

Language

Type

Statistical modeling of agglutinative languages

Files

Date

Authors

Editor(s)

Advisor

Supervisor

Co-Advisor

Co-Supervisor

Instructor

BUIR Usage Stats

Share

Series

Abstract

Source Title

Publisher

Course

Other identifiers

Book Title

Keywords

Degree Discipline

Degree Level

Degree Name

Citation

Permalink

Published Version (Please cite this version)

Collections

Language

Type