A statistical information extraction system for Turkish

buir.advisorOflazer, Kemal
dc.contributor.authorTür, Gökhan
dc.date.accessioned2016-01-08T20:20:36Z
dc.date.available2016-01-08T20:20:36Z
dc.date.copyright2000
dc.date.issued2000
dc.descriptionAnkara : Department of Computer Engineering and the Institute of Engineering and Science of Bilkent Univ., 2000.en_US
dc.descriptionThesis (Ph.D.) -- Bilkent University, 2000.en_US
dc.descriptionIncludes bibliographical references (leaves 125-135).en_US
dc.descriptionCataloged from PDF version of article.
dc.description.abstractThis thesis presents the results of a study on information extraction from unrestricted Turkish text using statistical language processing methods. We have successfully applied statistical methods using both the lexical and morphological information to the following tasks: -The Turkish Text Deasciifier task aims to convert the ASCII characters in a Turkish text, into the corresponding non-ASCII Turkish characters (i.e.,"ü", "ö", "ç", "ş", "ğ", "ı", and their upper cases). -The Word Segmentation task aims to detect word boundaries, given we have a sequence of characters without space or punctuation.-The Vowel Restoration task aims to restore the vowels of an input stream, whose vowels are deleted.-The Sentence Segmentation task aims to divide a stream of text or speech into grammatical sentences. Given a sequence of (written or spoken) words, the aim of sentence segmentation is to find the boundaries of the sentences.-The Topic Segmentation task aims to divide a stream of text or speech into topically homogeneous blocks. Given a sequence of (written or spoken) words, the aim of topic segmentation is to find the boundaries where topics change.-The Name Tagging task aims to mark the games (persons, locations, and organizations) in a text. For relatively simpler tasks, such as Turkish Text Deasciifier, Word Segmentation, and Vowel Restoration, only lexical information is enough, but in order to obtain better performance in more complex tasks, such as Sentence Segmentation, Topic Segmentation, and Name Tagging, we not only use lexical information, but also exploit morphological and contextual information. For sentence segmentation, we have modeled the final inflectional groups of the words and combined them with the lexical model, and decreased the error rate to 4.34%. For name tagging, in addition to the lexical and morphological models, we have also employed contextual and tag models, and reached an F-measure of 91.56%. For topic segmentation, stems of the words (nouns) have been found to be more effective than using the surface forms of the words and we have achieved 10.90% segmentation error rate on our test set.
dc.description.provenanceMade available in DSpace on 2016-01-08T20:20:36Z (GMT). No. of bitstreams: 1 1.pdf: 78510 bytes, checksum: d85492f20c2362aa2bcf4aad49380397 (MD5)en
dc.description.statementofresponsibilityby Gökhan Türen_US
dc.format.extentxx, 135 leaves ; 30 cm.en_US
dc.identifier.itemidBILKUTUPB053053
dc.identifier.urihttp://hdl.handle.net/11693/18581
dc.language.isoEnglishen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectInformation Extraction
dc.subjectStatistical Natural Language Processing
dc.subjectTurkish
dc.subjectNamed entity extraction
dc.subjectTopic segmentation
dc.subjectSentence segmentation
dc.subjectVowel restoration
dc.subjectWord segmentation
dc.subjectText deasciification.
dc.titleA statistical information extraction system for Turkishen_US
dc.title.alternativeTürkçe için istatistiksel bir bilgi çıkarım sistemi
dc.typeThesisen_US
thesis.degree.disciplineComputer Engineering
thesis.degree.grantorBilkent University
thesis.degree.levelDoctoral
thesis.degree.namePh.D. (Doctor of Philosophy)

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
B053053.pdf
Size:
8.24 MB
Format:
Adobe Portable Document Format
Description:
Full printable version