Automating information extraction task for Turkish texts

Tatar, Serhan

Automating information extraction task for Turkish texts

buir.advisor	Ulusoy, Özgür
dc.contributor.author	Tatar, Serhan
dc.date.accessioned	2016-01-08T18:14:29Z
dc.date.available	2016-01-08T18:14:29Z
dc.date.issued	2011
dc.description	Cataloged from PDF version of article.	en_US
dc.description	Includes bibliographical references leaves 85-97.	en_US
dc.description.abstract	Throughout history, mankind has often suffered from a lack of necessary resources. In today’s information world, the challenge can sometimes be a wealth of resources. That is to say, an excessive amount of information implies the need to find and extract necessary information. Information extraction can be defined as the identification of selected types of entities, relations, facts or events in a set of unstructured text documents in a natural language. The goal of our research is to build a system that automatically locates and extracts information from Turkish unstructured texts. Our study focuses on two basic Information Extraction (IE) tasks: Named Entity Recognition and Entity Relation Detection. Named Entity Recognition, finding named entities (persons, locations, organizations, etc.) located in unstructured texts, is one of the most fundamental IE tasks. Entity Relation Detection task tries to identify relationships between entities mentioned in text documents. Using supervised learning strategy, the developed systems start with a set of examples collected from a training dataset and generate the extraction rules from the given examples by using a carefully designed coverage algorithm. Moreover, several rule filtering and rule refinement techniques are utilized to maximize generalization and accuracy at the same time. In order to obtain accurate generalization, we use several syntactic and semantic features of the text, including: orthographical, contextual, lexical and morphological features. In particular, morphological features of the text are effectively used in this study to increase the extraction performance for Turkish, an agglutinative language. Since the system does not rely on handcrafted rules/patterns, it does not heavily suffer from domain adaptability problem. The results of the conducted experiments show that (1) the developed systems are successfully applicable to the Named Entity Recognition and Entity Relation Detection tasks, and (2) exploiting morphological features can significantly improve the performance of information extraction from Turkish, an agglutinative language.	en_US
dc.description.statementofresponsibility	Tatar, Serhan	en_US
dc.format.extent	xviii, 110 leaves	en_US
dc.identifier.uri	http://hdl.handle.net/11693/15166
dc.language.iso	English	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Information Extraction	en_US
dc.subject	Turkish	en_US
dc.subject	Named Entity Recognition	en_US
dc.subject	Entity Relation Detection	en_US
dc.subject.lcc	QA76.9.N38 T38 2011	en_US
dc.subject.lcsh	Natural language processing (Computer science)	en_US
dc.subject.lcsh	Computational linguistics.	en_US
dc.subject.lcsh	Information storage and retrieval systems.	en_US
dc.title	Automating information extraction task for Turkish texts	en_US
dc.type	Thesis	en_US
thesis.degree.discipline	Computer Engineering
thesis.degree.grantor	Bilkent University
thesis.degree.level	Doctoral
thesis.degree.name	Ph.D. (Doctor of Philosophy)

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 0005018.pdf
Size:: 1.36 MB
Format:: Adobe Portable Document Format

Download

Collections

Graduate School of Engineering and Science