Automating information extraction task for Turkish texts
Author(s)
Advisor
Date
2011Publisher
Bilkent University
Language
English
Type
ThesisItem Usage Stats
131
views
views
51
downloads
downloads
Abstract
Throughout history, mankind has often suffered from a lack of necessary resources.
In today’s information world, the challenge can sometimes be a wealth
of resources. That is to say, an excessive amount of information implies the need
to find and extract necessary information. Information extraction can be defined
as the identification of selected types of entities, relations, facts or events in a set
of unstructured text documents in a natural language.
The goal of our research is to build a system that automatically locates and
extracts information from Turkish unstructured texts. Our study focuses on
two basic Information Extraction (IE) tasks: Named Entity Recognition and
Entity Relation Detection. Named Entity Recognition, finding named entities
(persons, locations, organizations, etc.) located in unstructured texts, is one of
the most fundamental IE tasks. Entity Relation Detection task tries to identify
relationships between entities mentioned in text documents.
Using supervised learning strategy, the developed systems start with a set
of examples collected from a training dataset and generate the extraction rules
from the given examples by using a carefully designed coverage algorithm. Moreover,
several rule filtering and rule refinement techniques are utilized to maximize
generalization and accuracy at the same time. In order to obtain accurate generalization,
we use several syntactic and semantic features of the text, including:
orthographical, contextual, lexical and morphological features. In particular,
morphological features of the text are effectively used in this study to increase
the extraction performance for Turkish, an agglutinative language. Since the system
does not rely on handcrafted rules/patterns, it does not heavily suffer from
domain adaptability problem.
The results of the conducted experiments show that (1) the developed systems
are successfully applicable to the Named Entity Recognition and Entity Relation
Detection tasks, and (2) exploiting morphological features can significantly improve
the performance of information extraction from Turkish, an agglutinative
language.