Application of K-NN and FPTC based text categorization algorithms to Turkish news reports
Author(s)
Advisor
Güvenir, Halil AltayDate
2001Publisher
Bilkent University
Language
English
Type
ThesisItem Usage Stats
240
views
views
76
downloads
downloads
Abstract
New technological developments, such as easy access to Internet, optical character
readers, high-speed networks and inexpensive massive storage facilities,
have resulted in a dramatic increase in the availability of on-line text-newspaper
articles, incoming (electronic) mail, technical reports, etc. The enormous
growth of on-line information has led to a comparable growth in the need
for methods that help users organize such information. Text Categorization
may be the remedy of increased need for advanced techniques. Text Categorization
is the classi cation of units of natural language texts with respect to
a set of pre-existing categories. Categorization of documents is challenging,
as the number of discriminating words can be very large. This thesis presents
compilation of a Turkish dataset, called Anadolu Agency Newsgroup in order
to study in Text Categorization. Turkish is an agglutinative languages in
which words contain no direct indication where the morpheme boundaries are,
furthermore, morphemes take a shape dependent on the morphological and
phonological context. In Turkish, the process of adding one suÆx to another
can result in a relatively long word, furthermore, a single Turkish word can
give rise to a very large number of variants. Due to this complex morphological
structure, Turkish requires text processing techniques di erent than English
and similar languages. Therefore, besides converting all words to lower case
and removing punctuation marks, some preliminary work is required such as
stemming, removal of stopwords and formation of a keyword list.This thesis also presents the evaluation and comparison of the well-known k-NN
classi cation algorithm and a variant of the k-NN, called Feature Projection
Text Categorization (FPTC) algorithm. The k-NN classi er is an instance
based learning method. It computes the similarity between the test instance
and training instance, and considering the k top-ranking nearest instances to
predict the categories of the input, nds out the category that is most similar.
FPTC algorithm is based on the idea of representing training instances as their
pro jections on each feature dimension. If the value of a training instance is
missing for a feature, that instance is not stored on that feature. Experiments
show that the FPTC algorithm achieves comparable accuracy with the k-NN
algorithm, furthermore, the time eÆciency of FPTC outperforms the k-NN
signi cantly
Keywords
Text categorizationClassi cation
Feature pro jections
Stemming
Wild card matching
Stopword