Application of K-NN and FPTC based text categorization algorithms to Turkish news reports

Date
2001
Advisor
Güvenir, Halil Altay
Supervisor
Co-Advisor
Co-Supervisor
Instructor
Source Title
Print ISSN
Electronic ISSN
Publisher
Bilkent University
Volume
Issue
Pages
Language
English
Type
Thesis
Journal Title
Journal ISSN
Volume Title
Series
Abstract

New technological developments, such as easy access to Internet, optical character readers, high-speed networks and inexpensive massive storage facilities, have resulted in a dramatic increase in the availability of on-line text-newspaper articles, incoming (electronic) mail, technical reports, etc. The enormous growth of on-line information has led to a comparable growth in the need for methods that help users organize such information. Text Categorization may be the remedy of increased need for advanced techniques. Text Categorization is the classi cation of units of natural language texts with respect to a set of pre-existing categories. Categorization of documents is challenging, as the number of discriminating words can be very large. This thesis presents compilation of a Turkish dataset, called Anadolu Agency Newsgroup in order to study in Text Categorization. Turkish is an agglutinative languages in which words contain no direct indication where the morpheme boundaries are, furthermore, morphemes take a shape dependent on the morphological and phonological context. In Turkish, the process of adding one suÆx to another can result in a relatively long word, furthermore, a single Turkish word can give rise to a very large number of variants. Due to this complex morphological structure, Turkish requires text processing techniques di erent than English and similar languages. Therefore, besides converting all words to lower case and removing punctuation marks, some preliminary work is required such as stemming, removal of stopwords and formation of a keyword list.This thesis also presents the evaluation and comparison of the well-known k-NN classi cation algorithm and a variant of the k-NN, called Feature Projection Text Categorization (FPTC) algorithm. The k-NN classi er is an instance based learning method. It computes the similarity between the test instance and training instance, and considering the k top-ranking nearest instances to predict the categories of the input, nds out the category that is most similar. FPTC algorithm is based on the idea of representing training instances as their pro jections on each feature dimension. If the value of a training instance is missing for a feature, that instance is not stored on that feature. Experiments show that the FPTC algorithm achieves comparable accuracy with the k-NN algorithm, furthermore, the time eÆciency of FPTC outperforms the k-NN signi cantly

Course
Other identifiers
Book Title
Keywords
Text categorization, Classi cation, Feature pro jections, Stemming, Wild card matching, Stopword
Citation
Published Version (Please cite this version)