Application of K-NN and FPTC based text categorization algorithms to Turkish news reports

İlhan, Ufuk

Application of K-NN and FPTC based text categorization algorithms to Turkish news reports

Files

0001609.pdf (474.37 KB)

Date

2001

Authors

İlhan, Ufuk

Advisor

Güvenir, Halil Altay

BUIR Usage Stats

5
views

65
downloads

Abstract

New technological developments, such as easy access to Internet, optical character readers, high-speed networks and inexpensive massive storage facilities, have resulted in a dramatic increase in the availability of on-line text-newspaper articles, incoming (electronic) mail, technical reports, etc. The enormous growth of on-line information has led to a comparable growth in the need for methods that help users organize such information. Text Categorization may be the remedy of increased need for advanced techniques. Text Categorization is the classi cation of units of natural language texts with respect to a set of pre-existing categories. Categorization of documents is challenging, as the number of discriminating words can be very large. This thesis presents compilation of a Turkish dataset, called Anadolu Agency Newsgroup in order to study in Text Categorization. Turkish is an agglutinative languages in which words contain no direct indication where the morpheme boundaries are, furthermore, morphemes take a shape dependent on the morphological and phonological context. In Turkish, the process of adding one suÆx to another can result in a relatively long word, furthermore, a single Turkish word can give rise to a very large number of variants. Due to this complex morphological structure, Turkish requires text processing techniques di erent than English and similar languages. Therefore, besides converting all words to lower case and removing punctuation marks, some preliminary work is required such as stemming, removal of stopwords and formation of a keyword list.This thesis also presents the evaluation and comparison of the well-known k-NN classi cation algorithm and a variant of the k-NN, called Feature Projection Text Categorization (FPTC) algorithm. The k-NN classi er is an instance based learning method. It computes the similarity between the test instance and training instance, and considering the k top-ranking nearest instances to predict the categories of the input, nds out the category that is most similar. FPTC algorithm is based on the idea of representing training instances as their pro jections on each feature dimension. If the value of a training instance is missing for a feature, that instance is not stored on that feature. Experiments show that the FPTC algorithm achieves comparable accuracy with the k-NN algorithm, furthermore, the time eÆciency of FPTC outperforms the k-NN signi cantly

Keywords

Text categorization, Classi cation, Feature pro jections, Stemming, Wild card matching, Stopword

Degree Discipline

Computer Engineering

Degree Level

Master's

Degree Name

MS (Master of Science)

Permalink

http://hdl.handle.net/11693/14630

Collections

Graduate School of Engineering and Science

Language

English

Type

Thesis

Full item page

Application of K-NN and FPTC based text categorization algorithms to Turkish news reports

Files

Date

Authors

Editor(s)

Advisor

Supervisor

Co-Advisor

Co-Supervisor

Instructor

BUIR Usage Stats

Series

Abstract

Source Title

Publisher

Course

Other identifiers

Book Title

Keywords

Degree Discipline

Degree Level

Degree Name

Citation

Permalink

Published Version (Please cite this version)

Collections

Language

Type

Application of K-NN and FPTC based text categorization algorithms to Turkish news reports

Files

Date

Authors

Editor(s)

Advisor

Supervisor

Co-Advisor

Co-Supervisor

Instructor

BUIR Usage Stats

Share

Series

Abstract

Source Title

Publisher

Course

Other identifiers

Book Title

Keywords

Degree Discipline

Degree Level

Degree Name

Citation

Permalink

Published Version (Please cite this version)

Collections

Language

Type