Application of K-NN and FPTC based text categorization algorithms to Turkish news reports

İlhan, Ufuk

Application of K-NN and FPTC based text categorization algorithms to Turkish news reports

buir.advisor	Güvenir, Halil Altay
dc.contributor.author	İlhan, Ufuk
dc.date.accessioned	2016-01-08T18:03:17Z
dc.date.available	2016-01-08T18:03:17Z
dc.date.issued	2001
dc.description	Cataloged from PDF version of article.	en_US
dc.description	Includes bibliographical references leaves 64-68	en_US
dc.description.abstract	New technological developments, such as easy access to Internet, optical character readers, high-speed networks and inexpensive massive storage facilities, have resulted in a dramatic increase in the availability of on-line text-newspaper articles, incoming (electronic) mail, technical reports, etc. The enormous growth of on-line information has led to a comparable growth in the need for methods that help users organize such information. Text Categorization may be the remedy of increased need for advanced techniques. Text Categorization is the classi cation of units of natural language texts with respect to a set of pre-existing categories. Categorization of documents is challenging, as the number of discriminating words can be very large. This thesis presents compilation of a Turkish dataset, called Anadolu Agency Newsgroup in order to study in Text Categorization. Turkish is an agglutinative languages in which words contain no direct indication where the morpheme boundaries are, furthermore, morphemes take a shape dependent on the morphological and phonological context. In Turkish, the process of adding one suÆx to another can result in a relatively long word, furthermore, a single Turkish word can give rise to a very large number of variants. Due to this complex morphological structure, Turkish requires text processing techniques di erent than English and similar languages. Therefore, besides converting all words to lower case and removing punctuation marks, some preliminary work is required such as stemming, removal of stopwords and formation of a keyword list.This thesis also presents the evaluation and comparison of the well-known k-NN classi cation algorithm and a variant of the k-NN, called Feature Projection Text Categorization (FPTC) algorithm. The k-NN classi er is an instance based learning method. It computes the similarity between the test instance and training instance, and considering the k top-ranking nearest instances to predict the categories of the input, nds out the category that is most similar. FPTC algorithm is based on the idea of representing training instances as their pro jections on each feature dimension. If the value of a training instance is missing for a feature, that instance is not stored on that feature. Experiments show that the FPTC algorithm achieves comparable accuracy with the k-NN algorithm, furthermore, the time eÆciency of FPTC outperforms the k-NN signi cantly	en_US
dc.description.statementofresponsibility	İlhan, Ufuk	en_US
dc.format.extent	68 leaves	en_US
dc.identifier.itemid	BILKUTUPB056067
dc.identifier.uri	http://hdl.handle.net/11693/14630
dc.language.iso	English	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Text categorization	en_US
dc.subject	Classi cation	en_US
dc.subject	Feature pro jections	en_US
dc.subject	Stemming	en_US
dc.subject	Wild card matching	en_US
dc.subject	Stopword	en_US
dc.subject.lcc	P128 .C37 2001	en_US
dc.subject.lcsh	Categorization (Linguistics)	en_US
dc.subject.lcsh	Conversation analysis.	en_US
dc.title	Application of K-NN and FPTC based text categorization algorithms to Turkish news reports	en_US
dc.type	Thesis	en_US
thesis.degree.discipline	Computer Engineering
thesis.degree.grantor	Bilkent University
thesis.degree.level	Master's
thesis.degree.name	MS (Master of Science)

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 0001609.pdf
Size:: 474.37 KB
Format:: Adobe Portable Document Format

Download

Collections

Graduate School of Engineering and Science