Show simple item record

dc.contributor.advisorGüvenir, Halil Altayen_US
dc.contributor.authorİlhan, Ufuken_US
dc.date.accessioned2016-01-08T18:03:17Z
dc.date.available2016-01-08T18:03:17Z
dc.date.issued2001
dc.identifier.urihttp://hdl.handle.net/11693/14630
dc.descriptionAnkara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2001.en_US
dc.descriptionThesis (Master's) -- Bilkent University, 2001.en_US
dc.descriptionIncludes bibliographical references leaves 64-68en_US
dc.description.abstractNew technological developments, such as easy access to Internet, optical character readers, high-speed networks and inexpensive massive storage facilities, have resulted in a dramatic increase in the availability of on-line text-newspaper articles, incoming (electronic) mail, technical reports, etc. The enormous growth of on-line information has led to a comparable growth in the need for methods that help users organize such information. Text Categorization may be the remedy of increased need for advanced techniques. Text Categorization is the classi cation of units of natural language texts with respect to a set of pre-existing categories. Categorization of documents is challenging, as the number of discriminating words can be very large. This thesis presents compilation of a Turkish dataset, called Anadolu Agency Newsgroup in order to study in Text Categorization. Turkish is an agglutinative languages in which words contain no direct indication where the morpheme boundaries are, furthermore, morphemes take a shape dependent on the morphological and phonological context. In Turkish, the process of adding one suÆx to another can result in a relatively long word, furthermore, a single Turkish word can give rise to a very large number of variants. Due to this complex morphological structure, Turkish requires text processing techniques di erent than English and similar languages. Therefore, besides converting all words to lower case and removing punctuation marks, some preliminary work is required such as stemming, removal of stopwords and formation of a keyword list.This thesis also presents the evaluation and comparison of the well-known k-NN classi cation algorithm and a variant of the k-NN, called Feature Projection Text Categorization (FPTC) algorithm. The k-NN classi er is an instance based learning method. It computes the similarity between the test instance and training instance, and considering the k top-ranking nearest instances to predict the categories of the input, nds out the category that is most similar. FPTC algorithm is based on the idea of representing training instances as their pro jections on each feature dimension. If the value of a training instance is missing for a feature, that instance is not stored on that feature. Experiments show that the FPTC algorithm achieves comparable accuracy with the k-NN algorithm, furthermore, the time eÆciency of FPTC outperforms the k-NN signi cantlyen_US
dc.description.statementofresponsibilityİlhan, Ufuken_US
dc.format.extent68 leavesen_US
dc.language.isoEnglishen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectText categorizationen_US
dc.subjectClassi cationen_US
dc.subjectFeature pro jectionsen_US
dc.subjectStemmingen_US
dc.subjectWild card matchingen_US
dc.subjectStopworden_US
dc.subject.lccP128 .C37 2001en_US
dc.subject.lcshCategorization (Linguistics)en_US
dc.subject.lcshConversation analysis.en_US
dc.titleApplication of K-NN and FPTC based text categorization algorithms to Turkish news reportsen_US
dc.typeThesisen_US
dc.departmentDepartment of Computer Engineeringen_US
dc.publisherBilkent Universityen_US
dc.description.degreeM.S.en_US
dc.identifier.itemidBILKUTUPB056067


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record