Developing a text categorization template for Turkish news portals

Date
2011
Advisor
Supervisor
Co-Advisor
Co-Supervisor
Instructor
Source Title
2011 International Symposium on Innovations in Intelligent Systems and Applications
Print ISSN
Electronic ISSN
Publisher
IEEE
Volume
Issue
Pages
379 - 383
Language
English
Type
Conference Paper
Journal Title
Journal ISSN
Volume Title
Series
Abstract

In news portals, text category information is needed for news presentation. However, for many news stories the category information is unavailable, incorrectly assigned or too generic. This makes the text categorization a necessary tool for news portals. Automated text categorization (ATC) is a multifaceted difficult process that involves decisions regarding tuning of several parameters, term weighting, word stemming, word stopping, and feature selection. In this study we aim to find a categorization setup that will provide highly accurate results in ATC for Turkish news portals. We also examine some other aspects such as the effects of training dataset set size and robustness issues. Two Turkish test collections with different characteristics are created using Bilkent News Portal. Experiments are conducted with four classification methods: C4.5, KNN, Naive Bayes, and SVM (using polynomial and rbf kernels). Our results recommends a text categorization template for Turkish news portals and provides some future research pointers. © 2011 IEEE.

Course
Other identifiers
Book Title
Keywords
Turkish news, Automated text categorization, Classification methods, Naive Bayes, news portals, RBF kernels, Robustness issues, Term weighting, Test Collection, Text categorization, Training dataset, Turkishs, Word-stemming, Feature extraction, Intelligent systems, Text processing
Citation
Published Version (Please cite this version)