Developing a text categorization template for Turkish news portals
Date
2011Source Title
2011 International Symposium on Innovations in Intelligent Systems and Applications
Publisher
IEEE
Pages
379 - 383
Language
English
Type
Conference PaperItem Usage Stats
250
views
views
592
downloads
downloads
Abstract
In news portals, text category information is needed for news presentation. However, for many news stories the category information is unavailable, incorrectly assigned or too generic. This makes the text categorization a necessary tool for news portals. Automated text categorization (ATC) is a multifaceted difficult process that involves decisions regarding tuning of several parameters, term weighting, word stemming, word stopping, and feature selection. In this study we aim to find a categorization setup that will provide highly accurate results in ATC for Turkish news portals. We also examine some other aspects such as the effects of training dataset set size and robustness issues. Two Turkish test collections with different characteristics are created using Bilkent News Portal. Experiments are conducted with four classification methods: C4.5, KNN, Naive Bayes, and SVM (using polynomial and rbf kernels). Our results recommends a text categorization template for Turkish news portals and provides some future research pointers. © 2011 IEEE.
Keywords
Turkish newsAutomated text categorization
Classification methods
Naive Bayes
news portals
RBF kernels
Robustness issues
Term weighting
Test Collection
Text categorization
Training dataset
Turkishs
Word-stemming
Feature extraction
Intelligent systems
Text processing