Text categorization and ensemble pruning in Turkish news portals
In news portals, text category information is needed for news presentation. However, for many news stories the category information is unavailable, incorrectly assigned or too generic. This makes the text categorization a necessary tool for news portals. Automated text categorization (ATC) is a multifaceted diffi- cult process that involves decisions regarding tuning of several parameters, term weighting, word stemming, word stopping, and feature selection. It is important to find a categorization setup that will provide highly accurate results in ATC for Turkish news portals. Two Turkish test collections with different characteristics are created using Bilkent News Portal. Experiments are conducted with four classification methods: C4.5, KNN, Naive Bayes, and SVM (using polynomial and rbf kernels). Results recommend a text categorization template for Turkish news portals. Regarding recommended text categorization template, ensemble learning methods are applied to increase effectiveness. Since they require many computational workload, ensemble pruning strategies are developed. Data partitioning ensembles are constructed and ranked-based ensemble pruning is applied with several machine learning categorization algorithms. The aim is to answer the following questions: (1) How much data can we prune using data partitioning on the text categorization domain? (2) Which partitioning and categorization methods are more suitable for ensemble pruning? (3) How do English and Turkish differ in ensemble pruning? (4) Can we increase effectiveness with ensemble pruning in the text categorization? Experiments are conducted on two text collections: Reuters-21578 and BilCat-TRT. 90% of ensemble members can be pruned with almost no decreasing in accuracy.