Text categorization and ensemble pruning in Turkish news portals

Toraman, Çağrı

Text categorization and ensemble pruning in Turkish news portals

buir.advisor	Can, Fazlı
dc.contributor.author	Toraman, Çağrı
dc.date.accessioned	2016-01-08T18:15:39Z
dc.date.available	2016-01-08T18:15:39Z
dc.date.issued	2011
dc.description	Cataloged from PDF version of article.	en_US
dc.description	Includes bibliographical references leaves 53-60.	en_US
dc.description.abstract	In news portals, text category information is needed for news presentation. However, for many news stories the category information is unavailable, incorrectly assigned or too generic. This makes the text categorization a necessary tool for news portals. Automated text categorization (ATC) is a multifaceted diffi- cult process that involves decisions regarding tuning of several parameters, term weighting, word stemming, word stopping, and feature selection. It is important to find a categorization setup that will provide highly accurate results in ATC for Turkish news portals. Two Turkish test collections with different characteristics are created using Bilkent News Portal. Experiments are conducted with four classification methods: C4.5, KNN, Naive Bayes, and SVM (using polynomial and rbf kernels). Results recommend a text categorization template for Turkish news portals. Regarding recommended text categorization template, ensemble learning methods are applied to increase effectiveness. Since they require many computational workload, ensemble pruning strategies are developed. Data partitioning ensembles are constructed and ranked-based ensemble pruning is applied with several machine learning categorization algorithms. The aim is to answer the following questions: (1) How much data can we prune using data partitioning on the text categorization domain? (2) Which partitioning and categorization methods are more suitable for ensemble pruning? (3) How do English and Turkish differ in ensemble pruning? (4) Can we increase effectiveness with ensemble pruning in the text categorization? Experiments are conducted on two text collections: Reuters-21578 and BilCat-TRT. 90% of ensemble members can be pruned with almost no decreasing in accuracy.	en_US
dc.description.statementofresponsibility	Toraman, Çağrı	en_US
dc.format.extent	xi, 60 leaves	en_US
dc.identifier.itemid	B130527
dc.identifier.uri	http://hdl.handle.net/11693/15254
dc.language.iso	English	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Text Categorization	en_US
dc.subject	News Portal	en_US
dc.subject	Ensemble Learning	en_US
dc.subject	Ensemble Pruning	en_US
dc.subject.lcc	P128 .T67 2011	en_US
dc.subject.lcsh	Categorization (Linguistics)	en_US
dc.subject.lcsh	Conversation analysis.	en_US
dc.subject.lcsh	Information storage and retrieval systems.	en_US
dc.subject.lcsh	Information retrieval.	en_US
dc.title	Text categorization and ensemble pruning in Turkish news portals	en_US
dc.type	Thesis	en_US
thesis.degree.discipline	Computer Engineering
thesis.degree.grantor	Bilkent University
thesis.degree.level	Master's
thesis.degree.name	MS (Master of Science)

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 0006008.pdf
Size:: 891.47 KB
Format:: Adobe Portable Document Format

Download

Collections

Graduate School of Engineering and Science