Text categorization and ensemble pruning in Turkish news portals

buir.advisorCan, Fazlı
dc.contributor.authorToraman, Çağrı
dc.date.accessioned2016-01-08T18:15:39Z
dc.date.available2016-01-08T18:15:39Z
dc.date.issued2011
dc.descriptionCataloged from PDF version of article.en_US
dc.descriptionIncludes bibliographical references leaves 53-60.en_US
dc.description.abstractIn news portals, text category information is needed for news presentation. However, for many news stories the category information is unavailable, incorrectly assigned or too generic. This makes the text categorization a necessary tool for news portals. Automated text categorization (ATC) is a multifaceted diffi- cult process that involves decisions regarding tuning of several parameters, term weighting, word stemming, word stopping, and feature selection. It is important to find a categorization setup that will provide highly accurate results in ATC for Turkish news portals. Two Turkish test collections with different characteristics are created using Bilkent News Portal. Experiments are conducted with four classification methods: C4.5, KNN, Naive Bayes, and SVM (using polynomial and rbf kernels). Results recommend a text categorization template for Turkish news portals. Regarding recommended text categorization template, ensemble learning methods are applied to increase effectiveness. Since they require many computational workload, ensemble pruning strategies are developed. Data partitioning ensembles are constructed and ranked-based ensemble pruning is applied with several machine learning categorization algorithms. The aim is to answer the following questions: (1) How much data can we prune using data partitioning on the text categorization domain? (2) Which partitioning and categorization methods are more suitable for ensemble pruning? (3) How do English and Turkish differ in ensemble pruning? (4) Can we increase effectiveness with ensemble pruning in the text categorization? Experiments are conducted on two text collections: Reuters-21578 and BilCat-TRT. 90% of ensemble members can be pruned with almost no decreasing in accuracy.en_US
dc.description.provenanceMade available in DSpace on 2016-01-08T18:15:39Z (GMT). No. of bitstreams: 1 0006008.pdf: 912864 bytes, checksum: 95f492c1632dbbc8a888e87181677ef4 (MD5)en
dc.description.statementofresponsibilityToraman, Çağrıen_US
dc.format.extentxi, 60 leavesen_US
dc.identifier.itemidB130527
dc.identifier.urihttp://hdl.handle.net/11693/15254
dc.language.isoEnglishen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectText Categorizationen_US
dc.subjectNews Portalen_US
dc.subjectEnsemble Learningen_US
dc.subjectEnsemble Pruningen_US
dc.subject.lccP128 .T67 2011en_US
dc.subject.lcshCategorization (Linguistics)en_US
dc.subject.lcshConversation analysis.en_US
dc.subject.lcshInformation storage and retrieval systems.en_US
dc.subject.lcshInformation retrieval.en_US
dc.titleText categorization and ensemble pruning in Turkish news portalsen_US
dc.typeThesisen_US
thesis.degree.disciplineComputer Engineering
thesis.degree.grantorBilkent University
thesis.degree.levelMaster's
thesis.degree.nameMS (Master of Science)

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
0006008.pdf
Size:
891.47 KB
Format:
Adobe Portable Document Format