Text categorization and ensemble pruning in Turkish news portals
buir.advisor | Can, Fazlı | |
dc.contributor.author | Toraman, Çağrı | |
dc.date.accessioned | 2016-01-08T18:15:39Z | |
dc.date.available | 2016-01-08T18:15:39Z | |
dc.date.issued | 2011 | |
dc.description | Ankara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent University, 2011. | en_US |
dc.description | Thesis (Master's) -- Bilkent University, 2011. | en_US |
dc.description | Includes bibliographical references leaves 53-60. | en_US |
dc.description.abstract | In news portals, text category information is needed for news presentation. However, for many news stories the category information is unavailable, incorrectly assigned or too generic. This makes the text categorization a necessary tool for news portals. Automated text categorization (ATC) is a multifaceted diffi- cult process that involves decisions regarding tuning of several parameters, term weighting, word stemming, word stopping, and feature selection. It is important to find a categorization setup that will provide highly accurate results in ATC for Turkish news portals. Two Turkish test collections with different characteristics are created using Bilkent News Portal. Experiments are conducted with four classification methods: C4.5, KNN, Naive Bayes, and SVM (using polynomial and rbf kernels). Results recommend a text categorization template for Turkish news portals. Regarding recommended text categorization template, ensemble learning methods are applied to increase effectiveness. Since they require many computational workload, ensemble pruning strategies are developed. Data partitioning ensembles are constructed and ranked-based ensemble pruning is applied with several machine learning categorization algorithms. The aim is to answer the following questions: (1) How much data can we prune using data partitioning on the text categorization domain? (2) Which partitioning and categorization methods are more suitable for ensemble pruning? (3) How do English and Turkish differ in ensemble pruning? (4) Can we increase effectiveness with ensemble pruning in the text categorization? Experiments are conducted on two text collections: Reuters-21578 and BilCat-TRT. 90% of ensemble members can be pruned with almost no decreasing in accuracy. | en_US |
dc.description.provenance | Made available in DSpace on 2016-01-08T18:15:39Z (GMT). No. of bitstreams: 1 0006008.pdf: 912864 bytes, checksum: 95f492c1632dbbc8a888e87181677ef4 (MD5) | en |
dc.description.statementofresponsibility | Toraman, Çağrı | en_US |
dc.format.extent | xi, 60 leaves | en_US |
dc.identifier.itemid | B130527 | |
dc.identifier.uri | http://hdl.handle.net/11693/15254 | |
dc.language.iso | English | en_US |
dc.rights | info:eu-repo/semantics/openAccess | en_US |
dc.subject | Text Categorization | en_US |
dc.subject | News Portal | en_US |
dc.subject | Ensemble Learning | en_US |
dc.subject | Ensemble Pruning | en_US |
dc.subject.lcc | P128 .T67 2011 | en_US |
dc.subject.lcsh | Categorization (Linguistics) | en_US |
dc.subject.lcsh | Conversation analysis. | en_US |
dc.subject.lcsh | Information storage and retrieval systems. | en_US |
dc.subject.lcsh | Information retrieval. | en_US |
dc.title | Text categorization and ensemble pruning in Turkish news portals | en_US |
dc.type | Thesis | en_US |
thesis.degree.discipline | Computer Engineering | |
thesis.degree.grantor | Bilkent University | |
thesis.degree.level | Master's | |
thesis.degree.name | MS (Master of Science) |
Files
Original bundle
1 - 1 of 1