Browsing by Subject "Text categorization"

Now showing 1 - 7 of 7

Open Access
Application of K-NN and FPTC based text categorization algorithms to Turkish news reports
(2001) İlhan, Ufuk
New technological developments, such as easy access to Internet, optical character readers, high-speed networks and inexpensive massive storage facilities, have resulted in a dramatic increase in the availability of on-line text-newspaper articles, incoming (electronic) mail, technical reports, etc. The enormous growth of on-line information has led to a comparable growth in the need for methods that help users organize such information. Text Categorization may be the remedy of increased need for advanced techniques. Text Categorization is the classi cation of units of natural language texts with respect to a set of pre-existing categories. Categorization of documents is challenging, as the number of discriminating words can be very large. This thesis presents compilation of a Turkish dataset, called Anadolu Agency Newsgroup in order to study in Text Categorization. Turkish is an agglutinative languages in which words contain no direct indication where the morpheme boundaries are, furthermore, morphemes take a shape dependent on the morphological and phonological context. In Turkish, the process of adding one suÆx to another can result in a relatively long word, furthermore, a single Turkish word can give rise to a very large number of variants. Due to this complex morphological structure, Turkish requires text processing techniques di erent than English and similar languages. Therefore, besides converting all words to lower case and removing punctuation marks, some preliminary work is required such as stemming, removal of stopwords and formation of a keyword list.This thesis also presents the evaluation and comparison of the well-known k-NN classi cation algorithm and a variant of the k-NN, called Feature Projection Text Categorization (FPTC) algorithm. The k-NN classi er is an instance based learning method. It computes the similarity between the test instance and training instance, and considering the k top-ranking nearest instances to predict the categories of the input, nds out the category that is most similar. FPTC algorithm is based on the idea of representing training instances as their pro jections on each feature dimension. If the value of a training instance is missing for a feature, that instance is not stored on that feature. Experiments show that the FPTC algorithm achieves comparable accuracy with the k-NN algorithm, furthermore, the time eÆciency of FPTC outperforms the k-NN signi cantly
Open Access
Authorship attribution: performance of various features and classification methods
(IEEE, 2007-11) Bozkurt, İlker Nadi; Bağlıoğlu, Özgür; Uyar, Erkan
Authorship attribution is the process of determining the writer of a document. In literature, there are lots of classification techniques conducted in this process. In this paper we explore information retrieval methods such as tf-idf structure with support vector machines, parametric and nonparametric methods with supervised and unsupervised (clustering) classification techniques in authorship attribution. We performed various experiments with articles gathered from Turkish newspaper Milliyet. We performed experiments on different features extracted from these texts with different classifiers, and combined these results to improve our success rates. We identified which classifiers give satisfactory results on which feature sets. According to experiments, the success rates dramatically changes with different combinations, however the best among them are support vector classifier with bag of words, and Gaussian with function words. ©2007 IEEE.
Open Access
Developing a text categorization template for Turkish news portals
(IEEE, 2011) Toraman, Çağrı; Can, Fazlı; Koçberber, Seyit
In news portals, text category information is needed for news presentation. However, for many news stories the category information is unavailable, incorrectly assigned or too generic. This makes the text categorization a necessary tool for news portals. Automated text categorization (ATC) is a multifaceted difficult process that involves decisions regarding tuning of several parameters, term weighting, word stemming, word stopping, and feature selection. In this study we aim to find a categorization setup that will provide highly accurate results in ATC for Turkish news portals. We also examine some other aspects such as the effects of training dataset set size and robustness issues. Two Turkish test collections with different characteristics are created using Bilkent News Portal. Experiments are conducted with four classification methods: C4.5, KNN, Naive Bayes, and SVM (using polynomial and rbf kernels). Our results recommends a text categorization template for Turkish news portals and provides some future research pointers. © 2011 IEEE.
Open Access
Ensemble pruning for text categorization based on data partitioning
(Springer, Berlin, Heidelberg, 2011) Toraman, Çağrı; Can, Fazlı
Ensemble methods can improve the effectiveness in text categorization. Due to computation cost of ensemble approaches there is a need for pruning ensembles. In this work we study ensemble pruning based on data partitioning. We use a ranked-based pruning approach. For this purpose base classifiers are ranked and pruned according to their accuracies in a separate validation set. We employ four data partitioning methods with four machine learning categorization algorithms. We mainly aim to examine ensemble pruning in text categorization. We conduct experiments on two text collections: Reuters-21578 and BilCat-TRT. We show that we can prune 90% of ensemble members with almost no decrease in accuracy. We demonstrate that it is possible to increase accuracy of traditional ensembling with ensemble pruning. © 2011 Springer-Verlag Berlin Heidelberg.
Open Access
Squeezing the ensemble pruning: Faster and more accurate categorization for news portals
(Springer, 2012) Toraman, Cağrı; Can, Fazlı
Recent studies show that ensemble pruning works as effective as traditional ensemble of classifiers (EoC). In this study, we analyze how ensemble pruning can improve text categorization efficiency in time-critical real-life applications such as news portals. The most crucial two phases of text categorization are training classifiers and assigning labels to new documents; but the latter is more important for efficiency of such applications. We conduct experiments on ensemble pruning-based news article categorization to measure its accuracy and time cost. The results show that our heuristics reduce the time cost of the second phase. Also we can make a trade-off between accuracy and time cost to improve both of them with appropriate pruning degrees. © 2012 Springer-Verlag Berlin Heidelberg.
Open Access
Text categorization using syllables and recurrent neural networks
(2017-07) Yar, Ersin
We investigate multi class categorization of short texts. To this end, in the third chapter, we introduce highly efficient dimensionality reduction techniques suitable for online processing of high dimensional feature vectors generated from freely-worded text. Although text processing and classification are highly important due to many applications such as emotion recognition, advertisement selection, etc., online classification and regression algorithms over text are limited due to need for high dimensional vectors to represent natural text inputs. We overcome such limitations by showing that randomized projections and piecewise linear models can be efficiently leveraged to significantly reduce the computational cost for feature vector extraction from the tweets. We demonstrate our results over tweets collected from a real life case study where the tweets are freely-worded and unstructured. We implement several well-known machine learning algorithms as well as novel regression methods and demonstrate that we can significantly reduce the computational complexity with insignificant change in the classification and regression performance.Furthermore, in the fourth chapter, we introduce a simple and novel technique for short text classification based on LSTM neural networks. Our algorithm obtains two distributed representations for a short text to be used in classification task. We derive one representation by processing vector embeddings corresponding to words consecutively in LSTM structure and taking average of the produced outputs at each time step of the network. We also take average of distributed representations of the words in the short text to obtain the other representation. For classification, weighted combination of both representations are calculated. Moreover, for the first time in literature we propose to use syllables to exploit the sequential nature of the data in a better way. We derive distributed representations of the syllables and feed them to an LSTM network to obtain the distributed representation for the short text. Softmax layer is used to calculate categorical distribution at the end. Classification performance is evaluated in terms of AUC measure. Experiments show that utilizing two distributed representations improves classification performance by 2%. Furthermore, we demonstrate that using distributed representations of syllables in short text categorization also provides performance improvements.
Open Access
Türkçe metinler üzerine yapılan sayısal üslup araştırmalarını inceleyen ve Benim Adım Kırmızı çevirilerinin aslına alan sadakatini ölçen bir çalışma
(Türk Kütüphaneciler Derneği, 2018) Çalışkan, Sevil; Can, Fazlı
Bu makalede bilişimin beşerî bilimlerdeki önemli bir uygulaması olan sayısal üslup analizi yönteminin tanıtılması hedeflenmiş ve çevirilerin aslına sadakatini ölçen özgün bir araştırma sunulmuştur. Sayısal üslup analizi, bilgi ve belge yönetiminde çeşitli sınıflama işlemlerini gerçekleştiren ve edebiyat araştırmalarında yakın okuma sırasında görülmesi mümkün olmayan gözlemleri sağlayan yaklaşımlardan oluşmaktadır. Makalede, öncelikle Türkçe metinler üzerinde çalışmak isteyen araştırmacılar için, üslup analizinin Türkçeye nasıl uyarlanacağı anlatılmış ve bu konuda Türkçe metinler üzerinde yapılan çalışmaları inceleyen kapsamlı bir kaynak taraması sunulmuştur. Üslup analizinin uygulama amaçları örneklerle incelenmiş, ön işleme ve öznitelik çıkarımı, sınıflandırma yaklaşımları, başarı düzeyi değerlendirmesi ve yardımcı bilişim araçları konularına yer verilmiştir. Orhan Pamuk’un Benim Adım Kırmızı isimli romanı ve çevirilerindeki üslup uyumuna ilişkin sunulan özgün araştırma, roman kahramanlarının temel bileşenler düzlemindeki dağılımlarını inceleyen yeni bir yaklaşım kullanmaktadır. İstatistiksel olarak kayda değer olan gözlemler yazar üslubunun çevirilerde korunduğunu gösteren niteliktedir.