BUIR Repository :: Browsing by Subject "Text processing"

Browsing by Subject "Text processing"

Now showing 1 - 15 of 15

Open Access
Automatic categorization of ottoman literary texts by poet and time period
(Springer, London, 2012) Can, Ethem F.; Can, Fazlı; Duygulu, Pınar; Kalpaklı, Mehmet
Millions of manuscripts and printed texts are available in the Ottoman language. The automatic categorization of Ottoman texts would make these documents much more accessible in various applications ranging from historical investigations to literary analyses. In this work, we use transcribed version of Ottoman literary texts in the Latin alphabet and show that it is possible to develop effective Automatic Text Categorization techniques that can be applied to the Ottoman language. For this purpose, we use two fundamentally different machine learning methods: Naïve Bayes and Support Vector Machines, and employ four style markers: most frequent words, token lengths, two-word collocations, and type lengths. In the experiments, we use the collected works (divans) of ten different poets: two poets from five different hundred-year periods ranging from the 15th to 19th century. The experimental results show that it is possible to obtain highly accurate classifications in terms of poet and time period. By using statistical analysis we are able to recommend which style marker and machine learning method are to be used in future studies. © 2012 Springer-Verlag London Limited.
Open Access
Çağrı merkezi metin madenciliği yaklaşımı
(IEEE, 2017-05) Yiğit, İ. O.; Ateş, A. F.; Güvercin, Mehmet; Ferhatosmanoğlu, Hakan; Gedik, Buğra
Günümüzde çağrı merkezlerindeki görüşme kayıtlarının sesten metne dönüştürülebilmesi görüşme kaydı metinleri üzerinde metin madenciliği yöntemlerinin uygulanmasını mümkün kılmaktadır. Bu çalışma kapsamında görüşme kaydı metinleri kullanarak görüşmenin içeriğinin duygu yönünden (olumlu/olumsuz) değerlendirilmesi, müşteri memnuniyetinin ve müşteri temsilcisi performansının ölçülmesi amaçlanmaktadır. Yapılan çalışmada görüşme kaydı metinlerinden metin madenciliği yöntemleri ile yeni özellikler çıkarılmıştır. Metinlerden elde edilen özelliklerden yararlanılarak sınıflandırma ve regresyon yöntemleriyle görüşme kayıtlarının içeriklerinin değerlendirilmesini sağlayacak tahmin modelleri oluşturulmuştur. Bu çalışma sonucunda ortaya çıkarılan tahmin modellerinin Türk Telekom bünyesindeki çağrı merkezlerinde kullanılması hedeflenmektedir.
Open Access
Chat mining for gender prediction
(Springer, 2006-10) Küçükyılmaz, Tayfun; Cambazoğlu, B. Barla; Aykanat, Cevdet; Can, Fazlı
The aim of this paper is to investigate the feasibility of predicting the gender of a text document's author using linguistic evidence. For this purpose, term- and style-based classification techniques are evaluated over a large collection of chat messages. Prediction accuracies up to 84.2% are achieved, illustrating the applicability of these techniques to gender prediction. Moreover, the reverse problem is exploited, and the effect of gender on the writing style is discussed. © Springer-Verlag Berlin Heidelberg 2006.
Open Access
Developing a text categorization template for Turkish news portals
(IEEE, 2011) Toraman, Çağrı; Can, Fazlı; Koçberber, Seyit
In news portals, text category information is needed for news presentation. However, for many news stories the category information is unavailable, incorrectly assigned or too generic. This makes the text categorization a necessary tool for news portals. Automated text categorization (ATC) is a multifaceted difficult process that involves decisions regarding tuning of several parameters, term weighting, word stemming, word stopping, and feature selection. In this study we aim to find a categorization setup that will provide highly accurate results in ATC for Turkish news portals. We also examine some other aspects such as the effects of training dataset set size and robustness issues. Two Turkish test collections with different characteristics are created using Bilkent News Portal. Experiments are conducted with four classification methods: C4.5, KNN, Naive Bayes, and SVM (using polynomial and rbf kernels). Our results recommends a text categorization template for Turkish news portals and provides some future research pointers. © 2011 IEEE.
Open Access
Effect of inverted index partitioning schemes on performance of query processing in parallel text retrieval systems
(Springer, 2006-11) Cambazoğlu, B. Barla; Çatal, A.; Aykanat, Cevdet
Shared-nothing, parallel text retrieval systems require an inverted index, representing a document collection, to be partitioned among a number of processors. In general, the index can be partitioned based on either the terms or documents in the collection, and the way the partitioning is done greatly affects the query processing performance of the parallel system. In this work, we investigate the effect of these two index partitioning schemes on query processing. We conduct experiments on a 32-node PC cluster, considering the case where index is completely stored in disk. Performance results are reported for a large (30 GB) document collection using an MPI-based parallel query processing implementation. © Springer-Verlag Berlin Heidelberg 2006.
Open Access
Effective early termination techniques for text similarity join operator
(Springer, Berlin, Heidelberg, 2005) Özalp, S. A.; Ulusoy, Özgür
Text similarity join operator joins two relations if their join attributes are textually similar to each other, and it has a variety of application domains including integration and querying of data from heterogeneous resources; cleansing of data; and mining of data. Although, the text similarity join operator is widely used, its processing is expensive due to the huge number of similarity computations performed. In this paper, we incorporate some short cut evaluation techniques from the Information Retrieval domain, namely Harman, quit, continue, and maximal similarity filter heuristics, into the previously proposed text similarity join algorithms to reduce the amount of similarity computations needed during the join operation. We experimentally evaluate the original and the heuristic based similarity join algorithms using real data obtained from the DBLP Bibliography database, and observe performance improvements with continue and maximal similarity filter heuristics. © Springer-Verlag Berlin Heidelberg 2005.
Open Access
Ensemble pruning for text categorization based on data partitioning
(Springer, Berlin, Heidelberg, 2011) Toraman, Çağrı; Can, Fazlı
Ensemble methods can improve the effectiveness in text categorization. Due to computation cost of ensemble approaches there is a need for pruning ensembles. In this work we study ensemble pruning based on data partitioning. We use a ranked-based pruning approach. For this purpose base classifiers are ranked and pruned according to their accuracies in a separate validation set. We employ four data partitioning methods with four machine learning categorization algorithms. We mainly aim to examine ensemble pruning in text categorization. We conduct experiments on two text collections: Reuters-21578 and BilCat-TRT. We show that we can prune 90% of ensemble members with almost no decrease in accuracy. We demonstrate that it is possible to increase accuracy of traditional ensembling with ensemble pruning. © 2011 Springer-Verlag Berlin Heidelberg.
Open Access
First large-scale information retrieval experiments on Turkish texts
(ACM, 2006-08) Can, Fazlı; Koçberber, Seyit; Balcık, Erman; Kaynak, Cihan; Öcalan, H. Çağdaş; Vursavaş, Onur M.
We present the results of the first large-scale Turkish information retrieval experiments performed on a TREC-like test collection. The test bed, which has been created for this study, contains 95.5 million words, 408,305 documents, 72 ad hoc queries and has a size of about 800MB. All documents come from the Turkish newspaper Milliyet. We implement and apply simple to sophisticated stemmers and various query-document matching fonctions and show that truncating words at a prefix length of 5 creates an effective retrieval environment in Turkish. However, a lemmatizer-based stemmer provides significantly better effectiveness over a variety of matching functions.
Open Access
Information retrieval on Turkish texts
(John Wiley & Sons, Inc., 2008-02) Can, F.; Kocberber, S.; Balcik, E.; Kaynak, C.; Ocalan, H. C.; Vursavas, O. M.
In this study, we investigate information retrieval (IR) on Turkish texts using a large-scale test collection that contains 408,305 documents and 72 ad hoc queries. We examine the effects of several stemming options and query-document matching functions on retrieval performance. We show that a simple word truncation approach, a word truncation approach that uses language-dependent corpus statistics, and an elaborate lemmatizer-based stemmer provide similar retrieval effectiveness in Turkish IR. We investigate the effects of a range of search conditions on the retrieval performance; these include scalability issues, query and document length effects, and the use of stop-word list in indexing. © 2007 Wiley Periodicals, Inc.
Open Access
Lexical cohesion based topic modeling for summarization
(Springer, 2008-02) Ercan, Gönenç; Çiçekli, İlyas
In this paper, we attack the problem of forming extracts for text summarization. Forming extracts involves selecting the most representative and significant sentences from the text. Our method takes advantage of the lexical cohesion structure in the text in order to evaluate significance of sentences. Lexical chains have been used in summarization research to analyze the lexical cohesion structure and represent topics in a text. Our algorithm represents topics by sets of co-located lexical chains to take advantage of more lexical cohesion clues. Our algorithm segments the text with respect to each topic and finds the most important topic segments. Our summarization algorithm has achieved better results, compared to some other lexical chain based algorithms. © 2008 Springer-Verlag Berlin Heidelberg.
Open Access
Performance of query processing implementations in ranking-based text retrieval systems using inverted indices
(Elsevier Ltd, 2006-07) Cambazoglu, B. B.; Aykanat, Cevdet
Similarity calculations and document ranking form the computationally expensive parts of query processing in ranking-based text retrieval. In this work, for these calculations, 11 alternative implementation techniques are presented under four different categories, and their asymptotic time and space complexities are investigated. To our knowledge, six of these techniques are not discussed in any other publication before. Furthermore, analytical experiments are carried out on a 30 GB document collection to evaluate the practical performance of different implementations in terms of query processing time and space consumption. Advantages and disadvantages of each technique are illustrated under different querying scenarios, and several experiments that investigate the scalability of the implementations are presented. © 2005 Elsevier Ltd. All rights reserved.
Open Access
Query expansion with terms selected using lexical cohesion analysis of documents
(Elsevier Ltd, 2007-07) Vechtomova, O.; Karamuftuoglu, M.
We present new methods of query expansion using terms that form lexical cohesive links between the contexts of distinct query terms in documents (i.e., words surrounding the query terms in text). The link-forming terms (link-terms) and short snippets of text surrounding them are evaluated in both interactive and automatic query expansion (QE). We explore the effectiveness of snippets in providing context in interactive query expansion, compare query expansion from snippets vs. whole documents, and query expansion following snippet selection vs. full document relevance judgements. The evaluation, conducted on the HARD track data of TREC 2005, suggests that there are considerable advantages in using link-terms and their surrounding short text snippets in QE compared to terms selected from full-texts of documents. © 2006 Elsevier Ltd. All rights reserved.
Open Access
Squeezing the ensemble pruning: Faster and more accurate categorization for news portals
(Springer, 2012) Toraman, Cağrı; Can, Fazlı
Recent studies show that ensemble pruning works as effective as traditional ensemble of classifiers (EoC). In this study, we analyze how ensemble pruning can improve text categorization efficiency in time-critical real-life applications such as news portals. The most crucial two phases of text categorization are training classifiers and assigning labels to new documents; but the latter is more important for efficiency of such applications. We conduct experiments on ensemble pruning-based news article categorization to measure its accuracy and time cost. The results show that our heuristics reduce the time cost of the second phase. Also we can make a trade-off between accuracy and time cost to improve both of them with appropriate pruning degrees. © 2012 Springer-Verlag Berlin Heidelberg.
Open Access
Summarization of documentaries
(Springer, Dordrecht, 2010) Demirtas, K.; Çiçekli, İlyas; Cicekli, N.K.
Video summarization algorithms present condensed versions of a full length video by identifying the most significant parts of the video. In this paper, we propose an automatic video summarization method using the subtitles of videos and text summarization techniques. We identify significant sentences in the subtitles of a video by using text summarization techniques and then we compose a video summary by finding the video parts corresponding to these summary sentences. © 2011 Springer Science+Business Media B.V.
Open Access
Using lexical chains for keyword extraction
(Elsevier Ltd, 2007-11) Ercan, G.; Cicekli, I.
Keywords can be considered as condensed versions of documents and short forms of their summaries. In this paper, the problem of automatic extraction of keywords from documents is treated as a supervised learning task. A lexical chain holds a set of semantically related words of a text and it can be said that a lexical chain represents the semantic content of a portion of the text. Although lexical chains have been extensively used in text summarization, their usage for keyword extraction problem has not been fully investigated. In this paper, a keyword extraction technique that uses lexical chains is described, and encouraging results are obtained. © 2007 Elsevier Ltd. All rights reserved.

Browsing by Subject "Text processing"

Results Per Page

Sort Options