Browsing by Subject "Information retrieval systems"

Now showing 1 - 16 of 16

Open Access
Algorithms for within-cluster searches using inverted files
(Springer, 2006-11) Altıngövde, İsmail Şengör; Can, Fazlı; Ulusoy, Özgür
Information retrieval over clustered document collections has two successive stages: first identifying the best-clusters and then the best-documents in these clusters that are most similar to the user query. In this paper, we assume that an inverted file over the entire document collection is used for the latter stage. We propose and evaluate algorithms for within-cluster searches, i.e., to integrate the best-clusters with the best-documents to obtain the final output including the highest ranked documents only from the best-clusters. Our experiments on a TREC collection including 210,158 documents with several query sets show that an appropriately selected integration algorithm based on the query length and system resources can significantly improve the query evaluation efficiency. © Springer-Verlag Berlin Heidelberg 2006.
Open Access
An archiving model for a hierarchical information storage environment
(Elsevier, 2000) Moinzadeh, K.; Berk, E.
We consider an archiving model for a database consisting of secondary and tertiary storage devices in which the query rate for a record declines as it ages. We propose a `dynamic' archiving policy based on the number of records and the age of the records in the secondary device. We analyze the cases when the number of new records inserted in the system over time are either constant or follow a Poisson process. For both scenarios, we characterize the properties of the policy parameters and provide optimization results when the objective is to minimize the average record retrieval times. Furthermore, we propose a simple heuristic method for obtaining near-optimal policies in large databases when the record query rate declines exponentially with time. The e ectiveness of the heuristic is tested via a numerical experiment. Finally, we examine the behavior of performance measures such as the average record retrieval time and the hit rate as system parameters are varied.
Open Access
Automatic Ranking of Retrieval Systems in Imperfect Environments
(ACM, 2003-07-08) Nuray, Rabia; Can, Fazlı
The empirical investigation of the effectiveness of information retrieval (IR) systems requires a test collection, a set of query topics, and a set of relevance judgments made by human assessors for each query. Previous experiments show that differences in human relevance assessments do not affect the relative performance of retrieval systems. Based on this observation, we propose and evaluate a new approach to replace the human relevance judgments by an automatic method. Ranking of retrieval systems with our methodology correlates positively and significantly with that of human-based evaluations. In the experiments, we assume a Web-like imperfect environment: the indexing information for all documents is available for ranking, but some documents may not be available for retrieval. Such conditions can be due to document deletions or network problems. Our method of simulating imperfect environments can be used for Web search engine assessment and in estimating the effects of network conditions (e.g., network unreliability) on IR system performance.
Open Access
First large-scale information retrieval experiments on Turkish texts
(ACM, 2006-08) Can, Fazlı; Koçberber, Seyit; Balcık, Erman; Kaynak, Cihan; Öcalan, H. Çağdaş; Vursavaş, Onur M.
We present the results of the first large-scale Turkish information retrieval experiments performed on a TREC-like test collection. The test bed, which has been created for this study, contains 95.5 million words, 408,305 documents, 72 ad hoc queries and has a size of about 800MB. All documents come from the Turkish newspaper Milliyet. We implement and apply simple to sophisticated stemmers and various query-document matching fonctions and show that truncating words at a prefix length of 5 creates an effective retrieval environment in Turkish. However, a lemmatizer-based stemmer provides significantly better effectiveness over a variety of matching functions.
Open Access
A graph based approach to estimating lexical cohesion
(ACM, 2008) Gürkök, Hayrettin; Karamuftuoglu, Murat; Schaal, Markus
Traditionally, information retrieval systems rank documents according to the query terms they contain. However, even if a document may contain all query terms, this does not guarantee that it is relevant to the query. The query terms can occur together in the same document, but may have been used in different contexts, expressing separate topics. Lexical cohesion is a characteristic of natural language texts, which can be used to determine whether the query terms are used in the same context in the document. In this paper we make use of a graph-based approach to capture term contexts and estimate the level of lexical cohesion in a document. To evaluate the performance of our system, we compare it against two benchmark systems using three TREC document collections. Copyright 2008 ACM.
Open Access
Incremental cluster-based retrieval using compressed cluster-skipping inverted files
(Association for Computing Machinery, 2008-06) Altingovde, I. S.; Demir, E.; Can, F.; Ulusoy, Özgür
We propose a unique cluster-based retrieval (CBR) strategy using a new cluster-skipping inverted file for improving query processing efficiency. The new inverted file incorporates cluster membership and centroid information along with the usual document information into a single structure. In our incremental-CBR strategy, during query evaluation, both best(-matching) clusters and the best(-matching) documents of such clusters are computed together with a single posting-list access per query term. As we switch from term to term, the best clusters are recomputed and can dynamically change. During query-document matching, only relevant portions of the posting lists corresponding to the best clusters are considered and the rest are skipped. The proposed approach is essentially tailored for environments where inverted files are compressed, and provides substantial efficiency improvement while yielding comparable, or sometimes better, effectiveness figures. Our experiments with various collections show that the incremental-CBR strategy using a compressed cluster-skipping inverted file significantly improves CPU time efficiency, regardless of query length. The new compressed inverted file imposes an acceptable storage overhead in comparison to a typical inverted file. We also show that our approach scales well with the collection size. © 2008 ACM.
Open Access
Novelty detection for topic tracking
(John Wiley & Sons, Inc., 2012) Aksoy, C.; Can, F.; Kocberber, S.
Multisource web news portals provide various advantages such as richness in news content and an opportunity to follow developments from different perspectives. However, in such environments, news variety and quantity can have an overwhelming effect. New-event detection and topic-tracking studies address this problem. They examine news streams and organize stories according to their events; however, several tracking stories of an event/topic may contain no new information (i.e., no novelty). We study the novelty detection (ND) problem on the tracking news of a particular topic. For this purpose, we build a Turkish ND test collection called BilNov-2005 and propose the usage of three ND methods: a cosine-similarity (CS)-based method, a language-model (LM)-based method, and a cover-coefficient (CC)-based method. For the LM-based ND method, we show that a simpler smoothing approach, Dirichlet smoothing, can have similar performance to a more complex smoothing approach, Shrinkage smoothing. We introduce a baseline that shows the performance of a system with random novelty decisions. In addition, a category-based threshold learning method is used for the first time in ND literature. The experimental results show that the LM-based ND method significantly outperforms the CS- and CC-based methods, and categorybased threshold learning achieves promising results when compared to general threshold learning. © 2011 ASIS&T.
Open Access
Ordinal evaluation and assignment problems
(IEEE, 2010) Atmaca, Abdullah; Oruç, A. Yavuz
In many assignment problems, a set of documents such as research proposals, promotion dossiers, resumes of job applicants is assigned to a set of experts for ordinal evaluation, ranking, and classification. A desirable condition for such assignments is that every pair of documents is compared and ordered by one or more experts. This condition was modeled as an optimization problem and the number of pairs of documents was maximized for a given incidence relation between a set of documents and a set of experts using a set covering integer programming method in the literature[5]. In this paper, we use a combinatorial approach to derive lower bounds on the number of experts needed to compare all pairs of documents and describe assignments that asymptotically match these bounds. These results are not only theoretically interesting but also have practical implications in obtaining optimal assignments without using complex optimization techniques. ©2010 IEEE.
Open Access
Osmanlı arşivleri içerik-bazlı sorgulama (İBS) sistemi
(2006-04) Altıngövde, İsmail Şengör; Şaykol, Ediz; Ulusoy, Özgür; Güdükbay, Uğur; Çetin, A. Enis; Göçmen, M.
We propose a content-based retrieval (CBR) system for digital Ottoman archive documents. In this system, the symbols extracted from the documents are matched with the most similar one in the symbol library, which is created in a supervised manner. The users specify queries by marking a region on an example document and the system retrieves all documents that include the symbols found in the query region. A prototype of the system is currently available on the Web. © 2006 IEEE.
Open Access
Performance of query processing implementations in ranking-based text retrieval systems using inverted indices
(Elsevier Ltd, 2006-07) Cambazoglu, B. B.; Aykanat, Cevdet
Similarity calculations and document ranking form the computationally expensive parts of query processing in ranking-based text retrieval. In this work, for these calculations, 11 alternative implementation techniques are presented under four different categories, and their asymptotic time and space complexities are investigated. To our knowledge, six of these techniques are not discussed in any other publication before. Furthermore, analytical experiments are carried out on a 30 GB document collection to evaluate the practical performance of different implementations in terms of query processing time and space consumption. Advantages and disadvantages of each technique are illustrated under different querying scenarios, and several experiments that investigate the scalability of the implementations are presented. © 2005 Elsevier Ltd. All rights reserved.
Open Access
Space efficient caching of query results in search engines
(IEEE, 2008-10) Özcan, Rıfat; Altıngövde, İsmail Şengör; Ulusoy Özgür
Web search engines serve millions of query requests per day. Caching query results is one of the most crucial mechanisms to cope with such a demanding load. In this paper, we propose an efficient storage model to cache document identifiers of query results. Essentially, we first cluster queries that have common result documents. Next, for each cluster, we attempt to store those common document identifiers in a more compact manner. Experimental results reveal that the proposed storage model achieves space reduction of up to 4%. The proposed model is envisioned to improve the cache hit rate and system throughput as it allows storing more query results within a particular cache space, in return to a negligible increase in the cost of preparing the final query result page. © 2008 IEEE.
Open Access
Static index pruning in web search engines: combining term and document popularities with query views
(Association for Computing Machinery, 2012) Altingovde, I. S.; Ozcan, R.; Ulusoy, O.
Static index pruning techniques permanently remove a presumably redundant part of an inverted file, to reduce the file size and query processing time. These techniques differ in deciding which parts of an index can be removed safely; that is, without changing the top-ranked query results. As defined in the literature, the query view of a document is the set of query terms that access to this particular document, that is, retrieves this document among its top results. In this paper, we first propose using query views to improve the quality of the top results compared against the original results. We incorporate query views in a number of static pruning strategies, namely term-centric, document-centric, term popularity based and document access popularity based approaches, and show that the new strategies considerably outperform their counterparts especially for the higher levels of pruning and for both disjunctive and conjunctive query processing. Additionally,we combine the notions of term and document access popularity to form new pruning strategies, and further extend these strategies with the query views. The new strategies improve the result quality especially for the conjunctive query processing, which is the default and most common search mode of a search engine. © 2012 ACM.
Open Access
Subband coding of binary textual images for document retrieval
(IEEE, 1996) Gerek, Ömer N.; Çetin, A. Enis; Tevfik, A. H.
Efficient compression of binary textual images is very important for applications such as document archiving and retrieval, digital libraries and facsimile. The basic property of a textual image is the repetitions of small character images and curves inside the document. Exploiting the redundancy of these repetitions is the key step in most of the coding algorithms. In this paper, we use a similar compression method in subband domain. Four different subband decomposition schemes are described and their performances on textual image compression algorithm is examined. Experimentally, it is found that the described methods accomplish high compression ratios and they are suitable for fast database access and keyword search.
Open Access
Topic tracking using chronological term ranking
(2013-10) Acun, Bilge; Başpınar, Alper; Oǧuz, Ekin; Saraç, M.İlker; Can, Fazlı
Topic tracking (TT) is an important component of topic detection and tracking (TDT) applications. TT algorithms aim to determine all subsequent stories of a certain topic based on a small number of initial sample stories. We propose an alternative similarity measure based on chronological term ranking (CTR) concept to quantify the relatedness among news articles for topic tracking. The CTR approach is based on the fact that in general important issues are presented at the beginning of news articles. By following this observation we modify the traditional Okapi BM25 similarity measure using the CTR concept. Using a large standard test collection we show that our method provides a statistically significantly improvement with respect to the Okapi BM25 measure. The highly successful performance indicates that the approach can be used in real applications. © 2013 Springer-Verlag London.
Open Access
Towards auto-documentary: Tracking the evolution of news stories
(ACM, 2004) Duygulu, Pınar; Pan J.-Y.; Forsyth, D.A.
News videos constitute an important source of information for tracking and documenting important events. In these videos, news stories are often accompanied by short video shots that tend to be repeated during the course of the event. Automatic detection of such repetitions is essential for creating auto-documentaries, for alleviating the limitation of traditional textual topic detection methods. In this paper, we propose novel methods for detecting and tracking the evolution of news over time. The proposed method exploits both visual cues and textual information to summarize evolving news stories. Experiments are carried on the TREC-VID data set consisting of 120 hours of news videos from two different channels.
Open Access
Turkish keyphrase extraction using multi-criterion ranking
(IEEE, 2009-09) Özdemir, Bahadır; Çiçekli, İlyas
Keyphrases have been extensively used for indexing and searching in databases and information retrieval systems. In addition, they provide useful information about semantic content of a document. In this paper, we propose an algorithm for automating Turkish keyphrase extraction. Several features of candidate phrases are exploited and form the extraction task as a problem of finding optimal set of candidate phrases. We use multi-criterion ranking to tackle this problem. © 2009 IEEE.