Browsing by Subject "Search engines"

Now showing 1 - 20 of 45

Open Access
Adaptive time-to-live strategies for query result caching in web search engines
(2012) Alıcı, Sadiye; Altıngövde, I. Ş.; Rıfat, Özcan; Cambazoğlu, B. Barla; Ulusoy, Özgür
An important research problem that has recently started to receive attention is the freshness issue in search engine result caches. In the current techniques in literature, the cached search result pages are associated with a fixed time-to-live (TTL) value in order to bound the staleness of search results presented to the users, potentially as part of a more complex cache refresh or invalidation mechanism. In this paper, we propose techniques where the TTL values are set in an adaptive manner, on a per-query basis. Our results show that the proposed techniques reduce the fraction of stale results served by the cache and also decrease the fraction of redundant query evaluations on the search engine backend compared to a strategy using a fixed TTL value for all queries. © 2012 Springer-Verlag Berlin Heidelberg.
Open Access
Atatürk'ün el yazmalarının işlenmesi
(IEEE, 2010-04) Soysal, Talha; Adıgüzel Hande; Öktem, Alp; Haman, Alican; Can, Ethem Fatih; Duygulu, Pınar; Kalpaklı, Mehmet
Bu çalımada Atatürk'ün el yazmalarının etkin ve kolay eriimini salayabilecek kelime tabanlı bir arama sisteminin ilk aaması olarak sayısallatırılmı belgelerin ön ilemesi ve satır ve kelimelere bölütlenmesi konusunda çalımalar yapılmıtır. Tarihi el yazması belgeler çeitli zorluklar getirmekte, basılı belgelerde kullanılan yöntemlerin uygulanması baarılı sonuçlar üretememektedir. Bu nedenle daha gelimi çözümler üzerine younlaarak satır bölütlemede Hough dönüümü [1] tabanlı bir yöntem uyarlanmı, kelime bölütlemede ise yazıların eiklii göz önüne alınmıtır. Afet nan tarafından salanan belgelerin [4] 30 sayfası üzerinde yapılan çalımalarda elde edilen sonuçlar gelecek çalımalar açısından umut vericidir. In this paper, as a first step to an easy and convenient way to access the manuscripts of Atatürk with a word based search engine, the preprocessing of digitalized documents and their line and word segmentation is studied. The techniques that are applied on printed documents may not yield satisfactory results. Due to this fact, more developed techniques are decided to be applied consisting of a technique based on Hough transform [1] for line segmentation and a technique that is based on dealing with skewness of lines for word segmentation. The results, which are acquired through studies that are conducted on the documents provided by Afet İnan and consisting of 30 pages [2], prove to be highly accurate and promising for future researches. ©2010 IEEE.
Open Access
Authorship attribution: performance of various features and classification methods
(IEEE, 2007-11) Bozkurt, İlker Nadi; Bağlıoğlu, Özgür; Uyar, Erkan
Authorship attribution is the process of determining the writer of a document. In literature, there are lots of classification techniques conducted in this process. In this paper we explore information retrieval methods such as tf-idf structure with support vector machines, parametric and nonparametric methods with supervised and unsupervised (clustering) classification techniques in authorship attribution. We performed various experiments with articles gathered from Turkish newspaper Milliyet. We performed experiments on different features extracted from these texts with different classifiers, and combined these results to improve our success rates. We identified which classifiers give satisfactory results on which feature sets. According to experiments, the success rates dramatically changes with different combinations, however the best among them are support vector classifier with bag of words, and Gaussian with function words. ©2007 IEEE.
Open Access
Automatic performance evaluation of Web search engines
(Elsevier, 2004) Can, F.; Nuray, R.; Sevdik, A. B.
Measuring the information retrieval effectiveness of World Wide Web search engines is costly because of human relevance judgments involved. However, both for business enterprises and people it is important to know the most effective Web search engines, since such search engines help their users find higher number of relevant Web pages with less effort. Furthermore, this information can be used for several practical purposes. In this study we introduce automatic Web search engine evaluation method as an efficient and effective assessment tool of such systems. The experiments based on eight Web search engines, 25 queries, and binary user relevance judgments show that our method provides results consistent with human-based evaluations. It is shown that the observed consistencies are statistically significant. This indicates that the new method can be successfully used in the evaluation of Web search engines. © 2003 Elsevier Ltd. All rights reserved.
Open Access
Automatic Ranking of Retrieval Systems in Imperfect Environments
(ACM, 2003-07-08) Nuray, Rabia; Can, Fazlı
The empirical investigation of the effectiveness of information retrieval (IR) systems requires a test collection, a set of query topics, and a set of relevance judgments made by human assessors for each query. Previous experiments show that differences in human relevance assessments do not affect the relative performance of retrieval systems. Based on this observation, we propose and evaluate a new approach to replace the human relevance judgments by an automatic method. Ranking of retrieval systems with our methodology correlates positively and significantly with that of human-based evaluations. In the experiments, we assume a Web-like imperfect environment: the indexing information for all documents is available for ranking, but some documents may not be available for retrieval. Such conditions can be due to document deletions or network problems. Our method of simulating imperfect environments can be used for Web search engine assessment and in estimating the effects of network conditions (e.g., network unreliability) on IR system performance.
Open Access
Characterizing web search queries that match very few or no results
(ACM, 2012-11) Altıngövde, İ. Ş.; Blanco, R.; Cambazoğlu, B. B.; Özcan, Rıfat; Sarıgil, Erdem; Ulusoy, Özgür
Despite the continuous efforts to improve the web search quality, a non-negligible fraction of user queries end up with very few or even no matching results in leading web search engines. In this work, we provide a detailed characterization of such queries based on an analysis of a real-life query log. Our experimental setup allows us to characterize the queries with few/no results and compare the mechanisms employed by the major search engines in handling them.
Open Access
Cost-aware strategies for query result caching in Web search engines
(Association for Computing Machinery, 2011) Ozcan, R.; Altingovde, I. S.; Ulusoy, O.
Search engines and large-scale IR systems need to cache query results for efficiency and scalability purposes. Static and dynamic caching techniques (as well as their combinations) are employed to effectively cache query results. In this study, we propose cost-aware strategies for static and dynamic caching setups. Our research is motivated by two key observations: (i) query processing costs may significantly vary among different queries, and (ii) the processing cost of a query is not proportional to its popularity (i.e., frequency in the previous logs). The first observation implies that cache misses have different, that is, nonuniform, costs in this context. The latter observation implies that typical caching policies, solely based on query popularity, can not always minimize the total cost. Therefore, we propose to explicitly incorporate the query costs into the caching policies. Simulation results using two large Web crawl datasets and a real query log reveal that the proposed approach improves overall system performance in terms of the average query execution time. © 2011 ACM.
Open Access
Diversity and novelty in information retrieval
(ACM, 2013-07-08) Santos, R. L. T.; Castells, P.; Altıngövde, I. S.; Can, Fazlı
This tutorial aims to provide a unifying account of current research on diversity and novelty in different IR domains, namely, in the context of search engines, recommender sys- tems, and data streams.
Open Access
Energy-price-driven query processing in multi-center web search engines
(IEEE, 2011-07) Kayaaslan, Enver; Cambazoglu, B. B.; Blanco, R.; Junqueira, F. P.; Aykanat, Cevdet
Concurrently processing thousands of web queries, each with a response time under a fraction of a second, necessitates maintaining and operating massive data centers. For large-scale web search engines, this translates into high energy consumption and a huge electric bill. This work takes the challenge to reduce the electric bill of commercial web search engines operating on data centers that are geographically far apart. Based on the observation that energy prices and query workloads show high spatio-temporal variation, we propose a technique that dynamically shifts the query workload of a search engine between its data centers to reduce the electric bill. Experiments on real-life query workloads obtained from a commercial search engine show that significant financial savings can be achieved by this technique.
Open Access
Evolution of web search results within years
(ACM, 2011-07) Altıngövde, İsmail Şengör; Özcan, Rıfat; Ulusoy, Özgür
We provide a first large-scale analysis of the evolution of query results obtained from a real search engine at two distant points in time, namely, in 2007 and 2010, for a set of 630,000 real queries.
Open Access
Exploiting navigational queries for result presentation and caching in Web search engines
(John Wiley & Sons, Inc., 2011) Ozcan, R.; Altingovde, I. S.; Ulusoy, O.
Caching of query results is an important mechanism for efficiency and scalability of web search engines. Query results are cached and presented in terms of pages, which typically include 10 results each. In navigational queries, users seek a particular website, which would be typically listed at the top ranks (maybe, first or second) by the search engine, if found. For this type of query, caching and presenting results in the 10-per-page manner may waste cache space and network bandwidth. In this article, we propose nonuniform result page models with varying numbers of results for navigational queries. The experimental results show that our approach reduces the cache miss count by up to 9.17% (because of better utilization of cache space). Furthermore, bandwidth usage, which is measured in terms of number of snippets sent, is also reduced by 71% for navigational queries. This means a considerable reduction in the number of transmitted network packets, i.e., a crucial gain especially for mobile-search scenarios. A user study reveals that users easily adapt to the proposed result page model and that the efficiency gains observed in the experiments can be carried over to real-life situations. © 2011 ASIS&T.
Open Access
Exploiting query views for static index pruning in web search engines
(ACM, 2009-11) Altıngövde, İsmail Şengör; Özcan, Rıfat; Ulusoy, Özgür
We propose incorporating query views in a number of static pruning strategies, namely term-centric, document-centric and access-based approaches. These query-view based strategies considerably outperform their counterparts for both disjunctive and conjunctive query processing in Web search engines. Copyright 2009 ACM.
Open Access
A financial cost metric for result caching
(ACM, 2013-07-08) Sazoğlu, Fethi Burak; Cambazoğlu, B. B.; Özcan, R.; Altıngövde, I. S.; Ulusoy, Özgür
Web search engines cache results of frequent and/or recent queries. Result caching strategies can be evaluated using different metrics, hit rate being the most well-known. Recent works take the processing overhead of queries into account when evaluating the performance of result caching strategies and propose cost-aware caching strategies. In this paper, we propose a financial cost metric that goes one step beyond and takes also the hourly electricity prices into account when computing the cost. We evaluate the most well-known static, dynamic, and hybrid result caching strategies under this new metric. Moreover, we propose a financial-cost-aware version of the well-known LRU strategy and show that it outperforms the original LRU strategy in terms of the financial cost metric. Copyright © 2013 ACM.
Open Access
Görsel arama sonuçlarının çoklu örnekle öğrenme yöntemiyle yeniden sıralanması
(IEEE, 2012-04) Şener, Fadime; Cinbiş, N. I.; Duygulu-Şahin, Pınar
Bu çalışmada, çoklu öğrenme yöntemi ile metin tabanlı arama motorlarından elde edilen görsel sorgu sonuçlarını iyileştirmek için geliştirilmiş olan, zayıf denetimli öğrenen bir yöntem sunulmaktadır. Bu yöntemde arama motorundan dönen sonuçlar zayıf pozitif kabul edilerek, sorgu kategorisinden görüntü içermeyen negatif görüntüler de kullanılarak; çoklu örnekle öğrenme için torbalar oluşturulmaktadır. Bu torbalar ve veri kümesindeki örnekler arasında kurulan torba-örnek benzerliğinden yararlanarak; torbalar yeni bir örnek uzayına taşınmakta ve problem klasik bir denetimli öğrenme problemi haline getirilmektedir. Daha sonra, lineer destek vektör makinesi (DVM) kullanılarak her sorgu için sınıflandırma modelleri oluşturulmaktadır. Elde edilen sınıflandırma değerlerine göre görseller yeniden sıralanmış ve arama motorundan gelen sonuçların iyileştirildiği görülmüştür. Bu çerçevede, torba boyları arasında bir örüntü bulmak için yaptığımız deneyleri sunmaktayız. In this study, we propose a weakly-supervised multiple instance learning (MIL) method to improve the results of text-based image search engines. In this approach, ranked image list of search engine for a keyword query is treated as weak-positive input data, and with additional negative input data, multiple instance learning bags are constructed. Then, Multiple Instance problem is converted to a standard supervised learning problem by mapping each bag into a feature space defined by instances in training bags using a bag-instance similarity measure. At the end, linear SVM is used to construct a classifier to re-rank keyword-based image search data. Based on the classification scores, we re-rank the images and improve precision over the search engine results. In this respect, we also present our experiments conducted to find a pattern for multiple instance bag sizes to obtain better average precision. © 2012 IEEE.
Open Access
How k-12 students search for learning?: analysis of an educational search engine log
(ACM, 2014-07) Usta, Arif; Altıngövde, İsmail Şengör; Vidinli, İ. B.; Özcan, R.; Ulusoy, Özgür
In this study, we analyze an educational search engine log for shedding light on K-12 students' search behavior in a learning environment. We specially focus on query, session, user and click characteristics and compare the trends to the findings in the literature for general web search engines. Our analysis helps understanding how students search with the purpose of learning in an educational vertical, and reveals new directions to improve the search performance in the education domain. Copyright 2014 ACM.
Open Access
Hypergraph-theoretic partitioning models for parallel web crawling
(Springer, London, 2012) Türk, Ata; Cambazoğlu, B. Barla; Aykanat, Cevdet
Parallel web crawling is an important technique employed by large-scale search engines for content acquisition. A commonly used inter-processor coordination scheme in parallel crawling systems is the link exchange scheme, where discovered links are communicated between processors. This scheme can attain the coverage and quality level of a serial crawler while avoiding redundant crawling of pages by different processors. The main problem in the exchange scheme is the high inter-processor communication overhead. In this work, we propose a hypergraph model that reduces the communication overhead associated with link exchange operations in parallel web crawling systems by intelligent assignment of sites to processors. Our hypergraph model can correctly capture and minimize the number of network messages exchanged between crawlers. We evaluate the performance of our models on four benchmark datasets. Compared to the traditional hash-based assignment approach, significant performance improvements are observed in reducing the inter-processor communication overhead. © 2012 Springer-Verlag London Limited.
Open Access
Improved DST cryptanalysis of IDEA
(Springer, 2006-08) Ayaz, Eyüp Serdar; Selçuk, Ali Aydın
In this paper, we show how the Demirci-Selcuk-Ture attack, which is currently the deepest penetrating attack on the IDEA block cipher, can be improved significantly in performance. The improvements presented reduce the attack's plaintext, memory, precomputation time, and key search time complexities. These improvements also make a practical implementation of the attack on reduced versions of IDEA possible, enabling the first experimental verifications of the DST attack. © Springer-Verlag Berlin Heidelberg 2007.
Open Access
Incorporating the surfing behavior of web users into PageRank
(ACM, 2013-10-11) Ashyralyyev, Shatlyk; Cambazoğlu, B. B.; Aykanat, Cevdet
In large-scale commercial web search engines, estimating the importance of a web page is a crucial ingredient in ranking web search results. So far, to assess the importance of web pages, two different types of feedback have been taken into account, independent of each other: the feedback obtained from the hyperlink structure among the web pages (e.g., PageRank) or the web browsing patterns of users (e.g., BrowseRank). Unfortunately, both types of feedback have certain drawbacks. While the former lacks the user preferences and is vulnerable to malicious intent, the latter suffers from sparsity and hence low web coverage. In this work, we combine these two types of feedback under a hybrid page ranking model in order to alleviate the above-mentioned drawbacks. Our empirical results indicate that the proposed model leads to better estimation of page importance according to an evaluation metric that relies on user click feedback obtained from web search query logs. We conduct all of our experiments in a realistic setting, using a very large scale web page collection (around 6.5 billion web pages) and web browsing data (around two billion web page visits). Copyright is held by the owner/author(s).
Open Access
Integrated segmentation and recognition of connected Ottoman script
(S P I E - International Society for Optical Engineering, 2009-11) Yalniz, I. Z.; Altingovde, I. S.; Güdükbay, Uğur; Ulusoy, Özgür
We propose a novel context-sensitive segmentation and recognition method for connected letters in Ottoman script. This method first extracts a set of segments from a connected script and determines the candidate letters to which extracted segments are most similar. Next, a function is defined for scoring each different syntactically correct sequence of these candidate letters. To find the candidate letter sequence that maximizes the score function, a directed acyclic graph is constructed. The letters are finally recognized by computing the longest path in this graph. Experiments using a collection of printed Ottoman documents reveal that the proposed method provides >90% precision and recall figures in terms of character recognition. In a further set of experiments, we also demonstrate that the framework can be used as a building block for an information retrieval system for digital Ottoman archives. © 2009 Society of Photo-Optical Instrumentation Engineers.
Open Access
Keyframe labeling technique for surveillance event classification
(S P I E - International Society for Optical Engineering, 2010) Şaykol, E.; Baştan M.; Güdükbay, Uğur; Ulusoy, Özgür
The huge amount of video data generated by surveillance systems necessitates the use of automatic tools for their efficient analysis, indexing, and retrieval. Automated access to the semantic content of surveillance videos to detect anomalous events is among the basic tasks; however, due to the high variability of the audio-visual features and large size of the video input, it still remains a challenging task, though a considerable amount of research dealing with automated access to video surveillance has appeared in the literature. We propose a keyframe labeling technique, especially for indoor environments, which assigns labels to keyframes extracted by a keyframe detection algorithm, and hence transforms the input video to an event-sequence representation. This representation is used to detect unusual behaviors, such as crossover, deposit, and pickup, with the help of three separate mechanisms based on finite state automata. The keyframes are detected based on a grid-based motion representation of the moving regions, called the motion appearance mask. It has been shown through performance experiments that the keyframe labeling algorithm significantly reduces the storage requirements and yields reasonable event detection and classification performance. © 2010 Society of Photo-Optical Instrumentation Engineers.