Browsing by Subject "Web search engines--Mathematical models."

Now showing 1 - 7 of 7

Open Access
Analysis of Web search queries with very few or no results
(2012) Sarıgil, Erdem
Nowadays search engines have significant impacts on people’s life with the rapid growth of World Wide Web. There are billions of web pages that include a huge amount of information. Search engines are indispensable tools for finding information on the Web. Despite the continuous efforts to improve the web search quality, a non-negligible fraction of user queries end up with very few or even no matching results in leading commercial web search engines. In this thesis, we provide the first detailed characterization of such queries based on an analysis of a real-life query log. Our experimental setup allows us to characterize the queries with few/no results and compare the mechanisms employed by the three major search engines to handle them. Furthermore, we build machine learning models for the prediction of query suggestion patterns and no-answer queries.
Open Access
Caching techniques for large scale web search engines
(2011) Özcan, Rıfat
Large scale search engines have to cope with increasing volume of web content and increasing number of query requests each day. Caching of query results is one of the crucial methods that can increase the throughput of the system. In this thesis, we propose a variety of methods to increase the efficiency of caching for search engines. We first provide cost-aware policies for both static and dynamic query result caches. We show that queries have significantly varying costs and processing cost of a query is not proportional to its frequency (popularity). Based on this observation, we develop caching policies that take the query cost into consideration in addition to frequency, while deciding which items to cache. Second, we propose a query intent aware caching scheme such that navigational queries are identified and cached differently from other queries. Query results are cached and presented in terms of pages, which typically includes 10 results each. In navigational queries, the aim is to reach a particular web site which would be typically listed at the top ranks by the search engine, if found. We argue that caching and presenting the results of navigational queries in this 10-per-page manner is not cost effective and thus we propose alternative result presentation models and investigate the effect of these models on caching performance. Third, we propose a cluster based storage model for query results in a static cache. Queries with common result documents are clustered using single link clustering algorithm. We provide a compact storage model for those clusters by exploiting the overlap in query results. Finally, a five-level static cache that consists of all cacheable data items (query results, part of index, and document contents) in a search engine setting is presented. A greedy method is developed to determine which items to cache. This method prioritizes items for caching based on gains computed using items’ past frequency, estimated costs, and storage overheads. This approach alsoconsiders the inter-dependency between items such that caching of an item may affect the gain of items that are not cached yet. We experimentally evaluate all our methods using a real query log and document collections. We provide comparisons to corresponding baseline methods in the literature and we present improvements in terms of throughput, number of cache misses, and storage overhead of query results.
Open Access
Cascaded cross entropy-based search result diversification
(2012) Köroğlu, Bilge
Search engines are used to find information on the web. Retrieving relevant documents for ambiguous queries based on query-document similarity does not satisfy the users because such queries have more than one different meaning. In this study, a new method, cascaded cross entropy-based search result diversification (CCED), is proposed to list the web pages corresponding to different meanings of the query in higher rank positions. It combines modified reciprocal rank and cross entropy measures to balance the trade-off between query-document relevancy and diversity among the retrieved documents. We use the Latent Dirichlet Allocation (LDA) algorithm to compute query-document relevancy scores. The number of different meanings of an ambiguous query is estimated by complete-link clustering. We construct the first Turkish test collection for result diversification, BILDIV-2012. The performance of CCED is compared with Maximum Marginal Relevance (MMR) and IA-Select algorithms. In this comparison, the Ambient, TREC Diversity Track, and BILDIV-2012 test collections are used. We also compare performance of these algorithms with those of Bing and Google. The results indicate that CCED is the most successful method in terms of satisfying the users interested in different meanings of the query in higher rank positions of the result list.
Open Access
Incorporating the surfing behavior of web users into PageRank
(2013) Ashyralyyev, Shatlyk
One of the most crucial factors that determines the effectiveness of a large-scale commercial web search engine is the ranking (i.e., order) in which web search results are presented to the end user. In modern web search engines, the skeleton for the ranking of web search results is constructed using a combination of the global (i.e., query independent) importance of web pages and their relevance to the given search query. In this thesis, we are concerned with the estimation of global importance of web pages. So far, to estimate the importance of web pages, two different types of data sources have been taken into account, independent of each other: hyperlink structure of the web (e.g., PageRank) or surfing behavior of web users (e.g., BrowseRank). Unfortunately, both types of data sources have certain limitations. The hyperlink structure of the web is not very reliable and is vulnerable to bad intent (e.g., web spam), because hyperlinks can be easily edited by the web content creators. On the other hand, the browsing behavior of web users has limitations such as, sparsity and low web coverage. In this thesis, we combine these two types of feedback under a hybrid page importance estimation model in order to alleviate the above-mentioned drawbacks. Our experimental results indicate that the proposed hybrid model leads to better estimation of page importance according to an evaluation metric that uses the user click information obtained from Yahoo! web search engine’s query logs as ground-truth ranking. We conduct all of our experiments in a realistic setting, using a very large scale web page collection (around 6.5 billion web pages) and web browsing data (around two billion web page visits) collected through the Yahoo! toolbar.
Open Access
Longitudinal analysis of search engine query logs - temporal coverage
(2012) Yılmaz, Oğuz
The internet is growing day-by-day and the usage of web search engines is continuously increasing. Main page of browsers started by internet users is typically the home page of a search engine. To navigate a certain web site, most of the people prefer to type web sites’ name to search engine interface instead of using internet browsers’ address bar. Considering this important role of search engines as the main entry point to the web, we need to understand Web searching trends that are emerging over time. We believe that temporal analysis of returned query results by search engines reveals important insights for the current situation and future directions of web searching. In this thesis, we provide a large-scale analysis of the evolution of query results obtained from a real search engine at two distant points in time, namely, in 2007 and 2010, for a set of 630000 real queries. Our analyses in this work attempt to find answers to several critical questions regarding the evolution of Web search results. We believe that this work, being a large-scale longitudinal analysis of query results, would shed some light on those questions.
Open Access
A new approach to search result clustering and labeling
(2011) Türel, Anıl
Search engines present query results as a long ordered list of web snippets divided into several pages. Post-processing of information retrieval results for easier access to the desired information is an important research problem. A post-processing technique is clustering search results by topics and labeling these groups to reflect the topic of each cluster. In this thesis, we present a novel search result clustering approach to split the long list of documents returned by search engines into meaningfully grouped and labeled clusters. Our method emphasizes clustering quality by using cover coefficient and sequential k-means clustering algorithms. Cluster labeling is crucial because meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able to label clusters effectively, a new cluster labeling method based on term weighting is introduced. We also present a new metric that employs precision and recall to assess the success of cluster labeling. We adopt a comparative evaluation strategy to derive the relative performance of the proposed method with respect to the two prominent search result clustering methods: Suffix Tree Clustering and Lingo. Moreover, we perform the experiments using the publicly available Ambient and ODP-239 datasets. Experimental results show that the proposed method can successfully achieve both clustering and labeling tasks.
Open Access
A result cache invalidation scheme for web search engines
(2011) Alıcı, Şadiye
The result cache is a vital component for the efficiency of large-scale web search engines, and maintaining the freshness of cached query results is a current research challenge. As a remedy to this problem, our work proposes a new mechanism to identify queries whose cached results are stale. The basic idea behind our mechanism is to maintain and compare the generation time of query results with the update times of posting lists and documents to decide on staleness of query results. The proposed technique is evaluated using a Wikipedia document collection with real update information and a real-life query log. Throughout the experiments, we compare our approach with two baseline strategies from literature together with a detailed evaluation. We show that our technique has good prediction accuracy, relative to the baseline based on the time-to-live (TTL) mechanism. Moreover, it is easy to implement and it incurs less processing overhead on the system relative to a recently proposed, more sophisticated invalidation mechanism.