Caching techniques for large scale web search engines
Author(s)
Advisor
Ulusoy, ÖzgürDate
2011Publisher
Bilkent University
Language
English
Type
ThesisItem Usage Stats
191
views
views
54
downloads
downloads
Abstract
Large scale search engines have to cope with increasing volume of web content
and increasing number of query requests each day. Caching of query results is
one of the crucial methods that can increase the throughput of the system. In
this thesis, we propose a variety of methods to increase the efficiency of caching
for search engines.
We first provide cost-aware policies for both static and dynamic query result
caches. We show that queries have significantly varying costs and processing
cost of a query is not proportional to its frequency (popularity). Based on this
observation, we develop caching policies that take the query cost into consideration
in addition to frequency, while deciding which items to cache. Second, we
propose a query intent aware caching scheme such that navigational queries are
identified and cached differently from other queries. Query results are cached and
presented in terms of pages, which typically includes 10 results each. In navigational
queries, the aim is to reach a particular web site which would be typically
listed at the top ranks by the search engine, if found. We argue that caching
and presenting the results of navigational queries in this 10-per-page manner is
not cost effective and thus we propose alternative result presentation models and
investigate the effect of these models on caching performance. Third, we propose
a cluster based storage model for query results in a static cache. Queries with
common result documents are clustered using single link clustering algorithm. We
provide a compact storage model for those clusters by exploiting the overlap in
query results. Finally, a five-level static cache that consists of all cacheable data
items (query results, part of index, and document contents) in a search engine
setting is presented. A greedy method is developed to determine which items to
cache. This method prioritizes items for caching based on gains computed using
items’ past frequency, estimated costs, and storage overheads. This approach alsoconsiders the inter-dependency between items such that caching of an item may
affect the gain of items that are not cached yet.
We experimentally evaluate all our methods using a real query log and document
collections. We provide comparisons to corresponding baseline methods in
the literature and we present improvements in terms of throughput, number of
cache misses, and storage overhead of query results.