Caching techniques for large scale web search engines

buir.advisorUlusoy, Özgür
dc.contributor.authorÖzcan, Rıfat
dc.date.accessioned2016-01-08T18:15:41Z
dc.date.available2016-01-08T18:15:41Z
dc.date.issued2011
dc.descriptionAnkara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent Univ., 2011.en_US
dc.descriptionThesis (Ph. D.) -- Bilkent University, 2011.en_US
dc.descriptionIncludes bibliographical references leaves 120-130.en_US
dc.description.abstractLarge scale search engines have to cope with increasing volume of web content and increasing number of query requests each day. Caching of query results is one of the crucial methods that can increase the throughput of the system. In this thesis, we propose a variety of methods to increase the efficiency of caching for search engines. We first provide cost-aware policies for both static and dynamic query result caches. We show that queries have significantly varying costs and processing cost of a query is not proportional to its frequency (popularity). Based on this observation, we develop caching policies that take the query cost into consideration in addition to frequency, while deciding which items to cache. Second, we propose a query intent aware caching scheme such that navigational queries are identified and cached differently from other queries. Query results are cached and presented in terms of pages, which typically includes 10 results each. In navigational queries, the aim is to reach a particular web site which would be typically listed at the top ranks by the search engine, if found. We argue that caching and presenting the results of navigational queries in this 10-per-page manner is not cost effective and thus we propose alternative result presentation models and investigate the effect of these models on caching performance. Third, we propose a cluster based storage model for query results in a static cache. Queries with common result documents are clustered using single link clustering algorithm. We provide a compact storage model for those clusters by exploiting the overlap in query results. Finally, a five-level static cache that consists of all cacheable data items (query results, part of index, and document contents) in a search engine setting is presented. A greedy method is developed to determine which items to cache. This method prioritizes items for caching based on gains computed using items’ past frequency, estimated costs, and storage overheads. This approach alsoconsiders the inter-dependency between items such that caching of an item may affect the gain of items that are not cached yet. We experimentally evaluate all our methods using a real query log and document collections. We provide comparisons to corresponding baseline methods in the literature and we present improvements in terms of throughput, number of cache misses, and storage overhead of query results.en_US
dc.description.provenanceMade available in DSpace on 2016-01-08T18:15:41Z (GMT). No. of bitstreams: 1 0006012.pdf: 2931141 bytes, checksum: 6ffe0300bfe556a26bd877aed542d187 (MD5)en
dc.description.statementofresponsibilityÖzcan, Rıfaten_US
dc.format.extentxxvi, 130 leavesen_US
dc.identifier.urihttp://hdl.handle.net/11693/15258
dc.language.isoEnglishen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectSearch engineen_US
dc.subjectnavigational queriesen_US
dc.subjectcost-aware cachingen_US
dc.subjectcaching techniquesen_US
dc.subject.lccTK5105.884 .O93 2011en_US
dc.subject.lcshSearch engines--Programming.en_US
dc.subject.lcshWeb search engines--Mathematical models.en_US
dc.subject.lcshInformation storage and retrieval systems.en_US
dc.subject.lcshInformation retrieval.en_US
dc.subject.lcshInternet searching.en_US
dc.subject.lcshCache memory.en_US
dc.subject.lcshElectronic data processing--Backup processing alternatives.en_US
dc.titleCaching techniques for large scale web search enginesen_US
dc.typeThesisen_US
thesis.degree.disciplineComputer Engineering
thesis.degree.grantorBilkent University
thesis.degree.levelDoctoral
thesis.degree.namePh.D. (Doctor of Philosophy)

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
0006012.pdf
Size:
2.8 MB
Format:
Adobe Portable Document Format