Improving the efficiency of search engines : strategies for focused crawling, searching, and index pruning

Altıngövde, İsmail Sengör

Improving the efficiency of search engines : strategies for focused crawling, searching, and index pruning

buir.advisor	Ulusoy, Özgür
dc.contributor.author	Altıngövde, İsmail Sengör
dc.date.accessioned	2016-01-08T18:11:12Z
dc.date.available	2016-01-08T18:11:12Z
dc.date.issued	2009
dc.description	Cataloged from PDF version of article.	en_US
dc.description	Includes bibliographical references leaves 157-169.	en_US
dc.description.abstract	Search engines are the primary means of retrieval for text data that is abundantly available on the Web. A standard search engine should carry out three fundamental tasks, namely; crawling the Web, indexing the crawled content, and finally processing the queries using the index. Devising efficient methods for these tasks is an important research topic. In this thesis, we introduce efficient strategies related to all three tasks involved in a search engine. Most of the proposed strategies are essentially applicable when a grouping of documents in its broadest sense (i.e., in terms of automatically obtained classes/clusters, or manually edited categories) is readily available or can be constructed in a feasible manner. Additionally, we also introduce static index pruning strategies that are based on the query views. For the crawling task, we propose a rule-based focused crawling strategy that exploits interclass rules among the document classes in a topic taxonomy. These rules capture the probability of having hyperlinks between two classes. The rulebased crawler can tunnel toward the on-topic pages by following a path of off-topic pages, and thus yields higher harvest rate for crawling on-topic pages. In the context of indexing and query processing tasks, we concentrate on conducting efficient search, again, using document groups; i.e., clusters or categories. In typical cluster-based retrieval (CBR), first, clusters that are most similar to a given free-text query are determined, and then documents from these clusters are selected to form the final ranked output. For efficient CBR, we first identify and evaluate some alternative query processing strategies. Next, we introduce a new index organization, so-called cluster-skipping inverted index structure (CS-IIS). It is shown that typical-CBR with CS-IIS outperforms previous CBR strategies (with an ordinary index) for a number of datasets and under varying search parameters. In this thesis, an enhanced version of CS-IIS is further proposed, in which all information to compute query-cluster similarities during query evaluation is stored. We introduce an incremental-CBR strategy that operates on top of this latter index structure, and demonstrate its search efficiency for different scenarios. Finally, we exploit query views that are obtained from the search engine query logs to tailor more effective static pruning techniques. This is also related to the indexing task involved in a search engine. In particular, query view approach is incorporated into a set of existing pruning strategies, as well as some new variants proposed by us. We show that query view based strategies significantly outperform the existing approaches in terms of the query output quality, for both disjunctive and conjunctive evaluation of queries.	en_US
dc.description.statementofresponsibility	Altıngövde, İsmail Sengör	en_US
dc.format.extent	xx, 169 leaves, graphs	en_US
dc.identifier.uri	http://hdl.handle.net/11693/14932
dc.language.iso	English	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Search engine	en_US
dc.subject	focused crawling	en_US
dc.subject	cluster-based retrieval	en_US
dc.subject	static index pruning	en_US
dc.subject.lcc	TK5105.884 .A48 2009	en_US
dc.subject.lcsh	Search engines.	en_US
dc.subject.lcsh	Information retrieval.	en_US
dc.title	Improving the efficiency of search engines : strategies for focused crawling, searching, and index pruning	en_US
dc.type	Thesis	en_US
thesis.degree.discipline	Computer Engineering
thesis.degree.grantor	Bilkent University
thesis.degree.level	Doctoral
thesis.degree.name	Ph.D. (Doctor of Philosophy)

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 0003902.pdf
Size:: 3.88 MB
Format:: Adobe Portable Document Format

Download

Collections

Graduate School of Engineering and Science