|dc.contributor.author||Altıngövde, İsmail Sengör||
|dc.description||Ankara : The Department of Computer Engineering and the Instıtute of Engineering and Science of Bilkent University, 2009.||en_US
|dc.description||Thesis (Ph. D.) -- Bilkent University, 2009.||en_US
|dc.description||Includes bibliographical references leaves 157-169.||en_US
|dc.description.abstract||Search engines are the primary means of retrieval for text data that is abundantly
available on the Web. A standard search engine should carry out three
fundamental tasks, namely; crawling the Web, indexing the crawled content, and
finally processing the queries using the index. Devising efficient methods for these
tasks is an important research topic. In this thesis, we introduce efficient strategies
related to all three tasks involved in a search engine. Most of the proposed
strategies are essentially applicable when a grouping of documents in its broadest
sense (i.e., in terms of automatically obtained classes/clusters, or manually
edited categories) is readily available or can be constructed in a feasible manner.
Additionally, we also introduce static index pruning strategies that are based on
the query views.
For the crawling task, we propose a rule-based focused crawling strategy that
exploits interclass rules among the document classes in a topic taxonomy. These
rules capture the probability of having hyperlinks between two classes. The rulebased
crawler can tunnel toward the on-topic pages by following a path of off-topic
pages, and thus yields higher harvest rate for crawling on-topic pages.
In the context of indexing and query processing tasks, we concentrate on conducting
efficient search, again, using document groups; i.e., clusters or categories.
In typical cluster-based retrieval (CBR), first, clusters that are most similar to a
given free-text query are determined, and then documents from these clusters are
selected to form the final ranked output. For efficient CBR, we first identify and
evaluate some alternative query processing strategies. Next, we introduce a new
index organization, so-called cluster-skipping inverted index structure (CS-IIS).
It is shown that typical-CBR with CS-IIS outperforms previous CBR strategies
(with an ordinary index) for a number of datasets and under varying search parameters.
In this thesis, an enhanced version of CS-IIS is further proposed, in
which all information to compute query-cluster similarities during query evaluation
is stored. We introduce an incremental-CBR strategy that operates on top
of this latter index structure, and demonstrate its search efficiency for different
Finally, we exploit query views that are obtained from the search engine query
logs to tailor more effective static pruning techniques. This is also related to the
indexing task involved in a search engine. In particular, query view approach
is incorporated into a set of existing pruning strategies, as well as some new
variants proposed by us. We show that query view based strategies significantly
outperform the existing approaches in terms of the query output quality, for both
disjunctive and conjunctive evaluation of queries.||en_US
|dc.description.statementofresponsibility||Altıngövde, İsmail Sengör||en_US
|dc.format.extent||xx, 169 leaves, graphs||en_US
|dc.subject||static index pruning||en_US
|dc.subject.lcc||TK5105.884 .A48 2009||en_US
|dc.title||Improving the efficiency of search engines : strategies for focused crawling, searching, and index pruning||en_US
|dc.department||Department of Computer Engineering||en_US