Incorporating the surfing behavior of web users into PageRank
One of the most crucial factors that determines the effectiveness of a large-scale commercial web search engine is the ranking (i.e., order) in which web search results are presented to the end user. In modern web search engines, the skeleton for the ranking of web search results is constructed using a combination of the global (i.e., query independent) importance of web pages and their relevance to the given search query. In this thesis, we are concerned with the estimation of global importance of web pages. So far, to estimate the importance of web pages, two different types of data sources have been taken into account, independent of each other: hyperlink structure of the web (e.g., PageRank) or surfing behavior of web users (e.g., BrowseRank). Unfortunately, both types of data sources have certain limitations. The hyperlink structure of the web is not very reliable and is vulnerable to bad intent (e.g., web spam), because hyperlinks can be easily edited by the web content creators. On the other hand, the browsing behavior of web users has limitations such as, sparsity and low web coverage. In this thesis, we combine these two types of feedback under a hybrid page importance estimation model in order to alleviate the above-mentioned drawbacks. Our experimental results indicate that the proposed hybrid model leads to better estimation of page importance according to an evaluation metric that uses the user click information obtained from Yahoo! web search engine’s query logs as ground-truth ranking. We conduct all of our experiments in a realistic setting, using a very large scale web page collection (around 6.5 billion web pages) and web browsing data (around two billion web page visits) collected through the Yahoo! toolbar.