Parallel text retrieval on temporally versioned document collections
Please cite this item using this persistent URLhttp://hdl.handle.net/11693/14778
In recent years, as the access to the Internet is getting easier and cheaper, the amount and the rate of change of the online data presented to the Internet users are increasing at an astonishing rate. This ever-changing nature of the Internet causes an ever-decaying and replenishing information collection where newly presented data generally replaces old and sometimes valuable data. There are many recent studies aiming to preserve this valuable temporal data and size and number of temporal Web data collections are increasing. We believe that soon, information retrieval systems responding to time-range queries in a reasonable amount of time will emerge as a means of accessing vast temporal Web data collections. Due to tremendous size of temporal data and excessive number of query submissions per unit time, temporal information retrieval systems will have to utilize parallelism as much as possible. In parallel systems, in order to index collections using inverted indices, a strategy on distribution of the inverted indices has to be followed. In this study, the feasibility of time-based partitioned versus term-based partitioned temporalweb inverted-indices is analyzed and a novel parallel text retrieval system for answering temporal web queries is implemented considering the number of queries processed in unit time. Moreover, we investigate the performance of skip-list based and randomized-select based ranking schemes on time-based and termbased partitioned inverted indexes. Finally, we compare time-balanced and sizebalanced time-based partitioning schemes. The experimental results at small to medium number of processors reveal that for medium to long length queries time-based partitioning works better.