Models and algorithms for parallel text retrieval

buir.advisorAykanat, Cevdet
dc.contributor.authorCambazoğlu, Berkant Barla
dc.date.accessioned2016-07-01T11:07:47Z
dc.date.available2016-07-01T11:07:47Z
dc.date.issued2006
dc.descriptionCataloged from PDF version of article.en_US
dc.description.abstractIn the last decade, search engines became an integral part of our lives. The current state-of-the-art in search engine technology relies on parallel text retrieval. Basically, a parallel text retrieval system is composed of three components: a crawler, an indexer, and a query processor. The crawler component aims to locate, fetch, and store the Web pages in a local document repository. The indexer component converts the stored, unstructured text into a queryable form, most often an inverted index. Finally, the query processing component performs the search over the indexed content. In this thesis, we present models and algorithms for efficient Web crawling and query processing. First, for parallel Web crawling, we propose a hybrid model that aims to minimize the communication overhead among the processors while balancing the number of page download requests and storage loads of processors. Second, we propose models for documentand term-based inverted index partitioning. In the document-based partitioning model, the number of disk accesses incurred during query processing is minimized while the posting storage is balanced. In the term-based partitioning model, the total amount of communication is minimized while, again, the posting storage is balanced. Finally, we develop and evaluate a large number of algorithms for query processing in ranking-based text retrieval systems. We test the proposed algorithms over our experimental parallel text retrieval system, Skynet, currently running on a 48-node PC cluster. In the thesis, we also discuss the design and implementation details of another, somewhat untraditional, grid-enabled search engine, SE4SEE. Among our practical work, we present the Harbinger text classification system, used in SE4SEE for Web page classification, and the K-PaToH hypergraph partitioning toolkit, to be used in the proposed models.en_US
dc.description.provenanceMade available in DSpace on 2016-07-01T11:07:47Z (GMT). No. of bitstreams: 1 0003173.pdf: 2299068 bytes, checksum: 314e12cf18c108a1eded4934bb7a822e (MD5) Previous issue date: 2006en
dc.description.statementofresponsibilityCambazoğlu, Berkant Barlaen_US
dc.format.extentxviii, 180 leavesen_US
dc.identifier.itemidBILKUTUPB100085
dc.identifier.urihttp://hdl.handle.net/11693/29882
dc.language.isoEnglishen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectSearch engineen_US
dc.subjectParallel text retrievalen_US
dc.subjectWeb crawlingen_US
dc.subjectInverted index partitioningen_US
dc.subjectQuery processingen_US
dc.subjectText classificationen_US
dc.subjectHypergraph partitioningen_US
dc.subject.lccQA76.5 .C35 2006en_US
dc.subject.lcshParallel processing (Electronic computers).en_US
dc.titleModels and algorithms for parallel text retrievalen_US
dc.typeThesisen_US
thesis.degree.disciplineComputer Engineering
thesis.degree.grantorBilkent University
thesis.degree.levelDoctoral
thesis.degree.namePh.D. (Doctor of Philosophy)

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
0003173.pdf
Size:
2.19 MB
Format:
Adobe Portable Document Format
Description:
Full printable version