Page-to-processor assignment techniques for parallel crawlers

buir.advisorAykanat, Cevdet
dc.contributor.authorTürk, Ata
dc.date.accessioned2016-07-01T11:01:32Z
dc.date.available2016-07-01T11:01:32Z
dc.date.issued2004
dc.descriptionCataloged from PDF version of article.en_US
dc.description.abstractIn less than a decade, the World Wide Web has evolved from a research project to a cultural phenomena effective in almost every facet of our society. The increase in the popularity and usage of the Web enforced an increase in the efficiency of information retrieval techniques used over the net. Crawling is among such techniques and is used by search engines, web portals, and web caches. A crawler is a program which downloads and stores web pages, generally to feed a search engine or a web repository. In order to be of use for its target applications, a crawler must download huge amounts of data in a reasonable amount of time. Generally, the high download rates required for efficient crawling cannot be achieved by single-processor systems. Thus, existing large-scale applications use multiple parallel processors to solve the crawling problem. Apart from the classical parallelization issues such as load balancing and minimization of the communication overhead, parallel crawling poses problems such as overlap avoidance and early retrieval of high quality pages. This thesis addresses parallelization of the crawling task, and its major contribution is mainly on partitioning/page-to-processor assignment techniques applied in parallel crawlers. We propose two new pageto-processor assignment techniques based on graph and hypergraph partitioning, which respectively minimize the total communication volume and the number of messages, while balancing the storage load and page download requests of processors. We implemented the proposed models, and our theoretic approaches have been supported with empirical findings. We also implemented an efficient parallel crawler which uses the proposed models.en_US
dc.description.provenanceMade available in DSpace on 2016-07-01T11:01:32Z (GMT). No. of bitstreams: 1 0002710.pdf: 442889 bytes, checksum: 3d47c3ca7d8c5b866e98820c6b8db4cb (MD5) Previous issue date: 2004en
dc.description.statementofresponsibilityTürk, Ataen_US
dc.format.extentxi, 58 leavesen_US
dc.identifier.itemidBILKUTUPB084166
dc.identifier.urihttp://hdl.handle.net/11693/29569
dc.language.isoEnglishen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectParallel crawlingen_US
dc.subjectpage assignmenten_US
dc.subjecthypergraph partitioningen_US
dc.subjectgraph partitioningen_US
dc.subject.lccTK7874 .T87 2004en_US
dc.subject.lcshIntegrated circuits Very large scale.en_US
dc.titlePage-to-processor assignment techniques for parallel crawlersen_US
dc.typeThesisen_US
thesis.degree.disciplineComputer Engineering
thesis.degree.grantorBilkent University
thesis.degree.levelMaster's
thesis.degree.nameMS (Master of Science)

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
0002710.pdf
Size:
432.51 KB
Format:
Adobe Portable Document Format
Description:
Full printable version