Data distribution and performance optimization models for parallel data mining

buir.advisorAykanat, Cevdet
dc.contributor.authorÖzkural, Eray
dc.date.accessioned2016-01-08T20:02:44Z
dc.date.available2016-01-08T20:02:44Z
dc.date.issued2013
dc.departmentDepartment of Computer Engineeringen_US
dc.descriptionAnkara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent University, 2013.en_US
dc.descriptionThesis (Ph. D.) -- Bilkent University, 2013.en_US
dc.descriptionIncludes bibliographical references leaves 117-128.en_US
dc.description.abstractWe have embarked upon a multitude of approaches to improve the efficiency of selected fundamental tasks in data mining. The present thesis is concerned with improving the efficiency of parallel processing methods for large amounts of data. We have devised new parallel frequent itemset mining algorithms that work on both sparse and dense datasets, and 1-D and 2-D parallel algorithms for the all-pairs similarity problem. Two new parallel frequent itemset mining (FIM) algorithms named NoClique and NoClique2 parallelize our sequential vertical frequent itemset mining algorithm named bitdrill, and uses a method based on graph partitioning by vertex separator (GPVS) to distribute and selectively replicate items. The method operates on a graph where vertices correspond to frequent items and edges correspond to frequent itemsets of size two. We show that partitioning this graph by a vertex separator is sufficient to decide a distribution of the items such that the sub-databases determined by the item distribution can be mined independently. This distribution entails an amount of data replication, which may be reduced by setting appropriate weights to vertices. The data distribution scheme is used in the design of two new parallel frequent itemset mining algorithms. Both algorithms replicate the items that correspond to the separator. NoClique replicates the work induced by the separator and NoClique2 computes the same work collectively. Computational load balancing and minimization of redundant or collective work may be achieved by assigning appropriate load estimates to vertices. The performance is compared to another parallelization that replicates all items, and ParDCI algorithm. We introduce another parallel FIM method using a variation of item distribution with selective item replication. We extend the GPVS model for parallel FIM we have proposed earlier, by relaxing the condition of independent mining. Instead of finding independently mined item sets, we may minimize the amount of communication and partition the candidates in a fine-grained manner. We introduce a hypergraph partitioning model of the parallel computation where vertices correspond to candidates and hyperedges correspond to items. A load estimate is assigned to each candidate with vertex weights, and item frequencies are given as hyperedge weights. The model is shown to minimize data replication and balance load accurately. We also introduce a re-partitioning model since we can generate only so many levels of candidates at once, using fixed vertices to model previous item distribution/replication. Experiments show that we improve over the higher load imbalance of NoClique2 algorithm for the same problem instances at the cost of additional parallel overhead. For the all-pairs similarity problem, we extend recent efficient sequential algorithms to a parallel setting, and obtain document-wise and term-wise parallelizations of a fast sequential algorithm, as well as an elegant combination of two algorithms that yield a 2-D distribution of the data. Two effective algorithmic optimizations for the term-wise case are reported that make the term-wise parallelization feasible. These optimizations exploit local pruning and block processing of a number of vectors, in order to decrease communication costs, the number of candidates, and communication/computation imbalance. The correctness of local pruning is proven. Also, a recursive term-wise parallelization is introduced. The performance of the algorithms are shown to be favorable in extensive experiments, as well as the utility of two major optimizations.en_US
dc.description.degreePh.D.en_US
dc.description.statementofresponsibilityÖzkural, Erayen_US
dc.format.extentxv, 128 leaves, tables, graphicsen_US
dc.identifier.urihttp://hdl.handle.net/11693/16897
dc.language.isoEnglishen_US
dc.publisherBilkent Universityen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectparallel data miningen_US
dc.subjectgraph partitioning by vertex separatoren_US
dc.subjecthypergraph partitioningen_US
dc.subjectall pairs similarityen_US
dc.subjectdata distributionen_US
dc.subjectdata replicationen_US
dc.subject.lccQA76.9.D343 O951 2013en_US
dc.subject.lcshData mining.en_US
dc.subject.lcshParallel processing (Electronic computers)en_US
dc.subject.lcshPartitions (Mathematics)en_US
dc.subject.lcshHypergraphs.en_US
dc.titleData distribution and performance optimization models for parallel data miningen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
0006754.pdf
Size:
1.37 MB
Format:
Adobe Portable Document Format