Data distribution and performance optimization models for parallel data mining

Özkural, Eray

Data distribution and performance optimization models for parallel data mining

buir.advisor	Aykanat, Cevdet
dc.contributor.author	Özkural, Eray
dc.date.accessioned	2016-01-08T20:02:44Z
dc.date.available	2016-01-08T20:02:44Z
dc.date.issued	2013
dc.description	Cataloged from PDF version of article.	en_US
dc.description	Includes bibliographical references leaves 117-128.	en_US
dc.description.abstract	We have embarked upon a multitude of approaches to improve the efficiency of selected fundamental tasks in data mining. The present thesis is concerned with improving the efficiency of parallel processing methods for large amounts of data. We have devised new parallel frequent itemset mining algorithms that work on both sparse and dense datasets, and 1-D and 2-D parallel algorithms for the all-pairs similarity problem. Two new parallel frequent itemset mining (FIM) algorithms named NoClique and NoClique2 parallelize our sequential vertical frequent itemset mining algorithm named bitdrill, and uses a method based on graph partitioning by vertex separator (GPVS) to distribute and selectively replicate items. The method operates on a graph where vertices correspond to frequent items and edges correspond to frequent itemsets of size two. We show that partitioning this graph by a vertex separator is sufficient to decide a distribution of the items such that the sub-databases determined by the item distribution can be mined independently. This distribution entails an amount of data replication, which may be reduced by setting appropriate weights to vertices. The data distribution scheme is used in the design of two new parallel frequent itemset mining algorithms. Both algorithms replicate the items that correspond to the separator. NoClique replicates the work induced by the separator and NoClique2 computes the same work collectively. Computational load balancing and minimization of redundant or collective work may be achieved by assigning appropriate load estimates to vertices. The performance is compared to another parallelization that replicates all items, and ParDCI algorithm. We introduce another parallel FIM method using a variation of item distribution with selective item replication. We extend the GPVS model for parallel FIM we have proposed earlier, by relaxing the condition of independent mining. Instead of finding independently mined item sets, we may minimize the amount of communication and partition the candidates in a fine-grained manner. We introduce a hypergraph partitioning model of the parallel computation where vertices correspond to candidates and hyperedges correspond to items. A load estimate is assigned to each candidate with vertex weights, and item frequencies are given as hyperedge weights. The model is shown to minimize data replication and balance load accurately. We also introduce a re-partitioning model since we can generate only so many levels of candidates at once, using fixed vertices to model previous item distribution/replication. Experiments show that we improve over the higher load imbalance of NoClique2 algorithm for the same problem instances at the cost of additional parallel overhead. For the all-pairs similarity problem, we extend recent efficient sequential algorithms to a parallel setting, and obtain document-wise and term-wise parallelizations of a fast sequential algorithm, as well as an elegant combination of two algorithms that yield a 2-D distribution of the data. Two effective algorithmic optimizations for the term-wise case are reported that make the term-wise parallelization feasible. These optimizations exploit local pruning and block processing of a number of vectors, in order to decrease communication costs, the number of candidates, and communication/computation imbalance. The correctness of local pruning is proven. Also, a recursive term-wise parallelization is introduced. The performance of the algorithms are shown to be favorable in extensive experiments, as well as the utility of two major optimizations.	en_US
dc.description.statementofresponsibility	Özkural, Eray	en_US
dc.format.extent	xv, 128 leaves, tables, graphics	en_US
dc.identifier.uri	http://hdl.handle.net/11693/16897
dc.language.iso	English	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	parallel data mining	en_US
dc.subject	graph partitioning by vertex separator	en_US
dc.subject	hypergraph partitioning	en_US
dc.subject	all pairs similarity	en_US
dc.subject	data distribution	en_US
dc.subject	data replication	en_US
dc.subject.lcc	QA76.9.D343 O951 2013	en_US
dc.subject.lcsh	Data mining.	en_US
dc.subject.lcsh	Parallel processing (Electronic computers)	en_US
dc.subject.lcsh	Partitions (Mathematics)	en_US
dc.subject.lcsh	Hypergraphs.	en_US
dc.title	Data distribution and performance optimization models for parallel data mining	en_US
dc.type	Thesis	en_US
thesis.degree.discipline	Computer Engineering
thesis.degree.grantor	Bilkent University
thesis.degree.level	Doctoral
thesis.degree.name	Ph.D. (Doctor of Philosophy)

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 0006754.pdf
Size:: 1.37 MB
Format:: Adobe Portable Document Format

Download

Collections

Graduate School of Engineering and Science