Efficient parallel frequency mining based on a novel top-down partitioning scheme for transactional data

Özkural, Eray2016-01-082016-01-0820022002http://hdl.handle.net/11693/15419Cataloged from PDF version of article.Includes bibliographical references (leaves 91-99).In recent years, large quantities of data have been amassed with advances in data acquisition capabilities. Automated detection of useful information is required for vast data obtained from scientific and business domains. Data Mining is the application of efficient algorithmic solutions on a variety of immense data for such knowledge discovery. Frequency mining discovers all frequent patterns in a transaction or relational database and it comprises the core of several data mining algorithms such as association rule mining and sequence mining. Frequent pattern discovery has become a challenge for parallel programming since it is a highly complex operation on huge datasets demanding efficient and scalable algorithms. In this thesis, we propose a new family of parallel frequency mining algo rithms. We introduce a novel transaction set partitioning scheme that can be used to divide the frequency mining task in a top-down fashion. The method op erates on the graph of frequent patterns with length two (Gp2) from which a graph partitioning by vertex separator (GPVS) is mapped to a two-way partitioning on the transaction set. The two parts obtained can be mined independently and therefore can be utilized for concurrency. In order for this property to hold, there is an amount of replication dictated by the separator in Gp2 which is minimized by the GPVS algorithm. A k-way partitioning is derived from recursive applica tion of 2- way partitioning scheme which is used in the design of a generic parallel frequency mining algorithm. First we compute Gp2 in parallel, succeeding that we designate a k-way partitioning of the database for k processors with a parallel IVrecursive procedure. The database is redistributed such, that each processor is as signed one part. Subsequent mining proceeds simultaneously and independently at each processor with a given serial mining algorithm. A complete implemen tation in which we employ FP- Growth as the sequential algorithm has been achieved. The performance study of the algorithm on a Beowulf system demon strates favorable performance for synthetic databases. For hard instances of the problem, we have gained approximately twice the speedup of a state-of-the-art algorithm. We also present a correction and optimization to FP- Growth algorithm.xvii, 106 leaves ; 30 cm.Englishinfo:eu-repo/semantics/openAccessParallel data miningFrequency miningEfficient parallel frequency mining based on a novel top-down partitioning scheme for transactional dataYeni bir işlem verisi parçalama şeması tabanlı etkin bir paralel frekans taramaThesisBILKUTUPB062533