Browsing by Subject "Data mining."

Now showing 1 - 17 of 17

Open Access
Application of map/reduce paradigm in supercomputing systems
(2013) Demirci, Gündüz Vehbi
Map/Reduce is a framework first introduced by Google in order to rapidly develop big data analytic applications on distributed computing systems. Even though the Map/Reduce paradigm had a game changing impact on certain fields of computer science such as information retrieval and data mining, it did not have such an impact on the scientific computing domain yet. The current implementations of Map/Reduce are especially designed for commodity PC clusters, where failures of compute nodes are common and inter-processor communication is slow. However, scientific computing applications are usually executed on high performance computing (HPC) systems and such systems provide high communication bandwidth with low message latency where failures of processors are rare. Therefore, Map/Reduce framework causes performance degradation and becomes less preferable in scientific computing domain. Due to these reasons, specific implementations of Map/Reduce paradigm are needed for scientific computing domain. Among the existing implementations, we focus our attention on the MapReduce-MPI (MR-MPI) library developed at Sandia National Labs. In this thesis, we argue that by utilizing MR-MPI Library, the Map/Reduce programming paradigm can be successfully utilized for scientific computing applications that require scalability and performance. We tested MR-MPI Library in HPC systems with several fundamental algorithms that are frequently used in scientific computing and data mining domains. Implemented algorithms include all-pair-similarity-search (APSS), all-pair-shortest-path (APSP), and page-rank (PR). Tests were performed on well-known large-scale HPC systems IBM BlueGene/Q (Juqueen) and Cray XE6 (Hermit) to examine scalability and speedup of these algorithms.
Open Access
Aspect based opinion mining on Turkish tweets
(2012) Akbaş, Esra
Understanding opinions about entities or brands is instrumental in reputation management and decision making. With the advent of social media, more people are willing to publicly share their recommendations and opinions. As the type and amount of such venues increase, automated analysis of sentiment on textual resources has become an essential data mining task. Sentiment classification aims to identify the polarity of sentiment in text. The polarity is predicted on either a binary (positive, negative) or a multi-variant scale as the strength of sentiment expressed. Text often contains a mix of positive and negative sentiments, hence it is often necessary to detect both simultaneously. While classifying text based on sentiment polarity is a major task, analyzing sentiments separately for each aspect can be more useful in many applications. In this thesis, we investigate the problem of mining opinions by extracting aspects of entities/topics on collection of short texts. We focus on Turkish tweets that contain informal short messages. Most of the available resources such as lexicons and labeled corpus in the literature of opinion mining are for the English language. Our approach would help enhance the sentiment analyses to other languages where such rich sources do not exist. After a set of preprocessing steps, we extract the aspects of the product(s) from the data and group the tweets based on the extracted aspects. In addition to our manually constructed Turkish opinion word list, an automated generation of the words with their sentiment strengths is proposed using a word selection algorithm. Then, we propose a new representation of the text according to sentiment strength of the words, which we refer to as sentiment based text representation. The feature vectors of the text are constructed according to this new representation. We adapt machine learning methods to generate classifiers based on the multi-variant scale feature vectors to detect mixture of positive and negative sentiments and to test their performance on Turkish tweets.
Open Access
Characteristics of Web-based textual communications
(2012) Küçükyılmaz, Tayfun
In this thesis, we analyze different aspects of Web-based textual communications and argue that all such communications share some common properties. In order to provide practical evidence for the validity of this argument, we focus on two common properties by examining these properties on various types of Web-based textual communications data. These properties are: All Web-based communications contain features attributable to their author and reciever; and all Web-based communications exhibit similar heavy tailed distributional properties. In order to provide practical proof for the validity of our claims, we provide three practical, real life research problems and exploit the proposed common properties of Web-based textual communications to find practical solutions to these problems. In this work, we first provide a feature-based result caching framework for real life search engines. To this end, we mined attributes from user queries in order to classify queries and estimate a quality metric for giving admission and eviction decisions for the query result cache. Second, we analyzed messages of an online chat server in order to predict user and mesage attributes. Our results show that several user- and message-based attributes can be predicted with significant occuracy using both chat message- and writing-style based features of the chat users. Third, we provide a parallel framework for in-memory construction of term partitioned inverted indexes. In this work, in order to minimize the total communication time between processors, we provide a bucketing scheme that is based on term-based distributional properties of Web page contents.
Open Access
Data decomposition techniques for parallel tree-based k-means clustering
(2002) Şen, Cenk
The main computation in the k-means clustering is distance calculations between cluster centroids and patterns. As the number of the patterns and the number of centroids increases, time needed to complete computations increased. This computational load requires high performance computers and/or algorithmic improvements. The parallel tree-based k-means algorithm on distributed memory machines combines the algorithmic improvements and high computation capacity of the parallel computers to deal with huge datasets. Its performance is affected by the data decomposition technique used. In this thesis, we presented novel data decomposition technique to improve the performance of the parallel tree-based k-means algorithm on distributed memory machines. Proposed tree-based decomposition techniques try to decrease the total number of the distance calculations by assigning processors compact subspaces. The compact subspace improves the performance of the pruning function of the tree-based k-means algorithm. We have implemented the algorithm and have conducted experiments on a PC cluster. Our experimental results demonstrated that the tree-based decomposition technique outperforms the random decomposition and stripwise decomposition techniques.
Open Access
Data distribution and performance optimization models for parallel data mining
(2013) Özkural, Eray
We have embarked upon a multitude of approaches to improve the efficiency of selected fundamental tasks in data mining. The present thesis is concerned with improving the efficiency of parallel processing methods for large amounts of data. We have devised new parallel frequent itemset mining algorithms that work on both sparse and dense datasets, and 1-D and 2-D parallel algorithms for the all-pairs similarity problem. Two new parallel frequent itemset mining (FIM) algorithms named NoClique and NoClique2 parallelize our sequential vertical frequent itemset mining algorithm named bitdrill, and uses a method based on graph partitioning by vertex separator (GPVS) to distribute and selectively replicate items. The method operates on a graph where vertices correspond to frequent items and edges correspond to frequent itemsets of size two. We show that partitioning this graph by a vertex separator is sufficient to decide a distribution of the items such that the sub-databases determined by the item distribution can be mined independently. This distribution entails an amount of data replication, which may be reduced by setting appropriate weights to vertices. The data distribution scheme is used in the design of two new parallel frequent itemset mining algorithms. Both algorithms replicate the items that correspond to the separator. NoClique replicates the work induced by the separator and NoClique2 computes the same work collectively. Computational load balancing and minimization of redundant or collective work may be achieved by assigning appropriate load estimates to vertices. The performance is compared to another parallelization that replicates all items, and ParDCI algorithm. We introduce another parallel FIM method using a variation of item distribution with selective item replication. We extend the GPVS model for parallel FIM we have proposed earlier, by relaxing the condition of independent mining. Instead of finding independently mined item sets, we may minimize the amount of communication and partition the candidates in a fine-grained manner. We introduce a hypergraph partitioning model of the parallel computation where vertices correspond to candidates and hyperedges correspond to items. A load estimate is assigned to each candidate with vertex weights, and item frequencies are given as hyperedge weights. The model is shown to minimize data replication and balance load accurately. We also introduce a re-partitioning model since we can generate only so many levels of candidates at once, using fixed vertices to model previous item distribution/replication. Experiments show that we improve over the higher load imbalance of NoClique2 algorithm for the same problem instances at the cost of additional parallel overhead. For the all-pairs similarity problem, we extend recent efficient sequential algorithms to a parallel setting, and obtain document-wise and term-wise parallelizations of a fast sequential algorithm, as well as an elegant combination of two algorithms that yield a 2-D distribution of the data. Two effective algorithmic optimizations for the term-wise case are reported that make the term-wise parallelization feasible. These optimizations exploit local pruning and block processing of a number of vectors, in order to decrease communication costs, the number of candidates, and communication/computation imbalance. The correctness of local pruning is proven. Also, a recursive term-wise parallelization is introduced. The performance of the algorithms are shown to be favorable in extensive experiments, as well as the utility of two major optimizations.
Open Access
Data sensitive approximate query approaches in metric spaces
(2011) Dilek, Merve
Similarity searching is the task of retrieval of relevant information from datasets. We are particularly interested in datasets that contain complex and unstructured data such as images, videos, audio recordings, protein and DNA sequences. The relevant information is typically defined using one of two common query types: a range query involves retrieval of all the objects within a specified distance to the query object; whereas a k-nearest neighbor query deals with obtaining k closest database objects to the query object. A variety of index structures based on the notion of metric spaces have been offered to process these two query types. The query performances of the proposed index structures have not been satisfactory particularly for high dimensional datasets. As a solution, various approximate similarity search methods offering the users a quality/time trade-off have been proposed. The rationale is that the users might be willing to tolerate query precision to retrieve query results relatively faster. The proposed approximate searching schemes usually have strong connections to the underlying data structures, making the comparison of the quality of the essence of their ideas difficult. In this thesis we investigate various approximation approaches to decrease the response time of similarity queries. These approaches use a variety of statistics about the dataset in order to obtain dynamic (at the time of querying) and specific guidance on the approximation for each query object individually. The experiments are performed on top of a simple underlying pivot-based index structure to minimize the effects of the index to our approximation schemes. The results show that it is possible to improve the performance/precision of the approximation based on data and query object sensitive guidance.
Open Access
Diverse sequence search and alignment
(2013) Eser, Elif
Sequence similarity tools, such as BLAST, seek sequences from a database most similar to a query. They return results signi cantly similar to the query sequence that are typically also highly similar to each other. Most sequence analysis tasks in bioinformatics require an exploratory approach where the initial results guide the user to new searches. However, diversity has not been considered as an integral component of sequence search tools yet. Repetitions in the result can be avoided by introducing non-redundancy during database construction; however, it is not feasible to dynamically set a level of non-redundancy tailored to a query sequence. We introduce the problem of diverse search and browsing in sequence databases that produces non-redundant results optimized for any given query. We de ne diversity measures for sequences, and propose methods to obtain diverse results extracted from current sequence similarity search tools. We propose a new measure to evaluate the diversity of a set of sequences that is returned as a result of a similarity query. We evaluate the e ectiveness of the proposed methods in post-processing PSI-BLAST results. We also assess the functional diversity of the returned results based on available Gene Ontology annotations. Our experiments show that the proposed methods are able to achieve more diverse yet similar result sets compared to static non-redundancy approaches. In both sequence based and functional diversity evaluation, the proposed diversi cation methods outperform original BLAST results signi cantly. We built an online diverse sequence search tool Div-BLAST that supports queries using BLAST web services. It re-ranks the results diversely according to given parameters.
Open Access
Efficient analysis of large-scale social networks using big-data platforms
(2014) Aksu, Hidayet
In recent years, the rise of very large, rich content networks re-ignited interest to complex/social network analysis at the big data scale, which makes it possible to understand social interactions at large scale while it poses computation challenges to early works with algorithm complexity greater than O(n). This thesis analyzes social networks at very large-scales to derive important parameters and characteristics in an efficient and effective way using big-data platforms. With the popularization of mobile phone usage, telecommunication networks have turned into a socially binding medium and enables researches to analyze social interactions at very large scales. Degree distribution is one of the most important characteristics of social networks and to study degree characteristics and structural properties in large-scale social networks, in this thesis we first gathered a tera-scale dataset of telecommunication call detail records. Using this data we empirically evaluate some statistical models against the degree distribution of the country’s call graph and determine that a Pareto log-normal distribution provides the best fit, despite claims in the literature that power-law distribution is the best model. We also question and derive answers for how network operator, size, density and location affect degree distribution to understand the parameters governing it in social networks. Besides structural property analysis, community identification is of great interest in practice to learn high cohesive subnetworks about different subjects in a social network. In graph theory, k-core is a key metric used to identify subgraphs of high cohesion, also known as the ‘dense’ regions of a graph. As the real world graphs such as social network graphs grow in size, the contents get richer and the topologies change dynamically, we are challenged not only to materialize k-core subgraphs for one time but also to maintain them in order to keep up with continuous updates. These challenges inspired us to propose a new set of distributed algorithms for k-core view construction and maintenance on a horizontally scaling storage and computing platform. Experimental evaluation results demonstrated orders of magnitude speedup and advantages of maintaining k-core incrementally and in batch windows over complete reconstruction approaches. Moreover, the intensity of community engagement can be distinguished at multiple levels, resulting in a multiresolution community representation that has to be maintained over time. We also propose distributed algorithms to construct and maintain a multi-k-core graphs, implemented on the scalable big-data platform Apache HBase. Our experimental evaluation results demonstrate orders of magnitude speedup by maintaining multi-k-core incrementally over complete reconstruction. Furthermore, we propose a graph aware cache system designed for distributed graph processing. Experimental results demonstrate up to 15x speedup compared to traditional LRU based cache systems.
Open Access
Efficient k-nearest neighbor query processing in metric spaces based on precise radius estimation
(2009) Şardan, Can
Similarity searching is an important problem for complex and unstructured data such as images, video, and text documents. One common solution is approximating complex objects into feature vectors. Metric spaces approach, on the other hand, relies solely on a distance function between objects. No information is assumed about the internal structure of the objects, therefore a more general framework is provided. Methods that use the metric spaces have also been shown to perform better especially on high dimensional data. A common query type used in similarity searching is the range query, where all the neighbors in a certain area defined by a query object and a radius are retrieved. Another important type, k-nearest neighbor queries return k closest objects to a given query center. They are more difficult to process since the distance of the kth nearest neighbor varies highly. For that reason, some techniques are proposed to estimate a radius that will return exactly k objects, reducing the computation into a range query. A major problem with these methods is that multiple passes over the index data is required if the estimation is low. In this thesis we propose a new framework for k-nearest neighbor search based on radius estimation where only one sequential pass over the index data is required. We accomplish this by caching a short-list of promising candidates. We also propose several algorithms to estimate the query radius which outperform previously proposed methods. We show that our estimations are accurate enough to keep the size of the promising objects at acceptable levels.
Open Access
Hardware acceleration of similarity queries using graphic processor units
(2009) Genç, Atilla
A Graphic Processing Unit (GPU) is primarily designed for real-time rendering. In contrast to a Central Processing Unit (CPU) that have complex instructions and a limited number of pipelines, a GPU has simpler instructions and many execution pipelines to process vector data in a massively parallel fashion. In addition to its regular tasks, GPU instruction set can be used for performing other types of general-purpose computations as well. Several frameworks like Brook+, ATI CAL, OpenCL, and Nvidia Cuda have been proposed to utilize computational power of the GPU in general computing. This has provided interest and opportunities for accelerating different types of applications. This thesis explores ways of taking advantage of the GPU in the field of metric space-based similarity searching. The KVP index structure has a simple organization that lends itself to be easily processed in parallel, in contrast to tree-based structures that requires frequent ”pointer chasing” operations. Several implementations using the general purpose GPU programming frameworks (Brook+, ATI CAL and OpenCL) based on the ATI platform are provided. Experimental results of these implementations show that the GPU versions presented in this work are several times faster than the CPU versions.
Open Access
Image information mining using spatial relationship constraints
(2012) Karakuş, Fatih
There is a huge amount of data which is collected from the Earth observation satellites and they are continuously sending data to Earth receiving stations day by day. Therefore, mining of those data becomes more important for effective processing of collected multi-spectral images. The most popular approaches for this problem use the meta-data of the images such as geographical coordinates etc. However, these approaches do not offer a good solution for determining what those images contain. Some researches make a big step from the meta-data based approaches in this area by moving the focus of the study to content based approaches such as utilizing the region information of the sensed images. In this thesis, we propose a novel, generic and extendable image information mining system that uses spatial relationship constraints. In this system, we use not only the region content, but also relationships of those regions. First, we extract the region information of the images and then extract pairwise relationship information of those regions such as left, right, above, below, near, far and distance etc. This feature extraction process is defined as a generic process which is independent from how the region segmentation is obtained. In addition to these, since new features and new approaches are continuously being developed by the image information mining researchers, extendability feature of the our system plays a big role while we are designing our system. In this thesis, we also propose a novel feature vector structure in which a feature vector consists of several sub-feature vectors. In the proposed feature vector structure, each sub-feature vector can be exclusively selected to be used for search process and they can have different distance metrics to be used in comparisons between the same sub-feature vector of the other feature vector structures. Therefore, the system gives ability to users to choose which information about the region and its pairwise relationship with other regions to be used when they perform a search on the system. The proposed system is illustrated by using region based retrieval scenarios on very high spatial resolution satellite images.
Open Access
Improving the performance of similarity joins using graphics processing unit
(2012) Korkmaz, Zeynep
The similarity join is an important operation in data mining and it is used in many applications from varying domains. A similarity join operator takes one or two sets of data points and outputs pairs of points whose distances in the data space is within a certain threshold value, ". The baseline nested loop approach computes the distances between all pairs of objects. When considering large set of objects which yield too long query time for nested loop paradigm, accelerating such operator becomes more important. The computing capability of recent GPUs with the help of a general purpose parallel computing architecture (CUDA) has attracted many researches. With this motivation, we propose two similarity join algorithms for Graphics Processing Unit (GPU). To exploit the advantages of general purpose GPU computing, we rst propose an improved nested loop join algorithm (GPU-INLJ) for the speci c environment of GPU. Also we present a partitioning-based join algorithm (KMEANS-JOIN) that guarantees each partition can be joined independently without missing any join pair. Our experiments demonstrate massive performance gains and the suitability of our algorithms for large datasets.
Open Access
Mining web images for concept learning
(2014-08) Golge, Eren
We attack the problem of learning concepts automatically from noisy Web image search results. The idea is based on discovering common characteristics shared among category images by posing two novel methods that are able to organise the data while eliminating irrelevant instances. We propose a novel clustering and outlier detection method, namely Concept Map (CMAP). Given an image collection returned for a concept query, CMAP provides clusters pruned from outliers. Each cluster is used to train a model representing a different characteristics of the concept. One another method is Association through Model Evolution (AME). It prunes the data in an iterative manner and it progressively finds better set of images with an evaluational score computed for each iteration. The idea is based on capturing discriminativeness and representativeness of each instance against large number of random images and eliminating the outliers. The final model is used for classification of novel images. These two methods are applied on different benchmark problems and we observed compelling or better results compared to state of art methods.
Open Access
Modeling interestingness of streaming association rules as a benefit maximizing classification problem
(2009) Aydın, Tolga
In a typical application of association rule learning from market basket data, a set of transactions for a fixed period of time is used as input to rule learning algorithms. For example, the well-known Apriori algorithm can be applied to learn a set of association rules from such a transaction set. However, learning association rules from a set of transactions is not a one-time only process. For example, a market manager may perform the association rule learning process once every month over the set of transactions collected through the previous month. For this reason, we will consider the problem where transaction sets are input to the system as a stream of packages. The sets of transactions may come in varying sizes and in varying periods. Once a set of transactions arrives, the association rule learning algorithm is run on the last set of transactions, resulting in a new set of association rules. Therefore, the set of association rules learned will accumulate and increase in number over time, making the mining of interesting ones out of this enlarging set of association rules impractical for human experts. We refer to this sequence of rules as “association rule set stream” or “streaming association rules” and the main motivation behind this research is to develop a technique to overcome the interesting rule selection problem. A successful association rule mining system should select and present only the interesting rules to the domain experts. However, definition of interestingness of association rules on a given domain usually differs from one expert to the other and also over time for a given expert. In this thesis, we propose a post-processing method to learn a subjective model for the interestingness concept description of the streaming association rules. The uniqueness of the proposed method is its ability to formulate the interestingness issue of association rules as a benefit-maximizing classification problem and obtain a different interestingness model for each user. In this new classification scheme, the determining features are the selective objective interestingness factors, including the rule’s content itself, related to the interestingness of the association rules; and the target feature is the interestingness label of those rules. The proposed method works incrementally and employs user interactivity at a certain level. It is evaluated on a real supermarket dataset. The results show that the model can successfully select the interesting ones.
Open Access
Parallel sequence mining on distributed- memory systems
(2001) Karapınar, Embiya
Discovering all the frequent sequences in very large databases is a time consuming task. However, large databases forces to partition the original database into chunks of data to process in main-memory. Most current algorithms require as many database scans as the longest frequent sequences. Spade is a fast algorithm which reduces the number of database scans to three by using lattice-theoretic approach to decompose origional problem into small pieces(equivalence classes) which can be processed in main-memory independently. In this thesis work, we present dSpade, a parallel algorithm, based on Spade, for discov- ering the set of all frequent sequences, targeting distributed-memory systems. In dSpade, horizontal database partitioning method is used, where each processor stores equal number of customer transactions. dSpade is a synchronous algorithm for discovering frequent 1-sequences (F1) and frequent 2-sequences ( F2). Each processor performs the same computation on its local data to get local support counts and broadcasts the results to other processors to nd global frequent sequences during F1 and F2 computation. After discovering all F1 and F2, all frequent sequences are inserted into lattice to decompose the original problem into equivalence classes. Equivalence classes are mapped in a greedy heuristic to the least loaded processors in a roundrobin manner. Finally, each processor asynchronously begins to compute Fk on its mapped equivalence classes to nd all frequent sequences. We present results of performance experiments conducted on a 32-node Beowulf Cluster. Experiments show that dSpade delivers good speedup and scales linearly in the database size.
Open Access
Prescription Fraud detection via data mining : a methodology proposal
(2009) Aral, Karca Duru
Fraud is the illegitimate act of violating regulations in order to gain personal profit. These kinds of violations are seen in many important areas including, healthcare, computer networks, credit card transactions and communications. Every year health care fraud causes considerable amount of losses to Social Security Agencies and Insurance Companies in many countries including Turkey and USA. This kind of crime is often seem victimless by the committers, nonetheless the fraudulent chain between pharmaceutical companies, health care providers, patients and pharmacies not only damage the health care system with the financial burden but also greatly hinders the health care system to provide legitimate patients with quality health care. One of the biggest issues related with health care fraud is the prescription fraud. This thesis aims to identify a data mining methodology in order to detect fraudulent prescriptions in a large prescription database, which is a task traditionally conducted by human experts. For this purpose, we have developed a customized data-mining model for the prescription fraud detection. We employ data mining methodologies for assigning a risk score to prescriptions regarding Prescribed Medicament- Diagnosis consistency, Prescribed Medicaments’ consistency within a prescription, Prescribed Medicament- Age and Sex consistency and Diagnosis- Cost consistency. Our proposed model has been tested on real world data. The results we obtained from our experimentations reveal that the proposed model works considerably well for the prescription fraud detection problem with a 77.4% true positive rate. We conclude that incorporating such a system in Social Security Agencies would radically decrease human-expert auditing costs and efficiency.
Open Access
Software design, implementation, application, and refinement of a Bayesian approach for the assessment of content and user qualities
(2011) Türk, Melihcan
The internet provides unlimited access to vast amounts of information. Technical innovations and internet coverage allow more and more people to supply contents for the web. As a result, there is a great deal of material which is either inaccurate or out-of-date, making it increasingly difficult to find relevant and up-to-date content. In order to solve this problem, recommender systems based on collaborative filtering have been introduced. These systems cluster users based on their past preferences, and suggest relevant contents according to user similarities. Trustbased recommender systems consider the trust level of users in addition to their past preferences, since some users may not be trustworthy in certain categories even though they are trustworthy in others. Content quality levels are important in order to present the most current and relevant contents to users. The study presented here is based on a model which combines the concepts of content quality and user trust. According to this model, the quality level of contents cannot be properly determined without considering the quality levels of evaluators. The model uses a Bayesian approach, which allows the simultaneous co-evaluation of evaluators and contents. The Bayesian approach also allows the calculation of the updated quality values over time. In this thesis, the model is further refined and configurable software is implemented in order to assess the qualities of users and contents on the web. Experiments were performed on a movie data set and the results showed that the Bayesian co-evaluation approach performed more effectively than a classical approach which does not consider user qualities. The approach also succeeded in classifying users according to their expertise level.