Browsing by Subject "Data Mining"

Now showing 1 - 8 of 8

Open Access
Automated construction of fuzzy event sets and its application to active databases
(IEEE, 2001) Saygin, Y.; Ulusoy, Özgür
Fuzzy sets and fuzzy logic research aims to bridge the gap between the crisp world of math and the real world. Fuzzy set theory was applied to many different areas, from control to databases. Sometimes the number of events in an event-driven system may become very high and unmanageable. Therefore, it is very useful to organize the events into fuzzy event sets also introducing the benefits of the fuzzy set theory. All the events that have occurred in a system can be stored in event histories which contain precious hidden information. In this paper, we propose a method for automated construction of fuzzy event sets out of event histories via data mining techniques. The useful information hidden in the event history is extracted into a matrix called sequential proximity matrix. This matrix shows the proximities of events and it is used for fuzzy rule execution via similarity based event detection and construction of fuzzy event sets. Our application platform is active databases. We describe how fuzzy event sets can be exploited for similarity based event detection and fuzzy rule execution in active database systems.
Open Access
Clustered linear regression
(Elsevier, 2002) Ari, B.; Güvenir, H. A.
Clustered linear regression (CLR) is a new machine learning algorithm that improves the accuracy of classical linear regression by partitioning training space into subspaces. CLR makes some assumptions about the domain and the data set. Firstly, target value is assumed to be a function of feature values. Second assumption is that there are some linear approximations for this function in each subspace. Finally, there are enough training instances to determine subspaces and their linear approximations successfully. Tests indicate that if these approximations hold, CLR outperforms all other well-known machine-learning algorithms. Partitioning may continue until linear approximation fits all the instances in the training set - that generally occurs when the number of instances in the subspace is less than or equal to the number of features plus one. In other case, each new subspace will have a better fitting linear approximation. However, this will cause over fitting and gives less accurate results for the test instances. The stopping situation can be determined as no significant decrease or an increase in relative error. CLR uses a small portion of the training instances to determine the number of subspaces. The necessity of high number of training instances makes this algorithm suitable for data mining applications. © 2002 Elsevier Science B.V. All rights reserved.
Open Access
Exploiting data mining techniques for broadcasting data in mobile computing environments
(IEEE, 2002) Saygin, Y.; Ulusoy, Özgür
Mobile computers can be equipped with wireless communication devices that enable users to access data services from any location. In wireless communication, the server-to-client (downlink) communication bandwidth is much higher than the client-to-server (uplink) communication bandwidth. This asymmetry makes the dissemination of data to client machines a desirable approach. However, dissemination of data by broadcasting may induce high access latency in case the number of broadcast data items is large. In this paper, we propose two methods aiming to reduce client access latency of broadcast data. Our methods are based on analyzing the broadcast history (i.e., the chronological sequence of items that have been requested by clients) using data mining techniques. With the first method, the data items in the broadcast disk are organized in such a way that the items requested subsequently are placed close to each other. The second method focuses on improving the cache hit ratio to be able to decrease the access latency. It enables clients to prefetch the data from the broadcast disk based on the rules extracted from previous data request patterns. The proposed methods are implemented on a Web log to estimate their effectiveness. It is shown through performance experiments that the proposed rule-based methods are effective in improving the system performance in terms of the average latency as well as the cache hit ratio of mobile clients.
Open Access
Hypergraph models and algorithms for data-pattern-based clustering
(Springer, 2004) Ozdal, M. M.; Aykanat, Cevdet
In traditional approaches for clustering market basket type data, relations among transactions are modeled according to the items occurring in these transactions. However, an individual item might induce different relations in different contexts. Since such contexts might be captured by interesting patterns in the overall data, we represent each transaction as a set of patterns through modifying the conventional pattern semantics. By clustering the patterns in the dataset, we infer a clustering of the transactions represented this way. For this, we propose a novel hypergraph model to represent the relations among the patterns. Instead of a local measure that depends only on common items among patterns, we propose a global measure that is based on the cooccurences of these patterns in the overall data. The success of existing hypergraph partitioning based algorithms in other domains depends on sparsity of the hypergraph and explicit objective metrics. For this, we propose a two-phase clustering approach for the above hypergraph, which is expected to be dense. In the first phase, the vertices of the hypergraph are merged in a multilevel algorithm to obtain large number of high quality clusters. Here, we propose new quality metrics for merging decisions in hypergraph clustering specifically for this domain. In order to enable the use of existing metrics in the second phase, we introduce a vertex-to-cluster affinity concept to devise a method for constructing a sparse hypergraph based on the obtained clustering. The experiments we have performed show the effectiveness of the proposed framework.
Open Access
A learning-based schedulıng system wıth continuous control and update structure
(2005) Metan, Gökhan
In today’s highly competitive business environment, the product varieties of firms tend to increase and the demand patterns of commodities change rapidly. Especially for high tech industries, the product life cycles become very short and the customer demand can change drastically due to the introduction of new technologies in the market (i.e., introduction by the competitors). These factors increase the need for more efficient scheduling strategies. In this thesis, a learning-based scheduling system for a classical job shop problem with the average tardiness objective is developed. The system learns on the manufacturing environment by constructing a learning tree and selects a dispatching rule from the tree for each scheduling period to schedule the operations. The system also utilizes the process control charts to monitor the performance of the learning tree and the tree as well as the control charts is updated when necessary. Therefore, the system adapts itself for the changes in the manufacturing environment and survives in time. Also, extensive simulation experiments are performed for the system parameters such as monitoring (MPL) and scheduling period lengths (SPL). Our results indicate that the system performance is significantly affected by the parameters (i.e., MPL and SPL). Moreover, simulation results show that the performance of the proposed system is considerably better than the simulation-based single-pass and multi-pass scheduling algorithms available in the literature
Open Access
Maximizing benefit of classifications using feature intervals
(Springer, Berlin, Heidelberg, 2003) İkizler, Nazlı; Güvenir, H. Altay
There is a great need for classification methods that can properly handle asymmetric cost and benefit constraints of classifications. In this study, we aim to emphasize the importance of classification benefits by means of a new classification algorithm, Benefit-Maximizing classifier with Feature Intervals (BMFI) that uses feature projection based knowledge representation. Empirical results show that BMFI has promising performance compared to recent cost-sensitive algorithms in terms of the benefit gained.
Open Access
Prescription Fraud detection via data mining : a methodology proposal
(2009) Aral, Karca Duru
Fraud is the illegitimate act of violating regulations in order to gain personal profit. These kinds of violations are seen in many important areas including, healthcare, computer networks, credit card transactions and communications. Every year health care fraud causes considerable amount of losses to Social Security Agencies and Insurance Companies in many countries including Turkey and USA. This kind of crime is often seem victimless by the committers, nonetheless the fraudulent chain between pharmaceutical companies, health care providers, patients and pharmacies not only damage the health care system with the financial burden but also greatly hinders the health care system to provide legitimate patients with quality health care. One of the biggest issues related with health care fraud is the prescription fraud. This thesis aims to identify a data mining methodology in order to detect fraudulent prescriptions in a large prescription database, which is a task traditionally conducted by human experts. For this purpose, we have developed a customized data-mining model for the prescription fraud detection. We employ data mining methodologies for assigning a risk score to prescriptions regarding Prescribed Medicament- Diagnosis consistency, Prescribed Medicaments’ consistency within a prescription, Prescribed Medicament- Age and Sex consistency and Diagnosis- Cost consistency. Our proposed model has been tested on real world data. The results we obtained from our experimentations reveal that the proposed model works considerably well for the prescription fraud detection problem with a 77.4% true positive rate. We conclude that incorporating such a system in Social Security Agencies would radically decrease human-expert auditing costs and efficiency.
Open Access
A privacy-preserving solution for the bipartite ranking problem on spark framework
(2017-07) Faramarzi, Noushin Salek
The bipartite ranking problem is defined as finding a function that ranks positive instances in a dataset higher than the negative ones. Financial and medical domains are some of the common application areas of the ranking algorithms. However, a common concern for such domains is the privacy of individuals or companies in the dataset. That is, a researcher who wants to discover knowledge from a dataset extracted from such a domain, needs to access the records of all individuals in the dataset in order to run a ranking algorithm. This privacy concern puts limitations on the use of sensitive personal data for such analysis. We propose an efficient solution for the privacy-preserving bipartite ranking problem, where the researcher does not need the raw data of the instances in order to learn a ranking model from the data. The RIMARC (Ranking Instances by Maximizing Area under the ROC Curve) algorithm solves the bipartite ranking problem by learning a model to rank instances. As part of the model, it learns a weight for each feature by analyzing the area under receiver operating characteristic (ROC) curve. RIMARC algorithm is shown to be more accurate and efficient than its counterparts. Thus, we use this algorithm as a building-block and provide a privacy-preserving version of the RIMARC algorithm using homomorphic encryption and secure multi-party computation. In order to increase the time efficiency for big datasets, we have implemented privacy-preserving RIMARC algorithm on Apache Spark, which is a popular parallelization framework with its revolutionary programming paradigm called Resilient Distributed Datasets. Our proposed algorithm lets a data owner outsource the storage and processing of its encrypted dataset to a semi-trusted cloud. Then, a researcher can get the results of his/her queries (to learn the ranking function) on the dataset by interacting with the cloud. During this process, neither the researcher nor the cloud can access any information about the raw dataset. We prove the security of the proposed algorithm and show its efficiency via experiments on real data.