Browsing by Subject "Data handling"

Now showing 1 - 12 of 12

Open Access
Ensemble pruning for text categorization based on data partitioning
(Springer, Berlin, Heidelberg, 2011) Toraman, Çağrı; Can, Fazlı
Ensemble methods can improve the effectiveness in text categorization. Due to computation cost of ensemble approaches there is a need for pruning ensembles. In this work we study ensemble pruning based on data partitioning. We use a ranked-based pruning approach. For this purpose base classifiers are ranked and pruned according to their accuracies in a separate validation set. We employ four data partitioning methods with four machine learning categorization algorithms. We mainly aim to examine ensemble pruning in text categorization. We conduct experiments on two text collections: Reuters-21578 and BilCat-TRT. We show that we can prune 90% of ensemble members with almost no decrease in accuracy. We demonstrate that it is possible to increase accuracy of traditional ensembling with ensemble pruning. © 2011 Springer-Verlag Berlin Heidelberg.
Open Access
Graph aware caching policy for distributed graph stores
(IEEE, 2015-03) Aksu, Hidayet; Canım, M.; Chang, Y.-C.; Körpeoğlu, İbrahim; Ulusoy, Özgür
Graph stores are becoming increasingly popular among NOSQL applications seeking flexibility and heterogeneity in managing linked data. Conceptually and in practice, applications ranging from social networks, knowledge representations to Internet of things benefit from graph data stores built on a combination of relational and non-relational technologies aimed at desired performance characteristics. The most common data access pattern in querying graph stores is to traverse from a node to its neighboring nodes. This paper studies the impact of such traversal pattern to common data caching policies in a partitioned data environment where a big graph is distributed across servers in a cluster. We propose and evaluate a new graph aware caching policy designed to keep and evict nodes, edges and their metadata optimized for query traversal pattern. The algorithm distinguishes the topology of the graph as well as the latency of access to the graph nodes and neighbors. We implemented graph aware caching on a distributed data store Apache HBase in the Hadoop family. Performance evaluations showed up to 15x speedup on the benchmark datasets preferring our new graph aware policy over non-aware policies. We also show how to improve the performance of existing caching algorithms for distributed graphs by exploiting the topology information. © 2015 IEEE.
Open Access
IBM streams processing language: analyzing big data in motion
(I B M Corp., 2013-05-17) Hirzel M.; Andrade, H.; Gedik, B.; Jacques-Silva, R.; Khandekar, R.; Kumar, V.; Mendell, M.; Nasgaard, H.; Schneider S.; Soule´, R.; Wu, K. L.
The IBM Streams Processing Language (SPL) is the programming language for IBM InfoSphere® Streams, a platform for analyzing Big Data in motion. By “Big Data in motion,” we mean continuous data streams at high data-transfer rates. InfoSphere Streams processes such data with both high throughput and short response times. To meet these performance demands, it deploys each application on a cluster of commodity servers. SPL abstracts away the complexity of the distributed system, instead exposing a simple graph-of-operators view to the user. SPL has several innovations relative to prior streaming languages. For performance and code reuse, SPL provides a code-generation interface to C++ and Java®. To facilitate writing well-structured and concise applications, SPL provides higher-order composite operators that modularize stream sub-graphs. Finally, to enable static checking while exposing optimization opportunities, SPL provides a strong type system and user-defined operator models. This paper provides a language overview, describes the implementation including optimizations such as fusion, and explains the rationale behind the language design.
Open Access
L1 norm based multiplication-free cosine similarity measures for big data analysis
(IEEE, 2014-11) Akbaş, Cem Emre; Bozkurt, Alican; Arslan, Musa Tunç; Aslanoğlu, Hüseyin; Çetin, A. Enis
The cosine similarity measure is widely used in big data analysis to compare vectors. In this article a new set of vector similarity measures are proposed. New vector similarity measures are based on a multiplication-free operator which requires only additions and sign operations. A vector 'product' using the multiplication-free operator is also defined. The new vector product induces the ℓ1-norm. As a result, new cosine measure-like similarity measures are normalized by the ℓ1-norms of the vectors. They can be computed using the MapReduce framework. Simulation examples are presented. © 2014 IEEE.
Open Access
Land cover classification with multi-sensor fusion of partly missing data
(American Society for Photogrammetry and Remote Sensing, 2009-05) Aksoy, S.; Koperski, K.; Tusk, C.; Marchisio, G.
We describe a system that uses decision tree-based tools for seamless acquisition of knowledge for classification of remotely sensed imagery. We concentrate on three important problems in this process: information fusion, model understandability, and handling of missing data. Importance of multi-sensor information fusion and the use of decision tree classifiers for such problems have been well-studied in the literature. However, these studies have been limited to the cases where all data sources have a full coverage for the scene under consideration. Our contribution in this paper is to show how decision tree classifiers can be learned with alternative (surrogate) decision nodes and result in models that are capable of dealing with missing data during both training and classification to handle cases where one or more measurements do not exist for some locations. We present detailed performance evaluation regarding the effectiveness of these classifiers for information fusion and feature selection, and study three different methods for handling missing data in comparative experiments. The results show that surrogate decisions incorporated into decision tree classifiers provide powerful models for fusing information from different data layers while being robust to missing data. © 2009 American Society for Photogrammetry and Remote Sensing.
Open Access
Object-oriented query language facilitating construction of new objects
(Elsevier, 1993) Alhajj, R.; Arkun, M. E.
In object-oriented database systems, messages can be used to manipulate the database; however, a query language is still a required component of any kind of database system. In the paper, we describe a query language for object-oriented databases where both objects as well as behaviour defined in them are handled. Not only existing objects are manipulated; the introduction of new relationships and new objects constructed out of existing ones is also facilitated. The operations supported in the described query language subsumes those of the relational algebra aiming at a more powerful query language than the relational algebra. Among the additional operators, there is an operator that handles the application of an aggregate function on objects in an operand while still having the result possessing the characteristics of an operand. The result of a query as well as the operands are considered to have a pair of sets, a set of objects and a set of message expressions; where a message expression is a sequence of messages. A message expression handles both stored and derived values and hence provides a full computational power without having an embedded query language with impedance mismatch. Therefore the closure property is maintained by having the result of a query possessing the characteristics of an operand. Furthermore, we define a set of objects and derive a set of message expressions for every class; hence any class can be an operand. Moreover, the result of a query has the characteristics of a class and its superclass/subclass relationships with the operands are established to make it persistent. © 1993.
Open Access
Pipelined fission for stream programs with dynamic selectivity and partitioned state
(Academic Press, 2016) Gedik, B.; Özsema, H. G.; Öztürk, Ö.
There is an ever increasing rate of digital information available in the form of online data streams. In many application domains, high throughput processing of such data is a critical requirement for keeping up with the soaring input rates. Data stream processing is a computational paradigm that aims at addressing this challenge by processing data streams in an on-the-fly manner, in contrast to the more traditional and less efficient store-and-then process approach. In this paper, we study the problem of automatically parallelizing data stream processing applications in order to improve throughput. The parallelization is automatic in the sense that stream programs are written sequentially by the application developers and are parallelized by the system. We adopt the asynchronous data flow model for our work, which is typical in Data Stream Processing Systems (DSPS), where operators often have dynamic selectivity and are stateful. We solve the problem of pipelined fission, in which the original sequential program is parallelized by taking advantage of both pipeline parallelism and data parallelism at the same time. Our pipelined fission solution supports partitioned stateful data parallelism with dynamic selectivity and is designed for shared-memory multi-core machines. We first develop a cost-based formulation that enables us to express pipelined fission as an optimization problem. The bruteforce solution of this problem takes a long time for moderately sized stream programs. Accordingly, we develop a heuristic algorithm that can quickly, but approximately, solve the pipelined fission problem. We provide an extensive evaluation studying the performance of our pipelined fission solution, including simulations as well as experiments with an industrial-strength DSPS. Our results show good scalability for applications that contain sufficient parallelism, as well as close to optimal performance for the heuristic pipelined fission algorithm.
Open Access
A privacy-preserving solution for the bipartite ranking problem
(IEEE, 2016-12) Faramarzi, Noushin Salek; Ayday, Erman; Güvenir, H. Altay
In this paper, we propose an efficient solution for the privacy-preserving of a bipartite ranking algorithm. The bipartite ranking problem can be considered as finding a function that ranks positive instances (in a dataset) higher than the negative ones. However, one common concern for all the existing schemes is the privacy of individuals in the dataset. That is, one (e.g., a researcher) needs to access the records of all individuals in the dataset in order to run the algorithm. This privacy concern puts limitations on the use of sensitive personal data for such analysis. The RIMARC (Ranking Instances by Maximizing Area under the ROC Curve) algorithm solves the bipartite ranking problem by learning a model to rank instances. As part of the model, it learns weights for each feature by analyzing the area under receiver operating characteristic (ROC) curve. RIMARC algorithm is shown to be more accurate and efficient than its counterparts. Thus, we use this algorithm as a building-block and provide a privacy-preserving version of the RIMARC algorithm using homomorphic encryption and secure multi-party computation. Our proposed algorithm lets a data owner outsource the storage and processing of its encrypted dataset to a semi-trusted cloud. Then, a researcher can get the results of his/her queries (to learn the ranking function) on the dataset by interacting with the cloud. During this process, neither the researcher nor the cloud learns any information about the raw dataset. We prove the security of the proposed algorithm and show its efficiency via experiments on real data.
Open Access
Processing real-time transactions in a replicated database system
(Springer/Kluwer Academic Publishers, 1994) Ulusoy, Özgür
A database system supporting a real-time application has to provide real-time information to the executing transactions. Each real-time transaction is associated with a timing constraint, typically in the form of a deadline. It is difficult to satisfy all timing constraints due to the consistency requirements of the underlying database. In scheduling the transactions it is aimed to process as many transactions as possible within their deadlines. Replicated database systems possess desirable features for real-time applications, such as a high level of data availability, and potentially improved response time for queries. On the other hand, multiple copy updates lead to a considerable overhead due to the communication required among the data sites holding the copies. In this paper, we investigate the impact of storing multiple copies of data on satisfying the timing constraints of real-time transactions. A detailed performance model of a distributed database system is employed in evaluating the effects of various workload parameters and design alternatives on the system performance. The performance is expressed in terms of the fraction of satisfied transaction deadlines. A comparison of several real-time concurrency control protocols, which are based on different approaches in involving timing constraints of transactions in scheduling, is also provided in performance experiments. © 1994 Kluwer Academic Publishers.
Open Access
Query model for object-oriented databases
(IEEE, 1993-04) Alhajj, Reda; Arkun, M. Erol
A query language should be a part of any database system. While the relational model has a well defined underlying query model, the object-oriented database systems have been criticized for not having such a query model. One of the most challenging steps in the development of a theory for object-oriented databases is the definition of an object algebra. A formal object-oriented query model is described here in terms of an object algebra, at least as powerful as the relational algebra, by extending the latter in a consistent manner. Both the structure and the behavior of objects are handled. An operand and the output from a query in the object algebra are defined to have a pair of sets, a set of objects and a set of message expressions where a message expression is a valid sequence of messages. Hence the closure property is maintained in a natural way. In addition, it is proved that the output from a query has the characteristics of a class; hence the inheritance (sub/superclass) relationship between the operand(s) and the output from a query is derived. This way, the result of a query can be persistently placed in its proper place in the lattice.
Open Access
Safe data parallelism for general streaming
(Institute of Electrical and Electronics Engineers, 2015) Schneider S.; Hirzel M.; Gedik, B.; Wu, Kun-Lung
Streaming applications process possibly infinite streams of data and often have both high throughput and low latency requirements. They are comprised of operator graphs that produce and consume data tuples. General streaming applications use stateful, selective, and user-defined operators. The stream programming model naturally exposes task and pipeline parallelism, enabling it to exploit parallel systems of all kinds, including large clusters. However, data parallelism must either be manually introduced by programmers, or extracted as an optimization by compilers. Previous data parallel optimizations did not apply to selective, stateful and user-defined operators. This article presents a compiler and runtime system that automatically extracts data parallelism for general stream processing. Data-parallelization is safe if the transformed program has the same semantics as the original sequential version. The compiler forms parallel regions while considering operator selectivity, state, partitioning, and graph dependencies. The distributed runtime system ensures that tuples always exit parallel regions in the same order they would without data parallelism, using the most efficient strategy as identified by the compiler. Our experiments using 100 cores across 14 machines show linear scalability for parallel regions that are computation-bound, and near linear scalability when tuples are shuffled across parallel regions.
Open Access
SPL: an extensible language for distributed stream processing
(Association for Computing Machinery, 2017) Hirzel M.; Schneider S.; Gedik, B.
Big data is revolutionizing how all sectors of our economy do business, including telecommunication, transportation, medical, and finance. Big data comes in two flavors: data at rest and data in motion. Processing data in motion is stream processing. Stream processing for big data analytics often requires scale that can only be delivered by a distributed system, exploiting parallelism on many hosts and many cores. One such distributed stream processing system is IBM Streams. Early customer experience with IBM Streams uncovered that another core requirement is extensibility, since customers want to build high-performance domain-specific operators for use in their streaming applications. Based on these two core requirements of distribution and extensibility, we designed and implemented the Streams Processing Language (SPL). This article describes SPL with an emphasis on the language design, distributed runtime, and extensibility mechanism. SPL is now the gateway for the IBM Streams platform, used by our customers for stream processing in a broad range of application domains. © 2017 ACM.