Bilkent Repository :: Browsing by Subject "Big data"

Browsing by Subject "Big data"

Now showing 1 - 20 of 42

Open Access
A broad ensemble learning system for drifting stream classification
(Institute of Electrical and Electronics Engineers, 2023-08-21) Bakhshi, Sepehr; Ghahramanian, Pouya; Bonab, H.; Can, Fazlı
In a data stream environment, classification models must effectively and efficiently handle concept drift. Ensemble methods are widely used for this purpose; however, the ones available in the literature either use a large data chunk to update the model or learn the data one by one. In the former, the model may miss the changes in the data distribution, while in the latter, the model may suffer from inefficiency and instability. To address these issues, we introduce a novel ensemble approach based on the Broad Learning System (BLS), where mini chunks are used at each update. BLS is an effective lightweight neural architecture recently developed for incremental learning. Although it is fast, it requires huge data chunks for effective updates and is unable to handle dynamic changes observed in data streams. Our proposed approach, named Broad Ensemble Learning System (BELS), uses a novel updating method that significantly improves best-in class model accuracy. It employs an ensemble of output layers to address the limitations of BLS and handle drifts. Our model tracks the changes in the accuracy of the ensemble components and reacts to these changes. We present our mathematical derivation of BELS, perform comprehensive experiments with 35 datasets that demonstrate the adaptability of our model to various drift types, and provide its hyperparameter, ablation, and imbalanced dataset performance analysis. The experimental results show that the proposed approach outperforms 10 state-of-the-art baselines, and supplies an overall improvement of 18.59% in terms of average prequential accuracy.
Open Access
Accelerating the HyperLogLog cardinality estimation algorithm
(Hindawi Limited, 2017) Bozkus, C.; Fraguela, B. B.
In recent years, vast amounts of data of different kinds, from pictures and videos from our cameras to software logs from sensor networks and Internet routers operating day and night, are being generated. This has led to new big data problems, which require new algorithms to handle these large volumes of data and as a result are very computationally demanding because of the volumes to process. In this paper, we parallelize one of these new algorithms, namely, the HyperLogLog algorithm, which estimates the number of different items in a large data set with minimal memory usage, as it lowers the typical memory usage of this type of calculation from O(n) to O(1). We have implemented parallelizations based on OpenMP and OpenCL and evaluated them in a standard multicore system, an Intel Xeon Phi, and two GPUs from different vendors. The results obtained in our experiments, in which we reach a speedup of 88.6 with respect to an optimized sequential implementation, are very positive, particularly taking into account the need to run this kind of algorithm on large amounts of data. © 2017 Cem Bozkus and Basilio B. Fraguela.
Open Access
Adaptive ensemble learning with confidence bounds for personalized diagnosis
(AAAI Press, 2016) Tekin, Cem; Yoon, J.; Van Der Schaar, M.
With the advances in the field of medical informatics, automated clinical decision support systems are becoming the de facto standard in personalized diagnosis. In order to establish high accuracy and confidence in personalized diagnosis, massive amounts of distributed, heterogeneous, correlated and high-dimensional patient data from different sources such as wearable sensors, mobile applications, Electronic Health Record (EHR) databases etc. need to be processed. This requires learning both locally and globally due to privacy constraints and/or distributed nature of the multimodal medical data. In the last decade, a large number of meta-learning techniques have been proposed in which local learners make online predictions based on their locally-collected data instances, and feed these predictions to an ensemble learner, which fuses them and issues a global prediction. However, most of these works do not provide performance guarantees or, when they do, these guarantees are asymptotic. None of these existing works provide confidence estimates about the issued predictions or rate of learning guarantees for the ensemble learner. In this paper, we provide a systematic ensemble learning method called Hedged Bandits, which comes with both long run (asymptotic) and short run (rate of learning) performance guarantees. Moreover, we show that our proposed method outperforms all existing ensemble learning techniques, even in the presence of concept drift.
Open Access
Artificial intelligence-based hybrid anomaly detection and clinical decision support techniques for automated detection of cardiovascular diseases and Covid-19
(Bilkent University, 2023-10) Terzi, Merve Begüm
Coronary artery diseases are the leading cause of death worldwide, and early diagnosis is crucial for timely treatment. To address this, we present a novel automated arti cial intelligence-based hybrid anomaly detection technique com posed of various signal processing, feature extraction, supervised, and unsuper vised machine learning methods. By jointly and simultaneously analyzing 12-lead electrocardiogram (ECG) and cardiac sympathetic nerve activity (CSNA) data, the automated arti cial intelligence-based hybrid anomaly detection technique performs fast, early, and accurate diagnosis of coronary artery diseases. To develop and evaluate the proposed automated arti cial intelligence-based hybrid anomaly detection technique, we utilized the fully labeled STAFF III and PTBD databases, which contain 12-lead wideband raw recordings non invasively acquired from 260 subjects. Using the wideband raw recordings in these databases, we developed a signal processing technique that simultaneously detects the 12-lead ECG and CSNA signals of all subjects. Subsequently, using the pre-processed 12-lead ECG and CSNA signals, we developed a time-domain feature extraction technique that extracts the statistical CSNA and ECG features critical for the reliable diagnosis of coronary artery diseases. Using the extracted discriminative features, we developed a supervised classi cation technique based on arti cial neural networks that simultaneously detects anomalies in the 12-lead ECG and CSNA data. Furthermore, we developed an unsupervised clustering technique based on the Gaussian mixture model and Neyman-Pearson criterion that performs robust detection of the outliers corresponding to coronary artery diseases. By using the automated arti cial intelligence-based hybrid anomaly detection technique, we have demonstrated a signi cant association between the increase in the amplitude of CSNA signal and anomalies in ECG signal during coronary artery diseases. The automated arti cial intelligence-based hybrid anomaly de tection technique performed highly reliable detection of coronary artery diseases with a sensitivity of 98.48%, speci city of 97.73%, accuracy of 98.11%, positive predictive value (PPV) of 97.74%, negative predictive value (NPV) of 98.47%, and F1-score of 98.11%. Hence, the arti cial intelligence-based hybrid anomaly detection technique has superior performance compared to the gold standard diagnostic test ECG in diagnosing coronary artery diseases. Additionally, it out performed other techniques developed in this study that separately utilize either only CSNA data or only ECG data. Therefore, it signi cantly increases the detec tion performance of coronary artery diseases by taking advantage of the diversity in di erent data types and leveraging their strengths. Furthermore, its perfor mance is comparatively better than that of most previously proposed machine and deep learning methods that exclusively used ECG data to diagnose or clas sify coronary artery diseases. It also has a very short implementation time, which is highly desirable for real-time detection of coronary artery diseases in clinical practice. The proposed automated arti cial intelligence-based hybrid anomaly detection technique may serve as an e cient decision-support system to increase physicians' success in achieving fast, early, and accurate diagnosis of coronary artery diseases. It may be highly bene cial and valuable, particularly for asymptomatic coronary artery disease patients, for whom the diagnostic information provided by ECG alone is not su cient to reliably diagnose the disease. Hence, it may signi cantly improve patient outcomes, enable timely treatments, and reduce the mortality associated with cardiovascular diseases. Secondly, we propose a new automated arti cial intelligence-based hybrid clinical decision support technique that jointly analyzes reverse transcriptase polymerase chain reaction (RT-PCR) curves, thorax computed tomography im ages, and laboratory data to perform fast and accurate diagnosis of Coronavirus disease 2019 (COVID-19). For this purpose, we retrospectively created the fully labeled Ankara University Faculty of Medicine COVID-19 (AUFM-CoV) database, which contains a wide variety of medical data, including RT-PCR curves, thorax computed tomogra phy images, and laboratory data. The AUFM-CoV is the most comprehensive database that includes thorax computed tomography images of COVID-19 pneu monia (CVP), other viral and bacterial pneumonias (VBP), and parenchymal lung diseases (PLD), all of which present signi cant challenges for di erential diagnosis. We developed a new automated arti cial intelligence-based hybrid clinical de cision support technique, which is an ensemble learning technique consisting of two preprocessing methods, long short-term memory network-based deep learning method, convolutional neural network-based deep learning method, and arti cial neural network-based machine learning method. By jointly analyzing RT-PCR curves, thorax computed tomography images, and laboratory data, the proposed automated arti cial intelligence-based hybrid clinical decision support technique bene ts from the diversity in di erent data types that are critical for the reliable detection of COVID-19 and leverages their strengths. The multi-class classi cation performance results of the proposed convolu tional neural network-based deep learning method on the AUFM-CoV database showed that it achieved highly reliable detection of COVID-19 with a sensitivity of 91.9%, speci city of 92.5%, precision of 80.4%, and F1-score of 86%. There fore, it outperformed thorax computed tomography in terms of the speci city of COVID-19 diagnosis. Moreover, the convolutional neural network-based deep learning method has been shown to very successfully distinguish COVID-19 pneumonia (CVP) from other viral and bacterial pneumonias (VBP) and parenchymal lung diseases (PLD), which exhibit very similar radiological ndings. Therefore, it has great potential to be successfully used in the di erential diagnosis of pulmonary dis eases containing ground-glass opacities. The binary classi cation performance results of the proposed convolutional neural network-based deep learning method showed that it achieved a sensitivity of 91.5%, speci city of 94.8%, precision of 85.6%, and F1-score of 88.4% in diagnosing COVID-19. Hence, it has compara ble sensitivity to thorax computed tomography in diagnosing COVID-19. Additionally, the binary classi cation performance results of the proposed long short-term memory network-based deep learning method on the AUFM-CoV database showed that it performed highly reliable detection of COVID-19 with a sensitivity of 96.6%, speci city of 99.2%, precision of 98.1%, and F1-score of 97.3%. Thus, it outperformed the gold standard RT-PCR test in terms of the sensitivity of COVID-19 diagnosis Furthermore, the multi-class classi cation performance results of the proposed automated arti cial intelligence-based hybrid clinical decision support technique on the AUFM-CoV database showed that it diagnosed COVID-19 with a sen sitivity of 66.3%, speci city of 94.9%, precision of 80%, and F1-score of 73%. Hence, it has been shown to very successfully perform the di erential diagnosis of COVID-19 pneumonia (CVP) and other pneumonias. The binary classi cation performance results of the automated arti cial intelligence-based hybrid clinical decision support technique revealed that it diagnosed COVID-19 with a sensi tivity of 90%, speci city of 92.8%, precision of 91.8%, and F1-score of 90.9%. Therefore, it exhibits superior sensitivity and speci city compared to laboratory data in COVID-19 diagnosis. The performance results of the proposed automated arti cial intelligence-based hybrid clinical decision support technique on the AUFM-CoV database demon strate its ability to provide highly reliable diagnosis of COVID-19 by jointly ana lyzing RT-PCR data, thorax computed tomography images, and laboratory data. Consequently, it may signi cantly increase the success of physicians in diagnosing COVID-19, assist them in rapidly isolating and treating COVID-19 patients, and reduce their workload in daily clinical practice.
Open Access
Asymptotically optimal contextual bandit algorithm using hierarchical structures
(Institute of Electrical and Electronics Engineers, 2018) Neyshabouri, Mohammadreza Mohaghegh; Gökçesu, Kaan; Gökçesu, Hakan; Özkan, Hüseyin; Kozat, Süleyman Serdar
We propose an online algorithm for sequential learning in the contextual multiarmed bandit setting. Our approach is to partition the context space and, then, optimally combine all of the possible mappings between the partition regions and the set of bandit arms in a data-driven manner. We show that in our approach, the best mapping is able to approximate the best arm selection policy to any desired degree under mild Lipschitz conditions. Therefore, we design our algorithm based on the optimal adaptive combination and asymptotically achieve the performance of the best mapping as well as the best arm selection policy. This optimality is also guaranteed to hold even in adversarial environments since we do not rely on any statistical assumptions regarding the contexts or the loss of the bandit arms. Moreover, we design an efficient implementation for our algorithm using various hierarchical partitioning structures, such as lexicographical or arbitrary position splitting and binary trees (BTs) (and several other partitioning examples). For instance, in the case of BT partitioning, the computational complexity is only log-linear in the number of regions in the finest partition. In conclusion, we provide significant performance improvements by introducing upper bounds (with respect to the best arm selection policy) that are mathematically proven to vanish in the average loss per round sense at a faster rate compared to the state of the art. Our experimental work extensively covers various scenarios ranging from bandit settings to multiclass classification with real and synthetic data. In these experiments, we show that our algorithm is highly superior to the state-of-the-art techniques while maintaining the introduced mathematical guarantees and a computationally decent scalability. IEEE
Open Access
BELS: a broad ensemble learning system for data stream classification
(Bilkent University, 2021-12) Bakhshi, Sepehr
Data stream classification has become a major research topic due to the increase in temporal data. One of the biggest hurdles of data stream classification is the development of algorithms that deal with evolving data, also known as concept drifts. As data changes over time, static prediction models lose their validity. Adapting to concept drifts provides more robust and better performing models. The Broad Learning System (BLS) is an effective broad neural architecture recently developed for incremental learning. BLS cannot provide instant response since it requires huge data chunks and is unable to handle concept drifts. We propose a Broad Ensemble Learning System (BELS) for stream classification with concept drift. BELS uses a novel updating method that greatly improves bestin- class model accuracy. It employs a dynamic output ensemble layer to address the limitations of BLS. We present its mathematical derivation, provide comprehensive experiments with 11 datasets that demonstrate the adaptability of our model, including a comparison of our model with BLS, and provide parameter and robustness analysis on several drifting streams, showing that it statistically significantly outperforms seven state-of-the-art baselines. We show that our proposed method improves on average 44% compared to BLS, and 29% compared to other competitive baselines.
Open Access
Big data analytics, order imbalance and the predictability of stock returns
(Elsevier, 2021-09-24) Akyildirim, E.; Şensoy, Ahmet; Gulay, G.; Corbet, S.; Salari, Hajar Novin
Financial institutions have adopted big data to a considerable extent to provide better investment decisions. Consequently, high-frequency algorithmic traders use a vast amount of historical data with various statistical models to maximize their trading profits. Until recently, high-frequency algorithmic trading was the domain of institutional traders with access to supercomputers. Nowadays, any investor can potentially make high-frequency trades because of easy access to big data and software to analyze and execute trades. With that in mind, Borsa Istanbul introduced real time big data analytics as a product to its customers. These analytics are derived in real time from order book and trade data and aim to level the playing field between investment firms and retail traders. Using classical benchmark models in the literature, we show that Borsa Istanbul’s order imbalance-based data analytics are useful in predicting both time-series and cross-sectional intraday excess future returns, proving that this product is extremely beneficial to market participants, particularly day traders.
Open Access
Big data signal processing using boosted RLS algorithm
(IEEE, 2016) Civek, Burak Cevat; Kari, Dariush; Delibalta, İ.; Kozat, Süleyman Serdar
We propose an efficient method for the high dimensional data regression. To this end, we use a least mean squares (LMS) filter followed by a recursive least squares (RLS) filter and combine them via boosting notion extensively used in machine learning literature. Moreover, we provide a novel approach where the RLS filter is updated randomly in order to reduce the computational complexity while not giving up more on the performance. In the proposed algorithm, after the LMS filter produces an estimate, depending on the error made on this step, the algorithm decides whether or not updating the RLS filter. Since we avoid updating the RLS filter for all data sequence, the computational complexity is significantly reduced. Error performance and the computation time of our algorithm is demonstrated for a highly realistic scenario.
Open Access
Big-data streaming applications scheduling based on staged multi-armed bandits
(Institute of Electrical and Electronics Engineers, 2016) Kanoun, K.; Tekin, C.; Atienza, D.; Van Der Schaar, M.
Several techniques have been recently proposed to adapt Big-Data streaming applications to existing many core platforms. Among these techniques, online reinforcement learning methods have been proposed that learn how to adapt at run-time the throughput and resources allocated to the various streaming tasks depending on dynamically changing data stream characteristics and the desired applications performance (e.g., accuracy). However, most of state-of-the-art techniques consider only one single stream input in its application model input and assume that the system knows the amount of resources to allocate to each task to achieve a desired performance. To address these limitations, in this paper we propose a new systematic and efficient methodology and associated algorithms for online learning and energy-efficient scheduling of Big-Data streaming applications with multiple streams on many core systems with resource constraints. We formalize the problem of multi-stream scheduling as a staged decision problem in which the performance obtained for various resource allocations is unknown. The proposed scheduling methodology uses a novel class of online adaptive learning techniques which we refer to as staged multi-armed bandits (S-MAB). Our scheduler is able to learn online which processing method to assign to each stream and how to allocate its resources over time in order to maximize the performance on the fly, at run-time, without having access to any offline information. The proposed scheduler, applied on a face detection streaming application and without using any offline information, is able to achieve similar performance compared to an optimal semi-online solution that has full knowledge of the input stream where the differences in throughput, observed quality, resource usage and energy efficiency are less than 1, 0.3, 0.2 and 4 percent respectively.
Open Access
C-Stream: A coroutine-based elastic stream processing engine
(Bilkent University, 2015) Şahin, Semih
Stream processing is a computational paradigm for on-the-fly processing of live data. This paradigm lends itself to implementations that can provide high throughput and low latency, by taking advantage of various forms of parallelism that is naturally captured by the stream processing model of computation, such as pipeline, task, and data parallelism. In this thesis, we describe the design and implementation of C-Stream, which is an elastic stream processing engine. C-Stream encompasses three unique properties. First, in contrast to the widely adopted event-based interface for developing stream processing operators, C-Stream provides an interface wherein each operator has its own control loop and rely on data availability APIs to decide when to perform its computations. The self-control based model significantly simplifies development of operators that require multi-port synchronization. Second, C-Stream contains a multi-threaded dynamic scheduler that manages the execution of the operators. The scheduler, which is customizable via plug-ins, enables the execution of the operators as co-routines, using any number of threads. The base scheduler implements back-pressure, provides data availability APIs, and manages preemption and termination handling. Last, C-Stream provides elastic parallelization. It can dynamically adjust the number of threads used to execute an application, and can also adjust the number of replicas of data-parallel operators to resolve bottlenecks. We provide an experimental evaluation of C-Stream. The results show that C-Stream is scalable, highly customizable, and can resolve bottlenecks by dynamically adjusting the level of data parallelism used.
Open Access
Computationally highly efficient mixture of adaptive filters
(Springer London, 2017) Kilic, O. F.; Sayin, M. O.; Delibalta, I.; Kozat, S. S.
We introduce a new combination approach for the mixture of adaptive filters based on the set-membership filtering (SMF) framework. We perform SMF to combine the outputs of several parallel running adaptive algorithms and propose unconstrained, affinely constrained and convexly constrained combination weight configurations. Here, we achieve better trade-off in terms of the transient and steady-state convergence performance while providing significant computational reduction. Hence, through the introduced approaches, we can greatly enhance the convergence performance of the constituent filters with a slight increase in the computational load. In this sense, our approaches are suitable for big data applications where the data should be processed in streams with highly efficient algorithms. In the numerical examples, we demonstrate the superior performance of the proposed approaches over the state of the art using the well-known datasets in the machine learning literature. © 2016, Springer-Verlag London.
Open Access
Efficient community identification and maintenance at multiple resolutions on distributed datastores
(Elsevier BV, 2015) Aksu, H.; Canim, M.; Chang, Yuan-Chi; Korpeoglu, I.; Ulusoy, Özgür
The topic of network community identification at multiple resolutions is of great interest in practice to learn high cohesive subnetworks about different subjects in a network. For instance, one might examine the interconnections among web pages, blogs and social content to identify pockets of influencers on subjects like 'Big Data', 'smart phone' or 'global warming'. With dynamic changes to its graph representation and content, the incremental maintenance of a community poses significant challenges in computation. Moreover, the intensity of community engagement can be distinguished at multiple levels, resulting in a multi-resolution community representation that has to be maintained over time. In this paper, we first formalize this problem using the k-core metric projected at multiple k-values, so that multiple community resolutions are represented with multiple k-core graphs. Recognizing that large graphs and their even larger attributed content cannot be stored and managed by a single server, we then propose distributed algorithms to construct and maintain a multi-k-core graph, implemented on the scalable Big Data platform Apache HBase. Our experimental evaluation results demonstrate orders of magnitude speedup by maintaining multi-k-core incrementally over complete reconstruction. Our algorithms thus enable practitioners to create and maintain communities at multiple resolutions on multiple subjects in rich network content simultaneously.
Open Access
Efficient implementation of Newton-raphson methods for sequential data prediction
(IEEE Computer Society, 2017) Civek, B. C.; Kozat, S. S.
We investigate the problem of sequential linear data prediction for real life big data applications. The second order algorithms, i.e., Newton-Raphson Methods, asymptotically achieve the performance of the 'best' possible linear data predictor much faster compared to the first order algorithms, e.g., Online Gradient Descent. However, implementation of these second order methods results in a computational complexity in the order of $O(M2)$ for an $M$ dimensional feature vector, where the first order methods offer complexity in the order of $O(M)$. Because of this extremely high computational need, their usage in real life big data applications is prohibited. To this end, in order to enjoy the outstanding performance of the second order methods, we introduce a highly efficient implementation where the computational complexity of these methods is reduced from $O(M2)$ to $O(M)$. The presented algorithm provides the well-known merits of the second order methods while offering a computational complexity similar to the first order methods. We do not rely on any statistical assumptions, hence, both regular and fast implementations achieve the same performance in terms of mean square error. We demonstrate the efficiency of our algorithm on several sequential big datasets. We also illustrate the numerical stability of the presented algorithm. © 1989-2012 IEEE.
Open Access
Forecasting high-frequency excess stock returns via data analytics and machine learning
(Wiley-Blackwell Publishing Ltd., 2021-11-23) Akyıldırım, E.; Nguyen, D. K.; Şensoy, Ahmet; Šikić, M.
Borsa Istanbul introduced data analytics to present additional information about its market conditions. We examine whether this product can be utilized via various machine learning methods to predict intraday excess returns. Accordingly, these analytics provide significant prediction ratios above 50% with ideal profit ratios that can reach up to 33%. Among all the methods considered, XGBoost (logistic regression) performs better in predicting excess returns in the long-term analysis (short-term analysis). Results provide evidence for the benefits of both the analytics and the machine learning methods and raise further discussion on the semistrong market efficiency.
Open Access
Forecasting high‐frequency excess stock returns via data analytics and machine learning
(John Wiley and Sons Inc., 2023-01) Akyildirim, E.; Nguyen, D.K.; Şensoy, Ahmet; Šikić, M.
Borsa Istanbul introduced data analytics to present additional information about its market conditions. We examine whether this product can be utilized via var ious machine learning methods to predict intraday excess returns. Accordingly, these analytics provide significant prediction ratios above 50% with ideal profit ratios that can reach up to 33%. Among all the methods considered, XGBoost (logistic regression) performs better in predicting excess returns in the long‐term analysis (short‐term analysis). Results provide evidence for the benefits of both the analytics and the machine learning methods and raise further discussion on the semistrong market efficiency.
Open Access
Foreword: 1st International Workshop on High Performance Computing for Big Data
(IEEE, 2014-09) Kaya, Kamer; Gedik, Buğra; Çatalyürek, Ümit V.
The 1st International Workshop on High Performance Computing for Big Data (HPC4BD) is held on September 10, 2014 in concordance with 43rd International Conference on Parallel Processing (ICPP-2014). The workshop aimed to bring high performance computing (HPC) experts and experts from various application domains together to discuss their Big Data problems. There were four works accepted to be presented in this year's workshop. This foreword presents a summary of the them. © 2014 IEEE.
Open Access
Graph aware caching policy for distributed graph stores
(IEEE, 2015-03) Aksu, Hidayet; Canım, M.; Chang, Y.-C.; Körpeoğlu, İbrahim; Ulusoy, Özgür
Graph stores are becoming increasingly popular among NOSQL applications seeking flexibility and heterogeneity in managing linked data. Conceptually and in practice, applications ranging from social networks, knowledge representations to Internet of things benefit from graph data stores built on a combination of relational and non-relational technologies aimed at desired performance characteristics. The most common data access pattern in querying graph stores is to traverse from a node to its neighboring nodes. This paper studies the impact of such traversal pattern to common data caching policies in a partitioned data environment where a big graph is distributed across servers in a cluster. We propose and evaluate a new graph aware caching policy designed to keep and evict nodes, edges and their metadata optimized for query traversal pattern. The algorithm distinguishes the topology of the graph as well as the latency of access to the graph nodes and neighbors. We implemented graph aware caching on a distributed data store Apache HBase in the Hadoop family. Performance evaluations showed up to 15x speedup on the benchmark datasets preferring our new graph aware policy over non-aware policies. We also show how to improve the performance of existing caching algorithms for distributed graphs by exploiting the topology information. © 2015 IEEE.
Open Access
Implicit concept drift detection for multi-label data streams
(Bilkent University, 2022-01) Gülcan, Ege Berkay
Many real-world applications adopt multi-label data streams as the need for algo-rithms to deal with rapidly generated data increases. For such streams, changes in data distribution, also known as concept drift, cause the existing classification models to rapidly lose their effectiveness. To assist the classifiers, we propose a novel algorithm called Label Dependency Drift Detector (LD3), an implicit (un-supervised) concept drift detector using label dependencies within the data for multi-label data streams. Our study exploits the dynamic temporal dependencies between labels using a label influence ranking method, which leverages a data fusion algorithm and uses the produced ranking to detect concept drift. LD3 is the first unsupervised concept drift detection algorithm in the multi-label classification problem area. In this study, we perform an extensive evaluation of LD3 by comparing it with 14 prevalent supervised concept drift detection algorithms that we adapt to the problem area using 12 datasets and a baseline classifier. The results show that LD3 provides between 19.8% and 68.6% better predictive performance than comparable detectors on both real-world and synthetic data streams.
Open Access
Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems
(Elsevier BV, 2016) Acer, S.; Selvitopi, O.; Aykanat, Cevdet
We propose a comprehensive and generic framework to minimize multiple and different volume-based communication cost metrics for sparse matrix dense matrix multiplication (SpMM). SpMM is an important kernel that finds application in computational linear algebra and big data analytics. On distributed memory systems, this kernel is usually characterized with its high communication volume requirements. Our approach targets irregularly sparse matrices and is based on both graph and hypergraph partitioning models that rely on the widely adopted recursive bipartitioning paradigm. The proposed models are lightweight, portable (can be realized using any graph and hypergraph partitioning tool) and can simultaneously optimize different cost metrics besides total volume, such as maximum send/receive volume, maximum sum of send and receive volumes, etc., in a single partitioning phase. They allow one to define and optimize as many custom volume-based metrics as desired through a flexible formulation. The experiments on a wide range of about thousand matrices show that the proposed models drastically reduce the maximum communication volume compared to the standard partitioning models that only address the minimization of total volume. The improvements obtained on volume-based partition quality metrics using our models are validated with parallel SpMM as well as parallel multi-source BFS experiments on two large-scale systems. For parallel SpMM, compared to the standard partitioning models, our graph and hypergraph partitioning models respectively achieve reductions of 14% and 22% in runtime, on average. Compared to the state-of-the-art partitioner UMPa, our graph model is overall 14.5 ï¿½ faster and achieves an average improvement of 19% in the partition quality on instances that are bounded by maximum volume. For parallel BFS, we show on graphs with more than a billion edges that the scalability can significantly be improved with our models compared to a recently proposed two-dimensional partitioning model.
Open Access
Kişisel verilerin korunması ve rekabet hukuku boyutuyla büyük veri
(Bilkent University, 2020-12) Ekin, Beste
İçinde bulunduğumuz bilgi çağında yaşanan teknolojik gelişmeler ekonominin dijitalleşmesi sürecini beraberinde getirmiştir. Özellikle internet kullanımının yaygınlaşmasıyla birlikte ekonomik faaliyetlerin çevrim içi ortamlara taşınması dijital ekonomi kavramının ortaya çıkmasına yol açmıştır. Bu yeni ekonomi içerisinde kişisel veriler birçok dijital pazarın belkemiği haline gelmiştir. Bu durum üzerinde “büyük veri” olgusunun önemli bir rolü bulunmaktadır. Zira büyük veri, teşebbüslerin neredeyse gerçek zamanlı olarak verileri toplamasına, analiz etmesine ve bunlardan sonuç çıkarmasına olanak tanımaktadır. Bununla birlikte kişisel verilerin bu şekilde kullanılmasının bireylerin menfaatleri ile çatışabilecek olması ihtimali bu verilerin korunması gerekliliğini gündeme getirmektedir. Ayrıca büyük veri setlerinin hızla artan ekonomik kullanımlarının rekabet düzeni üzerinde de birtakım etkiler doğuracağı açıktır. Bu çalışmada kişisel verilerin korunması, büyük veri ve rekabet hukuku alanları arasındaki etkileşim temel alınarak mevcut rejimin büyük veri temelli pazarlarda gündeme gelebilecek rekabetçi endişeleri gidermede ne derece yeterli olabileceği sorusu cevaplanmıştır.