Browsing by Subject "Cluster analysis--Data processing."

Now showing 1 - 4 of 4

Open Access
Implementation of a specialized algorithm for clustering using minimum enclosing balls
(2010) Guruşçu, Utku
Clustering is the process of organizing objects into groups whose members are similar in some ways. The main objective is to identify the underlying structures and patterns among the objects correctly. Therefore, a cluster is a collection of objects which are more similar to each other than to the objects belonging to other clusters. The clustering problem has applications in wide-ranging areas including facility location, classification of massive data, and marketing. Many of these applications call for the solutions of the large-scale clustering problems. The main problem of focus in this thesis is the computation of k spheres that enclose a given set of m vectors, which represent the set of objects, in such a way that the radius of the largest sphere or the sum of the radii of spheres is as small as possible. The solutions of these problems allow one to divide the set of objects into k groups based on the level of similarity among them. Both of the aforementioned mathematical problems belong to the hardest class of optimization problems (i.e., they are NP-hard). Furthermore, as indicated by previous results in the literature, it is not only hard to find an optimal solution to these problems but also to find a good approximation to each one of them. In this thesis, specialized algorithms have been designed and implemented by taking into account the special underlying structures of the studied problems. These algorithms are based on an efficient and systematic search of an optimal solution using a Branch-and-Bound framework. In the course of the algorithms, the problem of computing the smallest sphere that encloses a given set of vectors appears as a sequence of subproblems that need to be solved. Our algorithms heavily rely on the recently developed efficient algorithms for this subproblem. A software has been developed that can implement the proposed algorithms in order to use them in practice. A user-friendly interface has been designed for the software. Extensive computational results reveal that our algorithms are capable of solving large-scale instances of the problems efficiently. Since the architecture of the software has been designed in a flexible and modular fashion, it serves as a solid foundation for further studies in this area.
Open Access
Row generation techniques for approximate solution of linear programming problems
(2010) Paç, A. Burak
In this study, row generation techniques are applied on general linear programming problems with a very large number of constraints with respect to the problem dimension. A lower bound is obtained for the change in the objective value caused by the generation of a specific row. To achieve row selection that results in a large shift in the feasible region and the objective value at each row generation iteration, the lower bound is used in the comparison of row generation candidates. For a warm-start to the solution procedure, an effective selection of the subset of constraints that constitutes the initial LP is considered. Several strategies are discussed to form such a small subset of constraints so as to obtain an initial solution close to the feasible region of the original LP. Approximation schemes are designed and compared to make possible the termination of row generation at a solution in the proximity of an optimal solution of the input LP. The row generation algorithm presented in this study, which is enhanced with a warm-start strategy and an approximation scheme is implemented and tested for computation time and the number of rows generated. Two efficient primal simplex method variants are used for benchmarking computation times, and the row generation algorithm appears to perform better than at least one of them especially when number of constraints is large.
Open Access
Scalable streaming profile clustering for telco analytics
(2013) Abbasoğlu, Mehmet Ali
Many telco analytics require maintaining call pro les based on recent customer call patterns. Such pro les are typically organized as aggregations computed at di erent time scales over the recent customer interactions. Clustering these pro les is needed to group customers with similar calling patterns and to build aggregate models for them. Example applications include optimizing tari s, segmentation, and usage forecasting. In this thesis, we present an approach for clustering pro les that are incrementally maintained over a stream of updates. Due to the large number of customers, maintaining pro le clusters have high processing and memory resource requirements. In order to tackle this problem, we apply distributed stream processing. However, in the presence of distributed state, it is a major challenge to partition the pro les over machines (nodes) such that memory and computation balance is maintained, while keeping the clustering accuracy high. Furthermore, to adapt to potentially changing customer calling patterns, the partitioning of pro les to machines should be continuously revised, yet one should minimize the migration of pro les so as not to disturb the online processing of updates. We provide a re-partitioning technique that achieves all these goals. We keep micro-cluster summaries at each node, collect these summaries at a centralized node, and use a greedy algorithm with novel a nity heuristics to revise the partitioning. We present a demo application that showcases our Storm and Hbase based implementation in the context of a customer segmentation application.
Open Access
Scaling forecasting algorithms using clustered modeling
(2013) Güvercin, Mehmet
Research on statistical forecasting has traditionally focused on building more accurate models for a given time-series. The models are mostly applied only to limited data due to their limitation on efficiency and scalability. However, many enterprise applications such as Customer Relationship Model (CRM) and Customer Experience Management (CEM) require scalable forecasting on large number of data series. For example, telecommunication companies need to forecast each of their customers’ traffic load individually to understand their needs and behavior, and to tailor targeted campaigns. Forecasting models are easily applied on aggregate traffic data to estimate the total traffic volume for revenue estimation and resource planning. However, they cannot be applied to each user individually as building accurate models for large number of users would be time consuming. The problem is exacerbated when the forecasting process is continuous and the models need to be updated periodically. We address the problem of building and updating forecasting models continuously for multiple data series and propose dynamic clustered modeling optimized for forecasting. We introduce representative models as an analogy to cluster centers, and apply the models to each individual series through iterative nonlinear optimization. The approach performs modeling and clustering simultaneously, makes forecasts by applying representative models to each data, and updates the model parameters for a continuous forecasting process. Our findings indicate that understanding an individual’s behavior within its segment’s model provides more scalability and accuracy than computing the individual model itself. Experimental results from a real telecom CRM application show the method is highly efficient and scalable, and also more accurate than having separate individual models.