Efficient analysis of large-scale social networks using big-data platforms
Embargo Lift Date: 2016-07-23
Item Usage Stats
In recent years, the rise of very large, rich content networks re-ignited interest to complex/social network analysis at the big data scale, which makes it possible to understand social interactions at large scale while it poses computation challenges to early works with algorithm complexity greater than O(n). This thesis analyzes social networks at very large-scales to derive important parameters and characteristics in an efficient and effective way using big-data platforms. With the popularization of mobile phone usage, telecommunication networks have turned into a socially binding medium and enables researches to analyze social interactions at very large scales. Degree distribution is one of the most important characteristics of social networks and to study degree characteristics and structural properties in large-scale social networks, in this thesis we first gathered a tera-scale dataset of telecommunication call detail records. Using this data we empirically evaluate some statistical models against the degree distribution of the country’s call graph and determine that a Pareto log-normal distribution provides the best fit, despite claims in the literature that power-law distribution is the best model. We also question and derive answers for how network operator, size, density and location affect degree distribution to understand the parameters governing it in social networks. Besides structural property analysis, community identification is of great interest in practice to learn high cohesive subnetworks about different subjects in a social network. In graph theory, k-core is a key metric used to identify subgraphs of high cohesion, also known as the ‘dense’ regions of a graph. As the real world graphs such as social network graphs grow in size, the contents get richer and the topologies change dynamically, we are challenged not only to materialize k-core subgraphs for one time but also to maintain them in order to keep up with continuous updates. These challenges inspired us to propose a new set of distributed algorithms for k-core view construction and maintenance on a horizontally scaling storage and computing platform. Experimental evaluation results demonstrated orders of magnitude speedup and advantages of maintaining k-core incrementally and in batch windows over complete reconstruction approaches. Moreover, the intensity of community engagement can be distinguished at multiple levels, resulting in a multiresolution community representation that has to be maintained over time. We also propose distributed algorithms to construct and maintain a multi-k-core graphs, implemented on the scalable big-data platform Apache HBase. Our experimental evaluation results demonstrate orders of magnitude speedup by maintaining multi-k-core incrementally over complete reconstruction. Furthermore, we propose a graph aware cache system designed for distributed graph processing. Experimental results demonstrate up to 15x speedup compared to traditional LRU based cache systems.
KeywordsSocial Network Analysis
Big Data Analytics