Towards unifiying mobility datasets
With the proliferation of smart phones integrated with positioning systems and the increasing penetration of Internet-of-Things (IoT) in our daily lives, mobility data has become widely available. A vast variety of mobile services and applications either have a location-based context or produce spatio-temporal records as a byproduct. These records contain information about both the entities that produce them, as well as the environment they were produced in. Availability of such data supports smart services in areas including healthcare, computational social sciences and location-based marketing. We postulate that the spatio-temporal usage records belonging to the same real-world entity can be matched across records from different location-enhanced services. This is a fundamental problem in many applications such as linking user identities for security, understanding privacy limitations of location based services, or producing a unified dataset from multiple sources for urban planning and traffic management. Such integrated datasets are also essential for service providers to optimise their services and improve business intelligence. As such, in this work, we explore scalable solutions to link entities across two mobility datasets, using only their spatio-temporal information to pave to road towards unifying mobility datasets. The first approach is rule-based linkage, based on the concept of k-l diversity | that we developed to capture both spatial and temporal aspects of the linkage. This model is realized by developing a scalable linking algorithm called ST-Link, which makes use of effective spatial and temporal filtering mechanisms that significantly reduce the search space for matching users. Furthermore, ST-Link utilizes sequential scan procedures to avoid random disk access and thus scales to large datasets. The second approach is similarity based linkage that proposes a mobility based representation and similarity computation for entities. An efficient matching process is then developed to identify the final linked pairs, with an automated mechanism to decide when to stop the linkage. We scale the process with a locality-sensitive hashing (LSH) based approach that significantly reduces candidate pairs for matching. To realize the effectiveness and efficiency of our techniques in practice, we introduce an algorithm called SLIM. We evaluated our work with respect to accuracy and performance using several datasets. Experiments show that both ST-Link and SLIM are effective in practice for performing spatio-temporal linkage and can scale to large datasets. Moreover, the LSH-based scalability brings two to four orders of magnitude speedup.