S3-TM: scalable streaming short text matching

dc.citation.epage866en_US
dc.citation.issueNumber6en_US
dc.citation.spage849en_US
dc.citation.volumeNumber24en_US
dc.contributor.authorBasık F.en_US
dc.contributor.authorGedik, B.en_US
dc.contributor.authorFerhatosmanoğlu, H.en_US
dc.contributor.authorKalender, M. E.en_US
dc.date.accessioned2016-02-08T10:58:40Z
dc.date.available2016-02-08T10:58:40Z
dc.date.issued2015en_US
dc.departmentDepartment of Computer Engineeringen_US
dc.description.abstractMicro-blogging services have become major venues for information creation, as well as channels of information dissemination. Accordingly, monitoring them for relevant information is a critical capability. This is typically achieved by registering content-based subscriptions with the micro-blogging service. Such subscriptions are long-running queries that are evaluated against the stream of posts. Given the popularity and scale of micro-blogging services like Twitter and Weibo, building a scalable infrastructure to evaluate these subscriptions is a challenge. To address this challenge, we present the S3-TM system for streaming short text matching. S3-TM is organized as a stream processing application, in the form of a data parallel flow graph designed to be run on a data center environment. It takes advantage of the structure of the publications (posts) and subscriptions to perform the matching in a scalable manner, without broadcasting publications or subscriptions to all of the matcher instances. The basic design of S$$^3$$3-TM uses a scoped multicast for publications and scoped anycast for subscriptions. To further improve throughput, we introduce publication routing algorithms that aim at minimizing the scope of the multicasts. First set of algorithms we develop are based on partitioning the word co-occurrence frequency graph, with the aim of routing posts that include commonly co-occurring words to a small set of matchers. While effective, these algorithms fell short in balancing the load. To address this, we develop the SALB algorithm, which provides better load balance by modeling the load more accurately using the word-to-post bipartite graph. We also develop a subscription placement algorithm, called LASP, to group together similar subscriptions, in order to minimize the subscription matching cost. Furthermore, to achieve good scalability for increasing number of nodes, we introduce techniques to handle workload skew. Finally, we introduce load shedding techniques for handling unexpected load spikes with small impact on the accuracy. Our experimental results show that S3-TM is scalable. Furthermore, the SALB algorithm provides more than 2.5× throughput compared to the baseline multicast and outperforms the graph partitioning-based approaches.en_US
dc.identifier.doi10.1007/s00778-015-0404-3en_US
dc.identifier.issn1066-8888
dc.identifier.urihttp://hdl.handle.net/11693/26352
dc.language.isoEnglishen_US
dc.publisherAssociation for Computing Machineryen_US
dc.relation.isversionofhttp://dx.doi.org/10.1007/s00778-015-0404-3en_US
dc.source.titleThe VLDB Journalen_US
dc.subjectPublish/subscribeen_US
dc.subjectShort text matchingen_US
dc.subjectStream processingen_US
dc.subjectFlow graphsen_US
dc.subjectGraph theoryen_US
dc.subjectInformation disseminationen_US
dc.subjectMulticastingen_US
dc.subjectParallel flowen_US
dc.subjectPublishingen_US
dc.subjectSocial networking (online)en_US
dc.subjectContent-based subscriptionen_US
dc.subjectInformation creationen_US
dc.subjectMicro-blogging servicesen_US
dc.subjectPlacement algorithmen_US
dc.subjectScalable infrastructureen_US
dc.subjectShort textsen_US
dc.subjectAlgorithmsen_US
dc.titleS3-TM: scalable streaming short text matchingen_US
dc.typeArticleen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
S3-TM scalable streaming short text matching.pdf
Size:
1.64 MB
Format:
Adobe Portable Document Format
Description:
Full printable version