dc.contributor.advisor | Ferhatosmano glu, Hakan | |
dc.contributor.author | Basık, Fuat | |
dc.date.accessioned | 2016-01-08T20:18:22Z | |
dc.date.available | 2016-01-08T20:18:22Z | |
dc.date.issued | 2014 | |
dc.identifier.uri | http://hdl.handle.net/11693/18335 | |
dc.description | Ankara : The Graduate School of Engineering and Science of Bilkent University, 2014. | en_US |
dc.description | Thesis (Master's) -- Bilkent University, 2014. | en_US |
dc.description | Includes bibliographical references leaves 46-49. | en_US |
dc.description.abstract | Micro-blogging services have become major venues for information creation, as
well as channels of information dissemination. Accordingly, monitoring them for
relevant information is a critical capability. This is typically achieved by registering
content-based subscriptions with the micro-blogging service. Such subscriptions
are long running queries that are evaluated against the stream of posts.
Given the popularity and scale of micro-blogging services like Twitter and Weibo,
building a scalable infrastructure to evaluate these subscriptions is a challenge.
To address this challenge, we present the S3-TM system for streaming short text
matching. S3-TM is organized as a stream processing application, in the form of
a data parallel
ow graph designed to be run on a data center environment. It
takes advantage of the structure of the publications (posts) and subscriptions to
perform the matching in a scalable manner, without broadcasting publications or
subscriptions to all of the matcher instances. The basic design of S3-TM uses a
scoped multicast for publications and scoped anycast for subscriptions. To further
improve throughput, we introduce publication routing algorithms that aim
at minimizing the scope of the multicasts. The rst set of algorithms we develop
are based on partitioning the word co-occurrence frequency graph, with the
aim of routing posts that include commonly co-occurring words to a small set of
matchers. While e ective, these algorithms fell short in balancing the load. To
address this, we develop the SALB algorithm, which provides better load balance
by modeling the load more accurately using the word-to-post bipartite graph. We
also develop a subscription placement algorithm, called LASP, to group together
similar subscriptions, in order to minimize the subscription matching cost. Furthermore,
to achieve good scalability for increasing number of nodes, we introduce
simple yet e ective techniques to handle workload skew. Finally, we introduce load shedding techniques for handling unexpected load spikes with small impact
on the accuracy. Our experimental results show that S3-TM is scalable. Furthermore,
the SALB algorithm provides more than 2:5 throughput compared to the
baseline multicast and outperforms the graph partitioning based approaches. | en_US |
dc.description.statementofresponsibility | Basık, Fuat | en_US |
dc.format.extent | xii, 49 leaves, graphics | en_US |
dc.language.iso | English | en_US |
dc.rights | info:eu-repo/semantics/openAccess | en_US |
dc.subject | Scalability | en_US |
dc.subject | Stream Processing | en_US |
dc.subject | Publish/Subscribe Systems | en_US |
dc.subject | Text Matching | en_US |
dc.subject.lcc | QA76.9.D338 B37 2014 | en_US |
dc.subject.lcsh | Data flow computing. | en_US |
dc.subject.lcsh | Text processing (Computer science) | en_US |
dc.subject.lcsh | Ubiquitous computing. | en_US |
dc.title | S³-TM : scalable streaming short text matching | en_US |
dc.type | Thesis | en_US |
dc.department | Department of Computer Engineering | en_US |
dc.publisher | Bilkent University | en_US |
dc.description.degree | M.S. | en_US |
dc.identifier.itemid | B149454 | |
dc.embargo.release | 2017-01-16 | |