Locality-aware distributed state partitioning for stream processing systems
Today, there are many applications that deal with high-volume data streams. These distributed stream processing applications process data on-the-fly and provide real-time distributed computing for big data. Due to the volume of data they process, some of these applications make use of data parallel nodes. The state management for distributed nodes in these applications is an important task to handle, because of different use cases such as: dealing with node failures, checkpointing, data enrichment, and re-partitioning. Therefore, distributed stream processing applications need a state management mechanism. In this thesis, we present a locality-aware state management mechanism for distributed stream processing applications. The proposed mechanism provides a transparent locality-aware data partitioning and state management system for distributed stream processing applications. The mechanism partitions data while preserving locality and handles state transfer among nodes transparently, in order to adapt to potential changes in the partitioning. In addition to this, it provides operators with a high-performance state management facility that can tackle check-pointing scenarios. The idea is implemented as a pluggable library for the open-source, distributed stream-processing engine, Apache Storm.