Adaptive Stream Processing using Dynamic Batch Sizing

Tathagata Das, Yuan Zhong, Ion Stoica and Scott Shenker

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2014-133
June 3, 2014

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-133.pdf

The need for real-time processing of ``big data'' has led to the development of frameworks for distributed stream processing in clusters. It is important for such frameworks to be robust against variable operating conditions such as server failures, changes in data ingestion rates, and workload characteristics. To provide fault tolerance and efficient stream processing at scale, recent stream processing frameworks have proposed to treat streaming workloads as a series of batch jobs on small batches of streaming data. However, the robustness of such frameworks against variable operating conditions has not been explored.

In this paper, we explore the effect of the size of batches on the performance of streaming workloads. The throughput and end-to-end latency of the system can have complicated relationships with batch sizes, data ingestion rates, variations in available resources, workload characteristics, etc. We propose a simple yet robust control algorithm that automatically adapts batch sizes as the situation necessitates. We show through extensive experiments that this algorithm is powerful enough to ensure system stability and low end-to-end latency for a wide class of workloads, despite large variations in data rates and operating conditions.

Advisor: Scott Shenker and Ion Stoica


BibTeX citation:

@mastersthesis{Das:EECS-2014-133,
    Author = {Das, Tathagata and Zhong, Yuan and Stoica, Ion and Shenker, Scott},
    Title = {Adaptive Stream Processing using Dynamic Batch Sizing},
    School = {EECS Department, University of California, Berkeley},
    Year = {2014},
    Month = {Jun},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-133.html},
    Number = {UCB/EECS-2014-133},
    Abstract = {The need for real-time processing of ``big data'' has led to the development of 
frameworks for distributed stream processing in clusters. It is important for such frameworks to be robust against variable operating conditions such as server failures, changes in data ingestion rates, and workload characteristics. To provide fault tolerance and efficient stream processing at scale, recent stream processing frameworks have proposed to treat streaming workloads as a series of batch jobs on small batches of streaming data. However, the robustness of such frameworks against variable operating conditions has not been explored. 

In this paper, we explore the effect of the size of batches on the performance of streaming workloads. The throughput and end-to-end latency of the system can have complicated relationships with batch sizes, data ingestion rates, variations in available resources, workload characteristics, etc. We propose a simple yet robust control algorithm that automatically adapts batch sizes as the situation necessitates. We show through extensive experiments that this algorithm is powerful enough to ensure system stability and low end-to-end latency for a wide class of workloads, despite large variations in data rates and operating conditions.}
}

EndNote citation:

%0 Thesis
%A Das, Tathagata
%A Zhong, Yuan
%A Stoica, Ion
%A Shenker, Scott
%T Adaptive Stream Processing using Dynamic Batch Sizing
%I EECS Department, University of California, Berkeley
%D 2014
%8 June 3
%@ UCB/EECS-2014-133
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-133.html
%F Das:EECS-2014-133