Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing

Matei Zaharia and Tathagata Das and Haoyuan Li and Timothy Hunter and Scott Shenker and Ion Stoica

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2012-259

December 14, 2012

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf

Many "big data" applications need to act on data arriving in real time. However, current programming models for distributed stream processing are relatively low-level, often leaving the user to worry about consistency of state across the system and fault recovery. Furthermore, the models that provide fault recovery do so in an expensive manner, requiring either hot replication or long recovery times. We propose a new programming model, discretized streams (D-Streams), that offers a high-level functional API, strong consistency, and efficient fault recovery. D-Streams support a new recovery mechanism that improves efficiency over the traditional replication and upstream backup schemes in streaming databases - parallel recovery of lost state - and unlike previous systems, also mitigate stragglers. We implement D-Streams as an extension to the Spark cluster computing engine that lets users seamlessly intermix streaming, batch and interactive queries. Our system can process over 60 million records/second at sub-second latency on 100 nodes.

BibTeX citation:

@techreport{Zaharia:EECS-2012-259,
    Author= {Zaharia, Matei and Das, Tathagata and Li, Haoyuan and Hunter, Timothy and Shenker, Scott and Stoica, Ion},
    Title= {Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing},
    Year= {2012},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.html},
    Number= {UCB/EECS-2012-259},
    Abstract= {Many "big data" applications need to act on data arriving in real time. However, current programming models for distributed stream processing are relatively low-level, often leaving the user to worry about consistency of state across the system and fault recovery. Furthermore, the models that provide fault recovery do so in an expensive manner, requiring either hot replication or long recovery times. We propose a new programming model, discretized streams (D-Streams), that offers a high-level functional API, strong consistency, and efficient fault recovery. D-Streams support a new recovery mechanism that improves efficiency over the traditional replication and upstream backup schemes in streaming databases - parallel recovery of lost state - and unlike previous systems, also mitigate stragglers. We implement D-Streams as an extension to the Spark cluster computing engine that lets users seamlessly intermix streaming, batch and interactive queries. Our system can process over 60 million records/second at sub-second latency on 100 nodes.},
}

EndNote citation:

%0 Report
%A Zaharia, Matei 
%A Das, Tathagata 
%A Li, Haoyuan 
%A Hunter, Timothy 
%A Shenker, Scott 
%A Stoica, Ion 
%T Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing
%I EECS Department, University of California, Berkeley
%D 2012
%8 December 14
%@ UCB/EECS-2012-259
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.html
%F Zaharia:EECS-2012-259