Automating the Debugging of Datacenter Applications with ADDA

Gautam Altekar, Cristian Zamfir, George Candea and Ion Stoica

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2011-22
April 4, 2011

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-22.pdf

Debugging data-intensive distributed applications running in a datacenter (“datacenter applications”) is complex and time-consuming. Developers wish they had a way to deterministically replay failed executions with little human effort, but unfortunately no such tool exists today. We see two challenges in replay-based debugging: First, the clusters used to run datacenter applications consist of many nodes, so the nondeterminism resulting from multithreaded execution on a single node is compounded by the size of the cluster. Second, datacenter applications produce terabytes of intermediate data shipped from one node to the next—the total data volume, itself proportional to cluster size, makes full input recording for potential subsequent replay infeasible.

We present ADDA, a replay-debugging system for datacenter applications. We observe that these applications often consist of a separate “control plane” and “data plane,” and that the applications’ initial inputs are typically persisted in append-only storage for reasons unrelated to debugging. Building upon these observations, ADDA leverages the control / data plane separation to make recording of debug-critical data scalable even in large clusters, it deterministically re-synthesizes intermediate data based on the (already available) initial inputs, and performs reduced-scale replay, i.e., recreates failed executions on just a subset of the original cluster.

We show that ADDA scales well and deterministically replays real-world failures in Hypertable and Memcached. We also argue that ADDA’s techniques generalize to a broader set of datacenter applications.


BibTeX citation:

@techreport{Altekar:EECS-2011-22,
    Author = {Altekar, Gautam and Zamfir, Cristian and Candea, George and Stoica, Ion},
    Title = {Automating the Debugging of Datacenter Applications with ADDA},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {2011},
    Month = {Apr},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-22.html},
    Number = {UCB/EECS-2011-22},
    Abstract = {Debugging data-intensive distributed applications
running in a datacenter (“datacenter applications”) is
complex and time-consuming. Developers wish they had
a way to deterministically replay failed executions with
little human effort, but unfortunately no such tool exists
today. We see two challenges in replay-based debugging:
First, the clusters used to run datacenter applications consist of many nodes, so the nondeterminism resulting from
multithreaded execution on a single node is compounded
by the size of the cluster. Second, datacenter applications
produce terabytes of intermediate data shipped from one
node to the next—the total data volume, itself proportional
to cluster size, makes full input recording for potential
subsequent replay infeasible.

We present ADDA, a replay-debugging system for datacenter
applications. We observe that these applications
often consist of a separate “control plane” and “data
plane,” and that the applications’ initial inputs are typically persisted in append-only storage for reasons unrelated to debugging. Building upon these observations,
ADDA leverages the control / data plane separation to
make recording of debug-critical data scalable even in
large clusters, it deterministically re-synthesizes intermediate data based on the (already available) initial inputs, and performs reduced-scale replay, i.e., recreates failed executions on just a subset of the original cluster.

We show that ADDA scales well and deterministically replays real-world failures in Hypertable and Memcached. We also argue that ADDA’s techniques generalize to a broader set of datacenter applications.}
}

EndNote citation:

%0 Report
%A Altekar, Gautam
%A Zamfir, Cristian
%A Candea, George
%A Stoica, Ion
%T Automating the Debugging of Datacenter Applications with ADDA
%I EECS Department, University of California, Berkeley
%D 2011
%8 April 4
%@ UCB/EECS-2011-22
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-22.html
%F Altekar:EECS-2011-22