DCR: Replay Debugging for the Datacenter

Gautam Altekar and Ion Stoica

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2010-74

May 13, 2010

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-74.pdf

Debugging is hard, but debugging production datacenter applications such as Cassandra, Hadoop, and Hypertable is downright daunting. The key obstacle is non-deterministic failures–hard-to-reproduce program misbehaviors that are immune to traditional cyclicdebugging techniques. Datacenter applications are rife with such failures because they operate in highly nondeterministic environments: a typical setup employs thousands of nodes, spread across multiple datacenters, to process terabytes of data per day. In these environments, existing methods for debugging non-deterministic failures are of limited use. They either incur excessive production overheads or don’t scale to multi-node, terabyte-scale processing.

To help remedy the situation, we have built a new replay debugging tool. Our tool, called DCR, enables the reproduction and debugging of non-deterministic failures in production datacenter runs. The key observation behind DCR is that debugging does not always require a precise replica of the original datacenter run. Instead, it often suffices to produce some run that exhibits the original behaviors of the control-plane–the most errorprone component of datacenter applications. DCR leverages this observation to relax the determinism guarantees offered by the system, and consequently, to address all key requirements of production datacenter applications: lightweight recording of long-running programs, causally consistent replay of large scale systems, and out of the box operation on real-world applications.

BibTeX citation:

@techreport{Altekar:EECS-2010-74,
    Author= {Altekar, Gautam and Stoica, Ion},
    Title= {DCR: Replay Debugging for the Datacenter},
    Year= {2010},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-74.html},
    Number= {UCB/EECS-2010-74},
    Abstract= {Debugging is hard, but debugging production datacenter
applications such as Cassandra, Hadoop, and
Hypertable is downright daunting. The key obstacle
is non-deterministic failures–hard-to-reproduce program
misbehaviors that are immune to traditional cyclicdebugging
techniques. Datacenter applications are rife
with such failures because they operate in highly nondeterministic
environments: a typical setup employs
thousands of nodes, spread across multiple datacenters,
to process terabytes of data per day. In these environments,
existing methods for debugging non-deterministic
failures are of limited use. They either incur excessive
production overheads or don’t scale to multi-node,
terabyte-scale processing.

To help remedy the situation, we have built a new replay
debugging tool. Our tool, called DCR, enables the
reproduction and debugging of non-deterministic failures
in production datacenter runs. The key observation behind
DCR is that debugging does not always require a
precise replica of the original datacenter run. Instead,
it often suffices to produce some run that exhibits the
original behaviors of the control-plane–the most errorprone
component of datacenter applications. DCR leverages
this observation to relax the determinism guarantees
offered by the system, and consequently, to address
all key requirements of production datacenter applications:
lightweight recording of long-running programs,
causally consistent replay of large scale systems, and out
of the box operation on real-world applications.},
}

EndNote citation:

%0 Report
%A Altekar, Gautam 
%A Stoica, Ion 
%T DCR: Replay Debugging for the Datacenter
%I EECS Department, University of California, Berkeley
%D 2010
%8 May 13
%@ UCB/EECS-2010-74
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-74.html
%F Altekar:EECS-2010-74