Minimizing Churn in Distributed Systems

Philip Brighten Godfrey, Scott Shenker and Ion Stoica

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2006-25
March 17, 2006

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-25.pdf

A pervasive requirement of distributed systems is to deal with churn --- change in the set of participating nodes due to joins, graceful leaves, and failures. A high churn rate can increase costs or decrease service quality. This paper studies how to reduce churn by selecting which subset of a set of available nodes to use.

First, we provide a comparison of the performance of a range of different node selection strategies in five real-world traces. Among our findings is that the simple strategy of picking a uniform-random replacement whenever a node fails performs surprisingly well. We explain its performance through analysis in a stochastic model.

Second, we show that a class of strategies, which we call "Preference List" strategies, arise commonly as a result of optimizing for a metric other than churn, and produce high churn relative to more randomized strategies under realistic node failure patterns. Using this insight, we demonstrate and explain differences in performance for designs that incorporate varying degrees of randomization. We give examples from a variety of protocols, including anycast, overlay multicast, and distributed hash tables. In many cases, simply adding some randomization can go a long way towards reducing churn.


BibTeX citation:

@techreport{Godfrey:EECS-2006-25,
    Author = {Godfrey, Philip Brighten and Shenker, Scott and Stoica, Ion},
    Title = {Minimizing Churn in Distributed Systems},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {2006},
    Month = {Mar},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-25.html},
    Number = {UCB/EECS-2006-25},
    Abstract = {A pervasive requirement of distributed systems is to deal with churn --- change in the set of participating nodes due to joins, graceful leaves, and failures. A high churn rate can increase costs or decrease service quality. This paper studies how to reduce churn by selecting which subset of a set of available nodes to use.

First, we provide a comparison of the performance of a range of different node selection strategies in five real-world traces. Among our findings is that the simple strategy of picking a uniform-random replacement whenever a node fails performs surprisingly well. We explain its performance through analysis in a stochastic model.

Second, we show that a class of strategies, which we call "Preference List" strategies, arise commonly as a result of optimizing for a metric other than churn, and produce high churn relative to more randomized strategies under realistic node failure patterns. Using this insight, we demonstrate and explain differences in performance for designs that incorporate varying degrees of randomization. We give examples from a variety of protocols, including anycast, overlay multicast, and distributed hash tables. In many cases, simply adding some randomization can go a long way towards reducing churn.}
}

EndNote citation:

%0 Report
%A Godfrey, Philip Brighten
%A Shenker, Scott
%A Stoica, Ion
%T Minimizing Churn in Distributed Systems
%I EECS Department, University of California, Berkeley
%D 2006
%8 March 17
%@ UCB/EECS-2006-25
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-25.html
%F Godfrey:EECS-2006-25