Efficient Reproducible Floating Point Summation and BLAS

Peter Ahrens, Hong Diep Nguyen and James Demmel

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2015-229
December 8, 2015

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-229.pdf

We define reproducibility to mean getting bitwise identical results from multiple runs of the same program, perhaps with different hardware resources or other changes that should ideally not change the answer. Many users depend on reproducibility for debugging or correctness. However, dynamic scheduling of parallel computing resources, combined with nonassociativity of floating point addition, makes attaining reproducibility a challenge even for simple operations like summing a vector of numbers, or more complicated operations like the Basic Linear Algebra Subprograms (BLAS). We describe an algorithm that computes a reproducible sum of floating point numbers, independent of the order of summation. The algorithm depends only on a subset of the IEEE Floating Point Standard 754-2008. It is communication-optimal, in the sense that it does just one pass over the data in the sequential case, or one reduction operation in the parallel case, requiring an ``accumulator'' represented by just 6 floating point words (more can be used if higher precision is desired). The arithmetic cost with a 6-word accumulator is 7 n floating point additions to sum n words, and (in IEEE double precision) the final error bound can be up to 10 -8 times smaller than the error bound for conventional summation. We describe the basic summation algorithm, the software infrastructure used to build reproducible BLAS (ReproBLAS), and performance results. For example, when computing the dot product of 4096 double precision floating point numbers, we get an 4x slowdown compared to IntelĀ® Math Kernel Library (MKL) running on an IntelĀ® Core i7-2600 CPU operating at 3.4 GHz and 256 KB L2 Cache.


BibTeX citation:

@techreport{Ahrens:EECS-2015-229,
    Author = {Ahrens, Peter and Nguyen, Hong Diep and Demmel, James},
    Title = {Efficient Reproducible Floating Point Summation and BLAS},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {2015},
    Month = {Dec},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-229.html},
    Number = {UCB/EECS-2015-229},
    Abstract = {We define reproducibility to mean getting bitwise identical results from
multiple runs of the same program, perhaps with different hardware resources or other
changes that should ideally not change the answer. Many users depend on reproducibility
for debugging or correctness. However, dynamic scheduling of parallel computing resources, combined with nonassociativity of floating point addition, makes attaining
reproducibility a challenge even for simple operations like summing a vector of numbers,
or more complicated operations like the Basic Linear Algebra Subprograms (BLAS).
We describe an algorithm that computes a reproducible sum of floating point numbers,
independent of the order of summation. The algorithm depends only on a subset of
the IEEE Floating Point Standard 754-2008. It is communication-optimal, in the sense that
it does just one pass over the data in the sequential case, or one reduction operation in
the parallel case, requiring an ``accumulator'' represented by just 6 floating point words
(more can be used if higher precision is desired). The arithmetic cost with a 6-word
accumulator is 7<i>n</i> floating point additions to sum <i>n</i> words, and (in IEEE double precision) the
final error bound can be up to 10<sup>-8</sup> times smaller than the error bound for conventional
summation. We describe the basic summation algorithm, the software infrastructure used to
build reproducible BLAS (ReproBLAS), and performance results. For example, when computing the dot product of 4096 double precision floating point numbers, we get an
4x slowdown compared to Intel&#174; Math Kernel Library (MKL) running on an Intel&#174; Core i7-2600 CPU operating at 3.4 GHz and 256 KB L2 Cache.}
}

EndNote citation:

%0 Report
%A Ahrens, Peter
%A Nguyen, Hong Diep
%A Demmel, James
%T Efficient Reproducible Floating Point Summation and BLAS
%I EECS Department, University of California, Berkeley
%D 2015
%8 December 8
%@ UCB/EECS-2015-229
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-229.html
%F Ahrens:EECS-2015-229