Efficient Reproducible Floating Point Summation and BLAS
James Demmel and Willow Ahrens and Hong Diep Nguyen
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2016-121
June 18, 2016
This publication is archived. It is kept only for reference purposes, so it is no longer being updated and may not meet accessibility standards. If you need this content in a different format, please email webteam@eecs.berkeley.edu.
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/Archive/EECS-2016-121.pdf
We define reproducibility to mean getting bitwise identical results from multiple runs of the same program, perhaps with different hardware resources or other changes that should ideally not change the answer. Many users depend on reproducibility for debugging or correctness. However, dynamic scheduling of parallel computing resources, combined with nonassociativity of floating point addition, makes attaining reproducibility a challenge even for simple operations like summing a vector of numbers, or more complicated operations like the Basic Linear Algebra Subprograms (BLAS). We describe an algorithm that computes a reproducible sum of floating point numbers, independent of the order of summation. The algorithm depends only on a subset of the IEEE Floating Point Standard 754-2008. It is communication-optimal, in the sense that it does just one pass over the data in the sequential case, or one reduction operation in the parallel case, requiring an ``accumulator'' represented by just 6 floating point words (more can be used if higher precision is desired). The arithmetic cost with a 6-word accumulator is 7n floating point additions to sum n words, and (in IEEE double precision) the final error bound can be up to 10^(-8) times smaller than the error bound for conventional summation. We describe the basic summation algorithm, the software infrastructure used to build reproducible BLAS (ReproBLAS), and performance results. For example, when computing the dot product of 4096 double precision floating point numbers, we get a 4x slowdown compared to Intel Math Kernel Library (MKL) running on an Intel Core i7-2600 CPU operating at 3.4 GHz and 256 KB L2 Cache.
BibTeX citation:
@techreport{Demmel:EECS-2016-121,
Author= {Demmel, James and Ahrens, Willow and Nguyen, Hong Diep},
Title= {Efficient Reproducible Floating Point Summation and BLAS},
Year= {2016},
Month= {Jun},
Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-121.html},
Number= {UCB/EECS-2016-121},
Abstract= {We define reproducibility to mean getting bitwise identical
results from multiple runs of the same program, perhaps
with different hardware resources or other changes that
should ideally not change the answer. Many users depend on
reproducibility for debugging or correctness. However,
dynamic scheduling of parallel computing resources,
combined with nonassociativity of floating point addition,
makes attaining reproducibility a challenge even for simple
operations like summing a vector of numbers, or more
complicated operations like the Basic Linear Algebra
Subprograms (BLAS).
We describe an algorithm that computes a reproducible sum
of floating point numbers, independent of the order of
summation. The algorithm depends only on a subset of
the IEEE Floating Point Standard 754-2008. It is
communication-optimal, in the sense that it does just one
pass over the data in the sequential case, or one reduction
operation in the parallel case, requiring an
``accumulator'' represented by just 6 floating point words
(more can be used if higher precision is desired).
The arithmetic cost with a 6-word accumulator is 7n
floating point additions to sum n words, and (in IEEE double precision) the final error bound can be up to
10^(-8) times smaller than the error bound for conventional
summation. We describe the basic summation algorithm,
the software infrastructure used to build reproducible BLAS
(ReproBLAS), and performance results. For example, when
computing the dot product of 4096 double precision floating
point numbers, we get a 4x slowdown compared to Intel Math
Kernel Library (MKL) running on an Intel Core i7-2600
CPU operating at 3.4 GHz and 256 KB L2 Cache.},
}
EndNote citation:
%0 Report %A Demmel, James %A Ahrens, Willow %A Nguyen, Hong Diep %T Efficient Reproducible Floating Point Summation and BLAS %I EECS Department, University of California, Berkeley %D 2016 %8 June 18 %@ UCB/EECS-2016-121 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-121.html %F Demmel:EECS-2016-121