Accurate Floating Point Summation
James Demmel and Yozo Hida
EECS Department, University of California, Berkeley
Technical Report No. UCB/CSD-02-1180
, 2002
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2002/CSD-02-1180.pdf
We present and analyze several simple algorithms for accurately summing <i>n</i> floating point numbers, independent of how much cancellation occurs in the sum. Let <i>f</i> be the number of significant bits in the sum (<i>si</i>). We assume a register is available with <i>F</i> > <i>f</i> significant bits. Then assuming that (1) <i>n</i> <= floor( 2^(F-f) / (1-2^(-f)) ) + 1, (2) rounding is to nearest, (3) no overflow occurs, and (4) all underflow is gradual, then simply summing the <i>si</i> in decreasing order of magnitude yields <i>S</i> rounded to within just over 1.5 units in its last place. If <i>S</i> = 0, then it is computed exactly. If we increase <i>n</i> slightly to floor( 2^(F-f) / (1-2^(-f)) ) + 3, then all accuracy can be lost. This result extends work of Priest and others who considered double precision only (<i>F</i> >= 2<i>f</i>). We apply this result to the floating point formats in the (proposed revision of the) IEEE floating point standard. For example, a dot product of IEEE single precision vectors computed using double precision and sorting is guaranteed correct to nearly 1.5 ulps as long as <i>n</i> <= 33. If double extended is used <i>n</i> can be as large as 65537. We also show how sorting may be mostly avoided while retaining accuracy.
BibTeX citation:
@techreport{Demmel:CSD-02-1180, Author= {Demmel, James and Hida, Yozo}, Title= {Accurate Floating Point Summation}, Year= {2002}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2002/5662.html}, Number= {UCB/CSD-02-1180}, Abstract= {We present and analyze several simple algorithms for accurately summing <i>n</i> floating point numbers, independent of how much cancellation occurs in the sum. Let <i>f</i> be the number of significant bits in the sum (<i>si</i>). We assume a register is available with <i>F</i> > <i>f</i> significant bits. Then assuming that (1) <i>n</i> <= floor( 2^(F-f) / (1-2^(-f)) ) + 1, (2) rounding is to nearest, (3) no overflow occurs, and (4) all underflow is gradual, then simply summing the <i>si</i> in decreasing order of magnitude yields <i>S</i> rounded to within just over 1.5 units in its last place. If <i>S</i> = 0, then it is computed exactly. If we increase <i>n</i> slightly to floor( 2^(F-f) / (1-2^(-f)) ) + 3, then all accuracy can be lost. This result extends work of Priest and others who considered double precision only (<i>F</i> >= 2<i>f</i>). We apply this result to the floating point formats in the (proposed revision of the) IEEE floating point standard. For example, a dot product of IEEE single precision vectors computed using double precision and sorting is guaranteed correct to nearly 1.5 ulps as long as <i>n</i> <= 33. If double extended is used <i>n</i> can be as large as 65537. We also show how sorting may be mostly avoided while retaining accuracy.}, }
EndNote citation:
%0 Report %A Demmel, James %A Hida, Yozo %T Accurate Floating Point Summation %I EECS Department, University of California, Berkeley %D 2002 %@ UCB/CSD-02-1180 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2002/5662.html %F Demmel:CSD-02-1180