Execution Time of Symmetric Eigensolvers

by

Kendall Swenson Stanley

B.S. (Purdue University) 1978

A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy

in

Computer Science

in the

GRADUATE DIVISION of the UNIVERSITY of CALIFORNIA at BERKELEY

Committee in charge:

Professor James Demmel, Chair
Professor William Kahan
Professor Phil Collela

Fall 1997
The dissertation of Kendall Swenson Stanley is approved:

Chair

Date

Date

Date

University of California at Berkeley

Fall 1997
Execution Time of Symmetric Eigensolvers

Copyright Fall 1997
by
Kendall Swenson Stanley
Abstract

Execution Time of Symmetric Eigensolvers

by

Kendall Swenson Stanley
Doctor of Philosophy in Computer Science
University of California at Berkeley
Professor James Demmel, Chair

The execution time of a symmetric eigendecomposition depends upon the application, the algorithm, the implementation, and the computer. Symmetric eigensolvers are used in a variety of applications, and the requirements of the eigensolver vary from application to application. Many different algorithms can be used to perform a symmetric eigendecomposition, each with differing computational properties. Different implementations of the same algorithm may also have greatly differing computational properties. The computer on which the eigensolver is run not only affects execution time but may favor certain algorithms and implementations over others.

This thesis explains the performance of the ScaLAPACK symmetric eigensolver, the algorithms that it uses, and other important algorithms for solving the symmetric eigenproblem on today’s fastest computers. We offer advice on how to pick the best eigensolver for particular situations and propose a design for the next ScaLAPACK symmetric eigensolver which will offer greater flexibility and 50% better performance.
To the memory of my father. My most ambitious goal is to be as good a father as he was to me.
## Contents

### List of Figures

List of Tables

### I First Part

1 **Summary - Interesting Observations**
   1.1 Algorithms ......................................................... 2
   1.2 Software overhead and load imbalance costs are significant ............... 6
   1.3 Effect of machine performance characteristics on PDSYEVX ................. 8
   1.4 Prioritizing techniques for improving performance ......................... 10
   1.5 Reducing the execution time of symmetric eigensolvers ..................... 11
   1.6 Jacobi .............................................................. 12
   1.7 Where to obtain this thesis ........................................ 13

2 **Overview of the design space**
   2.1 Motivation .......................................................... 15
   2.2 Algorithms .......................................................... 15
   2.3 Implementations .................................................... 17
      2.3.1 Parallel abstraction and languages ................................ 17
      2.3.2 Algorithmic blocking .......................................... 17
      2.3.3 Internal Data Layout ......................................... 18
      2.3.4 Libraries ....................................................... 19
      2.3.5 Compilers ....................................................... 20
      2.3.6 Operating Systems ............................................. 21
   2.4 Hardware ............................................................ 21
      2.4.1 Processor ....................................................... 21
      2.4.2 Memory .......................................................... 22
      2.4.3 Parallel computer configuration ................................ 22
   2.5 Applications .......................................................... 26
      2.5.1 Input matrix ...................................................... 27
      2.5.2 User request ...................................................... 28
      2.5.3 Accuracy and Orthogonality requirements ......................... 29
2.5.4 Input and Output Data layout ........................................ 29
2.6 Machine Load ................................................................. 29
2.7 Historical notes ............................................................... 30
  2.7.1 Reduction to tridiagonal form and back transformation .......... 30
  2.7.2 Tridiagonal eigendecomposition .................................... 32
  2.7.3 Matrix-matrix multiply based methods .............................. 39
  2.7.4 Orthogonality ............................................................... 40

3 Basic Linear Algebra Subroutines .................................... 43
  3.1 BLAS design and implementation .................................. 43
  3.2 BLAS execution time ...................................................... 44
  3.3 Timing methodology ....................................................... 48
  3.4 The cost of code and data cache misses in DGEMV ............. 50
  3.5 Miscellaneous timing details ......................................... 50

4 Details of the execution time of PDSYEVX ....................... 52
  4.1 High level overview of PDSYEVX algorithm ..................... 52
  4.2 Reduction to tridiagonal form ....................................... 53
    4.2.1 Householder's algorithm ........................................ 53
    4.2.2 PDSYTRD implementation (Figure 4.4) ...................... 57
    4.2.3 PDSYTRD execution time summary ............................ 71
  4.3 Eigendecomposition of the tridiagonal .......................... 72
    4.3.1 Bisection .............................................................. 72
    4.3.2 Inverse iteration ................................................... 72
    4.3.3 Load imbalance in bisection and inverse iteration .......... 73
    4.3.4 Execution time model for tridiagonal eigendecomposition in PDSYEVX 74
    4.3.5 Redistribution ....................................................... 74
  4.4 Back Transformation ..................................................... 75

5 Execution time of the ScaLAPACK symmetric eigensolver, PDSYEVX on efficient data layouts on the Paragon 81
  5.1 Deriving the PDSYEVX execution time on the Intel Paragon (common case) .. 83
  5.2 Simplifying assumptions allow the full model to be expressed as a six term model ......................................................... 83
  5.3 Deriving the computation time during matrix transformations in PDSYEVX on the Intel Paragon ........................................... 84
  5.4 Deriving the computation time during eigendecomposition of the tridiagonal matrix in PDSYEVX on the Intel Paragon ....................... 85
  5.5 Deriving the message initiation time in PDSYEVX on the Intel Paragon ... 86
  5.6 Deriving the inverse bandwidth time in PDSYEVX on the Intel Paragon ... 86
  5.7 Deriving the PDSYEVX order n imbalance and overhead term on the Intel Paragon .............................................................. 86
  5.8 Deriving the PDSYEVX order \( \frac{n^2}{p} \) imbalance and overhead term on the Intel Paragon .............................................................. 87
6 Performance on distributed memory computers

6.1 Performance requirements of distributed memory computers for running PDSYEVX efficiently .................................................. 88
6.1.1 Bandwidth rule of thumb .............................................. 89
6.1.2 Memory size rule of thumb ........................................... 89
6.1.3 Performance requirements for minimum execution time .......... 92
6.1.4 Gang scheduling ...................................................... 94
6.2 secgang .............................................................. 94
6.2.1 Consistent performance on all nodes ............................... 94
6.3 Performance characteristics of distributed memory computers ... 95
6.3.1 PDSYEVX execution time (predicted and actual) ............... 95

7 Execution time of other dense symmetric eigensolvers

7.1 Implementations based on reduction to tridiagonal form .......... 98
7.1.1 PeIGs ............................................................ 98
7.1.2 HJS ............................................................. 99
7.1.3 Comparing the execution time of HJS to PDSYEVX ............. 101
7.1.4 PDSYEV ........................................................ 106
7.2 Other techniques .................................................. 106
7.2.1 One dimensional data layouts ................................... 106
7.2.2 Unblocked reduction to tridiagonal form ......................... 108
7.2.3 Reduction to banded form ..................................... 109
7.2.4 One-sided reduction to tridiagonal form ....................... 110
7.2.5 Strassen's matrix multiply ...................................... 111
7.3 Jacobi .............................................................. 112
7.3.1 Jacobi versus Tridiagonal eigensolvers ......................... 112
7.3.2 Overview of Jacobi Methods .................................... 113
7.3.3 Jacobi Methods .................................................. 114
7.3.4 Computation costs .............................................. 114
7.3.5 Communication costs ............................................ 121
7.3.6 Blocking ........................................................ 124
7.3.7 Symmetry ...................................................... 125
7.3.8 Storing diagonal blocks in one-sided Jacobi .................. 126
7.3.9 Partial Eigensolver ............................................ 126
7.3.10 Threshold ..................................................... 128
7.3.11 Pairing ........................................................ 129
7.3.12 Pre-conditioners ............................................... 131
7.3.13 Communication overlap ........................................ 132
7.3.14 Recursive Jacobi .............................................. 132
7.3.15 Accuracy ..................................................... 133
7.3.16 Recommendation ............................................... 133
7.4 ISDA ............................................................. 134
7.5 Banded ISDA ...................................................... 135
7.6 FFT ............................................................... 136
8 Improving the 
ScalAPACK symmetric eigensolver 137
8.1 The next ScalAPACK symmetric eigensolver .......................... 137
8.2 Reduction to tridiagonal form in the next ScalAPACK symmetric eigensolver 138
8.3 Making the ScalAPACK symmetric eigensolver easier to use .......... 141
8.4 Details in reducing the execution time of the ScalAPACK symmetric eigensolver 141
8.4.1 Avoiding overflow and underflow during computation of the Householder vector without added messages ......................... 142
8.4.2 Reducing communications costs ........................................... 143
8.4.3 Reducing load imbalance costs ............................................ 144
8.4.4 Reducing software overhead costs ....................................... 145
8.5 Separating internal and external data layout without increasing memory usage 146

9 Advice to symmetric eigensolver users 148

II Second Part 150

Bibliography 151

A Variables and abbreviations 169

B Further details 172
B.1 Updating \( v \) during reduction to tridiagonal form ......................... 172
B.1.1 Notation .......................................................... 173
B.1.2 Updating \( v \) without added communication .......................... 173
B.1.3 Updating \( w \) with minimal computation cost .......................... 174
B.1.4 Updating \( w \) with minimal total cost .................................. 177
B.1.5 Notes to figure B.4 ................................................. 178
B.1.6 Overlap communication and computation as a last resort .......... 179
B.2 Matlab codes ......................................................... 180
B.2.1 Jacobi .............................................................. 180

C Miscellaneous matlab codes 181
C.1 Reduction to tridiagonal form ............................................. 181
List of Figures

1.1 9 by 9 matrix distributed over a 2 by 3 processor grid with \( mb = nb = 2 \) ................................................. 4
1.2 Processor point of view for 9 by 9 matrix distributed over a 2 by 3 processor grid with \( mb = nb = 2 \) ......................................................... 5

3.1 Performance of \texttt{DGEMV} on the Intel \texttt{PARAGON} ......................................................... 46
3.2 Additional execution time required for \texttt{DGEMV} when the code cache is flushed between each call. The y-axis shows the difference between the time required for a run which consists of one loop executing 16,384 no-ops after each call to \texttt{DGEMV} and the time required for a run which includes two loops one executing \texttt{DGEMV} and one executing 16,384 no-ops. ......................................................... 48
3.3 Additional execution time required for \texttt{DGEMV} when the code cache is flushed between each call as a percentage of the time required when the code is cached. See Figure 3.2. ......................................................... 49

4.1 \texttt{PDSYEVX} algorithm .......................................................... 53
4.2 Classical unblocked, serial reduction to tridiagonal form, i.e. \texttt{EISPACK}'s \texttt{TRED1} (The line numbers are consistent with figures 4.3, 4.4 and 4.5.) ............................................. 55
4.3 Blocked, serial reduction to tridiagonal form, i.e. \texttt{DSYEVX} (See Figure 4.2 for unblocked serial code) ................................................................. 56
4.4 \texttt{PDSYEVX} reduction to tridiagonal form (See Figure 4.3 for further details) ............. 58
4.5 Execution time model for \texttt{PDSYEVX} reduction to tridiagonal form (See Figure 4.4 for details about the algorithm and indices.) ......................................................... 59
4.6 Flops in the critical path during the matrix vector multiply ......................................................... 67

6.1 Relative cost of message volume as a function of the ratio between peak floating point execution rate in Megaflops, \( mfs \), and the product of main memory size in Megabytes, \( M \), and network bisection bandwidth in Megabytes/sec, \( mbs \). ................................................................. 90
6.2 Relative cost of message latency as a function of the ratio between peak floating point execution rate in Megaflops, \( mfs \), and main memory size in Megabytes, \( M \). ................................................................. 91

7.1 HJS notation ................................................................. 100
7.2 Execution time model for HJS reduction to tridiagonal form. Line numbers
match Figure 4.5(PDSYEVX execution time) .......................... 105
7.3 Matlab code for two-sided cyclic Jacobi .......................... 115
7.4 Matlab code for two-sided blocked Jacobi ....................... 116
7.5 Matlab code for one-sided blocked Jacobi ....................... 117
7.6 Matlab code for an inefficient partial eigendecomposition routine .... 118
7.7 Pseudo code for one-sided parallel Jacobi with a 2D data layout with communication highlighted ......................... 119
7.8 Pseudo code for two-sided parallel Jacobi with a 2D data layout, as described by Schrieber[150], with communication highlighted ............ 121

8.1 Data redistribution in the next ScalAPACK symmetric eigensolver .... 138
8.2 Choosing the data layout for reduction to tridiagonal form .......... 139
8.3 Execution time model for the new PDSYTRD. Line numbers match Figure 4.5(PDSYTRD execution time) where possible. ......................... 140

B.1 Avoiding communication in computing $W \cdot V^T v$ ................. 174
B.2 Computing $W \cdot V^T v$ without added communication ................ 175
B.3 Computing $W \cdot V^T v$ with minimal computation .................. 176
B.4 Computing $W \cdot V^T v$ on a four dimensional processor grid ......... 178
List of Tables

3.1 BLAS execution time (Time = $\delta_i + \text{number of flops} \cdot \gamma_i$ in microseconds) ........................................... 45
4.1 The cost of updating the current column of $A$ in PDLATRD (Line 1.1 and 1.2 in Figure 4.5) .................................................. 62
4.2 The cost of computing the reflector (PDLARFG) (Line 2.1 in Figure 4.5) ................................................................. 63
4.3 The cost of all calls to PDSYMV from PDSYTRD ................................................................. 66
4.4 The cost of updating the matrix vector product in PDLATRD (Line 4.1 in Figure 4.5) ...................................................... 68
4.5 The cost of computing the companion update vector in PDLATRD (Line 5.1 in Figure 4.5) ........................................ 69
4.6 The cost of performing the rank-$2k$ update (PDSYR2K) (Lines 6.1 through 6.3 in Figure 4.5) ........................................ 70
4.7 Computation cost in PDSYEVX ......................................................... 77
4.8 Computation cost (tridiagonal eigendecomposition) in PDSYEVX ......................................................... 78
4.9 Communication cost in PDSYEVX ......................................................... 79
4.10 The cost of back transformation (PDORMTR) ......................................................... 80
5.1 Six term model for PDSYEVX on the Paragon ......................................................... 82
5.2 Computation time in PDSYEVX ......................................................... 85
5.3 Execution time during tridiagonal eigendecomposition ......................................................... 85
5.4 Message initiations in PDSYEVX ......................................................... 86
5.5 Message transmission in PDSYEVX ......................................................... 86
5.6 $\theta(n)$ load imbalance cost on the PARAGON ......................................................... 87
5.7 Order $\frac{n^2}{\sqrt{P}}$ load imbalance and overhead term on the PARAGON ......................................................... 87
6.1 Performance ......................................................... 95
6.2 Hardware and software characteristics of the PARAGON and the IBM SP2 ......................................................... 96
6.3 Predicted and actual execution times of PDSYEVX on xps5, an Intel PARAGON. Problem sizes which resulted in execution time of greater than 15% greater than predicted are marked with an asterix. Many of these problem sizes which result in more than 15% greater execution time than expected were repeated to show that the unusually large execution times are aberrant ......................................................... 97
7.1 Comparison between the cost of HJS reduction to tridiagonal form and \texttt{PDSYTRD} on \( n = 4000, p = 64, nb = 32 \). Values differing from previous column are shaded. .......................................................... 107
7.2 Fastest eigendecomposition method .................................................. 112
7.3 Performance model for my recommended Jacobi method ...................... 118
7.4 Estimated execution time per sweep for my recommended Jacobi on the \texttt{PARAGON} on \( n=1000, p=64 \) ............................................................ 120
7.5 Performance models (flop counts) for one-sided Jacobi variants. Entries which differ from the previous column are shaded. ...................... 122
7.6 Performance models (flop counts) for two-sided Jacobi variants .......... 123
7.7 Communication cost for Jacobi methods (per sweep) ......................... 124
A.1 Variable names and their uses ...................................................... 170
A.2 Variable names and their uses (continued) .................................. 171
A.3 Abbreviations .............................................................................. 171
A.4 Model costs .................................................................................. 171
Acknowledgements

I thank those that I have worked with during my wonderful years at Berkeley. Doug Ghormley taught me all that I know about emacs, X, and tcsh. Susan Blackford, Clint Whaley and Antoine Petitet patiently answered my stupid questions about ScaLAPACK. I thank Bruce Hendrickson for numerous insights. Mark Sears and Greg Henry gave me the opportunity to test out some of my ideas on a real application. Peter Strazdins’ study of software overhead convinced me to take a hard look at code cache misses. Ross Moore gave me numerous typesetting hints and suggestions. Beresford Parlett helped me with the section on Jacobi. Oliver Sharp helped convince me to ask Jim Demmel to be my advisor and gave me some early help with technical writing. I am indebted to the members of the ScaLAPACK team whose effort made ScaLAPACK, and hence this thesis, possible.

My graduate studies would not have been possible were it not for my friends and family who encouraged me to resume my education and continued to support me in that decision, especially my wife (Marta Laskowska), Greg Lee, and Marta’s parents Michael and Joan. I also thank Chris Ranken for his friendship; my parents for bringing me into a loving world and teaching me to love mathematics; and Howard and Nani Ranken who proved, by example, that the two-body problem can be solved and inspired Marta and I to pursue the dream of two academic careers in one household.

I thank the members of my committee for their help and advice. I thank my advisor for allowing me the luxury of doing research without worrying about funding1 or machine access at UC Berkeley2 and the University of Tennessee at Knoxville3. I thank Prof. Kahan for his sage advice, not just on the technical aspects, but also on the non-technical aspects of a research career and on life itself. I thank Phil Colella for his interest in my work and for reading my thesis on extremely short notice.

Most importantly, I thank my wife for her love and never ending support and I thank my daughter for making me smile.

1This work was supported primarily by the Defense Advanced Research Projects Agency of the Department of Defense under contracts DAAH04-95-1-0077 and DAAH04-95-1-0077, and with additional support provided by the Department of Energy grant DE-FG03-94ER25266. The information presented here does not necessarily reflect the position or the policy of the Government and no official endorsement should be inferred.

2National Science Foundation Infrastructure grant Nos. CDA-9401156 and CDA-8722788.

3The University of Tennessee, Knoxville, acquired the IBM SP2 through an IBM Shared University Research Grant. Access to the machine and technical support was provided by the University of Tennessee/Oak Ridge National Laboratory Joint Institute for Computational Science.
Part I

First Part
Chapter 1

Summary - Interesting Observations

The symmetric eigendecomposition of a real symmetric matrix is: $A = QDQ^T$, where $D$ is diagonal and $Q$, is orthonormal, i.e. $Q^T Q = I$. Tridiagonal based methods reduce $A$ to a tridiagonal matrix through an orthonormal similarity transformation, i.e. $A = ZTZ^T$, compute the eigendecomposition of the tridiagonal matrix $T = UDU^T$ and, if necessary, transform the eigenvectors of the tridiagonal matrix back into eigenvectors of the original matrix $A$, i.e. $Q = ZU$. Non-tridiagonal based methods operate directly on the original matrix $A$.

I am interested in understanding and minimizing the execution time of dense symmetric eigensolvers, as used in real applications, on distributed memory parallel computers. I have modeled the performance of symmetric eigensolvers as a function of the algorithm, the application, the implementation and the computer. Some applications require only a partial eigendecomposition, i.e. only a few eigenvalues or eigenvectors. Different implementations may require different communication or computation patterns and they may use different libraries and/or compilers. This thesis concentrates on the $O(n^3)$ cost of reduction to tridiagonal form and transforming the eigenvectors back to the original space.

I have modeled the execution time of the ScaLAPACK[31] symmetric eigensolver, PDSYEVX, in detail and validated this model against actual performance on a number of distributed memory parallel computers. PDSYEVX, like most ScaLAPACK codes, uses calls to the PBLAS[41, 140] to perform basic linear algebra operations such as matrix-matrix
multiply and matrix-vector multiply in parallel. **PDSYEVX** and the **PBLAS** use calls to the Basic Linear Algebra Subroutines, **BLAS**[63, 62], to perform basic linear algebra operations such as matrix-matrix multiply and matrix-vector multiply on data local to each processor, and calls to the Basic Linear Algebra Communications Subroutines, **BLACS**[169, 69], to move data between the processors. The level one **BLAS** involving only vectors and perform \( O(n) \) flops on \( O(n) \) data, where \( n \) is the length of the vector. The level two **BLAS** involve one matrix and one or two vectors and perform \( O(n^2) \) flops on \( O(n^2) \) data, where the matrix is of size \( n \times n \). The level three **BLAS** involve only matrices and perform \( O(n^3) \) flops on \( O(n^3) \) data and offer the best opportunities to obtain peak floating point performance through data re-use.

**PDSYEVX** uses a 2D block cyclic data layout for all input, output and internal matrices. 2D block cyclic data layouts have been shown to support scalable high performance parallel dense linear algebra codes[32, 30, 124] and hence have been selected as the primary data layout for **HPF**[110], **ScalAPACK**[68] and other parallel dense linear algebra libraries[98, 164]. A 2D block cyclic data layout is defined by the processor grid \((p_r, p_c)\), the local block size \((mb, nb)\) and the location of the \((1, 1)\) element of the matrix. In this thesis, we will assume that the \((1, 1)\) element of matrix \(A\), i.e. \( A(1, 1) \) is mapped to the \((1, 1)\) element of the local matrix in processor \((0, 0)\). Hence, \( A(i, j) \) is stored in element \(((\lfloor \frac{i-1}{mb}\rfloor p_r + \text{mod}(i-1, mb \times p_r) + 1, \lfloor \frac{j-1}{nb}\rfloor p_c + \text{mod}(j-1, nb \times p_c) + 1))\) on processor \((p_r, p_c)\). Figures 1.1 and 1.2, reprinted from the **ScalAPACK** User’s Guide[31] shows how a 9 by 9 matrix would be distributed over a 2 by 3 processor grid with \( mb = nb = 2 \). In general, we will assume that square blocks are used since this is best for the symmetric eigenproblem, and we will use \( nb \) to refer to both the row block size and the column block size.

All **ScalAPACK** codes including **PDSYEVX** in version 1.5 use the data layout block size as the algorithmic blocking factor. Hence, except as noted, we use \( nb \) to refer to the algorithmic blocking factor as well as the data layout block size. Data layouts, and algorithmic blocking factors are discussed in Section 2.3.3.

**PDSYEVX** calls the following routines:

**PDSYTRD** Performs Householder reduction to tridiagonal form.

**PDSTEBZ** Computes the eigenvalues of a tridiagonal matrix using bisection.

**PDSTEIN** Computes the eigenvectors of the tridiagonal matrix using inverse iteration and
Figure 1.1: 9 by 9 matrix distributed over a 2 by 3 processor grid with $mb = nb = 2$

![9 x 9 matrix distributed over a 2 by 3 processor grid with mb = nb = 2](image)

Gram-Schmidt reorthogonalization.

**PDORMTR** Transforms the eigenvectors of the tridiagonal matrix back into eigenvectors of the original matrix.

My performance models explain performance in terms of the following application parameters:

- $n$ The matrix size.
- $m$ The number of eigenvectors required.
- $e$ The number of eigenvalues required. ($e \geq m$)

the following machine parameters:

- $p$ The number of processors (arranged in a $p_r$ by $p_c$ grid as described below).
- $\alpha$ The communication latency (secs/message).
- $\beta$ The inverse communication bandwidth (secs/double precision word). This means that sending a message of $k$ double precision words costs: $\alpha + k\beta$. 


Figure 1.2: Processor point of view for 9 by 9 matrix distributed over a 2 by 3 processor grid with \( mb = nb = 2 \)

\[
\begin{array}{ccc}
0 & 1 & 2 \\
\begin{array}{cccc}
\begin{array}{cccc}
a_{11} & a_{12} & a_{13} & a_{14} \\
a_{21} & a_{22} & a_{23} & a_{24} \\
a_{31} & a_{32} & a_{33} & a_{34} \\
a_{41} & a_{42} & a_{43} & a_{44} \\
a_{51} & a_{52} & a_{53} & a_{54} \\
a_{61} & a_{62} & a_{63} & a_{64} \\
a_{71} & a_{72} & a_{73} & a_{74} \\
a_{81} & a_{82} & a_{83} & a_{84} \\
a_{91} & a_{92} & a_{93} & a_{94} \\
\end{array}
\\
\begin{array}{cccc}
a_{15} & a_{16} & a_{17} & a_{18} \\
a_{25} & a_{26} & a_{27} & a_{28} \\
a_{35} & a_{36} & a_{37} & a_{38} \\
a_{45} & a_{46} & a_{47} & a_{48} \\
a_{55} & a_{56} & a_{57} & a_{58} \\
a_{65} & a_{66} & a_{67} & a_{68} \\
a_{75} & a_{76} & a_{77} & a_{78} \\
a_{85} & a_{86} & a_{87} & a_{88} \\
a_{95} & a_{96} & a_{97} & a_{98} \\
\end{array}
\\
\begin{array}{cccc}
a_{19} & a_{10} & a_{11} & a_{12} \\
a_{29} & a_{20} & a_{21} & a_{22} \\
a_{39} & a_{40} & a_{41} & a_{42} \\
a_{49} & a_{50} & a_{51} & a_{52} \\
a_{59} & a_{60} & a_{61} & a_{62} \\
a_{69} & a_{70} & a_{71} & a_{72} \\
a_{79} & a_{80} & a_{81} & a_{82} \\
a_{89} & a_{90} & a_{91} & a_{92} \\
a_{99} & a_{100} & a_{101} & a_{102} \\
\end{array}
\end{array}
\end{array}
\]

\( 2 \times 3 \) process grid point of view

\( \gamma_1, \gamma_2, \gamma_3 \) Time per flop for BLAS1, BLAS2 and BLAS3 routines respectively.

\( \delta_1, \delta_2, \delta_3, \delta_4 \) Software overhead for BLAS1, BLAS2, BLAS3 and PBLAS routines respectively.

This means that a call to \texttt{DGEMM} (a BLAS3 routine) requiring \( c \) flops costs: \( \delta_3 + c \gamma_3 \). See Chapter 3 for details on the cost of the BLAS. The cost of the PBLAS routine \texttt{PDSYMV} is shown in Table 4.3.

My model also uses the following algorithmic and data layout parameters:

\( p_r \) The number of processor rows in the processor grid.

\( p_c \) The number of processor columns in the processor grid.

\( nb \) The data layout block size and algorithmic blocking factor.

These and all other variables used in this thesis are listed in Table A.1 in Appendix A.

The rest of this chapter presents the most interesting results from my study of the execution time of symmetric eigensolvers on distributed memory computers. Section 1.1
describes the algorithms commonly used for dense symmetric eigendecomposition on distributed memory parallel computers. Section 1.2 describes how software overhead and load imbalance costs are significant. Section 1.3 explains the two rules of thumb for ensuring that a distributed memory parallel computer can achieve good performance on a dense linear algebra code such as ScaLAPACK’s symmetric eigensolver. Section 1.4 explains that it is important to identify which techniques offer the greatest potential for improving performance across a wide range of applications, computers, problem sizes and distributed memory parallel computers. Section 1.5 gives a synopsis of how execution time of the ScaLAPACK symmetric eigensolver could be reduced. Section 1.6 explains the types of applications on which Jacobi can be expected to be as fast as, or faster than, tridiagonal based methods.

The rest of my thesis is organized as follows. Chapter 2 provides an introduction and a historical prospective. Chapter 3 explains the performance of the Basic Linear Algebra Subroutines (BLAS). Chapter 4 contains my complete execution time model for ScaLAPACK’s symmetric eigensolver, PDSYEVD. Chapter 5 simplifies the execution time model by concentrating on a particular application on a particular distributed memory parallel computer, the Intel Paragon. Chapter 6 explains the performance requirements of distributed memory parallel computers and discusses the execution time of PDSYEVD. Chapter 7 explains the performance of other dense symmetric eigensolvers. Chapter 8 provides a blueprint for reducing the execution time of PDSYEVD. Chapter 9 offers concise advice to users of symmetric eigensolvers.

1.1 Algorithms

There are many widely disparate symmetric eigendecomposition algorithms. Tridiagonal reduction based algorithms for the symmetric eigendecomposition require asymptotically the fewest flops and have been historically the fastest and most popular[83, 79, 129, 153, 86, 145, 134, 50].

Iterative eigensolvers, e.g. Lanczos and conjugate gradient methods, are clearly superior if the input matrix is sparse and only a limited portion of the spectrum is needed[49, 119]. Iterative eigensolvers are out of the scope of this thesis.

Even for tridiagonal matrices, there are several algorithms worthy of attention for the tridiagonal eigendecomposition. The ideal method would require at most $O(n^2)$ floating point operations, $O(n)$ message volume and $O(p)$ messages. The recent work of
Parlett and Dhillon[136, 139] renews hope that such a method will be available in the near future. Should this effort hit unexpected snags, other better known methods, such as QR[79, 86, 93], QD[135], bisection and inverse iteration[83, 102] and Cuppen’s divide and conquer algorithm[50, 66, 147, 88] will remain common. Parallel codes have been written for QR[39, 8, 76, 125], bisection and inverse iteration[15, 75, 54, 81] and Cuppen’s algorithm[82, 80, 141]. ScaLAPACK offers parallel QR and parallel bisection and inverse iteration codes and Cuppen’s algorithm[50, 66, 88], which has recently replaced QR as the fastest serial method[147], has been coded for inclusion in ScaLAPACK by Françoise Tisseur. Algorithms for the tridiagonal eigenproblem are discussed in Section 2.2, and parallel tridiagonal eigensolvers are discussed in Section 7.1.

A detailed comparison of tridiagonal eigensolvers would be premature until Parlett and Dhillon complete their prototype.

This thesis concentrates on the $O(n^3)$ cost of reduction to tridiagonal form and transforming the eigenvectors back to the original space. Hendrickson, Jessup and Smith[91] showed that reduction to tridiagonal form can be performed 50% faster than ScaLAPACK does. Lang’s successive band reduction[116], SBR, is interesting at least if only eigenvalues are to be computed. But the complexity of SBR has made it difficult to realize the theoretical advantages of SBR in practice. A performance model for PDSYEVX, ScaLAPACK’s symmetric eigensolver, section 7.1.2, is given in Chapter 4. By restricting our attention to a single computer, and to the most common applications, the model is further simplified and discussed in Chapter 5.

Jacobi requires 4-20 times as many floating point operations as tridiagonal based methods, hence the type of problems on which Jacobi will be faster will always be limited. Jacobi is faster than tridiagonal based methods[125, 2] on small spectrally diagonally dominant matrices\(^1\) despite requiring 4 times as many flops because it has less overhead. However, on large problems tridiagonal based methods can achieve at least 25% efficiency and will hence be faster than any method requiring 4 times as many flops. And, on matrices that are not spectrally diagonally dominant, Jacobi requires 20 or more times as many flops as tridiagonal based methods - a handicap that is simply too large to overcome. Jacobi’s method is discussed in Section 7.3.

\(^1\)Spectrally diagonally dominant means that the eigenvector matrix, or a permutation thereof, is diagonally dominant.

Methods that require multiple $n$ by $n$ matrix-matrix multiplies, such as the Invari-
ant Subspace Decomposition Approach[97] (ISDA), and Yau and Lu’s FFT based method[174] require roughly 30 times as many floating point operations as tridiagonal based methods and hence may never be faster than tridiagonal based methods. The ISDA for solving symmetric eigenproblems is discussed in Section 7.4.

Banded ISDA[26] is an improvement on ISDA that involves an initial bandwidth reduction. Banded ISDA[26] is nearly a tridiagonal method and offers performance that is nearly as good, at least if only eigenvalues are sought. However since a banded ISDA code requires multiple bandwidth reduction each of which requires a back transformation, if even a few eigenvectors are required, a banded ISDA code must either store the back transformations in compact form or it will perform an additional $O(n^3)$ flops. No code available today stores and applies these backtransformations in compact form. At present, the fastest banded ISDA code starts by reducing the matrix to tridiagonal form and is neither the fastest tridiagonal eigensolver, nor the easiest to parallelize. Banded ISDA is discussed in Section 7.5.

In conclusion, reduction to tridiagonal form combined with Parlett and Dhillon’s tridiagonal eigensolver is likely to be the preferred method for eigensolution of dense matrices for most applications.

In the meantime, until Parlett and Dhillon’s code is available, we believe that **PDSYEVX** is the best general purpose symmetric eigensolver for dense matrices. It is available on any machine to which **ScalAPACK** has been ported\(^2\), it achieves 50% efficiency even when the flops in the tridiagonal eigensolution are not counted\(^3\) and it scales well, running efficiently on machines with thousands of nodes. It is faster than ISDA and faster than Jacobi on large matrices and on matrices that are not spectrally diagonally dominant.

### 1.2 Software overhead and load imbalance costs are significant

In **PDSYEVX**, it is somewhat surprising but true that software overhead and load imbalance costs are larger than communications costs. In its broadest definition, software overhead is the difference between the actual execution time and the cost of communication

\[^2\text{Intel Paragon, Cray T3D, Cray T3E, IBM SP2, and any machine supporting the BLACS, MPI or PVM}\]

\[^3\text{Our definition of efficiency is a demanding one: total time divided by the time required by reduction to tridiagonal form and back transformation assuming that these are performed at the peak floating point execution rate of the machine, i.e. } time/(\frac{256}{n} + \frac{1}{p} \gamma_3)\]
and computation. Software overhead includes saving and restoring registers, parameter passing, error and special case checking as well as those tasks which prevent calls to the BLAS involving few flops from being as efficient as calls to the BLAS involving many flops: loop overhead, border cases and data movement between memory hierarchies that gets amortized over all the operations in a given call to the BLAS. The cost of any operation which is performed by only a few of the processors (while the other processors are idle) is a load imbalance cost.

Because software overhead is as significant as communication latency, the three term performance model introduced by Choi et al.[40] and used in my earlier work[57], which only counts flops, number of messages and words communicated, does not adequately model the performance of PDSYEVX. In addition to these three terms a fourth term, which we designate \( \delta \), representing software overhead costs is required.

Software overhead is more difficult to measure, study, model and reason about than the other components of execution time. Measuring the execution time required for a subroutine call requiring little or no work measures only subroutine call overhead, parameter passing and error checking. For the performance models in this thesis, we measure the execution time of each routine across a range of problem sizes (with code cached and data not cached) and use curve fitting to estimate the software overhead of an individual routine. Because we perform these timings with code cached but data not cached, this gives an estimate of all software overhead costs except code cache misses.

We use times with the code cached and data for our performance models because, for most problem sizes, the matrix is too large to fit in cache but it is less clear whether code fits in cache or not. It is easy to compute the amount of data which must be cached, but there is no portable automatic way to measure the amount of code which must be cached. Furthermore, the data cache needs, for typical problem sizes, are much larger than code cache needs, hence while it is usually clear that the data is not cached the code cache needs and code cache size are much closer.

A full study of software overhead costs is out of the scope of this thesis and remains a topic for future research. The overhead and load imbalance terms in the performance model for PDSYEVX on the PARAGON are explained in Sections 5.7 and 5.8.
1.3 Effect of machine performance characteristics on PDSYEVX

The most important machine performance characteristic is the peak floating point rate. Bisection bandwidth essentially defines which machines ScaLAPACK can perform well on. Message latency and software overhead, since they are $O(n)$ terms are important primarily for small and medium matrices.

Most collections of computers fall into one of two groups: those connected by a switched network whose bisection bandwidth increases linearly (or nearly so) with the number of processors and those connected by a network that only allows one processor to send at a time. All current distributed memory parallel computers that I am aware of have adequate bisection bandwidth\(^4\) to support good efficiency on PDSYEVX. On the other hand, no network that only allows one processor to send at a time can allow scalable performance and none that I am aware of allows good performance with as many as 16 processors. As long as the bandwidth rule of thumb (explained in detail in Section 6.1.1) holds, bandwidth will not be the limiting factor in the performance of PDSYEVX.

**Bandwidth rule of thumb:** Bisection bandwidth per processor\(^5\) times the square root of memory size per processor should exceed floating point performance per processor:

\[
\text{Megabytes/sec processor} \times \sqrt{\text{Megabytes processor}} > \text{Megaops/sec processor}
\]

assures that bandwidth will not limit performance.

Assuming that the bandwidth is adequate, we consider next the problem size per processor:

If the problem is large enough, i.e. \((n^2/p) > 2 \times (\text{Mega flops/processor})\), then PDSYEVX should execute reasonably efficiently. This rule (explained in detail in Section 6.1.2 can be restated as:

**Memory size rule of thumb:** Memory size should match floating point performance

---

\(^4\)Few distributed memory parallel computers offer bandwidth that scales linearly with the number of processors but most still have adequate bisection bandwidth.

\(^5\)Bisection bandwidth per processor is the total bisection bandwidth of the network divided by the number of processors.
Megabytes \text{ processor} \Rightarrow \frac{\text{Megaflps/sec}}{\text{processor}}

assures that \texttt{PDSYEVX} will be efficient on large problems.

If the problem is not large enough, lower order terms, as explained in Chapter 4 will be significant. Unlike the peak flop rate which can be substantially independent of main memory performance, lower order terms (communication latency, communication bandwidth, software overhead and load imbalance) are strongly linked to main memory performance.

\texttt{PDSYEVX} can work well on machines with large slow main memory (on large problems) and or machines with small fast main memory (on small problems). Most distributed memory parallel computers have sufficient memory size and network bisection bandwidth to allow \texttt{PDSYEVX} to achieve high efficiency on large problem sizes. The \texttt{Cray T3E} is one of the few machines that has sufficient main memory performance to allow \texttt{PDSYEVX} to achieve high performance on small problem sizes. The effect of machine performance characteristics on \texttt{PDSYEVX} is discussed in Chapter 6.

1.4 Prioritizing techniques for improving performance.

One fo the most important uses of performance modeling is to identify which techniques offer the most promise for performance improvement, because there are too many performance improvement techniques to allow one to try them all. One technique that appeared to be important early in my work, optimizing global communications, now appears less important in light of the discovery that software overhead and load imbalance are more significant than earlier thought. Here we talk about general conclusions; details are summarized in Section 1.5, and elaborated in Chapters 7 and 8.

Overlapping communication and computation, though it undeniably increases performance, should be implemented only after every effort has been made to reduce both communications and computations costs as much as possible. Overlapping communication and computation has proven to be more attractive in theory than in practice because not all communication costs overlap well and communication costs are not the only impediment to good parallel performance.
Although Strassen's matrix multiplication has been proven to offer performance better than can be achieved through traditional methods, it will be a long time before a Strassen's matrix multiply is shown to be twice as fast as a traditional method. A typical single processor computer would require 2-4 Gigabytes of main memory to achieve an effective flop rate of twice the machine's peak flop rate\(^6\) and 2-4 Terabytes of main memory to achieve 4 times the peak flop rate. Strassen's matrix multiplication will get increasing use in the coming years, because achieving 20% above "peak" performance is nothing to sneeze at, but Strassen's matrix multiply will not soon make matrix multiply based eigendecomposition such as ISDA faster than tridiagonal based eigendecomposition.

1.5 Reducing the execution time of symmetric eigensolvers

**PDSYEVX** can be improved. It does not work well on matrices with large clusters of eigenvalues. And, it is not as efficient as it could be[91], achieving only 50% of peak efficiency on PARAGON, Cray T3D and Berkeley NOW even on large matrices. On small matrices it performs worse. Parlett and Dhillon's new tridiagonal eigensolver promises to solve the clustered eigenvalue problem so we concentrate on improving the performance of reduction to tridiagonal form and back transformation.

Input and output data layout need not affect execution time of a parallel symmetric eigensolver because data redistribution is cheap. Data redistribution requires only \(O(p)\) messages and \(O(n^2/p)\) message volume per processor. This is modest compared to \(O(n \log(p))\) messages and \(O(n^2/\sqrt{p})\) message volume per processor required by reduction to tridiagonal form and back transformation.

Separating internal and external data layout actually decreases minimum execution time over all data layouts. Separating internal and external data layouts allows reduction to tridiagonal form and back transformation to use different data layouts. It also allows codes to concentrate only on the best data layout, reducing software overhead and allowing improvements which would be prohibitively complicated to implement if they had to work on all two-dimensional block cyclic data layouts.

Separating internal and external data layouts increases the minimum workspace requirement\(^7\) from \(2.5n^2\) to \(3n^2\). However with minor improvements in the existing code,

\(^6\)A dual processor computer would require twice as much memory.
\(^7\)Assuming that data redistribution is not performed in place. It is difficult to redistribute data in place.
and without any changes to the interface, internal and external data layout can be separated without increasing the workspace requirement. See Section 8.5.

Lichtenstein and Johnson[124] point out that data layout is irrelevant to many linear algebra problems because one can solve a permuted problem instead of the original. This works for symmetric problems provided that the input data is distributed over a square processor grid and with a row block size is equal to the column block size.

Hendrickson, Jessup and Smith[91] demonstrated that the performance of PDSYEVX can be improved substantially by reducing load imbalance, software overhead and communications costs. Most of the inefficiency in PDSYEVX is in reduction to tridiagonal form. Software overhead and load imbalance are responsible for more of the inefficiency than the cost of communications. Hence, it is those areas that need to be sped up the most. Preliminary results[91] indicate that by abandoning the PBLAS interface, using BLAS and BLACS calls directly, and concentrating on the most efficient data layout, software overhead, load imbalance and communications costs can be cut in half. Strazdins has investigated reducing software overheads in the PBLAS[161], but it remains to be seen whether software overheads in the PBLAS can be reduced sufficiently to allow PDSYEVX to be as efficient as it could be. PDSYEVX performance can be improved further if the compiler can produce efficient code on simple doubly nested loops, implementing merged BLAS Level 2 operations (like DSYMV and dsyr2).

For small matrices, software overhead dominates all costs, and hence one should minimize software overhead even at the expense of increasing the cost per flop. An unblocked code has the potential to do just that.

Although back transformation is more efficient than reduction to tridiagonal form, it can be improved. Whereas software overhead is the largest source of inefficiency in reduction to tridiagonal form, communications cost and load imbalance are the largest source of inefficiency in back transformation. Load imbalance is hard to eliminate in a blocked data layout in reduction to tridiagonal form because the size of the matrix being updated is constantly changing (getting smaller), but in back transformation, all eigenvectors are constantly updated, so statically balancing the number of eigenvalues assigned to each processor works well. Therefore the best data layout for back transformation is a two-dimensional rectangular block-cyclic data layout. The number of processor columns, \( p_c \), between two arbitrary parallel data layouts. If efficient in-place data redistribution were feasible, separating internal and external data layout would require only a trivial increase in workspace.
should exceed the number of processor rows by a factor of approximately 8. The optimal data layout column block size is: \( \lceil n/(p_r k) \rceil \) for some small integer \( k \). The row blocksize is less important in back transformation, and 32 is a reasonable choice, although setting it to the same value as the column block size will also work well if the BLAS are efficient on that block size and \( p_r < p_c \). Many techniques used to improve performance in LU decomposition, such as overlapping communication and computation, pipelining communication and asynchronous message passing can also be used to improve the performance of back transformation. Of these techniques, only asynchronous message passing (which eliminates all local memory movement) requires modification to the BLACS interface. The modification to the BLACS needed to support asynchronous message passing would allow forward and backward compatibility.

All of these methods are discussed in Chapter 8.

### 1.6 Jacobi

A one-sided Jacobi method with a two-dimensional data layout will beat tridiagonal based eigensolvers on small spectrally diagonally dominant matrices. The simpler one-dimensional data layout is sufficient for modest numbers of processors, perhaps as many as a few hundred, but does not scale well. Tridiagonal based methods, because they require fewer flops, will beat Jacobi methods on random matrices regardless of their size on large \( (n > 200 \sqrt{p}) \) matrices even if they are spectrally diagonally dominant. Jacobi also remains of interest in some cases when high accuracy is desired\[58\]. Jacobi’s method is discussed in Section 7.3.

### 1.7 Where to obtain this thesis

This thesis is available at: \texttt{http://www.cs.berkeley.edu/stanley/thesis}
Chapter 2

Overview of the design space

2.1 Motivation

The execution time of any computational solution to a problem is a single-valued function (time) on a multi-dimensional and non-uniform domain. This domain includes the problem being solved, the algorithm, the implementation of the algorithm and the underlying hardware and software (sometimes referred to collectively as the computer). By studying one problem, the symmetric eigenproblem, in detail we gain insight into how each of these factors affects execution time.

Section 2.2 discusses the most important algorithms for dense symmetric eigendecomposition on distributed memory parallel computers. Section 2.3 discusses the effect that the implementation can have on execution time. Section 2.4 discusses the effect of various hardware characteristics on execution time. Section 2.5 lists several applications that uses symmetric eigendecomposition and their differing needs. Section 2.6 discusses the direct and indirect effects of machine load on the execution time of a parallel code. Section 2.7 outlines the most important historical developments in parallel symmetric eigendecomposition.

2.2 Algorithms

The most common symmetric eigensolvers which compute the entire eigendecomposition use Householder reduction to tridiagonal form, form the eigendecomposition of the tridiagonal matrix and transform the eigenvectors back to the original basis. Algorithms that do not begin by reduction to tridiagonal form require more floating point operations.
Except for small spectrally diagonally dominant matrices, on which Jacobi will likely be faster than tridiagonal based methods, and scaled diagonally dominant matrices on which Jacobi is more accurate\cite{58}, tridiagonal based codes will be best for the eigensolution of dense symmetric matrices. See Section 7.3 for details.

The recent work of Parlett and Dhillon offers the promise of computing the tridiagonal eigendecomposition with $O(n^2)$ flops and $O(p)$ messages. Should some unexpected hitch prevent this from being satisfactory on some matrix types, there are several other algorithms from which to choose. Experience with existing implementations shows that for most matrices of size 2000 by 2000 or larger, the tridiagonal eigendecomposition is a modest component of total execution time.

Reduction to tridiagonal form and back transformation are the most time consuming steps in the symmetric eigendecomposition of dense matrices. These two steps require more flops ($O(n^3)$ vs. $O(n^2)$), more message volume ($O(n^2 \sqrt{p})$ vs. $O(n^2)$) and more messages ($O(n \log(p))$ vs. $O(p)$) than the eigendecomposition of the tridiagonal matrix. Since the cost of the matrix transformations (reduction to tridiagonal form and back transformation) grows faster than the cost of tridiagonal eigendecomposition, the matrix transformations are the dominant cost for larger matrices.

Reduction to tridiagonal form and back transformation require different communication patterns. Reduction to tridiagonal form is a two-sided transformation requiring multiplication by Householder reflectors from both the left and right side. Two sided reductions require that every element in the trailing matrix be read for each column eliminated, hence half of the flops are BLAS2 matrix-vector flops and $O(n \log(p))$ messages are required.

Equally importantly, two-sided reductions require significant calculations within the inner loop, which translates into large software overhead. Indeed on the computers that we considered, software overhead appears to be a larger factor in limiting efficiency of reduction to tridiagonal form than communication.

Back transformation is a one-sided transformation with updates than can be formed anytime prior to their application. Hence back transformation requires $O(n/nb)$ messages (where $nb$ is the data layout block size) and far less software overhead than reduction to tridiagonal form.

Chapters 4 and 5 discuss the execution time of reduction to tridiagonal form and back transformation, as implemented in ScaLAPACK, in detail.
2.3 Implementations

2.3.1 Parallel abstraction and languages

There are three common ways of expressing parallelism in linear algebra codes: message passing, shared memory and calls to the BLAS. Message passing programs tend to keep communication to a minimum, in part because the communication is specified directly. Shared memory codes can outperform message passing codes when load imbalance costs outweigh communication costs[118]. All calls to the BLAS offer potential parallelism though the potential for speedup varies. ScaLAPACK uses message passing while LAPACK exposes parallelism through calls to the BLAS.

In some cases, recent compilers are able to identify the parallelism in codes that may not have been written specifically for parallel execution[172, 171]. However, experience has shown that programs designed for sequential machines rarely exhibit the properties necessary for efficient parallel execution, hence some research into parallelizing compilers has switched its emphasis to parallelizing codes which are written in languages such as HPF[94, 110] which allow the programmer to express parallelism and allow some control over data layout.

Codes written in any standard sequential language, such as C, C++ or Fortran can achieve high performance, especially if the majority of the operations are performed within calls to the BLAS. If the flops are performed within codes written in the language itself, the execution time will depend upon the code and the compiler more than on the language used. If pointers are used carelessly in C, the compiler may not be able to determine the data dependencies exactly and may have to forgo certain optimizations[172]. On the other hand, carefully crafted C codes, tuned for individual architectures and compiled with modern optimizing compilers can result in performance that rivals that of carefully tuned assembly codes[23, 168].

2.3.2 Algorithmic blocking

A blocked code is one that has been recast to allow some of the flops to be performed as efficient BLAS3 matrix-matrix multiply flops[6, 4]. Typically a block of columns is reduced using an unblocked code followed by a matrix-matrix update of the trailing matrix. The algorithmic blocking factor is the number of columns (or rows) in the block column.
In serial codes, data layout blocking does not exist and hence the algorithmic blocking factor is referred to simply as the blocking factor. In ScalAPACK version 1.5, the algorithmic blocking factor is set to match the data layout blocking factor.

### 2.3.3 Internal Data Layout

Most of the flops in blocked dense linear codes involve a rank $k$ update, i.e. $A' = A + B \cdot C$ where $A \in \mathbb{R}^{m,n}$, $B \in \mathbb{R}^{m,k}$, $C \in \mathbb{R}^{k,n}$ and $m, n$ are $O(n)$ and $k$ is the algorithmic blocking factor (a tuning parameter typically much smaller than $n$ or $m$). $A$ may be triangular and $B$ and/or $C$ may be transposed or conjugate transposed. Hence internal data layout must support good performance on such rank $k$ updates.

$A$ is typically updated in place, i.e. the node which owns element $A_{i,j}$ computes and stores $A'_{i,j}$. This is called the owner computes rule and is motivated by the high cost of data movement relative to the cost of floating point computation. If $k$ is large enough a 3D data layout is more efficient\cite{1} \cite{12}, and performance can be improved further by using Strassen's matrix multiply\cite{157} \cite{96} \cite{70}. Some dense linear algebra codes, including LU, can be recursively partitioned\cite{165} resulting in large values of $k$ for the majority of the flops. Nonetheless, though a 3D data layout might be best for a recursively partitioned LU, reduction to tridiagonal form is most efficient with a modest algorithmic blocking factor and hence it is more efficient to update $A$ in place and we will make that assumption for the rest of this discussion.

If $A$ is to be updated in place, a 2D layout minimizes the total communication requirement for rank $k$ updates. The elements of $B$ and $C$ which must be sent to each node are determined by the elements of $A$ owned by that node. The node that owns element $A_{i,j}$ must obtain a copy of row $i$ of $B$ and column $j$ of $C$. The number of elements of matrices $B$ and $C$ that a given node must obtain is $k$ times the number of rows and columns of $A$ for which the node owns at least one element. If a node must own $r^2$ elements, the number of elements of $B$ and $C$ which must be obtained is minimized if the node owns a square submatrix of $A$ corresponding to $r$ rows and $r$ columns. In a 2D layout, the processors are arranged in a rectangular grid. Each row of the matrix is assigned to a row of the processor grid. Each column is assigned to a column of the processor grid.

The common ways of assigning the rows and columns to the processor grid in a 2D layout are: block, cyclic and block-cyclic. For the following descriptions, we will assume
that we are distributing $n$ rows of $A$ over $p_r$ processor rows. In a cyclic layout, row $i$ is assigned to processor row $i \mod p_r$. In a block layout, row $i$ is assigned to processor row $\left\lfloor \frac{i - 1}{n/p_r} \right\rfloor$. In a block-cyclic data layout, row $i$ is assigned to processor row $\left\lfloor \frac{i - 1}{nb} \right\rfloor \mod p_r$, where $nb$ is the data layout block-size. The block-cyclic data layout includes the other two as special cases.

Block-cyclic data layouts simplify algorithmic blocking and are used in most parallel dense linear algebra libraries\cite{68,98,164}. However, by separating algorithmic blocking from data blocking it is usuallyootnote{Block-cyclic data layouts still maintain an advantage over cyclic data layouts on machines with high communication latency, especially in those algorithms, such as Cholesky and back transformation, that require only $O(n/nb)$ messages, where $nb$ is the data layout block-size.} possible to achieve high performance from a cyclic data layout\cite{91,140,44,158}.

One-dimensional data layouts require $O(n^2)$ data movement per node (compared to $O(n^2/\sqrt{p})$ for 2D data layouts) and are generally less efficient. However, there are certain situations in which 1D data layouts are preferred. If the communication pattern is strictly one-dimensional (i.e. only along rows or columns) a 1D data layout requires no communication. Furthermore, some applications, such as LU, require much more communication in one direction than the other\footnote{LU with partial pivoting requires $O(n \log(p))$ messages within the processor columns but only $O(n/nb)$ messages within the processor rows\cite{31,40,30}. The total volume of communication however is similar in both directions.}. Hence, for modest numbers of processors it may be better to use a 1D data layout.

A square processor grid can greatly simplify symmetric reductions - allowing lower overhead codes. Furthermore, I believe that pipelining and lookahead (see section 2.4.2) can only be used effectively on symmetric reductions (such as Cholesky and reduction from generalized to standard form) when a square processor grid is used\footnote{Pipelining and lookahead cannot be used in reduction to tridiagonal form because of its synchronous nature.}.

All existing parallel dense linear algebra libraries use the same input data layout as the internal data layout. In Chapter 8 I will demonstrate that this is not necessary to achieve high performance and that in fact performance can be improved by using a different data layout internally than the input and output data layout.
2.3.4 Libraries

Software libraries can improve portability, robustness, performance and software re-use. ScalAPACK is built on top of the BLAS and BLACS and hence will run on any system on which a copy of the BLAS[63, 62] and BLACS[169, 69] can be obtained.

Libraries, and their interface, have both a positive and a negative effect on performance. The existence of a standard interface to the BLAS means that by improving the performance of a limited set of routines, i.e. the BLAS, one can improve the performance of the entire LAPACK and ScalAPACK library and other codes as well. Hence, many manufacturers have written optimized BLAS for their machines. In addition, Bilmes et al.[23, 168] have written a portable high performance matrix-matrix multiply and two other research groups have written high performance BLAS that depend only on the existence of a high performance matrix-matrix multiply[51, 103, 104]. Portable high performance BLAS offers the promise of high performance on LAPACK and ScalAPACK codes without the expense of hand coded BLAS.

However, adhering to a particular library interface necessarily rules out some possibilities. The BLACS do not support asynchronous receives, a costly limitation on the Paragon. The BLAS do not meet all computational needs[108], especially in parallel codes[91], hence the programmer is faced with the choice of reformulating code to use what the BLAS offers or avoiding the BLAS and trusting the compiler to produce high performance code. Furthermore, the interface itself implies some overhead, at the very least a subroutine call but typically much more than that[161]. Strazdins[161] showed that software overhead in ScalAPACK accounts for 15-20% of total execution time even for the largest problems that fit in memory on a Fujitsu VP1000.

2.3.5 Compilers

Compiler code generation is relatively unimportant to LAPACK and ScalAPACK performance, because these codes are written so that most of the work is done in the calls to the BLAS. By contrast, EISPACK is written in Fortran without calls to the BLAS and hence its performance is dependent on the quality of the code generated by the Fortran compiler.

Lehoucq and Carr[35] argue that compilers now have the capability to perform many of the optimizations that the LAPACK project performed by hand. Although no compilers existing today can produce code as efficient as LAPACK from simple three line loops,
the compiler technology exists [149, 115, 148].

Today, most compilers are able to produce good code for single loops, reducing the performance advantage of the BLAS1 routines. Soon compilers will be able to produce good code for BLAS2 and even BLAS3 routines. This will require us to rethink certain decisions, especially where the precise functionality that we would like is lacking. There will be an awkward period, probably lasting decades, during which some but not all compilers will be able to perform comparably to the BLAS.

2.3.6 Operating Systems

Operating systems are largely irrelevant to serial codes such as LAPACK but they can have a significant impact on parallel codes. Consider, for example, the broadcast capability inherent in Ethernet hardware. That capability is not available because the TCP/IP protocol does not allow access to that capability. Furthermore, at least 90% of the message latency cost is attributable to software and the operating system often makes it difficult to reduce the message latency cost. Part of the NOW[3] project involves finding ways to reduce the large message latency cost inherent in Unix operating systems through using user-level to user-level communications, avoiding the operating system entirely.

2.4 Hardware

2.4.1 Processor

The processor, or more specifically the floating point unit, is the fundamental source of processing power or the ultimate limit on performance, depending on your point of view. The combined speed of all of the floating point units is the peak performance, or speed of light, for that computer. For many dense linear algebra codes, the number of floating point operations cannot be reduced substantially and hence the goal is to perform the necessary fops as fast (i.e. as close to the peak performance) as possible.

Floating point arithmetic

The increasing adherence to the IEEE standard 754 for binary floating point arithmetic[7] benefits performance in two ways: it reduces the effort needed to make codes
work across multiple platforms and it allows one to take advantage of details of the underlying arithmetic in a portable code. The developers of LAPACK had to expend considerable effort to make their codes work on machines with non-IEEE arithmetic, notably older Cray machines. By contrast, the developers of ScaLAPACK chose to concentrate on IEEE standard 754 conforming machines allowing them not only to avoid the hassles of old Cray arithmetic, but also to check the sign bit directly when using bisection[54] to compute the eigenvalues of a tridiagonal matrix.

Consistent floating point arithmetic is also important for execution on heterogeneous machines. Demmel et al.[54] discuss ways to achieve correct results in bisection on a heterogeneous machine. I have proposed having each process compute a subset of eigenvalues, chosen by index, sharing those eigenvalues among all processes and then having each process independently sort the eigenvalues[55].

Ironically the one place where the IEEE standard 754 allows some flexibility has caused problems for heterogeneous machines. The IEEE standard 754 allows several options for handling sub-normalized numbers, i.e. numbers that are too small to be represented as a normalized number. During ScaLAPACK testing it was discovered that a sub-normalized number could be produced on a machine that adheres to the IEEE standard 754 completely and that when this number is then passed to the DEC Alpha 21064 processor, the DEC Alpha 21064 processor does not recognize them as legitimate numbers and aborts. To fix this would have required xdr to be smart enough to recognize this unusual situation or make one of the processors work in a manner different from its default.

2.4.2 Memory

The slower speed of main memory (as compared to cache or registers) affects performance in three ways. It reduces the performance of matrix-matrix multiply slightly and greatly complicates the task of coding an efficient matrix-matrix multiply. It bounds from below the algorithmic blocking factor needed to achieve high performance on matrix-matrix multiply. And, it limits the performance of BLAS1 and BLAS2 codes.

The last two factors listed above combine in an unfortunate manner: slow main memory increases the number of BLAS1 and BLAS2 flops and reduces the rate at which they are executed. The number of BLAS1 and BLAS2 flops are typically $O(n^2 \text{nb})$, where $\text{nb}$ is the

---

4 This would slow down xdr, possibly significantly.
5 This too would result in slower execution.
algorithmic blocking factor, which as stated above, must be larger when main memory is slow. The ratio of peak floating point performance to main memory speed is large enough on some machines that the $O(n^2 \text{nb})$ cost of the BLAS1 and BLAS2 flops can no longer be ignored.

**Improving the load balance of the $O(n^2 \text{nb})$ BLAS1 and BLAS2 flops.**

In a blocked dense linear algebra transformation, such as LU decomposition, Cholesky or QR, there are $O(n^2 \text{nb})$ BLAS1 and BLAS2 flops\cite{30, 53}. PDSYEVX includes two blocked dense linear algebra transformations: Reduction to tridiagonal form, PDSYTRD, is described in Section 4.2 and back transformation, PDORMTR, is described in Section 4.4.

In ScalAPACK version 1.5, the $O(n^2 \text{nb})$ BLAS1 and BLAS2 flops are performed by just one row or column of processors. This leads to load imbalance and causes these flops to account for $O\left(\frac{n^2 \text{nb}}{\sqrt{p}}\right)$ execution time. If these flops can be performed on all $p$ processors, instead of just one row or column, they will account for only $O\left(\frac{n^2 \text{nb}}{p}\right)$ execution time.

There are two ways to spread the cost of the $O(n^2 \text{nb})$ BLAS1 and BLAS2 flops over all the processors: take them out of the critical path or distribute them over all processors. Transformations such as LU, and back transformation (applying a series of householder vectors) can be pipelined, allowing each processor column (or row) to execute asynchronously. Pipelining in turn allows lookahead, a process by which the active column performs only those computations in the critical path before sending that data on to the next column\cite{32}.

Distributing the BLAS1 and BLAS2 flops over all of the processors, as discussed in the last paragraph, requires a different data distribution, a different broadcast and a significant change to the code. The difference can be best illustrated by considering LU. In a 2D blocked LU, LU is first performed on a block of columns, and the resulting LU decomposition is broadcast, or spread across, to all processor columns. One way to broadcast $k$ elements to $p$ processors is to combine a $\text{Reduce\_scatter}$ (which takes $k$ elements and sends $k/p$ to each processor) with an $\text{Allgather}$ (which takes $k/p$ elements from each processor and spreads them out to all processors giving each processor a copy of all $k$ elements). There are three ways to perform LU on this column block of data: 1) Before the column block is broadcast to all processors (as ScalAPACK does) in which case only the current column of processors is involved in performing the column LU and the $\text{Reduce\_scatter}$ and $\text{Allgather}$
combine to broadcast the block LU decomposition. 2) After the broadcast, in which case the Reduce\_scatter and Allgather combine to broadcast the block column prior to the LU decomposition - all processor columns would have a copy of the block column and each processor column could perform the column block LU redundantly. 3) After the Reduce\_scatter but before the Allgather. In this case, the Reduce\_scatter operates on the column block prior to the LU decomposition but the Allgather operates on the block column after the LU decomposition. All processors can be involved in the LU decomposition.

In HJS, Hendrickson, Jessup and Smith's symmetric eigensolver[91, 154] discussed in Section 7.1.2, the BLAS1 and BLAS2 flops are analogously distributed over all of the processors.

Lookahead does not improve performance unless the execution of the code is pipelined, i.e. proceeds in a wave pattern over the processes. Two-sided reductions, like tridiagonal reduction, do not allow pipelining. And, pipelining may be limited on reductions of symmetric or Hermitian matrices (such as Cholesky)$^6$.

Memory size

The amount of main memory limits the size of the problem that can be executed efficiently, while the amount of virtual memory limits the size of the problem that can be run at all. ScaLAPACK's symmetric eigensolvers, PDSYEVX and PDSYEV require roughly $4n^2$ and $2n^2$ double precision words of virtual memory space respectively. However, both can be run efficiently provided that physical memory can contain$^7$ the $n^2/2$ elements of the triangular matrix $A$. Ed D'Azvedo[52] has written an out-of-core symmetric eigensolver for ScaLAPACK and studied the performance of PDSYEV and PDSYEVX on large problem sizes.

2.4.3 Parallel computer configuration

I will discuss primarily distributed memory computers with one processor per node, discussing shared memory computers (SMPs), clusters of workstations and clusters of shared memory computers only briefly.

Four machine characteristics are important for distributed memory computers: peak floating point performance, software overhead, communication latency and commun-

---

$^6$I believe that pipelining can be used in Cholesky if a square processor grid is used. Work in progress.

$^7$Depending on the page size, keeping an $n$ by $n$ triangular matrix in memory may require as few as $n^2/2$ memory locations (if the page size is 1) or as many as $n^2$ (if the page size is $\geq n$).
nication (bisection) bandwidth. Software overhead and communication latency are the dominant costs for small problems. Peak floating point performance is the dominant costs for large problems.

**Interconnection network**

Bisection bandwidth and communication latency are the two important measures of an interconnection network. Networks which allow only one pair of nodes to communicate at a time do not offer adequate bisection bandwidth and hence parallel dense linear algebra (with the possible exception of huge matrix-matrix multiplies) will not perform well on such a network.

As long as the bisection bandwidth is adequate, the topology of the interconnection network has not proven to be an important factor in the performance of parallel dense linear algebra.

**Shared Memory Multiprocessing**

Users of dense linear algebra codes have two choices on shared memory multiprocessors. They can use a serial code, such as LAPACK that has been coded in terms of the BLAS and, provided that the manufacturer has provided an optimized BLAS, they will achieve good performance. Or, provided that the manufacturer provides MPI[65], PVM[19] or the BLACS they can use ScaLAPACK.

LeBlanc and Markatos[118] argue that shared memory codes typically get better load balance while message passing codes typically incur lower communications cost. However, the real difference could well come down to a matter of how efficient the underlying libraries are.

**Clusters of workstations**

Some clusters of workstations, notably the NOW project[3] at Berkeley, offer comparable communication performance to distributed memory computers. However, the vast majority of networks of workstations in present use are still connected by Ethernet or FDDI.

---

On current architectures, \( n < 100 \sqrt{p} \) is small for our purposes.

On current architectures, \( n > 1000 \sqrt{p} \) is large for our purposes.
rings and hence do not have the low latency and high bisection bandwidth required to perform dense linear algebra reductions in parallel efficiently.

Cluster of SMPs (CLUMPS)

Dense linear algebra codes have two choices on clusters of SMPs: they can assign one process to each processor or they can assign one process to each multi-processor node. The tradeoff will be similar to the shared-memory versus message-passing question on shared memory computers.

If each processor is assigned a separate process the details of how the processes will be assigned to what is essentially a two level grid of processors will be important. For a modest cluster of SMPs (say 4 nodes each with 4 processors) it might make sense to assign one dimension within the node and the other across the nodes. However, this will not scale well - adding nodes will require increasing the bandwidth per node else all dense linear algebra transformations will become bandwidth limited as the number of nodes increases. A layout which that is 2 dimensional within the nodes and 2 dimensional among the nodes allows both the number of processors per node and the number of nodes to increase provided only that bisection bandwidth grow with the number of processors and that internal bisection bandwidth (i.e. main memory bandwidth) grows with the number of processors per node.

On the first CLUMPS, how well each of the libraries is implemented is likely to outweigh theoretical considerations. Shared memory BLAS are not trivial, nor will communication systems that properly handle two levels of processor hierarchy be, i.e. communication within a node and communication between nodes.

On most distributed memory systems, the logical to physical processor grid mapping is of secondary importance. I suspect that this will not be the case for clusters of SMPs. It will be important to have the processes assigned to the processors on a particular node nearby in the logical process grid as well.

2.5 Applications

Large symmetric eigenproblems are used in a variety of applications. Some of these applications include: real-time signal processing[156] [34], modeling of acoustic and electro-magnetic waveguides[114], quantum chemistry[74] [22, 175], numerical simulations
of disordered electronic systems\cite{95}, vibration mode superposition analysis\cite{18}, statistical mechanics\cite{132}, molecular dynamics\cite{152}, quantum Hall systems\cite{112, 106}, material science\cite{166}, and biophysics\cite{143, 144}.

The needs of these applications differ considerably. Many require considerable execution time to build the matrix and hence the eigensolution remains a modest part of the total execution. However, building the matrix often parallelizes easily and grows much more slowly than the $O(n^3)$ cost of eigensolution. Hence, for these applications, the eigensolver becomes the bottleneck as larger problems are solved in parallel. Few applications require the entire spectrum, but most of these listed above require at least 10\% of the spectrum and hence are best solved by dense techniques. Some have large clusters of eigenvalues\cite{74}, while others do not.

### 2.5.1 Input matrix

Three features of the input matrix affect the execution time of symmetric eigensolvers: sparsity, eigenvalue clustering and spectral diagonal dominance.

#### Sparsity

Some algorithms and codes are specifically designed for sparse input matrices. Lanczos\cite{149} has traditionally been used to find a few eigenvalues and eigenvectors at the ends of the spectrum. Recently, ARPACK\cite{119}, and PARPACK\cite{130} have been developed based on Lanczos with full re-orthogonalization. They can therefore compute as much of the spectrum as the user chooses.

The Invariant Subspace Decomposition Approach and reduction to tridiagonal form based algorithms can both be run from either a dense or banded matrix. In this dissertation, I discuss only dense matrices.

#### Spectrum

Some algorithms are more dependent on the spectrum than others. Most are dependent in some manner, but that dependence differs from one algorithm to another.

It is difficult to maintain orthogonality of the eigenvectors when computing the eigendecomposition of matrices with tight clusters of eigenvalues. Such matrices require special techniques in divide and conquer and in inverse iteration (See section 2.7.4). On
the other hand, divide and conquer experiences the most deflation, and hence the greatest efficiency, on matrices with clustered eigenvalues.

The Invariant Subspace Decomposition Approach maintains orthogonality on matrices with clustered eigenvalues. However, it may have difficulty picking a good split point if the clustering causes the eigenvalues to be unevenly distributed.

**Spectral Diagonal dominance**

Spectral diagonal dominance\(^{10}\) speeds convergence of the Jacobi algorithm. Indeed, if the input matrix is sufficiently diagonally dominant, Jacobi may converge in as little as two steps (versus 10 to 20 for non diagonally dominant matrices). But, spectral diagonal dominance has little effect on any of the other algorithms.

**2.5.2 User request**

The portion of the spectrum that the user needs, i.e. the number of eigenvalues and/or eigenvectors, affects execution time of some, but not all eigensolvers.

Two step band reduction (to tridiagonal form) is most attractive when only eigenvalues are requested because the back transformation task is expensive in two step band reduction.

The cost of bisection and inverse iteration depends upon the number of eigenvalues and eigenvectors requested. These costs are \(O(n^2)\) and generally not significant for large problem sizes. However, back transformation requires \(2n^2m\) flops where \(m\) is the number of eigenvectors required.

Iterative methods, such as Lanczos\(^{49}\) and implicitly restarted Lanczos\(^{119}\) are clearly superior if only a few eigenvectors are required.

---

\(^{10}\)Spectrally diagonally dominant means that the eigenvector matrix, or a permutation thereof, is diagonally dominant. Most, but not all, diagonally dominant matrices are spectrally diagonally dominant. For example if you take a dense matrix with elements randomly chosen from \([-1, 1]\) and scale the diagonal elements by \(1e3\) the resulting diagonally dominant matrix will be spectrally diagonally dominant. However, if you take that same matrix and add \(1e3\) to each diagonal element, the eigenvector matrix is unchanged even though the matrix is clearly diagonally dominant.
2.5.3 Accuracy and Orthogonality requirements.

Demmel and Veselić,[58] prove that on scaled diagonally dominant matrices\(^\text{11}\), Jacobi can compute small eigenvalues with high relative accuracy while tridiagonal based methods can fail to do so.

At present, the ScaLAPACK offers two symmetric eigensolvers: PDSYEVX and PDSYEV. PDSYEVX, which is based on bisection and inverse iteration (DSTEBZ and DSTEIN from LAPACK) is faster and scales better but does not guarantee orthogonality among eigenvectors associated with clustered eigenvalues. PDSYEV, which is based on QR iteration (DSTEQR from LAPACK) is slower and does not scale as well but does guarantee orthogonal eigenvectors.

2.5.4 Input and Output Data layout

At present, the execution time of the ScaLAPACK symmetric eigensolver is strongly dependent on the data layout chosen by the user for input and output matrices. 1D data layouts are not scalable and lead to both high communication costs and poor load balancing. Suboptimal block sizes can likewise affect performance significantly. In particular, a block size of 1, i.e. cyclic data layout, causes ScaLAPACK to send a large number of small messages resulting in unacceptable message latency costs and a huge number of calls to the BLAS. If the block size is too large, load balance suffers.

There are a couple ways to reduce this dependence on the data layout chosen by the user. If algorithmic blocking is separated from data layout blocking\([140][91][159]\) small data layouts can be handled much more efficiently. However, small block-sizes (especially cyclic layouts) still require more messages than larger block-sizes. And, large block sizes still lead to load imbalance.

In Chapter 8 I will show that redistributing the data to an internal format that is near optimal for the particular machine and algorithm involved allows for improved performance and performance that is independent of the input and output data layout.

2.6 Machine Load

The load of the machine, in addition to the direct effect of offering your program only a portion of the total cycles, can have several indirect effects. If each processor is

\(^\text{11}\) A matrix, \(A\), is scaled diagonally dominant if and only if \(DA \in D\) with \(D = \{\text{diag}(A)\}^{1/2}\) is diagonally dominant.
individually scheduled, performance can be arbitrarily poor because significant progress is only possible when all processes are concurrently scheduled. A loaded machine may also cause your data to be swapped out to disk, which can greatly reduce peak performance. Finally, it is the most heavily loaded machine which controls execution time. If your code is running on 9 unloaded processors and one processor with a load factor of 5, you will get no more than a factor of 10/5 speedup. A ScaLAPACK user has reported performance degradation and speedup less than 1, (i.e. more processors take longer to complete the same sized eigendecomposition) on the IBM IBM SP2. I have also witnessed this behaviour on the IBM IBM SP2 at the University of Tennesse at Knoxville and I have reason to suspect that the IBM IBM SP2 is not gang scheduled and that this fact accounts for a large part of the poor performance of PDSYEVRX that the user and I have witnessed on the IBM SP2.

Space sharing, allocating subsets of the processors, solves all of these problems, but has its own problems. On some machines, jobs running on different partitions share the same communications paths and hence if one job saturates the network, all jobs may suffer.

2.7 Historical notes

2.7.1 Reduction to tridiagonal form and back transformation

Householder reduction to tridiagonal form is a two-sided reduction, which requires multiplication by Householder reflectors from both the left and right side. Martin et al. implemented reduction to tridiagonal form in Algol[129]. TRED1 and TRED2 perform reduction to tridiagonal form in EISPACK[153]. Dongarra, Hammarling and Sorensen[64] showed that Householder reduction to tridiagonal form can be performed using half matrix-vector and half matrix-matrix multiply flops. This has been implemented as DSYTRD in LAPACK[5, 67] for scalar and shared memory multiprocessors and PDSYTRD for distributed memory computers in ScaLAPACK[42]. Chang et al. implemented one of the first parallel codes for reduction to tridiagonal form, first using a 1D cyclic data layout[37] and then a 2D cyclic data layout[38].

Smith, Hendrickson and Jessup[91] show that data blocking is not required for efficient algorithmic blocking and that PDSYTRD pays a substantial execution time penalty for its generality (accepting any processor layout) and portability (being built on top of
the PBLAS, BLACS and BLAS). By restricting their attention to square processor layouts on the PARAGON, they were able to dramatically reduce the overhead incurred in reduction to tridiagonal form in HJS. HJS does not have the redundant communication found in PDSYEVX, it makes many fewer BLAS calls, avoids the overhead of the PBLAS calls, and spreads the work more evenly among all the processors (improving load balance). Furthermore, HJS, by using communication primitives better suited to the task, reduces both the number of messages sent and the total volume of communication substantially. Some, but not all, of these advantages necessitate that the processor layout be square. HJS is discussed in Section 7.1.2.

Other ways to reduce the execution time of reduction to tridiagonal form do not require that the processor layout be square. Bischof and Sun[25] and Lang[116] showed that in a two step band reduction to tridiagonal form, all of the flops, asymptotically, can be performed in matrix multiply routines. Karp, Sahay, Santos and Schauer[107] showed that subset broadcasts and reductions can be performed optimally. Van de Geijn and others[16] are working to implement improved subset broadcast and reduction primitives.

Hegland et al.[90] argue that the fastest way to reduce a symmetric matrix $A$ to tridiagonal form on the VPP500 (a multiprocessor vector supercomputer by Fujitsu) is to compute $L_1 D L_1^T = A$ and then compute a series of $L_i$ using orthonormal transformations such that $L_{n+p-1} D L_{n+p-1}^T$ is tridiagonal. Their technique is, in essence, a two step band reduction in which the two steps are performed within the same loop. Let $L_i[:, \text{own}(\rho)]$ represent the columns of $L_i$, owned by processor $\rho$. $\rho Q_i$ means the portion of $Q_i$ which processor $\rho$ owns.

The code is:

$$L_1 D L_1^T = A$$

For $i = 1$ to $n - 1$ do:

- Each processor independently performs:
  $$\rho Q_i = \text{House}(L_i[:, \text{own}(\rho)] D [i, \text{own}(\rho), \text{own}(\rho)] L_i[:, \text{own}(\rho)]^T)$$
  $$L_{i+1}[:, \text{own}(\rho)] = \rho Q_i L_i[:, \text{own}(\rho)]$$

- The processors together perform:
  $$\text{Allgather}(L_{i+1}[:, i + 1 : i + p])$$

- Each processor performs redundantly:
  $$Q'_i = \text{House}(L_{i+1}[:, i + 1 : i + p] D [i + 1 : i + p, i + 1 : i + p] L_{i+1}[:, i + 1 : i + p]^T)$$
$L_{i+1}[:, i+1 : i+p] = Q_i^t L_{i+1}[:, i+1 : i+p]$

In $\text{Allgather}(L_{i+1}[:, i+1 : i+p])$ each processor contributes the column of $L_{i+1}[:, i+1 : i+p]$ which it owns and all processors end up with identical copies of $L_{i+1}[:, i+1 : i+p]$.

The loop invariants are as follows:

Let: $T_i = (L_i) D (L_i)^T$

$\forall j < i; k < i+p \quad T_i(j, k) = 0$ \hfill (Line 1)

$T_i(1 : i - p, 1 : i - p)$ is tridiagonal \hfill (Line 2)

For $p = 1$, the serial case, both of these conditions are identical and meeting them requires computing the first column of $(L_i) D (L_i)^T$, computing the Householder vector and applying it to $L_i$ to yield $L_{i+1}$.

For $p > 1$, the parallel case, the first loop invariant is maintained by each processor independently computing the first column of $(L_i) D (L_i)^T$, using only the local columns\(^{12}\) of $L_i$. A Householder vector is computed from this and applied to the local columns of $L_i$. The second loop invariant is maintained redundantly on all processors. All processors obtain copies of columns $i$ to $i+p-1$ of $L_i$ and compute: $A(1 : p, 1) = L_i(i : i+p-1, i : i+p-1) D(i : i+p-1, i : i+p-1) L_i(i : i+p-1, :)$. A Householder vector is computed from $A(1 : p, 1)$ and applied to $L_i(i : i+p-1, :)$, redundantly on all processors, maintaining the second loop invariant.

This one-sided transformation requires fewer messages than Hessenberg reduction to tridiagonal form and, for small $p$, less message volume, but requires twice as many flops.

### 2.7.2 Tridiagonal eigendecomposition

**Sequential symmetric QL and QR algorithms**

The implicit QL or QR algorithms have been the most commonly used methods for solving the symmetric eigenproblem for the last couple decades. Francis\cite{79} wrote the first implementation of the QL algorithm based on Rutishauser’s LR transformation. The QL algorithm is the basis of the EISPACK routine \texttt{IMTQL1}, while the LAPACK routine \texttt{DSTEQR} uses either implicit QR or implicit QL depending on the top and bottom diagonal elements\cite{86}.

\(^{12}\)Their implementation uses a column cyclic data distribution.
Henry[93] shows that if between each sweep of QR (or QL) in which the eigenvectors are updated an additional sweep is performed in which the eigenvectors are not updated, better shifts can be used, reducing the total number of flops from roughly \(6n^3\) to \(4n^3\).

Reinsch[145] wrote EISPACK’s TQLRAT which computes eigenvalues without square roots. LAPACK’s DSTERF improves on TQLRAT using a root free variant developed by Pal, Walker and Kahan[134]. Like DSTEQR, DSTERF uses either implicit QR or implicit QL depending on the top and bottom diagonal elements.

**Parallel symmetric QL and QR algorithms**

QR requires \(O(n^2)\) effort to compute the eigenvalues and \(O(n^3)\) to compute the eigenvectors. No one has found a good, stable way to parallelize the \(O(n^2)\) cost of computing the eigenvalues and reflectors. Sameh and Kuck[113] use parallel prefix to parallelize QR for eigenvalue extraction. They obtain \(O(\log p)\) speedup, but they do not show how their method can be used to generate reflectors and hence eigenvectors.

However, parallelizing the \(O(n^3)\) effort of computing the eigenvectors is straightforward as shown by Chinchalkar and Coleman[39]; and Arbenz et al.[8] and implemented for ScaLAPACK by Felmers[76].

Symmetric QR parallelizes nicely in a MIMD programming style, but efforts to parallelize it on a shared memory machine in which the parallelism is strictly within the calls to the BLAS have produced only modest speedups. Bai and Demmel[13] first suggested using multiple shifts in non-symmetric QR. Arbenz and Oettli[10] showed that blocking and multiple shifts could be used to obtain modest improvements in the speed (roughly a factor of 2 on 8 processors) of QR for eigenvalues and eigenvectors on the ALLIANT FX/80. Kaufman[109] showed that multi-shift QR could be used to speed eigenvalue extraction by a factor of 3 on a 2-processor Cray YMP despite tripling the number of flops performed.

**Sturm sequence methods**

Givens[83] used bisection to compute the eigenvalues of a tridiagonal matrix based on Wilkinson’s original idea. Kahan[105] showed that bisection can compute small eigenvalues with tiny componentwise relative backward error, and sometimes high relative accuracy. High relative accuracy is required for inverse iteration on a few matrices. Barlow and Evans were the first to use bisection in a parallel code[15].
Computing eigenvalues of a tridiagonal matrix can be split into three phases: isolation, separation and extraction. The isolation phase identifies, for each eigenvalue, an interval which contains that eigenvalue and no other. The separation phase improves the eigenvalue estimate. And the extraction phase computes the eigenvalue to within some tolerance. Bisection can be used for all three phases.

Neither existing codes, nor the literature explicitly distinguish between these three phases, but they have very different computational aspects. Isolation, at least to the point of identifying \( p \) intervals so that each processor is responsible for one interval is difficult to parallelize, whereas the other phases are fairly straightforward. The separation phase is typically the challenge for most root finders, and the area where they distinguish themselves from other codes. Divide and conquer techniques which use the eigenvalues from perturbed matrices as estimates of the eigenvalues of the original matrix, isolate and may separate the roots.

Techniques for eigenvalue isolation include: multi-section[126] [14], assigning different parts of the spectrum to different processors[95, 20], divide and conquer and using multiple processors to compute the inertia of a tridiagonal matrix[123]. In multi-section, each processor computes the inertia at a single point, splitting an interval into \( p + 1 \) intervals. Although multi-section requires communication, Crivelli and Jessup[48] show that the communication cost is often a modest part of the total cost. Divide and conquer splits the matrix by perturbing or ignoring a couple of elements, typically near the center of the matrix to separate the matrix into two tridiagonal matrices whose eigenvalues can be computed separately. If a rank 1 perturbation is chosen, the merged set of eigenvalues provides a set of intervals in which exactly one eigenvalue lies.

There are a number of ways to use multiple processors to compute the inertia of a tridiagonal matrix. Lu and Qiao[127] discuss using parallel prefix to compute the Sturm sequence as the sub-products of a series of 2 by 2 matrices and Mathias[131] did an error analysis and showed that it was unstable. Ren[146] tried unsuccessfully to repair parallel prefix. Conroy and Podrazik[46] perform LU on a block arrowhead matrix. Each block is tridiagonal and the arrow has width equal to the number of blocks. Swarztrauber[162] and Krishnakumar and Morf[111] discuss ways of computing the determinant of 4 matrices of size roughly \( n \) by \( n \) from the determinants of 8 matrices of size roughly \( \frac{1}{2}n \) by \( \frac{1}{2}n \). Each of these methods performs 2 to 4 times more floating point operations than a serial Sturm sequence count would and requires \( O(\log(p)) \) messages. Except for Conroy and Podrazik's
method, they all use multiplies instead of divides. Multiplies are faster than divides, but require special checks to avoid overflow.

The computation of the inertia is slowed by the existence of a divide and a comparison in the inner loop. There are also a couple tricks that can potentially be used to speed computation of the inertia to reduce the number of divides and comparisons or to make them faster. ScaLAPACK’s PDSYEVX uses signed zeroes and the C language ability to extract the sign bit of a floating point number to avoid a comparison in the inner loop. I have proposed perturbing tiny entries in the tridiagonal matrix to guarantee that negative zero will never occur, thus allowing a standard C or Fortran comparison against zero. Using a standard comparison against zero would allow compilers to produce more efficient code. I have also proposed reducing the number of divides in the inner loop by taking advantage of the fixed exponent and mantissa sizes in IEEE double precision numbers. I have not implemented either of these ideas. Some machines have two types of divide: a fast hardware divide that may be incorrect in the last couple bits and a slower but correct software divide.

Demmel, Dhillon and Ren[54] give a proof of correctness for PDSTEBA, ScaLAPACK’s bisection code for computing the eigenvalues of a tridiagonal matrix, in the face of heterogeneity and non-monotonic arithmetic (such as sloppy divides). This shows that bisection can be robust even in the face of incorrect divides.

Many techniques that have been used to accelerate eigenvalue extraction including: the secant method[33], Laguerre’s iteration[138], Rayleigh quotient iteration[163], secular equation root finding[50] and homotopy continuation[120, 45]. Bassermann and Weidner use a Newton-like root finder called the Pegasus method[17]. These acceleration techniques converge super-linearly as long as the eigenvalues are separated.

Li and Ren[121] accelerate eigenvalue separation in their Laguerre based root finder by detecting linear convergence and estimating the effect of the next several steps. Brent[33] discusses ways of separating eigenvalues when the secant method is used. Li and Zeng use an estimate of the multiplicity in their root finder based on Laguerre iteration[122]. Szyld[163] uses inverse iteration with a shift set to middle of the interval known to contain only one eigenvalue to separate eigenvalues before switching to Rayleigh quotient iteration. Cuppen’s method takes advantage of multiple eigenvalues through deflation.

Eigenvalue extraction can be performed in parallel with no communication, or a small constant amount of communication. However, eigenvalue extraction can exhibit poor load balance, especially if acceleration techniques are used. Ma and Szyld[128] use a task
queue to improve load balance. Li and Ren[121] minimize load imbalance by concentrating on worst case performance.

**ScalAPACK** chose bisection and inverse iteration for its first tridiagonal eigensolver, **PDSYEVR**, because they are fast, well known, robust, simple and parallelize easily. **ScalAPACK** has since added a QR based tridiagonal eigensolver for those applications needing guarantees on orthogonality within eigenvectors corresponding to large clusters of eigenvalues. See section 4.3 for details.

**Divide and Conquer**

Cuppen[50] showed that by making a small perturbation to a tridiagonal matrix it could be split into two separate tridiagonal matrices each of which could be solved independently, and that the eigendecomposition of the original tridiagonal matrix could then be constructed from the eigendecomposition of the two independent tridiagonal matrices and the perturbation.

There are many ways to perturb a tridiagonal matrix such that the result is two separate tridiagonal matrices. The following four have been implemented. Cuppen’s algorithm[50] subtracts \( uu^T \) from the tridiagonal matrix, where \( u = e_{\frac{1}{2}n} + e_{\frac{1}{2}n+1} \) and \( \alpha = T_{\frac{1}{2}n, \frac{1}{2}n+1} \). Gu and Eisenstat[89] set all elements in row and column \( i \) to zero. Gates and Arbenz[82] call this a rank-one extension and refer to this as permuting row and column \( \frac{1}{2}n \) to the last row and column (as opposed to setting all elements in row and column \( i \) to zero). Gates[80] uses a rank two perturbation: \( T_{\frac{1}{2}n, \frac{1}{2}n+1} (e_{\frac{1}{2}n} e_{\frac{1}{2}n+1}^T + e_{\frac{1}{2}n+1} e_{\frac{1}{2}n}^T) \) is subtracted from the original tridiagonal.

Cuppen’s original divide and conquer method can result in a loss of orthogonality among the eigenvectors. Three methods of maintaining orthogonality have been implemented. Sorensen and Tang[155] calculate the roots to double precision. Gu and Eisenstat[89] compute the eigenvectors to a slightly perturbed problem. Gates[81] showed that inverse iteration and Gram-Schmidt re-orthogonalization could be used in divide and conquer codes to compute orthogonal eigenvectors.

Several divide and conquer codes are available today. The first publicly available divide and conquer code, **TREEQL** was written by Dongarra and Sorensen[66]. The fastest reliable serial code currently available for computing the full eigendecomposition of a tridiagonal matrix is **LAPACK**’s **DSTEDC**[147]. It is based on Cuppen’s divide and conquer[50] and
uses Gu and Eisenstat’s method to maintain orthogonality.

There has long been interest in parallelizing divide and conquer codes because of the obvious parallelism involved in the early stages. There are three reasons why this technique has proven difficult to parallelize. The first is that the majority of the flops are performed at the root of the divide and conquer tree and hence the parallelism at the leaves is less valuable[36]. The second is that deflation, the property that makes DSTEDC the fastest serial code, leads to dynamic load imbalance in parallel codes. The third is the complexity of the serial code itself.

Dongarra and Sorensen’s parallel code[66], SESUPD, was written for a shared memory machine. The first parallel divide and conquer codes written for distributed memory computers used a 1D data layout (thus limiting their scalability)[99, 81]. Potter[141] has written a parallel divide and conquer for small matrices (it requires a full copy of the matrix on each node). Françoise Tisseur has written a parallel divide and conquer code for inclusion in ScaLAPACK.

**Inverse Iteration**

Inverse iteration with eigenvalue shifts is typically used to compute the eigenvectors once the eigenvalues are known[170]. Jessup and Ipsen[102] explain the use of Gram-Schmidt re-orthogonalization to ensure that the eigenvectors are orthogonal. Fann and Littlefield[75] found that inverse iteration and Gram-Schmidt can be performed in parallel, greatly improving its efficiency. Parlett and Dhillon[139, 59] are working on a method, based on work by Fernando, Parlett and Dhillon[77], that may avoid, or greatly reduce the need for re-orthogonalization.

**The Jacobi method**

The Jacobi method for the symmetric eigenproblem consists of applying a series of rotators each of which forces a single off-diagonal element to zero. Each such rotation reduces the square of the Frobenius norm of the off-diagonal elements by the square of the element which was eliminated. Hence, as long as the off-diagonal elements to be eliminated are reasonably chosen, the norm of the off-diagonal converges to zero[167].

There are several variations in the Jacobi method. Classical Jacobi[100], selects the largest off-diagonal element as the element to eliminate at each step, and hence requires
the fewest steps. However, $O(n^2)$ comparisons are required at each step to select the largest element, requiring $O(n^4)$ comparisons per sweep, rendering it unattractive. Cyclic Jacobi annihilates every element once per sweep in some specified order. Threshold Jacobi differs from cyclic Jacobi in that only those elements larger than a given threshold are annihilated. Block Jacobi annihilates an entire block of elements at each step.

Cyclic, threshold and block variants of Jacobi each have their advantages. Cyclic Jacobi is the simplest to implement. Block Jacobi requires fewer flops (and if done in parallel, fewer messages) per element annihilated. Threshold Jacobi requires fewer steps and converges more surely than cyclic Jacobi, however a parallel threshold Jacobi requires more communication. Scott et al. showed that a block threshold Jacobi method[151] is the best Jacobi method for distributed memory machines, however, it would also be the most complex to implement. Littlefield and Maschhoff[125] found that for large numbers of processors, a parallel block Jacobi beat tridiagonal based methods available at that time.

One-sided Jacobi methods apply rotations to only one side of the matrix and force the columns of the matrix to be orthogonal, hence represent scaled eigenvectors. One-sided Jacobi methods require fewer flops and may parallelize better[10, 21].

Existing parallel implementations of the Jacobi algorithm are based on a 1D data layout. Arbzen and Oettli[10] implemented a blocked one-sided Jacobi. Pourzandi and Tourandcheau[142] show that overlapping communication and computation is effective in a Jacobi implementation on the i860 based NCUBE. Although a 1D data layout is not scalable, the huge computation to communication ratio in the Jacobi algorithm hides this on all machines available today.

There are two publicly available parallel Jacobi codes. Fernando wrote a parallel Jacobi code for NAG[87]. O’Neal and Reddy[133] wrote a parallel Jacobi, PJAC, for the Pittsburgh Supercomputing Center.

Demmel and Veselić[58] prove that on scaled diagonally dominant matrices, Jacobi can compute small eigenvalues with high relative accuracy while tridiagonal based methods cannot. Demmel et al.[56] give a comprehensive discussion of the situations in which Jacobi is more accurate than other available algorithms.

The Jacobi method is discussed in Section 7.3.
2.7.3 Matrix-matrix multiply based methods

There are several methods for solving the symmetric eigenproblem which can be made to use only matrix-matrix multiply.

Matrix-matrix based methods are attractive because they can be performed efficiently on all computers, and they scale well. However, they require many more flops (typically 6 - 60 times more) than reduction to tridiagonal form, tridiagonal eigensolution and back transformation. Hence, these methods only make sense if tridiagonal based methods cannot be performed efficiently or do not yield answers that are sufficiently accurate.

Invariant Subspace Decomposition Algorithm

The Invariant Subspace Decomposition Algorithm[97], ISDA, for solving the symmetric eigenproblem involves recursively decoupling the matrix $A$ into two smaller matrices. Each decoupling is achieved by applying an orthogonal similarity transformation, $Q^T A Q$, such that the first columns of $Q$ span an invariant subspace of $A$. Such a $Q$ is found by computing a polynomial function of $A$, $A' = p(A)$ which maps all the eigenvalues of $A$ nearly to 0 or 1, and then taking the QR decomposition of $p(A)$. One such polynomial can be computed by first shifting and scaling $A$ so that all its eigenvalues are known to be between 0 and 1 (by Gershgorin's theorem) and then repeatedly computing the beta function, $A_{i+1} = 3A_i^2 - 2A_i^3$, until all of the eigenvalues of $A_i$ are effectively either 0 or 1. (All of the eigenvalues of $A_0$ that are less than 0.5 are mapped to 0, all the eigenvalues of $A_0$ that are greater than 0.5 are mapped to 1.)

The ISDA parallelizes well because each of the tasks involved perform well in parallel[97]. Unfortunately, the ISDA requires far more floating point operations (roughly 100 $n^3$) than eigensolvers that are based on reducing the matrix first to tridiagonal form (which require $8n^3 + O(n^2)$ or fewer flops).

Applying the ISDA for banded matrices greatly reduces the flop count[26]. Furthermore, the banded matrix multiplications can still be performed efficiently, and the bandwidth does not triple with each application of $A_{i+1} = 3A_i^2 - 2A_i^3$ as one would expect with random banded matrices. Nonetheless, the bandwidth does grow enough to necessitate several band reductions, each of which requires a corresponding back transformation step.

A publically available code based on the ISDA is available from the PRISM group[28].
The ISDA applied directly to the full matrix requires roughly $100n^3$ flops, or 30 times as many as tridiagonal reduction based methods, and hence will never be as fast. Banded ISDA is almost a tridiagonal based method, but is not likely to be the fastest method. The quickest way to compute eigenvalues from a banded matrix is to reduce the matrix first to tridiagonal form. And, if eigenvectors are required, banded ISDA will require at least twice and probably three times as many flops in back transformation.

**FFT based invariant subspace decomposition**

Yau and Lu[174] implemented an FFT based invariant subspace decomposition method. This method requires $O(\log(n))$ matrix multiplications. Tisseur and Domas[60] have written a parallel implementation of the Yau and Lu method.

FFT based invariant subspace decomposition, like ISDA applied to dense matrices requires roughly $100n^3$ flops. Hence, it, like ISDA will never be as fast as tridiagonal reduction based methods.

**Strassen’s matrix multiply**

Strassen’s matrix-matrix multiply[157] can decrease the execution time for very large matrix-matrix multiplies by up to 20% but will not make ISDA competitive. Several implementations of Strassen’s matrix multiply have been able to demonstrate performance superior to conventional matrix-matrix multiply[96][43]. However, Strassen’s method is only useful when performing matrix-matrix multiplies in which all three matrices are very large and Strassen’s flop count advantage grows very slowly as the matrix size grows. In order to double Strassen’s flop count advantage, the matrices begin multiplied must be sixteen times as large and hence memory usage must increase a thousand fold.

**2.7.4 Orthogonality**

Some methods, notably inverse iteration, require extra care to ensure that the eigenvectors are orthogonal. In exact arithmetic, if two eigenvalues differ, their corresponding eigenvectors will be orthogonal. However, if the input matrix has, say, a double eigenvalue, the eigenvectors corresponding to this double eigenvalue span a two-dimensional subspace and hence there is no guarantee that two eigenvectors chosen at random from this space will be orthogonal. In floating point arithmetic, inverse iteration without re-
orthogonalization may not produce orthogonal eigenvectors when two or more eigenvalues are nearly identical. In DSTEIN, LAPACK's inverse iteration code, when computing the eigenvectors for a cluster of eigenvalues, modified Gram-Schmidt re-orthogonalization is employed after each iteration to re-orthogonalize the iterate against all of the other eigenvalues in the cluster[102]. Modified Gram-Schmidt re-orthogonalization parallelizes poorly because it is a series of dot products and DAXPY's each of which depends upon the result of the immediately preceding operation. PeIGs[74] and PDSYEVX[68] have chosen different responses to the fact that the re-orthogonalization in DSYEVX parallelizes poorly.

PeIGs alternates inverse iteration and re-orthogonalization in a different manner than DSYEVX. Instead of computing one eigenvector at a time, all of the eigenvectors within a cluster are computed simultaneously. For each cluster, PeIGs first performs a round of inverse iteration without re-orthogonalization using random starting vectors. Then, PeIGs performs modified Gram-Schmidt re-orthogonalization twice to orthogonalize the eigenvectors. PeIGs performs a second round of inverse iteration without re-orthogonalization, using the output from the previous step as the starting vectors, and again repeating until sufficient accuracy is obtained for each eigenvector. Finally, PeIGs performs modified Gram-Schmidt re-orthogonalization one last time. They have shown that this method works on application matrices with large clusters of eigenvalues.

PDSYEVX attempts to assign the computation of all eigenvectors associated with each cluster of eigenvalues to a single processor. When enough space is available to accomplish this, PDSYEVX produces exactly the same results as DSYEVX. When the user does not provide enough local workspace PDSYEVX relaxes the definition of cluster repeatedly until it can assign all the computation of all eigenvectors associated with each cluster of eigenvalues to a single processor.

When the input matrix contains one or more very large clusters of eigenvalues, PDSYEVX performs poorly: If enough workspace is available, PDSYEVX gives the same results as DSYEVX, but runs very slowly. If insufficient workspace is available, PDSYEVX does not guarantee orthogonality. Dhillon explains the fundamental problems in inverse iteration[59].

Recently Parlett and Dhillon have identified new techniques for computing the eigenvectors of a symmetric tridiagonal matrix[136, 139]. These new results raise the hope that we will soon have an $O(n^2)$ method for computing the eigenvectors of a symmetric tridiagonal matrix which parallelizes well and avoids the problems with computing the eigenvectors associated with clustered eigenvalues. ScaLAPACK looks forward to applying
these new techniques in a future release.
Chapter 3

Basic Linear Algebra Subroutines

3.1 BLAS design and implementation

The BLAS [117, 63, 62], Basic Linear Algebra Subroutines, were designed to allow portable codes most of whose operations are matrix-matrix multiplications, matrix-vector multiplications, and related linear algebra operations to achieve high performance provided that the BLAS achieve high performance. In LAPACK [4], the BLAS were used to re-express the linear algebra algorithms in the previous libraries LINPACK [61] and EISPACK [153], thereby achieving performance portability.

The BLAS routines are split into three sets. BLAS Level 1 routines involve only vectors, require $O(n)$ flops (on input vectors of length $n$) and two or three memory operations for every two flops performed. BLAS Level 2 routines involve one $n$ by $n$ matrix, $O(n^2)$ flops and one or two memory operations for every two flops performed (rectangular matrices are also supported). BLAS Level 3 routines involve only matrices, $O(n^3)$ flops and $O(n^2)$ memory operations. BLAS Level 1, because they involve only $O(n)$ operations per invocation, have the least flexibility in how the operations are ordered, and require the most memory operations per flop. Hence, BLAS Level 1 routines have the lowest peak floating point operation rate. They also have the lowest software overhead - an important consideration because they perform few operations. BLAS Level 3 routines have the most flexibility in how the operations are ordered and require the fewest memory operations per flop and hence achieve the highest performance on large tasks. BLAS Level 1 and 2 routines are typically limited by the speed of memory. BLAS Level 3 routines typically execute very near the peak speed of the floating point unit.
Typical hardware architectures make it possible, but not easy, to achieve high floating point execution rates for matrix-matrix multiply. Floating point units can initiate floating point operations every 2 to 5 nanoseconds though floating point operations take 10 to 30 nanoseconds to complete and main memory requires 20 to 60 nanoseconds per random data fetch. Floating point units achieve high throughput through concurrency, allowing multiple operations to be performed simultaneously, and pipelining, starting operations before the previous operation is complete. Register files are made large enough to provide source and target registers for as many operations as can be active at one time. Main memory throughput can be enhanced by interleaving memory banks and by fetching several words simultaneously (or nearly so) from main memory. Memory performance is further enhanced by the use of caches. Two levels of caches are now typical and systems are now being designed with three levels of caches.

High performance BLAS routines typically incur significant software overhead: because to achieve near the floating point unit's peak performance, BLAS routines need an inner loop that can keep the floating point units busy, surrounded by one or more levels of blocking to keep the memory accesses in the fastest memory possible. Managing concurrency and/or pipelining requires a long inner loop which operates on several vectors at once. Each level of blocking requires additional control code and separate loops to handle portions of the matrix that are not exact multiples of the block size. For example, DGEMV\(^1\) (double precision matrix-vector multiplication) on the PARAGON has an average software overhead of 23 microseconds (over 1000 cycles at 50 Mhz) and includes 200 instructions of error checking and case selection, 750 instructions for the transpose case and 500 for the non-transpose case\(^2\).

**3.2 BLAS execution time**

The execution time for each call to a BLAS routine depends upon the hardware, the BLAS implementation, the operation requested and the state of the machine, especially the contents of the caches, at the time of the call. The time per DGEMV, or BLAS Level 2, flop is limited by the speed of the memory hierarchy level at which the matrix resides. The

\(^1\)DGEMV performs \(y = \alpha A x + \beta y\) or \(y = \alpha A^T x + \beta y\), where \(A\) is a matrix, \(x\) and \(y\) are vectors and \(\alpha\) and \(\beta\) are scalars.

\(^2\)These instruction counts include all instructions routinely executed during the main loop in reduction to tridiagonal form. Not all are executed during each call to DGEMV.
Table 3.1: BLAS execution time (Time = δi + number of flops · γi in microseconds)

<table>
<thead>
<tr>
<th></th>
<th>BLAS Level 3</th>
<th>BLAS Level 2</th>
<th>BLAS Level 1</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>peak</td>
<td>software</td>
<td>time</td>
</tr>
<tr>
<td></td>
<td>flop rate</td>
<td>overhead δ3</td>
<td>per flop γ3</td>
</tr>
<tr>
<td>PARAGON</td>
<td></td>
<td></td>
<td>(Mflops/sec)</td>
</tr>
<tr>
<td>Basic Math</td>
<td>50</td>
<td>300</td>
<td>.024</td>
</tr>
<tr>
<td>Library Software (Release 5.0)</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>(41)</td>
<td>87</td>
<td>.026</td>
<td></td>
</tr>
<tr>
<td>(38)</td>
<td>3</td>
<td>.10</td>
<td></td>
</tr>
<tr>
<td>IBM SP2</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>ESSL 2.2.2.2</td>
<td>480</td>
<td>0</td>
<td>.0037</td>
</tr>
<tr>
<td>(270)</td>
<td>5.1e</td>
<td>0.0055</td>
<td></td>
</tr>
<tr>
<td>(180)</td>
<td>1.2</td>
<td>.01</td>
<td></td>
</tr>
<tr>
<td>(100)</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

time per \(\text{DGEMM}^3\), or BLAS Level 3, flop is typically limited primarily by the rate at which the floating point unit can issue and complete instructions. We will concentrate on \(\text{DGEMM}\) and \(\text{DGEMV}\) because they perform most of the flops in \(\text{PDSYEVX}\).

Table 3.1 shows the software overhead and time per flop for the BLAS routines. These times are based on independent timings with code cached but not data cached using invocations that are typical for \(\text{PDSYEVX}\). Recall that these parameters are used in a linear model of performance:

\[
\delta + \text{number of flops} \times \gamma
\]

(Line 1)

In \(\text{PDSYEVX}\) we are most concerned with the time per flop for Level 3 routines and secondarily concerned with the time per flop and software overhead for Level 2 routines. For \(n=3840\) and \(p=64\), on the Paragon, the three largest components attributable to the items in Table 3.1 are: 28% of the \(\text{PDSYEVX}\) execution time is attributable to BLAS Level

---

\(^3\text{DGEMM}\) performs \(C = \alpha AB + \beta C\) or \(c = \alpha A^T B + \beta C\), where \(A\), \(B\) and \(C\) are matrices, and \(\alpha\) and \(\beta\) are scalars.
3 floating point execution (not including software overhead), 8% is attributable to BLAS Level 2 floating point execution and 5% is attributable to BLAS Level 2 software overhead. (See Chapter 5 for details.) The fact that the BLAS3 software overhead for the IBM SP2 is listed as 0 stems from the fact that matrix-matrix multiply is faster for small problem sizes because they fit in cache\(^4\).

Figure 3.1: Performance of DGEMV on the Intel PARAGON

Figure 3.1 shows how actual DGEMV performance differs from predicted performance. Line 1 on the PARAGON. Each point represents the time required for a call to DGEMV with parameters that are typical of calls to DGEMV made in PDSYEVX divided by the time predicted by our performance model. The timings are made by an independent timer as described in Section 3.3. The model matches quite well on most calls to DGEMV. It also shows a modest, but noticeable difference between the cost when data is cached versus when it is not. If

\(^4\)We did not pursue this because it BLAS3 software overhead has little impact on PDSYEVX execution time.
the software overhead term were removed (i.e. using number of flops $\times \gamma_2$ as the model) the model would underestimate execution by a factor of two hundred or more on small problem sizes.

Some calls to $\text{DGEMV}$ require much less time than expected, as little as $1/9$, indicating the software overhead is not independent of the type of call made. In particular, calls which involve very few flops can vary widely in their execution time (for the predicted time). However, not many calls differ widely in their execution time and those that do require few flops (hence little execution time) and the fact that they do not match well does not significantly affect the accuracy of my performance model for $\text{PDSYEVX}$ (given in Chapter 4) and hence I did not study them further.

Figure 3.2 shows that $\text{DGEMV}$ on the PARAGON requires 10 to 50 microsends longer if the code is not cached at the time it is called. The additional time required is estimated by subtracting the cost of running $\text{DGEMV}$ alone from the cost of running $\text{DGEMV}$ followed by 16,384 no-ops while accounting for the execution time of the 16,384 no-ops themselves. The extra time required increases as the number of flops increases. And the extra time is greater when the data is not cached than when it is cached. It is not surprising that the extra time required when the code is not cached increases as the number of flops increases because when few flops are involved, the code does not execute as many loops. However it is surprising that the code cache miss cost in the “Data not cached” case appears to increase almost linearly with the number of flops, I would expect to see something closer to a step function. This deserves further study if it is determined that code cache misses substantially affect execution time.

Figure 3.3 shows that the extra time required by $\text{DGEMV}$ ranges from 1.5\% (when $\text{DGEMV}$ performs many flops) to over 10\% (when $\text{DGEMV}$ performs few flops). Only calls made to $\text{DGEMV}$ with parameters that are typical of the calls commonly made by $\text{PDSYEVX}$ are shown. The extra time required when code is not cached can be up to 80\% on calls made to $\text{DGEMV}$ requiring very few flops, but these are rare in $\text{PDSYEVX}$.

---

5 The code cache holds 8,192 no-ops. Hence, 16,384 guarantees that the no-ops are not in cache, making their execution time independent of what is in the code cache at the time the 16,384 no-ops are executed.

6 I compare the execution time when neither code nor data is cached to the execution time when code is cached but data is not when estimating the extra time required when data is not cached.
Figure 3.2: Additional execution time required for DGEMV when the code cache is flushed between each call. The y-axis shows the difference between the time required for a run which consists of one loop executing 16,384 no-ops after each call to DGEMV and the time required for a run which includes two loops one executing DGEMV and one executing 16,384 no-ops.

3.3 Timing methodology

Each routine is timed with several sets of input parameters. To time a routine with a given set of input parameters, the routine is run three times and the time from the third run is used. Each run consists of calling the routine to be timed repeatedly within a loop. The first run, in which the loop is run only once, ensures that the code is paged in. The second run, in which the loop is run just long enough to exceed the timer resolution, provides an estimate that is used to determine how many times to run the third run. The third run, in which the loop is run for approximately one second, is the only one whose execution time is recorded. We record both CPU time and wall clock time. These plots are
Figure 3.3: Additional execution time required for DGEMV when the code cache is flushed between each call as a percentage of the time required when the code is cached. See Figure 3.2.

The input parameters for each run are randomly selected such that they match the input parameters made in a typical call to DGEMV from PDSYEVX. Randomly selecting the input parameters provides advantages over a systematic choice of input parameters. A systematic choice of input parameters might include, for example, only even values of $k$ whereas odd values of $k$ might require significantly longer. Random selection means that the likelihood of identifying anomalous behavior is directly related to how often that behavior occurs in calls within PDSYEVX. Random selection scales well also: It is easy to increase or decrease the number of timings and/or the number of processors used.

![DGEMV code cache miss cost on XPS5](image_url)
3.4 The cost of code and data cache misses in DGEMV

Each set of input parameters is timed under four different cache situations:

- Code and data cached
- Code cached but data not cached
- Code not cached but data cached
- Neither code nor data cached

Data can be allowed to remain in cache (to the extent that it fits in cache) by using the same arrays in each call within the timing loop. Likewise data can be prevented from remaining in cache by using different arrays for each call within the timing loop.

Allowing data to reside in cache reduces execution time in two ways. It reduces the cost of accessing the data in the arrays being operated on and it reduces the software overhead cost, because software overhead also involves reading and writing data, notably while saving and restoring registers.

Code and data cache misses are more important in DGEMV than in DGEMM because DGEMV is called more often than DGEMM and the ratio of flops to data movement is higher for DGEMM than for DGEMV, hence reducing the cost of data cache misses in DGEMM.

3.5 Miscellaneous timing details

We make sure that timings are not affected by conditions which are not likely to be encountered in a typical run of PDSYEVX. Exceptional numbers (subnormalized numbers and infinities) will occur only rarely in PDSYEVX. Hence, we make sure that exceptional numbers do not appear during our timing runs.

We do not time PDSYEVX on problem sizes that do not fit in physical memory. Hence, when timing the individual BLAS routines, we make sure that the arrays fit in physical memory. Ed D’Azevedo has written an out-of-core symmetric eigensolver and studied the effect of paging on PDSYEVX[52].

[52] The matrix is scaled before reduction to tridiagonal form to avoid being close to the overflow or underflow threshold. Although this does not prevent underflows (or subnormalized numbers) it causes them to be rare. NaNs will never appear in PDSYEVX unless NaNs appear in the input.
We measure and report both wall clock time and CPU time. Wall clock time may differ from CPU time for several reasons, including: time spent waiting for communication, time spent on other processes and time spent on paging and other operating system services. When timing the BLAS, we are primarily interested in CPU time because there is no communication and we are not interested in measuring the time spent waiting on other processes. However, we measure and report wall clock time because for all other timings we must rely on wall clock timings. When the wall clock time differs substantially from the CPU time on calls to the BLAS on time shared systems (such as the IBM SP2) we use the ratio of wall clock time to CPU time as a crude measure of the load on the system.

We use the timing routines included in the BLACS routines developed at University of Tennessee at Knoxville [169, 69] (which are not a part of the BLACS specification). Many modern computers have cycle time counters which would allow much more detailed measurement of execution time and often other machine characteristics. These detailed timing routines are not portable and I chose to stick to portable timing techniques. Alternatively, Krste Asanovic has developed a portable interface for taking performance related statistics over an "interval" of a code's execution [11].

---

8 CPU time is often meaningless when communication is involved.
Chapter 4

Details of the execution time of PDSYEVX

4.1 High level overview of PDSYEVX algorithm

Figure 4.1 shows how PDSYEVX reduces the original (dense) matrix to tridiagonal form (Line 1), uses bisection and inverse iteration to solve the tridiagonal eigenproblem (Line 2) and then transforms the eigenvectors of the tridiagonal matrix back into the eigenvectors of the original dense matrix (Line 3). PDSYEVX uses a two-dimensional block cyclic data layout with an algorithmic block size equal to the data layout block size in both Householder reduction to tridiagonal form and back transformation. When using bisection to compute the eigenvalues, it assigns each process an essentially equal number of eigenvalues to compute. For inverse iteration, PDSYEVX attempts to assign roughly equal numbers of eigenvectors to each process while assigning all eigenvectors corresponding to a given cluster of eigenvalues to the same process. Gram-Schmidt re-orthogonalization is performed locally within each process and hence orthogonality is not guaranteed for eigenvectors corresponding to eigenvalues within a cluster that is too large to fit on a single process.

We assume that only the lower triangle of the square symmetric matrix $A$ contains valid data on input and the algorithms only read and write this lower triangle. The general conclusions of this thesis apply to the upper triangular case as well.

Please refer to Table A.1, Table A.2, and Table A in Appendix A for the list of
Figure 4.1: PDSYEVX algorithm

\[ A = Q T Q^T \]

A is the matrix whose eigendecomposition we seek.  
\( T \) is tridiagonal.  
\( Q \) is orthogonal.  

\( A = \text{diag}(\lambda_1, \ldots, \lambda_n) \) is the diagonal matrix of eigenvalues.  

\[ T = U \Lambda U^T \]

The columns of \( U = [u_1 \ldots u_n] \) are the eigenvectors of \( T \).  
\( T u_i = \lambda_i u_i \)  

\[ V = Q U \]

The columns of \( V = [v_1 \ldots v_n] \) are the eigenvectors of \( A \).  
\( Av_i = \lambda_i v_i \)  

notation used in this chapter.

Section 4.2 describes and models reduction to tridiagonal form as performed by PDSYTRD. Section 4.3 describes and models the tridiagonal eigensolution as performed by PDSTEBZ (bisection) and PDSTEIN (inverse iteration). Section 4.4 describes and models back transformation as performed by PDORMTR.

4.2 Reduction to tridiagonal form

4.2.1 Householder’s algorithm

Figure 4.4 shows Householder’s reduction to tridiagonal form, Figure 4.4 shows a model for the runtime of ScALAPACK’s reduction to tridiagonal form code, PDSYTRD. The rest of this section explains the computation and communication pattern in PDSYTRD. We begin by describing the classical (serial and unblocked) algorithm (essentially the EISPACK algorithm TRED1 and also LAPACK’s DSYTD2), then the blocked (but still serial) algorithm (essentially the LAPACK algorithm DSYTRD) and finally the parallel blocked ScALAPACK algorithm PDSYTRD.
Classical (serial and unblocked) Householder reduction (Figure 4.2)

Figure 4.2 shows the algorithm for the classical (serial and unblocked) Householder reduction to tridiagonal form, (essentially the algorithm used in \textsc{LAPACK}'s DSYTD2).

The first iteration through the loop performs an orthogonal similarity transformation of the form: \( A \leftarrow (I - \tau vv^T)A(I - \tau vv^T) \) where \( \tau = 2/v_2 \| v_2 \|^2 \), such that only the first two elements in the first column (and hence the first two elements in the first row) of \( A \) are non-zero. Each iteration through the loop repeats these steps on the trailing submatrix \( A(2:n, 2:n) \) to reduce \( A \) to tridiagonal form by a series of similarity transformations.

\textit{Compute an appropriate reflector} (Line 2.1 in Figure 4.2)

We seek a reflector of the form: \( I - \tau vv^T \) such that \( \tau = 2/v_2 \| v_2 \|^2 \) and the first row and column of \( (I - \tau vv^T)A(I - \tau vv^T) \) has zeroes in all entries except the first two.

Let \( z \) be the column vector \( A(2:n, 1) \). In exact arithmetic, any vector \( v = c[z_1 \pm \| z \|_2, z_2 \ldots z_n] \) for any scalar \( c \) will suffice, and defines what value \( \tau \) must take. \textsc{LAPACK} and \textsc{ScalAPACK} choose the sign \( (\pm \| z \|_2) \) to match the sign of \( z_1 \) to minimize roundoff errors, and choose \( c \) such that \( v(1) = 1.0 \). \( c \) can also be chosen to be 1, avoiding the need to multiply \( z \) by \( c \), at some small risk of overflow.

\textit{Form the matrix vector product} \( y = Av \) (Line 3.3 in Figure 4.2)

This is a matrix vector multiply (Basic Linear Algebra Subroutines Level 2) requiring \( 2(n-i)^2 \) flops, which when summed from \( i = 1 \) to \( n-1 \) totals \( \frac{2}{3}n^3 \) flops.

\textit{Compute the companion update vector} \( w = y - \frac{1}{2}(\tau(y^Tv)v) \) (Line 5.1 in Figure 4.2)

The vector \( w \) (which is computed here with a dot product and a \textsc{DAXPY}) has the property that \( (I - \tau vv^T)A(I - \tau vv^T) = A - vw^Tv^T - wv^T \).

\textit{Update the matrix} (Line 6.3 in Figure 4.2)

Compute \( A = A - vw^Tv^T - wv^T \), a \textsc{BLAS} Level 2 rank-2 update. A rank-2 update requires 4 flops per element updated, only the lower triangular portion of \( A \) is updated, so this requires \( 2(n-i)^2 \) flops, which summed over \( i = 1 \) to \( n-1 \) is \( \frac{2}{3}n^3 \) flops.
Figure 4.2: Classical unblocked, serial reduction to tridiagonal form, i.e. EISPACK's TRED1 (The line numbers are consistent with figures 4.3, 4.4 and 4.5.)

do $i = 1, n$

**Compute reflector**

2.1 $[\tau, v] = \text{house}(A(i+1:n, i))$

**Perform matrix-vector multiply**

3.3 $w = \text{tril}(A(i+1:n, i+1:n))v$
   
   $+ \text{tril}(A(i+1:n, i+1:n), -1)v^T$

**Compute companion update vector**

5.1 $c = w \cdot v^T$;
   $w = \tau w - (c \tau / 2)v$

**Perform rank 2 update**

6.3 $A(i+1:n, i+1:n) =$

   $\text{tril}(A(i+1:n, i+1:n) - wv^T - wv^T) - wv^T - wv^T)$

   end do $i = 1, n$

**Blocked Householder reduction to tridiagonal form (Figure 4.3)**

In the above algorithm, nearly all the flops are performed in the product $y = Av$, or the rank-2 update $A - vv^T - vv^T$, both of which are BLAS Level 2 operations. Through blocking, half of the flops can be executed as BLAS 3 flops because $k$ matrix updates can be performed as one rank-2$k$ update instead of $k$ rank-2 updates. This is done in Line 6.3 in Figure 4.2. The cost of blocking is significant in PDSYTRD, but the gain is also. See section 7.2.2. This allows the matrix update to be considerably more efficient, but it complicates the computation of the reflector and the computation of the companion update vector, because PDSYTRD must work with an out-of-date matrix. Starting with $A_0$, the computation of the first reflector $v_0$, the matrix vector product and $w_0$ are unchanged, but as soon as PDSYTRD attempts to compute the second reflector, $v_1$ it has to deal with the fact that $A_1$ is known only in factored from, i.e. $A_1 = A_0 - v_0w_0^T - w_0v_0^T$. This does not greatly complicate computing the reflector because the reflector needs only the first column of $A_1$. 
do \( \hat{i} = 1, n, \text{nb} \)
\[ mxi = \min(\hat{i} + \text{nb}, n) \]
do \( i = \hat{i}, mxi \)

**Update current (\( i^{th} \)) column of \( A \)**
\[ A(:, i) = A(:, i) - 1.2 \quad W(:, ii:i-1) V(i, ii:i-1)^T - V(:, ii:i-1) W(i, ii:i-1)^T \]

**Compute reflector**
2.1 \[ [\tau, v] = \text{house}(A(i+1:n, i)) \quad v \in R^{n-i}; \tau \text{ is a scalar} \]

**Perform matrix-vector multiply**
3.3 \[ w = \text{tril}(A(i+1:n, i+1:n))v + \text{tril}(A(i+1:n, i+1:n), -1)^T v \]

**Update the matrix-vector product**
\[ w = w - 4.1 \quad W(:, ii:i-1) V(i, i+1:n)^T v - V(:, ii:i-1) W(i, i+1:n)^T v \]

**Compute companion update vector**
5.1 \[ c = w \cdot v^T; \quad w = \tau w - (c \tau / 2) v \]
\[ W(i+1:n, i) = w; \quad V(i+1:n, i) = v \]
end do \( i = \hat{i}, mxi \)

**Perform rank 2k update**
\[ A(mxi+1:n, mxi+1:n) = \]
\[ \text{tril}(A(mxi+1:n, mxi+1:n)) - 6.3 \quad W(mxi+1:n, ii:mxi)V(mxi+1:n, ii:mxi)^T - V(mxi+1:n, ii:mxi)W(mxi+1:n, ii:mxi)^T \]
end do \( \hat{i} = 1, n, \text{nb} \)

However, computing \( w_1 \) requires the computation of \( A_1 v \), hence we must either update the entire matrix \( A_1 \), returning to an unblocked code, or compute \( y = (A_0 - v_0 w_0^T - w_0 v_0^T) v \).
Computing the reflectors and the companion update vectors now requires that the current column be updated (Line 1.2 in Figure 4.3). The matrix vector product must be updated (Line 4.1 in Figure 4.3).
4.2.2 PDSYTRD implementation (Figure 4.4)

Figure 4.5 shows Householder’s reduction to tridiagonal form along with a model for the runtime of each step in ScaLAPACK’s reduction to tridiagonal form code, PDSYTRD. The rest of this section explains the computation and communication pattern in PDSYTRD, and hence the inefficiencies.
Figure 4.4: PDSYEVX reduction to tridiagonal form (See Figure 4.3 for further details)

\[
\text{do } \vec{i} = 1, n, \text{nb}
\]
\[
\text{mx}_i = \min(\vec{i} + \text{nb}, n)
\]
\[
\text{do } i = \vec{i}, \text{mx}_i
\]

**Update current \((i^{th})\) column of \(A\) (Table 4.1)**

1.1 spread \(V(i, ii:i-1)^T\) and \(W(i, ii:i-1)^T\) down

\[
A(i, i) = A(:, i) - V(:, ii:i-1) V(i, ii:i-1)^T W(:, ii:i-1) W(i, ii:i-1)^T
\]

1.2 \(V\) and \(W\) are used as they are stored (no data movement required)

**Compute reflector (Table 4.2)**

2.1 \([\tau, v] = \text{house}(A(i+1:n, i))\)

\(v \in R^{n-i}; \tau\) is a scalar

**Perform matrix-vector multiply (Table 4.3)**

3.1 spread \(v\) across

3.2 transpose \(v\), spread down

3.3 \(w_1 = \text{tril}(A(i+1:n, i+1:n))v;\)

\(w_2 = \text{tril}(A(i+1:n, i+1:n), -1)^T v\)

3.4 sum \(w\) row-wise

3.5 sum \(w^T\) column-wise

3.6 \(w = w_1 + w_2\)

\(w\) is distributed like column \(A(:, i)\), hence \(w_1\) must be transposed.

**Update the matrix-vector product (Table 4.4)**

\(w = w -\)

4.1 \(W(:, ii:i-1) V(i, i+1:n)^T v -\)

\(V(:, ii:i-1) W(i, i+1:n)^T v\)

**Compute companion update vector (Table 4.5)**

5.1 \(c = w \cdot v^T;\)

\(w = \tau w - (c \tau / 2) v\)

\(W(i+1:n, i) = w;\)

\(V(i+1:n, i) = v\)

end do \(i = \vec{i}, \text{mx}_i\)

**Perform rank 2k update (Table 4.6)**

6.1 spread \(V(mx_1+1:n, ii:mx_1),\)

\(W(mx_1+1:n, ii:mx_1)\) across processors in current column of processors broadcasts to processors in other processor columns

6.2 transpose \(V(mx_1+1:n, ii:mx_1),\)

\(W(mx_1+1:n, ii:mx_1),\) spread down

\(A(mx_1+1:n, mx_1+1:n) =\)

\(\text{tril}(A(mx_1+1:n, mx_1+1:n) -\)

6.3 \(W(mx_1+1:n, ii:mx_1)V(mx_1+1:n, ii:mx_1)^T -\)

\(V(mx_1+1:n, ii:mx_1)W(mx_1+1:n, ii:mx_1)^T)\)

end do \(\vec{i} = 1, n, \text{nb}\)
Figure 4.5: Execution time model for PDSYEVX reduction to tridiagonal form (See Figure 4.4 for details about the algorithm and indices.)

<table>
<thead>
<tr>
<th>do $ii = 1, n, nb$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$mxi = \min((ii + nb, n))$</td>
</tr>
<tr>
<td>do $i = ii, mxi$</td>
</tr>
</tbody>
</table>

**Update current ($i^{th}$) column of $A$**

1.1 spread $V^T$ and $W^T$ down

1.2 $A = A - W V^T - V W^T$

**Compute reflector**

2.1 $v = \text{house}(A)$

**Perform matrix-vector multiply**

3.1 spread $v$ across

3.2 transpose $v$, spread down

3.3 $w = \text{tril}(A) v$;

3.4 sum $w$ row-wise

3.5 sum $w^T$ column-wise

3.6 $w = w + \text{transpose } w^T$

**Update the matrix-vector product**

4.1 $w = w - W V^T v - V W^T v$

**Compute companion update vector**

5.1 $c = w \cdot v^T$;

5.2 $w = \tau w - (c \tau / 2) v$

end do $i = ii, mxi$

**Perform rank $2k$ update**

6.1 spread $V, W$ across

6.2 transpose $V, W$, spread down

6.3 $A = A - W V^T - V W^T$
Distribution of data and computation in PDSYTRD

In PDSYEVX, the matrix being reduced, $A$, is distributed across a 2 dimensional grid of processors. The computation is distributed in a like manner, i.e. computations involving matrix element $A(i,j)$ are performed by the processor which owns matrix element $A(i,j)$. Vectors are distributed across the processors within a given column of processors. At the $i^{th}$ step, i.e. when reducing $A(i:n,i:n)$ to $A(i+1:n,i+1:n)$, the vectors are distributed amongst the processors which own some portion of the vector $A(i:n,i)$. Within calls to the PBLAS, these vectors are sometimes replicated across all processor columns, or even transposed and replicated across all processor rows. However, between PBLAS calls, each vector element is owned by just one processor.

Critical path in PDSYTRD

For steps 1.1, 1.2, 2.1, 4.1, 5.1, 6.1, 6.2, 6.3 in Figure 4.5, i.e. all steps except “forming the matrix vector product”, the processor owning the most rows in the current column of the remaining matrix has the most work to do and hence it is on the critical path. When the matrix vector product is being formed, (steps 3.1 through 3.6) the processor which owns the most rows and the most columns in the remaining matrix has the most work (both communication and computation) and hence is on the critical path.

Load imbalance

Load imbalance occurs when some processor(s) take longer to perform certain operations\(^1\), requiring other processors to wait. Each processor is responsible for computations on the portion of the matrix and/or vectors that it owns. Some processors own a larger portion of the matrix and/or vectors. Since PDSYTRD has regular synchronization points\(^2\), the processor which takes the longest to complete any given step determines the execution time for that step.

If row $j$ is the first row in a data layout block, the processor which owns $A(j,j)$ will own the most rows in $A(j:n,j:n): \left\lfloor \frac{n-j+1}{p,r} \right\rfloor \text{nb} + \min(n-j+1, \frac{n-j}{p,r} \text{nb} p,r, \text{nb})$. However, if row $j$ is not the first row in a data layout block, even this formula is too simplistic.

\(^1\)Load imbalance also occurs during communication, but for PDSYTRD on the machines that we studied the communication load imbalance was negligible.

\(^2\)Computing the reflector (Line 2.1) and computing the companion update vector (Line 5.1) require all the processors in the processor column owning column $i$ of the matrix and are hence synchronization points.
Fortunately, \( \frac{n-j+1}{p_r} + \frac{\text{nb}}{2} \) is an excellent approximation, on average, for the maximum number of rows of \( A(j:n, j:n) \) owned by any processor. \( \frac{n-j+1}{p_r} + \frac{\text{nb} \cdot \text{pbf} - 1}{p_r} \) is more accurate, but the difference is too small to be useful.

The second source of load imbalance is that many of the computations are performed only by the processors which own the current column of the matrix.

**Updating the current column of \( A \)**

As shown in table 4.1, **PDSYTRD** updates the current column of \( A \) through two calls to **PDGEMV**, one at line 350 of *pdlatrd.f* and one at line 355 of *pdlatrd.f*. Each of these calls to **PDGEMV** requires that the first few elements of a column vector (\( \text{W} \) or \( \text{V} \)) be transposed and replicated among all the processors in that column. The transposition is fast because these elements are entirely contained within one processor, but the replication requires a spread down (column-wise broadcast) of \( \text{nb} \) or fewer items.

**Standard data layout model**

By making a few assumptions, we can significantly simplify the model. By assuming that \( p_r = p_c = \sqrt{p} \), many of the terms coalesce. We also assume that the panel blocking factor\(^4\), \( \text{pbf} = 2 \), as it is in **ScalAPACK** 1.5.

This standard data layout is also assumed in Figure 4.5 and in Chapter 5. The models used in Figure 4.5 and in Chapter 5 are subsets, including only the most important terms, of the “standard data layout” models shown in Tables 4.2 through 4.10.

**Computing the reflector (Line 2.1 in Figure 4.5)**

**PDLARFG** computes the reflector as shown in table 4.2. First, it broadcasts \( \alpha = A(j+1, j) \) to all processes that own column \( A(:,j) \). Then, it computes the norm \( \beta = |A(j+1:n, j)| \) leaving the result replicated across all processors that own column \( A(:, j) \).

The rest of the computation is entirely local and requires only \( \frac{2n^2}{\sqrt{p}} + O(n) \) flops, hence does not contribute significantly to total execution time.

\(^4\)The matrix vector multiplies are each performed in panels of size \( \text{pbf} \cdot \text{nb} \). See Section 4.2.2.
Table 4.1: The cost of updating the current column of \( A \) in PDLATRD (Line 1.1 and 1.2 in Figure 4.5)

<table>
<thead>
<tr>
<th>Task</th>
<th>Fileline number or subroutine</th>
<th>Execution time contribution from columns ( j = 1 ) to ( n ) shown explicitly</th>
<th>Execution time (simplified)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Broadcast ( W(jb'-1)^T ) within current column(^3).</td>
<td>( pdlatrd.f:350 ) ( pdgemv.c ) ( pbdgemv.f:560 ) ( dgelbsd )</td>
<td>( \sum_{j=1}^{n} \left( \left\lceil \log_{2}(p_r) \right\rceil \alpha + \delta_4 + \left\lceil \log_{2}(p_r) \right\rceil \beta \right) )</td>
<td>( n \left\lceil \log_{2}(p_r) \right\rceil \alpha + n \delta_4 + 0.5n \text{ nb} \left\lceil \log_{2}(p_r) \right\rceil \beta )</td>
</tr>
<tr>
<td>Compute local portion of ( A(jn) = A(jn,j) - V(jn, jb'-1) \times W(jn, jb'-1)^T )</td>
<td>( pdlatrd.f:350 ) ( pdgemv.c ) ( pbdgemv.f:580 ) ( dgemv )</td>
<td>( \sum_{j=1}^{n} \left( \delta_2 + 2 \frac{(s-1)j'}{pr} \gamma_2 \right) )</td>
<td>( n \delta_2 + 0.5 \frac{2^{s+1}}{pr} \gamma_2 )</td>
</tr>
<tr>
<td>Broadcast ( V(jb'-1)^T ) within current column.</td>
<td>( pdlatrd.f:355 ) ( pdgemv.c ) ( pbdgemv.f:560 )</td>
<td>( \sum_{j=1}^{n} \left( \left\lceil \log_{2}(p_r) \right\rceil \alpha + \delta_4 + \left\lceil \log_{2}(p_r) \right\rceil \beta \right) )</td>
<td>( n \left\lceil \log_{2}(p_r) \right\rceil \alpha + n \delta_4 + 0.5n \text{ nb} \left\lceil \log_{2}(p_r) \right\rceil \beta )</td>
</tr>
<tr>
<td>Compute local portion of ( A(jn) = A(jn,j) - W(jn, jb'-1) \times V(jn, jb'-1)^T )</td>
<td>( pdlatrd.f:355 ) ( pdgemv.c ) ( pbdgemv.f:580 ) ( dgemv )</td>
<td>( \sum_{j=1}^{n} \left( \delta_2 + 2 \frac{(s-1)j'}{pr} \gamma_2 \right) )</td>
<td>( n \delta_2 + 0.5 \frac{2^{s+1}}{pr} \gamma_2 )</td>
</tr>
<tr>
<td>Total</td>
<td></td>
<td>( 2n \left\lceil \log_{2}(p_r) \right\rceil \alpha + n \text{ nb} \left\lceil \log_{2}(p_r) \right\rceil \beta + 2n \delta_2 + \frac{2^{s+1}}{pr} \gamma_2 + 2n \delta_4 )</td>
<td></td>
</tr>
<tr>
<td>Standard data layout (See section 4.2.2)</td>
<td></td>
<td>( 2n \left\lceil \log_{2}(p) \right\rceil \alpha + n \text{ nb} \left\lceil \log_{2}(p) \right\rceil \beta + 2n \delta_2 + \frac{2^{s+1}}{\sqrt{p}} \gamma_2 + 2n \delta_4 )</td>
<td></td>
</tr>
</tbody>
</table>
Table 4.2: The cost of computing the reflector (PDLARFG) (Line 2.1 in Figure 4.5)

<table>
<thead>
<tr>
<th>Task</th>
<th>Fileline number or subroutine</th>
<th>Execution time contribution from columns $j = 1$ to $n$ shown explicitly</th>
<th>Execution time (simplified)</th>
</tr>
</thead>
<tbody>
<tr>
<td>$\alpha = A(j + 1, j)$</td>
<td>pdlatrd.f:364, pdlarf.f:213</td>
<td>$n \sum_{j=1}^{n} [\log_2(p_r)] \alpha$</td>
<td>$n [\log_2(p_r)] \alpha$</td>
</tr>
<tr>
<td>$x\text{norm} =</td>
<td>A(j + 1:n, j)</td>
<td>$</td>
<td>pdlatrd.f:364, pdlarf.f:229, pglmn2</td>
</tr>
<tr>
<td>$\tau = -(\alpha + \beta)/\beta$</td>
<td>pdlatrd.f:364, pdlarf.f:271</td>
<td>negligible</td>
<td>negligible</td>
</tr>
<tr>
<td>$A(j + 2, j) = \frac{A(j + 2, j)}{\alpha + \beta}$</td>
<td>pdlatrd.f:364, pdlarf.f:272, pdscal</td>
<td>$\sum_{j=1}^{n} \frac{1}{2} \epsilon_4$</td>
<td>$\frac{1}{2} \epsilon_4$</td>
</tr>
<tr>
<td>$E(j) = A(j + 1, j) = \beta$</td>
<td>pdlatrd.f:364, pdlarf.f:273</td>
<td>negligible</td>
<td>negligible</td>
</tr>
<tr>
<td>Total</td>
<td></td>
<td></td>
<td>$3 n [\log_2(p_r)] \alpha + n \epsilon_4$</td>
</tr>
<tr>
<td>Standard data layout (See section 4.2.2)</td>
<td></td>
<td></td>
<td>$3 n [\log_2(\sqrt{p})] \alpha + n \epsilon_4$</td>
</tr>
</tbody>
</table>
Forming the matrix vector product using \texttt{PDSYMV}(Lines 3.1 through 3.6 in Figure 4.5)

The matrix $A$ is laid out in a block cyclic manner as described in section 2.5.4. Computing the matrix vector product $y = Av$ requires that $v$ be copied to all processes that own a part of $A$ that needs to be multiplied by $v$. The vector $v$ must be transposed\(^5\).

Each element is sent directly from the processor (in the processor column) Each processor in the processor column that owns $v$ sends to each processor in the processor row $v$ exactly the elements and spread down and because only half of $A$ is stored, $v$ must also be spread across. Then, the matrix vector multiplies\(^6\), $w_1 = \text{tril}(A,0)v$ and $w_2 = \text{tril}(A,-1)^Tv$ are performed locally. $w_1$ is summed within columns, transposed and added to the result of $\text{tril}(A,0)^Tv$ which is summed to the active column of processors. The algorithm used by \texttt{PDSYMV} is:

\begin{algorithm}
\begin{enumerate}
  \item Broadcast $v$ within each row of processors (Line 3.1 in Figure 4.4)
  \item Transpose $v$ within each column of processors (Line 3.2 in Figure 4.4)
  \item Broadcast $v^T$ within each column of processors (Line 3.2 in Figure 4.4)
  \item Form diagonal portion of $A$ (Line 3.3 in Figure 4.4)
  \item $w_1 = \text{locally available portion of } \text{tril}(A,0)v$ (Line 3.3 in Figure 4.4)
  \item $w_2 = v^T \text{tril}(A,-1)$ (Line 3.3 in Figure 4.4)
  \item Sum $w_1$ within each column of processors (Line 3.4 in Figure 4.4)
  \item Sum $w_2$ within each row of processors (Line 3.5 in Figure 4.4)
  \item Transpose $w_1$ and add to $w_2$ (Line 3.6 in Figure 4.4)
\end{enumerate}
\end{algorithm}

The two transpose operations, steps \{2,3\} and step 9 in algorithm 4.1 though both are performed by \texttt{PBDTRMV}, use different communication patterns. The transpose performed in steps 2 and 3, is an all-to-all. It takes $v$ replicated across the processor columns and distributed across the processor rows and produces $v^T$ replicated across the processor rows and distributed across the processor columns. The transpose performed in step 9 is a one-to-one transpose. It takes $y^T_n$ distributed across the processor columns within one processor.

\(^5\)The non-transposed $v$ is distributed like column $A(:,i)$, the transposed $v$ is distributed like row $A(i,:)$.

\(^6\)\text{tril()} is \texttt{MATLAB} notation for the lower triangular portion of a matrix (including the diagonal). \text{tril}(-1) refers to the portion of the matrix below the diagonal.
row. It produces $y_u$ distributed across the processor rows within the current processor column.

The all-to-all transposition is performed in two steps (steps 2 and 3 in Algorithm 4.1). Since each column of processors contains a complete copy of the vector $v$, each acts independently, first collecting the portion of $v^T$ that belongs to this processor column to one processor and then broadcasting it to all processor columns. The operation of collecting the portion of $v^T$ that belongs to this processor column to one processor is done as a tree-based reduction, requiring $\lceil \log_2(\text{lcm}(p_r, p_c)) \rceil$ messages, and a total of $\frac{\text{lcm}(p_r, p_c) - 1}{p_c}$ words which I model as $\frac{p}{p_c}$ words. The broadcast which completes the transpose (step 3), requires $\lceil \log_2(p_c) \rceil$ messages and $\lceil \log_2(p_c) \rceil \frac{p}{p_c}$ words.

The one-to-one transpose (step 9) is accomplished as a single set of direct messages. Every word in $y_u^T$ is owned by exactly one processor. Every word in $y_u$ should be sent to one processor. Every word in $y_u^T$ is sent from the processor that owns it to the processor that needs the corresponding word in $y_u$. All words being sent between the same two processors are sent in a single message. The number of words sent by each processor that owns a part of $y_u$ sends every word that it owns, i.e. $\frac{p}{p_c}$ in $\text{lcm}(p_r, p_c)/p_c$ messages. Every processor that needs a part of $y_u$ receives the number of words that it needs: $\frac{p}{p_c}$ in $\text{lcm}(p_r, p_c)$ messages.

The two matrix vectors multiplies are each performed in panels of size: $\text{pbf} \times \text{nb}$. $\text{pbf}$, the panel blocking factor, is set to $\max(\text{mullen}, \frac{\text{lcm}(p_r, p_c)}{p_c})$, where $\text{mullen}$ is a tuning parameter set at compile-time to 2 in ScaLAPACK 1.5.

The cost of the matrix vector multiply is detailed in Table 4.3.

The number of flops in the matrix vector multiply which any given processor must perform is controlled by the size and shape of the local portion of the trailing matrix. The processor holding the largest portion of the trailing matrix holds a matrix of size approximately $[\frac{n-j}{\text{mb}_{p_r}}] \times [\frac{n-j}{\text{nb}_{p_c}}] \times \text{mb} \times \text{nb}$. Because we update only the lower triangular portion of the matrix, each element in the lower triangular portion of the matrix is used in two matrix vector multiplies. And, because the shape of the local portion of the matrix is irregular (a column block stair step with some diagonal steps) the matrix vector computation is performed by column blocks. The irregular patterns repeat every $\frac{\text{lcm}(p_r, p_c)}{p_c} \times \text{nb}$, so $\text{pbf}$, the panel blocking factor is chosen to be: $\max(\text{mullen}, \frac{\text{lcm}(p_r, p_c)}{p_c})$, where $\text{mullen}$ is a compile time

---

7 If $p_c = \text{lcm}(p_r, p_c)$ the portion of data that belongs to this processor column is already on one processor and hence this “collection” is a null operation.

8 The largest local matrix size differs from this only when $\text{mod}(n-j, \text{nb}_{p_r}) < \text{nb}$ or $\text{mod}(n-j, \text{nb}_{p_c}) < \text{nb}$.
Table 4.3: The cost of all calls to PDSYM from PDSYRD

<table>
<thead>
<tr>
<th>Step</th>
<th>Description</th>
<th>Equation</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Broadcast $v$ within each processor row</td>
<td>$\sum_{j=1}^{n} \left( \frac{\log_2 (p_r \cdot v)}{p_r} \right) \alpha + \frac{n}{p_r} \cdot \frac{\log_2 (p_r \cdot v)}{p_r} \beta + 0.5 \cdot n \cdot \text{nb} \cdot \frac{\log_2 (p_r \cdot v)}{p_r} \beta$</td>
</tr>
<tr>
<td>2</td>
<td>Transpose $v$</td>
<td>$\sum_{j=1}^{n} \left( \frac{\log_2 (\text{ldxf}(p_r \cdot v))}{p_r} \right) \alpha + \frac{n}{p_r} \cdot \frac{\log_2 (\text{ldxf}(p_r \cdot v))}{p_r} \beta + 0.5 \cdot n \cdot \text{nb} \cdot \frac{\log_2 (\text{ldxf}(p_r \cdot v))}{p_r} \beta$</td>
</tr>
<tr>
<td>3</td>
<td>Broadcast $v^T$ down within each processor column</td>
<td>$\sum_{j=1}^{n} \left( \frac{\log_2 (p_r \cdot v)}{p_r} \right) \alpha + \frac{n}{p_r} \cdot \frac{\log_2 (p_r \cdot v)}{p_r} \beta + 0.5 \cdot n \cdot \text{nb} \cdot \frac{\log_2 (p_r \cdot v)}{p_r} \beta$</td>
</tr>
<tr>
<td>4</td>
<td>Form diagonal portion of matrix, padded with zeroes</td>
<td>$\frac{1}{2} \cdot \frac{n}{p_r} \cdot \beta_1 + \frac{1}{2} \cdot \frac{n}{p_r} \cdot \beta_1$</td>
</tr>
<tr>
<td>5</td>
<td>$w = \text{tril}(A, 0) \cdot v$, $w^T = v^T \cdot \text{tril}(A, -1)$</td>
<td>$\sum_{j=1}^{n} \left( \frac{\log_2 (p_r \cdot v)}{p_r} \right) \alpha + \frac{n}{p_r} \cdot \frac{\log_2 (p_r \cdot v)}{p_r} \beta + 0.5 \cdot n \cdot \text{nb} \cdot \frac{\log_2 (p_r \cdot v)}{p_r} \beta$</td>
</tr>
<tr>
<td>6</td>
<td>Sum, row-wise, $w$</td>
<td>$\sum_{j=1}^{n} \left( \frac{\log_2 (p_r \cdot v)}{p_r} \right) \alpha + \frac{n}{p_r} \cdot \frac{\log_2 (p_r \cdot v)}{p_r} \beta + 0.5 \cdot n \cdot \text{nb} \cdot \frac{\log_2 (p_r \cdot v)}{p_r} \beta$</td>
</tr>
<tr>
<td>7</td>
<td>Sum, column-wise, $w^T$</td>
<td>$\sum_{j=1}^{n} \left( \frac{\log_2 (p_r \cdot v)}{p_r} \right) \alpha + \frac{n}{p_r} \cdot \frac{\log_2 (p_r \cdot v)}{p_r} \beta + 0.5 \cdot n \cdot \text{nb} \cdot \frac{\log_2 (p_r \cdot v)}{p_r} \beta$</td>
</tr>
<tr>
<td>8</td>
<td>Transpose $w^T$ and sum into $w$</td>
<td>$\sum_{i=1}^{n} \left( \frac{\log_2 (\text{ldxf}(p_r \cdot v))}{p_r} \right) \alpha + \frac{n}{p_r} \cdot \frac{\log_2 (\text{ldxf}(p_r \cdot v))}{p_r} \beta + 0.5 \cdot n \cdot \text{nb} \cdot \frac{\log_2 (\text{ldxf}(p_r \cdot v))}{p_r} \beta + 0.5 \cdot n \cdot \beta_1 + 0.5 \cdot n \cdot \beta_2 + 0.5 \cdot \beta_1 + 0.5 \cdot \beta_2 + 0.5 \cdot \beta_3 + 0.5 \cdot \beta_4$</td>
</tr>
<tr>
<td>9</td>
<td>Total</td>
<td>$\sum_{i=1}^{n} \left( \frac{\log_2 (\sqrt{p_r}) \cdot \alpha + 2n \cdot \frac{\log_2 (p_r \cdot v)}{p_r} \beta + 0.5 \cdot n \cdot \text{nb} \cdot \frac{\log_2 (p_r \cdot v)}{p_r} \beta}{2 \cdot \frac{n}{p_r} \cdot \beta_1 + 0.5 \cdot n \cdot \beta_2 + 0.5 \cdot \beta_3 + 0.5 \cdot \beta_4}$</td>
</tr>
<tr>
<td>10</td>
<td>Standard data layout (See section 4.2.2)</td>
<td>$\sum_{i=1}^{n} \left( \frac{\log_2 (\sqrt{p_r}) \cdot \alpha + 2n \cdot \frac{\log_2 (p_r \cdot v)}{p_r} \beta + 0.5 \cdot n \cdot \text{nb} \cdot \frac{\log_2 (p_r \cdot v)}{p_r} \beta}{2 \cdot \frac{n}{p_r} \cdot \beta_1 + 0.5 \cdot n \cdot \beta_2 + 0.5 \cdot \beta_3 + 0.5 \cdot \beta_4} \right)$</td>
</tr>
</tbody>
</table>
parameter, set to 2 in the standard PBLAS release. The column panels are filled out with zeroes to make the matrix vector multiply efficient. Even the act of filling the diagonal blocks with zeroes, because it is done inefficiently, is noticeable on modest problem sizes.

The number of flops required for a global \((n - j) \times (n - j)\) matrix vector multiply is approximately:

\[
2 \times 2 \times \left( \frac{1}{2} \left( \frac{n - j}{nb \, p_r} \right) \left( \frac{n - j}{nb \, p_c} \right) + \left( \frac{n - j}{2 \, p_c} \right) \frac{pbf \, nb}{2 \, p_r} \right).
\]

The first 2 is because multiplies and adds are counted separately. Each element in the lower triangular portion of the matrix is involved twice, hence the second 2. The first term stems directly from the size of the local matrix. The second term stems from the odd shape of the local matrix and is primarily the result of the unnecessary flops (zero matrix elements) added to reduce the number of dgemv calls.

We use the following equality, dropping the \(O(n)\) term:

\[
\sum_{i=1}^{n} \left[ \frac{i}{a} \right] \left[ \frac{i}{b} \right] = \frac{n^3}{3} + \frac{n^2 a}{4} + \frac{n^2 b}{4} + O(n).
\]

\[
flops = 2 \times 2 \sum_{j=1}^{n} \left( \frac{1}{2} \left( \frac{n - j}{nb \, p_r} \right) \left( \frac{n - j}{nb \, p_c} \right) + \left( \frac{n - j}{2 \, p_c} \right) \frac{pbf \, nb}{2 \, p_r} \right).
\]

\[
= 2 \times \frac{1}{2} \left( \frac{n^3}{3 \, p_r} + \frac{n^2}{4 \, nb \, p_c} \frac{nb^2}{p_r} + \frac{1}{4 \, nb \, p_c} \frac{nb^2}{p_r} + \frac{n^2 \, pbf \, nb}{2 \, p_r} \right)
\]

\[
= \frac{2}{3} \frac{n^3}{p} + \frac{n^2 \, nb}{2 \, p_c} + \frac{n^2 \, nb}{2 \, p_c} + \frac{n^2 \, nb \, pbf}{p_r}
\]

Figure 4.6: Flops in the critical path during the matrix vector multiply

**Updating the matrix vector product**

Updating the matrix vector product, \(y = y - VW^T v - W^T v\), requires four matrix vector multiplies. \(\text{temp} = W^T v\) and \(\text{temp} = V^T v\) are both \((n - j) \times j'\) by \(n - j\) matrix vector multiplies, where \(j' = \lfloor j \mod nb \rfloor\). Both the matrix and the vector are stored in the current process column. No data movement is required to perform the computation, however the result, a vector of length \(j' - 1\) is the sum of the matrix vector multiplies performed on each of the processes in the process column.
Table 4.4: The cost of updating the matrix vector product in PDLATRD (Line 4.1 in Figure 4.5)

<table>
<thead>
<tr>
<th>Task</th>
<th>Fileline number or subroutine</th>
<th>Execution time contribution from columns $j = 1$ to $n$ shown explicitly</th>
<th>Execution time (simplified)</th>
</tr>
</thead>
</table>
| Broadcast $W^T$ unnecessarily for temp = $W^T v$ | pdlatrd.f:373 
pbdgemv.f:326 
dgelss2d | $\sum_{j=1}^{n} (\epsilon_4 + [\log_2(p_r)] \alpha + \frac{\epsilon_4}{p_r} \beta + [\log_2(p_r)] \frac{\epsilon_4}{p_r} \beta)$ | $n \epsilon_4 + n [\log_2(p_r)] \alpha + 0.5 \frac{n^2 [\log_2(p_r)] \beta}{p_r} + 0.5 n nb [\log_2(p_r)] \beta$ |
| Local computation of temp = $W^T v$ | pdlatrd.f:373 
pbdgemv.f:346 
dgemv | $\sum_{j=1}^{n} (\epsilon_2 + 2 \frac{\epsilon_2}{p_r} \gamma_2)$ | $n \epsilon_2 + 0.5 \frac{2 \epsilon_2}{p_r} \gamma_2$ |
| Sum the contribution of temp from all processes in the column | pdlatrd.f:373 
pbdgemv.f:358 
dgsum2d | $\sum_{j=1}^{n} ([\log_2(p_r)] \alpha + [\log_2(p_r)] \frac{\epsilon_4}{p_r} \beta)$ | $n [\log_2(p_r)] \alpha + 0.5 n nb [\log_2(p_r)] \beta$ |
| Broadcast temp (row-wise) to all processes in this column | pdlatrd.f:376 
pbdgemv.f:579 
dgelss2d | $\sum_{j=1}^{n} ([\log_2(p_r)] \alpha + [\log_2(p_r)] \frac{\epsilon_4}{p_r} \beta)$ | $n \epsilon_4 + n [\log_2(p_r)] \alpha + 0.5 n nb [\log_2(p_r)] \beta$ |
| Local computation of $y = V \cdot temp$ | pdlatrd.f:376 
pbdgemv.f:500 
dgemv | $\sum_{j=1}^{n} (\epsilon_2 + 2 \frac{\epsilon_2}{p_r} \gamma_2)$ | $n \epsilon_2 + 0.5 \frac{2 \epsilon_2}{p_r} \gamma_2$ |
| $y = y + W^T$ is identical to $y = y + VW^T$ | pdlatrd.f:379 
pdlatrd.f:382 | $n [\log_2(p_r)] \alpha + 2 n [\log_2(p_r)] \alpha + 0.5 \frac{n^2 [\log_2(p_r)] \beta}{p_r} + 0.5 n nb [\log_2(p_r)] \beta + 2 n \epsilon_2 + \frac{2 \epsilon_2}{p_r} \gamma_2 + 2 n \epsilon_4$ |
| Total | | $2 n [\log_2(p_r)] \alpha + 4 n [\log_2(p_r)] \alpha + \frac{n^2 [\log_2(p_r)] \beta}{p_r} + n nb [\log_2(p_r)] \beta + 2 n nb [\log_2(p_r)] \beta + 2 n \epsilon_2 + \frac{2 \epsilon_2}{p_r} \gamma_2 + 4 n \epsilon_4$ |
| Standard data layout (See section 4.2.2) | | $6 n [\log_2(\sqrt{n})] \alpha + \frac{n^2 [\log_2(\sqrt{n})] \beta}{p_r} + 3 n nb [\log_2(\sqrt{n})] \beta + 4 n \epsilon_2 + \frac{2 \epsilon_2}{p_r} \gamma_2 + 4 n \epsilon_4$ |

The other two matrix vector multiplies, $y = V \cdot temp$ and $y = W \cdot temp$, are both $(n - j) \times (j' - 1)$ by $j'-1$ matrix vector multiplies. Again, the computation is performed entirely within the current process column. The 1 by $j'-1$ vector, temp, must be spread down, i.e. broadcast column-wise, to all processes in this process column, however no further communication is necessary in order to update $y$, as $y$ is perfectly aligned with $V$.

Details are given in the table 4.4.

**Computing the companion update vector**

The details involved in computing the companion update vector are shown in table 4.5.
Table 4.5: The cost of computing the companion update vector in PDLATRD (Line 5.1 in Figure 4.5)

<table>
<thead>
<tr>
<th>Task</th>
<th>Fileline number or subroutine</th>
<th>Execution time contribution from columns $j = 1$ to $n$ shown explicitly</th>
<th>Execution time (simplified)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Compute $y = \pi y$</td>
<td>pdlatrd.f:385 pdscal</td>
<td>$\sum_{i=1}^{n} \frac{1}{3} \delta_4$</td>
<td>$\frac{1}{3} n \delta_4$</td>
</tr>
<tr>
<td>Compute $\alpha = -0.5 \pi y^T v$</td>
<td>pdlatrd.f:386 pdot</td>
<td>$\sum_{i=1}^{n} \left[ \frac{\log_2(p_r)}{r} \right] \alpha + \frac{1}{3} \delta_4$</td>
<td>$n \left[ \frac{\log_2(p_r)}{r} \right] \alpha + \frac{1}{3} n \delta_4$</td>
</tr>
<tr>
<td>Compute $w = y - \alpha v$</td>
<td>pdlatrd.f:390 pdaxpy</td>
<td>$\sum_{i=1}^{n} \left[ \frac{\log_2(p_r)}{r} \right] \alpha + \frac{1}{3} \delta_4$</td>
<td>$n \left[ \frac{\log_2(p_r)}{r} \right] \alpha + \frac{1}{3} n \delta_4$</td>
</tr>
<tr>
<td>Total</td>
<td></td>
<td>$2 n \left[ \frac{\log_2(p_r)}{r} \right] \alpha + n \delta_4$</td>
<td></td>
</tr>
<tr>
<td>Standard data layout</td>
<td></td>
<td>$2n \left[ \log_2(\sqrt{p}) \right] \alpha + n \delta_4$</td>
<td></td>
</tr>
</tbody>
</table>

Performing the rank 2k update

The rank 2-k update is performed once per block column (i.e. $n/nb$ times):

$$A = A - vw^T - wv^T.$$ 

PDSYTRD broadcasts $v$ and $w$ along processor rows, transposes them and then broadcasts them along processor columns. I ignore the $\alpha$ (latency) cost of the transpose here, because it is less significant (by a factor of $nb$) than the similar cost for the transpose in the matrix-vector multiply and because it is only relevant when $\frac{\log_2(p_r)}{r}$ is very large. The third $\beta$ term in the transpose and broadcast operation should be multiplied by $\frac{\log_2(p_r)}{r}$ but the added complexity is not justified for a small term.

The number of flops performed during the rank two update of $A(\tilde{j} : n, \tilde{j} : n)$ is modeled as:

$$2 \times 2 \times nb \left( \frac{1}{2} \left( \frac{n - \tilde{j}}{p_r} + \frac{nb}{2} \right) \left( \frac{n - \tilde{j}}{p_c} + \frac{nb}{2} \right) + \frac{n \cdot nb \cdot pbf}{2 \cdot p_r} \right).$$

The number of flops performed per matrix element involved in the rank-2 update is $2 \times 2 \times nb$. The number of elements in the lower triangular matrix is given by the sum of the terms within the parentheses.

The total number of flops for all rank two updates is modeled as the sum of this quantity as $\tilde{j}$ ranges from $nb$ to $n$ by $nb$. 
Table 4.6: The cost of performing the rank-2k update (PDSYR2K) (Lines 6.1 through 6.3 in Figure 4.5)

<table>
<thead>
<tr>
<th>Task</th>
<th>Fileline number or subroutine</th>
<th>Execution time contribution from columns $j = 1$ to $n$ shown explicitly</th>
<th>Execution time (simplified)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Broadcast $V$ and $W$ within process rows (Line 6.1)</td>
<td>$pdsytrd.f:354$</td>
<td>$\sum_{j=nb, nb}^{n} (2 \log_2(p_r) + \beta) + \frac{\beta}{\log_2(p_r)}$</td>
<td>$2 \frac{n}{\log_2(p_r)} + \frac{n \log_2(p_r)}{p_r} \gamma_3$</td>
</tr>
<tr>
<td></td>
<td>$pdsyr2k_c$</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>$pdsyr2k.f:454,477$</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>$dgels3d$</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Transpose and broadcast $V$ and $W$ within process columns (Line 6.2)</td>
<td>$pdsytrd.f:354$</td>
<td>$\sum_{j=nb, nb}^{n} (2 \log_2(p_r) + \beta) + \frac{\beta}{\log_2(p_r)}$</td>
<td>$2 \frac{n}{\log_2(p_r)} + \frac{n \log_2(p_r)}{p_r} \gamma_3$</td>
</tr>
<tr>
<td></td>
<td>$pdsyr2k_c$</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>$pdsyr2k.f:491,847$</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>$pdbtran$</td>
<td></td>
<td></td>
</tr>
<tr>
<td>tril$(A, 0) = tril(A, 0)$</td>
<td>$pdsytrd.f:354$</td>
<td>$\sum_{j=nb, nb}^{n} (4 \log_2(p_r) + 4\beta) + \frac{\beta}{\log_2(p_r)}$</td>
<td>$2 \frac{n^2}{\log_2(p_r)} + \frac{n^2 \log_2(p_r)}{p_r} \gamma_3$</td>
</tr>
<tr>
<td></td>
<td>$pdsyr2k_c$</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>$pdsyr2k.f:655-663$</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>$1052-57$</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>$pdgemm$</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Total</td>
<td>$2 \frac{n}{\log_2(p_r)} + \frac{n \log_2(p_r)}{p_r} \gamma_3$</td>
<td>$-n \log_2(p_r) + \frac{n \log_2(p_r)}{p_r} \gamma_3$</td>
<td>$2 \frac{n^2}{\log_2(p_r)} + \frac{n^2 \log_2(p_r)}{p_r} \gamma_3$</td>
</tr>
<tr>
<td></td>
<td></td>
<td>$\frac{\beta}{\log_2(p_r)}$</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>$\frac{\beta}{\log^2_2(p_r)} + \frac{\beta^2}{\log_2(p_r)} \gamma_3$</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>$+ \frac{\beta^2}{\log^2_2(p_r)} \gamma_3$</td>
<td></td>
</tr>
</tbody>
</table>

Standard data layout (See section 4.2.2)

|                                                                       |                                                                       |                                                                       |                                                                       |
|                                                                       |                                                                       |                                                                       |                                                                       |
|                                                                       |                                                                       |                                                                       |                                                                       |
|                                                                       |                                                                       |                                                                       |                                                                       |

The negative term $(-2 \frac{n^2}{p} \log_2(p_r) \gamma_3)$, which results from the fact that $j$ starts at nb, is ignored because it is $O(\frac{n^2}{p})$ and hence too small.

Details are given in table 4.6.
### 4.2.3 PDSYTRD execution time summary

Table 4.7 shows that the computation cost in PDSYTRD is:

\[
\begin{align*}
\frac{2}{3} n^3 &+ \frac{2}{3} n^3 + \frac{n^2 \text{nb pbf}}{p_r} \gamma_3 + \frac{7}{2} \frac{n^2 \text{nb}}{p_r} \gamma_2 + \frac{1}{2} \frac{n^2 \text{nb}}{p_c} \gamma_2 + \frac{n^2 \text{nb pbf}}{p_r} \gamma_2 + \\
\frac{n^2 \text{nb pbf}}{p_r} \gamma_2 + \frac{1}{2} \frac{n^2 \text{nb}}{p_r} \gamma_3 + \frac{1}{2} \frac{n^2 \text{nb}}{p_c} \gamma_3 + \\
\frac{2}{n^2 \text{pbf}} \frac{n^2}{p_c} \gamma_2 + \frac{2}{n^2 \text{pbf}} \frac{n}{p_c} \delta_2 + 2 &\frac{n^2 \text{pbf}}{p_r} \frac{n}{p_c} \delta_2 + \frac{n^2}{p_c} \delta_1 + 9 n \delta_4.
\end{align*}
\]

The most important terms in the computation cost are the \(O\left(\frac{n^3}{p}\right)\) flops. The relative importance of the other \(O(n^3)\) terms depends on the computer. On the PARAGON none stand out above the rest. Indeed on the PARAGON none of the \(O(n^3)\) terms accounts for more than 3% of the total execution time of PDSYEVX when \(n = 3480\) and \(p = 64\). However, all of \(O(n^3)\) terms combined account for 21% of the total execution time on that same problem.

Figure 4.8 shows that the computation cost in the tridiagonal eigendecomposition in PDSYEVX is:

\[
53 \frac{n e}{p} \gamma_2 + 3 \frac{n m}{p} \gamma_2 + 112 n \gamma_2 + 265 \frac{n e}{p} \gamma_1 + 45 \frac{n m}{p} \gamma_1 + 620 n \gamma_1 + 6 n e^2 \gamma_1.
\]

The execution time of tridiagonal eigendecomposition is dominated by the cost of divides, and the size of the largest cluster, \(c\). The load imbalance terms (112 \(n \gamma_2\) and 620 \(n \gamma_1\)) are negligible.

Table 4.9 shows that the communication cost in PDSYTRD is:

\[
4 n \left[ \log_2(p_c) \right] \alpha + 13 n \left[ \log_2(p_r) \right] \alpha + n \left[ \text{lcm}(p_r, p_c) / p_r \right] \alpha + \\
n \left[ \log_2(\text{lcm}(p_r, p_c)) \right] \alpha + \\
3 \frac{n^2 \text{pbf}}{p_r} \left[ \log_2(p_r) \right] \beta + 2 \frac{n^2 \text{pbf}}{p_c} \left[ \log_2(p_r) \right] \beta + \frac{1}{2} \frac{n^2}{p_r} \beta + 2 \frac{n^2}{p_c} \beta.
\]

Most of the messages are in broadcasts and reductions (i.e. the \(O(n \log(p))\) terms) and most of the broadcasts and reductions (13\(n\)) are within processor rows, versus only 4\(n\) broadcasts and reductions within processor columns. By contrast, the message volume is fairly evenly split between broadcasts and reductions within processor rows (
\( 3 \frac{n^2}{p_r} \lfloor \log_2(p_c) \rfloor / \beta \) and broadcasts and reductions within processor columns \( 2 \frac{n^2}{p_c} \lfloor \log_2(p_c) \rfloor / \beta \).

The lcm terms are negligible unless \( p \) is very large, in which case it is important to make sure that \( \text{lcm}(p_r, p_c) \) is reasonable (say \( < 10 \max(p_r, p_c) \)).

### 4.3 Eigendecomposition of the tridiagonal

The execution time of tridiagonal eigendecomposition is dominated by two factors: the size of the largest cluster of eigenvalues and the speed of the divide.

#### 4.3.1 Bisection

During bisection, in \texttt{DSTEBZ}, each Sturm count requires \( n \) divisions and \( 5n \) other flops to produce one additional bit of accuracy. Hence, it takes roughly \( 53n \) divisions and \( 53 \times 5n \) flops\(^9\) for each eigenvalue and \( 53n\epsilon \) total divisions for all eigenvalues in IEEE double precision, where \( \epsilon \) is the number of eigenvalues to be computed. The exact number of divisions and flops depends on the actual eigenvalues, the parallelization strategy and other factors. However, this simple model suffices for our purposes.

#### 4.3.2 Inverse iteration

Inverse iteration typically requires \( 3n \) divides and \( 45n \) flops per eigenvalue plus the cost of re-orthogonalization.

In \texttt{PDSYEVX} the number of flops performed by any particular processor, \( p_i \), during re-orthogonalization is: \( \sum_{C \in \{ \text{clusters assigned to } p_i \}} 4 \sum_{i=1}^{\text{size}(C)} \text{n.iter}(i) n \ (i - 1) \). Where: \( \text{n.iter}(i) \) is the number of inverse iterations performed for eigenvalue \( i \) (typically 3). If the size of the largest cluster is greater than \( \frac{n}{p} \), the processor which is responsible for this cluster will not be responsible for any eigenvalues outside of this cluster.

Hence, if the size of the largest cluster is greater than \( \frac{n}{p} \), the number of flops performed by the processor to which this processor is assigned is (on average):

\[
4 \ \text{n.iter} \ n \ \frac{1}{2} \epsilon^2 = 6n \ \epsilon^2
\]

\(^9\)Although these are not all BLAS Level 1 flops, they have the same ratio memory operations to flops that are typical of BLAS Level 1 operations.
where: \( c = \max_{C \in \{\text{clusters}\}} \text{size}(C) \) i.e. the number of eigenvalues in the largest cluster, and \( \text{n\_iter} = 3 \) is the average number of inverse iterations performed for each eigenvalue.

As the problem size and number of processors grows, the largest cluster that \text{PDSYEVX} is able to reorthogonalize properly gets smaller (relative to \( n \)). As a consequence, reorthogonalization will not require large execution time\(^{10}\). Specifically, if the largest cluster has fewer than \( \frac{n}{p} \) eigenvalues, (i.e. fits easily on one processor) the number of eigenvalues that will be assigned to any one processor, and hence the total number of flops it must perform, is limited. The worst case is where there are \( p + 1 \) clusters each of size \( \frac{n}{p+1} \). In this case, one processor must be assigned 2 clusters of size \( \frac{n}{p+1} \), requiring (on average) \( 2 \times 6 \, n \left( \frac{n}{p+1} \right)^2 \) or roughly \( 12 \frac{n^3}{p^2} \).\(^{11}\)

Our model for the execution time of Gram Schmidt re-orthogonalization \( \left( \sum_{i=1}^{c} 4 \, n \, i = 2 \, n \, c^2 \gamma_1 \right) \), where \( c \) is the size of the largest cluster,) assumes that the processor to which the largest cluster is assigned is not assigned any other clusters. This is true if the largest cluster has more than \( n/p \) eigenvalues in it. If the largest cluster of eigenvalues contains fewer than \( n/p \) eigenvalues, reorthogonalization is relatively unimportant.

Inderjit Dhillon, Beresford Parlett and Vince Fernando’s recent work\([139, 77]\) on the tridiagonal eigenproblem substantially reduces the motivation to model the existing \text{ScaLAPACK} tridiagonal eigensolution code in great detail, since we expect them to replace the current code with something that costs \( O(\frac{n^2}{p}) \) flops, \( O(\frac{n^2}{p}) \) message volume and \( O(p) \) messages, which is negligible compared to tridiagonal reduction.

### 4.3.3 Load imbalance in bisection and inverse iteration

Load imbalance during the tridiagonal eigendecomposition is caused in part by the fact that not all processes will be assigned the same number of eigenvalues and eigenvectors and in part by the fact that different eigenvalues and eigenvectors will require slightly different amounts of computation. Our experience indicates that the load imbalance corresponds roughly to the cost of finding two eigenvalues \( \left( 2 \times \left( 53n \gamma_\pm + 53 \times 5n \gamma_1 \right) \right) \) and two eigenvectors \( \left( 2 \times \left( 3n \gamma_\pm + 45n \gamma_1 \right) \right) \) on one processor. Hence, our execution time model for the load imbalance during tridiagonal eigendecomposition is: \( \left( 2 \times 53 + 2 \times 3 \right) = \)

\(^{10}\)This is not to suggest that reorthogonalization in \text{PDSYEVX} gets better as \( n \) and \( p \) increase. (indeed \text{PDSYEVX} may fail to reorthogonalize large clusters for large \( n \) and \( p \)) It just means that reorthogonalization in \text{PDSYEVX} will not take a long time for large \( n \) and large \( p \).

\(^{11}\)The appearance of \( p^2 \) in the denominator stems from the restriction \( c \leq \frac{n}{p} \), meaning that as \( p \) increases the largest cluster size that \text{PDSYEVX} can handle efficiently decreases.
In evaluating the cost of load imbalance in tridiagonal eigendecomposition, one must include load imbalance in Gram Schmidt reorthogonalization. Indeed if the input matrix has one cluster of eigenvalues that is substantially larger than all others (yet small enough to fit on one processor so that PDSYEVX can reorthogonalize it) Gram Schmidt reorthogonalization is very poorly load balanced and could be treated almost entirely as a load imbalance cost.

We do not separate the load imbalance cost of Gram Schmidt from what the execution time for Gram Schmidt would be if the load were balanced, because doing so would complicate the model without making it match actual execution time any better.

4.3.4 Execution time model for tridiagonal eigendecomposition in PDSYEVX

The cost of tridiagonal eigendecomposition in PDSYEVX is the sum of the cost of bisection, inverse iteration and reorthogonalization. Hence:

\[
112n\gamma_1 + (2 \times 53 \times 5 + 2 \times 45) = 620n\gamma_1
\]

The load imbalance terms stem partly from the fact that some processors will typically be assigned at least one more eigenvalue and/or eigenvector than other processors and from the fact that both bisection and inverse iteration are iterative procedures requiring more time on some eigenvalues than on others.

4.3.5 Redistribution

Inverse iteration typically leaves the data distributed in a manner in which it would be awkward and inefficient to perform back transformation. If each eigenvector is computed entirely within one processor, as PDSTEIN does, inverse iteration requires no communication, provided that all processors have a copy of the tridiagonal matrix and the eigenvalues. This, however, leaves the eigenvector matrix distributed in a one-dimensional manner in which back transformation would be inefficient. Furthermore, since different processors may have been assigned to compute a different number of eigenvectors (to improve orthogonality among the eigenvectors) the eigenvector matrix will typically not be distributed in a block cyclic manner. Since PDORMTR (and all ScalAPACK matrix transformations) requires that the data be in a 2D block cyclic distribution, the eigenvectors must, at least, be redistributed to
a block cyclic distribution. For convenience and potential efficiency\textsuperscript{12}, PDSTEIN redistributes
the eigenvector matrix.

The simplest method of data redistribution is to have each processor send one
message to each of the other processors. That message contains the data owned by the
sender and needed by the receiver. Redistributing the data in this manner requires that
each processor send every element that it owns to other processors\textsuperscript{13} and receive what
it needs from other processors. Since each processor owns\textsuperscript{14} roughly \((n m)/p\) elements
and needs roughly \((n m)/p\) elements, the total data sent and received by each processor
is roughly \(2(n m)/p\). In our experience, data redistribution is slightly less efficient than
other broadcasts and reductions and hence we use \(4(n m)/p\) as our model for the data
redistribution cost.

\section{Back Transformation}

Transforming the eigenvectors of the tridiagonal matrix back to the eigenvectors of
the original matrix requires multiplying a series of Householder vectors. The Householder
updates can be applied in a blocked manner with each update taking the form: \((I + V T V^T)\),
where \(V \in R^{n, nb}\) is the matrix of Householder vectors, and \(T\) is an \((nb \times nb)\) triangular
matrix\[27].

The following steps compute \(Z' = (I + V T V^T)Z\). These are performed for each
block Householder update. The major contributors to the cost are noted below.

\textbf{Compute} \(T\)

Computing the \(nb\) by \(nb\) triangular matrix \(T\) requires \(nb\) calls to \texttt{DGEMV}, a summation
of \(nb^2/2\) elements within the current processor column and \(nb\) calls to \texttt{DTRMV}. The
computation of \(T\) need not be in the critical path. There are \(n/nb\) different matrices
\(T\) that need to be computed, and they could be computed in advance in parallel.

\textbf{Compute} \(W = V^T Z\).

Spread \(V\) across. Compute \(V^T Z\) locally. Sum \(W\) within each processor column.

\textsuperscript{12}The actual efficiency depends upon the data distribution chosen by the user for the input and output matrices.

\textsuperscript{13}Although some data will not have to be sent because it is owned and needed by the same processor, this
will typically be a minor savings.

\textsuperscript{14}In the absence of large clusters of eigenvalues assigned to a single processor.
The spread across of $V$ is performed on a ring topology because the processor columns need not be synchronized. Each processor column must receive $V$ and send $V$, hence the cost for each processor column is: $(2 n' \text{nb})/p_r$

The local computation of $V^T Z$ is a call to \texttt{DGEMM} involving $2(m/p_r + \text{vnb})(n'/p_r + \text{nb}/2)\text{nb}$ flops. Ignoring the lower order $\text{vnb} \text{nb}^2$ term, this is:

$$2 \left( n'm \text{nb} \right)/p + 2 \left( n'\text{vnb} \right)/p_r + 2 \left( m \text{nb} \right)/p_r.$$  

**Compute** $W = TW$

Local.

**Compute** $Z = Z - VW$

Spread $W$ down. Local computation. (Note: $V$ has already been spread across.)

The local computation of $Z - VW$, like the computation of $VTZ$ involves a call to \texttt{DGEMM} involving $2(m/p_r + \text{vnb})(n'/p_r + \text{nb}/2)\text{nb}$ flops.

Back transformation differs from reduction to tridiagonal form in many ways. It requires many fewer messages: $O(n/\text{nb})$ versus $O(n)$. Because the back transformation of each eigenvector is independent, the Householder updates can be applied in a pipelined manner, allowing $V$ to be broadcast in a ring instead of a tree topology. \texttt{PDLARFB} does not use the \texttt{PBLAS}, allowing $V$ to be broadcast once but used twice. Since the number of eigenvectors does not change during the update, half of the load imbalance depends on \( \text{mod} \left( n, \text{nb} p_r \right) \) and can be reduced significantly if \( \text{mod} \left( n, \text{nb} p_r \right) = 0 \). In the following table $\text{vnb}$ is the imbalance in the 2D block-cyclic distribution of eigenvectors$^{15}$

The cost of back transformation, shown in table 4.10, is asymmetric, the $(O(n^2/p_r))$ cost is smaller than the $(O(n^2/p_c))$ cost. Furthermore, the $(O(n^2/p_r))$ cost can be reduced further by computing $T$ in parallel, and choosing a data layout which will minimize $\text{vnb}$. Reducing the $O(n^2/p_r)$ cost would allow $p_r < p_c$, reducing the $O(n^2/p_c)$ costs. This is discussed further in Chapter 8.

---

$^{15}$ $\text{vnb}$ is computed as follows: $\text{extravecsonproc}1 - \text{extravec}/p_r$. Where: $\text{extravec} = \text{mod} \left( n, \text{nb} p_r \right)$ and $\text{extravecsonproc}1 = \text{min(\text{nb}, extravec)}$. 
Table 4.7: Computation cost in PDSYEVX

<table>
<thead>
<tr>
<th>Scale factor</th>
<th>Update current column</th>
<th>Compute reflector</th>
<th>Matrix-vector product</th>
<th>Update vector product</th>
<th>Compute update vector</th>
<th>Perform rank-2k update</th>
<th>Triangular eigenvalue decomposition</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>$\frac{n}{p} \cdot \gamma_3$</td>
<td>$\frac{2}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
</tr>
<tr>
<td>$\frac{n}{p} \cdot \gamma_2$</td>
<td>$\frac{2}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
</tr>
<tr>
<td>$\frac{n}{p} \cdot \gamma_1$</td>
<td>$\frac{2}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
</tr>
<tr>
<td>$\frac{n}{p} \cdot \delta_1$</td>
<td>$\frac{2}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
</tr>
<tr>
<td>$\frac{n}{p} \cdot \delta_2$</td>
<td>$\frac{2}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
</tr>
<tr>
<td>$\frac{n}{p} \cdot \delta_3$</td>
<td>$\frac{2}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
</tr>
<tr>
<td>$\frac{n}{p} \cdot \delta_4$</td>
<td>$\frac{2}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
<td>$\frac{4}{3}$</td>
</tr>
</tbody>
</table>
Table 4.8: Computation cost (tridiagonal eigendecomposition) in PDSYEVX

<table>
<thead>
<tr>
<th>scale factor</th>
<th>update current column (Table 4.1)</th>
<th>compute reflector (Table 4.2)</th>
<th>matrix vector product (Table 4.4)</th>
<th>update vector product (Table 4.5)</th>
<th>compute update vector (Table 4.6)</th>
<th>perform rank-2k update (Table 4.10)</th>
<th>tridiagonal eigendecomposition (Section 4.3)</th>
<th>back transformation (Table 4.10)</th>
<th>total</th>
</tr>
</thead>
<tbody>
<tr>
<td>$\frac{n}{J}\gamma_1$</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>53</td>
</tr>
<tr>
<td>$\frac{n}{J}\gamma_2$</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>3</td>
</tr>
<tr>
<td>$n\gamma_1$</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>112</td>
</tr>
<tr>
<td>$\frac{n}{J}\gamma_1$</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>265</td>
</tr>
<tr>
<td>$\frac{n}{J}\gamma_2$</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>45</td>
</tr>
<tr>
<td>$n\gamma_1$</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>620</td>
</tr>
<tr>
<td>$n\epsilon^2\gamma_1$</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>6</td>
</tr>
</tbody>
</table>
Table 4.9: Communication cost in PDSYEVX

<table>
<thead>
<tr>
<th>scale factor</th>
<th>update current column (Table 4.1)</th>
<th>compute reflector (Table 4.2)</th>
<th>matrix vector product (Table 4.4)</th>
<th>update vector product (Table 4.6)</th>
<th>compute update vector (Table 4.6)</th>
<th>perform rank-k update (Table 4.10)</th>
<th>bidiagonal eigendecomposition (Section 4.5)</th>
<th>back transformation (Table 4.10)</th>
<th>total</th>
</tr>
</thead>
<tbody>
<tr>
<td>(n \log_5(p_c)\alpha)</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>4</td>
</tr>
<tr>
<td>(n \log_5(p_c)\alpha)</td>
<td>2</td>
<td>3</td>
<td>2</td>
<td>4</td>
<td>2</td>
<td></td>
<td></td>
<td></td>
<td>13</td>
</tr>
<tr>
<td>(n \log_5(p_c)\alpha)</td>
<td>1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>1</td>
</tr>
<tr>
<td>(n \log_5(p_c)\alpha)</td>
<td>1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>1</td>
</tr>
<tr>
<td>(n \log_5(p_c)\alpha)</td>
<td>1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>1</td>
</tr>
<tr>
<td>(n \log_5(p_c)\beta)</td>
<td>1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>1</td>
</tr>
<tr>
<td>(n \log_5(p_c)\beta)</td>
<td>1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>2</td>
</tr>
<tr>
<td>(n \log_5(p_c)\beta)</td>
<td>1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(\frac{1}{2})</td>
</tr>
<tr>
<td>(n \log_5(p_c)\beta)</td>
<td>1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>(\frac{1}{2})</td>
</tr>
<tr>
<td>(n \log_5(p_c)\beta)</td>
<td>1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>2</td>
</tr>
<tr>
<td>(n \log_5(p_c)\beta)</td>
<td>1</td>
<td>1</td>
<td>-1</td>
<td>-1</td>
<td>1</td>
<td></td>
<td></td>
<td></td>
<td>1</td>
</tr>
<tr>
<td>(n \log_5(p_c)\beta)</td>
<td>1</td>
<td>2</td>
<td>-1</td>
<td>-1</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>3</td>
</tr>
</tbody>
</table>
Table 4.10: The cost of back transformation (PDORMTR)

<table>
<thead>
<tr>
<th>Task</th>
<th>Fileline number or subroutine</th>
<th>Execution time contribution from columns ( j = 1 ) to ( n ) shown explicitly</th>
<th>Execution time (simplified)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Compute ( T )</td>
<td>( \text{pdsevxf.f:855} ) &lt;br&gt;( \text{pdormtr.f:468} ) &lt;br&gt;( \text{pdormqr.f:394} ) &lt;br&gt;( \text{pdlaft.f:} )</td>
<td>( \sum_{n'=1}^{n} \left( \left\lfloor \log_{2}(p_{r}) \right\rfloor \frac{n'^{2}}{n'b_{p}} \beta + \right. ) &lt;br&gt;( \left. 2n' \delta_{2} + 2n'b_{r}2\gamma_{2} + \right) )</td>
<td>( 2n \delta_{2} + 0.5n^{2}nb_{p}\gamma_{2} )</td>
</tr>
<tr>
<td>Compute ( W = V^{T}Z )</td>
<td>( \text{pdsevxf.f:855} ) &lt;br&gt;( \text{pdormtr.f:468} ) &lt;br&gt;( \text{pdormqr.f:3412} ) &lt;br&gt;( \text{pdlaft.f:322, 398, 465} )</td>
<td>( \sum_{n'=1}^{n} \left( \left\lfloor \log_{2}(p_{r}) \right\rfloor \frac{m_{k}^{2}}{p_{r}} \beta + \right. ) &lt;br&gt;( \left. \frac{n'b_{r}2\gamma_{3} + 2n'b_{r}2\gamma_{2} + \right) )</td>
<td>( \frac{n'b_{r}}{p_{r}} \beta + ) &lt;br&gt;( \frac{n'b_{r}}{p_{r}} \left\lfloor \log_{2}(p_{r}) \right\rfloor \beta + ) &lt;br&gt;( \frac{n'b_{r}}{p_{r}} \delta_{3} + \frac{m_{k}^{2}}{p_{r}} \gamma_{3} + ) &lt;br&gt;( \frac{n'b_{r}}{p_{r}} \delta_{3} + \frac{m_{k}^{2}}{p_{r}} \gamma_{3} + ) &lt;br&gt;( \frac{2n'b_{r}}{p_{r}} \gamma_{3} + \frac{2n'b_{r}}{p_{r}} \gamma_{3} )</td>
</tr>
<tr>
<td>Compute ( W = TW )</td>
<td>( \text{pdsevxf.f:855} ) &lt;br&gt;( \text{pdormtr.f:468} ) &lt;br&gt;( \text{pdormqr.f:3412} ) &lt;br&gt;( \text{pdlaft.f:412} )</td>
<td>( \sum_{n'=1}^{n} \left( \delta_{3} + \frac{2n'b_{r}}{p_{r}} \gamma_{3} \right) )</td>
<td>( \frac{2n'b_{r}}{p_{r}} \gamma_{3} )</td>
</tr>
<tr>
<td>Compute ( Z = Z - VW )</td>
<td>( \text{pdsevxf.f:855} ) &lt;br&gt;( \text{pdormtr.f:468} ) &lt;br&gt;( \text{pdormqr.f:3412} ) &lt;br&gt;( \text{pdlaft.f:415, 425} )</td>
<td>( \sum_{n'=1}^{n} \left( \left\lfloor \log_{2}(p_{r}) \right\rfloor \frac{m_{k}}{p_{r}} \delta_{3} + \right. ) &lt;br&gt;( \left. \frac{n'b_{r}}{p_{r}} \gamma_{3} + \right) )</td>
<td>( \frac{n'b_{r}}{p_{r}} \beta + ) &lt;br&gt;( \frac{2n'b_{r}}{p_{r}} \gamma_{2} + ) &lt;br&gt;( \frac{2m_{k}^{2}}{p_{r}} \gamma_{2} + ) &lt;br&gt;( \frac{2n'b_{r}}{p_{r}} \gamma_{3} + ) &lt;br&gt;( \frac{2m_{k}^{2}}{p_{r}} \gamma_{3} + ) &lt;br&gt;( \frac{2n'b_{r}}{p_{r}} \gamma_{3} + ) &lt;br&gt;( \frac{2n'b_{r}}{p_{r}} \gamma_{3} )</td>
</tr>
<tr>
<td>Total</td>
<td>( \frac{\beta b_{r}}{p_{r}} + ) &lt;br&gt;( \frac{2m_{k}^{2}}{p_{r}} \left\lfloor \log_{2}(p_{r}) \right\rfloor \delta_{3} + ) &lt;br&gt;( \frac{2n'b_{r}}{p_{r}} \gamma_{3} + ) &lt;br&gt;( 2n'b_{r}2\gamma_{2} + ) &lt;br&gt;( \gamma_{3} + ) &lt;br&gt;( \frac{m_{k}^{2}}{p_{r}} \gamma_{3} + ) &lt;br&gt;( \frac{m_{k}^{2}}{p_{r}} \gamma_{3} + ) &lt;br&gt;( \frac{m_{k}^{2}}{p_{r}} \gamma_{3} )</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Standard data layout</td>
<td>( \frac{\beta b_{r}}{p_{r}} + ) &lt;br&gt;( \frac{2m_{k}^{2}}{p_{r}} \left\lfloor \log_{2}(p_{r}) \right\rfloor \delta_{3} + ) &lt;br&gt;( \frac{2n'b_{r}}{p_{r}} \gamma_{3} + ) &lt;br&gt;( 2n'b_{r}2\gamma_{2} + ) &lt;br&gt;( \gamma_{3} + ) &lt;br&gt;( \frac{m_{k}^{2}}{p_{r}} \gamma_{3} + ) &lt;br&gt;( \frac{m_{k}^{2}}{p_{r}} \gamma_{3} + ) &lt;br&gt;( \frac{m_{k}^{2}}{p_{r}} \gamma_{3} )</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Chapter 5

Execution time of the ScaLAPACK symmetric eigensolver, PDSYEYX on efficient data layouts on the Paragon

The detailed execution time model gives us confidence that we understand the execution time of PDSYEYX. It explains performance on a wide range of problem sizes, data layouts, input matrices, computers and user requests. However, the same complexity that allows the detailed model to explain performance over such a large domain makes it difficult to grasp, understand and interpret. The simple six term model shown in this chapter is designed to explain the performance of the common, efficient case on a well known computer.

PDSYEYX takes 205 seconds to compute the eigendecomposition of a 3840 by 3840 symmetric random matrix on a 64 node Paragon in double precision. Counting only the $\frac{10}{3} n^3$ flops, PDSYEYX achieves 920 Gigaflops per second which equals 14 Megaflops per second per node.

For large, well behaved\(^1\), matrices, PDSYEYX is efficient, as detailed in Table 5.1. For well behaved $3840 \times 3840$ matrices, PDSYEYX spends $63\% = (28+35)\%$ of its time on necessary computation and only $35\%$ of its time on communication, load imbalance and

\(^1\) For PDSYEYX's purpose, a well behaved matrix is one which does not have any large clusters of eigenvalues whose associated eigenvectors must be computed orthogonally.
Table 5.1: Six term model for PDSYEVX on the Paragon

<table>
<thead>
<tr>
<th>Component</th>
<th>Model</th>
<th>n = 3840, p = 64 % time</th>
</tr>
</thead>
<tbody>
<tr>
<td>matrix transformation computation</td>
<td>$\frac{10}{9} \frac{n^3}{p}$ ($\gamma = .0215$)</td>
<td>35</td>
</tr>
<tr>
<td>(See section 5.3)</td>
<td></td>
<td></td>
</tr>
<tr>
<td>tridiagonal eigendecomposition computation</td>
<td>$239 \frac{n^2}{p}$</td>
<td>28</td>
</tr>
<tr>
<td>(See section 5.4)</td>
<td></td>
<td></td>
</tr>
<tr>
<td>message initiation</td>
<td>$17 n \log_2(\sqrt{p})$ ($\alpha = 65.9$)</td>
<td>10</td>
</tr>
<tr>
<td>(See section 5.5)</td>
<td></td>
<td></td>
</tr>
<tr>
<td>message transmission</td>
<td>$7 \frac{n^2}{\sqrt{p}} \log_2(\sqrt{p})$ ($\beta = .146$)</td>
<td>4</td>
</tr>
<tr>
<td>(See section 5.6)</td>
<td></td>
<td></td>
</tr>
<tr>
<td>order $n$ overhead &amp; imbalance</td>
<td>$2780 n$</td>
<td>7</td>
</tr>
<tr>
<td>(See section 5.7)</td>
<td></td>
<td></td>
</tr>
<tr>
<td>order $n^2$ overhead &amp; imbalance</td>
<td>$14.0 \frac{n^2}{\sqrt{p}}$</td>
<td>14</td>
</tr>
<tr>
<td>(See section 5.8)</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

overhead required for execution in parallel.

n Matrix size

p Number of processors

$\gamma$ Matrix-matrix multiply time (= .0215 microseconds/flop)

$\alpha$ Message latency time (= 65.9 microseconds/message)

$\beta$ Message throughput time (= .146 microseconds/word)

Although PDSYEVX is efficient on the PARAGON, Table 5.1 shows us that there is room for improvement. Ignoring the execution time required for solution of the tridiagonal eigenproblem for the moment, we note that the matrix transformations reach only about 50% of peak performance (35% vs. 35+10+4+7+14=72%) for this problem size (roughly the largest that will fit on this PARAGON). Furthermore, efficiency will be lower for smaller problem sizes.

Unfortunately, there is no single culprit that accounts for the inefficiency. Communication accounts for a bit less than half of the inefficiency, while software overhead accounts for a bit more than half of the inefficiency.

---

2 Details about the hardware and software used for this timing run are given in table 6.3
One could argue that while \( n = 3840 \) on 64 nodes is the largest problem that \texttt{PDSYEVX} can run on this particular computer, it is still a relatively small problem. However, there are several reasons not to ignore this result. First, while it is true that newer machines have more memory, they also have much faster floating point units, steeper memory hierarchies and few offer communication to computation ratios as high as the \texttt{PARAGON}. Furthermore, we should strive to achieve high efficiency across a range of problem sizes, not just for the largest problems that can fit on the computer. Achieving high efficiency on small problem sizes means that users can efficiently use more processors and hence reduce execution time.

In summary, \texttt{PDSYEVX} is a good starting point, but leaves room for improvement. However, significantly improving performance will require attacking more than one source of inefficiency.

The fact that \texttt{PDSYEVX} spends 28\% of its total time in solving the tridiagonal eigenproblem is a result of the slow divide on the \texttt{PARAGON}. The \texttt{PARAGON} offers two divides: a fast divide and a slow divide that meets the IEEE 754 spec\cite{IEEE754}. Although the \texttt{SCALAPACK}'s bisection and inverse iteration codes are designed to work with an inaccurate divide, \texttt{SCALAPACK} uses the slow correct divide by default.

5.1 Deriving the \texttt{PDSYEVX} execution time on the Intel Paragon (common case)

This six term model is based on the detailed model described in section 4 which has been validated on a number of distributed memory computers and a wide range of data layouts and problem sizes.

5.2 Simplifying assumptions allow the full model to be expressed as a six term model

I assume that a reasonably efficient data layout is chosen. I set the data layout parameters as follows:

\( nb = 32 \). The optimal block size on the Paragon is about 10, however the reduction in execution time obtained by using \( nb = 10 \) rather than \( nb = 32 \) is less than 10\%, so
we stick to our standard suggested value of $nb$.

\[ p_r = p_c = \sqrt{p} \]. PDSYEVX achieves the best performance\(^3\) when \( p_r \leq p_c \leq \frac{p}{4} \). Assuming that \( p_r = p_c = \sqrt{p} \) allows the \( p_r \) and \( p_c \) terms to be coalesced into a single \( \sqrt{p} \) term.

\( pbf = 2 \). The panel blocking factor\(^4\), \( pbf = \max(2, \text{lcm}(p_r, p_c)/p_r) \) in ScaLAPACK version 1.5.

\( vnb = 0 \). \( vnb \) is the imbalance in the number of rows in the original matrix as distributed amongst the processors. I assume that the matrix is initially balanced perfectly amongst all processors, i.e. \( n \) is a multiple of \( p_r \) \( nb \).

\( \gamma_2 = \gamma_3 \) We assume for the simplified model that all flops are performed at the peak flop rate. This introduces an error equal to \( 2/3 \ n^3 / p(\gamma_2 - \gamma_3) \) which is typically no more than 2-5\% of the total time on the PARAGON.

\( m = e = n \) Assume that a full eigendecomposition is required, i.e. all eigenvalues are required \( e = n \) and all eigenvectors are required \( m = n \).

\( e = 1 \) Assume that the input matrix has no clusters of eigenvalues.

In addition, we set all of the machine parameters to constants measured or estimated on the Intel Paragon as shown in table 6.3 in order to coalesce the overhead, load imbalance, and tridiagonal eigen decomposition terms into just three terms.

### 5.3 Deriving the computation time during matrix transformations in PDSYEVX on the Intel Paragon

Table 5.2 shows that PDSYTRD performs \( \frac{4}{3} n^3 + O(n^2) \) flops per process. Of these, \( \frac{2}{3} n^3 + O(n^2) \) are matrix vector multiply flops and \( \frac{2}{3} n^3 + O(n^2) \) are matrix matrix multiply flops. PDSYTRD performs the same floating point operations that the LAPACK routine, DSYTRD, does. And \( \frac{4}{3} n^3 \) is the textbook\([84] \) number of flops for reduction to tridiagonal form.

---

\(^3\) Performance of PDSYEVX is not overly sensitive to the data layout, provided that \( nb \) is sufficiently large to allow good DGEMM performance, that the processor grid is reasonably close to square and that \( \text{lcm}(p_r, p_c) \) is not outrageous compared to \( p_r \) and \( p_c \). (The latter factor is only relevant when one is dealing with thousands of processors.) I have not performed a detailed study of when using fewer processors results in lower execution time. However, if you drop processors only when necessary to make \( p_r \leq p_c \leq \frac{p}{16} \) and \( \text{lcm} p_r, p_c \leq 10 p_r \), the processor grid chosen will allow performance within 10\% of the optimal processor grid.

\(^4\) The matrix vector multiplies are each performed in panels of size \( pbfnb \). See Section 4.2.2.
Table 5.2: Computation time in \texttt{PDSYEVX}

<table>
<thead>
<tr>
<th>Task</th>
<th>Full model</th>
<th>Six term model</th>
</tr>
</thead>
<tbody>
<tr>
<td>computation time during reduction to tridiagonal form</td>
<td>$\frac{2}{3} \gamma_1 + \frac{2}{3} \gamma_2 + \frac{2}{3} \gamma_3$</td>
<td>$\frac{4}{3} \frac{n^3}{p} \gamma_3$</td>
</tr>
<tr>
<td>(See section 4.2)</td>
<td></td>
<td></td>
</tr>
<tr>
<td>computation time during back transformation</td>
<td>$\frac{2}{3} \frac{n^3}{p} \gamma_3$</td>
<td>$\frac{2}{3} \frac{n^3}{p} \gamma_3$</td>
</tr>
<tr>
<td>(See table 4.10)</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Total</td>
<td></td>
<td>$\frac{10}{3} \frac{n^3}{p} \gamma_3$</td>
</tr>
</tbody>
</table>

Table 5.3: Execution time during tridiagonal eigendecomposition

<table>
<thead>
<tr>
<th>Task</th>
<th>Full model</th>
<th>Paragon model</th>
<th>Paragon time</th>
</tr>
</thead>
<tbody>
<tr>
<td>computation time during tridiagonal eigendecomposition</td>
<td>$265 \frac{\gamma_1}{p} + 45 \frac{\gamma_2}{p} + 2 \frac{\gamma_3}{p}$</td>
<td>$239 \frac{\gamma_1}{p}$</td>
<td>$239 \frac{\gamma_2}{p}$</td>
</tr>
<tr>
<td>(See section 4.3)</td>
<td>$53 \frac{\gamma_1}{p} + 3 \frac{\gamma_2}{p} + 2 \frac{\gamma_3}{p}$</td>
<td>$56 \frac{\gamma_1}{p}$ + $3 \frac{\gamma_2}{p}$ + $0 \frac{\gamma_3}{p}$</td>
<td>$239 \frac{\gamma_2}{p}$</td>
</tr>
<tr>
<td>Total</td>
<td></td>
<td>$239 \frac{\gamma_2}{p}$</td>
<td>$239 \frac{\gamma_2}{p}$</td>
</tr>
</tbody>
</table>

\texttt{PDORMTR} performs $2 \frac{n^3}{p} + O(n^2)$ flops per process. Again this is the same as the \texttt{LAPACK} routine.

5.4 Deriving the computation time during eigendecomposition of the tridiagonal matrix in \texttt{PDSYEVX} on the Intel Paragon

The computation time during tridiagonal eigendecomposition, in the absence of clusters of eigenvalues is $O(n^2)$ and hence for large $n$ becomes less important.

The simplified model for the execution time of the tridiagonal eigensolution on the \texttt{PARAGON} in table 5.3 is obtained from the detailed model by replacing $\gamma_1$ and $\gamma_2$ with their values on the \texttt{PARAGON} and by assuming that all clusters of eigenvalues are of modest size.

Load imbalance during the tridiagonal eigendecomposition is caused in part by the fact that not all processes will be assigned the same number of eigenvalues and eigenvectors and in part by the fact that different eigenvalues and eigenvectors will require slightly different amounts of computation. Our experience indicates that the load imbalance corresponds roughly to the cost of finding two eigenvalues and two eigenvectors.
Table 5.4: Message initiations in PDSYEVX

<table>
<thead>
<tr>
<th>Task</th>
<th>Full model</th>
<th>Six term model</th>
</tr>
</thead>
<tbody>
<tr>
<td>message initiation during reduction</td>
<td>$(15 \lceil \log_2(p_r) \rceil + 4 \lceil \log_2(p_c) \rceil) n \alpha$</td>
<td>$17 n \log_2(\sqrt{p}) \alpha$</td>
</tr>
<tr>
<td>to tridiagonal form (See table 4.9)</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Total</td>
<td></td>
<td>$17 n \log_2(\sqrt{p}) \alpha$</td>
</tr>
</tbody>
</table>

Table 5.5: Message transmission in PDSYEVX

<table>
<thead>
<tr>
<th>Task</th>
<th>Full model</th>
<th>Six term model</th>
</tr>
</thead>
<tbody>
<tr>
<td>message transmission time during</td>
<td>$(3 \lceil \log_2(p_r) \rceil \frac{2}{p_r} + 2 \lceil \log_2(p_c) \rceil \frac{2}{p_c}) \beta$</td>
<td>$5 \frac{2}{\sqrt{p}} \log_2(\sqrt{p}) \beta$</td>
</tr>
<tr>
<td>reduction to tridiagonal form (See</td>
<td></td>
<td></td>
</tr>
<tr>
<td>table 4.9)</td>
<td></td>
<td></td>
</tr>
<tr>
<td>message transmission time during</td>
<td>$2 \lceil \log_2(p_r) \rceil \frac{2}{p_r} \beta$</td>
<td>$2 \frac{2}{\sqrt{p}} \log_2(\sqrt{p}) \beta$</td>
</tr>
<tr>
<td>back transformation (See table 4.10)</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Total</td>
<td></td>
<td>$7 \frac{2}{\sqrt{p}} \log_2(\sqrt{p}) \beta$</td>
</tr>
</tbody>
</table>

5.5 Deriving the message initiation time in PDSYEVX on the Intel Paragon

Table 5.4 shows that PDSYEVX requires $17 n \log(\sqrt{p})$ message initiations.

5.6 Deriving the inverse bandwidth time in PDSYEVX on the Intel Paragon

Table 5.5 shows that PDSYEVX transmits $7 n^2 / \sqrt{p} \log(\sqrt{p})$ words per node.

5.7 Deriving the PDSYEVX order n imbalance and overhead term on the Intel Paragon

Table 5.6 shows the origin of the $\theta(n)$ load imbalance cost on the Intel Paragon.
Table 5.6: $\theta(n)$ load imbalance cost on the PARAGON

<table>
<thead>
<tr>
<th>Task</th>
<th>Full model</th>
<th>Paragon model</th>
<th>Paragon time</th>
</tr>
</thead>
<tbody>
<tr>
<td>load imbalance during eigendecomposition (See section 4.3)</td>
<td>$620\gamma_1 + 112\gamma_2$</td>
<td>$620 \times 0.740$</td>
<td>$477 , n$</td>
</tr>
<tr>
<td></td>
<td>+ $112 \times 3.85$</td>
<td></td>
<td></td>
</tr>
<tr>
<td>order $n$ overhead term in reduction to tridiagonal form (See table 4.7)</td>
<td>$9\delta_4 + 6\delta_2$</td>
<td>$9 \times 239 + 6 \times 23.5$</td>
<td>$2256 , n$</td>
</tr>
<tr>
<td>order $n$ overhead term in back transformation (See table 4.10)</td>
<td>$2\delta_2$</td>
<td>$2 \times 23.5$</td>
<td>$47 , n$</td>
</tr>
<tr>
<td>Total</td>
<td></td>
<td></td>
<td>$2780 , n$</td>
</tr>
</tbody>
</table>

Table 5.7: Order $\frac{n^2}{\sqrt{p}}$ load imbalance and overhead term on the PARAGON

<table>
<thead>
<tr>
<th>Task</th>
<th>Full model</th>
<th>Paragon model</th>
<th>Paragon time</th>
</tr>
</thead>
<tbody>
<tr>
<td>Order $n^2/\sqrt{p}$ overhead term in reduction to tridiagonal form (See table 4.7)</td>
<td>$\frac{n^2}{\sqrt{p}} \gamma_2 \times 32 + \frac{n^2}{\sqrt{p}} \delta_2 \times 32$</td>
<td>$4.70 \frac{n^2}{\sqrt{p}}$</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Order $n^2/\sqrt{p}$ overhead term in reduction to tridiagonal form (See table 4.7)</td>
<td>$\frac{n^2}{\sqrt{p}} \gamma_2 \times 32 + \frac{n^2}{\sqrt{p}} \delta_2 \times 32$</td>
<td>$6.81 \frac{n^2}{\sqrt{p}}$</td>
<td></td>
</tr>
<tr>
<td>Order $n^2/\sqrt{p}$ overhead term in back transformation (See table 4.10)</td>
<td>$0.5 \frac{n^2}{\sqrt{p}} \gamma_2 \times 32 + \frac{n^2}{\sqrt{p}} \delta_2 \times 32$</td>
<td>$2.46 \frac{n^2}{\sqrt{p}}$</td>
<td></td>
</tr>
<tr>
<td>Total</td>
<td></td>
<td></td>
<td>$14.0 \frac{n^2}{\sqrt{p}}$</td>
</tr>
</tbody>
</table>

5.8 Deriving the PDSYEVX order $\frac{n^2}{\sqrt{p}}$ imbalance and overhead term on the Intel Paragon

The order $\frac{n^2}{\sqrt{p}}$ load imbalance and overhead term on the Intel Paragon, $7/\sqrt{p}$ $\frac{n^2}{\sqrt{p}}$ is shown in table 5.7.

See section 5.2 for details on the assumptions made to simplify the full model to the six term model. Note that $vnb$ is assumed to be zero and that $pbf$ is assumed to be 2.
Chapter 6

Performance on distributed memory computers

6.1 Performance requirements of distributed memory computers for running PDSYEVX efficiently

The most important feature of a parallel computer is its peak flop rate. Indeed, everything else is measured against the peak flop rate. The second most important feature is main memory, but which feature of main memory is most important depends on whether you want peak efficiency (i.e., using as few processors as possible) or minimum execution time (i.e., using more processors). If you plan to use only as many processors as necessary, filling each processor’s memory completely, then main memory size is the most important factor controlling efficiency. If you plan to use more processors, main memory random access time becomes the most important factor.

Network performance of today’s distributed memory computers is good enough to keep communication cost from being the limiting factor on performance. Furthermore, if the network performance (either latency or bandwidth) were the limiting factor, there are ways that we could reduce the communication cost by as much as log(\sqrt{p})[107]. Still, if one has a network of workstations connected by a single ethernet or FDDI ring, the very low bisection bandwidth will always keep efficiency low. See section 8.4.2 for details.
6.1.1 Bandwidth rule of thumb

Bandwidth rule of thumb: Bisection bandwidth per processor\(^1\) times the square root of memory size per processor should exceed floating point performance per processor.

\[
\frac{\text{Megabytes/sec}}{\text{processor}} \times \sqrt[2]{\frac{\text{Megabytes}}{\text{processor}}} > \frac{\text{Megaops/sec}}{\text{processor}}
\]

assures that bandwidth will not limit performance.

The bandwidth rule of thumb shows that if memory size grows as fast as peak floating point execution rate, the network bisection bandwidth need only grow as the square root of the peak floating point execution rate. This is very encouraging for the future of parallel computing. This rule also shows that the bandwidth requirement grows as the problem sizes decreases. This rule does not make as wide a claim as the memory rule of thumb, it does not promise that \texttt{PDSYEVX} will be efficient, only that bandwidth will not be the limiting factor.

Provided the bandwidth rule of thumb holds, execution time attributable to message volume will not exceed 40\% of the time devoted to floating point execution in \texttt{PDSYEVX} on problems that nearly fill memory.

6.1.2 Memory size rule of thumb

Memory size rule of thumb: memory size should match floating point performance

\[
\frac{\text{Megabytes}}{\text{processor}} > \frac{\text{Megaops/sec}}{\text{processor}}
\]

assures that \texttt{PDSYEVX} will be efficient on large problems.

This rule is sufficient because it holds even if message latency and software overhead hold constant as peak performance increases and network bisection bandwidth and \texttt{BLAS2} performance increase as slowly as the square root of the increase in the peak flop rate.

\(^{1}\)Bisection bandwidth per processor is the total bisection bandwidth of the network divided by the number of processors.
Message latency and software overhead are limited by main memory access time, which decreases slowly, but bisection bandwidth and BLAS2 performance (which is limited by main memory bandwidth) continue to improve though not as rapidly as peak performance.

When the number of megabytes of main memory equals the peak floating point rate (in megaflops/sec), message latency will typically account for ten times less execution time than the time devoted to floating point execution in PDSYEVX on problems that nearly fill memory. The arithmetic in figure 6.2 justifies this statement provided that message latency does not exceed 100 microseconds.

The memory rule of thumb is too simple to capture all aspects of any computer, nonetheless we have found it to be useful. The derivation in figure 6.2 makes two main assumptions: latency is around 100 microseconds and \([\log_2(\sqrt{p})] = 3\). Selcom will either be exactly correct, but in our experience neither will tend to be small by more than a factor of \(2\) (i.e. \(p \leq q4096\)). The memory rule of thumb also depends on sufficient bandwidth and on reasonable BLAS2 and software overhead costs. As we will show next, network bandwidth capacity and BLAS2 performance need not grow rapidly to support this rule and software overhead costs need only remain constant.

The memory rule of thumb holds for all computers marketed as distributed memory.

\[
\text{message transmission time} = \frac{7.5 n^2/ \sqrt{p} \left[ \log_2(\sqrt{p}) \right] \beta}{10/3 \, n^3/p \, \gamma_3} \\
= \frac{7.5 \left[ \log_2(\sqrt{p}) \right] \beta}{10/3 \, n/\sqrt{p} \, \gamma_3} \\
= \frac{7.5 \left[ \log_2(\sqrt{p}) \right] \beta}{10/3 \, \sqrt{M} \, 10^6/(6 \times 8) \, \gamma_3} \\
= \frac{7.5 \times 3 \times 8 \times 10^{-6}/mbs}{10/3 \, \sqrt{M} \, 10^6/(6 \times 8) \, 10^{-6}/mfs} \\
= \frac{7.5 \times 8 \sqrt{6 \times 8 \, mfs}}{10/3 \, 10^3 \, mbs} \\
= 0.374 \frac{mfs}{\sqrt{M} \, mbs} \\
= 0.374 \frac{7.5 \times 3 \times 8 \sqrt{6 \times 8}}{10/3 \, 10^3} \\
\text{ Cancel } n^2/ \sqrt{p} \\
\text{ Table 5.1} \\
\text{PDSYEVX uses } \frac{6 \, \log_2(\sqrt{p}) \beta \text{ DP words}}{3/3 \approx (10^{-6}/mbs)} \\
\beta = \frac{8 \times 10^{-6}/mbs}{3/3 \approx (10^{-6}/mfs)} \\
\text{Simplify} \\
\text{\quad } \frac{mbs}{3/3 \approx 10^{-6}/mfs} \\
= \frac{0.374 \times \sqrt{6 \times 8}}{10^3} \\
\]
computers, but does not hold for non-scalable or extremely low bandwidth networks. One could design a distributed memory computer for which this rule does not hold, but the features that are necessary for this rule to hold are also important for a range of other applications and hence we expect this rule to hold for essentially all distributed memory computers.

The memory rule of thumb while sufficient is not necessary. It is possible to achieve efficiency on PDSYEVX on computers whose memory is smaller than that suggested by this rule\(^2\). In section 6.1.3 I discuss what properties a computer must have to allow efficient execution on smaller problem sizes.

Though meeting the memory rule of thumb is not necessary to achieve high performance, there are reasons to believe that it will be useful for several years. Software latencies are not decreasing rapidly. Software overhead, since it is tied to main memory latency, is not decreasing rapidly either. Bisection bandwidth and BLAS2 performance is increasing, but not as fast as peak floating point efficiency.

On the other hand, improvements to PDSYEVX will make it possible to achieve high performance with less memory and may someday obsolete the memory rule of thumb.

\(^2\)The PARAGON is an example.
6.1.3 Performance requirements for minimum execution time

If you intend to use as many processors as possible to minimize execution time, the second most important machine characteristic (after peak floating point rate) is main memory speed. Main memory speed affects three of the four sources of inefficiencies in PDSYEVX: message initiation, load imbalance and software overhead. Message initiation and software overhead costs are controlled by how long it takes to execute a stream of code with little data or code locality. Since the communication software initiation code offers little code or data locality, its execution time is largely dependent on main memory latency.

Load imbalance consists mainly of BLAS2 row and column operations. The BLAS2 flop rate is controlled by main memory bandwidth. Smaller main memory bandwidth also requires a larger blocking factor in order to achieve peak floating point performance in matrix matrix multiply. Larger blocking factors mean more BLAS2 row and column operations. Hence reduced main memory speed has a double effect on the cost of row and column operations: increasing the number of them while increasing the cost per operation.

Caches can be used to improve memory performance, however the value of caches is reduced by several factors: The inner loop in reduction to tridiagonal form, the source of most of the inefficiency in PDSYEVX, is substantial and includes many subroutine calls. ScaLAPACK is a layered library which includes the PBLAS, BLAS, BLACS and the underlying communication software. The inner loop in reduction to tridiagonal form touches every element in the unreduced (trailing) part of the matrix. The second level cache is typically shared between code and data. Even the way that BLAS routines are typically coded impacts the value of caches in PDSYEVX. The fact that the inner loop in reduction to tridiagonal form includes many subroutine calls combined with ScaLAPACK's layered approach means that this inner loop typically involves many code cache misses. Indeed even the much simpler inner loop in LU involves many code cache misses in ScaLAPACK[160]. Since this same inner loop touches every element in the matrix, the secondary cache, typically shared by both code and data, will be completely flushed each time through the loop meaning that code cache misses will have to be satisfied by main memory.

The way that BLAS routines are typically optimized leads to a high code cache miss rate. BLAS routines are typically coded and optimized by timing them on a representative set of requests[92]. Each request however is typically run many times and the times are averaged. Each run may involved different data to ensure that the times represent the cost
of moving the data from main memory. However, no effort is made\(^3\) to account for the cost of moving the code from main memory. Hence, the code cache is a resource to which no cost is assigned during optimization. Loop unrolling can vastly expand the code cache requirements but it can also improve performance, at least if the code is in cache. Hence it is likely that in optimizing BLAS codes, some loops get unrolled to the point where they use half or more of the code cache. If two such codes are called in the same loop, code cache misses are inevitable. The unfortunate aspect of this is that the hardware designer is powerless to prevent it. Increasing the size of the code cache might lead to even more loop unrolling and even worse performance.

There are two ways that hardware manufacturers could make caches more useful. One would be to improve the way that BLAS codes are optimized to ensure that the code cache is a recognized resource (either by measuring code cache use in each call or by having the codes optimized on a system with smaller cache sizes than those offered to the public). The second would be to allow a path from main memory to the register file that bypasses the cache. In the inner loop of reduction to tridiagonal form, every element of the matrix is touched, but there is no temporal locality and no point in moving these elements up the cache hierarchy. If these calls to the BLAS matrix-vector multiply routine, \texttt{DGEMV}, could be made to bypass the caches, these caches would remain useful in the other portions of the code; i.e. software overhead and communication latency. Even row and column operations would benefit because these operations involve data locality across loop iterations, this data locality is made worthless by the fact that the loop touches every element in the matrix each time through but could be useful if certain \texttt{DGEMV} calls could be made to bypass the caches. This would require a coordinated software and hardware effort.

Secondary caches are of little importance in determining \texttt{PDSYEVX} execution time because the inner loop traverses the entire matrix without any data temporal locality within the loop. Secondary caches are important to achieving peak matrix-matrix multiply performance, but that is their only use in \texttt{PDSYEVX}. This is because in principle if the secondary cache were large enough and the problem small enough, secondary cache could hold the entire matrix and hence act as fast main memory. Unfortunately, secondary caches are never large enough to support an efficient problem size.

I would hope that, if there are other applications like \texttt{PDSYEVX} that could make

\(^3\)It is difficult to account for the cost of moving the code from main memory.
efficient use of smaller faster memories, some vendor or vendors will build some machines with smaller faster main memory. I suspect that more applications need large slow memory, than small fast memory. Indeed, PDSYEVX, can work well either way. But, especially with improvements to PDSYEVX that will allow it to achieve high performance on smaller problem sizes, PDSYEVX could achieve impressive results on a distributed memory machine with half the main memory now typical of distributed memory parallel computers if that smaller main memory could be made modestly, say 20%, faster. With the out-of-core symmetric eigensolver being developed by Ed D’Azevedo (based on my suggestion to reduce main memory requirements from $4n^2$ to $\frac{1}{2}n^2$ by using symmetric packed storage during the reduction to tridiagonal form and two passes through back transformation), the main memory requirements of PDSYEVX will drop by a factor of 6 to 12, furthering the argument for smaller, faster main memory.

As ScalAPACK improves, it will be able to achieve high efficiency on smaller problem sizes. This will mean that the best machines for ScalAPACK will have less memory than that suggested by the memory rule of thumb at the top of this chapter.

6.1.4 Gang scheduling

6.2 sec:gang

A code which involves frequent synchronizations, such as reduction to tridiagonal form, requires either dedicated use of the the nodes upon which it runs or gang scheduling. If even one node is not participating in the computation, the computation will stall at the next synchronization point.

6.2.1 Consistent performance on all nodes

A statically load balanced code, such as PDSYEVX, will executed only as fast as the slowest node on which it is run. This, like the need for gang scheduling, is obvious. Yet, occasionally nodes which have identical specifications perform differently. Kathy Yelick noticed that some nodes CM5 at Berkeley were slower than others. And, I have reason to believe that at least two of the nodes on the PARAGON at University of Tennessee at Knoxville are slower than the others (See Table 6.3).

The people who design and maintain distributed memory parallel computers should
Table 6.1: Performance

<table>
<thead>
<tr>
<th></th>
<th>messagelatency ( \alpha )</th>
<th>transmission cost per word ( \beta )</th>
<th>BLAS1 flop rate ( \gamma_1 )</th>
<th>matrix-vector multiply software overhead ( \delta_2 )</th>
<th>matrix-vector multiply flop rate ( \gamma_2 )</th>
<th>matrix-matrix multiply flop rate ( \gamma_3 )</th>
<th>divide ( \gamma_4 )</th>
</tr>
</thead>
<tbody>
<tr>
<td>IBM SP2</td>
<td>54</td>
<td>0.12 (67)</td>
<td>.0037</td>
<td>.25 (4)</td>
<td>5</td>
<td>??</td>
<td>P</td>
</tr>
<tr>
<td>PARAGON</td>
<td>66</td>
<td>0.14 (57)</td>
<td>0.0235</td>
<td>3.8 (.26)</td>
<td>80</td>
<td>P</td>
<td>??</td>
</tr>
</tbody>
</table>

make sure that slow nodes are identified and marked as such or taken off-line.

6.3 Performance characteristics of distributed memory computers

6.3.1 PDSYEVX execution time (predicted and actual)

Table 6.3 compares predicted and actual performance on the Intel PARAGON. Actual PDSYEVX performance never exceeds the performance predicted by our model and usually is within 15% of the predicted performance. Every run which shows actual execution time which is more than 15% greater than expected execution time is marked with an asterisk. I would be satisfied with a performance model that is within 20% to 25%, and would not expect this performance model to match to within 15% on other machines. I have checked several of these and have noticed that in these runs one or two processors have noticeably slower performance on DGEMV than the other processors. I have also rerun many of these aberrant timings and for each that I have rerun, at least one of the runs completed within 15% of predicted performance. Nonetheless, this aberrant behavior deserves further study.
<table>
<thead>
<tr>
<th></th>
<th>PARAGON MP</th>
<th>IBM SP2(^1)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Processor</td>
<td>50 Mhz i860 XP</td>
<td>120 Mhz POWER2 SC</td>
</tr>
<tr>
<td>Location</td>
<td>xps5.ccs.ornl.gov</td>
<td>chowder.ccs.utk.edu</td>
</tr>
<tr>
<td>Data cache</td>
<td>16K bytes</td>
<td>128K bytes</td>
</tr>
<tr>
<td></td>
<td>4-way set-associated write-back</td>
<td></td>
</tr>
<tr>
<td></td>
<td>32-byte lines(^5)</td>
<td></td>
</tr>
<tr>
<td>Code cache</td>
<td>16K bytes</td>
<td>32K bytes</td>
</tr>
<tr>
<td></td>
<td>4-way set-associated</td>
<td></td>
</tr>
<tr>
<td></td>
<td>32-byte blocks</td>
<td></td>
</tr>
<tr>
<td>Second level cache</td>
<td>None</td>
<td>None</td>
</tr>
<tr>
<td>Processors per node</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>Memory per node</td>
<td>32 Mbytes</td>
<td>256 Mbytes</td>
</tr>
<tr>
<td>Operating system</td>
<td>Paragon OSF/1 xps5</td>
<td>AIX</td>
</tr>
<tr>
<td></td>
<td>1.0.4 R1.4.5</td>
<td></td>
</tr>
<tr>
<td>ScALAPACK</td>
<td>1.5</td>
<td>1.5</td>
</tr>
<tr>
<td>BLAS</td>
<td>-Ikmath</td>
<td>-lesslp2</td>
</tr>
<tr>
<td>BLACS</td>
<td>NX BLACS</td>
<td>MPI, BLACS</td>
</tr>
<tr>
<td>Communication software</td>
<td>NX</td>
<td>MPI</td>
</tr>
<tr>
<td>Precision</td>
<td>Double</td>
<td>Double</td>
</tr>
<tr>
<td></td>
<td>64 bits</td>
<td>64 bits</td>
</tr>
</tbody>
</table>

Table 6.2: Hardware and software characteristics of the PARAGON and the IBM SP2.
Table 6.3: Predicted and actual execution times of `PDSYEVX` on xps5, an Intel PARAGON. Problem sizes which resulted in execution time of greater than 15% greater than predicted are marked with an asterix. Many of these problem sizes which result in more than 15% greater execution time than expected were repeated to show that the unusually large execution times are aberrant.

<table>
<thead>
<tr>
<th>n</th>
<th>nprow</th>
<th>npcol</th>
<th>nb</th>
<th>Actual time (seconds)</th>
<th>Estimated time (seconds)</th>
<th>Actual/Estimated</th>
</tr>
</thead>
<tbody>
<tr>
<td>375</td>
<td>2</td>
<td>4</td>
<td>32</td>
<td>8.51</td>
<td>8.24</td>
<td>0.97</td>
</tr>
<tr>
<td>375</td>
<td>4</td>
<td>8</td>
<td>32</td>
<td>6.34</td>
<td>4.65</td>
<td>0.73*</td>
</tr>
<tr>
<td>750</td>
<td>2</td>
<td>4</td>
<td>32</td>
<td>31.2</td>
<td>30.1</td>
<td>0.96</td>
</tr>
<tr>
<td>750</td>
<td>2</td>
<td>4</td>
<td>32</td>
<td>31.3</td>
<td>30.1</td>
<td>0.96</td>
</tr>
<tr>
<td>750</td>
<td>2</td>
<td>4</td>
<td>32</td>
<td>31.5</td>
<td>30.1</td>
<td>0.96</td>
</tr>
<tr>
<td>750</td>
<td>2</td>
<td>4</td>
<td>32</td>
<td>41.2</td>
<td>30.1</td>
<td>0.73*</td>
</tr>
<tr>
<td>750</td>
<td>2</td>
<td>4</td>
<td>32</td>
<td>43.3</td>
<td>30.1</td>
<td>0.7*</td>
</tr>
<tr>
<td>750</td>
<td>4</td>
<td>4</td>
<td>32</td>
<td>20.3</td>
<td>18.9</td>
<td>0.93</td>
</tr>
<tr>
<td>750</td>
<td>4</td>
<td>6</td>
<td>32</td>
<td>16.5</td>
<td>15.3</td>
<td>0.93</td>
</tr>
<tr>
<td>750</td>
<td>4</td>
<td>6</td>
<td>32</td>
<td>22.3</td>
<td>15.3</td>
<td>0.69*</td>
</tr>
<tr>
<td>750</td>
<td>4</td>
<td>6</td>
<td>32</td>
<td>23.1</td>
<td>15.3</td>
<td>0.66*</td>
</tr>
<tr>
<td>750</td>
<td>4</td>
<td>8</td>
<td>32</td>
<td>14.1</td>
<td>13.2</td>
<td>0.93</td>
</tr>
<tr>
<td>1000</td>
<td>2</td>
<td>4</td>
<td>32</td>
<td>55.8</td>
<td>53.8</td>
<td>0.96</td>
</tr>
<tr>
<td>1000</td>
<td>2</td>
<td>4</td>
<td>8</td>
<td>52.9</td>
<td>54.4</td>
<td>1.0</td>
</tr>
<tr>
<td>1000</td>
<td>4</td>
<td>2</td>
<td>32</td>
<td>56.5</td>
<td>54.9</td>
<td>0.97</td>
</tr>
<tr>
<td>1000</td>
<td>4</td>
<td>2</td>
<td>8</td>
<td>56.2</td>
<td>59.3</td>
<td>1.1</td>
</tr>
<tr>
<td>1125</td>
<td>2</td>
<td>4</td>
<td>32</td>
<td>72.2</td>
<td>68.8</td>
<td>0.95</td>
</tr>
<tr>
<td>1125</td>
<td>4</td>
<td>8</td>
<td>32</td>
<td>38.2</td>
<td>26.7</td>
<td>0.7*</td>
</tr>
<tr>
<td>1500</td>
<td>2</td>
<td>4</td>
<td>32</td>
<td>133</td>
<td>127</td>
<td>0.95</td>
</tr>
<tr>
<td>1500</td>
<td>2</td>
<td>4</td>
<td>32</td>
<td>134</td>
<td>127</td>
<td>0.95</td>
</tr>
<tr>
<td>1500</td>
<td>2</td>
<td>4</td>
<td>32</td>
<td>134</td>
<td>127</td>
<td>0.95</td>
</tr>
<tr>
<td>1500</td>
<td>2</td>
<td>4</td>
<td>32</td>
<td>176</td>
<td>127</td>
<td>0.73*</td>
</tr>
<tr>
<td>1500</td>
<td>2</td>
<td>4</td>
<td>32</td>
<td>183</td>
<td>127</td>
<td>0.7*</td>
</tr>
<tr>
<td>1500</td>
<td>4</td>
<td>4</td>
<td>32</td>
<td>77.2</td>
<td>72.9</td>
<td>0.94</td>
</tr>
<tr>
<td>1500</td>
<td>4</td>
<td>6</td>
<td>32</td>
<td>77</td>
<td>55</td>
<td>0.71*</td>
</tr>
<tr>
<td>1500</td>
<td>4</td>
<td>6</td>
<td>32</td>
<td>59.3</td>
<td>55</td>
<td>0.93</td>
</tr>
<tr>
<td>1500</td>
<td>4</td>
<td>6</td>
<td>32</td>
<td>80.9</td>
<td>55</td>
<td>0.68*</td>
</tr>
<tr>
<td>1500</td>
<td>4</td>
<td>8</td>
<td>32</td>
<td>48.6</td>
<td>45.2</td>
<td>0.93</td>
</tr>
<tr>
<td>1875</td>
<td>4</td>
<td>8</td>
<td>32</td>
<td>99.7</td>
<td>70.9</td>
<td>0.71*</td>
</tr>
<tr>
<td>2250</td>
<td>4</td>
<td>4</td>
<td>32</td>
<td>186</td>
<td>175</td>
<td>0.94</td>
</tr>
<tr>
<td>2250</td>
<td>4</td>
<td>6</td>
<td>32</td>
<td>138</td>
<td>127</td>
<td>0.92</td>
</tr>
<tr>
<td>2250</td>
<td>4</td>
<td>6</td>
<td>32</td>
<td>179</td>
<td>127</td>
<td>0.71*</td>
</tr>
<tr>
<td>2250</td>
<td>4</td>
<td>6</td>
<td>32</td>
<td>182</td>
<td>127</td>
<td>0.7*</td>
</tr>
<tr>
<td>2250</td>
<td>4</td>
<td>8</td>
<td>32</td>
<td>112</td>
<td>102</td>
<td>0.91</td>
</tr>
<tr>
<td>2625</td>
<td>4</td>
<td>8</td>
<td>32</td>
<td>203</td>
<td>144</td>
<td>0.71*</td>
</tr>
<tr>
<td>3000</td>
<td>4</td>
<td>8</td>
<td>32</td>
<td>214</td>
<td>191</td>
<td>0.89</td>
</tr>
</tbody>
</table>
Chapter 7

Execution time of other dense symmetric eigensolvers

In this chapter, I present models for performance of other symmetric eigensolvers. These models have not been fully validated, although some have been partly validated.

7.1 Implementations based on reduction to tridiagonal form

7.1.1 PeIGs

PeIGs[74], like PDSYEVX, uses reduction to tridiagonal form, bisection, inverse iteration and back transformation to perform the parallel eigendecomposition of a dense symmetric matrix. The execution time of PeIGs differs from that of PDSYEVX for two significant reasons: PeIGs is coded differently, (using a different language and different libraries) than PDSYEVX and it uses a different re-orthogonalization strategy. I am more interested in the difference resulting from the different re-orthogonalization strategy.

In PDSYEVX the number of flops performed by any particular processor, $p_i$, during re-orthogonalization is: $\sum_{C \in \text{clusters assigned to } p_i} 4 \sum_{i=1}^{\text{size}(C)} n_{\text{iter}}(i) n (i - 1)$. Where: $n_{\text{iter}}(i)$ is the number of inverse iterations performed for eigenvalue $i$ (typically 3). If the size of the largest cluster is greater than $\frac{n}{p}$, the processor which is responsible for this cluster will not be responsible for any eigenvalues outside of this cluster.

Hence, if the size of the largest cluster is greater than $\frac{n}{p}$, the number of flops
performed by the processor to which this processor is assigned is (on average):

$$4 \text{n}_{\text{iter}} n \frac{1}{2} c^2 = 6 n c^2$$

where: $c = \max_{C \in \{\text{clusters}\}} \text{size}(C)$ i.e. the number of eigenvalues in the largest cluster, and $n_{\text{iter}} = 3$ is the average number of inverse iterations performed for each eigenvalue.

If the largest cluster has fewer than $\frac{n}{p}$ eigenvalues, the number of eigenvalues that will be assigned to any one processor, and hence the total number of flops it must perform, is limited. The worst case is where there are $p + 1$ clusters each of size $\frac{n}{p+1}$. In this case, one processor must be assigned 2 clusters of size $\frac{n}{p+1}$, requiring (on average) $2 \times 6 n \left(\frac{n}{p+1}\right)^2$ or roughly $12 \frac{n^3}{p^3}$ flops.

In contrast, PeIGs uses multiple processors and simultaneous iteration to maintain orthogonality among eigenvectors associated with clustered eigenvalues. Traditional inverse iteration[102] computes one eigenvector at a time, re-orthogonalizing against all previous eigenvectors associated with eigenvalues in the same cluster, after each iteration. PeIGs, in what they refer to as simultaneous iteration, performs one step of inverse iteration on all eigenvectors associated with a cluster of eigenvalues and then reorthogonalizes all the eigenvectors. This allows the re-orthogonalization to be performed efficiently in parallel.

PeIGs is more accurate but slower than PDSYEVX if the input matrix has large clusters of eigenvalues\(^1\). The cost of re-orthogonalization in PeIGs is $O(n^2 c/p)$ flops versus $O(nc^2)$ flops in PDSYEVX.

### 7.1.2 HJS

Hendrickson, Jessup and Smith[91] wrote a symmetric eigensolver, HJS, for the PARAGON which is significantly faster than PDSYEVX, but which has never been released, and only works on the Intel PARAGON.

HJS requires that the data layout block size be 1, i.e. a cyclic distribution, that the processor grid be square, i.e. $p_r = p_c$ and that intermediate matrices be replicated across processor columns and distributed across processor rows. The requirement that the processor grid be square limits efficiency when used on a non-square processor grid. They show that the algorithmic block size need not be tied to the data layout block size. At the time that PDSYTRD was written, the PBLAS could not efficiently use a cyclic distribution and

\(^1\)PDSYEVX can maintain orthogonality among eigenvectors associated with clusters up to $\frac{n}{p}$ eigenvalues easily and efficiently.
did not support matrices replicated in one processor dimension and distributed across the other.

HJS has several advantages over PDSYEVX. It uses a more efficient transpose operation, eliminates redundant communication, reduces the number of messages by combining some and reduces the number of words transmitted per process by using recursive halving and doubling. HJS also reduces the load imbalance by a factor of $\sqrt{p}$ by using a cyclic data layout and using all processors in all calculations. ScaLAPACK will incorporate several of these ideas into the next version of PDSYEVX.

**HJS notation**

HJS also differs in a couple other rather minor aspects. They compute the norm of $v$ in a manner which could overflow, and they represent the reflector in a manner could likewise overflow. These reduce execution time and program complexity slightly.

Their manner of counting the cost of messages in their performance model differs from ours also. They count the cost of a message swap (sending a message to and simultaneously receiving a message from another processor) as equal to cost of sending a single message. This reflects reality on the PARAGON and many but not all distributed memory machines. Using their method would not significantly change the model for PDSYEVX because PDSYEVX does not use message swap operations.

In their paper, they use different variable names for the result of each computation, and show all indices explicitly. Figure 7.1 relates their notation to ours.

![Figure 7.1: HJS notation](image)

<table>
<thead>
<tr>
<th>HJS</th>
<th>our equivalent</th>
<th>details</th>
</tr>
</thead>
<tbody>
<tr>
<td>$L$</td>
<td>tril($A$)</td>
<td>tril($A$) $v$</td>
</tr>
<tr>
<td>$x_o$</td>
<td>$w$</td>
<td>tril($A$) $v$</td>
</tr>
<tr>
<td>$y_o$</td>
<td>$w^T$</td>
<td>tril($A$, $-1$) $v^T$</td>
</tr>
<tr>
<td>$p$</td>
<td>$w$</td>
<td>$w$ + transpose $w^T$</td>
</tr>
<tr>
<td>$\gamma$</td>
<td>$c$</td>
<td>not mathematically identical</td>
</tr>
</tbody>
</table>

$^2$PDSYTRD uses only $p_r$ processors in many computations
7.1.3 Comparing the execution time of HJS to PDSYEVX

The HJS implementation of parallel blocked Household tri-diagonalization performs essentially the same computation as PDSYEVX. The difference is in the communication, load balance and overhead costs. However, the operations are not performed in the same order, and hence the steps don’t match exactly. Some of the costs, particularly communication costs, could easily have been assigned to a different operation than the one that I assigned them to. Hence, the execution time models for each of the individual tasks should not be taken in isolation but understood as an aid in understanding the total.

Updating the current column of $A$ (Line 1.1 in Figure 7.2)

As shown in table 4.1, the cost of updating the current column of $A$ in PDSYTRD is:

$$2n[\log_2(\sqrt{p})] \alpha + n \, \text{nb} \, [\log_2(\sqrt{p})] \beta + 2n \, \delta_2 + \frac{n^2 \, \text{nb}}{\sqrt{p}} \gamma_2 + 2n \, \delta_4$$

In Figure 6[91] steps Y2, 10.1, 10.2 and 10.3 of HJS are involved in updating the current column of $A$ and the cost of these steps is:

$$n \, \alpha + \frac{1}{2} \, n^2 + \beta + 2n \, \delta_2 + \frac{n^2 \, \text{nb}}{p} \gamma_2 .$$

In PDSYEVX, a small part of $v^T$ and $w^T$ must be broadcast within the current column of processors. In HJS, there is no need to broadcast $v^T$ because it is already replicated across all processor rows. Instead of broadcasting the piece of $w^T$ that is necessary for this update, HJS transposes all of $w^T$, (cost: $n \, \alpha + 1/2 \, n^2 / \sqrt{p} \beta$) anticipating the need for this in the rank 2$k$ update.

The number of DGEMV flops performed does not change, but they are distributed across all of the processors instead of being shared only by one column of processors. In order to allow these flops to be distributed across all the processors, this update is performed in a right-looking manner, i.e. the entire block column of the remaining matrix is updated with the Householder reflector. In PDSYEVX, this update is performed in a left looking manner, only the current column is updated (with a matrix vector multiply). In PDSYEVX, the right-looking variant does not spread the work any better and hence the left-looking variant is preferred because it involves a matrix-vector multiply, DGEMV, rather than a rank-one update, DGER. Matrix-vector multiply requires only that every matrix element be read. A rank-one update requires that every matrix element be read and then re-written.
The $\delta_4$ term does not exist for HJS because they do not use the PBLAS, avoiding the error checking and overhead associated with the PBLAS.

**Computing the reflector (Line 2.1 in Figure 7.2)**

As shown in Table 4.2, the cost in PDSYTRD is: $3 \ n \ \left\lfloor \log_2(p_r) \right\rfloor + n \ \delta_4$.

In Figure 6[91] steps 2, 3, 4, 5, 6 and $X$ of HJS are involved in computing the reflector, and the cost of these steps is: $n \ \left\lfloor \log_2(p) \right\rfloor + \delta_4$, a little less than the cost in PDSYTRD.

Step 1 in HJS is also used in the computation of the reflector in HJS, however step 1 isn't necessary to compute the reflector, and it is necessary for the matrix-vector multiply, hence I assign the cost of Step 1 to the matrix-vector multiply.

Both routines perform essentially the same operations. HJS appends the broadcast of $A(J + 1, J)$ to the computation of $xnorm$ (though HJS actually computes $xnorm^2$), which HJS performs as a sum-to-all. On the other hand, they involve all processors rather than just one column of processors, hence the sum costs $\left\lfloor \log_2(p) \right\rfloor$ rather than $\left\lfloor \log_2(p_r) \right\rfloor$.

The difference in performance would appear more dramatic if I included the cost of the BLAS1 operations in my PDSYEVX model. I do not because they account for an insignificant $O\left(\frac{n^2}{p_r}\right)$ execution time. HJS performs fewer BLAS1 flops (because they do not go the extremes that PDSYTRD does to avoid overflow) and the flops that they perform are distributed over all processors instead of over only one column of processors.

**The cost of matrix vector multiply (Lines 3.1-3.6 in Figure 7.2)**

As shown in Table 4.3, the cost of the matrix vector multiply in PDSYTRD is:

$$4 \ n \ \left\lfloor \log_2(\sqrt{p}) \right\rfloor + 2 \ \left\lfloor \log_2(\sqrt{p}) \right\rfloor + n^2 \left\lfloor \log_2(\sqrt{p}) \right\rfloor \beta + 3 \ n + n \ \beta + n \ \beta + n \ \beta + n \ \beta + n \ \gamma_2.$$

In Figure 6[91] steps 1, Y1, 7.1, 7.2, and 7.3 are involved in matrix vector multiply and the cost of these steps is:

$$2 \ n \ \left\lfloor \log_2(\sqrt{p}) \right\rfloor + 2 \ n \ \left\lfloor \log_2(\sqrt{p}) \right\rfloor + \frac{3 \ n^2}{2} \ \beta + 2 \ n \ \beta + \frac{3 \ n^3}{2} \ \gamma_2.$$

The model for HJS is much simpler because 1) the local portion of the matrix-vector multiply requires just a single call to DGEMV and 2) the load imbalance in HJS is negligible ($O\left(\frac{n^2}{p}\right)$ versus $O\left(\frac{n^2}{\sqrt{p}}\right)$ in PDSYEVX).
The communication performed in HJS during the matrix vector multiply includes:

<table>
<thead>
<tr>
<th></th>
<th>Figure 6[91]</th>
<th>Execution time model</th>
</tr>
</thead>
<tbody>
<tr>
<td>Broadcast $v$ within a row</td>
<td>Step 1)</td>
<td>$n \lceil \log_2(\sqrt{p}) \rceil \alpha + \frac{\beta}{p} \lceil \log_2(\sqrt{p}) \rceil$</td>
</tr>
<tr>
<td>Transpose $v$ and $y$</td>
<td>Steps Y1, 7.3</td>
<td>$2n + \frac{n^2}{\sqrt{p}} \beta$</td>
</tr>
<tr>
<td>Recursive halve $p$</td>
<td>Step 7.3</td>
<td>$n \lceil \log_2(\sqrt{p}) \rceil \alpha + \frac{n^2}{\sqrt{p}} \beta$</td>
</tr>
</tbody>
</table>

The transpose operations take advantage of the fact that $p_r = p_c$. Each processor $(a, b)$ simply sends its local portion of the vector to processor $(b, a)$ while receiving the transpose from that same processor.

The recursive halving operation is a distributed sum in which each of the $p_c$ processors in the row starts with $k$ values and end up with $\frac{k}{p_c}$ sums.

**Updating the matrix vector multiply (Line 4.1 in Figure 7.2)**

As shown in table 4.4, the cost of updating the matrix vector multiply in `PDSYTRD` is:

$$6n \lceil \log_2(\sqrt{p}) \rceil \alpha + \frac{n^2 \lceil \log_2(\sqrt{p}) \rceil}{p_c} \beta + 3n \frac{n b}{p} \lceil \log_2(\sqrt{p}) \rceil \beta + 4n \delta_2 + 2 \frac{n^2 b}{\sqrt{p}} \gamma_2 + 4n \delta_4.$$

In Figure 6[91] step 7.4 updates the matrix vector multiply and the cost of this step is:

$$2n \delta_2 + \frac{n^2 b}{p} \gamma_2 + .$$

**Computing the companion update vector, $w$ (Line 5.1 in Figure 7.2)**

As shown in table 4.5, the cost of computing the companion update vector in `PDSYTRD` is:

$$2n \lceil \log_2(\sqrt{p}) \rceil \alpha + n \delta_4 + .$$

In Figure 6[91] steps 8 and 9 compute the companion update vector and the cost of these steps is:

$$5n \lceil \log_2(\sqrt{p}) \rceil \alpha + \frac{n^2}{\sqrt{p}} \beta .$$

Just as in the computation of the reflector, the $O(n^2)$ costs of the BLAS1 operations is insignificant. HJS performs these more efficiently than `PDSYEVX`, because it uses all the processors in these computations.
Perform the rank $2k$ update (Line 6.3 in Figure 7.2)

As shown in table 4.6, the cost of the rank $2k$ update in PDSYTRD is:

$$4 \frac{n}{nb} \left\lceil \log_2(\sqrt{p}) \right\rceil \alpha + 2 \frac{n^2}{\sqrt{p}} \left\lceil \log_2(\sqrt{p}) \right\rceil \beta + \frac{n^2}{\sqrt{p}} \beta - 2 n nb \left\lceil \log_2(\sqrt{p}) \right\rceil \beta$$

$$+ 4 \frac{n^2}{nb^2} \sqrt{p} \frac{2}{3} \gamma_3 + 3 n^2 nb \frac{2}{\sqrt{p}} \gamma_3.$$

In Figure 6[91] step 10.4 performs the rank $2k$ update and the cost of this step is:

$$2 \frac{n^2}{nb^2} \sqrt{p} \frac{2}{3} \gamma_3 + .$$

HJS does not require any communication here because $W$ and $V$, are already replicated across the processor rows, while $WT$ and $VT$ are already replicated across all the processor columns.

Both HJS and PDSYEVX must perform the rank $2k$ update as a series of panel updates using Dgemm. Both PDSYTRD and HJS use a panel width of twice the algorithmic blocking factor.

Figure 7.2 summarizes the main sources of inefficiencies in HJS reduction to tridiagonal form.

Table 7.1 compares the execution time in PDSYEVX and HJS reduction to tridiagonal form. Each row represents a particular operation. The second column is the time (in seconds) associated with the given operation in PDSYEVX. The third column shows the number of the given operation performed in PDSYEVX. The product of the third column with the first column, after substituting the cost given for the operation given in section 5.2 and $n = 4000$ and $p = 64$ is the second column. For example the cost of matrix multiply flops in PDSYTRD on the PARAGON is: $2/3 (n = 4000)^3/(p = 64) (\gamma_3 = .0215e-6) = 14.3$. Likewise, the second to last column (the number of the given operation performed in reduction to tridiagonal form in HJS) times the first column equals the last column (the time associated with the given operation in reduction to tridiagonal form in HJS.)

Columns 4 through 10 represent unimplemented intermediate variations on reduction to tridiagonal form. Column 4, labeled “minus PBLAS inefficiencies” assumes that a couple inefficiencies of the PBLAS are removed: (a bug in the PBLAS causing unnecessary communication and the PBLAS overhead). Column 5, labeled “be less paranoid”, assumes that in addition PDSYTRD computes reflectors in the slightly faster, slightly riskier manner
Figure 7.2: Execution time model for HJS reduction to tridiagonal form. Line numbers match Figure 4.5 (PDSYEVX execution time)

<table>
<thead>
<tr>
<th>computation</th>
<th>communication</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
</tr>
<tr>
<td>do $i = 1, n, nb$</td>
<td></td>
</tr>
<tr>
<td>$mx_i = \min(i + nb, n)$</td>
<td></td>
</tr>
<tr>
<td>do $i = ii, mx_i$</td>
<td></td>
</tr>
</tbody>
</table>

**Update current ($i^\text{th}$) column of $A**

1.1 transpose $w$  
$n \log(\sqrt{p}) \alpha + \frac{1}{2} n^2 \beta$

1.2 $A = A - W V^T - V W^T$

**Compute reflector**

2.1 $v = \text{house}(A)$  
$2 n \log(\sqrt{p}) \alpha$

**Perform matrix-vector multiply**

3.1 spread $v$ across  
$n \log(\sqrt{p}) \alpha + \frac{1}{2} n^2 \frac{\log(\sqrt{p})}{\sqrt{p}} \beta$

3.2 transpose $v$  
$rac{2 n^3}{3 p} \sqrt{2}$

3.3 $w = \text{tril}(A) v$;  
$w^T = \text{tril}(A, -1) v^T$

3.5 recursive halve $w$  
$rac{1}{2} n^2 \sqrt{p} \beta$

3.6 $w = w + \text{transpose} w^T$

**Update the matrix-vector product**

4.1 $w = w - W V^T v - V W^T v$  
$3 n \log(\sqrt{p}) \alpha + \frac{2}{\sqrt{p}} \beta$

**Compute companion update vector**

5.1 $c = w \cdot v^T$;  
$w = \tau w - (c \tau/2) v$

end do $i = ii, mx_i$

**Perform rank $2k$ update**

6.3 $A = A - W V^T - V W^T$  
$2 \frac{n^2}{nb \sqrt{p}} \beta_3 + \frac{2 n^3}{p} \gamma_3$

end do $i = 1, n, nb$

that HJS does. Column 6 assumes direct transpose operations. Column 7 assumes that certain messages are combined, reducing the message latency cost. Column 8 assumes that sum-to-all is used instead of sum-to-one follow by a broadcast, reducing the latency cost. Column 9, assumes that $V, W, V^T, W^T$ are stored replicated across processor columns, this eliminates all communication in the rank $2k$ update. Storing the data replicated also allows all processors to be involved in all computations, but this is not assumed until column 11. Column 10 assumes a cyclic data layout, eliminating some load imbalance. Column 11 as-
sumes that all processors are involved in all computations, eliminating the load imbalance which was not eliminated by using a cyclic data layout.

7.1.4 PDSYEV

PDSYEV uses the QR algorithm to solve the tridiagonal eigenproblem. Each eigenvector is spread evenly among all the processors. Each processor redundantly computes the rotations and updates the portion of each eigenvector which it owns. Computing the rotations requires $O(n^2)$ flops, whereas updating the eigenvectors requires $O(n^3)$ flops. Hence PDSYEV scales reasonably well as long as all the eigenvectors are required.

Each rotation requires 2 divides, 1 square root and approximately 20 to compute and 6 flops to apply.

The cost of the QR based tridiagonal eigensolution in PDSYEV is:

$$
\sum_{j=1}^{n} \text{sweeps}(j) \left( n - j \right) \left( 2 \gamma_{\Delta} + \gamma_{\Delta'} + 20 \gamma_1 + \frac{1}{p} 6n \gamma_1 \right)
$$

On average, it takes two sweeps per eigenvalue, so we set $\text{sweeps}(j) = 2$ and simplify:

$$
2 n^2 \gamma_{\Delta} + 1 n^2 \gamma_{\Delta'} + 20 n^2 \gamma_1 + 6 \frac{m n^2}{p} \gamma_1
$$

7.2 Other techniques

7.2.1 One dimensional data layouts

One dimensional data layouts can improve the performance of dense linear algebra codes on modest numbers of processors, especially on one-sided reductions like LU and QR decomposition. In general, one dimensional data layouts require fewer communication calls in the inner loop but more words transmitted per process. One-sided reductions typically require fewer messages within rows than within columns, sometimes by a factor as high as $\text{nb}$, other times the advantage is a more modest $\log(\sqrt{p})$. One-sided reductions often require fewer words to be transmitted between columns than between rows of processors, usually by a factor of $\text{nb}$.

One dimensional data layouts also offer less overhead. Often an entire block column can be computed by a call to the corresponding LAPACK code rather than the ScaLAPACK code, saving significant overhead costs.
Table 7.1: Comparison between the cost of HJS reduction to tridiagonal form and \texttt{PDSYTRD} on $n = 4000$, $p = 64$, \texttt{nb} = 32. Values differing from previous column are shaded.

<table>
<thead>
<tr>
<th>Scale Factor</th>
<th>PDSTYRD estimated time</th>
<th>PDSTYRD counts</th>
<th>minus PELAS inefficiency</th>
<th>be less paranoid</th>
<th>direct transpose</th>
<th>merge operations</th>
<th>use sum-to-all</th>
<th>Store $V$, $V^T$, $W^T$ replicated</th>
<th>All processors compute (i.e., HJS)</th>
<th>no data blocking</th>
<th>HJS estimated time</th>
</tr>
</thead>
<tbody>
<tr>
<td>$\frac{n}{p} \cdot \gamma_3$</td>
<td>14.3</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>$\frac{n}{p} \cdot \gamma_2$</td>
<td>16.4</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>$\frac{n}{\sqrt{p} \cdot \text{nb} \cdot \text{pbf}^6_3}$</td>
<td>2.1</td>
<td>2</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>4</td>
<td>4</td>
</tr>
<tr>
<td>$n \cdot \delta_2$</td>
<td>0.6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>4</td>
<td>4</td>
</tr>
<tr>
<td>$\frac{n}{\sqrt{p} \cdot \text{nb} \cdot \text{pbf}^6_2}$</td>
<td>4.7</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>$\frac{n}{\sqrt{p} \cdot \text{nb} \cdot \text{pbf}^6_1}$</td>
<td>4.0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>$\frac{n}{\sqrt{p} \cdot \text{nb} \cdot \text{pbf}^6_3}$</td>
<td>1.7</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>$n \cdot \delta_4$</td>
<td>8.9</td>
<td>9</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>$\frac{n}{\sqrt{p} \cdot \text{nb} \cdot \text{pbf}^6_2}$</td>
<td>0.9</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>$\frac{n}{\sqrt{p} \cdot \text{nb} \cdot \text{pbf}^6_3}$</td>
<td>1.7</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>3</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>$\frac{n}{\sqrt{p} \cdot \text{nb} \cdot \text{pbf}^6_3}$</td>
<td>1.0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>$\frac{n}{\sqrt{p} \cdot \text{nb} \cdot \text{pbf}^6_3}$</td>
<td>0.5</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>$n \cdot [\log_2(\sqrt{p})]^a$</td>
<td>13.5</td>
<td>17</td>
<td>15</td>
<td>14</td>
<td>12</td>
<td>9</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
<td>6</td>
</tr>
<tr>
<td>$\frac{n}{p} \cdot [\log_2(\sqrt{p})]^a$</td>
<td>0.5</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>3</td>
<td>3</td>
<td>3</td>
</tr>
<tr>
<td>$\frac{n}{\text{nb} \cdot [\log_2(\sqrt{p})]^a}$</td>
<td>0.3</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>2</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>$\frac{n}{\sqrt{p} \cdot [\log_2(\sqrt{p})]^b}$</td>
<td>4.4</td>
<td>5</td>
<td>4</td>
<td>4</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>$\frac{n}{\text{nb} \cdot [\log_2(\sqrt{p})]^b}$</td>
<td>0.6</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>$n \cdot \text{nb} \cdot [\log_2(\sqrt{p})]^b$</td>
<td>0.02</td>
<td>8</td>
<td>7</td>
<td>7</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
<td>5</td>
</tr>
<tr>
<td>Total estimated time</td>
<td>76</td>
<td>59</td>
<td>58</td>
<td>55</td>
<td>51</td>
<td>49</td>
<td>48</td>
<td>41</td>
<td>42</td>
<td>42</td>
<td>42</td>
</tr>
<tr>
<td>Actual time</td>
<td>93</td>
<td>61</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Both LU decomposition and back transformation would benefit considerably from one-dimensional data layouts when \( p \) is small, although the advantage would be most pronounced on LU. One-sided reductions require \( O(n) \) reductions across processor rows but only \( O(\frac{n}{p}) \) reductions across processor columns. On a high latency system such as a network of workstations, the performance improvement from using a one-dimensional data layout could be substantial since LU requires \( O(nb) \) fewer messages on a one-dimensional data layout.

ScalAPACK does not take full advantage of one-dimensional data layouts because it calls the ScalAPACK code even when the LAPACK code would do the job faster.

Two-sided reductions, such as reduction to tridiagonal form, do not benefit from one dimensional data layouts. Two-sided reductions require \( O(n) \) reductions across processor rows and \( O(n) \) reductions across processor columns, hence eliminating the reductions across processor rows (by using a 1D data decomposition) will not substantially reduce the number of messages in two-sided reductions.

### 7.2.2 Unblocked reduction to tridiagonal form

Unblocked reduction to tridiagonal form can outperform blocked reduction for small and modest sized problems, especially if a good compiler is available for the inner kernel. Unblocked reduction to tridiagonal form must perform all of its flops as BLAS2 flops, whereas blocked reduction to tridiagonal form performs half of its flops as BLAS3 flops. However, unblocked reduction to tridiagonal form requires much less overhead. Blocked reduction to tridiagonal form requires at least \( 6n \) calls to \texttt{DGEMV}, unblocked reduction to tridiagonal form requires only \( n \) calls to \texttt{DSYMV} and \( n \) calls to \texttt{DGER}.

If a compiler is available that will efficiently compile the following kernel, unblocked reduction to tridiagonal form could require only \( n \) BLAS2 calls and still attain near peak performance on large problem sizes, especially for Hermitian eigenproblems\(^3\). The kernel shown below only requires each element of \( A \) be read once and written once, while performing 8 flops. This ratio, 1 memory read, 1 memory to 8 flops is one that many modern computers can handle at near peak speed, even from main memory - in part because the access are essentially all stride 1.

\(^3\) Complex arithmetic requires only half as much memory traffic per flop
for i = 1, n {
    for j = 1, i {
        A(i,j) = A(i,j) - v(i) * wt(j) - w(i) * vt(j);
        nwt(i) = nwt(i) + A(i,j) * n(w(j);
        nw(j) = nw(j) + A(i,j) * nvt(i);
    }
}

7.2.3 Reduction to banded form

Reducing a dense matrix to banded form can be more efficient than reduction to tridiagonal form\cite{24, 25, 116}, however it is not clear that this can be made to be fast enough to overcome the added costs to the rest of the code. Reduction to banded form requires less execution time than reduction to tridiagonal form because it requires fewer messages $O(n/nb)$ instead of $O(n)$ and because asymptotically all of the flops can be performed as BLAS3 flops rather than half BLAS2 flops.

An efficient eigensolver based on reduction to banded form could be designed as follows:

Reduce to banded form
Reduce from banded form to tridiagonal form (do not save rotations)
Compute eigenvalues using bisection on tridiagonal form
Perform inverse iteration on banded form
Back transform the eigenvectors

This would be even simpler if only eigenvalues were required, as that eliminates the inverse iteration and back transformation steps.

If only a few eigenvectors are required, one could reduce from banded form to tridiagonal form, saving the rotations. This would allow the eigenvectors to be computed on the tridiagonal using inverse iteration (or the new Parlett/Dhillon work). Then the rotations could be applied as necessary and finally the eigenvectors would be transformed back. This would result in a complex code.

If two step band reduction to tridiagonal form were performed as above and the eigenvectors were computed on the tridiagonal matrix, the cost of transforming them back to the original problem would be at least $4n^3$, adding 60% more $O(n^3)$ flops to full tridiagonal eigendecomposition. This could be done in two steps, applying first the rotations accrued during reduction from banded to tridiagonal form and then transforming the eigenvectors of the banded form back to the original problem. A cleaner, though more costly solution
would be to form the back transformation matrix after (or during) reduction to banded form, update that during reduction to banded form and then use this to transform the eigenvectors of the tridiagonal back to the original problem.

Using reduction to banded form in an eigensolver requires, at a minimum, that two step band reduction to tridiagonal form be faster than direct reduction to tridiagonal form. If eigenvectors are required, it must be significantly faster in order to overcome the additional $2n^3$ cost of back transformation.

So far, no one has demonstrated that two step reduction to tridiagonal form can be performed faster than direct reduction on distributed memory computers. Alpatov, Bischof and van de Geijn's two-step reduction to tridiagonal form\cite{173} is not faster than 

\textsc{Pdsytrd}. They assert that it can be optimized, but that is also true of 

\textsc{Pdsytrd}. So, it is not yet clear whether two-step reduction to tridiagonal form will be significantly faster than direct reduction to tridiagonal form on any important subset of distributed memory parallel computers.

I believe that software overhead plays a significant role in limiting the performance of two step reduction to banded form.

\subsection*{7.2.4 One-sided reduction to tridiagonal form}

Hegland et al.,\cite{90} show that one can reduce the Cholesky factor (of a shifted input matrix) to bidiagonal form updating from only one side. The result, in their implementation, is a code which requires $(10/3n^3/p+n^2*p)$ flops per processor, $(n^2*p)$ words communicated per processor and $(n*p)$ messages per processor.

They argue that this technique, despite requiring 2.5 times as many flops, yields better performance on their target machine than conventional methods for reduction to tridiagonal form. They use a 1D processor grid, a unblocked algorithm, a non-scalable pattern communication and computation and ignore symmetry. By ignoring much of the conventional wisdom they have achieved a simple, high performance code for their target machine (vector).
7.2.5 Strassen’s matrix multiply

The number of flops in Strassen’s’ matrix matrix multiply is:

\[ 2m n k \left( \frac{\min(m, n, k)}{s_{\frac{1}{2}}} \right)^{3-\log 7}. \]

Where \( s_{\frac{1}{2}} \) is the break even point for a particular Strassen’s implementation, i.e. the point at which one additional Strassen’s divided and conquer step neither increases nor decreases execution time. Three factors contribute to preventing the use of Strassen’s in reduction to tridiagonal form and back transformation:

\( s_{\frac{1}{2}} \) is still too large

Lederman et al.[96] have reduced \( s_{\frac{1}{2}} \) to the range 100 to 500.

\( k \) is modest (where \( k \) is the block size.)

We can increase the block size, but only at the cost of additional load imbalance.

\( n^{3-\log 7} = n^{-1.93} \) shrinks slowly

Increasing \( n \) by enough to improve the ratio of “Strassen flops” to standard matrix multiply flops by 50% requires a thousand-fold increase in the amount of memory required. (\( 5^{-1.93} \approx .5 \), hence \( n \) must increase by a factor of 32 to to improve the ratio of Strassen flops to standard matrix multiply flops.) Improving the ratio of “Strassen flops” to standard matrix multiply flops by increasing the number of processors involved is even difficult. Although Chou et al.[43] have shown that \( 7^k \) processors can be used to do the work of \( 8^k \), it takes \( 7^5 = 16807 \) processors to get a factor of two advantage this way. (\( 7^5 / 8^5 = .51 \))

It is this last point that prevents Strassen’s from rescuing ISDA (which is described below). Because \( 32^{-1.93} \approx 0.5 \), the problem size must be 32 \( s_{\frac{1}{2}} \) in order to halve the number of flops required in ISDA. Halving the number of flops again would require that \( n \) be increased by another factor of 30, increasing memory by another factor of 900 and the total number of flops, even after the factor of two savings, by \( \frac{1}{2} \cdot 30^3 = 13,500 \). I have not yet seen a Strassen’s matrix matrix multiply that achieves twice the performance of a regular matrix matrix multiply.
Table 7.2: Fastest eigendecomposition method

<table>
<thead>
<tr>
<th></th>
<th>$n &gt; 500 \sqrt{p}$</th>
<th>$n &lt; 500 \sqrt{p}$</th>
</tr>
</thead>
<tbody>
<tr>
<td>Random matrices</td>
<td>Tridiagonal (&gt; 4 times faster)</td>
<td>Tridiagonal</td>
</tr>
<tr>
<td>Spectrally diagonally dominant matrices</td>
<td>Tridiagonal</td>
<td>Jacobi</td>
</tr>
</tbody>
</table>

7.3 Jacobi

7.3.1 Jacobi versus Tridiagonal eigensolvers

This section is based on models that have only been informally validated. I have compared my models to those used by Arbenz and Slapničar[9] and Littlefield and Maschhoff[125] as well as against the execution times reported in these papers but have not performed any independent validation. Hence, the opinions that I express in this section should be taken as conjectures.

Large matrices\(^4\) can be solved faster by a tridiagonal based eigensolver than by a Jacobi eigensolver, but it is likely that Jacobi will outperform tridiagonal based eigensolvers on small spectrally diagonally dominant matrices\(^5\). Since tridiagonal based methods require, asymptotically, no more than a quarter as many flops as blocked Jacobi methods, even on spectrally diagonally dominant matrices, I expect that tridiagonally based methods will win on large matrices, even spectrally diagonally dominant ones, because tridiagonal based methods can achieve 25% of peak performance on large matrices as shown in Chapter 5. I also expect that tridiagonal based eigensolvers will beat Jacobi eigensolvers on random matrices regardless of their size because on random matrices, tridiagonal eigensolvers perform roughly 16 times fewer flops\(^6\) and I don’t think that Jacobi methods will be 16 times faster per flop regardless of the input size. Table 7.2 summarizes which eigensolution method I expect to be faster as a function of these input matrix characteristics.

\(^4\)On current machines ($n > 500 \sqrt{p}$) is sufficiently large to allow a tridiagonal eigensolver to outperform Jacobi.

\(^5\)Spectrally diagonally dominant means that the eigenvector matrix, or a permutation thereof, is diagonally dominant. Most, but not all, diagonally dominant matrices are spectrally diagonally dominant. For example if you take a dense matrix with elements randomly chosen from $[-1, 1]$ and scale the diagonal elements by $1/e^3$ the resulting diagonally dominant matrix will generally be spectrally diagonally dominant. However, if you take that same matrix and add $1/e^3$ to each diagonal element, the eigenvector matrix is unchanged even though the matrix is clearly diagonally dominant.

\(^6\)Assuming Jacobi converges in an optimistic 8 sweeps
7.3.2 Overview of Jacobi Methods

Despite Jacobi’s simplicity there are several possible variants, especially for a parallel code, each of which have advantages. In section 7.3.16 I describe the code that I would write if I were going to write a parallel code. I recommend a 2D data layout if one wishes to be able to run efficiently on large numbers of processors (say 48 or more). However, a 1D data layout is considerably simpler to implement and simpler implementation translates into less software overhead. On some computers, Jacobi with a 1D data layout might be efficient for hundreds of processors. I recommend using a one-sided, blocked, non-threshold Jacobi with a caterpillar track pairing and distinct communication and computation phases, but other methods cannot be entirely rejected. For a spectrally diagonally matrix the fastest serial Jacobi algorithm is a threshold Jacobi, hence threshold methods cannot be ignored. A threshold method would almost certainly have to be two-sided, use a different pairing strategy and either a non-blocked code or some unconventional blocking strategy. Non-blocked codes may make sense for small matrices and large numbers of processors as well as for machines, such as vector architectures, which offer comparable BLAS1 and BLAS3 performance. Overlapping communication and computation will save time, but my experience indicates that the savings is limited.

My recommendation is weighted toward small matrices that are modestly spectrally diagonally dominant, but not so dominant that certain matrix entries can be completely ignored. If the input matrix is sparse and so strongly spectrally diagonally dominant that the matrix never fills in, one would have to consider threshold methods and methods that don’t update parts of the matrix that remain zero. On the other hand, if the matrix is quite large, performance could be further improved by using a different data layout from the one that I recommend.

There are many implementation options available to anyone writing a Jacobi code. I will discuss many of these implementation options in the following sections. Section 7.3.3 explains the basic variants and data layout options. Section 7.3.4 explains the computation requirements of each of the basic variants. Section 7.3.5 explains the communication requirements of each of the basic variants. Section 7.3.6 discusses blocking (both communication and computation). Section 7.3.7 discusses the importance of exploiting symmetry. Section 7.3.8 explains that one-sided methods need not recompute diagonal blocks of $A^T A$. Section 7.3.9 discusses options for the partial eigendecomposition required by a blocked
Jacobi method. Section 7.3.10 discusses threshold strategies. Section 7.3.12 discusses preconditioners. Section 7.3.13 discusses overlapping communication and computation.

7.3.3 Jacobi Methods

The matlab code for the classical, two-sided, Jacobi method shown in figure 7.3 differs from textbook descriptions only in that the rotation is computed by calling parteig and the off diagonals are compared to the diagonals (in the threshold test) in an unusual manner. Figure 7.6 gives inefficient matlab code for parteig which calls matlab's eig() routine and sorts the eigenvalues to guarantee convergence. In a real implementation, parteig would be one or two sweeps of two-sided Jacobi.

A two-sided blocked Jacobi matlab code is given in figure 7.4. Because the code in figure 7.3 uses parteig to compute the rotations and norm in the threshold test, the only difference between the blocked and unblocked versions is the definition of I and J. parteig is not typically a full eigendecomposition, more often it is a single sweep of Jacobi.

The one-sided Jacobi variants can operate on any matrix whose left singular vectors are the same as, or related to, the eigenvectors of the input matrix. This allows many choices for pre-conditioning the input matrix, several of which are discussed in section 7.3.12.

The one-sided Jacobi methods lose symmetry, but still require fewer flops than the two-sided Jacobi methods because they do not have to update the eigenvectors separately. Furthermore, the one-sided Jacobi methods always access the matrix in one direction (by column for Fortran). A typical one-sided Jacobi method is shown in figure 7.5.

Parallel Jacobi methods require two forms of communication. The columns and/or rows of the matrix must be exchanged in order to compute the rotations and the rotations must be broadcast. The basic communication for one-sided Jacobi is shown in figure 7.7 while the communication pattern for two-sided Jacobi is given in figure 7.8.

7.3.4 Computation costs

The computation and communication cost for the Jacobi method which I recommend for non-vector distributed memory computers with many nodes, a one-sided blocked Jacobi on a 2 dimensional \((p_x \times p_z)\) processor grid, is shown in table 7.3. Definitions for all symbols used here can be found in Appendix A.

\(^7\)They also avoid applying rotations from both sides, but this advantage is negated by the fact that they
function [Q,D] = jac2(A)
    
    % Classical two-sided threshold Jacobi
    
    thresh = 1e-15;
    maxiter = 25;
    n = size(A,2)

    iter = 0
    mods = 1
    Q = eye(n);

    while (iter < maxiter & mods > 0 )
        mods = 0;
        for I = 1:n
            for J = 1:I-1
                blkA = A([J,I],[J,I]) ;
                if ( norm(blkA-diag(diag(blkA))) > ( norm(blkA)*thresh) )
                    mods = mods + 1;
                    [R,D] = parteig(A([J,I],[J,I]));
                    A([J,I],:) = R' * A([J,I],:); 
                    A(:,[J,I]) = A(:,[J,I]) * R;
                    Q(:,[J,I]) = Q(:,[J,I]) * R;
                end
            end
        end
        iter = iter + 1
    end

    D = diag(diag(A)) ;
Figure 7.4: Matlab code for two-sided blocked Jacobi

```matlab
function [Q,D] = bjac2( A )

 
 
% Two sided blocked threshold Jacobi

maxiter = 25;
thresh = 1e-15;
b = 1;
n = size(A,2)

iter = 0;
mods = 1;
Q = eye(n);

while (iter < maxiter & mods > 0)
    A = ( A + A' ) / 2; % restore symmetry
    mods = 0;
    for i = 1:nb:n
        maxi = min(i+nb-1,n);
        I = i:maxi;
        for j = 1:nb:I-1
            maxj = min(j+nb-1,n);
            J = j:maxj;

            blkA = A(J,J,I,I);
            if ( norm(blkA-diag(diag(blkA))) > ( norm(blkA)*sqrt(nb)*thresh) )
                mods = mods + 1;

                [R,D] = parteig(A([J,I],[J,I]));
                A([J,I],:) = R' * A([J,I],:);
                A(:,[J,I]) = A(:,[J,I]) * R;
                Q(:,[J,I]) = Q(:,[J,I]) * R;
            end
        end
    end
    iter = iter + 1
end

D = diag(diag(A)) ;
```
function [ Q, D ] = bjac1( A )
%
% One sided blocked Jacobi
%
thresh = 1e-15 ;
nb = 2 ;
maxiter = 25;
n = size(A,2)

B = A;
iter = 0 ;
mods = 1 ;

while (iter < maxiter & mods > 0)
    mods = 0 ;
    for i = 1:nb:n
        maxi = min(i+nb-1,n);
        I = i:maxi;
        for j = 1:nb:I-1
            maxj = min(j+nb-1,n);
            J = j:maxj;

            blkA = A(:,[J,I])' * A(:,[J,I]) ;
            if (norm(blkA-diag(diag(blkA))) > norm(blkA)*sqrt(nb)*thresh)
                mods = mods + 1 ;

                [R,D] = parteig(blkA) ;
                A(:,[J,I]) = A(:,[J,I]) * R ;

            end % if
        end % for j
    end % for i
    iter = iter + 1
end % while

D = A' * A;
Q = A * diag(1./sqrt(diag(D))) ;

D = Q' * B * Q ;
D = diag(diag(D)) ;
Figure 7.6: Matlab code for an inefficient partial eigendecomposition routine

```matlab
function [ Q, D ] = parteig( A )

[QQ,DD ] = eig(A) ;
[tmp,Index] = sort(- diag(DD));
D = DD(Index,Index) ;
Q = QQ(:,Index) ;
```

Table 7.3: Performance model for my recommended Jacobi method

<table>
<thead>
<tr>
<th>Task</th>
<th>Cost per parallel pairing i.e. ( \frac{n}{nb^2} / (2p_r) ) parallel pairings</th>
<th>Cost for recommended data layout ( \frac{nb=n}{2p_r} ) ( p_r=16p_r=4\sqrt{\pi} )</th>
</tr>
</thead>
<tbody>
<tr>
<td>Move column for this pairing( ^a )</td>
<td>( 2^\frac{2}{p_r} n \frac{2}{p_r} \times \left( \alpha + \frac{a}{p_r^2} \beta \right) )</td>
<td>( \frac{8}{p_r^2} n + \frac{4\sqrt{\pi}}{p_r} \beta )</td>
</tr>
<tr>
<td>( \text{diag} = A {I,J,:} \times A {I,J,:} ) ( ^b )</td>
<td>( 2n^2 + 2n \frac{p_r}{p_r} ) ( \gamma_0 )</td>
<td>( 8\sqrt{\pi} n + \frac{4\sqrt{\pi}}{p_r} \gamma_0 )</td>
</tr>
<tr>
<td>Sum diag within each processor column</td>
<td>( \frac{1}{2} \left( \frac{\pi}{p_r} \right) \left( \frac{\sqrt{\pi}}{p_r} \right) \frac{\gamma_0}{n} )</td>
<td>( 4\sqrt{\pi} \frac{n \gamma_0}{\sqrt{p_r}} ) + ( \frac{2\gamma_0}{\sqrt{p_r}} \gamma_0 ) + ( \frac{2\gamma_0^2}{\sqrt{p_r}} \left( \frac{\gamma_0}{n} \right) )</td>
</tr>
<tr>
<td>( [Q, D] = \text{parteig(diag)} ) ( ^c )</td>
<td>( 2n^2 (2\gamma_0 + \frac{\gamma_0}{n}) )</td>
<td>( 4\sqrt{\pi} \frac{n \gamma_0}{\sqrt{p_r}} ) + ( \frac{2\gamma_0}{\sqrt{p_r}} \gamma_0 ) + ( \frac{2\gamma_0^2}{\sqrt{p_r}} \left( \frac{\gamma_0}{n} \right) )</td>
</tr>
<tr>
<td>Broadcast ( Q ) within each processor column</td>
<td>( \frac{1}{2} \left( \frac{\pi}{p_r} \right) \left( \frac{\sqrt{\pi}}{p_r} \right) \frac{\gamma_0}{n} )</td>
<td>( 4\sqrt{\pi} \frac{n \gamma_0}{\sqrt{p_r}} ) + ( \frac{2\gamma_0}{\sqrt{p_r}} \gamma_0 ) + ( \frac{2\gamma_0^2}{\sqrt{p_r}} \left( \frac{\gamma_0}{n} \right) )</td>
</tr>
<tr>
<td>( A = QA )</td>
<td>( \frac{1}{2} \left( \frac{\pi}{p_r} \right) \left( \frac{\sqrt{\pi}}{p_r} \right) \frac{\gamma_0}{n} )</td>
<td>( 8\sqrt{\pi} n + \frac{4\sqrt{\pi}}{p_r} \gamma_0 )</td>
</tr>
</tbody>
</table>

\( ^a \) My models assume that sends and receives do not overlap, hence the factor of 2. The factor of \( \frac{2nb_p,r}{n} \) represents the number of parallel pairings that can be performed on the data local to one processor column.

\( ^b \) Only \( A \{I,:\} \times A \{I,:\} \) need be computed. See section 7.3.8

\( ^c \) Partial eigendecomposition of the \( (2nb) \times (2nb) \) matrix performed with one pass of an unblocked two-sided Jacobi method exploiting symmetry, see column labeled "exploiting symmetry" in table 7.6

\( ^d \) \( \left( \frac{2n \gamma_0^2}{p_r} \right) \times \left( \frac{\pi}{p_r} \right)^2 = \frac{24}{p_r^2} = 24/36 = \frac{3}{4} \)

\( ^e \) \( n \gamma_0^2 = \frac{8\sqrt{\pi}}{p_r} \gamma_0 \)
Figure 7.7: Pseudo code for one-sided parallel Jacobi with a 2D data layout with communication highlighted

Until convergence do:
   Foreach iteration do:
      Move column data (A) to adjacent columns of processors
      Compute $A^T A$ locally (i.e. $\text{blk}A = A(:,[I,J])^* A(:,[I,J])$)
      Combine $A^T A$ within each column of processors
      Partial eigendecomposition of diagonal block (i.e. $[R, D] = \text{eig}(A^T A)$)
      Broadcast R within each row of processors
      Compute $A R$ locally
   End Foreach
End Until

Table 7.4 shows the estimated execution time for one sweep of my recommended Jacobi on a matrix of size 1000 by 1000 on a 64-node PARAGON. As this model has not been validated, these estimates must be viewed with caution. Actual performance will be different, but the model gives some idea of how important the various aspects may be. This model is given in matlab form in section B.2.1. Table 7.4 suggests that Jacobi is indeed efficient ($1.68/2.69 = 62\%$) even on such small problems. It also suggests that the optimal data layout may be even taller and thinner than my recommended data layout: $p_c = 32; p_r = 2$. A taller and thinner layout (specifically $p_c = 64; p_r = 1$) would double the cost of message transmission between columns but would decrease the cost of the partial eigensolver. The cost of the divides and square roots in the partial eigensolver would decrease by a factor of $64/32$ because all 64 processors would participate in the partial eigensolver. And the cost of accumulating the rotations within the partial eigensolver would decrease by $2 \times 2 = 4$. The first factor of 2 stems the fact that all processors would share in the work, while the second factor of 2 stems from the fact that the block size would be smaller by a factor of 2 and the cost of accumulating rotations grows as $O(n^2nb)$.

Table 7.5 gives computation cost models for 6 one-sided Jacobi variants. These models are not complete (they overlook many overhead and load imbalance costs), nor have they been validated. This table is designed mainly to put the various variants in perspective and not

must perform dot products to form the square sub matrices to be diagonalized.
Table 7.4: Estimated execution time per sweep for my recommended Jacobi on the PARAGON on n=1000, p=64

<table>
<thead>
<tr>
<th>Task</th>
<th>Performance Model</th>
<th>Operation cost</th>
<th>Estimated time (seconds)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Message latency</td>
<td>(8\sqrt{p}\log(p) - 3\alpha)</td>
<td>(\alpha = 65.9 e - 6)</td>
<td>0.01</td>
</tr>
<tr>
<td>Message transmission between</td>
<td>(\frac{7\sqrt{p}}{\sqrt{p}}\beta)</td>
<td>(\beta = .146 e - 6)</td>
<td>0.06</td>
</tr>
<tr>
<td>Message transmission within</td>
<td>(\frac{1}{8\sqrt{p}}\log(p)\beta)</td>
<td>(\beta = .146 e - 6)</td>
<td>0.01</td>
</tr>
<tr>
<td>Computing rotations</td>
<td>(\frac{1}{\sqrt{p}}\gamma_{\pm})</td>
<td>(\gamma_{\pm} = 3.85 e - 6)</td>
<td>0.24</td>
</tr>
<tr>
<td>Computing rotations</td>
<td>(\frac{1}{\sqrt{p}}\gamma_1)</td>
<td>(\gamma_1 = 7.7 e - 6)</td>
<td>0.24</td>
</tr>
<tr>
<td>Accumulating rotations in</td>
<td>(\frac{3}{\sqrt{p}}\gamma_1)</td>
<td>(\gamma_1 = .074 e - 6)</td>
<td>0.43</td>
</tr>
<tr>
<td>partial eigensolver</td>
<td>(8\sqrt{p}\delta_3)</td>
<td>(\delta_3 = 103 e - 6)</td>
<td>0.01</td>
</tr>
<tr>
<td>Software overhead</td>
<td>(5\frac{\pi}{6}\gamma_3)</td>
<td>(\gamma_3 = .0215 e - 6)</td>
<td>1.68</td>
</tr>
<tr>
<td>Total (per sweep)</td>
<td></td>
<td></td>
<td>2.68</td>
</tr>
</tbody>
</table>

*See 6.1
Figure 7.8: Pseudo code for two-sided parallel Jacobi with a 2D data layout, as described by Schriber[150], with communication highlighted

Until convergence do:
    Foreach paring do:
        Move row and column data \((A)\) to diagonally adjacent processors
        Compute partial eigendecomposition of diagonal block
        Broadcast \(R\) within each row of processors
        Broadcast \(R'\) within each column of processors
        Compute \(R A R'\) locally
        Compute \(QR\) locally
    End Foreach
End Until

to establish which is best. Communication costs are considered in section 7.3.5

I have attempted to list the variants that have been implemented as well as the most promising suggestions. For each variant I have, where appropriate, followed my recommendations for implementing a Jacobi code made in section 7.3.16.

Table 7.6 gives performance models for 5 commonly mentioned two-sided Jacobi variants. Like the performance models for one-sided Jacobi variants, these models are incomplete and have not been validated.

7.3.5 Communication costs

Table 7.7 summarizes the communication costs for parallel Jacobi methods. I assume that the communication block size is chosen to be as large as possible.

A performance model for Jacobi could be created by selecting the appropriate computation costs from table 7.5 or table 7.6 and the appropriate communication cost from table 7.7. Not all load imbalance and overhead costs are covered in either of these tables, and the models have not been validated.
Table 7.5: Performance models (flop counts) for one-sided Jacobi variants. Entries which differ from the previous column are shaded.

<table>
<thead>
<tr>
<th></th>
<th>Littlefield exploit symmetry</th>
<th>store diagonals</th>
<th>fast given *</th>
<th>Blocked exploit symmetry</th>
<th>store diagonals</th>
</tr>
</thead>
<tbody>
<tr>
<td>$A^T A$</td>
<td>$\frac{1}{2} n^2 \delta_1 + 3n^3 \gamma$</td>
<td>$\frac{1}{2} n^2 \delta_1 + n^2 \gamma$</td>
<td>$\frac{1}{2} n^2 \delta_1 + n^3 \gamma$</td>
<td>$2p^2 \delta_3 + 2n^3 \gamma$</td>
<td>$2p^2 \delta_3 + n^3 \gamma$</td>
</tr>
<tr>
<td>$[Q, D]$</td>
<td>$\frac{1}{2} n^2 (\gamma_+ + \gamma_\gamma)$</td>
<td>$\frac{1}{2} n^2 (\gamma_+ + \gamma_\gamma)$</td>
<td>$\frac{1}{2} n^2 (\gamma_+ + \gamma_\gamma)$</td>
<td>$8n^2 \delta_1 + 8n^2 \delta_1 + 24n^2 \tau \gamma$</td>
<td>$8n^2 \delta_1 + 8n^2 \delta_1 + 24n^2 \tau \gamma$</td>
</tr>
<tr>
<td>One sweep</td>
<td>$2n^2 \delta_1 + 3n^3 \gamma$</td>
<td>$2n^2 \delta_1 + 3n^3 \gamma$</td>
<td>$2n^2 \delta_1 + 3n^3 \gamma$</td>
<td>$2p^2 \delta_3 + 4n^3 \gamma$</td>
<td>$2p^2 \delta_3 + 4n^3 \gamma$</td>
</tr>
<tr>
<td>$V \times Q$</td>
<td>$2n^2 \delta_1 + 3n^3 \gamma$</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>Total (per sweep)</td>
<td>$\frac{1}{2} n^2 \delta_1 + 3n^3 \gamma$</td>
<td>$\frac{1}{2} n^2 \delta_1 + 3n^3 \gamma$</td>
<td>$\frac{1}{2} n^2 \delta_1 + 3n^3 \gamma$</td>
<td>$4p^2 \delta_3 + 4p^2 \delta_3 + 6n^3 \tau \gamma$</td>
<td>$4p^2 \delta_3 + 4p^2 \delta_3 + 6n^3 \tau \gamma$</td>
</tr>
<tr>
<td>Assume:</td>
<td>$\frac{1}{2} n^2 \delta_1 + 3n^3 \gamma$</td>
<td>$\frac{1}{2} n^2 \delta_1 + 3n^3 \gamma$</td>
<td>$\frac{1}{2} n^2 \delta_1 + 3n^3 \gamma$</td>
<td>$64p^2 \delta_3 + 64p^2 \delta_3 + 6n^3 \tau \gamma$</td>
<td>$64p^2 \delta_3 + 64p^2 \delta_3 + 6n^3 \tau \gamma$</td>
</tr>
<tr>
<td>nb = $\frac{2}{p^2}$</td>
<td>$\frac{1}{2} n^2 \delta_1 + 3n^3 \gamma$</td>
<td>$\frac{1}{2} n^2 \delta_1 + 3n^3 \gamma$</td>
<td>$\frac{1}{2} n^2 \delta_1 + 3n^3 \gamma$</td>
<td>$64p^2 \delta_3 + 64p^2 \delta_3 + 6n^3 \tau \gamma$</td>
<td>$64p^2 \delta_3 + 64p^2 \delta_3 + 6n^3 \tau \gamma$</td>
</tr>
<tr>
<td>$p_s = 16p_s$</td>
<td>$\frac{1}{2} n^2 \delta_1 + 3n^3 \gamma$</td>
<td>$\frac{1}{2} n^2 \delta_1 + 3n^3 \gamma$</td>
<td>$\frac{1}{2} n^2 \delta_1 + 3n^3 \gamma$</td>
<td>$64p^2 \delta_3 + 64p^2 \delta_3 + 6n^3 \tau \gamma$</td>
<td>$64p^2 \delta_3 + 64p^2 \delta_3 + 6n^3 \tau \gamma$</td>
</tr>
</tbody>
</table>

### Notes

- This is the method used by Arbennz and Oettli.[10]
- This is the one-sided method used by Little/"eld and Mascaro.[12]
- This is the method used by Little/"eld and Mascaro.[12]
- This is the one-sided method used by Arbennz and Oettli.[10]
- This is the method used by Arbennz and Oettli.[10]
- This is the method used by Arbennz and Oettli.[10]
- This is the method used by Arbennz and Oettli.[10]
- This is the method used by Arbennz and Oettli.[10]

### Assumptions

- $p_s = \frac{1}{2} p^2$
- nb = $\frac{2}{p^2}$
- $p_s = 16p_s$
Table 7.6: Performance models (flop counts) for two-sided Jacobi variants

<table>
<thead>
<tr>
<th></th>
<th>Unblocked</th>
<th></th>
<th></th>
<th>Blocked*</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Total (per sweep)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>$\frac{1}{2} n^2(\gamma_+ + \gamma_-)$</td>
<td>$\frac{1}{2} n^2(\gamma_+ + \gamma_-)$</td>
<td>$\frac{1}{2} n^2(\gamma_+ + \gamma_-)$</td>
<td>$\frac{1}{2} n^2(\gamma_+ + \gamma_-)$</td>
<td>$\frac{1}{2} n^2(\gamma_+ + \gamma_-)$</td>
<td></td>
</tr>
<tr>
<td>$+ 6n^2 \delta_1$</td>
<td>$+ 3n^2 \delta_1$</td>
<td>$+ 3n^2 \delta_1$</td>
<td>$+ 3n^2 \delta_1$</td>
<td>$+ 3n^2 \delta_1$</td>
<td></td>
</tr>
<tr>
<td>$+ 9n^3 \gamma_1$</td>
<td>$+ 9n^3 \gamma_1$</td>
<td>$+ 9n^3 \gamma_1$</td>
<td>$+ 9n^3 \gamma_1$</td>
<td>$+ 9n^3 \gamma_1$</td>
<td></td>
</tr>
</tbody>
</table>

| Assumption: |       |       |       |          |       |
| $nb = \frac{n}{2p_c}$ | $p_c = 16p_c$ | $p_c = 16p_c$ | $p_c = 16p_c$ | $p_c = 16p_c$ |
| $\frac{1}{8} \sqrt{p_c} \gamma_0$ | $\frac{1}{8} \sqrt{p_c} \gamma_0$ | $\frac{1}{8} \sqrt{p_c} \gamma_0$ | $\frac{1}{8} \sqrt{p_c} \gamma_0$ | $\frac{1}{8} \sqrt{p_c} \gamma_0$ |
| $\frac{1}{8} \sqrt{p_c} \gamma_0$ | $\frac{1}{8} \sqrt{p_c} \gamma_0$ | $\frac{1}{8} \sqrt{p_c} \gamma_0$ | $\frac{1}{8} \sqrt{p_c} \gamma_0$ | $\frac{1}{8} \sqrt{p_c} \gamma_0$ |
| $\frac{1}{8} \sqrt{p_c} \gamma_0$ | $\frac{1}{8} \sqrt{p_c} \gamma_0$ | $\frac{1}{8} \sqrt{p_c} \gamma_0$ | $\frac{1}{8} \sqrt{p_c} \gamma_0$ | $\frac{1}{8} \sqrt{p_c} \gamma_0$ |
| $\frac{1}{8} \sqrt{p_c} \gamma_0$ | $\frac{1}{8} \sqrt{p_c} \gamma_0$ | $\frac{1}{8} \sqrt{p_c} \gamma_0$ | $\frac{1}{8} \sqrt{p_c} \gamma_0$ | $\frac{1}{8} \sqrt{p_c} \gamma_0$ |

*For parallel codes we assume that the blocksize is chosen to be as large as possible i.e. $nb = n/(2p_c)$ where $p_c$ is the number of processor columns. For a serial code $p_c = n/(2 \times nb)$ can be arbitrarily chosen.

*This is the method used by Pourari and Tourancheau [142], by Schreiber [150] and the method described in figure 7.3.

*Using fast given is often mentioned, but rarely implemented. Perhaps the benefit is not as good as this model would suggest.

*This is the method shown in figure 7.4.

*One sweep of Jacobi on an matrix of size $2nb$ by $2nb$.

*I also assume that only one processor in each processor column is involved in each partial eigendecomposition.
## Table 7.7: Communication cost for Jacobi methods (per sweep)

<table>
<thead>
<tr>
<th></th>
<th>One-sided</th>
<th>Two-sided</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>1-D data layout&lt;sup&gt;a&lt;/sup&gt;</td>
<td>2-D data layout&lt;sup&gt;b&lt;/sup&gt;</td>
</tr>
<tr>
<td>exchange column vectors</td>
<td>4$p\alpha +2n^2\beta$</td>
<td>4$p\alpha +2n^2\beta$</td>
</tr>
<tr>
<td>Reduce $A^TA$</td>
<td>0</td>
<td>$2p\alpha \log(p)\alpha + \frac{2n^2}{p} \log(p)\beta$</td>
</tr>
<tr>
<td>Broadcast rotations&lt;sup&gt;f&lt;/sup&gt;</td>
<td>0</td>
<td>$2p\alpha \log(p)\alpha + \frac{2n^2}{p} \log(p)\beta$</td>
</tr>
</tbody>
</table>

<sup>a</sup>This is the method used by Arbenz and Slapnicar<sup>[9]</sup>

<sup>b</sup>This is the method used by Littlefield and Maschhoff<sup>[125]</sup>

<sup>c</sup>This is the method used by Pourzandi and Tourancheau<sup>[142]</sup>

<sup>d</sup>This is 2D method most likely to be used today

<sup>e</sup>This is method used by Schreiber<sup>[150]</sup>

<sup>f</sup>On the unblocked methods we assume that communication is blocked even though the computation is not. We also assume that each rotation is sent as a single floating point number. This is natural if you are using fast Givens but requires extra divides and square roots if fast Givens are not used.

### 7.3.6 Blocking

Classical Jacobi methods annihilate individual entries whereas blocked Jacobi methods use a partial eigendecomposition on blocks. Cyclic Jacobi methods use fewer flops, especially if fast Givens rotations are used. But, almost all of the floating point operations in blocked Jacobi methods are performed in matrix-matrix multiply operations, the most efficient operation.

Both cyclic and blocked Jacobi methods can be blocked for communication. The communication block size need only be an integer multiple of the computation block size. Blocking for communication may be more important than blocking for computation because it reduces the number of messages by a factor equal to the communication block size.

Blocking allows greater possibilities for the partial eigendecomposition. A better partial eigendecomposition will lead to faster convergence. For example, performing two Jacobi sweeps in the partial eigendecomposition would result in fewer sweeps through the entire matrix. However initial experiments indicate that on random matrices the best that one can hope for is a reduction of $\log(nb)$ in the number of full sweeps even if one uses a complete eigendecomposition as the “partial eigendecomposition”.

---

<sup>a</sup>This is the method used by Arbenz and Slapnicar<sup>[9]</sup>

<sup>b</sup>This is the method used by Littlefield and Maschhoff<sup>[125]</sup>

<sup>c</sup>This is the method used by Pourzandi and Tourancheau<sup>[142]</sup>

<sup>d</sup>This is 2D method most likely to be used today

<sup>e</sup>This is method used by Schreiber<sup>[150]</sup>

<sup>f</sup>On the unblocked methods we assume that communication is blocked even though the computation is not. We also assume that each rotation is sent as a single floating point number. This is natural if you are using fast Givens but requires extra divides and square roots if fast Givens are not used.
Using a block size that is smaller than the maximum allowed (i.e. \( nb < n/(2p_c) \)) offers various possibilities. It allows communication to be pipelined to some extent. Alternatively it allows more than \( p_c \) processors to be involved in computing the partial eigendecompositions.

The per sweep cost of the partial eigensolutions grows as the square of the block size because larger block sizes mean that fewer processors are involved in the partial eigendecomposition.\(^8\)

I recommend keeping the code simple by keeping the communication and computation block size equal and setting \( nb = n/(2p_c) \) so that each parallel pairing involves one partial eigendecomposition per processor column. Using a rectangular process grid such that \((16p_r \leq p_c \leq 32p_r)\) requires a lower \( nb \) and hence allows the code to keep communication and computation block size equal while holding the cost of the partial eigendecomposition to \( \frac{2}{8} \) to \( \frac{3n^3}{4p_c} \gamma_1 \). On most machines this will be no more than half the \( \frac{2n^3}{p_c} \gamma_3 \) cost, in part because the partial eigendecomposition will fit in the highest level data cache.

A larger computational block size increases the cost of partial eigendecomposition and decreases the cost of the BLAS3 operations. Larger communication block size decreases message latency cost but leaves less opportunity for overlapping communication with computation. A larger ratio of \( p_c \) to \( p_r \) increases message latency but reduces the partial eigendecomposition cost.\(^9\) See section 7.3.9 for details on the partial eigendecomposition cost.

### 7.3.7 Symmetry

Exploiting symmetry in two-sided Jacobi methods is important because it reduces the number of flops per sweep from \( 12n^3 \) to \( 8n^3 \). However, exploiting symmetry while maintaining load balance is difficult. If in a blocked Jacobi method, the block size were set to the largest value possible, i.e. \( \frac{2n^3}{2p_c} \), and a standard rectangular grid of processors were used, half of the processors (either those above or below the diagonal) would be idle all the time. Using a smaller block size would allow better load balance but gives up some of the benefits of blocking. Alternatives, such as using a different processor layout for the eigenvector update, are feasible, but their complexity make them unattractive.

---

\(^8\)This does not hold for \( nb < n/(2p) \).

\(^9\)Assuming only one processor per processor column is involved in computing partial eigendecompositions.
In one-sided Jacobi methods, $A^T A$ is symmetric and only half of it need be computed. In fact, only a quarter of it must be computed as shown in the following section.

### 7.3.8 Storing diagonal blocks in one-sided Jacobi

One-sided Jacobi methods must compute diagonal blocks of $A^T A$. This is shown in the matlab code given in figure 7.5 as: $\text{blk} A = A(:,[J,I])' * A(:,[J,I])$. This is inefficient because not only does it compute both halves of a symmetric (or hermitian) matrix, but $A(:,[I]' * A(:,I]$ and $A(:,[J]' * A(:,J]$ are already known. They are the diagonal blocks returned by `parteig` on the most recent previous pairing which involved $I$ and $J$ respectively. Storing these blocks for future use avoids the need to recompute them, although they may need to be refreshed from time to time for accuracy reasons.

### 7.3.9 Partial Eigensolver

My performance models suggest that execution time is likely to be minimized when the partial eigendecomposition consists of either one or two sweeps of Jacobi. The per sweep cost of the partial eigensolver grows as $O\left(\frac{n^2 mb}{\sqrt{p}}\right)$. In my recommended Jacobi method, the partial eigensolver consists of one sweep of Jacobi and based on the data layout which I recommend, costs $\frac{3}{8} \frac{n^2}{p} \gamma_1 + O\left(\frac{n^2}{\sqrt{p}}\right)$ or roughly 10\% to 30\% of the total cost of the sweep. Preliminary experiments indicate that with a block size of 32, using a full eigendecomposition instead of a partial eigendecomposition may reduce the number of sweeps by as much as 20\%. Assuming that a full eigendecomposition of a 32 by 32 matrix costs 6 times what a single sweep of Jacobi would cost, this analysis suggests that the added cost of a full eigendecomposition will not reduce the number of sweeps sufficiently to result in a net decrease in execution time, especially if `DGEMM` performs efficiently on a smaller block size\(^{10}\). On the other hand, since most of the advantage of a full eigendecomposition will come from the second sweep, using two sweeps of Jacobi in the partial eigensolver may result in a net decrease in execution time. This analysis depends on a great many assumptions and should be taken as a guide, not a prediction. Schreiber[150] reached a similar conclusion.

In a non-blocked code, the “partial eigendecomposition” should consist of a rotation, i.e. a full eigendecomposition. In a non-blocked code, the cost of the partial eigensolver,\(^{10}\)

---

\(^{10}\)A smaller block size reduces the cost of the partial eigensolver.
though still $O(n^2 \text{nb})$, is lower because $\text{nb} = 1$ and for a 2 by 2 matrix, a single sweep of Jacobi is a full eigendecomposition. Except for very small $n$, say $n < 100$, partial eigendecompositions, such as those suggested by Götze[85], are not likely to result in lower total execution time.

In a blocked eigensolver, one must compute a partial eigendecomposition for each pairing. Most commonly, a single sweep of two-sided Jacobi is used as the partial eigendecomposition. Since the elements in $A[I, I'] \times A[I, I]$ and $A[J, J'] \times A[I, I]$ are involved in more pairings than the elements in $A[I, J'] \times A[I, J]$ they need not be annihilated in every pairing.

The number of partial eigenproblems that can be performed simultaneously is $\frac{n}{\text{nb}}$. If this is less than $p$, either the partial eigenproblems must themselves be performed in parallel or some processors will be idle. Unless $\text{nb}$ is quite large, say $\text{nb} \geq 64$, it is likely to be faster to compute them each on a single processor, especially since the partial eigendecomposition is a two-sided, not one-sided, sweep.

If $n/(2\text{nb}) = p_c$, it is natural to assign one processor within each processor column to perform the partial eigendecomposition. If $n/(2\text{nb}) > p_c$, each parallel pairing will have more partial eigenproblems than processor columns, hence the code could assign different partial eigenproblems to different processors within each processor column. The other alternative is to increase $p_c$ (decreasing $p_r$). Hence, assigning different partial eigenproblems to different processors within a column only makes sense if bandwidth cost makes increasing $p_c$ unattractive. On the other hand, the only disadvantage to assigning different partial eigenproblems to different processors with a column (as opposed to increasing $p_c$) is increased code complexity.

If the cost of divisions and square roots ($\frac{1}{4}n^2 \frac{2n^2 + n^2}{p_c}$) is significant, one should consider inexact rotations in the partial eigensolver. Götze points out that one need not perform exact rotations and suggests a number of approximate rotations which avoid divides and square roots[85]. It would be counterproductive to use inexact rotations (saving $O(n^2)$ flops at the expense of increasing the number of sweeps and the accompanying $O(n^3)$ flops) in a parallel cyclic Jacobi method. Likewise I would be hesitant to use inexact rotations in the partial eigensolver unless doing so makes it feasible to perform two sweeps in the partial eigensolver. However, it is entirely possible that more sweeps with inexact rotations might be better than fewer sweeps using exact rotations in the partial eigensolver.

Using a classical threshold scheme in the partial eigensolver is likely to save little
time, but using thresholds to perform more important rotations might improve performance. A classical threshold scheme is not attractive because the processors performing fewer rotations would simply sit idle. However having each processor compute the same number of rotations, while using thresholds to skip some rotations might allow the rotations performed to be more productive.

7.3.10 Threshold

For serial cyclic codes, thresholds can significantly reduce the total number of floating point operations performed, especially on spectrally diagonally dominant matrices. Since Jacobi methods are most likely to be attractive on spectrally diagonally dominant codes, thresholds cannot be rejected as unimportant. However in a blocked parallel program, entire blocks can only be skipped if the whole block requires no rotations. As an example, consider a blocked parallel Jacobi eigensolution of a 1024 by 1024 matrix on a 1024 node computer using a block size of 16. This would involve 63 (or 64) steps each of which would consist of 32 pairings performed in parallel. Each pairing involves a partial eigendecomposition of a $2 \times 16$ by $2 \times 16$ matrix. If any of the off-diagonal elements in any of the 32 pairings requires annihilation, no savings is achieved in that step. Hence, in the worst case, if just 63 of the 499,001 off-diagonal elements (one per step) require annihilation, the threshold algorithm realizes no benefit.

Corbato devised a method for implementing a classical Jacobi method in $O(n^3)$ time. His method involves keeping track of the largest off-diagonal element in each column. The cost of maintaining this data structure would more than double the cost of each rotation and may not lead to reduced execution time even in serial codes. However, Beresford Parlett pointed out to me that one need not keep track of the true largest element and that each rotation must maintain the sum of the squares of the elements, hence allowing the list of “largest” off-diagonal elements to be out-of-date would seriously undermine the advantage and would significantly reduce the overhead. This deserves further study.

Untested Threshold methods

One could design a code that used variable block sizes and/or switched from a one-sided-non-threshold Jacobi to a two-sided-threshold Jacobi. A code could even scan the matrix, identify the elements that need to be eliminated and select pairings and block
sizes that would eliminate those elements as efficiently as possible. In our worst case example given in the preceding paragraph, it might be that those 63 off diagonal elements could be annihilated in just two parallel steps each requiring only a two element rotation.

Scanning all off-diagonal elements and choosing the largest \( n \) non-interfering elements might be an attractive compromise between the classical Jacobi method which examines all off diagonal elements and annihilates the largest and the cyclic Jacobi method which annihilates all elements without regard to size. If software overhead could be kept modest, such a method might on small spectrally diagonally dominant matrices. Precisely the matrices that are best suited to Jacobi methods.

Jacobi methods that attempt to annihilate larger elements, i.e. threshold methods, work best on two-sided Jacobi methods. This is unfortunate because it appears that one-sided Jacobi is otherwise preferred.

As mentioned in section 7.3.9, thresholds might be useful in the partial eigendecomposition.

### 7.3.11 Pairing

The order in which the off-diagonal elements are annihilated is referred to as the pairing strategy. Eliminating off-diagonal element \( A_{i,j} \) in a two-sided Jacobi requires that rows \( i \) and \( j \) of \( A \) and columns \( i \) and \( j \) of \( A \) be rotated. Hence, rows \( i \) and \( j \) of \( A \) must be distributed similarly, i.e. \( A_{i,k} \) and \( A_{j,k} \) must both reside on the same processor. Likewise, columns \( i \) and \( j \) of \( A \) must be distributed similarly. Orthogonalizing vectors \( i \) and \( j \) in a one-sided Jacobi also requires that the two vectors be distributed similarly. In order to annihilate multiple off-diagonal elements simultaneously, they must reside on different sets of processors.

The pairing strategy affects execution time through communications cost, number of pairings per sweep and number of sweeps required for convergence. Different pairing strategies require different communication patterns and hence different communication costs. Some pairings strategies require slightly more pairings than others. Mantharam and Eberlein argue that some pairings lead to faster convergence than others[72].

In this section, we illustrate two pairing strategies, showing how each would pair 8 elements in 4 sets at a time. The elements might be individual indices (in a non-blocked Jacobi) or blocks of indices. The sets might correspond to individual processors (in a 1D
data layout) or columns of processors (in a one-sided Jacobi on a 2D layout) or rows and columns of processors (in a two-sided Jacobi on a 2D layout). Furthermore, several sets might be assigned to the same processor or column of processors.

The classic round robin pairing strategy leaves one element stationary and rotates the other elements. As the following diagram shows, in 7 pairings, each element is paired exactly once with each of the other elements. Elements 3 through 8 follow elements 2 through 7 respectively, while element 2 follows element 8.

\[
\begin{array}{cccc}
1 & 2 & 3 & 4 \\
8 & 7 & 6 & 5 \\
1 & 3 & 4 & 5 \\
2 & 8 & 7 & 6 \\
1 & 4 & 5 & 6 \\
3 & 2 & 8 & 7 \\
1 & 5 & 6 & 7 \\
4 & 3 & 2 & 8 \\
1 & 6 & 7 & 8 \\
5 & 4 & 3 & 2 \\
1 & 7 & 8 & 2 \\
6 & 5 & 4 & 3 \\
1 & 8 & 2 & 3 \\
7 & 6 & 5 & 4 \\
\end{array}
\]

A slight variation, called the caterpillar pairing method, cuts the communication cost in half at the expense of increasing the number of pairings from \( n - 1 \) to \( n \). The caterpillar method, modified so that communication is always performed in the same direction, is shown below. Only the elements in the top line rotate, and they always rotate to the left. The elements, shown in red, in the bottom line get swapped into the top line one at a time. In this pairing method, it takes 8 pairings in order for each element to be paired with every other element. The swapped elements need not perform any work, but must exchange the blocks assigned to them prior to the next communication step. This pairing strategy requires 16 (in general \( 2n \)) pairings to come back to the original pairing, but the second \( n \) pairings duplicate the first \( n \).
Mantharam and Eberlein[72] suggest that some pairing strategies may lead to convergence in fewer steps than others.

### 7.3.12 Pre-conditioners

One sided Jacobi methods compute eigenvectors by orthogonalizing a matrix which has the same or related left singular vectors as the original matrix. Some options include:

\[ [U, D, V] = \text{svd}(A); \]  
\[ U \] contains the eigenvectors of \( A \), \( D \) is the absolute value of the eigenvalues of \( A \). This method is used by Berry and Sameh[21].

\[ [U, D, V] = \text{svd}(\text{chol}(A)); \]  
\[ U \] contains the eigenvectors of \( A \), \( D \) is the square root of the eigenvalues of \( A \). This is used by Arbenz and Slapničar[9] and is mathematically equivalent to classical Jacobi.

\[ [Q, R] = \text{qr}(A); [U, D, V] = \text{svd}(R); \]  
\[ Q \ast U \] contains the eigenvectors of \( A \). \( D \) contains the absolute value of the eigenvalues of \( A \).
In addition, there are pivoting counterparts to both Cholesky and QR, indeed many flavors of QR with pivoting, which would improve these pre-conditioners. If $A$ is spectrally diagonally dominant, permuting $A$ so that the diagonal elements are non-increasing might provide most of the benefit that Cholesky with pivoting does and at considerably lower cost.

### 7.3.13 Communication overlap

Overlapping communication and computation is attractive because in theory it reduces the total cost from the sum of the computation and communication costs to their maximum. Arbenz and Slapničar demonstrated that overlapping communication and computation is straightforward in a one-sided Jacobi method with a one-dimensional data layout[10]. But, overlapping communication and computation when using a two-dimensional data layout is not as straightforward. Furthermore, actual experience with communication and computation overlap has been disappointing, see section B.1.6

### 7.3.14 Recursive Jacobi

The partial eigendecomposition could be a recursive call to a Jacobi eigensolver. A recursive Jacobi could offer all the benefits shown by Toledo on LU[165], notably excellent use of the memory hierarchy. Unfortunately, each level of recursion requires 6 calls, tripling the software overhead. Therefore, the number of subroutine calls, and hence the software overhead, grows at an unacceptably high $O(n^{1.6})$.

Increasing software overhead in order to reduce the number of sweeps will make sense for large matrices but not for small matrices. Since Jacobi is unlikely to be faster than tridiagonal based methods for large matrices, I feel that it is more important to concentrate on making Jacobi fast on smaller matrices. Hence, I do not include recursion as a part of my recommended Jacobi method. Nonetheless, it may be that one step of recursion (tripling the software overhead) and conceivably two steps of recursion (increasing software overhead by a factor of 9) may reduce total execution time, but I would not expect the improvement to be significant.
7.3.15 Accuracy

Demmel and Veselić[58] prove that on scaled diagonally dominant matrices, Jacobi can compute small eigenvalues with high relative accuracy while tridiagonal based methods cannot. Drmač and Veselić[71] show that Jacobi methods can be used to refine an eigen-solution, thereby providing high relative accuracy on scaled diagonally dominant matrices at lower total cost than a full Jacobi. Demmel et al.[56] give a comprehensive discussion of the situations in which Jacobi is more accurate than other available algorithms.

7.3.16 Recommendation

If I were asked to write one Jacobi method for all non-vector distributed memory computers, it would be a one-sided blocked Jacobi method. It would use a one-dimensional data layout on computers with fewer than 48 nodes and a two-dimensional data layout on computers with 48 or more nodes. It would use 16-32 times as many processor columns as rows in a two-dimensional data layout\textsuperscript{11}. It would use a computational and communication block size equal to\textsuperscript{12}\[\max(n/(2p_r), 8),\] leaving processors idle if \(8 < n/(2p_r)\). It would compute the partial eigendecompositions on just one processor in each processor column. It would avoid recomputing diagonal entries unnecessarily, use a one-directional caterpillar track pairing and one sweep of Jacobi for the partial eigendecomposition. It would use the largest block size possible for both computation and communication.

If I had time to experiment, I would investigate different partial eigendecompositions, pre-conditioners and pairing strategies in that order. Overlapping communication and computation appears to offer greater performance improvements in theory than in practice. I would use thresholds as a part of the stopping criteria, but wouldn’t count on them to avoid unnecessary flops. I would check to make sure that my suggested data layout (1D for \(p < 48, 16p_r < p_c < 32p_r\) for \(p > 48\) and \(nb = \max(n/(2p_r), 8)\)) was reasonable on several computers, but unless there was a substantial benefit to tuning the data layout to each machine I would hesitate to do so.

For vector machines I recommend an unblocked code with fast Givens rotations if the cost of BLAS1 operations is no more than twice that of BLAS3 operations. If the BLAS1 operations cost just twice what BLAS3 operations cost, the flop cost in an unblocked

\textsuperscript{11}The ratio \(\frac{p_c}{p_r}\) can be made to fall in the 16-32 range for any number of processors except 1 to 15, 32 to 63 and 128 to 144. No more than 2.1% of the processors are left idle following these rules.

\textsuperscript{12}Definitions for all symbols used here can be found in Appendix A.
code would be 6/5 that of the blocked code (because unblocked codes using fast Givens require 3/5 as many flops. Savings on other aspects can be expected to make up for this difference on all but the largest matrices. Communication should still be blocked however. One-dimensional data layout can be used for more nodes if a cyclic code is used, perhaps as many as a hundred nodes, since block size is not an issue. As long as $n < 2p$ a one-dimensional data layout is limited only by communication costs.

Combining elements of classical and cyclic Jacobi is an interesting long shot. Classical Jacobi always annihilates the largest off-diagonal element but requires $O(n^4)$ comparisons per sweep\textsuperscript{13}. Annihilating the $n$ largest off-diagonal elements each time would roughly match the number of comparisons performed to the number of flops performed. To parallelize this idea, one would have to choose the $n$ largest non-interfering elements.

### 7.4 ISDA

The total execution time for the ISDA\textsuperscript{[97]} for solving the symmetric eigenproblem\textsuperscript{14} will be no less than $100n^2$ on typical matrices. The execution time depends largely on how many decouplings are required to make each of the smaller matrices no larger than half the size of the original matrix. It also depends on the cost of each decoupling, but this will not vary that much.

The ISDA achieves high floating point execution rates, but in order to beat tridiagonal methods it must achieve $100/(10/3) = 30$ times higher floating point rates, which it does not. The PRISM implementation of ISDA takes 36 minutes = 2160 seconds to compute the eigen decomposition of a matrix of size 4800 by 4800 on the 100 node SP2 at Argonne\textsuperscript{[29]}, ScaLAPACK's \texttt{PDSYEVX} takes 397 seconds to compute the eigen decomposition of a matrix of size 5000 by 5000 on a 64 node SP2\textsuperscript{[31]}. ISDA should not require as large a granularity, $n/\sqrt{p}$, as \texttt{PDSYEVX} because of its heavy reliance on matrix-matrix multiply. However, at present, the PRISM implementation is still at least three times slower than \texttt{PDSYEVX} even on small matrices. Solving a matrix of size 800 by 800 on 64 nodes takes 60 seconds using the PRISM ISDA code, whereas \texttt{PDSYEVX} can solve a matrix of size 1000 by 1000 on 64 nodes of an SP2 in 16 seconds.

The cost of each decoupling depends upon how close the split happens to come to

\textsuperscript{13} Or increased overhead if Corbato's method\textsuperscript{[47]} is used.

\textsuperscript{14} See section 2.7.3 for a brief description of the ISDA.
a eigenvalue of the matrix being split. The number of beta function evaluations required for a given decoupling is roughly \(-\log(min_{i \in n}(split - \lambda_i))\), where \(split\) is the split point selected for this decoupling. The distance between \(split\) and the nearest eigenvalue cannot be computed in advance but is likely to fall in the range: \((\log(n)/\log(1.5)) + 2, \log(n)/\log(1.5) + 8\). This is consistent with empirical results. For our purposes we will say that the number of beta function evaluations is: \((\log(1500))/\log(1.5) + 2 = 20\). The cost per beta function evaluation is 2 matrix-matrix multiplies at: \(2(n')^3/p_{\gamma_3}\) each, where \(n'\) is the size of the matrix being decoupled. Hence the cost for the first decoupling is: \(2 \times 2 \times 20n^3/p_{\gamma_3} = 80n^3/p_{\gamma_3}\).

If each decoupling splits the matrix exactly in half, round \(i\) of decouplings involves \(2^i\) decouplings each involving a matrix of size \(n/2^i\) at a total cost of: \(2^i \times 80(n/2^i)^3 = 80n^3/4^i\). The sum of all rounds would then be: \(\sum_{i=0}^{\infty} 80n^3/4^i = 80 \times 4/3 = 107n^3\).

The ISDA for symmetric eigendecomposition may require substantially longer on some matrices with a single cluster of eigenvalues containing more than half of the eigenvalues and on matrices with most of the eigenvalues at one end of the spectrum\. It is unlikely that the first split point chosen for decoupling will lie in the middle of a cluster. Hence, if the matrix contains one large cluster, that cluster will likely remain completely in one of the two submatrices, making the decoupling less even and hence less successful. Likewise, if most of the eigenvalues are at one end of the spectrum, the submatrix on that end of the spectrum will likely be much larger than the other after the first decoupling. If each decoupling splits off only 20% of the spectrum, the total time will be twice what it would be if each decoupling splits the spectrum exactly in half.

One could check to make sure that a reasonable split point has been chosen by performing an \(LDLT\) decomposition on the shifted matrix, and counting the number of positive or negative values in \(D\). An \(LDLT\) decomposition costs \(1/3n^3\) flops or about 0.5% of the flops required to perform the full decoupling.

### 7.5 Banded ISDA

Banded ISDA is very nearly a tridiagonal based method and hence offers performance that is nearly as good as tridiagonal based methods. PRISM's single processor implementation of banded ISDA is two to three times slower than bisection (\(DSTEBZ\)\[26\]).\footnote{Fann et al.[75] present a couple examples of real applications that fit this description.}
Computing eigenvectors using banded ISDA will not only be more difficult to code, it will require about twice as many flops as inverse iteration. Banded ISDA requires additional bandwidth reductions, each of which requiring up to $2n^3$ additional flops during back transformation\textsuperscript{16}.

Banded ISDA could make sense if reduction to banded form were twice as fast as reduction to tridiagonal. Although even then one has to question whether it makes sense to use banded ISDA instead of a banded solver.

Banded ISDA should perform a few shifted $LDL^T$ decompositions to make sure that the selected shift will leave at least $1/3$ of the matrix in each of the two submatrices.

7.6 FFT

Yau and Lu\textsuperscript{[174]} have implemented an FFT based invariant subspace decomposition method. It, like ISDA, uses efficient matrix-matrix multiply flops, but since it requires $100n^3$ flops the same analysis which shows that ISDA will not be faster applies to it as well. Domas and Tisseur have implemented a parallel version of the Yau and Lu method\textsuperscript{[60]}.

\textsuperscript{16}The first bandwidth reduction essentially always requires the full $2n^3$ flops during back transformation, though later ones typically require less than that. However, taking advantage of the opportunity to perform fewer flops wither means a complex data structure or that the update matrix $Q$ be formed and then applied, adding another $4/3n^3$ flops.
Chapter 8

Improving the ScaLAPACK symmetric eigensolver

8.1 The next ScaLAPACK symmetric eigensolver

The next ScaLAPACK symmetric eigensolver will be 50% faster than the ScaLAPACK symmetric eigensolver in version 1.5 and provide performance that is independent of the user's data layout. Separating internal and external data layout will not only make the code easier to use because the user need not modify their storage scheme, it will also improve performance. The next ScaLAPACK symmetric eigensolver will select the fastest of four methods for reduction to tridiagonal form\(^1\), and use Parlett and Dhillon's new tridiagonal eigensolver\(^{[139]}\).

Separating internal and external data layout allows execution time to be reduced for three reasons. It allows reduction to tridiagonal form and back transformation to use different data layouts. It allows reduction to tridiagonal form to use a square processor grid, significantly reducing message latency and software overhead. It allows the code to support any input and output data layout without all the layers of software required to support any data layout. Last but not least by concentrating our coding efforts on the simple, but efficient square cyclic data layout, we can implement several reduction to tridiagonal codes and incorporate ideas that would be prohibitively complicated in a code that had to support multiple data layouts.

\(^1\)On machines where timers are not available, a heuristic will be used which may not always pick the fastest.
The rest of this section concentrates on improving execution time in reduction to tridiagonal form. Back transformation is already very efficient and hence leaves less room for improvement. We leave the tridiagonal eigensolver to others[139]. Figure 8.1 gives a top-level description of the next ScalAPACK symmetric eigensolver.

Figure 8.1: Data redistribution in the next ScalAPACK symmetric eigensolver

Choose a data layout for reduction to tridiagonal form (see figure 8.2)
Redistribute A to reduction to tridiagonal form data layout
Reduce to tridiagonal form
Use Parlett and Dhillon's tridiagonal eigendecomposition scheme
Choose data layout for back transformation
Redistribute A to back transformation data layout
If space is limited, redistribute A back to original data layout
Redistribute eigenvalues, Z, to back transformation data layout
Perform back transformation
Redistribute eigenvectors to user's format

8.2 Reduction to tridiagonal form in the next ScalAPACK symmetric eigensolver

Figure 8.2 shows how the data layout for reduction to tridiagonal form will be chosen. The data layout and the code used for reduction to tridiagonal form must be chosen in tandem.

Although the new PDSYTRD has three variants, they all share the same pattern of communication and computation shown in figure 8.3.

Message initiations are reduced by using techniques first used in HJS, and several new ones. HJS stores V and W in a row-distributed/column-replicated manner which avoids to need to broadcast them repeatedly. HJS also keeps the number of messages small by combining messages wherever possible.

Our communication pattern has three advantages over HJS: it requires fewer messages, does not risk over/underflow and uses only the BLACS communication primitives\(^2\). The manner in which we compute the Householder vector requires the same number of message initiations as the HJS, but avoids the risk of over/underflow in the computation of the norm. We use fewer messages than HJS because we update \(w\) in a novel manner (see

\(^2\)Whether the right communication primitives were chosen for the BLACS may be debatable, but they are what is available for use within ScalAPACK.
If timers (or environmental inquiry routine) are available

determine the best data layout for each of the four reduction to tridiagonal form codes

else

if \( p \neq 1 \), 

\[
TRD_{-p_r} = b_{p_r} = \frac{p}{p_r} + 1
\]

use old PDSYTRD 

else

\[
TRD_{-p_c} = \frac{p}{p_r} = \frac{p}{p_c}
\]

if the compiler is good

if \( n > 200 \sqrt{p} \)

use new PDSYTRD with compiled kernel

else

use unblocked reduction to tridiagonal form (no BLAS)

endif

else

if \( n > 100 \sqrt{p} \)

use new PDSYTRD with DGEMV

else

use unblocked reduction to tridiagonal form (no BLAS)

endif

endif

endif

Figure 8.2: Choosing the data layout for reduction to tridiagonal form

Discussion of Line 4.1 below) and we delay the spread of \( w \) (which HJS naturally performs at the bottom of the loop) to the top of the loop so that it can be spread in the same message that spreads \( v \).

Our communication pattern has one disadvantage over HJS: it requires redundant computation in the update of \( w \). The discussion of Line 4.1 below explains that we can choose to eliminate this redundant computation by increasing the number of messages.

**Line 2.1 in Figure 8.3** In Section 8.4.1 we show how to avoid overflow while using just \( 2n \log(\sqrt{p}) \) messages.

**Lines 3.2 and 3.6 in Figure 8.3** Only 2 messages are required to transpose a matrix when a square processor layout is used. Each processor, \((a, b)\) must send a message to, and receive a message from, its transpose processor \((b, a)\). The required time is:

\[
\sum_{n'=1}^{n} 2(\alpha + 2n'\beta) = 2n\alpha + 2n^2\beta
\]

**Line 4.1 in Figure 8.3** \( w = w - WV^Tv - VW^Tv \) can be computed in a number of ways. \( W, V \) and \( v \) are distributed across processor rows and replicated across proces-
Figure 8.3: Execution time model for the new \texttt{PDSYTRD}. Line numbers match Figure 4.5 (\texttt{PDSYTRD} execution time) where possible.

<table>
<thead>
<tr>
<th>computation</th>
<th>communication</th>
</tr>
</thead>
<tbody>
<tr>
<td>overhead</td>
<td>imbalance</td>
</tr>
<tr>
<td>latency</td>
<td>bandwidth</td>
</tr>
</tbody>
</table>

\begin{verbatim}
do \ i I = 1, n, nb
\ mxi = min(\ i I + nb, n)
do \ i = \ i I, mxi
\end{verbatim}

**Update current \((i^{th})\) column of \(A\)**

1.2 \( A = A - W V^T - V W^T \)

**Compute reflector**

2.1 \( v = \text{house}(A) \)

2.2 \( n \log(\sqrt{p}) \alpha \)

**Perform matrix-vector multiply**

1.1, 3.1 \( \text{spread } v, w \text{ across} \)

3.2 \( n \log(\sqrt{p}) \alpha \)

3.3 \( w = \text{tril}(A) v; \)

3.4 \( w^T = \text{tril}(A, -1) v^T \)

**Update the matrix-vector product**

4.1 \( w = w - W V^T v - V W^T v \)

4.3 \( w = w + \text{transpose } w^T \)

**Compute companion update vector**

5.1 \( c = w \cdot v^T; \)

5.6 \( 2 n \log(\sqrt{p}) \alpha \)

5.7 \( w = \tau w - (c \tau /2) v \)

**Perform rank \(2k\) update**

6.3 \( A = A - W V^T - V W^T \)

6.6 \( 2 n \log(\sqrt{p}) \alpha \)

6.7 \( 2 n \log(\sqrt{p}) \alpha \)

Appendix B.1 how this update is performed without communication and shows that there are a range of options which trade off communication and load imbalance.

**Line 1.1 in Figure 8.3** updates the current block column. This can be implemented in
several ways. LAPACK's DSYTRD uses a right looking update\(^3\) because a matrix-matrix multiply is more efficient than an outer product update. HJS uses a left looking update because on their cyclic data layout, the left looking update allows all processors to be involved, reducing load imbalance.

**Line 5.1 in Figure 8.3** Computing \(c = w v^T\) requires summing \(c\) within a processor column. In order to compute \(w\) in Line 5.1, \(c\) must be known throughout a processor column. To allow \(w\) and \(v\) to be broadcast in the same message (Line 3.1), \(c\) is summed and broadcast in the column that owns column \(i + 1\) of the matrix.

**Line 6.1 in Figure 8.3** No communication is required here. \(W, V^T\) and \(W^T\) are already replicated as necessary.

### 8.3 Making the ScalAPACK symmetric eigensolver easier to use

The next ScalAPACK symmetric eigensolver will separate internal data layout from external data layout while executing 50\% faster than PDSYEVX on a large range of problem sizes on most distributed memory parallel computers and requiring less memory. Separating internal and external data layout allows the user to choose whatever data layout is most appropriate for the rest of their code and to use that data layout regardless of the problem size and computer they are using. Separating internal and external data layouts also makes it easy for the ScalAPACK symmetric eigensolver to add support for additional data layouts. However, while these ease-of-use issues are the most important advantages of separating internal and external data, we will focus further discussion on how this separation improves performance.

### 8.4 Details in reducing the execution time of the ScalAPACK symmetric eigensolver

Separating internal and external data layout will improve the performance of PDSYEVX by allowing PDSYEVX to use different data layouts for different tasks, and by allow-

---

\(^3\)A right looking update updates the current column with a matrix-matrix multiply. A left looking update updates every column in the block column with an outer product update.
ing PDSYEVX to concentrate only on the most efficient data layout for each task. A reduction to tridiagonal form which only works on a cyclic data layout on a square processor grid will not only have lower overhead and load imbalance than the present reduction to tridiagonal form, but will be able to incorporate techniques that would be prohibitively complicated if they were implemented in a code that must support all data layouts.

Significant reduction of the execution time in PDSYEVX, the ScaLAPACK symmetric eigensolver, requires that all four sources of inefficiency (message latency, message transmission, software overhead and load imbalance) be reduced. Fortunately, as Hendrickson, Jessup and Smith[91] have shown, all of these can be reduced. PDSYEVX sends 3 times as many messages as necessary, and require 3 times as much message volume as well. Overhead and load imbalance costs are harder to quantify. Load imbalance costs will be reduced by using data layouts appropriate to each task. If necessary, load imbalance costs can be further reduced at the expense of increasing the number of messages sent. Overhead will be reduced by eliminating the PBLAS, reducing the number of calls to the BLAS and, where a sufficiently good compiler is available, eliminating the calls to the BLAS entirely.

8.4.1 Avoiding overflow and underflow during computation of the Householder vector without added messages

Overflow and underflow can be avoided during the computation of the Householder vector without added messages by using the pdnrm2 routine to broadcast values. The easiest way to compute the norm of a vector in parallel is to sum the squares of the elements. However, this will lead to overflow if the square of one of the elements or one of the intermediate values are greater than the overflow threshold (likewise underflow occurs if one or more of the squares of the elements or the intermediate values is less than the underflow threshold). The ScaLAPACK routine pdnrm2 avoids underflow and overflow during reductions by computing the norm directly leaving the result on all processors in the processor column. The requires $2 \log(p_e) \alpha$ execution time. In PDSYTRD, $\alpha = A(i+1,i)$ is broadcast

$^4$PDSYEVX uses $17 n \log(\sqrt{p})$, HJS uses $9 n \log(\sqrt{p})$, we will show that this can be reduced to $5 n \log(\sqrt{p})$ but do not claim that this is minimal.

$^5$PDSYEVX sends $(5 \log(\sqrt{p}) + 2) n^2/\sqrt{p}$ elements per processor and HJS reduces this to $(2 \log(\sqrt{p}) + \frac{n}{2}) n^2/\sqrt{p}$ elements per processor. The design I suggest requires $(2 \log(\sqrt{p}) + \frac{n}{2}) n^2/\sqrt{p}$ elements per processor but requires fewer messages.

$^6$Statically balancing the number of eigenvectors assigned to each processor column will reduce load imbalance in back transformation. Using a smaller block size will reduce load imbalance in reduction to tridiagonal form.
to all processors in the processor column, this requires $2\log(p_r)\alpha$ execution time. In HJS, they sum the squares of the elements and broadcast $\alpha = A(i+1,i)$ at the same time by summing an additional value in the reduction. All processors except for the processor that owns $A(i+1,i)$ contribute 0 to the sum while the processor owning $A(i+1,i)$ contributes $A(i+1,i)$.

In the new PDSYEVX, we will employ this trick, to broadcast $\alpha$ at the same time as the norm is computed. It is slightly more complicated because norm computations do not preserve negative numbers. Hence, we compute two norms: $\max(0, \alpha)$ and $\max(0, -\alpha)$, from these $\alpha$ is easily recovered. Ideally, we need a new PBLAS or BLACS routine which would simultaneously compute a norm and broadcast both it and other values.

### 8.4.2 Reducing communications costs

Communications costs can be reduced in both reduction to tridiagonal form and back transformation but by vastly different methods. PDSYTRD, ScaLAPACK's reduction to tridiagonal form code, will use a cyclic data layout on a square processor grid to simplify the code, allowing PDSYEVX to use the techniques demonstrated by Hendrickson, Jessup and Smith[91]: direct transpose, a column replicated/row distributed data layout for intermediate matrices and combining messages. In addition, PDSYTRD will delay the last operation in the loop to combine it with the first, reducing the number of messages per loop iteration from 6 to 5.

Communication costs will be reduced in back transformation by using a rectangular grid and a relatively large block size. Most of the communication in back transformation is within processor columns, and the communication within processor columns cannot be pipelined (meaning that it grows as $\log(p_r)$), hence setting $p_c$ to be substantially larger (roughly 4-8 times larger) than $p_r$ will cut message volume nearly in half compared to the message volume required for a square processor grid.

Communications cost could be reduced further on select computers by writing machine specific BLACS implementations\(^7\), but I don't think that the benefit will justify the

\(^7\)Karp et al.[107] proved that a broadcast or reduction of $k$ elements on $p_x$ processors can be executed in $\log(p_x)\alpha + k\beta$. Equally importantly, the latency term can be reduced significantly by machine specific code because latency is primarily a software cost, the actual hardware latency is typically less than one tenth of the total observed latency. I believe that by coding broadcasts and reductions in a machine specific manner, I could reduce the latency to $\alpha_{\text{software}} + \log(p_x)\alpha_{\text{hardware}}$. It might be possible to achieve a similar result using active messages. Machine specific optimization of the BLACS broadcast and reduction codes is attractive because it would benefit all of the ScaLAPACK matrix transformation codes. However,
cost. In PDSYEVX as shipped in version 1.5 of ScaLAPACK, software overhead and load imbalance are roughly twice as high as communications cost on the PARAGON. The new PDSYEVX should reduce communications by at least a factor of 2, and though I hope it will reduce software overhead and load imbalance by close to a factor of 4, overhead and load imbalance will probably remain larger than communications cost. The fact that communication costs is not the dominant factor limiting efficiency limits the improvement that one can expect from machine specific BLACS implementations.

Communications cost in back transformation could be reduced further by overlapping communication and computation and/or using an all-to-all broadcast pattern instead of a series of broadcasts. Back transformation enjoys the luxury of being able to compute the majority of what it needs to communicate in advance. This allows many possibilities for reducing the communications bandwidth cost. The fact that message latency, load imbalance and software overhead costs are modest in back transformation means that a reduction in the communications bandwidth cost ought to result in significant performance improvement in back transformation. However, overlapping communication and computation has historically offered less benefit than in practice than in theory, (see section B.1.6) so I approach this with caution and will not pursue it without first convincing myself that the benefit is significant on several platforms.

### 8.4.3 Reducing load imbalance costs

Load imbalance can be reduced in both reduction to tridiagonal form and back transformation by careful selection of the block size. The number of messages in reduction to tridiagonal form is not dependent on the data layout block size, hence a cyclic data layout (i.e. block size of 1) will be used, reducing load imbalance. The fact that only half of the flops in reduction to tridiagonal form are BLAS3 flops and the large number of load imbalanced row operations combine to make the optimal algorithmic block size for reduction to tridiagonal form small.

Load imbalance is minimized in back transformation by choosing a block size which assigns a nearly equal number of eigenvectors to each column of processors ($\text{nb} = \left\lfloor n/(k \cdot p_c) \right\rfloor$ for some small integer $k$). A block cyclic data layout reduces execution time in back transformation by reducing the number of messages sent, hence we must look for

---

purely from the point of view of improving the performance of the ScaLAPACK symmetric eigensolver this effort probably would not be worth the effort.
other ways to reduce load imbalance. Fortunately, all eigenvectors must be updated at each step, hence a good static load balance of eigenvectors across processor columns eliminates most of the load imbalance in back transformation. The load imbalance within each column of processors is less important because the number of processor rows will be small. The computation of $T$ can be performed simultaneously on all processor columns, eliminating the load imbalance in that step.

8.4.4 Reducing software overhead costs

There are many ways to reduce software overhead, but software overhead is poorly understood and hence it is hard to predict which method will be best. Hendrickson, Jessup and Smith[91] showed that using a cyclic data layout and a square processor grid reduces the number of $\text{DTRMV}$ calls from $O(n^2/nb)$ to $O(n)$ because each local matrix is triangular. Using lightweight (no error checking, minimal overhead) BLAS would reduce software overhead, but these are still in the planning stages. If the compiler produces efficient code for a simple doubly nested loop, software overhead can be further reduced by using a compiled code instead of calls to the BLAS. Peter Strazdins has shown that software overhead within the PBLAS can be reduced up to 50%[161, 160]. Alternatively, eliminating the PBLAS entirely would eliminate the overhead associated with the PBLAS. I would prefer to reduce the PBLAS overhead and continue to use the PBLAS. But, that is likely to be much harder than simply abandoning the PBLAS.

When $\text{PDSYTRD}$, ScaLAPACK's reduction to tridiagonal form, was written the PBLAS did not support column-replicated/row-distributed matrices or algorithmic blocking. Hence, many of the ideas mentioned here for improving the performance of $\text{PDSYTRD}$ were not available to a PBLAS-based code. PBLAS version 2 now offers these capabilities.

Software overhead cannot be measured separate from other costs and is hence difficult to measure, understand and reason about. It varies widely from machine to machine and can change just by changing the order in which subroutines are linked. We do not, for example, know how much can be attributed to subroutine calls, how much is caused by error checking, how much is caused by loop unrolling and how much is caused by code cache misses.

A good compiler should be able to compute the local portion of $Av$ faster than two calls to $\text{DTRMV}$ because a simple doubly nested loop could access each element in the
local portion of \( A \) only once whereas two calls to \texttt{DTRMV} would require that each element in \( A \) be read twice. The result is that the ratio of flops to main memory reads is 4-to-1 in the doubly nested loop versus 2-to-1 in \texttt{DTRMV}\(^8\). Furthermore, a compiled kernel would avoid the \texttt{BLAS} overhead and might involve less loop unrolling - reducing overhead directly and reducing code cache pressure as well. However, compiler technology is uneven, so we would make using compiled code instead of the \texttt{BLAS} optional.

Unblocked reduction to tridiagonal form will likely be faster than blocked reduction to tridiagonal form on problem sizes where software overhead is the dominant cost. Unblocked reduction to tridiagonal form on a cyclic data layout eliminates load imbalance, requires a minimum of communication and software overhead. The only disadvantage is that all of the \( 4/3 \times n^3 \) flops are \texttt{BLAS2} flops. However, with a good compiler, these \texttt{BLAS2} flops can perform well on most computers. The kernel in an unblocked reduction to tridiagonal form involves 8 flops to each read-modify-write memory access\(^9\). Most computers have adequate main memory bandwidth to handle this at full speed. However, not all compilers are good enough yet.

### 8.5 Separating internal and external data layout without increasing memory usage

Separating internal and external data layout will require memory-intensive data redistribution, but making the data redistribution codes more space efficient will save enough memory space to offset the memory needs of separating internal and external data layout. Data redistributions between two data layouts with different values of \( p_r, p_c \) or \( nb \) use messages of \( O(n^2/(p^{3/2}) + nb^2) \) data elements. However, degenerate data redistributions between two data layouts with the same values of \( p_r, p_c \) or \( nb \) use messages of roughly \( n^2/p \) elements. In order to avoid treating degenerate data redistributions separately, the current redistribution codes require \( n^2/p \) buffer space for all redistributions. Splitting one large message into several smaller ones is not conceptually difficult but will require that the code be rewritten and the testing will have to be augmented to properly exercise the new paths. However, the execution time will not be significantly affected. Both \texttt{PDLARED2D}, the

---

\(^8\) These ratios are 8-to-1 and 4-to-1 respectively for Hermitian matrices.

\(^9\) The ratio for reducing Hermitian matrices to tridiagonal form is 16 flops per read-modify-write operation.
eigenvector redistribution routine, and DGMR2D, the general purpose redistribution routine, will have to be modified.

If the redistribution routines are not modified as described above, memory usage would increase from $4n^2/p$ to $6n^2/p$, and run a remote risk of causing the eigensolver to crash. While both PDLARED2D and DGMR2D require $n^2/p$ space and could use the same space, they do not. PDLARED2D uses space passed to it in the \texttt{WORK} array, while DGMR2D calls \texttt{malloc} to allocate space. The eigensolver could crash if a message of $n^2/p$ elements were sent, and the communication system was unable to allocate a buffer of that size. Messages of that size are not required during normal \texttt{ScaLAPACK} eigensolver tests, hence the eigensolver could crash during regular use even after passing all tests and after months or even years of flawless service. Modifying the redistribution routines as we propose, eliminates this potential problem.

Memory needs could be reduced from $4n^2/p$ to $3n^2/p$ by using the space allocated to the input matrix, $A$, and the output matrix, $Z$, as internal workspace. This would require a modification to the present calling sequence, probably in the form of a new data descriptor. However, reducing memory usage by 25% may not justify a change to the calling sequence.
Chapter 9

Advice to symmetric eigensolver users

Parallel dense tridiagonal eigensolvers should be used if none of the following counter indications hold. Use a serial eigensolver if the problem is small enough to fit\(^1\). Use a sparse eigensolver if your input matrix is sparse\(^2\) and you don’t need all the eigenvalues or if the matrix is dense and you only need a small fraction of the eigenvalues. Use a Jacobi eigensolver if you need to compute small eigenvalues of a scaled diagonally dominant matrix (or a matrix satisfying one of the other properties described by Demmel et al\(^{[56]}\)) accurately. Use a Jacobi eigensolver for small \((n < 100\sqrt{p})\) spectrally diagonally dominant matrices\(^3\).

Currently the three most readily available parallel dense symmetric eigensolvers are PeIGs and ScaLAPACK’s PDSYEV and PDSYEVX. PeIGs and PDSYEV maintain orthogonality among eigenvectors associated with clustered eigenvalues. PeIGs and PDSYEVX are faster than PDSYEV. PDSYEVX scales better than either PeIGs or PDSYEV.

The choice between PeIGs and ScaLAPACK is probably more a matter of which infrastructure\(^4\) is preferred and is out of the scope of this thesis. Furthermore, it is likely that PeIGs will at some point use the ScaLAPACK symmetric eigensolver in the future. Hence,

\[^{1}\text{i.e. if memory allows}\]
\[^{2}\text{The break-even point is not known, so I suggest that if your matrix is less than 10\% non-zero and you need less than 10\% of the eigenvalues you should use a sparse eigensolver.}\]
\[^{3}\text{Spectrally diagonally dominant means that the eigenvector matrix, or a permutation thereof, is diagonally dominant.}\]
\[^{4}\text{PeIGs is built on top of Global Arrays\([101]\) while ScaLAPACK is built on the BLACS or MPI.}\]
the upgrade path for both may end up with the same underlying code. If you are not likely to use more than 32 processors, PeiGs performance should be acceptable\(^5\). If your input matrices do not include large clusters of eigenvalues or if you can accept non-orthogonal eigenvectors, \texttt{PDSYEVX} is the right choice. Otherwise, i.e. if your input matrix has large clusters of eigenvalues for which you need orthogonal eigenvectors, and you wish to use more than 32 processors, \texttt{PDSYEV} is the right choice. Eventually, the improved version of \texttt{PDSYEVX} described in Chapter 8 will be the method of choice in all cases.

\(^5\)Since PeiGs uses a 1D data layout, its performance will degrade if you use more than 32 processors.
Part II

Second Part
Bibliography


[44] Almadena Chtc hekanova, John Gunnels, Greg Morrow, James Overfelt, and
Robert A. van de Geijn. Parallel implementation of blas: General techniques for level
3 blas. Technical Report TR-95-40, Department of Computer Sciences, University of
Texas, October 1995. PLAPACK Working Note #4, to appear in Concurrency: Prac-


values on vector and simd architectures. SIAM Journal on Statistical Computing, 

[47] F. J. Corbato. On the coding of jacobi’s method for computing eigenvalues and 

[48] Jessup E.R. Crivelli, S. The cost of eigenvalue computation on distributed memory 


[50] J.J.M. Cuppen. A divide and conquer method for the symmetric tridiagonal eigen-

BLAS for MIMD Vector Processors. ACM Transactions on Mathematical Software, 

~efdazedo/.

[53] J. Demmel. CS 267 Course Notes: Applications of Parallel Processing. Computer 
Science Division, University of California, 1991. 130 pages.

eigenvalue algorithms in floating point arithmetic. Electronic Trans. Num. Anal., 


Appendix A

Variables and abbreviations
### Table A.1: Variable names and their uses

<table>
<thead>
<tr>
<th>Name</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>((a, b))</td>
<td>The processor in processor row (a) and processor column (b).</td>
</tr>
<tr>
<td>(A)</td>
<td>The input matrix (partially reduced).</td>
</tr>
<tr>
<td>(A(i, j))</td>
<td>The (i, j) element in the (partially reduced) matrix (A).</td>
</tr>
<tr>
<td>(c)</td>
<td>The number of eigenvalues in the largest cluster of eigenvalues.</td>
</tr>
<tr>
<td>(C)</td>
<td>The set of all processor columns.</td>
</tr>
<tr>
<td>(c_s)</td>
<td>The current processor column within the sub-grid.</td>
</tr>
<tr>
<td>(c_b)</td>
<td>The current processor column sub-grid.</td>
</tr>
<tr>
<td>(e)</td>
<td>The number of eigenvalues required.</td>
</tr>
<tr>
<td>(j)</td>
<td>The current column, (A(j : n, j : n)) being the un-reduced portion of the matrix.</td>
</tr>
<tr>
<td>(j^f)</td>
<td>The column within the current block column, (j^f = \text{mod} (j, nb)).</td>
</tr>
<tr>
<td>(\log(\sqrt{p}))</td>
<td>(\log_2 \sqrt{p})</td>
</tr>
<tr>
<td>(m)</td>
<td>The number of eigenvectors required.</td>
</tr>
<tr>
<td>(mb)</td>
<td>The row block size. Used only when we discuss rectangular blocks. In general, the row block size and column block size are assumed to be equal and are written as (nb).</td>
</tr>
<tr>
<td>(mullen)</td>
<td>A compile time parameter in the PBLAS which controls the panel size used in PBLAS symmetric matrix vector multiply routine, PDSYMV.</td>
</tr>
<tr>
<td>(n)</td>
<td>The size of the input matrix (A).</td>
</tr>
<tr>
<td>(nb)</td>
<td>The blocking factor. In PDSYEVX the data layout and algorithmic blocking factor are the same. In HJS the data layout blocking factor is 1 and (nb) refers to the algorithmic blocking factor.</td>
</tr>
<tr>
<td>(p)</td>
<td>The number of processors used in the computation.</td>
</tr>
<tr>
<td>(pbf)</td>
<td>Panel blocking factor. The panel width used in DGEMV in PDSYEVX and DGEMM in PDSYEVX and HJS is (pbf \times nb).</td>
</tr>
<tr>
<td>(p_r)</td>
<td>The number of processor rows in the process grid.</td>
</tr>
<tr>
<td>(p_{r1})</td>
<td>The number of processor rows in a sub-grid.</td>
</tr>
<tr>
<td>(p_{r2})</td>
<td>The number of processor sub-grid rows.</td>
</tr>
<tr>
<td>(p_c)</td>
<td>The number of processor columns in the process grid.</td>
</tr>
<tr>
<td>(p_{c1})</td>
<td>The number of processor columns in a sub-grid.</td>
</tr>
<tr>
<td>(p_{c2})</td>
<td>The number of processor sub-grid columns.</td>
</tr>
<tr>
<td>(R)</td>
<td>The set of all processor rows.</td>
</tr>
<tr>
<td>(r_s)</td>
<td>The current processor row within the sub-grid.</td>
</tr>
<tr>
<td>(r_b)</td>
<td>The current processor row sub-grid.</td>
</tr>
<tr>
<td>(spread)</td>
<td>In a “spread across”, every processor in current processor column broadcasts to every other processor in the same processor row. In a “spread down”, every processor in current processor row broadcasts to every other processor in the same processor column.</td>
</tr>
<tr>
<td>(\text{tril}(A, 0))</td>
<td>The lower triangular part, including the diagonal, of the unreduced part of the input matrix (A), i.e. (A(j : n, j : n))</td>
</tr>
<tr>
<td>(\text{tril}(A, -1))</td>
<td>The lower triangular part, excluding the diagonal, of the unreduced part of the input matrix (A), i.e. (A(j : n, j : n))</td>
</tr>
</tbody>
</table>
Table A.2: Variable names and their uses (continued)

<table>
<thead>
<tr>
<th>Name</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>$v$</td>
<td>The vector portion of the householder reflector.</td>
</tr>
<tr>
<td>$V$</td>
<td>The current column of householder reflectors. Size: $n - j + j'$ by $j'$.</td>
</tr>
<tr>
<td>$V(j - j': n, 1 : j')$</td>
<td>The current column of householder reflectors. Size: $n - j + j'$ by $j'$.</td>
</tr>
<tr>
<td>vnb</td>
<td>The imbalance in the 2D block-cyclic distribution of the eigenvector matrix.</td>
</tr>
<tr>
<td>$w$</td>
<td>The companion update vector. i.e. the vector used in $A = A - vv^T - Wv^T$ to reduce $A$.</td>
</tr>
<tr>
<td>$W$</td>
<td>The current column of companion update vectors. Size: $n - j + j'$ by $j'$.</td>
</tr>
<tr>
<td>$W(j - j': n, 1 : j')$</td>
<td>The current column of companion update vectors. Size: $n - j + j'$ by $j'$.</td>
</tr>
</tbody>
</table>

Table A.3: Abbreviations

<table>
<thead>
<tr>
<th>Abbreviation</th>
<th>Meaning</th>
<th>Terms included</th>
</tr>
</thead>
<tbody>
<tr>
<td>CPU</td>
<td>Central Processing Unit</td>
<td></td>
</tr>
<tr>
<td>FPU</td>
<td>Floating Point Unit</td>
<td></td>
</tr>
</tbody>
</table>

Table A.4: Model costs

<table>
<thead>
<tr>
<th>Symbol</th>
<th>Meaning</th>
<th>Terms included</th>
</tr>
</thead>
<tbody>
<tr>
<td>$\alpha$</td>
<td>The message initiation cost for BLACS send and receive.</td>
<td>$n \log(p), n$</td>
</tr>
<tr>
<td>$\beta$</td>
<td>The inverse bandwidth cost for BLACS send and receive.</td>
<td>$n^2 \log(p), \sqrt{n}, \frac{n^2}{\sqrt{p}}, n \log(p)$</td>
</tr>
<tr>
<td>$\delta_3$</td>
<td>DGEMM (matrix-matrix multiply) subroutine overhead plus the time penalty associated of invoking DGEMM on small matrices.</td>
<td>$n^2, n^2 \times \text{pbf}, \frac{n}{\text{nb}^2 \times \text{pbf}}$</td>
</tr>
<tr>
<td>$\gamma_3$</td>
<td>Time required per DGEMM (matrix-matrix multiply) flop.</td>
<td>$n^2, n^2 \times \text{pbf}$</td>
</tr>
<tr>
<td>$\delta_2$</td>
<td>DGEMV (matrix-vector multiply) subroutine overhead plus the time penalty associated of invoking DGEMV on small matrices.</td>
<td>$n$</td>
</tr>
<tr>
<td>$\gamma_2$</td>
<td>Time required per DGEMV (matrix-vector multiply) flop.</td>
<td>$n^2, n^2 \times \text{nb}$</td>
</tr>
<tr>
<td>$\gamma_\div$</td>
<td>Time required per divide.</td>
<td>$\frac{n^2}{p}, n$</td>
</tr>
<tr>
<td>$\gamma_\sqrt{\ }$</td>
<td>Time required per square root.</td>
<td>$\frac{n^2}{p}, \frac{n}{\sqrt{p}}$</td>
</tr>
<tr>
<td>$\gamma_1$</td>
<td>Time required per BLAS1 (scalar-vector) flop.</td>
<td>$n^2, n$</td>
</tr>
<tr>
<td>$\delta_1$</td>
<td>Subroutine overhead for BLAS1 and similar codes.</td>
<td>$\frac{n}{\sqrt{p}}$</td>
</tr>
<tr>
<td>$\delta_4$</td>
<td>Subroutine overhead for the PBLAS.</td>
<td>$n$</td>
</tr>
</tbody>
</table>
Appendix B

Further details

B.1 Updating $v$ during reduction to tridiagonal form

Line 4.1, $w = w - W V^T v - V W^T v$ in Figure 8.3 can be computed with minimal communication, minimal computation or with an intermediate amount of both communication and computation. Indeed, Line 4.1 can be computed with $O((\frac{n^2}{p} + n^2 \frac{m}{p}) + n \log(p^{r-0.5}) \alpha)$ cost for various $r \in [0.5, 1.0]$. $r = 1.0$ corresponds to the minimal computation cost option (discussed in section B.1.3) while $r = 0.5$ corresponds to the minimal (zero) communication cost option (discussed in section B.1.2). Section B.1.4 describes the intermediate options in a generalized form which includes both the minimum communication and minimum computation options as special cases.

The plethora of options for the update of $v$ stems from the fact that the input matrices $W$, $V$, $WT$ and $VT$ are replicated across the relevant processors while the input/output vector $v$ is stored as partial sums across the processor columns in each of the processor rows. The input matrices are replicated because they will need to be replicated later to update $A$. The vector $v$ is stored as partial sums because that is how it is initially computed, and because the combine operation used to compute $v$ from the partial sums has not been performed at this point.

Throughout this section we only discuss computing $W V^T v$, $V W^T v$ can be computed in a similar manner. Moreover, the two computations, and all associated communication, can be merged to reduce software overhead and message latency costs.
B.1.1 Notation

In describing most parallel linear algebra codes, including all codes in this thesis outside of this appendix, we need not explicitly state the processor on which a value is stored. \( A_{i,j} \) is understood to live on the processor that owns row \( i \) and column \( j \). The \( nb' \) element array \( tmp \) contains different values on different processors. Therefore, for the discussion in this appendix, an additional subscript is added to \( tmp \) to indicate the processor column. Furthermore, some entries in \( tmp \) are left undefined at various stages, therefore we use \( j \in \{c_a\} \) to indicate all columns \( j \) owned by processor column \( c_a \). i.e. \( tmp_{j \in \{c_a\}, c_a} = val \) means that \( \forall j \in \{c_a\}, tmp_j \) on processor \( c_a \) is assigned \( val \). For extra clarity within a display we write this as \( tmp_{j, c_a} \).

B.1.2 Updating \( v \) without added communication

Line 4.1, \( w = w - W V^T v - V W^T v \) in Figure 8.3 can be computed without any communication other than that needed to compute \( v \) without the update. It initially appears that \( w = w - W : V^T v - V : W^T v \) requires communication because computing \( tmp = V^T v \) requires summing \( nb' \) values\(^1\) within each processor column, and computing \( w = w - W : tmp \) requires that \( tmp \) be broadcast within each processor column. However, \( W : V^T v \) can be computed with a single sum within each processor row, and by delaying the sum needed to compute \( w \), one of them can be avoided completely. Figure B.1 derives how \( W : V^T v \) can be computed with a single sum within each processor row.

**Line 3** The transformation from line 2 to line 3 is the standard way that a matrix vector multiply is performed in parallel. The leftmost sum is the local portion, the middle sum is the sum over all processors in the processor column.

**Line 4** Delay the sum over all processors in the processor column until after multiplying by \( W \). The rightmost two sums involve only local values.

Figure B.2 shows how to compute \( W : V^T v \) without added communication.

**Line 5** Local computation of \( VT \cdot v \). Operations:

\[
\sum_{i=1, nb}^{n} \sum_{nb' = 1}^{nb} 2 \frac{i}{p_c} nb' \gamma_2 = \frac{1}{2} n^2 \frac{nb}{p_c} \gamma_2
\]

\(^1\) \( nb' = i - ii - 1 \) is the number of columns in \( H \)
Figure B.1: Avoiding communication in computing $W \cdot V^T v$

\[
\begin{align*}
tmp &= W \cdot V^T v \quad \text{(Line 1)} \\
\text{tmp}_{ij} &= \sum_{1 \leq j \leq nb'} W_{ij} \sum_{k \in \{C\}} V_{kj} v_k \quad \text{(Line 2)} \\
\text{tmp}_{ij} &= \sum_{1 \leq j \leq nb'} W_{ij} \sum_{1 \leq k \leq \rho_r} \sum_{k \in \{C\}} H_{kj} h_k \quad \text{(Line 3)} \\
\text{tmp}_{ij} &= \sum_{\substack{C \in \mathcal{P}_r \\ 1 \leq j \leq nb'}} W_{ij} \sum_{k \in \{C\}} H_{kj} h_k \quad \text{(Line 4)}
\end{align*}
\]

**Line 6** Local computation of $W \cdot \text{tmp}$. Operations:

\[
\sum_{i=1}^{n} \sum_{nb'=1}^{nb} 2 \frac{i}{p_r} \frac{nb'}{\gamma_2} = \frac{1}{2} n^2 \frac{nb}{p_r} \gamma_2
\]

**Line 7** Effect of summing $\text{res}_i$ within each processor row. This operation is merged with the unavoidable summation of $w$ within each processor row, hence this operation is not performed and has no cost.

**B.1.3 Updating $w$ with minimal computation cost**

Figure B.3 shows how $W \cdot V^T v$ can be performed with only $O(\frac{n^2}{\sqrt{p}} + \frac{n^2 nb}{p})$ computation by distributing the computation of $\text{tmp} = VT \cdot v$ and $w = w + W \cdot \text{tmp}$ over all the processors. Each of the $nb$ columns of $VT$ is assigned to one processor row, hence each processor row is assigned $\frac{nb}{\sqrt{p}}$ columns of $VT$. Each processor row computes the portion of $VT \cdot v$ assigned to it, leaving the answer on the diagonal processor in this row. The diagonal processors then broadcast the $\frac{nb}{\sqrt{p}}$ elements of $VT \cdot v$ which they own to all of the processors within their processor column. Finally, each processor computes $w = w + W \cdot \text{tmp}$ for the values of $W$ and $\text{tmp}$ which it owns.
Figure B.2: Computing $W \cdot VTv$ without added communication

\[ \text{tmp}_{j,C} = \sum_{k \in \{C\}} VT_{k,j} v_k \quad \text{(Line 5)} \]

\[ \text{res}_{i,C} = \sum_{j} W_{i,j} \text{tmp}_{j,C} \quad \text{(Line 6)} \]

\[ = \sum_{j} W_{i,j} \left( \sum_{k \in \{C\}} VT_{k,j} v_k \right) \]

\[ \sum_{C \in \{R\}} \text{res}_{i,C} = \sum_{1 \leq j \leq p_c} \sum_{k \in \{C\}} W_{i,j} VT_{k,j} v_k \quad \text{(Line 7)} \]

\[ = \sum_{j} W_{i,j} \left( \sum_{k} VT_{k,j} v_k \right) \]

**Line 8** Local computation of $VT \cdot v$. Operations:

\[ \sum_{i=1, nb}^{n} \sum_{nb'=1}^{nb} \frac{i}{p_c} \frac{nb'}{p_c} \gamma_2 = \frac{1}{2} \frac{n^2 nb}{p} \gamma_2 \]

**Line 9** Combine $\text{tmp}_{j \in \{R\}, C}$ within each processor column, leaving the answer on the diagonal processor. Operations:

\[ \sum_{i=1, nb}^{n} \sum_{nb'=1}^{nb} \log(p_c) \left( \alpha + \frac{nb'}{p_c} \beta \right) = \log(p_c) \alpha + \frac{1}{2} \frac{n nb}{p_c} \log(p_c) \beta \]

**Line 10** Broadcast $\text{tmp}_{j \in \{C\}, C}$ within each processor row from the diagonal processor. Operations:

\[ \sum_{i=1, nb}^{n} \sum_{nb'=1}^{nb} \log(p_c) \left( \alpha + \frac{nb'}{p_c} \beta \right) = \log(p_c) \alpha + \frac{1}{2} \frac{n nb}{p_c} \log(p_c) \beta \]

**Line 11** Local computation of $W \cdot \text{tmp}$. Operations:

\[ \sum_{i=1, nb}^{n} \sum_{nb'=1}^{nb} 2 \frac{i}{p_c} \frac{nb'}{p_c} \gamma_2 = \frac{1}{2} \frac{n^2 nb}{p} \gamma_2 \]
Figure B.3: Computing $W \cdot V^T v$ with minimal computation

\[
\text{tmp}_{j,C} = \sum_{k \in \{C\}} VT_{k,j} v_k \quad \text{(Line 8)}
\]

\[
\forall R = C \quad \text{tmp}_{j,C} = \sum_{1 \leq i \leq p, k \in \{i\}} VT_{k,j} v_k \quad \text{(Line 9)}
\]

\[
= \sum_k VT_{k,j} v_k
\]

\[
\text{tmp}_{j,C} = \sum_{k \in \{C\}} VT_{k,j} v_k \quad \text{(Line 10)}
\]

\[
\text{res}_{i,C} = \sum_{j \in \{C\}} W_{i,j} \text{tmp}_{j,C} \quad \text{(Line 11)}
\]

\[
= \sum_{j \in \{C\}} W_{i,j} \sum_{k \in \{C\}} VT_{k,j} v_k
\]

\[
\sum_{C \in \{R\}} \text{res}_{i,C} = \sum_{1 \leq i \leq p, j \in \{C\}} W_{i,j} \sum_{k \in \{C\}} VT_{k,j} v_k \quad \text{(Line 12)}
\]

\[
= \sum_j W_{i,j} \sum_k VT_{k,j} v_k
\]
Line 12 Effect of summing \texttt{res}_i within each processor row. This operation is merged with the unavoidable summation of \( w \) within each processor row, hence this operation is not performed and has no cost.

The update of \( w \) in HJS requires similar communication and computation costs although the patterns of communication are quite different. HJS uses recursive halving to spread the result of \( \texttt{tmp} = V^T v \), computes \( W \cdot \texttt{tmp} \) on all processors, and uses recursive doubling to compute \( w \) while simultaneously spreading it to all processor columns. Although the \texttt{BLACS} do not offer recursive halving and recursive doubling operations we could build them out of \texttt{BLACS} sends and receives but that incurs higher latency costs.

B.1.4 Updating \( w \) with minimal total cost

Line 4.1, \( w = w - WW^T w - W W^T w \) in Figure 8.3 can be computed with \( O(n^2 + n \log(p^{r-0.5})n) \) cost for any \( r \geq 0.5 \). On a high latency machine, one can reduce the total number of messages by increasing the load imbalance. On a low latency machine, one can reduce the load imbalance by using more messages. The two options described in the preceding sections are special cases of the general case of methods described in this section. Section B.1.2 corresponds to \( r = 0.5 \). Section B.1.3 corresponds to \( r = 1.0 \).

This method has not been implemented and hence has not been proven to result in decreased execution times in practice.

Methods corresponding to \( 0.5 < r < 1.0 \) require what amounts to a four dimensional processor grid. The \( p_r \times p_c \) processor grid is divided into \( p_{r2} \times p_{c2} \) sub-grids with each sub-grid consisting of \( p_{r1} \times p_{c1} \) processors. We restrict our attention to square processor grids and square processor sub-grids, hence \( p_r = p_c, p_{r1} = p_{c1} \) and \( p_{r2} = p_{c2} \). Each processor column is identified by a pair of numbers, \((c_a, c_b)\), s.t., \( 1 \leq c_a \leq p_{c1} \) and \( 1 \leq c_b \leq p_{c2} \). Likewise, each processor row is identified by a pair of numbers, \((r_a, r_b)\), s.t., \( 1 \leq r_a \leq p_{r1} \) and \( 1 \leq r_b \leq p_{r2} \). No modifications are needed to the \texttt{BLACS} to support this method because each processor belongs to only two 2 dimensional processor grids: the normal two dimensional data layout and a two dimensional data layout containing only those processors in the same processor sub-grid, i.e., with the same \( r_b \) and \( c_b \).

Figure B.4 shows the general method for updating \( w \) using a 4 dimensional data layout. The \( n_{b'} \) elements of \texttt{tmp} are distributed over the \( p_{c1} \) processor rows and columns within each processor block, such that each processor row and column owns roughly \( \frac{n_{b'}}{p_{r1}} \).
elements of \( \text{tmp} \).

Figure B.4: Computing \( W \cdot V^T v \) on a four dimensional processor grid

\[
\text{tmp}_{j \in \{r_a\}}^{(c_a, c_b)} = \sum_{k \in \{c_a, c_b\}} VT_{k,j} v_k \quad \text{(Line 13)}
\]

\[
\forall r_a \in \{r_a\} \quad \text{tmp}_{j \in \{r_a\}}^{(c_a, c_b)} = \sum_{1 \leq c_l \leq p_c, j \in \{r_a\}} VT_{k,j} v_k \quad \text{(Line 14)}
\]

\[
= \sum_{k \in \{c_a, c_b\}} VT_{k,j} v_k
\]

\[
\text{tmp}_{j \in \{r_a\}}^{(c_a, c_b)} = \sum_{k \in \{c_a, c_b\}} VT_{k,j} v_k \quad \text{(Line 15)}
\]

\[
\text{res}_{i \in \{r_a, r_b\}}^{(c_a, c_b)} = \sum_{j \in \{r_a\}} W_{i,j} \text{tmp}_{j \in \{r_a\}}^{(c_a, c_b)} \quad \text{(Line 16)}
\]

\[
= \sum_{j \in \{r_a\}} W_{i,j} \sum_{k \in \{c_a, c_b\}} VT_{k,j} v_k
\]

\[
\sum_{i \in \{r_a, r_b\}} \text{res}_{i \in \{r_a, r_b\}}^{(c_a, c_b)} = \sum_{1 \leq c_l \leq p_c, j \in \{r_a\}} W_{i,j} \sum_{1 \leq c_l \leq p_c, k \in \{c_a, c_b\}} VT_{k,j} v_k \quad \text{(Line 17)}
\]

\[
= \sum_{j \in \{r_a\}} W_{i,j} \sum_{k \in \{c_a, c_b\}} VT_{k,j} v_k
\]

\[
= \frac{1}{2} n^2 \text{nb} (p_r, p_c) \gamma_2
\]

\[
\gamma_2 = \frac{1}{2} n^2 \text{nb} (p_r, p_c) \gamma_2
\]

\[
\sum_{i=1, \text{nb}}^n \sum_{\text{nb}=1}^{\text{nb}} 2 \frac{i \text{nb'}}{p_r p_c} \gamma_2 = \frac{1}{2} n^2 \text{nb} (p_r, p_c) \gamma_2
\]

\[
\text{Line 13} \quad \text{Local computation of} \ VT \cdot v. \ \text{Operations:}
\]

\[
\sum_{i=1, \text{nb}}^n \sum_{\text{nb}=1}^{\text{nb}} 2 \frac{i \text{nb'}}{p_r p_c} \gamma_2 = \frac{1}{2} n^2 \text{nb} (p_r, p_c) \gamma_2
\]

\[
\text{Line 14} \quad \text{Combine} \ \text{tmp}_{j \in \{r_a\}}^{(c_a, c_b)} \text{within each processor sub-grid column, leaving the an-}
\]

\[
\text{Line 15} \quad \text{Combine} \ \text{tmp}_{j \in \{r_a\}}^{(c_a, c_b)} \text{within each processor sub-grid column, leaving the an-}
\]
swer on the diagonal processor (i.e. \( r_a = c_a \)) within each sub-grid. Operations:

\[
\sum_{i=1, nb}^{n} \sum_{nb'=1}^{nb} \log(p_{i,1}) \left( \alpha + \frac{nb'}{p_{i,1}} \beta \right) = n \log(p_{i,1}) \alpha + \frac{1}{2} \frac{n \cdot nb}{p_{i,1}} \log(p_{i,1}) \beta
\]

**Line 15** Broadcast \( \text{tmp}_{j \in \{r_a, \ldots, c_a\}} \) within each processor sub-grid row from the diagonal processor in that sub-grid row. Operations:

\[
\sum_{i=1, nb}^{n} \sum_{nb'=1}^{nb} \log(p_{i,1}) \left( \alpha + \frac{nb'}{p_{i,1}} \beta \right) = n \log(p_{i,1}) \alpha + \frac{1}{2} \frac{n \cdot nb}{p_{i,1}} \log(p_{i,1}) \beta
\]

**Line 16** Local computation of \( W \cdot \text{tmp} \). Operations:

\[
\sum_{i=1, nb}^{n} \sum_{nb'=1}^{nb} \frac{i}{p_{i,1}^2} \frac{nb'}{p_{i,1} \gamma_2} = \frac{1}{2} \cdot \frac{n^2 \cdot nb}{p_{i,1} \gamma_2}
\]

**Line 17** Effect of summing \( \text{res} \), within each processor row. This operation is merged with the unavoidable summation of \( w \) within each processor row, hence this operation is not performed and has no cost.

### B.1.6 Overlap communication and computation as a last resort

There are numerous studies showing that overlapping communication and computation improves performance, but most of them show only modest improvement. Arbenz and Slapnicar[9] show a 5% improvement by overlapping communication and computation while Pourzandi and Tourancheau show a 6% improvement. Those that show the greatest improvement combine communication and computation overlap with other equally important techniques such as pipelining and lookahead[32].

I don’t know why overlapping communication and computation leads to only modest improvements. In theory it ought to hide most of the communication costs. There are several possible explanations, all of which presumably contribute. I suspect that the most important reason for the disappointing savings from overlap is that overhead and not communication costs are not the primary factor limiting efficiency. A second important reason is that most of the cost of communication on today’s distributed memory machines is the cost of moving the data between the node and the network, not moving data within the network. The cost of moving data to and from the node always involves main memory cycles, unless the main memory is dual ported (i.e. expensive), which must be stolen from the
execution of the rest of the code. Further the latency cost is almost all software overhead, hence during the message setup the cpu is busy and cannot compute.

The disadvantage to communication and computation overlap is that it adds complexity which can be put to better use elsewhere. Both the Pourzandi/Tourancheau and Arbenz/Slapnicar studies used a 1D data layout in Jacobi although a 2D data layout offers lower communication and costs $O\left(\frac{n^3}{\sqrt{p}}\right)$ versus $O(n^2)$ and lower overhead costs. They would have done better to use a 2D data layout and delayed (potentially forever) consideration of communication and computation overlap.

### B.2 Matlab codes

#### B.2.1 Jacobi

The following is the matlab code for Table 1.4.

```matlab
n = 1000;
p = 64;
blacalpha = 65.9e-6;
blacsbeta = 1.46e-6;
dividebeta = 3.85e-6;
squarerootbeta = 7.7e-6;
blasonebeta = 0.074e-6;
dgemmalpha = 1.03e-6;
dgemmbeta = 0.0215e-6;
term(1) = 8 * sqrt(p) * ( log2(p) - 3 ) * blacalpha
term(2) = 7/2 * n^2 / sqrt(p) * blacsbeta
term(3) = 1/8 * n^2 / sqrt(p) * log2(p) * blacsbeta
term(4) = 1/2 * n^2 / sqrt(p) * dividebeta
term(5) = 1/4 * n^2 / sqrt(p) * squarerootbeta
term(6) = 3/8 * n^3 / p * blasonebeta
term(7) = 8 * sqrt(p) * dgemmalpha
term(8) = 5 * n^3 / p * dgemmbeta
time = sum(term)
```
Appendix C

Miscellaneous matlab codes

C.1 Reduction to tridiagonal form

The following matlab code performs an unblocked reduction to tridiagonal form. It produces the same values, up to roundoff, of $D$, $E$ and $TAU$ as LAPACK’s DSYTRD and ScaLAPACK’s PDSYTRD.

```

%%
%% tridi - An unblocked, non-symmetric reduction to tridiagonal form
%%
%% This file creates an input matrix A, reduces it to tridiagonal form
%% and tests to make sure that the reduction was performed correctly.
%%
%% outputs:
%% D, E - The tridiagonal matrix
%% tau
%% A - The lower half holds the householder updates
%%
%%
%% Produce the input matrix
%%
%% N = 7;
A = hilb(N) + toeplitz([1 (1:(N-1))*i ]); B = A; % Keep a copy to check our work later.

```

```
I = eye(N);

for j =1:n-1
  
  % Compute the householder vector: v
  clear v;
  v(1:n,1) = zeros(n,1);
  v(j+1:n,1) = A(j+1:n,j);
  alpha = A(j+1,j);
  beta = - norm(v) * real(alpha) / abs( real(alpha) ) ;

  tau(j) = ( beta - alpha ) / beta ;
  v = v / ( alpha - beta ) ;
  v(j+1) = 1.0 ;

  
  % Perform the matrix vector multiply:
  w = A * v ;

  % Compute the companion update vector: w
  w = tau(j) * w ;
  c = w' * v;
  w = (w - (c * tau(j) / 2 ) * v);

  D(j) = A(j,j);
  E(j) = beta ;

  % Update the trailing matrix
  A = A - v * w' - w * v' ;

  % Store the household vector back into A
  A(j+2:n,j) = v(j+2:n);
end
D(n) = A(n,n);
```matlab
% Check to make sure that the reduction was performed correctly.

DE = diag(D) + diag(E,-1) + diag(E,1) ;

Q = I;
for j = 1:n-1
clear house
  house(1:n,1) = zeros(n,1);
  house(j+1:n,1) = A(j+1:n,j);
  house(j+1,1) = 1.0;
  Q = (I - tau(j)' * house' * house') * Q ;
end

norm(B - Q' * DE * Q )
```