Fall 2019Question 1: A- Please write a CSR-based serial SpMV (y = Ax) code. B- How would you parallelize it for a multithreaded CPU? C- How would you parallelize it for distributed memory? What is the communication cost (here we give a specific nonzero structure on a toy matrix) in alpha/beta model? D- Let's say the communication is too much, what would you use to partition this matrix to minimize communication? How would you model communication? E- Assume you have a matrix with power-law degree (nonzero counts per row) distribution, how would this work? What would go wrong with row-wise decomposition? Question 2: A- Please give an example of an application or algorithm that would use dynamic load balancing. B- Describe an implementation for a shared task queue (implemented as a linked-list in shared memory) C- How would you do it with Compare-and-swap? Gave signatures and pseudocode for CAS. D- How would you reduce contention? (hint given: having a separate queue per processor, e.g., a shared array of linked lists.) How would you select a processor from which to stea? E- Which item (older or newer) should you steal from the remote queue vs. your local queue.

(Fall 2012 - Culler and Shewchuk):[1] Parallelism and Performance Modeling Give a simple performance model of a generic parallel operation performed on n elements. Explain what the parameters represent. How would such the model differ for pipeline operation versus fully concurrent operation in p processors or functional units? What measures can be taken at the algorithmic level to maximize performance under these models? Given an analogous model for communication performance. Where do issues of overhead and latency appear in this model? What architectural or algorithmic measures can be taken to mitigate these aspects? [2] Explain the difference between block, cyclic, and block-cyclic distributions of a matrix on a distributed-memory computer. Explain why for some applications, a block-cyclic distribution might give better performance than either a block or cyclic distribution. (Naming such an application will help. If you're stuck, explain why block distributions might give better performance for some tasks, and cyclic distributions for others.) [3] Auto-tuning Stencil Operations Suppose you are writing an auto-tuner whose job is to find a fast implementation of finite difference stencil operations on a shared-memory multicore computer. Give a list of different changes/optimizations your auto-tuner might attempt, or parameters it might vary, to get the code to work faster. For two of these changes that operate very differently, explain HOW these changes can cause variations in the running time that are unpredictable enough to justify auto-tuning. [4] Parallel Algorithm Trade-offs Consider a bucket sort algorithm that consists of identifying a set of pivot keys, making a pass over the data and distributing keys to buckets according to the pivots, sorting each bucket, and concatenating the buckets together. Parallelize this by giving each process(or) a subset of the keys and a copy of the pivots, and having each process(or) store one bucket. All process(or)s distribute their keys to the appropriate buckets; each performs a local sort, and the buckets are concatenated to form the final result. Describe how you would implement this simple algorithm in a message passing model. Then describe how you would implement it in a shared address space model. Where is synchronization required in each? What issues arise in expressing this algorithm? What is the best, worst, and expected case speedup on this algorithm? What factors limit its speedup? What are the shortcomings of this algorithm? How are they affected by scaling the problem size? How can they each be overcome? Outline the load balancing, locality, communication, and extra-work trade-offs present in the implementation of this algorithm. In the limit, what fundamentally limits the performance of this or any sorting algorithm?

(Fall 2004 - Culler & Shewchuk):"Problem 1 You are writing a parallelizing Fortran compiler for a distributed memory computer whose interconnection network has high bandwidth but large message latency. Your task is to create an efficient implementation of array statements of the following form. A[1:n] = B[1:n] The notation "1:n" is interpreted as an implicit loop over the array elements from 1 to n. (a) If the array A is distributed equally among the processors in a block distribution, and B is distributed equally in a cyclic distribution, what inter-processor communication takes place? What other data distribution schemes are frequently used? (b) Now consider array statements of the following form. A[X[1:n]] = B[1:n] A[1:n] = B[Y[1:n]] A[X[1:n]] = B[Y[1:n]] The good news is that these statements typically appear within loops that are iterated many times, and the arrays X and Y almost never change from iteration to iteration. How will your compiler make these statements execute as quickly as possible? (c) Suppose one of the array statements above appears alone inside a loop that iterates a thousand times. How can improve the ratio of computation to communication? Problem 2. You have a computation of the form FORALL i = 1 to n F(i) where n is not known at compile time. (a) Describe the run time mechanisms that might be used to distribute this computation over processors on a shared address space machine? (b) In a message passing model (say on a distributed memory machine). (c) What serialization do your mechanisms introduce? (What is the potential speedup?) (d) What are the sources of load imbalance? (e) What techniques could be used to improve the load balance. (f) In general, what trade-offs are faced in balancing load? (g) What general techniques can be used? Problem 3 (a) You are computing a one-dimensional FFT on a distributed-memory machine. How do you suggest parallelizing the FFT? What communication is involved in your scheme? As you scale the problem or machine up or down, where does your approach run into problems. (b) It used to be very popular to build machines with a hypercube interconnection network. (Can you name a couple?) How ca you take advantage of the machine topology to make the FFT as fast as possible? Under what conditions does this pay off? (c) In view of what you know about the LogP model, why do you think so few architectures are built in hypercube topologies anymore?"

(Spring 2004 - Demmel & Shewchuk):"Consider a one-dimensional nonlinear PDE on a domain Omega = [0, 1] with Dirichlet boundary conditions f(0) = f(1) = 0. The domain is divided into n = 2^k identical elements with n + 1 nodes. We wish to find an approximate solution to the PDE using either finite elements or finite differences, whichever you're more comfortable with. We use multigrid to solve the associated linear equations. (a) Describe the matrices that appear in the method, including matrices that are defined implicitly (e.g. the restriction operator). Roughly speaking, what are the contents of these matrices? Why? (b) Which part of the multigrid cycle is most difficult to parallelize on a shared memory multiprocessor? How would you suggest parallelizing it? Each iterate (estimated solution) can be expressed as a Fourier expansion. In principle, we could convert an iterate to its corresponding Fourier coefficients or back at any time by a matrix multiply, though in practice we would use an FFT. Suppose we transform the linear system so it is expressed in term of Fourier coefficients. We want to use multigrid to solve for the Fourier coefficients, and we stay in the Fourier domain throughout the multigrid cycle. The restriction operator and other elements of multigrid do not need to have _exactly_ the same effect as their space-domain counterparts; you can change them slightly to better suit the spectral domain, but they should accomplish the same goals in principle. (c) What do the matrices discussed in part a look like now? (For reasonable implementation choices.) (d) Which part of the spectral multigrid cycle is most difficult to a shared memory multiprocessor? How would you suggest parallelizing it? Does it make any sense to choose the spectral domain for this problem?"

(Fall 1999 - Demmel & Shewchuk):"Q1: You are given a bunch of fish swimming around in a very shallow pond, where for some reason each pair of fish attracts one with a force f = 1/r^3 , r = distance between fish In other, they attract weakly, but not so weakly that you can ignore distant fish. How would you compute the forces on each fish as efficiently as possible, assuming that you wanted justa good approximation? What would you do if the force law changed to f = { 1/r^3 , if r >= 1 { 2-1/r , if r < 1 ? Q2: Consider parallelizing two large 3D numerical grid-based simulations. One uses a uniform regular cubic grid, and the other uses a very unstructured 3D grid distributed among processors using a graph partitioner like Chaco or Metis. The simulation alternates between a computation phase, in which each vertex of the graph receives a new value based on the values of its neighbors, and a communication phase, in which each vertex gathers the values of its neighbors. (a) How should we structure the computation on the grid if we want to overlap computation and communication as much as possible? (b) If the unstructured simulation is run on a shared memory multiprocessor, is the need for a graph partitioner eliminated (or reduced)? Why? (c) If the simulations are run on a network of workstations, how do the structured and unstructured simulations differ in the demands they place on the latency and bandwidth of the communication network? Q3: Graph partitioning is a method for parallelizing and load balancing a variety of applications. Define the graph partitioning problems, and say how it could be used in the following situations. In particular, what is the graph that we partition, and how does a good partition impact the parallelism or complexity of the algorithm? Conjugate gradient method to solve Ax=b for large sparse A; Sparse Choleksy to solve Ax=b; Sparse GEPP to solve Ax=b, where A is a sparse nonsymmetric matrix; A term-document matrix is a matrix with one row for each document in a collection, and one column for each "interesting" term that appears in the document. A(i,j) is the number of times term j appears in document i. The goal is to partition the documents and partition the terms to discover whether there are subsets of documents which share many terms, and so are likely to be closely related. Now we will ask about some graph partitioning algorithms. They come in two flavors, depending on whether one has coordinate information associated with each vertex, which is often the case in physical problems. Suppose coordinate information is available, and that most edges tend to be between physically nearby vertices. What is an efficient partitioning algorithm? Suppose coordinate information is not available? What is an efficient partitioning algorithm? Let In(G) be the incidence matrix of an undirected graph G(V,E), i.e., an |N|-by-|E| matrix with one row per vertex and one column per edge. If E=(i,j) is edge k, then In(G) (i,k)=+1 and In(G) (j,k)=-1. Then the Laplacian of G is defined as L(G) = In(G)*(In(G))^T. What do the eigenvalues of L(G) tell you about how G can be partitioned?"

(Spring 1999 - Demmel & Culler):"Q1: Critical design issues in parallel systems often arise at several levels - at the machine level, at the level of programming primitives, at the application logic level - perhaps in subtly different forms. Let's take one issue, say deadlock. Give examples of how deadlock arises in each of these three levels, in a shared-address space framework and also in a message passing framework. What about race conditions? Q2: Parallel processing often deals with making trade-offs among opposing costs, such as load balance vs. communication. Often, different phases of a computation will optimize the tradeoffs differently. For example, consider a basic particle-in-cell (PIC) code. It normally has two phases. Each particle contributes to the field values at the corners of the mesh. Each field value contributes force on each of the particles, causing them to move. The time step is such that the fastest particles only move one cell length per timestep. Discuss how you would partition the mesh and how you would partition the particles. Q3: Given a message passing program, it is very simple to emulate it on a shared address space (SAS) programming model. How? Given a SAS program, can you give a general means of emulating it on a message passing system? What do you require in the message passing system? How might you optimize from that general solution? Q4: This question considers tradeoffs in implementing a variation of the sharks and fish problem: sharks and fish with gravity. We will present a sequence of slightly different situations, and ask which algorithm you would use in each case, and why: (a) Suppose there are only fish. What algorithms are available for this problem? What is their sequential and parallel complexity? Can you briefly say how they work? (b) Which algorithm would you choose if the number of fish, f, is very small? Very large? Why? (c) Now suppose there are some fish and some sharks, all of which are similar in mass. How does the algorithm change? (d) Now suppose fish are very light (like asteroids) so that their gravitational interactions can be ignored, but sharks are heavy (like planets) and so that fish-shark and shark-shark interactions are important. (e) Suppose instead that there are many fish and just a few sharks. How does the algorithm change? (f) Suppose next that communication is very expensive and there are many more fish than sharks. How does the algorithm change?"

August 2000