Parallel Processing Questions


Fall 2019 

Question 1: 
A- Please write a CSR-based serial SpMV (y = Ax) code. 
B- How would you parallelize it for a multithreaded CPU? 
C- How would you parallelize it for distributed memory? What is the communication cost (here we give a specific nonzero structure on a toy matrix) in alpha/beta model? 
D- Let's say the communication is too much, what would you use to partition this matrix to minimize communication? How would you model communication? 
E- Assume you have a matrix with power-law degree (nonzero counts per row) distribution, how would this work? What would go wrong with row-wise decomposition? 

Question 2:  
A- Please give an example of an application or algorithm that would use dynamic load balancing. 
B- Describe an implementation for a shared task queue (implemented as a linked-list in shared memory)  
C- How would you do it with Compare-and-swap?  Gave signatures and pseudocode for CAS. 
D- How would you reduce contention? (hint given: having a separate queue per processor, e.g., a shared array of linked lists.)  How would you select a processor from which to stea? 
E- Which item (older or newer) should you steal from the remote queue vs. your local queue. 

(Fall 2012 - Culler and Shewchuk):

[1] Parallelism and Performance Modeling

Give a simple performance model of a generic parallel operation performed
on n elements.  Explain what the parameters represent.

How would such the model differ for pipeline operation versus fully
concurrent operation in p processors or functional units?

What measures can be taken at the algorithmic level to maximize performance
under these models?

Given an analogous model for communication performance.

Where do issues of overhead and latency appear in this model?  What
architectural or algorithmic measures can be taken to mitigate these aspects?

[2]  Explain the difference between block, cyclic, and block-cyclic
distributions of a matrix on a distributed-memory computer.

Explain why for some applications, a block-cyclic distribution might give
better performance than either a block or cyclic distribution.  (Naming
such an application will help.  If you're stuck, explain why block
distributions might give better performance for some tasks, and cyclic
distributions for others.)

[3]  Auto-tuning Stencil Operations

Suppose you are writing an auto-tuner whose job is to find a fast
implementation of finite difference stencil operations on a shared-memory
multicore computer.

Give a list of different changes/optimizations your auto-tuner might attempt,
or parameters it might vary, to get the code to work faster.  For two of
these changes that operate very differently, explain HOW these changes
can cause variations in the running time that are unpredictable enough to
justify auto-tuning.

[4] Parallel Algorithm Trade-offs

Consider a bucket sort algorithm that consists of identifying a set of pivot
keys, making a pass over the data and distributing keys to buckets
according to the pivots, sorting each bucket, and concatenating the
buckets together.  Parallelize this by giving each process(or) a subset
of the keys and a copy of the pivots, and having each process(or) store
one bucket.  All process(or)s distribute their keys to the appropriate
buckets; each performs a local sort, and the buckets are concatenated to
form the final result.

Describe how you would implement this simple algorithm in a message passing
model.  Then describe how you would implement it in a shared address space
model.  Where is synchronization required in each?  What issues arise in
expressing this algorithm?

What is the best, worst, and expected case speedup on this algorithm?  What
factors limit its speedup?  What are the shortcomings of this algorithm?  How
are they affected by scaling the problem size?  How can they each be overcome?

Outline the load balancing, locality, communication, and extra-work
trade-offs present in the implementation of this algorithm.

In the limit, what fundamentally limits the performance of this or any
sorting algorithm?

(Fall 2004 - Culler & Shewchuk):
"Problem 1

You are writing a parallelizing Fortran compiler for a distributed
memory computer whose interconnection network has high bandwidth but
large message latency.  Your task is to create an efficient
implementation of array statements of the following form.

  A[1:n] = B[1:n]

The notation "1:n" is interpreted as an implicit loop over the array elements
from 1 to n.

(a)  If the array A is distributed equally among the processors in a block
distribution, and B is distributed equally in a cyclic distribution, what
inter-processor communication takes place? 

What other data distribution schemes are frequently used?

(b) Now consider array statements of the following form.

  A[X[1:n]] = B[1:n]
  A[1:n] = B[Y[1:n]]
  A[X[1:n]] = B[Y[1:n]]

The good news is that these statements typically appear within loops that are
iterated many times, and the arrays X and Y almost never change from iteration
to iteration.

How will your compiler make these statements execute as quickly as possible?

(c) Suppose one of the array statements above appears alone inside a
     loop that iterates a thousand times.  How can improve the ratio
     of computation to communication?

Problem 2.

You have a computation of the form

FORALL i = 1 to n

where n is not known at compile time.

(a) Describe the run time mechanisms that might be used to distribute
this computation over processors on a shared address space machine?  

(b) In a message passing model (say on a distributed memory machine).

(c) What serialization do your mechanisms introduce?  (What is the
potential speedup?)

(d) What are the sources of load imbalance?

(e) What techniques could be used to improve the load balance.

(f) In general, what trade-offs are faced in balancing load?

(g) What general techniques can be used?

Problem 3

(a) You are computing a one-dimensional FFT on a distributed-memory
machine.  How do you suggest parallelizing the FFT?  What
communication is involved in your scheme?

As you scale the problem or machine up or down, where does your
approach run into problems.

(b) It used to be very popular to build machines with a hypercube
interconnection network.  (Can you name a couple?)  How ca you take
advantage of the machine topology to make the FFT as fast as possible?
Under what conditions does this pay off?

(c)  In view of what you know about the LogP model, why do you think so few
architectures are built in hypercube topologies anymore?"

(Spring 2004 - Demmel & Shewchuk):
"Consider a one-dimensional nonlinear PDE on a domain Omega = [0, 1] with
Dirichlet boundary conditions f(0) = f(1) = 0.  The domain is divided into
n = 2^k identical elements with n + 1 nodes.  We wish to find an approximate
solution to the PDE using either finite elements or finite differences,
whichever you're more comfortable with.  We use multigrid to solve the
associated linear equations.

(a)  Describe the matrices that appear in the method, including matrices that
     are defined implicitly (e.g. the restriction operator).  Roughly speaking,
     what are the contents of these matrices?  Why?

(b)  Which part of the multigrid cycle is most difficult to parallelize on
     a shared memory multiprocessor?  How would you suggest parallelizing it?

Each iterate (estimated solution) can be expressed as a Fourier expansion.  In
principle, we could convert an iterate to its corresponding Fourier
coefficients or back at any time by a matrix multiply, though in practice we
would use an FFT.

Suppose we transform the linear system so it is expressed in term of Fourier
coefficients.  We want to use multigrid to solve for the Fourier coefficients,
and we stay in the Fourier domain throughout the multigrid cycle.

The restriction operator and other elements of multigrid do not need to have
_exactly_ the same effect as their space-domain counterparts; you can change
them slightly to better suit the spectral domain, but they should accomplish
the same goals in principle.

(c)  What do the matrices discussed in part a look like now?  (For reasonable
     implementation choices.)

(d)  Which part of the spectral multigrid cycle is most difficult to
     a shared memory multiprocessor?  How would you suggest parallelizing it?
     Does it make any sense to choose the spectral domain for this problem?"

(Fall 1999 - Demmel & Shewchuk):
"Q1: You are given a bunch of fish swimming around in a very shallow
pond, where for some reason each pair of fish attracts one with a force
	f = 1/r^3  , r = distance between fish
In other, they attract weakly, but not so weakly that you can ignore
distant fish.

How would you compute the forces on each fish as efficiently as
possible, assuming that you wanted justa good approximation?

What would you do if the force law changed to
	f = { 1/r^3  , if r >= 1
	    { 2-1/r  , if r < 1    ?

Q2: Consider parallelizing two large 3D numerical grid-based
simulations.  One uses a uniform regular cubic grid, and the other uses
a very unstructured 3D grid distributed among processors using a graph
partitioner like Chaco or Metis.

The simulation alternates between a computation phase, in which each
vertex of the graph receives a new value based on the values of its
neighbors, and a communication phase, in which each vertex gathers the
values of its neighbors.

(a) How should we structure the computation on the grid if we want to
overlap computation and communication as much as possible?

(b) If the unstructured simulation is run on a shared memory
multiprocessor, is the need for a graph partitioner eliminated (or
reduced)?  Why?

(c) If the simulations are run on a network of workstations, how do the
structured and unstructured simulations differ in the demands they place
on the latency and bandwidth of the communication network?

Q3: Graph partitioning is a method for parallelizing and load balancing
a variety of applications.  Define the graph partitioning problems,
and say how it could be used in the following situations.  In
particular, what is the graph that we partition, and how does a good
partition impact the parallelism or complexity of the algorithm?

	Conjugate gradient method to solve Ax=b for large sparse A;
	Sparse Choleksy to solve Ax=b;
	Sparse GEPP to solve Ax=b, where A is a sparse nonsymmetric
	A term-document matrix is a matrix with one row for each 
		document in a collection, and one column for each
		"interesting" term that appears in the document.
		A(i,j) is the number of times term j appears in
		document i.  The goal is to partition the documents
		and partition the terms to discover whether there are
		subsets of documents which share many terms, and so
		are likely to be closely related.

Now we will ask about some graph partitioning algorithms.  They come in
two flavors, depending on whether one has coordinate information
associated with each vertex, which is often the case in physical

Suppose coordinate information is available, and that most edges tend to
be between physically nearby vertices.  What is an efficient
partitioning algorithm?

Suppose coordinate information is not available?  What is an efficient
partitioning algorithm?

Let In(G) be the incidence matrix of an undirected graph G(V,E), i.e.,
an |N|-by-|E| matrix with one row per vertex and one column per edge.
If E=(i,j) is edge k, then In(G) (i,k)=+1 and In(G) (j,k)=-1. Then the
Laplacian of G is defined as L(G) = In(G)*(In(G))^T.  What do the
eigenvalues of L(G) tell you about how G can be partitioned?"

(Spring 1999 - Demmel & Culler):
"Q1: Critical design issues in parallel systems often arise at several
levels - at the machine level, at the level of programming primitives,
at the application logic level - perhaps in subtly different forms.
Let's take one issue, say deadlock. Give examples of how deadlock arises
in each of these three levels, in a shared-address space framework and
also in a message passing framework.  What about race conditions?

Q2: Parallel processing often deals with making trade-offs among
opposing costs, such as load balance vs. communication.  Often,
different phases of a computation will optimize the tradeoffs
differently.  For example, consider a basic particle-in-cell (PIC) code.
It normally has two phases.  Each particle contributes to the field
values at the corners of the mesh.  Each field value contributes force
on each of the particles, causing them to move. The time step is such
that the fastest particles only move one cell length per timestep.
Discuss how you would partition the mesh and how you would partition the

Q3: Given a message passing program, it is very simple to emulate it on
a shared address space (SAS) programming model.  How? Given a SAS
program, can you give a general means of emulating it on a message
passing system?  What do you require in the message passing system? How
might you optimize from that general solution?

Q4: This question considers tradeoffs in implementing a variation of
the sharks and fish problem: sharks and fish with gravity. We will
present a sequence of slightly different situations, and ask which
algorithm you would use in each case, and why:
(a) Suppose there are only fish. What algorithms are available for this
problem? What is their sequential and parallel complexity? Can you
briefly say how they work?
(b) Which algorithm would you choose if the number of fish, f, is very
small? Very large? Why?
(c) Now suppose there are some fish and some sharks, all of which are
similar in mass. How does the algorithm change?
(d) Now suppose fish are very light (like asteroids) so that their
gravitational interactions can be ignored, but sharks are heavy (like
planets) and so that fish-shark and shark-shark interactions are
(e) Suppose instead that there are many fish and just a few sharks. How
does the algorithm change?
(f) Suppose next that communication is very expensive and there are many
more fish than sharks. How does the algorithm change?"

August 2000