Computer Architecture Questions

(AR)


(Spring 2022 - John Wawrzynek, Prabal Dutta, Sophia Shao):
Accelerators & Technology Scaling

Your friend, Sasha, an algorithm developer in a Silicon Valley-based 
startup that develops revolutionary video compression algorithms, 
reaches out to you for advice.

Sasha recently heard about the importance of domain-specific accelerators 
and is wondering whether they should also consider it in their company. 
Sasha took undergrad-level computer architecture courses and understands 
processor pipelines quite well, though she hasn't seen accelerators before.

1. Source of efficiency: Could you describe to Sasha why domain-specific
   accelerators are typically more efficient than general-purpose
   processors? Please list 3 reasons how accelerators can simplify the 
   the pipeline of general-purpose processors.

2. Hardware arithmetic: After discussing with you, Sasha is interested in
   giving accelerators a try and hires you to design the first-generation 
   hardware for their algorithm. It turns out that the core computation of 
   their video codec is a multiply-accumulate (MAC) unit. Your fellow designer 
   plans to use an array multiplier and a ripple-carry adder.  Please 
   recommend an alternative adder and multiplier design.

3. CMOS Power: Now we want to understand the power consumption of our MAC 
   unit. Describe the dynamic power consumption of CMOS circuits (in formula format).

4. Dennard Scaling: In a world where Denard scaling is still true, how 
   would power consumption change from one technology node to the next? Assuming the 
   transistor dimensions are reduced by a factor of 0.7.

5. End of Dennard Scaling: Why is Dennard scaling not working now?

CPU Design

A central design problem in processor core design is dealing with memory latency.

1. How might you go about quantifying the effect of memory latency on 
   processor (single core) performance. Assume for now, that there are no caches.

2. What are the primary methods that we employ to deal with memory latency 
   and mitigate the effect?

3. Let's explore details of each approach:

a) Caches:
   What is a typical modern cache hierarchy?
   Why do we use multiple level caches?
   How can you quantify the advantage?

b) Store buffers:
   What are the advantages and disadvantages?

c) Prefetching:
   Describe how a simple instruction prefetcher works, data prefetcher.
   What are the challenges in any prefetching approach?

d) OoO execution:
   How does OoO help with memory latency?
   What are the key structures needed to support OoO?

e) Multithreading:
   How can MT help with memory latency?
   Describe for both coarse grain and simultaneous-multithreading (SMT)?

Virtual Memory

You've been hired as a consultant to help a recently-funded startup architect 
their new wearable computer. The two key engineers, both Stanford undergraduates, 
can't decide which processor they should use. One of them did an internship 
at Pebble and wants to use the processor that he knows-an embedded ARM Cortex-M 
class architecture without hardware support for virtual memory. The other did 
an internship at Samsung and wants to use the processor that she knows-an ARM 
Cortex-A class device with a memory management unit for supporting virtual
memory.

1. Is it needed? The company's CEO, who has a technical background but 
   doesn't know all the ins-and-outs of processor architecture, has asked you to 
   provide an independent assessment of the situation. You start by asking some 
   questions to guide the discussion about whether virtual memory is needed or not. 
   What might you ask?

2. Implementation and benefits. Intrigued by your questions, the CEO now 
   wants to understand (i) what exactly virtual memory is, (ii) what its components 
   are, (iii) how it relates to physical memory and disk, (iv) how it is implemented, 
   and (v) what benefits it offers. Make sure your explanation includes the process 
   for mapping virtual addresses to physical ones, including virtual and physical 
   page numbers, a description of MMU and TLB hardware, (hierarchical) 
   page/translation tables, and the processing path for TLB hits and misses. Start 
   with a simple explanation and build up from there. Be sure to also list at least 
   three benefits of virtual memory.

3. Cost and complexity. After hearing your explanation, the team thinks 
   that virtual memory sounds like a wonderful invention and the company should 
   use it! At this point, however, you explain that virtual memory does not come 
   for free. Explain some of the costs and complexities of using virtual memory 
   and their underlying reasons. Also explain the implications for latency-sensitive 
   embedded and real-time applications.


(Fall 2020):

1.  Memory hierarchy
a) Describe the memory hierarchy of a modern multicore microprocessor
   (including caches and DRAM).  You can choose any design with which
   you are familiar or a generic design suitable for desktop/laptop,
   server, or smartphone application processing.
   Some details you should cover: 
   - what are typical cache parameters (capacity, associativity, line
     size) of each level in the memory hierarchy?
   
   - what are typical range of access latencies (load-use in processor 
     clock cycles) and access widths (in bytes) for each level ?
   
   - how are the data caches on the multiple processors kept coherent?
b) In the system from a), describe the sequence of actions, including
   the separate bus transactions that occur, when one processor writes
   a word in memory, and then a different processor reads that
   word. How can the initial state of the system affect the set of
   transactions that occur in this sequence?
c) Describe techniques that could improve the performance of the
   sequence in b)

2. Power and Energy:
a) What is the relationship between power and energy?  Why are these
   important considerations in digital systems - in particular in
   data-centers and in handheld devices.
b) Describe (in formula form) the components of power consumption in
   CMOS circuits.
c) Based on this expression, discuss techniques that can be used to
   reduce power consumption in general purpose computers, for instance
   laptop computers.  Which of these improve energy efficiency?

3. Out-of-order (OoO) memory accesses
Executing load instructions as early as possible is critical to OoO
processor performance.
a) Why are loads more critical than stores?
b) Describe how loads and stores are handled in a modern OoO
   superscalar processor with register renaming and a unified phyiscal
   register file.  You should describe where, how, and when the
   address and data portions of the instructions are executed by the
   microarchitecture.  Initially, assume that a conservative scheme is
   used that does not speculate on address values.
c) Describe how the system in b) could be improved by speculating on
   memory address values.  What are the potential pitfalls of address
   speculation?  How can these be mitigated?

4. Number Representations:
a) Suppose you are responsible for designing a domain specific
   processor (we will leave the domain unspecified).  You already have
   the high-level design done, but have not yet decided on the native
   number representation.  How would you go about determining the most
   appropriate number representation for your machine?  What are the
   factors to be considered?
b) Now consider "bfloat" - invented by Google for AI processors.
   Compare it's format to IEEE 754 "single precision" and to
   "half-precision".  Show how to determine the maximum representable
   number.  Repeat with smallest.
c) One problem with floating point computation is that it does not
   always follow the rules of algebra of real numbers.  Why is that?
   Give an example.
(Fall 2018):
CMOS Circuits
a) Consider the design of a 4-bit LFSR (linear feedback shift register). 
   Using CMOS transistors, how would you construct the circuit?

b) What determines the max frequency of operation?

c) In this case what could you do to speed it up?

d) What determines the power consumption?

e) How could you decrease the power consumption?

f) Can you improve the energy efficiency?

Modern processor organization
a) What are precise exceptions and why are they useful? 

b) Pick your favorite multi-issue, out-of-order superscalar processor.  
   Draw a rough block diagram and explain how it provides precise interrupts.

c) What is register renaming, why is it useful, and how is it implemented in 
   your processor?

d) Can you always increase performance by issuing more instructions per cycle?  
   Explain.  Are there other ways in which to improve performance?

e) How does the hardware complexity of a multi-issue processor vary with issue 
   width? You can answer by explaining the scaling of various components.

f) Why is branch prediction important in a multi-issue processor?  Pick your 
   favorite branch predictor, draw a diagram for it, and explain how it is tied 
   into your processor pipeline.

Cache coherence
a) Describe the cache coherence problem in an SMP.  What does it mean for a 
   multiprocessor to have a sequentially-consistent memory model?

b) Describe a snoopy cache-coherence protocol.  How does it work?  Take use 
   through a typical transaction in which several processors read a value, then 
   one decides to write it. 

c) What is a "reasonable" bus bandwidth for a multiprocessor?  What factors 
   will limit the maximum number of processors on a bus?

d) Suppose we were interested in handling 100's of processors.  What issues 
   might cause one to use something other than a snoopy protocol?  Describe 
   details of a cache coherence protocol that can be used for many processors. 
   Can you come up with a block-diagram for one of the nodes of the system?  
   What about the internals of a memory controller?

DNN Acceleration
Assume you are a system designer for an application that makes substantial use 
of a particular DNN (deep neural network).

a) Does it make sense to either design a custom ASIC accelerators or a dedicated 
   processor tailored to this application?

b) What are the potential benefits and disadvantages over using a standard GPP? 
   What are the important metrics of comparison? Estimate the advantages over a 
   GPP. Estimate the disadvantages.

c) What might be a promising custom architecture for this application? What do 
you think would be the limitation to performance?

d) Besides a custom silicon design, what are other design/implementation 
alternatives.  How do they compare?

e) Are there other techniques that might improve performance?

Power in the large and in the small. 
This question is about power in microprocessors and data centers (machine rooms).

a) Why is power a serious problem today for both microprocessors and data centers?

b) List techniques are used to reduce the power problem for microprocessors

c) Are there analogous techniques that work for data centers as well?

d) List techniques that work for data centers that have no analogy in 
microprocessors.

e) Was power a problem 5 to 10 years ago? Why or Why not?

f) Will the problem be better or worse in 5 to 10 years? Why?


(Fall 2015 - Patterson and Wawrzynek):

1. What is the relationship between power and energy?
2. Intel and Micron recently announced a new memory
    technology 3D X-point (details provided). How might you use
    that new technology in some computer system?
3.  Why do some say that Moore's Law has ended or is slowing
    down? How might computer architecture change as a result?
4. Vectors vs. GPUs: What are key similarities and differences,
    and the overall pros and cons? 
5. (if time permits) What are a few metrics of dependability?
    How are they related? What are techniques that improve availability?


(Fall 2011 - Asanovic and Patterson):

1) Explain energy and power.  How is energy dissipated in a modern
   microprocessor?  What techniques could an architect use to reduce
   energy consumption in a microprocessor?

2) Explain the different types of memory used in modern server and
   handheld computers.  What are their cost/bit, densities, access
   latencies, and bandwidths?  How would you take advantage of different
   memory types in a modern memory system?

3) Describe the operation of an out-of-order superscalar processor
   based on a unified physical register file (e.g., MIPS R10K or Alpha
   21264).  Describe how register renaming can be performed in parallel
   for a group of sequential instructions.  What is the minimal number of
   physical registers needed?  How does instruction scheduling logic cope
   with variable latency of cache accesses?

4) Describe the principal types of parallelism exploited in computer
   systems.  Describe representative architectures for each type of
   parallelism.


 (Fall 2010 - Kubi & Patterson):
"Q1: Flash Memory
 Q2: Modern processor
 Q3: CMOS dependability
 Q4: Personal Mobile Device"

(Fall 2008 - Asanovic & Wawrzynek):
"1. Memory Hierarchy

a) For a typical modern general purpose processor sketch 
   and describe in detail the memory hierarchy.

b) How would you enhance/modify the above to accommodate
   several processors sharing a cache-coherent memory.

c) Would your solution scale to hundreds of processors?
   If not, what would you change to accommodate the scaling?


2. Processor Microarchitecture

a) Sketch and describe the major stages of a modern 
   out-of-order processor pipeline and how the processor works.
   [BP/IF, Dec/RegRename, Ex, Completion, Commit]

b) Devise and write assembly code for an example program 
   where register renaming helps performance.  Show an example
   where it doesn't help.

c) Devise and write assembly code for an example program 
   where branch prediction helps and where it doesn't.

d) If you had to choose only one of register renaming and
   branch prediction over the other, which one would you 
   choose and why?


3. Power and Energy

a) What is the relationship between power and energy?  Why are
   these important considerations in digital systems.

b) Describe (in formula form) the components of power consumption in
   CMOS circuits.
   
c) Take the design of a floating point unit for instance.  For a
   fixed required throughput, what could you do to lower its
   energy/operation?

d) What ultimately limits the effectiveness of the techniques from c)?


4. Parallel Processing

   In 2005 there was a historic in the industry with all microprocessor companies announced
   that their future products would be chip-scale multiprocessors
   (CMPs).  Why did this happen?

   [No more Vt scaling (leakage problem dominates), diminishing returns on
    ILP extraction, memory latency (&BW?) problems]


5. Looking ahead

   Consider as a baseline architecture, a collection of energy
   efficient RISC cores.  Assuming that technology stops scaling, what
   can be done to further reduce energy consumption for a given
   workload?  
   [accelerators, vectors, ...]"

(Spring 2004 - Wawrzynek & Patterson):
"1)  Power and Energy in Microprocessors

a) Give us a metric for expressing energy efficiency of a microprocessor for a 
   particular workload.

   (would expect MIPS/watt or joules/instruction, ...)

b) If I gave you a microprocessor and that workload, how would you measure its 
   average energy efficiency?

   (needs to understand P=IV and think about using a current meter.
   Then measure time and number of instructions, ..., )

c) What are the factors that effect/determine power consumption in this experiment 
   and how could you influence each factor?
   
   (P = Cv^2f, c is process and wire lengths, v is process and user set,...)


2) Microprocessor Limitations

a) What do you think some of the technological limitations to increasing microprocessor 
   performance in the next 3 to 7 years?

b) How do recent 80x86 microprocessor designs match up to those limitations?

c) How would you expect such limitations will change the microarchitectures in this time 
   period?


3) Networks of processors

Suppose you were engineering a multiprocessor on a chip (a homogeneous array of simple 
processing units with local memory each).  Think about what structure you would use to 
connect the processors together.

a) What network topologies would you consider and why?

   (Should be able to describe meshes, trees, busses, xbar, etc.)

   What factors would you consider in choosing one versus another?

   What other factors would come into the design?

  (need to consider area cost, control and routing complexity, packet switched or circuit switched, cross-section bandwidth, scaling)

   How would you go about determining the best design for this network?"

 (Fall 2003 - Culler & Kubiatowicz):
"Q1: 	For the first question, we were looking for concise definition of 
	precise interrupts, followed by clear illuminations of mechanisms for 
	achieving precise interrupts in both 5-stage and out of order pipelines.

 Q2: 	The second question focused on evaluating design tradeoffs.  It 
	centered on energy efficiency.

 Q3: 	The third question focused on instruction set design.  The specific 
	context was communication on a chip based multiprocessor.  We asked 
	for a list of possible forms of communication and then to discuss how 
	to extend the ISA for each.

 Q4: 	For the final question, we were looking to explore the design space for 
	a NAS (network attached storage) system." 

 (Fall 2002 - Culler & Patterson):
"Q1: Vector processing
 Q2: Power and energy
 Q3: Errors in computers
 Q4: Branch prediction"

(Spring 2002 - Kubiatowicz & Wawrzynek):
"Q1: RISC/CISC
 Q2: Power consumption
 Q3: Networked multiprocessors
 Q4: Multi-threading"

(Fall 2001 - Kubiatowicz & Patterson):
"Q1: modern processor design
Q2: trace caches
Q3: VLSI scaling
Q4: high-transaction rate server"

(Spring 2001 - Wawrzynek & Kubiatowicz):
"Q1: What are precise interrupts?  Why are they useful?  Discuss how to
implement precise interrupts in a modern processor (superscalar,
out-of-order) of your choice.  What is branch prediction?  Why is it
useful?  How does it fit into your example processor?  What other
things do people try to predict?

Q2:
Draw the floor plan for a processor (real or imaginary but realistic).
Include the major functional blocks and their approximate sizes.
Imagine you now have a merged DRAM/logic process.  Assuming an
identical organization and memory hierarchy, how would this affect the
floor plan?  How would the floor plan differ if you could change the
organization or memory hierarchy?

Q3:
What is memory coherency in multiprocessors?  Discuss coherency in a
snoopy bus model.  Describe a series of reads and writes in this
model.  What is consistency?  What is sequential consistency and how
can it be modelled?  Describe the limitation of snoopy-based coherency
models?  What is/are the solutions?  Describe a series of reads and
writes in that model?

Q4:
We want to create a network router from a standard PC.  (For our
purposes, a router will input a packet, examine the packet, make a
decision about where to route the packet based on stored state, and
send the packet out that port.)  Draw a diagram of this system and
trace a packet through this system.  How do we determine the number of
ports that a single PC can support?  What is the bottleneck?  How do
you justify this?  Estimate various execution times, bandwidths, and
latencies."

(Fall 2000 - Wawrzynek & Patterson):
"Q1: Identify the critical paths in microprocessor design. What is the
effect of carry logic on adder delay? Discuss fast adder techniques.

Q2: Discuss issues related to disk drives, trends, today's spec, and
future specs.  Reason about read times for an entire disk and how this
relates to RAIDs and other disk arrays. 

Q3: Discuss the basic issues involved with system implementation
alternatives.   

Q4: Identify some out-of-order (OOO) execution processors. Select one
and describe its operations."

(Spring 2000 - Wawrzynek & Kubiatowicz):
"Q1: Discuss the distinctions between RISC and CISC; compare and contrast
the advantages of one over the other. Discuss whether the RISC/CISC
distinction is still valid today.

Q2: Discuss the detailing of overheads of message communication. How
would you optimize communication?

Q3: Account for the large difference in energy observed between solving
a problem with custom hardware vs. a general purpose processor. List
some of the metrics one might use to decide between a custom ASIC, an
FPGA, or a general purpose processor.

Q4: What is the cache-coherence problem? What are the definitions of sequential
consistency? Discuss the details of snoopy protocols. and about
the typical bus bandwidth. Estimate the maximum number of processors
that would fit on a bus."

(Fall 1999 - Wawrzynek & Kubiatowicz):
"Q1. Describe precise interrupt.  Why is it useful?  Describe exactly how
you would implement it on a 5 stage pipeline.  Draw a rough block diagram of 
an out-of-order execution processor.  Describe how you implement precise
interrupt with respect to that block diagram.  What happens to the
precise interrupt scheme when there is branch mispredictions?

Q2: Given a very simple processor with no cache, no floating point unit,
single issue, everything in order, pipelined processor, estimate the
number of transistors it will use.  Given the number of transistors you
just came up with, can you implement it using some state-of-the art FPGA?
What is the rough capacity of the FPGA?  What are some pros and cons of
implementing such a processor on a FPGA if it is possible?  Why would
someone ever want to do such thing in the real world?

Q3: Suppose you want to implement a multiprocessor over the conventional
network (like ethernet), describe all the mechanisms you need to pass a message
from one user to another user.  Estimate the time it takes for the
message to go through each part of the process.  Given the time you just
estimated, is it fast enough to meet the need of implementing
multiprocessor?  How can you improve the speed?  How can you protect
users from interfering each other if they have control over the network
hardware?

Q4: Define energy efficiency.  Given that you have a 4-way super-scalar
processor, a program, and the data input, describe how (including
hardware
setup) you can measure the energy efficiency of this processor.  What
would be the issue if you want to compare this energy efficiency number
with other processors?  If you are a chief architect, what kind of
processor you would design to optimize energy efficiency?"

(Spring 1999 - Patterson & Kubiatowicz):
"Q1: What are the causes of cache misses (three C's)? Describe how to
remove these. Under what circumstances would they be advantageous? 

Q2: What is the definition of precise interrupts and the implementation
of precise interrupts in a five-stage pipeline? Discuss modern
superscalar processors.

Q3: Discuss some of the issues behind the replacement of buses with
routers. What are some advantages of router-like architectures?

Q4: Discuss computational paint."

(Fall 1998 - Patterson & Kubiatowicz):
"Q1: What is the definition of precise interrupts and the implementation
of precise interrupts in a five-stage pipeline? Discuss modern
superscalar processors.

Q2: What are the consequences of increased wire delay? How would you
evaluate ambiguous design alternatives?

Q3: Discuss the basic message communication path. Why would a separate
network be needed to avoid protocol overhead? 

Q4: (A database design problem was given.) Calculate the numbers of
disks and CPUs for 100% utilization as well as bus organization. What
does the queueing theory suggest?"


August 2021