Simulation Analysis of Data Sharing in Shared Memory Multiprocessors

Susan J. Eggers

EECS Department
University of California, Berkeley
Technical Report No. UCB/CSD-89-501
February 1989

http://www2.eecs.berkeley.edu/Pubs/TechRpts/1989/CSD-89-501.pdf

This dissertation examines shared memory reference patterns in parallel programs that run on bus-based, shared memory multiprocessors. The study reveals two distinct modes of sharing behavior. In sequential sharing, a processor makes multiple, sequential writes to the words within a block, uninterrupted by accesses from other processors. Under fine-grain sharing, processors contend for these words, and the number of per-processor sequential writes is low. Whether a program exhibits sequential or fine-grain sharing affects several factors relating to multiprocessor performance: the accuracy of sharing models that predict cache coherency overhead, the cache miss ratio and bus utilization of parallel programs, and the choice of coherency protocol.

An architecture-independent model of write sharing was developed, based on the inter-processor activity to write-shared data. The model was used to predict the relative coherency overhead of write-invalidate and write-broadcast protocols. Architecturally detailed simulations validated the model for write-broadcast. Successive refinements, incorporating architecture-dependent parameters, most importantly cache block size, produced acceptable predictions for write-invalidate. Block size was crucial for modeling write-invalidate, because the pattern of memory references within a block determines protocol performance.

The cache and bus behavior of parallel programs running under write-invalidate protocols was evaluated over various block and cache sizes. The analysis determined the effect of shared memory accesses on cache miss ratio and bus utilization by focusing on the sharing component of these metrics. The studies show that parallel programs incur substantially higher miss ratios and bus utilization than comparable uniprocessor programs. The sharing component of the metrics proportionally increases with cache and block size, and for some cache configurations determines both their magnitude and trend. Again, the amount of overhead depends on the memory reference pattern to the shared data. Programs that exhibit sequential sharing perform better than those whose sharing is fine-grain.

A cross-protocol comparison provided empirical evidence of the performance loss caused by increasing block size in write-invalidate protocols and cache size in write-broadcast. It then measured the extent to which read broadcast improved write-invalidate performance and competitive snooping helped write-broadcast. The results indicated that read-broadcast reduced the number of invalidation misses, but at a high cost in processor lockout from the cache. The surprising net effect was an increase in total execution cycles. Competitive snooping benefited only those programs that exhibited sequential sharing; both bus utilization and total execution time dropped moderately. For programs characterized by fine-grain sharing, competitive snooping degraded performance by causing a slight increase in these metrics.

Advisor: Randy H. Katz


BibTeX citation:

@phdthesis{Eggers:CSD-89-501,
    Author = {Eggers, Susan J.},
    Title = {Simulation Analysis of Data Sharing in Shared Memory Multiprocessors},
    School = {EECS Department, University of California, Berkeley},
    Year = {1989},
    Month = {Feb},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/1989/6162.html},
    Number = {UCB/CSD-89-501},
    Abstract = {This dissertation examines shared memory reference patterns in parallel programs that run on bus-based, shared memory multiprocessors. The study reveals two distinct modes of sharing behavior. In sequential sharing, a processor makes multiple, sequential writes to the words within a block, uninterrupted by accesses from other processors. Under fine-grain sharing, processors contend for these words, and the number of per-processor sequential writes is low. Whether a program exhibits sequential or fine-grain sharing affects several factors relating to multiprocessor performance:  the accuracy of sharing models that predict cache coherency overhead, the cache miss ratio and bus utilization of parallel programs, and the choice of coherency protocol.   <p>An architecture-independent model of write sharing was developed, based on the inter-processor activity to write-shared data. The model was used to predict the relative coherency overhead of write-invalidate and write-broadcast protocols. Architecturally detailed simulations validated the model for write-broadcast. Successive refinements, incorporating architecture-dependent parameters, most importantly cache block size, produced acceptable predictions for write-invalidate. Block size was crucial for modeling write-invalidate, because the pattern of memory references within a block determines protocol performance.   <p>The cache and bus behavior of parallel programs running under write-invalidate protocols was evaluated over various block and cache sizes. The analysis determined the effect of shared memory accesses on cache miss ratio and bus utilization by focusing on the sharing component of these metrics. The studies show that parallel programs incur substantially higher miss ratios and bus utilization than comparable uniprocessor programs. The sharing component of the metrics proportionally increases with cache and block size, and for some cache configurations determines both their magnitude and trend. Again, the amount of overhead depends on the memory reference pattern to the shared data. Programs that exhibit sequential sharing perform better than those whose sharing is fine-grain.   <p>A cross-protocol comparison provided empirical evidence of the performance loss caused by increasing block size in write-invalidate protocols and cache size in write-broadcast. It then measured the extent to which read broadcast improved write-invalidate performance and competitive snooping helped write-broadcast. The results indicated that read-broadcast reduced the number of invalidation misses, but at a high cost in processor lockout from the cache. The surprising net effect was an increase in total execution cycles. Competitive snooping benefited only those programs that exhibited sequential sharing; both bus utilization and total execution time dropped moderately. For programs characterized by fine-grain sharing, competitive snooping degraded performance by causing a slight increase in these metrics.}
}

EndNote citation:

%0 Thesis
%A Eggers, Susan J.
%T Simulation Analysis of Data Sharing in Shared Memory Multiprocessors
%I EECS Department, University of California, Berkeley
%D 1989
%@ UCB/CSD-89-501
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/1989/6162.html
%F Eggers:CSD-89-501