Analysis of Shared Memory Misses and Reference Patterns
Jeffrey B. Rothman and Alan Jay Smith
EECS Department, University of California, Berkeley
Technical Report No. UCB/CSD-99-1064
, 1999
http://www2.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf
Shared bus computer systems permit the relatively simple and efficient implementation of cache consistency algorithms, but the shared bus is a bottleneck which limits performance. False sharing can be an important source of unnecessary traffic for invalidation-based protocols, elimination of which can provide significant performance improvements. For many multiprocessor workloads, however, most misses are true sharing and cold start misses. Regardless of the cause of cache misses, the largest fraction of bus traffic are words transferred between caches without being accessed, which we refer to as dead sharing. <p>We establish here new methods for characterizing cache block reference patterns, and we measure how these patterns change with variation in workload and block size. Our results show that 42 percent of 64-byte cache blocks are invalidated before more than one word has been read from the block and that 58 percent of blocks that have been modified only have a single word modified before an invalidation to the block occurs. Approximately 50 percent of blocks written and subsequently read by other caches shown no use of the newly written information before the block is again invalidated. <p>In addition to our general analysis of reference patterns, we also present a detailed analysis of false sharing and dead sharing in each shared memory multiprocessor program studied. We find that the worst 10 blocks from each our traces contribute almost 50 percent of the false sharing misses and almost 20 percent of the true sharing misses (on average). A relatively simple restructuring of four of our workloads based on analysis of these 10 worst blocks leads to a 21 percent reduction in overall misses and a 15 percent reduction in execution time. Permitting the block size to vary (as could be accomplished with a sector cache) shows that bus traffic can be reduced by 88 percent (for 64-byte blocks) while also decreasing the miss ratio by 35 percent.
BibTeX citation:
@techreport{Rothman:CSD-99-1064, Author= {Rothman, Jeffrey B. and Smith, Alan Jay}, Title= {Analysis of Shared Memory Misses and Reference Patterns}, Year= {1999}, Month= {Sep}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/1999/5804.html}, Number= {UCB/CSD-99-1064}, Abstract= {Shared bus computer systems permit the relatively simple and efficient implementation of cache consistency algorithms, but the shared bus is a bottleneck which limits performance. False sharing can be an important source of unnecessary traffic for invalidation-based protocols, elimination of which can provide significant performance improvements. For many multiprocessor workloads, however, most misses are true sharing and cold start misses. Regardless of the cause of cache misses, the largest fraction of bus traffic are words transferred between caches without being accessed, which we refer to as dead sharing. <p>We establish here new methods for characterizing cache block reference patterns, and we measure how these patterns change with variation in workload and block size. Our results show that 42 percent of 64-byte cache blocks are invalidated before more than one word has been read from the block and that 58 percent of blocks that have been modified only have a single word modified before an invalidation to the block occurs. Approximately 50 percent of blocks written and subsequently read by other caches shown no use of the newly written information before the block is again invalidated. <p>In addition to our general analysis of reference patterns, we also present a detailed analysis of false sharing and dead sharing in each shared memory multiprocessor program studied. We find that the worst 10 blocks from each our traces contribute almost 50 percent of the false sharing misses and almost 20 percent of the true sharing misses (on average). A relatively simple restructuring of four of our workloads based on analysis of these 10 worst blocks leads to a 21 percent reduction in overall misses and a 15 percent reduction in execution time. Permitting the block size to vary (as could be accomplished with a sector cache) shows that bus traffic can be reduced by 88 percent (for 64-byte blocks) while also decreasing the miss ratio by 35 percent.}, }
EndNote citation:
%0 Report %A Rothman, Jeffrey B. %A Smith, Alan Jay %T Analysis of Shared Memory Misses and Reference Patterns %I EECS Department, University of California, Berkeley %D 1999 %@ UCB/CSD-99-1064 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/1999/5804.html %F Rothman:CSD-99-1064