Auto-tuning Stencil Codes for Cache-Based Multicore Platforms

Kaushik Datta

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2009-177
December 17, 2009

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-177.pdf

As clock frequencies have tapered off and the number of cores on a chip has taken off, the challenge of effectively utilizing these multicore systems has become increasingly important. However, the diversity of multicore machines in today's market compels us to individually tune for each platform. This is especially true for problems with low computational intensity, since the improvements in memory latency and bandwidth are much slower than those of computational rates.

One such kernel is a stencil, a regular nearest neighbor operation over the points in a structured grid. Stencils often arise from solving partial differential equations, which are found in almost every scientific discipline. In this thesis, we analyze three common three-dimensional stencils: the 7-point stencil, the 27-point stencil, and the Gauss-Seidel Red-Black Helmholtz kernel.

We examine the performance of these stencil codes over a spectrum of multicore architectures, including the Intel Clovertown, Intel Nehalem, AMD Barcelona, the highly-multithreaded Sun Victoria Falls, and the low power IBM Blue~Gene/P. These platforms not only have significant variations in their core architectures, but also exhibit a 32x range in available hardware threads, a 4.5x range in attained DRAM bandwidth, and a 6.3x range in peak flop rates. Clearly, designing optimal code for such a diverse set of platforms represents a serious challenge.

Unfortunately, compilers alone do not achieve satisfactory stencil code performance on this varied set of platforms. Instead, we have created an automatic stencil code tuner, or auto-tuner, that incorporates several optimizations into a single software framework. These optimizations hide memory latency, account for non-uniform memory access times, reduce the volume of data transferred, and take advantage of special instructions. The auto-tuner then searches over the space of optimizations, thereby allowing for much greater productivity than hand-tuning. The fully auto-tuned code runs up to 5.4x faster than a straightforward implementation and is more scalable across cores.

By using performance models to identify performance limits, we determined that our auto-tuner can achieve over 95% of the attainable performance for all three stencils in our study. This demonstrates that auto-tuning is an important technique for fully exploiting available multicore resources.

Advisor: Katherine A. Yelick


BibTeX citation:

@phdthesis{Datta:EECS-2009-177,
    Author = {Datta, Kaushik},
    Title = {Auto-tuning Stencil Codes for Cache-Based Multicore Platforms},
    School = {EECS Department, University of California, Berkeley},
    Year = {2009},
    Month = {Dec},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-177.html},
    Number = {UCB/EECS-2009-177},
    Abstract = {As clock frequencies have tapered off and the number of cores on a chip has taken off, the challenge of effectively utilizing these multicore systems has become increasingly important.  However, the diversity of multicore machines in today's market compels us to individually tune for each platform.  This is especially true for problems with low computational intensity, since the improvements in memory latency and bandwidth are much slower than those of computational rates.

One such kernel is a stencil, a regular nearest neighbor operation over the points in a structured grid.  Stencils often arise from solving partial differential equations, which are found in almost every scientific discipline.  In this thesis, we analyze three common three-dimensional stencils: the 7-point stencil, the 27-point stencil, and the Gauss-Seidel Red-Black Helmholtz kernel.

We examine the performance of these stencil codes over a spectrum of multicore architectures, including the Intel Clovertown, Intel Nehalem, AMD Barcelona, the highly-multithreaded Sun Victoria Falls, and the low power IBM Blue~Gene/P.  These platforms not only have significant variations in their core architectures, but also exhibit a 32x range in available hardware threads, a 4.5x range in attained DRAM bandwidth, and a 6.3x range in peak flop rates.  Clearly, designing optimal code for such a diverse set of platforms represents a serious challenge.

Unfortunately, compilers alone do not achieve satisfactory stencil code performance on this varied set of platforms.  Instead, we have created an automatic stencil code tuner, or <i>auto-tuner</i>, that incorporates several optimizations into a single software framework.  These optimizations hide memory latency, account for non-uniform memory access times, reduce the volume of data transferred, and take advantage of special instructions.  The auto-tuner then searches over the space of optimizations, thereby allowing for much greater productivity than hand-tuning.  The fully auto-tuned code runs up to 5.4x faster than a straightforward implementation and is more scalable across cores.

By using performance models to identify performance limits, we determined that our auto-tuner can achieve over 95% of the attainable performance for all three stencils in our study.  This demonstrates that auto-tuning is an important technique for fully exploiting available multicore resources.}
}

EndNote citation:

%0 Thesis
%A Datta, Kaushik
%T Auto-tuning Stencil Codes for Cache-Based Multicore Platforms
%I EECS Department, University of California, Berkeley
%D 2009
%8 December 17
%@ UCB/EECS-2009-177
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-177.html
%F Datta:EECS-2009-177