A Systematic Characterization of Application Sensitivity to Network Performance

Richard Paul Martin

EECS Department
University of California, Berkeley
Technical Report No. UCB/CSD-99-1055
1999

http://www2.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1055.pdf

This thesis provides a systematic study of application sensitivity to network performance. Our aim is to investigate the impact of communication performance on real applications. Using the LogGP model as an abstract framework, we set out to understand which aspects of communication performance are most important. The focus of our investigation thus centers on a quantification of the sensitivity of applications to the parameters of the LogGP model: network latency, software overhead, per-message and per-byte bandwidth. We define sensitivity as the change in some application performance metric, such as run time or updates per second, as a function of the LogGP parameters. The strong association of the LogGP model with real machine components allows us to draw architectural conclusions from the measured sensitivity curves as well.

The basic methodology to measure sensitivity is simple. First, we build a networking apparatus whose parameters are adjustable according to the LogGP model. To build such an apparatus we start with a higher performance system than what is generally available and add controllable delays to it. Next, the apparatus must be calibrated to make sure the parameters can be accurately controlled according to the model. The calibration also yields the useful range of LogGP parameters we can consider.

Once we have a calibrated apparatus, we run real applications in a network with controllable performance characteristics. We vary each LogGP parameter in turn to observe the sensitivity of the application relative to a single parameter. Sensitive applications will exhibit a high rate of "slowdown" as we scale a given parameter. Insensitive applications will show little or no difference in performance as we change the parameters. In addition, we can categorize the shape of the slowdown curve because our apparatus allows us to observe plateaus or other discontinuities. In all cases, we must compare our measured results against analytic models of the applications. The analytic models serve a check against our measured data. Points where the data and model deviate from one another expose areas that warrant further investigation.

We use three distinct application suites in order to broaden the applicability of our results. The first suite consists of parallel programs designed for low-overhead Massively Parallel Processors (MPPs) and Networks of Workstations (NOWs). The second suite is a sub-set of the NAS parallel benchmarks, which were designed on older MPPs. The final suite consists of the SPECsfs benchmark, which is designed to measure Network File System (NFS) performance over local area networks.

Our results show that applications display the strongest sensitivity to software overhead, slowing down by as much as a factor of 50 when overhead is increased by a factor of 20. Even lightly communicating applications can suffer a factor of 3-5 slowdown. Frequently communicating applications also display strong sensitivity to various bandwidths, suggesting that communication phases are bursty and limited by the rate at which messages can be injected into the network. We found that simple models are able to predict sensitivity to the software overhead and bandwidth parameters for most of our applications. We also found that queuing theoretic models of NFS servers are useful in understanding the performance of industry published SPECsfs benchmark results.

The effect of added latency is qualitatively different from the effect of added overhead and bandwidth. Further, the effects are harder to predict because they are more dependent on application structure. For our measured applications, the sensitivity to overhead and various bandwidths is much stronger than sensitivity to latency. We found that this result stemmed from programmers who are quite adept at using latency tolerating techniques such as pipelining, overlapping, batching and caching. However, many of these techniques are still sensitive to software overhead and bandwidth. Thus, efforts in improving software overhead, per-message and per-byte bandwidth, as opposed to network transit latency, will result in the largest performance improvements across a wide class of applications demonstrating diverse architectural requirements.

We conclude that computer systems are complex enough to warrant our perturbation based methodology, and speculate how the methodology might be applied to other computer systems areas. We also conclude that without either much more aggressive hardware support or the acceptance of radical new protocols, software overheads will continue to limit communication performance.

Advisor: David E. Culler


BibTeX citation:

@phdthesis{Martin:CSD-99-1055,
    Author = {Martin, Richard Paul},
    Title = {A Systematic Characterization of Application Sensitivity to Network Performance},
    School = {EECS Department, University of California, Berkeley},
    Year = {1999},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/1999/5802.html},
    Number = {UCB/CSD-99-1055},
    Abstract = {This thesis provides a systematic study of application sensitivity to network performance. Our aim is to investigate the impact of communication performance on real applications. Using the LogGP model as an abstract framework, we set out to understand which aspects of communication performance are most important. The focus of our investigation thus centers on a quantification of the sensitivity of applications to the parameters of the LogGP model: network latency, software overhead, per-message and per-byte bandwidth. We define sensitivity as the change in some application performance metric, such as run time or updates per second, as a function of the LogGP parameters. The strong association of the LogGP model with real machine components allows us to draw architectural conclusions from the measured sensitivity curves as well. <p>The basic methodology to measure sensitivity is simple. First, we build a networking apparatus whose parameters are adjustable according to the LogGP model. To build such an apparatus we start with a higher performance system than what is generally available and add controllable delays to it. Next, the apparatus must be calibrated to make sure the parameters can be accurately controlled according to the model. The calibration also yields the useful range of LogGP parameters we can consider. <p>Once we have a calibrated apparatus, we run real applications in a network with controllable performance characteristics. We vary each LogGP parameter in turn to observe the sensitivity of the application relative to a single parameter. Sensitive applications will exhibit a high rate of "slowdown" as we scale a given parameter. Insensitive applications will show little or no difference in performance as we change the parameters. In addition, we can categorize the shape of the slowdown curve because our apparatus allows us to observe plateaus or other discontinuities. In all cases, we must compare our measured results against analytic models of the applications. The analytic models serve a check against our measured data. Points where the data and model deviate from one another expose areas that warrant further investigation. <p>We use three distinct application suites in order to broaden the applicability of our results. The first suite consists of parallel programs designed for low-overhead Massively Parallel Processors (MPPs) and Networks of Workstations (NOWs). The second suite is a sub-set of the NAS parallel benchmarks, which were designed on older MPPs. The final suite consists of the SPECsfs benchmark, which is designed to measure Network File System (NFS) performance over local area networks. <p>Our results show that applications display the strongest sensitivity to software overhead, slowing down by as much as a factor of 50 when overhead is increased by a factor of 20. Even lightly communicating applications can suffer a factor of 3-5 slowdown. Frequently communicating applications also display strong sensitivity to various bandwidths, suggesting that communication phases are bursty and limited by the rate at which messages can be injected into the network. We found that simple models are able to predict sensitivity to the software overhead and bandwidth parameters for most of our applications. We also found that queuing theoretic models of NFS servers are useful in understanding the performance of industry published SPECsfs benchmark results. <p>The effect of added latency is qualitatively different from the effect of added overhead and bandwidth. Further, the effects are harder to predict because they are more dependent on application structure. For our measured applications, the sensitivity to overhead and various bandwidths is much stronger than sensitivity to latency. We found that this result stemmed from programmers who are quite adept at using latency tolerating techniques such as pipelining, overlapping, batching and caching. However, many of these techniques are still sensitive to software overhead and bandwidth. Thus, efforts in improving software overhead, per-message and per-byte bandwidth, as opposed to network transit latency, will result in the largest performance improvements across a wide class of applications demonstrating diverse architectural requirements. <p>We conclude that computer systems are complex enough to warrant our perturbation based methodology, and speculate how the methodology might be applied to other computer systems areas. We also conclude that without either much more aggressive hardware support or the acceptance of radical new protocols, software overheads will continue to limit communication performance.}
}

EndNote citation:

%0 Thesis
%A Martin, Richard Paul
%T A Systematic Characterization of Application Sensitivity to Network Performance
%I EECS Department, University of California, Berkeley
%D 1999
%@ UCB/CSD-99-1055
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/1999/5802.html
%F Martin:CSD-99-1055