A System-level Approach to Fault and Variation Resilience in Multi-core Die

Yury Markovskiy

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2009-128
September 5, 2009

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-128.pdf

With shrinking transistors and growth in parametric variability, statically managing die yield is no longer possible. Design for Manufacturing (DFM) techniques use increasingly bigger guard-bands that waste area, power, and performance, impeding Moore's Law of semiconductor device scaling. Process Voltage Temperature (PVT) variations can turn a nominally homogeneous many-core die into a set of cores with heterogeneous performance.

Network-on-Chip provides an effective and scalable way to integrate hundreds of heterogeneous cores without forcing each to give up its own PVT-induced operating point for the chip-wide common worst case. As with asynchronous logic, a NoC of regular, redundant, many-CLK/Vdd cores can deliver the average rather than the worst case system performance with greater power efficiency and fault tolerance than its globally synchronous monolithic counterparts. This work shows that the Voltage-Frequency Island (VFI) architectures are also the key to tolerating and compensating for PVT variations.

The VFI advantages cannot be realized without run-time task-to-core mapping and adaptive network routing that optimally match application resource requirements with heterogeneous cores and communication fabric. These systematic techniques are more effective at mitigating a variety of faults and variations than layout and circuit DFM. Most importantly, the gains from these techniques can be translated into die yield improvements and smaller DFM guard-bands.

This work investigates core sparing and network routing. The developed models demonstrate that core sparing reduces the die cost asymptotically from O(A3) to O(A1/2), and it is more cost efficient than larger design guard-bands of layout and circuit redundancy. The analysis outcome favors a greater number of smaller unreliable cores as opposed to a fewer larger reliable cores given a fixed die area. This points to the limitations and ultimately the futility of DFM techniques in the future semiconductor process generations.

Adaptive network routing enables core sparing. More critically, it simultaneously combats the two sources of network load imbalance: on-die performance heterogeneity from PVT variations and application communication topology. With stochastic PVT variations, the developed Minimal Adaptive Total Congestion (MATC) router increases the expected network saturation bandwidth by 7--23% and reduces its variance by 2--10x as compared to the Dimension Order router. With systematic PVT variations, the improvements are 5--35%. These gains of the adaptive router can compensate for degradation due to performance variations and can thus be used to reduce design guard-bands.

By treating cores as units of fault and variation tolerance, these systematic techniques provide a simple and consistent way to deal with static and dynamic performance variations and faults. These techniques are more effective than isolated DFM solutions. Rather than fighting and minimizing the on-die parametric variations, our approach takes advantage of the platform heterogeneity and manages its net system performance impact.

Advisor: John Wawrzynek


BibTeX citation:

@phdthesis{Markovskiy:EECS-2009-128,
    Author = {Markovskiy, Yury},
    Title = {A System-level Approach to Fault and Variation Resilience in Multi-core Die},
    School = {EECS Department, University of California, Berkeley},
    Year = {2009},
    Month = {Sep},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-128.html},
    Number = {UCB/EECS-2009-128},
    Abstract = {With shrinking transistors and growth in parametric variability, statically managing die yield is no longer possible. Design for Manufacturing (DFM) techniques use increasingly bigger guard-bands that waste area, power, and performance, impeding Moore's Law of semiconductor device scaling. Process Voltage Temperature (PVT) variations can turn a nominally homogeneous many-core die into a set of cores with heterogeneous performance.

Network-on-Chip provides an effective and scalable way to integrate hundreds of heterogeneous cores without forcing each to give up its own PVT-induced operating point for the chip-wide common worst case. As with asynchronous logic, a NoC of regular, redundant, many-CLK/Vdd cores can deliver the average rather than the worst case system performance with greater power efficiency and fault tolerance than its globally synchronous monolithic counterparts.  This work shows that the Voltage-Frequency Island (VFI) architectures are also the key to tolerating and compensating for PVT variations.

The VFI advantages cannot be realized without run-time task-to-core mapping and adaptive network routing that optimally match application resource requirements with heterogeneous cores and communication fabric. These systematic techniques are more effective at mitigating a variety of faults and variations than layout and circuit DFM. Most importantly, the gains from these techniques can be translated into die yield improvements and smaller DFM guard-bands.


This work investigates core sparing and network routing. The developed models demonstrate that core sparing reduces the die cost asymptotically from O(A<sup>3</sup>) to O(A<sup>1/2</sup>), and it is more cost efficient than larger design guard-bands of layout and circuit redundancy. The analysis outcome favors a greater number of smaller unreliable cores as opposed to a fewer larger reliable cores given a fixed die area. This points to the limitations and ultimately the futility of DFM techniques in the future semiconductor process generations.

Adaptive network routing enables core sparing. More critically, it simultaneously combats the two sources of network load imbalance: on-die performance heterogeneity from PVT variations and application communication topology. With stochastic PVT variations, the developed Minimal Adaptive Total Congestion (MATC) router increases the expected network saturation bandwidth by 7--23% and reduces its variance by 2--10x as compared to the Dimension Order router. With systematic PVT variations, the improvements are 5--35%. These gains of the adaptive router can compensate for degradation due to performance variations and can thus be used to reduce design guard-bands.

By treating cores as units of fault and variation tolerance, these systematic techniques provide a simple and consistent way to deal with static and dynamic performance variations and faults. These techniques are more effective than isolated DFM solutions. Rather than fighting and minimizing the on-die parametric variations, our approach takes advantage of the platform heterogeneity and manages its net system performance impact.}
}

EndNote citation:

%0 Thesis
%A Markovskiy, Yury
%T A System-level Approach to Fault and Variation Resilience in Multi-core Die
%I EECS Department, University of California, Berkeley
%D 2009
%8 September 5
%@ UCB/EECS-2009-128
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-128.html
%F Markovskiy:EECS-2009-128