Cross-Layer Optimization to Address the Dual Challenges of Energy and Reliability

DATE Session 8.2

(This session to be held at DATE 2010, March 8-12, Dresden, Germany.)

Increasing unpredictability threatens our ability to continue scaling integrated circuits at Moore's Law rates. As the transistors, wires, and other components that make up integrated circuits become smaller, they display both greater variation (differences in behavior between devices designed to be identical) and greater vulnerability to transient and permanent faults. Conventional design techniques expend energy to tolerate this unpredictability by either replicating circuitry or by adding safety margins to a circuit's operating voltage, clock frequency or charge stored per bit of data. Such approaches have been effective in the past, but their costs in energy and system performance are rapidly becoming unacceptable, particularly in an environment where power consumption is often the limiting factor on integrated circuit performance and energy efficiency is a global concern.

To continue scaling and energy reduction, we can no longer assume that devices will be fabricated perfectly and identically or that circuits will operate without transient upsets. Higher layers in the system stack (architecture, firmware, OS, compilers, and applications) must co-operate to mitigate these unpredictable effects efficiently. Reliability must be a first-order concern that is optimized as a design metric along with energy, delay, area, and thermal profile by design automation tools. These tools must explore a larger space of optimizing transformations including tradeoffs across the layer stack and continuing optimization throughout the component's operational lifetime.

Sample multi-layer techniques include:

In the realm of memory and communication, we have a long history of success tolerating unpredictable effects including fabrication variability, transient upsets, and lifetime wear by using strategic information and multi-layer approaches that anticipate, accommodate, and suppress errors. In memory devices, error correcting codes have been useful at correcting single bit errors through an increase in the amount of information used to store each word. With this type of mitigation, there is a 100% reduction in single bit errors with only a 12% increase in both energy and information for a 64-bit word. Use of these techniques to tolerate occasional and unpredictable deviations from intended function has resulted in reduced cost, reduced energy, and increased performance while guaranteeing robust operation in the presence of noise. Unfortunately, mitigating errors in logic is not as simple or as well-researched as memory or communication systems. This lack of understanding has lead to very expensive solutions, such as triple-modular redundancy where the computation is done three times so that erroneous calculations can be voted out---at a cost of 3x the energy of the base computation, a 200% energy margin. We believe there is ample need and opportunity to bring these kinds of efficient, cross-layer solutions to computation.

This special session describes the vision and the need for new design automation. It summarizes findings and vision from an ongoing study of cross-layer reliability.

  1. Reliability Roadmap (presenter: S. Nassif, IBM)

  2. Vision (presenter: A. DeHon, University of Pennsylvania)

  3. Techniques and Examples (presenter: N. Carter, Intel)

  4. Metrics and Needs for Automating Cross-Layer Reliability Optimization (presenter: S. Mitra, Stanford)

DATE2010 (last edited 2010-01-04 20:35:56 by AndreDeHon)