Architectural Approaches to Design Reliable and Energy-Efficient GPUs

Date

2016-05

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Modern graphic processing units (GPUs) support thousands of concurrent threads and provide high computational throughput, which makes them popular platforms for general-purpose high-performance computing (HPC) applications. However this raises reliability and energy-efficiency challenges in GPU architecture design. Originally designed for graphics applications with relaxed requirements on execution correctness, GPUs lack the error detection and fault tolerance features. In contrast, HPC programs have rigorous demands on execution correctness, which poses serious reliability challenges for general purpose computing on GPUs (GPGPUs). In addition, GPUs consume large amount of energy to achieve its high computing power. The peak power consumption of a high-end GPU is more than twice of the CPU counterparts and the energy-efficiency of GPUs fail to grow as fast as the performance improvement.

In this dissertation, we introduce several architectural approaches to design reliable and energy-efficient GPUs. We first propose several opportunistic techniques to recycle the idle time of streaming processors for soft-error detection and obtain the good fault coverage with negligible performance degradation. Utilizing the promising benefits of resistive memory, we further propose to leverage resistive memory to enhance the soft-error robustness and reduce the power consumption of registers in the GPUs. We then explore to mitigate the susceptibility of GPU register file to process variations. The proposed techniques are able to significantly optimize GPUs' performance under process variations. After that, we propose an effective and low-cost mechanism to maintain the register file reliability with negligible performance loss under process variations and low supply voltages, which enables substantial energy savings via aggressive supply voltage reduction. Finally, we propose an energy-efficient GPU L2 cache design that leverages locality similarity to reduce the L2 energy consumption with negligible performance degradation. Overall, these techniques efficiently address the reliability and energy-efficient challenges in GPU architectures.

Description

Keywords

GPU, Reliability, Energy Efficiency

Citation

Portions of this document appear in: Tan, Jingweijia, and Xin Fu. "RISE: Improving the streaming processors reliability against soft errors in GPGPUs." In Proceedings of the 21st international conference on Parallel architectures and compilation techniques, pp. 191-200. 2012. And in: Tan, Jingweijia, Zhi Li, and Xin Fu. "Soft-error reliability and power co-optimization for GPGPUS register file using resistive memory." In 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 369-374. IEEE, 2015. And in: Tan, Jingweijia, and Xin Fu. "Mitigating the susceptibility of gpgpus register file to process variations." In 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 969-978. IEEE, 2015.