Abstract
The high processing power of GPUs makes them attractive for safety-critical applications, where transient effects are a major concern, and resilience must be enforced without compromising performance. Configurable softcore GPUs are a recent technology that allows detailed reliability assessment capable of bringing directions to the design of reliable GPU applications. This work investigates the reliability of the register files and the pipeline of a softcore GPU under radiation-induced faults. It proposes software-based fault tolerance techniques to mitigate errors. Faults are simulated at the register transfer level in four case-study algorithms, and the Architectural Vulnerability Factor (AVF) and Mean Workload to Failure (MWTF) are checked over different GPU configurations. Results indicate that software-based techniques efficiently reduce AVF. In terms of MWTF, results show that the best cases depend on an optimized balance between GPU configuration, application runtime, and AVF.
Similar content being viewed by others
References
Chernikova A, Oprea A, Nita-Rotaru C, Kim B (2019) Are self-driving cars secure? Evasion attacks against deep neural networks for steering angle prediction. In: 2019 IEEE Security and Privacy Workshops (SPW), pp 132–137. https://doi.org/10.1109/SPW.2019.00033
Hassani R, Aiatullah M, Luksch P (2014) Improving HPC application performance in public cloud. IERI Procedia 10:169–176. https://doi.org/10.1016/j.ieri.2014.09.072
Hakobyan G, Yang B (2019) High-performance automotive radar: a review of signal processing algorithms and modulation schemes. IEEE Signal Process Mag 36(5):32–44. https://doi.org/10.1109/MSP.2019.2911722
Bojarski M, Del Testa D, Dworakowski D, Firner B, Flepp B, Goyal P, Jackel LD, Monfort M, Muller U, Zhang J et al (2016) End to end learning for self-driving cars. arXiv preprint arXiv:160407316
JEDEC (2006) Measurement and reporting of alpha particle and terrestrial cosmic ray induced soft errors in semiconductor devices. https://www.jedec.org/standards-documents/docs/jesd-89a. Accessed 19 Sept 2021
Oliveira DA, Rech P, Quinn HM, Fairbanks TD, Monroe L, Michalak SE, Anderson-Cook C, Navaux PO, Carro L (2014) Modern GPUs radiation sensitivity evaluation and mitigation through duplication with comparison. IEEE Trans Nucl Sci 61(6):3115–3122
Pilla LL, Rech P, Silvestri F, Frost C, Navaux POA, Reorda MS, Carro L (2014) Software-based hardening strategies for neutron sensitive FFT algorithms on GPUs. IEEE Trans Nucl Sci 61(4):1874–1880
Slayman C (2010) Soft errors—past history and recent discoveries. In: IEEE International Integrated Reliability Workshop Final Report, pp 25–30
Dixit A, Wood A (2011) The impact of new technology on soft error rates. In: International Reliability Physics Symposium, pp 1–7
Azambuja JR, Nazar G, Rech P, Carro L, Kastensmidt FL, Fairbanks T, Quinn H (2013) Evaluating neutron induced see in SRAM-based FPGA protected by hardware- and software-based fault tolerant techniques. IEEE Trans Nucl Sci 60(6):4243–4250. https://doi.org/10.1109/TNS.2013.2288305
Tiwari D, Gupta S, Rogers J, Maxwell D, Rech P, Vazhkudai S, Oliveira D, Londo D, DeBardeleben N, Navaux P, Carro L, Bland A (2015) Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In: 2015 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 331–342. https://doi.org/10.1109/HPCA.2015.7056044
Hari SKS, Tsai T, Stephenson M, Keckler SW, Emer J (2017) SASSIFI: an architecture-level fault injection tool for GPU application resilience evaluation. In: 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp 249–258
Gonçalves M, Saquetti M, Kastensmidt F, Azambuja JR (2017) A low-level software-based fault tolerance approach to detect SEUs in GPUs’ register files. Microelectron Reliab 76:665–669
Gonçalves M, Saquetti M, Azambuja JR (2018) Evaluating the reliability of a GPU pipeline to SEU and the impacts of software-based and hardware-based fault tolerance techniques. Microelectron Reliab 88:931–935
Mahmoud A, Hari SKS, Sullivan MB, Tsai T, Keckler SW (2018) Optimizing software-directed instruction replication for GPU error detection. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, pp 842–853
Rhod EL, Lisbôa CAL, Carro L, Sonza Reorda M, Violante M (2008) Hardware and software transparency in the protection of programs against SEUs and SETs. J Electron Test 24(1–3):45–56
Condia JER, Du B, Sonza Reorda M, Sterpone L (2020) Flexgripplus: an improved GPGPU model to support reliability analysis. Microelectron Reliab 109:113660. https://doi.org/10.1016/j.microrel.2020.113660
Kadi MA, Janssen B, Yudi J, Huebner M (2018) General-purpose computing with soft GPUs on FPGAs. ACM Trans Reconfigurable Technol Syst 11(1):1–22. https://doi.org/10.1145/3173548
Goncalves MM, Azambuja JR, Condia JER, Sonza Reorda M, Sterpone L (2020) Evaluating software-based hardening techniques for general-purpose registers on a GPGPU. In: 2020 IEEE Latin-American Test Symposium (LATS). IEEE, pp 1–6
Dimitrov M, Mantor M, Zhou H (2009) Understanding software approaches for GPGPU reliability. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2. ACM, New York, pp 94–104. https://doi.org/10.1145/1513895.1513907
Wadden J, Lyashevsky A, Gurumurthi S, Sridharan V, Skadron K (2014) Real-world design and evaluation of compiler-managed GPU redundant multithreading. In: 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), pp 73–84. https://doi.org/10.1109/ISCA.2014.6853227
Rech P, Aguiar C, Frost C, Carro L (2013) An efficient and experimentally tuned software-based hardening strategy for matrix multiplication on GPUs. IEEE Trans Nucl Sci 60(4):2797–2804
Braun C, Halder S, Wunderlich HJ (2014) A-abft: autonomous algorithm-based fault tolerance for matrix multiplications on graphics processing units. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, pp 443–454
Sullivan MB, Hari SKS, Zimmer B, Tsai T, Keckler SW (2018) Swapcodes: error codes for hardware-software cooperative GPU pipeline error detection. In: 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, pp 762–774
Gonçalves M, Condia JR, Reorda MS, Sterpone L, Azambuja J (2020) Improving GPU register file reliability with a comprehensive ISA extension. Microelectron Reliab 114:113768. https://doi.org/10.1016/j.microrel.2020.113768
Goncalves MM, Lamb IP, Rech P, Brum RM, Azambuja JR (2020) Improving selective fault tolerance in GPU register files by relaxing application accuracy. IEEE Trans Nucl Sci 67(7):1573–1580. https://doi.org/10.1109/TNS.2020.2982162
Gupta M, Lowell D, Kalamatianos J, Raasch S, Sridharan V, Tullsen D, Gupta R (2017) Compiler techniques to reduce the synchronization overhead of GPU redundant multithreading. In: 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, pp 1–6
Sundaram A, Aakel A, Lockhart D, Thaker D, Franklin D (2008) Efficient fault tolerance in multi-media applications through selective instruction replication. In: Proceedings of the 2008 workshop on Radiation effects and fault tolerance in nanometer technologies, pp 339–346
Kalra C, Previlon F, Rubin N, Kaeli D (2020) Armorall: compiler-based resilience targeting GPU applications. ACM Trans Archit Code Optim (TACO) 17(2):1–24
Goncalves M, Fernandes F, Lamb I, Rech P, Azambuja JR (2019) Selective fault tolerance for register files of graphics processing units. IEEE Trans Nucl Sci 66(7):1449–1456
dos Santos FF, Brandalero M, Basso PM, Hubner M, Carro L, Rech P (2020) Reduced-precision dwc for mixed-precision GPUs. In: 2020 IEEE 26th International Symposium on On-Line Testing and Robust System Design (IOLTS). IEEE, pp 1–6
Andryc K, Merchant M, Tessier R (2013) Flexgrip: a soft GPGPU for FPGAs. In: 2013 International Conference on Field-Programmable Technology (FPT), pp 230–237. https://doi.org/10.1109/FPT.2013.6718358
Lindholm E, Nickolls J, Oberman S, Montrym J (2008) Nvidia tesla: a unified graphics and computing architecture. IEEE Micro 28(2):39–55
Oh N, Shirvani PP, McCluskey EJ (2002) Error detection by duplicated instructions in super-scalar processors. IEEE Trans Reliab 51(1):63–75
Azambuja JR, Lapolli A, Rosa L, Kastensmidt FL (2011) Detecting sees in microprocessors through a non-intrusive hybrid technique. IEEE Trans Nucl Sci 58(3):993–1000
Mukherjee SS, Weaver C, Emer J, Reinhardt SK, Austin T (2003) A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In: Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36., pp 29–40. https://doi.org/10.1109/MICRO.2003.1253181
Leveugle R, Calvez A, Maistri P, Vanhauwaert P (2009) Statistical fault injection: quantified error and confidence. In: 2009 Design, Automation and Test in Europe. IEEE, pp 502–506. https://doi.org/10.1109/DATE.2009.5090716
Reis GA, Chang J, Vachharajani N, Mukherjee SS, Rangan R, August DI (2005) Design and evaluation of hybrid fault-detection systems. In: 32nd International Symposium on Computer Architecture (ISCA’05), pp 148–159. https://doi.org/10.1109/ISCA.2005.21
Acknowledgements
This work has been partially supported by the European Commission through the Horizon 2020 RESCUE-ETN project under grant 722325, Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) - Finance Code 001, Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), and Fundação de Amparo à pesquisa do Estado do RS (FAPERGS).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Goncalves, M.M., Condia, J.E.R., Reorda, M.S. et al. Evaluating low-level software-based hardening techniques for configurable GPU architectures. J Supercomput 78, 8081–8105 (2022). https://doi.org/10.1007/s11227-021-04154-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-04154-z