W03.4.3 Impact of Optimizations on the Reliability of DNNs on GPUs and FPGAs
DNN optimizations, such as compiler passes and reduced precision, are typically chosen to improve throughput, latency, or resource usage. However, they also change how software is mapped onto hardware, which can affect how radiation-induced faults in hardware propagate and how often the application fails. As a result, performance and reliability are linked through toolchain choices, and different optimization settings can lead to significant differences in failure rate. This presentation explores the effect of common optimizations on DNNs running on GPUs and FPGAs in the presence of radiation-induced faults. On both platforms, changing compiler optimization settings, numerical precision, and reuse parameters can change the failure rate and the criticality of fault outcomes for a given DNN configuration. Even when an optimization increases the raw failure rate, performance gains can still yield configurations that produce more correct results over time. Overall, assessing DNN-based systems requires considering both reliability and performance.
