iBet uBet web content aggregator. Adding the entire web to your favor.

Link to original content: https://unpaywall.org/10.1145/2485922.2485964

GPUWattch | Proceedings of the 40th Annual International Symposium on Computer Architecture

research-article

GPUWattch: enabling energy optimizations in GPGPUs

Authors:

Tayler Hetherington,

Ahmed ElTantawy,

Vijay Janapa ReddiAuthors Info & Claims

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

Pages 487 - 498

https://doi.org/10.1145/2485922.2485964

Published: 23 June 2013 Publication History

Abstract

General-purpose GPUs (GPGPUs) are becoming prevalent in mainstream computing, and performance per watt has emerged as a more crucial evaluation metric than peak performance. As such, GPU architects require robust tools that will enable them to quickly explore new ways to optimize GPGPUs for energy efficiency. We propose a new GPGPU power model that is configurable, capable of cycle-level calculations, and carefully validated against real hardware measurements. To achieve configurability, we use a bottom-up methodology and abstract parameters from the microarchitectural components as the model's inputs. We developed a rigorous suite of 80 microbenchmarks that we use to bound any modeling uncertainties and inaccuracies. The power model is comprehensively validated against measurements of two commercially available GPUs, and the measured error is within 9.9% and 13.4% for the two target GPUs (GTX 480 and Quadro FX5600). The model also accurately tracks the power consumption trend over time. We integrated the power model with the cycle-level simulator GPGPU-Sim and demonstrate the energy savings by utilizing dynamic voltage and frequency scaling (DVFS) and clock gating. Traditional DVFS reduces GPU energy consumption by 14.4% by leveraging within-kernel runtime variations. More finer-grained SM cluster-level DVFS improves the energy savings from 6.6% to 13.6% for those benchmarks that show clustered execution behavior. We also show that clock gating inactive lanes during divergence reduces dynamic power by 11.2%.

References

[1]

MacSim, http://code.google.com/p/macsim.

[2]

Predictive technology model, http://ptm.asu.edu.

[3]

Synopsys Inc., Power Compiler, www.synopsys.com.

[4]

A. Bakhoda et al. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS, 2009.

[5]

M. Bauer et al. CudaDMA: optimizing GPU memory bandwidth via warp specialization. In SC, 2011.

Digital Library

[6]

D. Brooks et al. Wattch: a framework for architectural-level power analysis and optimizations. In ISCA, 2000.

Digital Library

[7]

S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009.

Digital Library

[8]

S. Collange et al. Power consumption of GPUs from a software perspective. In ICCS, 2009.

Digital Library

[9]

W. J. Dally. Moving the needle, computer architecture research in academe and industry. In ISCA, 2010.

Digital Library

[10]

J. M. V. Dyke et al. Graphics system with virtual memory pages and non-power of two number of memory elements, 2011.

[11]

W. Fung and T. Aamodt. Thread block compaction for efficient SIMT control flow. In HPCA, 2011.

Digital Library

[12]

W. Fung et al. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO, 2007.

Digital Library

[13]

S. Hong and H. Kim. An integrated GPU power and performance model. In ISCA, 2010.

Digital Library

[14]

C. Isci et al. Live, runtime phase monitoring and prediction on real systems with application to dynamic power management. In MICRO, 2006.

Digital Library

[15]

H. Jacobson et al. Stretching the limits of clock-gating efficiency in server-class processors. In HPCA, 2005.

Digital Library

[16]

T. Kailath, A. Sayed, and B. Hassibi. Linear Estimation. Prentice Hall, 2000.

[17]

K. Kasichayanula et al. Power aware computing on GPUs. SAAHPC, 2012.

Digital Library

[18]

S. Keckler. Life After Dennard and How I Learned to Love the Picojoule. In MICRO, 2012.

[19]

W. Kim et al. System level analysis of fast, per-core DVFS using on-chip switching regulators. In HPCA, 2008.

[20]

J. Lee et al. Improving throughput of power-constrained GPUs using dynamic voltage/frequency and core scaling. In PACT, 2011.

Digital Library

[21]

H. Li et al. Deterministic clock gating for microprocessor power reduction. In HPCA, 2003.

Digital Library

[22]

S. Li et al. McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In MICRO, 2009.

Digital Library

[23]

E. Lindholm et al. NVIDIA Tesla: A unified graphics and computing architecture. Micro, IEEE, 2008.

Digital Library

[24]

J. E. Lindholm et al. Simulating multiported memories using lower port count memories, 2008.

[25]

S. Liu et al. Operand collector architecture, 2010.

[26]

H. Nagasaka et al. Statistical power modeling of GPU kernels using performance counters. In Green Computing Conference, 2010.

Digital Library

[27]

V. Narasiman et al. Improving GPU performance via large warps and two-level warp scheduling. In MICRO, 2011.

Digital Library

[28]

NVIDIA. Fermi Compute Architecture Whitepaper, 2009.

[29]

NVIDIA. Compute Visual Profiler - User Guide, Version 4, 2011.

[30]

NVIDIA. NVIDIA CUDA C Programming Guide, 2012.

[31]

H.-J. Oh et al. A fully pipelined single-precision floating-point unit in the synergistic processor element of a CELL processor. JSSC, 2006.

[32]

V. Sathish et al. Lossless and lossy memory-link compression techniques for improving performance of memory-bound GPGPU workloads. In PACT, 2012.

Digital Library

[33]

S. Thoziyoor et al. A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies. In ISCA, 2008.

Digital Library

[34]

R. Ubal et al. Multi2Sim: A simulation framework for CPU-GPU computing. In PACT, 2012.

Digital Library

[35]

T. Vogelsang. Understanding the energy consumption of dynamic random access memories. In MICRO, 2010.

Digital Library

[36]

H. Wang and Q. Chen. Power estimating model and analysis of general programming on GPU. Journal of Software, 2012.

[37]

Q. Wu et al. A dynamic compilation framework for controlling microprocessor energy and performance. In MICRO, 2005.

Digital Library

[38]

Y. Zhang et al. Performance and power analysis of ATI GPU: A statistical approach. In NSA, 2011.

Digital Library

Cited By

Raskind JBabakol TMahmoud KLiu Y(2024)VESTA: Power Modeling with Language Runtime EventsProceedings of the ACM on Programming Languages10.1145/36564028:PLDI(621-646)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656402
Lee SJeong DChoi JKwak JSon SSong JShin I(2024)SERENUS: Alleviating Low-Battery Anxiety Through Real-time, Accurate, and User-Friendly Energy Consumption Prediction of Mobile ApplicationsProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676437(1-20)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3654777.3676437
Falahati HSadrosadati MXu QGómez-Luna JLatibari BJeon HHesaabi SSarbazi-Azad HMutlu OAnnavaram MPedram M(2024)Cross-Core Data Sharing for Energy-Efficient GPUsACM Transactions on Architecture and Code Optimization10.1145/3653019Online publication date: 18-Mar-2024
https://doi.org/10.1145/3653019
Show More Cited By

Index Terms

GPUWattch: enabling energy optimizations in GPGPUs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies

Recommendations

GPUWattch: enabling energy optimizations in GPGPUs
ICSA '13

General-purpose GPUs (GPGPUs) are becoming prevalent in mainstream computing, and performance per watt has emerged as a more crucial evaluation metric than peak performance. As such, GPU architects require robust tools that will enable them to quickly ...
An integrated GPU power and performance model
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Performance optimization for multi-core processors has been a challenge for programmers. Furthermore, optimizing for power consumption is ...
An integrated GPU power and performance model
ISCA '10

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Performance optimization for multi-core processors has been a challenge for programmers. Furthermore, optimizing for power consumption is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

June 2013

686 pages

ISBN:9781450320795

DOI:10.1145/2485922

General Chair:
Avi Mendelson
Technion

ACM SIGARCH Computer Architecture News Volume 41, Issue 3
ICSA '13
June 2013
666 pages
ISSN:0163-5964
DOI:10.1145/2508148
Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE CS

In-Cooperation

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

ISCA'13

Sponsor:

ISCA'13: The 40th Annual International Symposium on Computer Architecture

June 23 - 27, 2013

Tel-Aviv, Israel

Acceptance Rates

ISCA '13 Paper Acceptance Rate 56 of 288 submissions, 19%;

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

493
Total Citations
View Citations
2,449
Total Downloads

Downloads (Last 12 months)286
Downloads (Last 6 weeks)48

Reflects downloads up to 07 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Raskind JBabakol TMahmoud KLiu Y(2024)VESTA: Power Modeling with Language Runtime EventsProceedings of the ACM on Programming Languages10.1145/36564028:PLDI(621-646)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656402
Lee SJeong DChoi JKwak JSon SSong JShin I(2024)SERENUS: Alleviating Low-Battery Anxiety Through Real-time, Accurate, and User-Friendly Energy Consumption Prediction of Mobile ApplicationsProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676437(1-20)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3654777.3676437
Falahati HSadrosadati MXu QGómez-Luna JLatibari BJeon HHesaabi SSarbazi-Azad HMutlu OAnnavaram MPedram M(2024)Cross-Core Data Sharing for Energy-Efficient GPUsACM Transactions on Architecture and Code Optimization10.1145/3653019Online publication date: 18-Mar-2024
https://doi.org/10.1145/3653019
Kaya EÖz I(2024)Compiler-Managed Replication of CUDA Kernels for Reliable Execution of GPGPU ApplicationsJournal of Circuits, Systems and Computers10.1142/S021812662450254233:14Online publication date: 18-Apr-2024
https://doi.org/10.1142/S0218126624502542
Matsuo RKoizumi TIrie HSakai SShioya R(2024)TURBULENCE: Complexity-Effective Out-of-Order Execution on GPU With Distance-Based ISAIEEE Computer Architecture Letters10.1109/LCA.2023.328931723:2(175-178)Online publication date: Jul-2024
https://doi.org/10.1109/LCA.2023.3289317
Shenoy G(2024)A Performance and Power Comparison of Contemporary GPGPU Architectures2024 3rd International Conference for Innovation in Technology (INOCON)10.1109/INOCON60754.2024.10512242(1-5)Online publication date: 1-Mar-2024
https://doi.org/10.1109/INOCON60754.2024.10512242
Hyun BKim TLee DRhu M(2024)Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00029(263-279)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00029
Guan XZhou HBao GLi HZhu LYao JGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)PresCount: Effective Register Allocation for Bank Conflict ReductionProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444841(170-181)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444841
Qi SLi YPasricha SKim R(2023)MOELA: A Multi-Objective Evolutionary/Learning Design Space Exploration Framework for 3D Heterogeneous Manycore Platforms2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137276(1-6)Online publication date: Apr-2023
https://doi.org/10.23919/DATE56975.2023.10137276
Matsuo RKoizumi TIrie HSakai SShioya R(2023)TURBULENCE: Complexity-effective Out-of-order Execution on GPU with Distance-based ISA2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE56975.2023.10137216(1-2)Online publication date: Apr-2023
https://doi.org/10.23919/DATE56975.2023.10137216
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents