Case Studies on the Impact and Challenges of Heterogeneous NUMA Architectures for HPC

Zaourar, Lilia; Benazouz, Mohamed; Mouhagir, Ayoub; Falquez, Carlos; Portero, Antoni; Ho, Nam; Suarez, Estela; Petrakis, Polydoros; Marazakis, Manolis; Sgherzi, Francesco; Fernandez, Ivan; Dolbeau, Romain; Pleiter, Dirk

doi:10.1007/978-3-031-66146-4_17

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14842))

Included in the following conference series:

International Conference on Architecture of Computing Systems

258 Accesses

Abstract

The memory systems of High-Performance Computing (HPC) systems commonly feature non-uniform data paths to memory, i.e. are non-uniform memory access (NUMA) architectures. Memory is divided into multiple regions, with each processing unit having its own local memory. Therefore, for each processing unit access to local memory regions is faster compared to accessing memory at non-local regions. Architectures with hybrid memory technologies result in further non-uniformity. This paper presents case studies of the performance potential and data placement implications of non-uniform and heterogeneous memory in HPC systems. Using the gem5 and VPSim simulation platforms, we model NUMA systems with processors based on the ARMv8 Neoverse V1 Reference Design. The gem5 simulator provides a cycle-accurate view, while VPSim offers greater simulation speed, with a high-level view of the simulated system. We highlight the performance impact of design trade-offs regarding NUMA node organization and System Level Cache (SLC) group assignment, as well as Network-on-Chip (NoC) configuration. Our case studies provide essential input to a co-design process involving HPC processor architects and system integrators. A comparison of system configurations for different NoC bandwidths shows reduced NoC latency and high memory bandwidth improvement when NUMA control is enabled. Furthermore, a configuration with HBM2 memory organized as four NUMA nodes highlights the memory bandwidth performance gap and NoC queuing latency impact when comparing local vs. remote memory accesses. On the other hand, NUMA can result in an unbalanced distribution of memory accesses and reduced SLC hit ratios, as shown with DDR4 memory organized as four NUMA nodes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

COMPESCE: A Co-design Approach for Memory Subsystem Performance Analysis in HPC Many-Cores

Modeling Large Compute Nodes with Heterogeneous Memories with Cache-Aware Roofline Model

Footprint-Aware Power Capping for Hybrid Memory Based Systems

References

ACPI HMAT. https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/05_ACPI_Software_Programming_Model/ACPI_Software_Programming_Model.html
ARM. Neoverse CMN-650 Technical Reference manual (2023). https://developer.arm.com/documentation/101481/0200?lang=en
Bienia, C., Kumar, S., Pal Singh,J., Li, K.: The parsec benchmark suite: characterization and architectural implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 72–81 (2008)
Google Scholar
Cantalupo, C., Venkatesan, V., Hammond, J., Czurlyo, K., Hammond, S.D.: Memkind: an extensible heap memory manager for heterogeneous memory platforms and mixed memory policies. Technical report, Sandia National Lab.(SNL-NM), Albuquerque, NM, USA (2015)
Google Scholar
Charif, A., Busnot, G., Mameesh, R., Sassolas, T., Ventroux, N.: Fast virtual prototyping for embedded computing systems design and exploration. In: Proceedings of the Rapid Simulation and Performance Evaluation: Methods and Tools, RAPIDO 2019, pp. 1–8. Association for Computing Machinery (2019)
Google Scholar
Feichtinger, C., Donath, S., Köstler, H., Götz, J., Rüde, U.: WaLBerla: HPC software design for computational engineering simulations. J. Comput. Sci. 2(2), 105–112 (2011)
Article Google Scholar
Hammarlund, P., et al.: Haswell: the fourth-generation intel core processor. IEEE Micro 34(2), 6–20 (2014)
Article Google Scholar
Hofmann, J., Fey, D., Eitzinger, J., Hager, G., Wellein, G.: Analysis of intel’s Haswell microarchitecture using the ECM model and microbenchmarks. In: Hannig, F., Cardoso, J.M.P., Pionteck, T., Fey, D., Schröder-Preikschat, W., Teich, J. (eds.) ARCS 2016. LNCS, vol. 9637, pp. 210–222. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30695-7_16
Chapter Google Scholar
Jebali, F., Matoussi, O., Wicaksana, A., Charif, A., Zaourar, L.: Decoupling processor and memory hierarchy simulators for efficient design space exploration. In: System Engineering for Constrained Embedded Systems, pp. 47–52 (2022)
Google Scholar
Keltcher, C.N., McGrath, K.J., Ahmed, A., Conway, P.: The AMD opteron processor for multiprocessor servers. IEEE Micro 23(2), 66–76 (2003)
Article Google Scholar
Laso, R., Rivera, F.F., Cabaleiro, J.C.: Influence of architectural features of the SNC-4 mode of the intel xeon phi KNL on matrix multiplication. In: Rodrigues, J.M.F., et al. (eds.) ICCS 2019, Part V. LNCS, vol. 11540, pp. 483–490. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22750-0_41
Chapter Google Scholar
Laudon, J., Lenoski, D.: The SGI origin: A ccNUMA highly scalable server. ACM SIGARCH Comput. Archit. News 25(2), 241–251 (1997)
Article Google Scholar
Lowe-Power, J., et al.: The gem5 simulator: version 20.0+. arXiv preprint arXiv:2007.03152 (2020)
Matoussi, O.: NOC performance model for efficient network latency estimation. In: 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 994–999. IEEE (2021)
Google Scholar
McCalpin, J.D.: Memory bandwidth and machine balance in high performance computers (1995)
Google Scholar
Molka, D., Hackenberg, D., Schöne, R., Nagel, W.E.: Cache coherence protocol and memory performance of the intel Haswell-EP architecture. In: 2015 44th International Conference on Parallel Processing, pp. 739–748. IEEE (2015)
Google Scholar
Park, S., et al.: Scaling of memory performance and capacity with CXL memory expander. In: 2022 IEEE Hot Chips 34 Symposium (HCS), pp. 1–27. IEEE Computer Society (2022)
Google Scholar
Sato, M., et al.: Co-design for A64FX manycore processor and “Fugaku”. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15. IEEE (2020)
Google Scholar
Singhal, R.: Inside intel next generation Nehalem microarchitecture. In: Hot Chips, vol. 20, p. 15 (2008)
Google Scholar
Sodani, A.: Knights landing (KNL): 2nd generation intel® xeon phi processor. In: 2015 IEEE Hot Chips 27 Symposium (HCS), pp. 1–24. IEEE (2015)
Google Scholar
Sodani, A., et al.: Knights landing: second-generation intel xeon phi product. IEEE Micro 36(2), 34–46 (2016)
Article Google Scholar
Stephens, N., et al.: The ARM scalable vector extension. IEEE Micro 37(2), 26–39 (2017)
Article Google Scholar
Thekkath, R., Singh, A.P., Singh, J.P., John, S., Hennessy, J.: An evaluation of a commercial cc-NUMA architecture-the convex exemplar SPP1200. In: Proceedings 11th International Parallel Processing Symposium, pp. 8–17. IEEE (1997)
Google Scholar
Williams, S., Ionkov, L., Lang, M.: NUMA distance for heterogeneous memory. In: Proceedings of the Workshop on Memory Centric Programming for HPC, pp. 30–34 (2017)
Google Scholar
Cameron Woo, S., Ohara, M., Torrie, E., Pal Singh, J., Gupta, A.: The SPLASH-2 programs: characterization and methodological considerations. SIGARCH Comput. Archit. News 23(2), 24–36 (1995)
Article Google Scholar
Xiang, L., Zhao, X., Rao, J., Jiang, S., Jiang, H.: Characterizing the performance of intel optane persistent memory: a close look at its on-DIMM buffering. In: Proceedings of the Seventeenth European Conference on Computer Systems, pp. 488–505 (2022)
Google Scholar
Zaourar, L., et al.: Multilevel simulation-based co-design of next generation HPC microprocessors. In: 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp. 18–29. IEEE (2021)
Google Scholar

Download references

Acknowledgment

This research has received funding from the European High Performance Computing Joint Undertaking (JU) under Framework Partnership Agreement No 800928 (European Processor Initiative) and Specific Grant Agreement No 101036168 (EPI SGA2). The JU receives support from the European Union’s Horizon 2020 research and innovation programme and from Croatia, France, Germany, Greece, Italy, Netherlands, Portugal, Spain, Sweden, and Switzerland. The EPI-SGA2 project, PCI2022-132935 is also co-funded by MCIN/AEI /10.13039/501100011033 and by the UE NextGenerationEU/PRTR.

Author information

Authors and Affiliations

Université Paris-Saclay, CEA, List, 91120, Palaiseau, France
Lilia Zaourar, Mohamed Benazouz & Ayoub Mouhagir
Jülich Supercomputing Centre, Institute for Advanced Simulation, Forschungszentrum Jülich GmbH, Jülich, Germany
Carlos Falquez, Antoni Portero, Nam Ho & Estela Suarez
Institute of Computer Science, Foundation for Research and Technology - Hellas (FORTH), Heraklion, Greece
Polydoros Petrakis & Manolis Marazakis
Barcelona Supercomputing Center (BSC), Barcelona, Spain
Francesco Sgherzi & Ivan Fernandez
SiPearl, Rennes, France
Romain Dolbeau
KTH Royal Institute of Technology, Stockholm, Sweden
Dirk Pleiter

Authors

Lilia Zaourar
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Benazouz
View author publications
You can also search for this author in PubMed Google Scholar
Ayoub Mouhagir
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Falquez
View author publications
You can also search for this author in PubMed Google Scholar
Antoni Portero
View author publications
You can also search for this author in PubMed Google Scholar
Nam Ho
View author publications
You can also search for this author in PubMed Google Scholar
Estela Suarez
View author publications
You can also search for this author in PubMed Google Scholar
Polydoros Petrakis
View author publications
You can also search for this author in PubMed Google Scholar
Manolis Marazakis
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Sgherzi
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Fernandez
View author publications
You can also search for this author in PubMed Google Scholar
Romain Dolbeau
View author publications
You can also search for this author in PubMed Google Scholar
Dirk Pleiter
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manolis Marazakis .

Editor information

Editors and Affiliations

Department Informatik 3 - Lehrstuhl Rechnerarchitektur, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
Dietmar Fey
AESS, University of Potsdam, Potsdam, Germany
Benno Stabernack
Institute for Automation of Complex Power Systems E.ON Energy Research Center, RWTH Aachen University, Aachen, Germany
Stefan Lankes
Goethe-Universität Frankfurt am Main Institut für Informatik, Frankfurt, Germany
Mathias Pacher
Faculty of Electrical Engineering and Information Technology, Otto-von-Guericke-Universität Magdeburg, Magdeburg, Germany
Thilo Pionteck

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zaourar, L. et al. (2024). Case Studies on the Impact and Challenges of Heterogeneous NUMA Architectures for HPC. In: Fey, D., Stabernack, B., Lankes, S., Pacher, M., Pionteck, T. (eds) Architecture of Computing Systems. ARCS 2024. Lecture Notes in Computer Science, vol 14842. Springer, Cham. https://doi.org/10.1007/978-3-031-66146-4_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-66146-4_17
Published: 01 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-66145-7
Online ISBN: 978-3-031-66146-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Case Studies on the Impact and Challenges of Heterogeneous NUMA Architectures for HPC

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

COMPESCE: A Co-design Approach for Memory Subsystem Performance Analysis in HPC Many-Cores

Modeling Large Compute Nodes with Heterogeneous Memories with Cache-Aware Roofline Model

Footprint-Aware Power Capping for Hybrid Memory Based Systems

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Case Studies on the Impact and Challenges of Heterogeneous NUMA Architectures for HPC

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

COMPESCE: A Co-design Approach for Memory Subsystem Performance Analysis in HPC Many-Cores

Modeling Large Compute Nodes with Heterogeneous Memories with Cache-Aware Roofline Model

Footprint-Aware Power Capping for Hybrid Memory Based Systems

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation