Abstract
The memory systems of High-Performance Computing (HPC) systems commonly feature non-uniform data paths to memory, i.e. are non-uniform memory access (NUMA) architectures. Memory is divided into multiple regions, with each processing unit having its own local memory. Therefore, for each processing unit access to local memory regions is faster compared to accessing memory at non-local regions. Architectures with hybrid memory technologies result in further non-uniformity. This paper presents case studies of the performance potential and data placement implications of non-uniform and heterogeneous memory in HPC systems. Using the gem5 and VPSim simulation platforms, we model NUMA systems with processors based on the ARMv8 Neoverse V1 Reference Design. The gem5 simulator provides a cycle-accurate view, while VPSim offers greater simulation speed, with a high-level view of the simulated system. We highlight the performance impact of design trade-offs regarding NUMA node organization and System Level Cache (SLC) group assignment, as well as Network-on-Chip (NoC) configuration. Our case studies provide essential input to a co-design process involving HPC processor architects and system integrators. A comparison of system configurations for different NoC bandwidths shows reduced NoC latency and high memory bandwidth improvement when NUMA control is enabled. Furthermore, a configuration with HBM2 memory organized as four NUMA nodes highlights the memory bandwidth performance gap and NoC queuing latency impact when comparing local vs. remote memory accesses. On the other hand, NUMA can result in an unbalanced distribution of memory accesses and reduced SLC hit ratios, as shown with DDR4 memory organized as four NUMA nodes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
ARM. Neoverse CMN-650 Technical Reference manual (2023). https://developer.arm.com/documentation/101481/0200?lang=en
Bienia, C., Kumar, S., Pal Singh,J., Li, K.: The parsec benchmark suite: characterization and architectural implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 72–81 (2008)
Cantalupo, C., Venkatesan, V., Hammond, J., Czurlyo, K., Hammond, S.D.: Memkind: an extensible heap memory manager for heterogeneous memory platforms and mixed memory policies. Technical report, Sandia National Lab.(SNL-NM), Albuquerque, NM, USA (2015)
Charif, A., Busnot, G., Mameesh, R., Sassolas, T., Ventroux, N.: Fast virtual prototyping for embedded computing systems design and exploration. In: Proceedings of the Rapid Simulation and Performance Evaluation: Methods and Tools, RAPIDO 2019, pp. 1–8. Association for Computing Machinery (2019)
Feichtinger, C., Donath, S., Köstler, H., Götz, J., Rüde, U.: WaLBerla: HPC software design for computational engineering simulations. J. Comput. Sci. 2(2), 105–112 (2011)
Hammarlund, P., et al.: Haswell: the fourth-generation intel core processor. IEEE Micro 34(2), 6–20 (2014)
Hofmann, J., Fey, D., Eitzinger, J., Hager, G., Wellein, G.: Analysis of intel’s Haswell microarchitecture using the ECM model and microbenchmarks. In: Hannig, F., Cardoso, J.M.P., Pionteck, T., Fey, D., Schröder-Preikschat, W., Teich, J. (eds.) ARCS 2016. LNCS, vol. 9637, pp. 210–222. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30695-7_16
Jebali, F., Matoussi, O., Wicaksana, A., Charif, A., Zaourar, L.: Decoupling processor and memory hierarchy simulators for efficient design space exploration. In: System Engineering for Constrained Embedded Systems, pp. 47–52 (2022)
Keltcher, C.N., McGrath, K.J., Ahmed, A., Conway, P.: The AMD opteron processor for multiprocessor servers. IEEE Micro 23(2), 66–76 (2003)
Laso, R., Rivera, F.F., Cabaleiro, J.C.: Influence of architectural features of the SNC-4 mode of the intel xeon phi KNL on matrix multiplication. In: Rodrigues, J.M.F., et al. (eds.) ICCS 2019, Part V. LNCS, vol. 11540, pp. 483–490. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22750-0_41
Laudon, J., Lenoski, D.: The SGI origin: A ccNUMA highly scalable server. ACM SIGARCH Comput. Archit. News 25(2), 241–251 (1997)
Lowe-Power, J., et al.: The gem5 simulator: version 20.0+. arXiv preprint arXiv:2007.03152 (2020)
Matoussi, O.: NOC performance model for efficient network latency estimation. In: 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 994–999. IEEE (2021)
McCalpin, J.D.: Memory bandwidth and machine balance in high performance computers (1995)
Molka, D., Hackenberg, D., Schöne, R., Nagel, W.E.: Cache coherence protocol and memory performance of the intel Haswell-EP architecture. In: 2015 44th International Conference on Parallel Processing, pp. 739–748. IEEE (2015)
Park, S., et al.: Scaling of memory performance and capacity with CXL memory expander. In: 2022 IEEE Hot Chips 34 Symposium (HCS), pp. 1–27. IEEE Computer Society (2022)
Sato, M., et al.: Co-design for A64FX manycore processor and “Fugaku”. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15. IEEE (2020)
Singhal, R.: Inside intel next generation Nehalem microarchitecture. In: Hot Chips, vol. 20, p. 15 (2008)
Sodani, A.: Knights landing (KNL): 2nd generation intel® xeon phi processor. In: 2015 IEEE Hot Chips 27 Symposium (HCS), pp. 1–24. IEEE (2015)
Sodani, A., et al.: Knights landing: second-generation intel xeon phi product. IEEE Micro 36(2), 34–46 (2016)
Stephens, N., et al.: The ARM scalable vector extension. IEEE Micro 37(2), 26–39 (2017)
Thekkath, R., Singh, A.P., Singh, J.P., John, S., Hennessy, J.: An evaluation of a commercial cc-NUMA architecture-the convex exemplar SPP1200. In: Proceedings 11th International Parallel Processing Symposium, pp. 8–17. IEEE (1997)
Williams, S., Ionkov, L., Lang, M.: NUMA distance for heterogeneous memory. In: Proceedings of the Workshop on Memory Centric Programming for HPC, pp. 30–34 (2017)
Cameron Woo, S., Ohara, M., Torrie, E., Pal Singh, J., Gupta, A.: The SPLASH-2 programs: characterization and methodological considerations. SIGARCH Comput. Archit. News 23(2), 24–36 (1995)
Xiang, L., Zhao, X., Rao, J., Jiang, S., Jiang, H.: Characterizing the performance of intel optane persistent memory: a close look at its on-DIMM buffering. In: Proceedings of the Seventeenth European Conference on Computer Systems, pp. 488–505 (2022)
Zaourar, L., et al.: Multilevel simulation-based co-design of next generation HPC microprocessors. In: 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp. 18–29. IEEE (2021)
Acknowledgment
This research has received funding from the European High Performance Computing Joint Undertaking (JU) under Framework Partnership Agreement No 800928 (European Processor Initiative) and Specific Grant Agreement No 101036168 (EPI SGA2). The JU receives support from the European Union’s Horizon 2020 research and innovation programme and from Croatia, France, Germany, Greece, Italy, Netherlands, Portugal, Spain, Sweden, and Switzerland. The EPI-SGA2 project, PCI2022-132935 is also co-funded by MCIN/AEI /10.13039/501100011033 and by the UE NextGenerationEU/PRTR.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zaourar, L. et al. (2024). Case Studies on the Impact and Challenges of Heterogeneous NUMA Architectures for HPC. In: Fey, D., Stabernack, B., Lankes, S., Pacher, M., Pionteck, T. (eds) Architecture of Computing Systems. ARCS 2024. Lecture Notes in Computer Science, vol 14842. Springer, Cham. https://doi.org/10.1007/978-3-031-66146-4_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-66146-4_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-66145-7
Online ISBN: 978-3-031-66146-4
eBook Packages: Computer ScienceComputer Science (R0)