iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://doi.org/10.1007/978-3-031-66146-4_17
Case Studies on the Impact and Challenges of Heterogeneous NUMA Architectures for HPC | SpringerLink
Skip to main content

Case Studies on the Impact and Challenges of Heterogeneous NUMA Architectures for HPC

  • Conference paper
  • First Online:
Architecture of Computing Systems (ARCS 2024)

Abstract

The memory systems of High-Performance Computing (HPC) systems commonly feature non-uniform data paths to memory, i.e. are non-uniform memory access (NUMA) architectures. Memory is divided into multiple regions, with each processing unit having its own local memory. Therefore, for each processing unit access to local memory regions is faster compared to accessing memory at non-local regions. Architectures with hybrid memory technologies result in further non-uniformity. This paper presents case studies of the performance potential and data placement implications of non-uniform and heterogeneous memory in HPC systems. Using the gem5 and VPSim simulation platforms, we model NUMA systems with processors based on the ARMv8 Neoverse V1 Reference Design. The gem5 simulator provides a cycle-accurate view, while VPSim offers greater simulation speed, with a high-level view of the simulated system. We highlight the performance impact of design trade-offs regarding NUMA node organization and System Level Cache (SLC) group assignment, as well as Network-on-Chip (NoC) configuration. Our case studies provide essential input to a co-design process involving HPC processor architects and system integrators. A comparison of system configurations for different NoC bandwidths shows reduced NoC latency and high memory bandwidth improvement when NUMA control is enabled. Furthermore, a configuration with HBM2 memory organized as four NUMA nodes highlights the memory bandwidth performance gap and NoC queuing latency impact when comparing local vs. remote memory accesses. On the other hand, NUMA can result in an unbalanced distribution of memory accesses and reduced SLC hit ratios, as shown with DDR4 memory organized as four NUMA nodes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. ACPI HMAT. https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/05_ACPI_Software_Programming_Model/ACPI_Software_Programming_Model.html

  2. ARM. Neoverse CMN-650 Technical Reference manual (2023). https://developer.arm.com/documentation/101481/0200?lang=en

  3. Bienia, C., Kumar, S., Pal Singh,J., Li, K.: The parsec benchmark suite: characterization and architectural implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 72–81 (2008)

    Google Scholar 

  4. Cantalupo, C., Venkatesan, V., Hammond, J., Czurlyo, K., Hammond, S.D.: Memkind: an extensible heap memory manager for heterogeneous memory platforms and mixed memory policies. Technical report, Sandia National Lab.(SNL-NM), Albuquerque, NM, USA (2015)

    Google Scholar 

  5. Charif, A., Busnot, G., Mameesh, R., Sassolas, T., Ventroux, N.: Fast virtual prototyping for embedded computing systems design and exploration. In: Proceedings of the Rapid Simulation and Performance Evaluation: Methods and Tools, RAPIDO 2019, pp. 1–8. Association for Computing Machinery (2019)

    Google Scholar 

  6. Feichtinger, C., Donath, S., Köstler, H., Götz, J., Rüde, U.: WaLBerla: HPC software design for computational engineering simulations. J. Comput. Sci. 2(2), 105–112 (2011)

    Article  Google Scholar 

  7. Hammarlund, P., et al.: Haswell: the fourth-generation intel core processor. IEEE Micro 34(2), 6–20 (2014)

    Article  Google Scholar 

  8. Hofmann, J., Fey, D., Eitzinger, J., Hager, G., Wellein, G.: Analysis of intel’s Haswell microarchitecture using the ECM model and microbenchmarks. In: Hannig, F., Cardoso, J.M.P., Pionteck, T., Fey, D., Schröder-Preikschat, W., Teich, J. (eds.) ARCS 2016. LNCS, vol. 9637, pp. 210–222. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30695-7_16

    Chapter  Google Scholar 

  9. Jebali, F., Matoussi, O., Wicaksana, A., Charif, A., Zaourar, L.: Decoupling processor and memory hierarchy simulators for efficient design space exploration. In: System Engineering for Constrained Embedded Systems, pp. 47–52 (2022)

    Google Scholar 

  10. Keltcher, C.N., McGrath, K.J., Ahmed, A., Conway, P.: The AMD opteron processor for multiprocessor servers. IEEE Micro 23(2), 66–76 (2003)

    Article  Google Scholar 

  11. Laso, R., Rivera, F.F., Cabaleiro, J.C.: Influence of architectural features of the SNC-4 mode of the intel xeon phi KNL on matrix multiplication. In: Rodrigues, J.M.F., et al. (eds.) ICCS 2019, Part V. LNCS, vol. 11540, pp. 483–490. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22750-0_41

    Chapter  Google Scholar 

  12. Laudon, J., Lenoski, D.: The SGI origin: A ccNUMA highly scalable server. ACM SIGARCH Comput. Archit. News 25(2), 241–251 (1997)

    Article  Google Scholar 

  13. Lowe-Power, J., et al.: The gem5 simulator: version 20.0+. arXiv preprint arXiv:2007.03152 (2020)

  14. Matoussi, O.: NOC performance model for efficient network latency estimation. In: 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 994–999. IEEE (2021)

    Google Scholar 

  15. McCalpin, J.D.: Memory bandwidth and machine balance in high performance computers (1995)

    Google Scholar 

  16. Molka, D., Hackenberg, D., Schöne, R., Nagel, W.E.: Cache coherence protocol and memory performance of the intel Haswell-EP architecture. In: 2015 44th International Conference on Parallel Processing, pp. 739–748. IEEE (2015)

    Google Scholar 

  17. Park, S., et al.: Scaling of memory performance and capacity with CXL memory expander. In: 2022 IEEE Hot Chips 34 Symposium (HCS), pp. 1–27. IEEE Computer Society (2022)

    Google Scholar 

  18. Sato, M., et al.: Co-design for A64FX manycore processor and “Fugaku”. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15. IEEE (2020)

    Google Scholar 

  19. Singhal, R.: Inside intel next generation Nehalem microarchitecture. In: Hot Chips, vol. 20, p. 15 (2008)

    Google Scholar 

  20. Sodani, A.: Knights landing (KNL): 2nd generation intel® xeon phi processor. In: 2015 IEEE Hot Chips 27 Symposium (HCS), pp. 1–24. IEEE (2015)

    Google Scholar 

  21. Sodani, A., et al.: Knights landing: second-generation intel xeon phi product. IEEE Micro 36(2), 34–46 (2016)

    Article  Google Scholar 

  22. Stephens, N., et al.: The ARM scalable vector extension. IEEE Micro 37(2), 26–39 (2017)

    Article  Google Scholar 

  23. Thekkath, R., Singh, A.P., Singh, J.P., John, S., Hennessy, J.: An evaluation of a commercial cc-NUMA architecture-the convex exemplar SPP1200. In: Proceedings 11th International Parallel Processing Symposium, pp. 8–17. IEEE (1997)

    Google Scholar 

  24. Williams, S., Ionkov, L., Lang, M.: NUMA distance for heterogeneous memory. In: Proceedings of the Workshop on Memory Centric Programming for HPC, pp. 30–34 (2017)

    Google Scholar 

  25. Cameron Woo, S., Ohara, M., Torrie, E., Pal Singh, J., Gupta, A.: The SPLASH-2 programs: characterization and methodological considerations. SIGARCH Comput. Archit. News 23(2), 24–36 (1995)

    Article  Google Scholar 

  26. Xiang, L., Zhao, X., Rao, J., Jiang, S., Jiang, H.: Characterizing the performance of intel optane persistent memory: a close look at its on-DIMM buffering. In: Proceedings of the Seventeenth European Conference on Computer Systems, pp. 488–505 (2022)

    Google Scholar 

  27. Zaourar, L., et al.: Multilevel simulation-based co-design of next generation HPC microprocessors. In: 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp. 18–29. IEEE (2021)

    Google Scholar 

Download references

Acknowledgment

This research has received funding from the European High Performance Computing Joint Undertaking (JU) under Framework Partnership Agreement No 800928 (European Processor Initiative) and Specific Grant Agreement No 101036168 (EPI SGA2). The JU receives support from the European Union’s Horizon 2020 research and innovation programme and from Croatia, France, Germany, Greece, Italy, Netherlands, Portugal, Spain, Sweden, and Switzerland. The EPI-SGA2 project, PCI2022-132935 is also co-funded by MCIN/AEI /10.13039/501100011033 and by the UE NextGenerationEU/PRTR.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manolis Marazakis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zaourar, L. et al. (2024). Case Studies on the Impact and Challenges of Heterogeneous NUMA Architectures for HPC. In: Fey, D., Stabernack, B., Lankes, S., Pacher, M., Pionteck, T. (eds) Architecture of Computing Systems. ARCS 2024. Lecture Notes in Computer Science, vol 14842. Springer, Cham. https://doi.org/10.1007/978-3-031-66146-4_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-66146-4_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-66145-7

  • Online ISBN: 978-3-031-66146-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics