Exploiting Hidden Non-uniformity of Uniform Memory Access on Manycore CPUs

Gerofi, Balazs; Takagi, Masamichi; Ishikawa, Yutaka

doi:10.1007/978-3-319-14313-2_21

Balazs Gerofi³⁴,
Masamichi Takagi³⁵ &
Yutaka Ishikawa³⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8806))

Included in the following conference series:

European Conference on Parallel Processing

1822 Accesses
1 Citations

Abstract

As the rate of CPU clock improvement has stalled for the last decade, increased use of parallelism in the form of multi- and many-core processors has been chased to improve overall performance. Current high-end manycore CPUs already accommodate up to hundreds of processing cores. At the same time, these architectures come with complex on-chip networks for inter-core communication and multiple memory controllers for accessing off-chip RAM modules. Intel’s latest Many Integrated Cores (MIC) chip, also called the Xeon Phi, boasts up to 60 CPU cores (each with 4-ways SMT) combined with eight memory controllers. Although the chip provides Uniform Memory Access (UMA), we find that there are substantial (as high as 60%) differences in access latencies for different memory blocks depending on which CPU core issues the request, resembling Non-Uniform Memory Access (NUMA) architectures.

Exploiting the aforementioned differences, in this paper, we propose a memory block latency-aware memory allocator, which assigns memory addresses to the requesting CPU cores in a fashion that it minimizes access latencies. We then show that applying our mechanism to the A-star graph search algorithm can yield performance improvements up to 28%, without any need for modifications to the algorithm itself.

Download to read the full chapter text

Chapter PDF

A Dynamic Cache Architecture for Efficient Memory Resource Allocation in Many-Core Systems

DIPP—An LLC Replacement Policy for On-chip Dynamic Heterogeneous Multi-core Architecture

To Share or Not to Share: A Case for MPI in Shared-Memory

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Saha, B., Zhou, X., Chen, H., Gao, Y., Yan, S., Rajagopalan, M., Fang, J., Zhang, P., Ronen, R., Mendelson, A.: Programming model for a heterogeneous x86 platform. In: Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and implementation, PLDI 2009, pp. 431–440. ACM, New York (2009)
Chapter Google Scholar
Intel Corporation: Single-Chip Cloud Computer (2010), https://www-ssl.intel.com/content/www/us/en/research/intel-labs-single-chip-cloud-computer.html
Adapteva: Epiphany-IV 64-core 28nm Microprocessor, E64G401 (2014), http://www.adapteva.com/epiphanyiv
Tilera: TILE-Gx8072 Processor Product Brief (2014), http://www.tilera.com/sites/default/files/images/products/TILE-Gx8072_PB041-03_WEB.pdf
The University of Glasgow: Scientists squeeze more than 1,000 cores on to computer chip (2010), http://www.gla.ac.uk/news/archiveofnews/2010/december/headline_183814_en.html
Intel Corporation: Intel Xeon Phi Coprocessor System Software Developers Guide (2013), https://www-ssl.intel.com/content/www/us/en/processors/xeon/xeon-phi-coprocessor-system-software-developers-guide.html
Nychis, G.P., Fallin, C., Moscibroda, T., Mutlu, O., Seshan, S.: On-chip Networks from a Networking Perspective: Congestion and Scalability in Many-core Interconnects. In: SIGCOMM 2012, pp. 407–418. ACM, New York (2012)
Google Scholar
Lameter, C.: NUMA (Non-Uniform Memory Access): An Overview. ACM Queue 11(7), 40 (2013)
Article Google Scholar
Kim, C., Burger, D., Keckler, S.: Nonuniform cache architectures for wire-delay dominated on-chip caches. IEEE Micro 23(6), 99–107 (2003)
Article Google Scholar
Luk, C.K., Mowry, T.C.: Compiler-based prefetching for recursive data structures. In: ASPLOS VII, pp. 222–233. ACM, New York (1996)
Chapter Google Scholar
Hart, P., Nilsson, N., Raphael, B.: A Formal Basis for the Heuristic Determination of Minimum Cost Paths. IEEE Transactions on Systems Science and Cybernetics 4(2), 100–107 (1968)
Article Google Scholar
Ivanov, L., Nunna, R.: Modeling and Verification of Cache Coherence Protocols. In: The 2001 IEEE International Symposium on Circuits and Systems, ISCAS 2001, vol. 5, pp. 129–132 (2001)
Google Scholar
Hackenberg, D., Molka, D., Nagel, W.E.: Comparing Cache Architectures and Coherency Protocols on x86-64 Multicore SMP Systems. In: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pp. 413–422. ACM, New York (2009)
Chapter Google Scholar
Hennessy, J.L., Patterson, D.A.: Computer Architecture - A Quantitative Approach, 5th edn. Morgan Kaufmann (2012)
Google Scholar
Ramos, S., Hoefler, T.: Modeling Communication in Cache-coherent SMP Systems: A Case-study with Xeon Phi. In: HPDC 2013, pp. 97–108. ACM, New York (2013)
Google Scholar
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Prentice Hall Press, Upper Saddle River (2009)
Google Scholar
Dutt, S., Mahapatra, N.: Parallel A* algorithms and their performance on hypercube multiprocessors. In: Proceedings of Seventh International Parallel Processing Symposium, pp. 797–803 (April 1993)
Google Scholar
Burns, E., Lemons, S., Zhou, R., Ruml, W.: Best-first Heuristic Search for Multi-core Machines. In: Proceedings of the 21st International Jont Conference on Artifical Intelligence, IJCAI 2009, pp. 449–455. Morgan Kaufmann Publishers, San Francisco (2009)
Google Scholar
Rios, L.H.O., Chaimowicz, L.: A Survey and Classification of A* Based Best-First Heuristic Search Algorithms. In: da Rocha Costa, A.C., Vicari, R.M., Tonidandel, F. (eds.) SBIA 2010. LNCS, vol. 6404, pp. 253–262. Springer, Heidelberg (2010)
Chapter Google Scholar
Gerofi, B., Shimada, A., Hori, A., Ishikawa, Y.: Partially Separated Page Tables for Efficient Operating System Assisted Hierarchical Memory Management on Heterogeneous Architectures. In: CCGrid 2013 (May 2013)
Google Scholar
Heyes-Jones, J.: A* Algorithm Tutorial (2013), http://heyes-jones.com/astar.php
Dashti, M., Fedorova, A., Funston, J., Gaud, F., Lachaize, R., Lepers, B., Quema, V., Roth, M.: Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In: ASPLOS 2013, pp. 381–394. ACM, New York (2013)
Google Scholar
Chilimbi, T.M., Davidson, B., Larus, J.R.: Cache-conscious Structure Definition. In: PLDI 1999, pp. 13–24. ACM, New York (1999)
Google Scholar
Bolosky, W., Fitzgerald, R., Scott, M.: Simple but Effective Techniques for NUMA Memory Management. In: SOSP 1989, pp. 19–31. ACM, New York (1989)
Google Scholar
LaRowe, J. R.P., Ellis, C.S., Holliday, M.A.: Evaluation of NUMA Memory Management Through Modeling and Measurements. IEEE Trans. Parallel Distrib. Syst. 3(6), 686–701 (1992)
Article Google Scholar
Verghese, B., Devine, S., Gupta, A., Rosenblum, M.: Operating system support for improving data locality on cc-numa compute servers. In: ASPLOS VII, pp. 279–289. ACM, New York (1996)
Chapter Google Scholar
Avdic, K., Melot, N., Keller, J., Kessler, C.: Parallel sorting on Intel Single-Chip Cloud Computer. In: Proceedings of the 2nd Workshop on Applications for Multi and Many Core Processors (2011)
Google Scholar
Knauerhase, R., Brett, P., Hohlt, B., Li, T., Hahn, S.: Using os observations to improve performance in multicore systems. IEEE Micro 28(3), 54–66 (2008)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan
Balazs Gerofi & Yutaka Ishikawa
RIKEN Advanced Institute for Computational Science, Kobe, Japan
Masamichi Takagi

Authors

Balazs Gerofi
View author publications
You can also search for this author in PubMed Google Scholar
Masamichi Takagi
View author publications
You can also search for this author in PubMed Google Scholar
Yutaka Ishikawa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

CRACS/INESC-TEC and FCUP, University of Porto, Rua do Campo Alegre, 1021, 4169-007, Porto, Portugal
Luís Lopes
Vilnius University, 08663, Vilnius, Lithuania
Julius Žilinskas
Inria Rennes - Bretagne Atlantique, 35042, Rennes, France
Alexandru Costan
Inria, Campus Universitaire de Beaulieu, 35042, Rennes, France
Roberto G. Cascella
MTA SZTAKI, Budapest, Hungary
Gabor Kecskemeti
Inria, LaBRI, France
Emmanuel Jeannot
University Magna Graecia of Catanzaro, 88100, Catanzaro, Italy
Mario Cannataro
University of Pisa, Italy
Laura Ricci
Faculty of Computer Science, University of Vienna, Wien, Austria
Siegfried Benkner
Universitat Politècnica de València, Spain
Salvador Petit
ISISLab - Dipartimento di Informatica, Università di Salerno, Italy
Vittorio Scarano
High Performance Computing Center Stuttgart (HLRS), University of Stuttgart, 70550, Stuttgart, Germany
José Gracia
Vienna University of Technology, 1040, Vienna, Austria
Sascha Hunold
Tennessee Tech University and Oak Ridge National Laboratory, 38505, Cookeville, TN, USA
Stephen L. Scott
RWTH Aachen University, Aachen, Germany
Stefan Lankes
Department of Informatics and Mathematics, University of Passau, Germany
Christian Lengauer
Universidad Carlos III de Madrid, 28911, Leganés, Spain
Jesús Carretero
TU München, 85747, Garching bei München, Germany
Jens Breitbart
TU Vienna, 1040, Vienna, Austria
Michael Alexander

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gerofi, B., Takagi, M., Ishikawa, Y. (2014). Exploiting Hidden Non-uniformity of Uniform Memory Access on Manycore CPUs. In: Lopes, L., et al. Euro-Par 2014: Parallel Processing Workshops. Euro-Par 2014. Lecture Notes in Computer Science, vol 8806. Springer, Cham. https://doi.org/10.1007/978-3-319-14313-2_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-14313-2_21
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14312-5
Online ISBN: 978-3-319-14313-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Exploiting Hidden Non-uniformity of Uniform Memory Access on Manycore CPUs

Abstract

Chapter PDF

Similar content being viewed by others

A Dynamic Cache Architecture for Efficient Memory Resource Allocation in Many-Core Systems

DIPP—An LLC Replacement Policy for On-chip Dynamic Heterogeneous Multi-core Architecture

To Share or Not to Share: A Case for MPI in Shared-Memory

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Exploiting Hidden Non-uniformity of Uniform Memory Access on Manycore CPUs

Abstract

Chapter PDF

Similar content being viewed by others

A Dynamic Cache Architecture for Efficient Memory Resource Allocation in Many-Core Systems

DIPP—An LLC Replacement Policy for On-chip Dynamic Heterogeneous Multi-core Architecture

To Share or Not to Share: A Case for MPI in Shared-Memory

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation