iBet uBet web content aggregator. Adding the entire web to your favor.

Link to original content: https://doi.org/10.1145/2980024.2872400

CASPAR: Breaking Serialization in Lock-Free Multicore Synchronization: ACM SIGARCH Computer Architecture News: Vol 44, No 2

research-article

Public Access

CASPAR: Breaking Serialization in Lock-Free Multicore Synchronization

Authors:

Tanmay Gangwani,

Josep TorrellasAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 44, Issue 2

Pages 789 - 804

https://doi.org/10.1145/2980024.2872400

Published: 25 March 2016 Publication History

Abstract

In multicores, performance-critical synchronization is increasingly performed in a lock-free manner using atomic instructions such as CAS or LL/SC. However, when many processors synchronize on the same variable, performance can still degrade significantly. Contending writes get serialized, creating a non-scalable condition. Past proposals that build hardware queues of synchronizing processors do not fundamentally solve this problem---at best, they help to efficiently serialize the contending writes.

This paper proposes a novel architecture that breaks the serialization of hardware queues and enables the queued processors to perform lock-free synchronization in parallel. The architecture, called CASPAR, is able to (1) execute the CASes in the queued-up processors in parallel through eager forwarding of expected values, and (2) validate the CASes in parallel and dequeue groups of processors at a time. The result is highly-scalable synchronization. We evaluate CASPAR with simulations of a 64-core chip. Compared to existing proposals with hardware queues, CASPAR improves the throughput of kernels by 32% on average, and reduces the execution time of the sections considered in lock-free versions of applications by 47% on average. This makes these sections 2.5x faster than in the original applications.

References

[1]

The Go Programming Language. http://golang.org, 2014.

[2]

]LFapp2NetBSD producer/consumer queue. ftp://ftp.netbsd.org/pub/NetBSD/NetBSD-current/src/sys/kern/subr_pcq.c, 2014.

[3]

8: High Performance C+ FIX Framework. http://fix8.org, 2014.

[4]

MySQL Concurrent Allocator. https://github.com/twitter/mysql/blob/master/mysys/lf_alloc-pin.c, 2014.

[5]

Y. Afek, G. Korland, and E. Yanovsky. Quasi-Linearizability: Relaxed Consistency for Improved Concurrency. In phProceedings of the 14th International Conference On Principles Of Distributed Systems (OPODIS 2010), volume 6490 of LNCS, pages 395--410. 2010.

[6]

D. Alistarh, J. Kopinsky, J. Li, and N. Shavit. The SprayList: A Scalable Relaxed Priority Queue. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pages 11--20, 2015.

Digital Library

[7]

R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The Tera Computer System. In Proceedings of the 4th International Conference on Supercomputing, ICS '90, pages 1--6, 1990.

Digital Library

[8]

U. Aydonat and T. S. Abdelrahman. Hardware Support for Relaxed Concurrency Control in Transactional Memory. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '10, pages 15--26, 2010.

Digital Library

[9]

S. Bansal and D. S. Modha. CAR: Clock with Adaptive Replacement. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies, FAST '04, pages 187--200, 2004.

[10]

G. Blake, R. G. Dreslinski, and T. Mudge. Proactive Transaction Scheduling for Contention Management. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '09, pages 156--167, 2009.

Digital Library

[11]

C. Blundell, A. Raghavan, and M. M. Martin. RETCON: Transactional Repair Without Replay. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA '10, pages 258--269, 2010.

Digital Library

[12]

S. Boyd-Wickizer, M. F. Kaashoek, R. Morris, and N. Zeldovich. OpLog: a library for scaling update-heavy data structures. Technical Report MIT-CSAIL-TR-2014-019, MIT Computer Science and Artificial Intelligence Laboratory, September 2014.

[13]

J. D. Brouer. Qdisc lockless FIFO. In phNetfilter Workshop, 2014.

[14]

T. E. Carlson, W. Heirman, and L. Eeckhout. Sniper: Exploring the Level of Abstraction for Scalable and Accurate Parallel Multi-core Simulation. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 52:1--52:12, 2011.

Digital Library

[15]

C. Diaconu, C. Freedman, E. Ismert, P.-A. Larson, P. Mittal, R. Stonecipher, N. Verma, and M. Zwilling. Hekaton: SQL Server's Memory-optimized OLTP Engine. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 1243--1254, 2013.

Digital Library

[16]

N. Diegues, P. Romano, and L. Rodrigues. Virtues and Limitations of Commodity Hardware Transactional Memory. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT '14, pages 3--14, 2014.

Digital Library

[17]

A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP. In Proceedings of the 2009 International Conference on Parallel Processing, ICPP '09, pages 124--131, 2009.

Digital Library

[18]

K. Fraser. Practical lock-freedom. PhD thesis, University of Cambridge, Computer Laboratory, University of Cambridge, Computer Laboratory, February 2004.

[19]

L. Gidra, G. Thomas, J. Sopena, and M. Shapiro. A Study of the Scalability of Stop-the-world Garbage Collectors on Multicores. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 229--240, 2013.

Digital Library

[20]

J. R. Goodman, M. K. Vernon, and P. J. Woest. Efficient Synchronization Primitives for Large-scale Cache-coherent Multiprocessors. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '89, pages 64--75, 1989.

Digital Library

[21]

A. Gottlieb, R. Grishman, C. Kruskal, K. McAuliffe, L. Rudolph, and M. Snir. The NYU Ultracomputer -- Designing an MIMD, Shared Memory Parallel Machine. In IEEE Transactions on Computers, February 1983.

[22]

A. Haas, M. Lippautz, T. A. Henzinger, H. Payer, A. Sokolova, C. M. Kirsch, and A. Sezgin. Distributed queues in shared memory: multicore performance and scalability through quantitative relaxation. In Proceedings of the ACM International Conference on Computing Frontiers, CF '13, pages 17:1--17:9, 2013.

Digital Library

[23]

D. Hendler, N. Shavit, and L. Yerushalmi. A scalable lock-free stack algorithm. In Proceedings of the 16th Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2004, pages 206--215, 2004.

Digital Library

[24]

M. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems (TOPLAS), 13: 124--149, January 1991.

[25]

M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008.

Digital Library

[26]

M. Hoffman, O. Shalev, and N. Shavit. The baskets queue. In Proceedings of the 11th International Conference on Principles of Distributed Systems, OPODIS'07, pages 401--414, 2007.

Digital Library

[27]

Intel. Intel Xeon Phi Coprocessor. https://software.intel.com/en-us/mic-developer, 2013.

[28]

S. A. R. Jafri, G. Voskuilen, and T. N. Vijaykumar. Wait-n-GoTM: Improving HTM Performance by Serializing Cyclic Dependencies. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 521--534, 2013.

Digital Library

[29]

E. H. Jensen, G. W. Hagensen, and J. M. Broughton. A New Approach to Exclusive Data Access in Shared Memory Multiprocessors. Technical Report UCRL-97663, Lawrence Livermore National Laboratory, 1987.

[30]

R. Jones, A. Hosking, and E. Moss. The Garbage Collection Handbook: The Art of Automatic Memory Management. Chapman & Hall/CRC, 1st edition, 2011.

[31]

H. Jordan. Performance Measurements on HEP A Pipelined MIMD Computer. In International Symposium on Computer Architecture (ISCA), June 1983.

[32]

A. Kagi, D. Burger, and J. R. Goodman. Efficient Synchronization: Let Them Eat QOLB. In Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA '97, pages 170--180, 1997.

Digital Library

[33]

C. Kirsch, H. Payer, H. Röck, and A. Sokolova. Performance, Scalability, and Semantics of Concurrent FIFO Queues. In Proceedings of the International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP '12), volume 7439 of phLNCS, pages 273--287. 2012.

Digital Library

[34]

C. M. Kirsch, M. Lippautz, and H. Payer. Fast and Scalable, Lock-Free k-FIFO Queues. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, pages 241--252, 2013.

Digital Library

[35]

A. Kivity, D. Laor, G. Costa, P. Enberg, N. Har'El, D. Marti, and V. Zolotarov. OSv-Optimizing the Operating System for Virtual Machines. In Proceedings of the 2014 USENIX Annual Technical Conference, ATC '14, pages 61--72, June 2014.

[36]

A. Kogan and E. Petrank. Wait-free queues with multiple enqueuers and dequeuers. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP '11, pages 223--234, 2011.

Digital Library

[37]

D. A. Koufaty, X. Chen, D. K. Poulsen, and J. Torrellas. Data Forwarding in Scalable Shared-memory Multiprocessors. In Proceedings of the 9th International Conference on Supercomputing, ICS '95, pages 255--264, 1995.

Digital Library

[38]

D. Kuck, E. Davidson, D. Lawrie, A. Sameh, C. Q. Zhu, A. Veidenbaum, J. Konicek, P. Yew, K. Gallivan, W. Jalby, H. Wijshoff, R. Bramley, U. M. Yang, P. Emrath, D. Padua, R. Eigenmann, J. Hoeflinger, G. Jaxon, Z. Li, T. Murphy, and J. Andrews. The Cedar System and an Initial Performance Study. In Proceedings of the 20th Annual International Symposium on Computer Architecture, ISCA '93, pages 213--223, 1993.

Digital Library

[39]

M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '07, pages 211--222, 2007.

Digital Library

[40]

E. Ladan-Mozes and N. Shavit. An optimistic approach to lock-free FIFO queues. In Proceedings of the 18th International Symposium on Distributed Computing (DISC 2004), volume 3274 of phLNCS, pages 117--131, 2004.

[41]

P.-A. Larson and M. Krishnan. Memory Allocation for Long-running Server Applications. In Proceedings of the 1st International Symposium on Memory Management, ISMM '98, pages 176--185, 1998.

Digital Library

[42]

J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA '97, pages 241--251, 1997.

Digital Library

[43]

D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The Stanford DASH Multiprocessor. In IEEE Computer, March 1992.

Digital Library

[44]

I. Lotan and N. Shavit. Skiplist-Based Concurrent Priority Queues. In Proceedings of the 14th International Parallel and Distributed Processing Symposium, IPDPS '00, pages 263--268, 2000.

[45]

N. D. Matsakis and F. S. Klock, II. The Rust Language. In Proceedings of the 2014 ACM SIGAda Annual Conference on High Integrity Language Technology, HILT '14, pages 103--104, 2014.

Digital Library

[46]

M. M. Michael. Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects. IEEE Transactions on Parallel and Distributed System, 15 (6): 491--504, June 2004.

Digital Library

[47]

M. M. Michael. Scalable lock-free dynamic memory allocation. In Proceedings of the 2004 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '04, pages 35--46, 2004.

Digital Library

[48]

M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing, PODC '96, pages 267--275, 1996.

Digital Library

[49]

M. M. Michael and M. L. Scott. Nonblocking algorithms and preemption-safe locking on multiprogrammed shared memory multiprocessors. Journal of Parallel and Distributed Computing, 51: 1--26, May 1998.

Digital Library

[50]

D. Nguyen, A. Lenharth, and K. Pingali. A Lightweight Infrastructure for Graph Analytics. In Proceedings of the 24th ACM Symposium on Operating Systems Principles, SOSP '13, pages 456--471, 2013.

Digital Library

[51]

S. L. Olivier, A. K. Porterfield, K. B. Wheeler, M. Spiegel, and J. F. Prins. OpenMP Task Scheduling Strategies for Multicore NUMA Systems. International Journal of High Performance Computing Applications, 26 (2): 110--124, May 2012.

Digital Library

[52]

G. Pfister, W. Brantley, D. George, S. Harvey, W. Kleinfelder, K. McAuliffe, E. Melton, V. Norton, and J. Weiss. The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture. In Proceedings of the 1985 International Conference on Parallel Processing, ICPP '85, pages 764--771, 1985.

[53]

X. Qian, B. Sahelices, and J. Torrellas. BulkSMT: Designing SMT Processors for Atomic-block Execution. In Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture, HPCA '12, pages 1--12, 2012.

Digital Library

[54]

X. Qian, B. Sahelices, and J. Torrellas. OmniOrder: Directory-based Conflict Serialization of Transactions. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA '14, pages 421--432, 2014.

Digital Library

[55]

R. Rajwar, A. Kagi, and J. R. Goodman. Improving the throughput of synchronization by insertion of delays. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture, HPCA '00, pages 168--179, January 2000.

[56]

R. Rajwar, A. Kagi, and J. R. Goodman. Inferential Queueing and Speculative Push for Reducing Critical Communication Latencies. In Proceedings of the 17th Annual International Conference on Supercomputing, ICS '03, pages 273--284, 2003.

Digital Library

[57]

H. E. Ramadan, C. J. Rossbach, and E. Witchel. Dependence-aware Transactional Memory for Increased Concurrency. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '08, pages 246--257, 2008.

[58]

A. Ros and S. Kaxiras. Complexity-effective Multicore Coherence. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 241--252, 2012.

Digital Library

[59]

S. Schneider, C. D. Antonopoulos, and D. S. Nikolopoulos. Scalable Locality-conscious Multithreaded Memory Allocation. In Proceedings of the 5th International Symposium on Memory Management, ISMM '06, pages 84--94, 2006.

Digital Library

[60]

S. L. Scott. Synchronization and Communication in the T3E Multiprocessor. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '96, pages 26--36, 1996.

[61]

S. Seo, J. Kim, and J. Lee. SFMalloc: A Lock-Free and Mostly Synchronization-Free Dynamic Memory Allocator for Manycores. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques, PACT '11, pages 253--263, 2011.

Digital Library

[62]

N. Shavit. Data Structures in the Multicore Age. Communications of the ACM, 54 (3): 76--84, Mar. 2011.

Digital Library

[63]

M. Stonebraker, A. Pavlo, R. Taft, and M. L. Brodie. Enterprise Database Applications and the Cloud: A Difficult Road Ahead. In Proceedings of the 2014 IEEE International Conference on Cloud Engineering, IC2E '14, pages 1--6, 2014.

Digital Library

[64]

M. E. Thomadakis. The Architecture of the Nehalem Processor and Nehalem-EP SMP Platforms. Resource, 3: 2, 2011.

[65]

P. Trancoso and J. Torrellas. The Impact of Speeding Up Critical Sections with Data Prefetching and Forwarding. In Proceedings of the 1996 International Conference on Parallel Processing, ICPP '96, pages 79--86.

[66]

R. K. Treiber. Systems Programming: Coping with Parallelism. Technical Report RJ5118, IBM Almaden Research Center, 2006.

[67]

L. G. Valiant. A Bridging Model for Parallel Computation. Communications of the ACM, 33 (8): 103--111, Aug. 1990.

Digital Library

[68]

E. Vallejo, R. Beivide, A. Cristal, T. Harris, F. Vallejo, O. Unsal, and M. Valero. Architectural Support for Fair Reader-Writer Locking. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '10, pages 275--286, 2010.

Digital Library

[69]

K. B. Wheeler, R. C. Murphy, and D. Thain. Qthreads: An API for programming with millions of lightweight threads. In Proceedings of the 22nd International Parallel and Distributed Processing Symposium, IPDPS '08, pages 1--8, 2008.

Index Terms

CASPAR: Breaking Serialization in Lock-Free Multicore Synchronization
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
2. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent algorithms

Recommendations

CASPAR: Breaking Serialization in Lock-Free Multicore Synchronization
ASPLOS '16

In multicores, performance-critical synchronization is increasingly performed in a lock-free manner using atomic instructions such as CAS or LL/SC. However, when many processors synchronize on the same variable, performance can still degrade ...
CASPAR: Breaking Serialization in Lock-Free Multicore Synchronization
ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems

In multicores, performance-critical synchronization is increasingly performed in a lock-free manner using atomic instructions such as CAS or LL/SC. However, when many processors synchronize on the same variable, performance can still degrade ...
Lock-free and scalable multi-version software transactional memory
PPoPP '11

Software Transactional Memory (STM) was initially proposed as a lock-free mechanism for concurrency control. Early implementations had efficiency limitations, and soon obstruction-free proposals appeared, to tackle this problem, often simplifying STM ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 44, Issue 2

ASPLOS'16

May 2016

774 pages

ISSN:0163-5964

DOI:10.1145/2980024

Editor:
Doug DeGroot
acm dot org

Issue’s Table of Contents

ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
March 2016
824 pages
ISBN:9781450340915
DOI:10.1145/2872362
General Chair:
Tom Conte
Georgia Tech, USA
,
Program Chair:
Yuanyuan Zhou
University of California, San Diego, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2016

Published in SIGARCH Volume 44, Issue 2

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
1,126
Total Downloads

Downloads (Last 12 months)206
Downloads (Last 6 weeks)28

Reflects downloads up to 08 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents