iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://doi.org/10.1145/2980024.2872400
CASPAR: Breaking Serialization in Lock-Free Multicore Synchronization: ACM SIGARCH Computer Architecture News: Vol 44, No 2 skip to main content
research-article
Public Access

CASPAR: Breaking Serialization in Lock-Free Multicore Synchronization

Published: 25 March 2016 Publication History

Abstract

In multicores, performance-critical synchronization is increasingly performed in a lock-free manner using atomic instructions such as CAS or LL/SC. However, when many processors synchronize on the same variable, performance can still degrade significantly. Contending writes get serialized, creating a non-scalable condition. Past proposals that build hardware queues of synchronizing processors do not fundamentally solve this problem---at best, they help to efficiently serialize the contending writes.
This paper proposes a novel architecture that breaks the serialization of hardware queues and enables the queued processors to perform lock-free synchronization in parallel. The architecture, called CASPAR, is able to (1) execute the CASes in the queued-up processors in parallel through eager forwarding of expected values, and (2) validate the CASes in parallel and dequeue groups of processors at a time. The result is highly-scalable synchronization. We evaluate CASPAR with simulations of a 64-core chip. Compared to existing proposals with hardware queues, CASPAR improves the throughput of kernels by 32% on average, and reduces the execution time of the sections considered in lock-free versions of applications by 47% on average. This makes these sections 2.5x faster than in the original applications.

References

[1]
The Go Programming Language. http://golang.org, 2014.
[2]
]LFapp2NetBSD producer/consumer queue. ftp://ftp.netbsd.org/pub/NetBSD/NetBSD-current/src/sys/kern/subr_pcq.c, 2014.
[3]
8: High Performance C+ FIX Framework. http://fix8.org, 2014.
[4]
MySQL Concurrent Allocator. https://github.com/twitter/mysql/blob/master/mysys/lf_alloc-pin.c, 2014.
[5]
Y. Afek, G. Korland, and E. Yanovsky. Quasi-Linearizability: Relaxed Consistency for Improved Concurrency. In phProceedings of the 14th International Conference On Principles Of Distributed Systems (OPODIS 2010), volume 6490 of LNCS, pages 395--410. 2010.
[6]
D. Alistarh, J. Kopinsky, J. Li, and N. Shavit. The SprayList: A Scalable Relaxed Priority Queue. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pages 11--20, 2015.
[7]
R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The Tera Computer System. In Proceedings of the 4th International Conference on Supercomputing, ICS '90, pages 1--6, 1990.
[8]
U. Aydonat and T. S. Abdelrahman. Hardware Support for Relaxed Concurrency Control in Transactional Memory. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '10, pages 15--26, 2010.
[9]
S. Bansal and D. S. Modha. CAR: Clock with Adaptive Replacement. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies, FAST '04, pages 187--200, 2004.
[10]
G. Blake, R. G. Dreslinski, and T. Mudge. Proactive Transaction Scheduling for Contention Management. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '09, pages 156--167, 2009.
[11]
C. Blundell, A. Raghavan, and M. M. Martin. RETCON: Transactional Repair Without Replay. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA '10, pages 258--269, 2010.
[12]
S. Boyd-Wickizer, M. F. Kaashoek, R. Morris, and N. Zeldovich. OpLog: a library for scaling update-heavy data structures. Technical Report MIT-CSAIL-TR-2014-019, MIT Computer Science and Artificial Intelligence Laboratory, September 2014.
[13]
J. D. Brouer. Qdisc lockless FIFO. In phNetfilter Workshop, 2014.
[14]
T. E. Carlson, W. Heirman, and L. Eeckhout. Sniper: Exploring the Level of Abstraction for Scalable and Accurate Parallel Multi-core Simulation. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 52:1--52:12, 2011.
[15]
C. Diaconu, C. Freedman, E. Ismert, P.-A. Larson, P. Mittal, R. Stonecipher, N. Verma, and M. Zwilling. Hekaton: SQL Server's Memory-optimized OLTP Engine. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 1243--1254, 2013.
[16]
N. Diegues, P. Romano, and L. Rodrigues. Virtues and Limitations of Commodity Hardware Transactional Memory. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT '14, pages 3--14, 2014.
[17]
A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguade. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP. In Proceedings of the 2009 International Conference on Parallel Processing, ICPP '09, pages 124--131, 2009.
[18]
K. Fraser. Practical lock-freedom. PhD thesis, University of Cambridge, Computer Laboratory, University of Cambridge, Computer Laboratory, February 2004.
[19]
L. Gidra, G. Thomas, J. Sopena, and M. Shapiro. A Study of the Scalability of Stop-the-world Garbage Collectors on Multicores. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 229--240, 2013.
[20]
J. R. Goodman, M. K. Vernon, and P. J. Woest. Efficient Synchronization Primitives for Large-scale Cache-coherent Multiprocessors. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '89, pages 64--75, 1989.
[21]
A. Gottlieb, R. Grishman, C. Kruskal, K. McAuliffe, L. Rudolph, and M. Snir. The NYU Ultracomputer -- Designing an MIMD, Shared Memory Parallel Machine. In IEEE Transactions on Computers, February 1983.
[22]
A. Haas, M. Lippautz, T. A. Henzinger, H. Payer, A. Sokolova, C. M. Kirsch, and A. Sezgin. Distributed queues in shared memory: multicore performance and scalability through quantitative relaxation. In Proceedings of the ACM International Conference on Computing Frontiers, CF '13, pages 17:1--17:9, 2013.
[23]
D. Hendler, N. Shavit, and L. Yerushalmi. A scalable lock-free stack algorithm. In Proceedings of the 16th Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2004, pages 206--215, 2004.
[24]
M. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems (TOPLAS), 13: 124--149, January 1991.
[25]
M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008.
[26]
M. Hoffman, O. Shalev, and N. Shavit. The baskets queue. In Proceedings of the 11th International Conference on Principles of Distributed Systems, OPODIS'07, pages 401--414, 2007.
[27]
Intel. Intel Xeon Phi Coprocessor. https://software.intel.com/en-us/mic-developer, 2013.
[28]
S. A. R. Jafri, G. Voskuilen, and T. N. Vijaykumar. Wait-n-GoTM: Improving HTM Performance by Serializing Cyclic Dependencies. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 521--534, 2013.
[29]
E. H. Jensen, G. W. Hagensen, and J. M. Broughton. A New Approach to Exclusive Data Access in Shared Memory Multiprocessors. Technical Report UCRL-97663, Lawrence Livermore National Laboratory, 1987.
[30]
R. Jones, A. Hosking, and E. Moss. The Garbage Collection Handbook: The Art of Automatic Memory Management. Chapman & Hall/CRC, 1st edition, 2011.
[31]
H. Jordan. Performance Measurements on HEP A Pipelined MIMD Computer. In International Symposium on Computer Architecture (ISCA), June 1983.
[32]
A. Kagi, D. Burger, and J. R. Goodman. Efficient Synchronization: Let Them Eat QOLB. In Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA '97, pages 170--180, 1997.
[33]
C. Kirsch, H. Payer, H. Röck, and A. Sokolova. Performance, Scalability, and Semantics of Concurrent FIFO Queues. In Proceedings of the International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP '12), volume 7439 of phLNCS, pages 273--287. 2012.
[34]
C. M. Kirsch, M. Lippautz, and H. Payer. Fast and Scalable, Lock-Free k-FIFO Queues. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, pages 241--252, 2013.
[35]
A. Kivity, D. Laor, G. Costa, P. Enberg, N. Har'El, D. Marti, and V. Zolotarov. OSv-Optimizing the Operating System for Virtual Machines. In Proceedings of the 2014 USENIX Annual Technical Conference, ATC '14, pages 61--72, June 2014.
[36]
A. Kogan and E. Petrank. Wait-free queues with multiple enqueuers and dequeuers. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, PPoPP '11, pages 223--234, 2011.
[37]
D. A. Koufaty, X. Chen, D. K. Poulsen, and J. Torrellas. Data Forwarding in Scalable Shared-memory Multiprocessors. In Proceedings of the 9th International Conference on Supercomputing, ICS '95, pages 255--264, 1995.
[38]
D. Kuck, E. Davidson, D. Lawrie, A. Sameh, C. Q. Zhu, A. Veidenbaum, J. Konicek, P. Yew, K. Gallivan, W. Jalby, H. Wijshoff, R. Bramley, U. M. Yang, P. Emrath, D. Padua, R. Eigenmann, J. Hoeflinger, G. Jaxon, Z. Li, T. Murphy, and J. Andrews. The Cedar System and an Initial Performance Study. In Proceedings of the 20th Annual International Symposium on Computer Architecture, ISCA '93, pages 213--223, 1993.
[39]
M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '07, pages 211--222, 2007.
[40]
E. Ladan-Mozes and N. Shavit. An optimistic approach to lock-free FIFO queues. In Proceedings of the 18th International Symposium on Distributed Computing (DISC 2004), volume 3274 of phLNCS, pages 117--131, 2004.
[41]
P.-A. Larson and M. Krishnan. Memory Allocation for Long-running Server Applications. In Proceedings of the 1st International Symposium on Memory Management, ISMM '98, pages 176--185, 1998.
[42]
J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA '97, pages 241--251, 1997.
[43]
D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. Lam. The Stanford DASH Multiprocessor. In IEEE Computer, March 1992.
[44]
I. Lotan and N. Shavit. Skiplist-Based Concurrent Priority Queues. In Proceedings of the 14th International Parallel and Distributed Processing Symposium, IPDPS '00, pages 263--268, 2000.
[45]
N. D. Matsakis and F. S. Klock, II. The Rust Language. In Proceedings of the 2014 ACM SIGAda Annual Conference on High Integrity Language Technology, HILT '14, pages 103--104, 2014.
[46]
M. M. Michael. Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects. IEEE Transactions on Parallel and Distributed System, 15 (6): 491--504, June 2004.
[47]
M. M. Michael. Scalable lock-free dynamic memory allocation. In Proceedings of the 2004 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '04, pages 35--46, 2004.
[48]
M. M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing, PODC '96, pages 267--275, 1996.
[49]
M. M. Michael and M. L. Scott. Nonblocking algorithms and preemption-safe locking on multiprogrammed shared memory multiprocessors. Journal of Parallel and Distributed Computing, 51: 1--26, May 1998.
[50]
D. Nguyen, A. Lenharth, and K. Pingali. A Lightweight Infrastructure for Graph Analytics. In Proceedings of the 24th ACM Symposium on Operating Systems Principles, SOSP '13, pages 456--471, 2013.
[51]
S. L. Olivier, A. K. Porterfield, K. B. Wheeler, M. Spiegel, and J. F. Prins. OpenMP Task Scheduling Strategies for Multicore NUMA Systems. International Journal of High Performance Computing Applications, 26 (2): 110--124, May 2012.
[52]
G. Pfister, W. Brantley, D. George, S. Harvey, W. Kleinfelder, K. McAuliffe, E. Melton, V. Norton, and J. Weiss. The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture. In Proceedings of the 1985 International Conference on Parallel Processing, ICPP '85, pages 764--771, 1985.
[53]
X. Qian, B. Sahelices, and J. Torrellas. BulkSMT: Designing SMT Processors for Atomic-block Execution. In Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture, HPCA '12, pages 1--12, 2012.
[54]
X. Qian, B. Sahelices, and J. Torrellas. OmniOrder: Directory-based Conflict Serialization of Transactions. In Proceeding of the 41st Annual International Symposium on Computer Architecuture, ISCA '14, pages 421--432, 2014.
[55]
R. Rajwar, A. Kagi, and J. R. Goodman. Improving the throughput of synchronization by insertion of delays. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture, HPCA '00, pages 168--179, January 2000.
[56]
R. Rajwar, A. Kagi, and J. R. Goodman. Inferential Queueing and Speculative Push for Reducing Critical Communication Latencies. In Proceedings of the 17th Annual International Conference on Supercomputing, ICS '03, pages 273--284, 2003.
[57]
H. E. Ramadan, C. J. Rossbach, and E. Witchel. Dependence-aware Transactional Memory for Increased Concurrency. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '08, pages 246--257, 2008.
[58]
A. Ros and S. Kaxiras. Complexity-effective Multicore Coherence. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 241--252, 2012.
[59]
S. Schneider, C. D. Antonopoulos, and D. S. Nikolopoulos. Scalable Locality-conscious Multithreaded Memory Allocation. In Proceedings of the 5th International Symposium on Memory Management, ISMM '06, pages 84--94, 2006.
[60]
S. L. Scott. Synchronization and Communication in the T3E Multiprocessor. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '96, pages 26--36, 1996.
[61]
S. Seo, J. Kim, and J. Lee. SFMalloc: A Lock-Free and Mostly Synchronization-Free Dynamic Memory Allocator for Manycores. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques, PACT '11, pages 253--263, 2011.
[62]
N. Shavit. Data Structures in the Multicore Age. Communications of the ACM, 54 (3): 76--84, Mar. 2011.
[63]
M. Stonebraker, A. Pavlo, R. Taft, and M. L. Brodie. Enterprise Database Applications and the Cloud: A Difficult Road Ahead. In Proceedings of the 2014 IEEE International Conference on Cloud Engineering, IC2E '14, pages 1--6, 2014.
[64]
M. E. Thomadakis. The Architecture of the Nehalem Processor and Nehalem-EP SMP Platforms. Resource, 3: 2, 2011.
[65]
P. Trancoso and J. Torrellas. The Impact of Speeding Up Critical Sections with Data Prefetching and Forwarding. In Proceedings of the 1996 International Conference on Parallel Processing, ICPP '96, pages 79--86.
[66]
R. K. Treiber. Systems Programming: Coping with Parallelism. Technical Report RJ5118, IBM Almaden Research Center, 2006.
[67]
L. G. Valiant. A Bridging Model for Parallel Computation. Communications of the ACM, 33 (8): 103--111, Aug. 1990.
[68]
E. Vallejo, R. Beivide, A. Cristal, T. Harris, F. Vallejo, O. Unsal, and M. Valero. Architectural Support for Fair Reader-Writer Locking. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '10, pages 275--286, 2010.
[69]
K. B. Wheeler, R. C. Murphy, and D. Thain. Qthreads: An API for programming with millions of lightweight threads. In Proceedings of the 22nd International Parallel and Distributed Processing Symposium, IPDPS '08, pages 1--8, 2008.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 44, Issue 2
ASPLOS'16
May 2016
774 pages
ISSN:0163-5964
DOI:10.1145/2980024
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
    March 2016
    824 pages
    ISBN:9781450340915
    DOI:10.1145/2872362
    • General Chair:
    • Tom Conte,
    • Program Chair:
    • Yuanyuan Zhou
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 March 2016
Published in SIGARCH Volume 44, Issue 2

Check for updates

Author Tags

  1. lock-free synchronization
  2. serialization

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)206
  • Downloads (Last 6 weeks)28
Reflects downloads up to 08 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media