{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,9,8]],"date-time":"2024-09-08T13:25:42Z","timestamp":1725801942811},"publisher-location":"New York, NY, USA","reference-count":37,"publisher":"ACM","license":[{"start":{"date-parts":[[2018,9,23]],"date-time":"2018-09-23T00:00:00Z","timestamp":1537660800000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2018,9,23]]},"DOI":"10.1145\/3236367.3236381","type":"proceedings-article","created":{"date-parts":[[2018,9,19]],"date-time":"2018-09-19T08:16:51Z","timestamp":1537345011000},"page":"1-9","update-policy":"http:\/\/dx.doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":20,"title":["Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters"],"prefix":"10.1145","author":[{"given":"Ammar Ahmad","family":"Awan","sequence":"first","affiliation":[{"name":"Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio"}]},{"given":"Ching-Hsiang","family":"Chu","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio"}]},{"given":"Hari","family":"Subramoni","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio"}]},{"given":"Dhabaleswar K.","family":"Panda","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio"}]}],"member":"320","published-online":{"date-parts":[[2018,9,23]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"{n. d.}. KESCH: Cray CS-Storm System. http:\/\/www.cscs.ch\/computers\/kesch_escha\/index.html. ({n. d.}). {n. d.}. KESCH: Cray CS-Storm System. http:\/\/www.cscs.ch\/computers\/kesch_escha\/index.html. ({n. d.})."},{"volume-title":"http:\/\/www.cntk.ai\/. (2015). {Online","year":"2016","author":"CNTK.","key":"e_1_3_2_1_2_1","unstructured":"2015. CNTK. http:\/\/www.cntk.ai\/. (2015). {Online ; accessed April- 2016 }. 2015. CNTK. http:\/\/www.cntk.ai\/. (2015). {Online; accessed April-2016}."},{"volume-title":"{n. d.}. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems","year":"2015","author":"Abadi Martin","key":"e_1_3_2_1_3_1","unstructured":"Martin Abadi , Ashish Agarwal , Paul Barham , Eugene Brevdo , Zhifeng Chen , Craig Citro , Greg S Corrado , Andy Davis , Jeffrey Dean , Matthieu Devin , {n. d.}. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems , 2015 . Software available from tensorflow. org ({n. d.}). Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. {n. d.}. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015. Software available from tensorflow. org ({n. d.})."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3018743.3018769"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/2966884.2966912"},{"volume-title":"Re-Designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters. In 2016 IEEE International Conference on Cloud Computing Technology and Science (CloudCom). 144--151","author":"Banerjee D. S.","key":"e_1_3_2_1_6_1","unstructured":"D. S. Banerjee , K. Hamidouche , and D. K. Panda . 2016 . Re-Designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters. In 2016 IEEE International Conference on Cloud Computing Technology and Science (CloudCom). 144--151 . D. S. Banerjee, K. Hamidouche, and D. K. Panda. 2016. Re-Designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters. In 2016 IEEE International Conference on Cloud Computing Technology and Science (CloudCom). 144--151."},{"volume-title":"Proceedings of IEEE Scalable High Performance Computing Conference. 357--364","author":"Barnett M.","key":"e_1_3_2_1_7_1","unstructured":"M. Barnett , L. Shuler , R. van de Geijn, S. Gupta, D. G. Payne, and J. Watts. 1994. Interprocessor collective communication library (InterCom) . In Proceedings of IEEE Scalable High Performance Computing Conference. 357--364 . M. Barnett, L. Shuler, R. van de Geijn, S. Gupta, D. G. Payne, and J. Watts. 1994. Interprocessor collective communication library (InterCom). In Proceedings of IEEE Scalable High Performance Computing Conference. 357--364."},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CCGRID.2007.59"},{"volume-title":"Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning. In 46th International Conference on Parallel Processing (ICPP-2017)","author":"Chu Ching-Hsiang","key":"e_1_3_2_1_9_1","unstructured":"Ching-Hsiang Chu , Xiaoyi Lu , Ammar A. Awan , Hari Subramoni , Jahanzeb Hashmi , Bracy Elton , and D. K. Panda . 2017 . Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning. In 46th International Conference on Parallel Processing (ICPP-2017) . {To appear}. Ching-Hsiang Chu, Xiaoyi Lu, Ammar A. Awan, Hari Subramoni, Jahanzeb Hashmi, Bracy Elton, and D. K. Panda. 2017. Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning. In 46th International Conference on Parallel Processing (ICPP-2017). {To appear}."},{"key":"e_1_3_2_1_10_1","unstructured":"Cray. {n. d.}. CS-STORM GPU-ACCELERATED CLUSTER SUPERCOMPUTER. ({n. d.}). http:\/\/www.cray.com\/products\/computing\/cs-series\/cs-storm Accessed: August 8 2018. Cray. {n. d.}. CS-STORM GPU-ACCELERATED CLUSTER SUPERCOMPUTER. ({n. d.}). http:\/\/www.cray.com\/products\/computing\/cs-series\/cs-storm Accessed: August 8 2018."},{"volume-title":"Ng","year":"2012","author":"Dean Jeffrey","key":"e_1_3_2_1_11_1","unstructured":"Jeffrey Dean , Greg Corrado , Rajat Monga , Kai Chen , Matthieu Devin , Mark Mao , Marc'aurelio Ranzato , Andrew Senior , Paul Tucker , Ke Yang , Quoc V. Le , and Andrew Y . Ng . 2012 . Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc ., 1223--1231. http:\/\/papers.nips.cc\/paper\/4687-large-scale-distributed-deep-networks.pdf Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and Andrew Y. Ng. 2012. Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1223--1231. http:\/\/papers.nips.cc\/paper\/4687-large-scale-distributed-deep-networks.pdf"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-33518-1_18"},{"volume-title":"Proceedings of the 21st IEEE International Parallel & Distributed Processing Symposium (CAC'07 Workshop). 232","author":"Hoefler T.","key":"e_1_3_2_1_13_1","unstructured":"T. Hoefler , C. Siebert , and W. Rehm . 2007. A Practically Constant-time MPI Broadcast Algorithm for Large-scale InfiniBand Clusters with Multicast . In Proceedings of the 21st IEEE International Parallel & Distributed Processing Symposium (CAC'07 Workshop). 232 . T. Hoefler, C. Siebert, and W. Rehm. 2007. A Practically Constant-time MPI Broadcast Algorithm for Large-scale InfiniBand Clusters with Multicast. In Proceedings of the 21st IEEE International Parallel & Distributed Processing Symposium (CAC'07 Workshop). 232."},{"volume-title":"FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters. arXiv preprint arXiv:1511.00175","year":"2015","author":"Iandola Forrest N","key":"e_1_3_2_1_14_1","unstructured":"Forrest N Iandola , Khalid Ashraf , Mattthew W Moskewicz , and Kurt Keutzer . 2015. FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters. arXiv preprint arXiv:1511.00175 ( 2015 ). Forrest N Iandola, Khalid Ashraf, Mattthew W Moskewicz, and Kurt Keutzer. 2015. FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters. arXiv preprint arXiv:1511.00175 (2015)."},{"volume-title":"Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093","year":"2014","author":"Jia Yangqing","key":"e_1_3_2_1_15_1","unstructured":"Yangqing Jia , Evan Shelhamer , Jeff Donahue , Sergey Karayev , Jonathan Long , Ross Girshick , Sergio Guadarrama , and Trevor Darrell . 2014 . Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014). Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014)."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/HOTI.2013.26"},{"key":"e_1_3_2_1_17_1","unstructured":"Alex Krizhevsky Ilya Sutskever and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in neural information processing systems. 1097--1105. Alex Krizhevsky Ilya Sutskever and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in neural information processing systems. 1097--1105."},{"volume-title":"Deep learning. Nature 521, 7553 (28 05","year":"2015","author":"LeCun Yann","key":"e_1_3_2_1_18_1","unstructured":"Yann LeCun , Yoshua Bengio , and Geoffrey Hinton . 2015. Deep learning. Nature 521, 7553 (28 05 2015 ), 436--444. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (28 05 2015), 436--444."},{"volume-title":"Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International. 10","author":"Liu J.","key":"e_1_3_2_1_19_1","unstructured":"J. Liu , A. R. Mamidala , and D. K. Panda . 2004. Fast and Scalable MPI-level Broadcast using InfiniBand's Hardware Multicast Support . In Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International. 10 . J. Liu, A. R. Mamidala, and D. K. Panda. 2004. Fast and Scalable MPI-level Broadcast using InfiniBand's Hardware Multicast Support. In Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International. 10."},{"volume-title":"Proceedings 20th IEEE International Parallel Distributed Processing Symposium. 8.","author":"Mamidala A. R.","key":"e_1_3_2_1_20_1","unstructured":"A. R. Mamidala , Lei Chai , Hyun-Wook Jin , and D. K. Panda . 2006. Efficient SMP-aware MPI-level Broadcast over InfiniBand's Hardware Multicast . In Proceedings 20th IEEE International Parallel Distributed Processing Symposium. 8. A. R. Mamidala, Lei Chai, Hyun-Wook Jin, and D. K. Panda. 2006. Efficient SMP-aware MPI-level Broadcast over InfiniBand's Hardware Multicast. In Proceedings 20th IEEE International Parallel Distributed Processing Symposium. 8."},{"key":"e_1_3_2_1_21_1","unstructured":"Hans Meuer Erich Strohmaier Jack Dongarra and Horst Simon. {n. d.}. TOP 500 Supercomputer Sites. http:\/\/www.top500.org. ({n. d.}). Hans Meuer Erich Strohmaier Jack Dongarra and Horst Simon. {n. d.}. TOP 500 Supercomputer Sites. http:\/\/www.top500.org. ({n. d.})."},{"key":"e_1_3_2_1_22_1","unstructured":"MVAPICH2: MPI over InfiniBand 10GigE\/iWARP and RoCE. {n. d.}. https:\/\/mvapich.cse.ohio-state.edu\/. ({n. d.}). MVAPICH2: MPI over InfiniBand 10GigE\/iWARP and RoCE. {n. d.}. https:\/\/mvapich.cse.ohio-state.edu\/. ({n. d.})."},{"volume-title":"Neural Machine Translation and Sequence-to-sequence Models: A Tutorial. CoRR abs\/1703.01619","year":"2017","author":"Neubig Graham","key":"e_1_3_2_1_23_1","unstructured":"Graham Neubig . 2017. Neural Machine Translation and Sequence-to-sequence Models: A Tutorial. CoRR abs\/1703.01619 ( 2017 ). http:\/\/arxiv.org\/abs\/1703.01619 Graham Neubig. 2017. Neural Machine Translation and Sequence-to-sequence Models: A Tutorial. CoRR abs\/1703.01619 (2017). http:\/\/arxiv.org\/abs\/1703.01619"},{"volume-title":"d.}. DGX-1: Essential Instrument of AI Research. ({n. d.}). https:\/\/www.nvidia.com\/en-us\/data-center\/dgx-1\/ Accessed","year":"2018","author":"NVIDIA.","key":"e_1_3_2_1_24_1","unstructured":"NVIDIA. {n. d.}. DGX-1: Essential Instrument of AI Research. ({n. d.}). https:\/\/www.nvidia.com\/en-us\/data-center\/dgx-1\/ Accessed : August 8, 2018 . NVIDIA. {n. d.}. DGX-1: Essential Instrument of AI Research. ({n. d.}). https:\/\/www.nvidia.com\/en-us\/data-center\/dgx-1\/ Accessed: August 8, 2018."},{"volume-title":"d.}. Optimized Primitives for Collective Multi-GPU Communication. ({n. d.}). https:\/\/github.com\/NVIDIA\/nccl Accessed","year":"2018","author":"NVIDIA.","key":"e_1_3_2_1_25_1","unstructured":"NVIDIA. {n. d.}. Optimized Primitives for Collective Multi-GPU Communication. ({n. d.}). https:\/\/github.com\/NVIDIA\/nccl Accessed : August 8, 2018 . NVIDIA. {n. d.}. Optimized Primitives for Collective Multi-GPU Communication. ({n. d.}). https:\/\/github.com\/NVIDIA\/nccl Accessed: August 8, 2018."},{"key":"e_1_3_2_1_26_1","unstructured":"NVIDIA. 2017. NCCL 2. https:\/\/developer.nvidia.com\/nccl. (2017). NVIDIA. 2017. NCCL 2. https:\/\/developer.nvidia.com\/nccl. (2017)."},{"volume-title":"d.}. SUMMIT. ({n. d.}). https:\/\/www.olcf.ornl.gov\/summit\/ Accessed","year":"2018","author":"Oak Ridge National Laboratory. {n.","key":"e_1_3_2_1_27_1","unstructured":"Oak Ridge National Laboratory. {n. d.}. SUMMIT. ({n. d.}). https:\/\/www.olcf.ornl.gov\/summit\/ Accessed : August 8, 2018 . Oak Ridge National Laboratory. {n. d.}. SUMMIT. ({n. d.}). https:\/\/www.olcf.ornl.gov\/summit\/ Accessed: August 8, 2018."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2013.17"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neunet.2014.09.003"},{"volume-title":"Edinburgh Neural Machine Translation Systems for WMT 16. CoRR abs\/1606.02891","year":"2016","author":"Sennrich Rico","key":"e_1_3_2_1_30_1","unstructured":"Rico Sennrich , Barry Haddow , and Alexandra Birch . 2016. Edinburgh Neural Machine Translation Systems for WMT 16. CoRR abs\/1606.02891 ( 2016 ). http:\/\/arxiv.org\/abs\/1606.02891 Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Edinburgh Neural Machine Translation Systems for WMT 16. CoRR abs\/1606.02891 (2016). http:\/\/arxiv.org\/abs\/1606.02891"},{"volume-title":"Designing Efficient Small Message Transfer Mechanism for Internode MPI Communication on InfiniBand GPU Clusters. In 2014 21st International Conference on High Performance Computing (HiPC). 1--10","author":"Shi R.","key":"e_1_3_2_1_31_1","unstructured":"R. Shi , S. Potluri , K. Hamidouche , J. Perkins , M. Li , D. Rossetti , and D. K. Panda . 2014 . Designing Efficient Small Message Transfer Mechanism for Internode MPI Communication on InfiniBand GPU Clusters. In 2014 21st International Conference on High Performance Computing (HiPC). 1--10 . R. Shi, S. Potluri, K. Hamidouche, J. Perkins, M. Li, D. Rossetti, and D. K. Panda. 2014. Designing Efficient Small Message Transfer Mechanism for Internode MPI Communication on InfiniBand GPU Clusters. In 2014 21st International Conference on High Performance Computing (HiPC). 1--10."},{"volume-title":"Van De Geijn","year":"2000","author":"Shroff Mohak","key":"e_1_3_2_1_32_1","unstructured":"Mohak Shroff and Robert A . Van De Geijn . 2000 . CollMark: MPI Collective Communication Benchmark. Technical Report. Dept. of Computer Sciences, University of Texas at Austin. Mohak Shroff and Robert A. Van De Geijn. 2000. CollMark: MPI Collective Communication Benchmark. Technical Report. Dept. of Computer Sciences, University of Texas at Austin."},{"volume-title":"Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556","year":"2014","author":"Simonyan Karen","key":"e_1_3_2_1_33_1","unstructured":"Karen Simonyan and Andrew Zisserman . 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556 ( 2014 ). Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014)."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342005051521"},{"key":"e_1_3_2_1_35_1","unstructured":"The Open MPI Development Team. {n. d.}. Open MPI: Open Source High Performance Computing. http:\/\/www.open-mpi.org. ({n. d.}). The Open MPI Development Team. {n. d.}. Open MPI: Open Source High Performance Computing. http:\/\/www.open-mpi.org. ({n. d.})."},{"volume-title":"2014 21st International Conference on High Performance Computing (HiPC). 1--10","author":"Venkatesh A.","key":"e_1_3_2_1_36_1","unstructured":"A. Venkatesh , H. Subramoni , K. Hamidouche , and D. K. Panda . 2014. A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on Infiniband Clusters . In 2014 21st International Conference on High Performance Computing (HiPC). 1--10 . A. Venkatesh, H. Subramoni, K. Hamidouche, and D. K. Panda. 2014. A High Performance Broadcast Design with Hardware Multicast and GPUDirect RDMA for Streaming Applications on Infiniband Clusters. In 2014 21st International Conference on High Performance Computing (HiPC). 1--10."},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPPW.2015.20"}],"event":{"name":"EuroMPI'18: 25th European MPI Users' Group Meeting","acronym":"EuroMPI'18","location":"Barcelona Spain"},"container-title":["Proceedings of the 25th European MPI Users' Group Meeting"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3236367.3236381","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,9,22]],"date-time":"2023-09-22T00:09:49Z","timestamp":1695341389000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3236367.3236381"}},"subtitle":["MPI or NCCL?"],"short-title":[],"issued":{"date-parts":[[2018,9,23]]},"references-count":37,"alternative-id":["10.1145\/3236367.3236381","10.1145\/3236367"],"URL":"http:\/\/dx.doi.org\/10.1145\/3236367.3236381","relation":{},"subject":[],"published":{"date-parts":[[2018,9,23]]},"assertion":[{"value":"2018-09-23","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}