DeletePop: A DLT Execution Time Predictor Based on Comprehensive Modeling

He, Yongzhe; Zhou, Yueyuan; Shao, En; Tan, Guangming; Sun, Ninghui

doi:10.1007/978-981-97-0862-8_9

Yongzhe He^10,11,
Yueyuan Zhou¹⁰,
En Shao^10,11,12,
Guangming Tan^10,11 &
…
Ninghui Sun^10,11

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14493))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

257 Accesses

Abstract

The modeling and simulation of Deep Learning Training (DLT) are challenging problems. Due to the intricate parallel patterns, existing modelings and simulations do not consider enough factors that influence the training, which brings inaccuracy for the prediction of DLT time. To address these rising challenges, we propose DeletePop, a Deep Learning Training Execution time Predictor based on comprehensive modeling at the Operator level. It systematically abstracts the process of DLT by dividing it into computation, memory access, and communication three parts. DeletePop could predict the Job Execution Time (JET) according to the operator dataset obtained from the homogeneous network. Finally, we integrate the DeletePop into a Job Scheduling Simulator (JSS) DLTSim to make support more efficient scheduling. Although the implementation of DeletePop is based on the TensorFlow framework, the theoretical model could adapt to any other frameworks that use static graphs. DeletePop achieves up to 90% accuracy for Homogeneous Networks, and we also provide the theoretical manners to add support for Heterogeneous Networks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Horus: An Interference-Aware Resource Manager for Deep Learning Systems

DNNEmu: A Lightweight Performance Emulator for Distributed DNN Training

Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters

Article 02 March 2021

References

Aida, K.: Effect of job size characteristics on job scheduling performance. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2000. LNCS, vol. 1911, pp. 1–17. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-39997-6_1
Chapter Google Scholar
Arafa, Y., et al.: Hybrid, scalable, trace-driven performance modeling of GPGPUs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, pp. 1–15. Association for Computing Machinery. https://doi.org/10.1145/3458817.3476221
Bai, Y., et al.: Gradient compression supercharged high-performance data parallel DNN training. In: Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles CD-ROM, pp. 359–375. ACM (2021). https://doi.org/10.1145/3477132.3483553
Dadu, V., Nowatzki, T.: TaskStream: accelerating task-parallel workloads by recovering program structure. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 1–13. ACM (2022). https://doi.org/10.1145/3503222.3507706
Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, vol. 25. Curran Associates Inc. (2012)
Google Scholar
Gautam, J.V., Prajapati, H.B., Dabhi, V.K., Chaudhary, S.: A survey on job scheduling algorithms in big data processing. In: 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), pp. 1–11 (2015). https://doi.org/10.1109/ICECCT.2015.7226035
Goldsborough, P.: A Tour of TensorFlow (2016)
Google Scholar
Hindman, B., et al.: Mesos: a platform for fine-grained resource sharing in the data center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI 2011, pp. 295–308. USENIX Association (2011)
Google Scholar
Javed, M.H., Ibrahim, K.Z., Lu, X.: Performance analysis of deep learning workloads using roofline trajectories. CCF Trans. High Perform. Comput. 1(3), 224–239 (2019). https://doi.org/10.1007/s42514-019-00018-4
Article Google Scholar
Jia, Z., Zaharia, M., Aiken, A.: Beyond data and model parallelism for deep neural networks, vol. 1, pp. 1–13 (2019)
Google Scholar
Kaufman, S., et al.: A learned performance model for tensor processing units 3, 387–400
Google Scholar
Kim, Y., Choi, H., Lee, J., Kim, J.-S., Jei, H., Roh, H.: Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster. Cluster Comput. 23(3), 2287–2300 (2020). https://doi.org/10.1007/s10586-020-03144-9
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25. Curran Associates Inc. (2012)
Google Scholar
Kwon, W., Yu, G.I., Jeong, E., Chun, B.G.: Nimble: lightweight and parallel GPU task scheduling for deep learning, vol. 33, pp. 8343–8354 (2020)
Google Scholar
Li, A., et al.: Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Trans. Parallel Distrib. Syst. 31(1), 94–110 (2020). https://doi.org/10.1109/TPDS.2019.2928289
Article Google Scholar
Li, M.: Scaling distributed machine learning with the parameter server. In: Proceedings of the 2014 International Conference on Big Data Science and Computing, BigDataScience 2014, p. 1. Association for Computing Machinery (2014). https://doi.org/10.1145/2640087.2644155
Narayanan, D., et al.: PipeDream: generalized pipeline parallelism for DNN training. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019, pp. 1–15. Association for Computing Machinery (2019). https://doi.org/10.1145/3341301.3359646
Ouyang, S., Dong, D., Xu, Y., Xiao, L.: Communication optimization strategies for distributed deep neural network training: a survey. J. Parallel Distrib. Comput. 149, 52–65 (2021). https://doi.org/10.1016/j.jpdc.2020.11.005
Article Google Scholar
Park, J.H., et al.: HetPipe: enabling large DNN training on (Whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism. In: Proceedings of the 2020 USENIX Conference on USENIX Annual Technical Conference, vol. 21, pp. 307–321. USENIX Association (2020)
Google Scholar
Patarasuk, P., Yuan, X.: Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput. 69(2), 117–124 (2009). https://doi.org/10.1016/j.jpdc.2008.09.002
Article Google Scholar
Robson, E., Xu, C., Wills, L.W.: ProSE: the architecture and design of a protein discovery engine. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 655–668. ACM (2022). https://doi.org/10.1145/3503222.3507722
Sanders, P., Mehlhorn, K., Dietzfelbinger, M., Dementiev, R.: Sequential and Parallel Algorithms and Data Structures: The Basic Toolbox. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-25209-0
Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., Wilkes, J.: Omega: flexible, scalable schedulers for large compute clusters. In: Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys 2013, pp. 351–364. Association for Computing Machinery (2013). https://doi.org/10.1145/2465351.2465386
Simakov, N.A., et al.: Slurm simulator: improving slurm scheduler performance on large HPC systems by utilization of multiple controllers and node sharing. In: Proceedings of the Practice and Experience on Advanced Research Computing, PEARC 2018, pp. 1–8. Association for Computing Machinery (2018). https://doi.org/10.1145/3219104.3219111
Song, L., Mao, J., Zhuo, Y., Qian, X., Li, H., Chen, Y.: HyPar: towards hybrid parallelism for deep learning accelerator array. In: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 56–68. IEEE (2019). https://doi.org/10.1109/HPCA.2019.00027
Vavilapalli, V.K., et al.: Apache hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC 2013, pp. 1–16. Association for Computing Machinery (2013). https://doi.org/10.1145/2523616.2523633
Verma, A., Dahiya, P.K.: PCIe bus: a state-of-the-art-review. IOSR J. VLSI Sig. Process. 7(4), 24–28 (2017). https://doi.org/10.9790/4200-0704012428
Article Google Scholar
Wette, P., Schwabe, A., Splietker, M., Karl, H.: Extending Hadoop’s yarn scheduler load simulator with a highly realistic network & traffic model. In: Proceedings of the 2015 1st IEEE Conference on Network Softwarization (NetSoft), pp. 1–2 (2015). https://doi.org/10.1109/NETSOFT.2015.7116169
Yang, K., Cao, R., Zhou, Y., Zhang, J., Shao, E., Tan, G.: Deep reinforcement agent for failure-aware job scheduling in high-performance computing. In: 2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS), pp. 442–449 (2021). https://doi.org/10.1109/ICPADS53394.2021.00061
Yang, X., et al.: Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2013, pp. 1–11 (2013). https://doi.org/10.1145/2503210.2503264
Yang, Y., et al.: INFless: a native serverless system for low-latency, high-throughput inference. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 768–781. ACM (2022). https://doi.org/10.1145/3503222.3507709
Zhang, P., Fang, J., Yang, C., Huang, C., Tang, T., Wang, Z.: Optimizing streaming parallelism on heterogeneous many-core architectures. IEEE Trans. Parallel Distrib. Syst. 31(8), 1878–1896 (2020). https://doi.org/10.1109/TPDS.2020.2978045
Article Google Scholar
Zheng, Z., et al.: AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 359–373. ACM (2022). https://doi.org/10.1145/3503222.3507723

Download references

Acknowledgment

This work was sponsored in part by NKRDP (2021YFB0300800), and in part by NSFC (62102396), Beijing Nova Program (Z211100002121143, 20220484217), Youth Innovation Promotion Association of Chinese Academy of Sciences (2021099). Pilot for Major Scientific Research Facility of Jiangsu Province of China (NO. BM2021800).

Author information

Authors and Affiliations

State Key Lab of Processors, Institute of Computing Technology, CAS, Beijing, 100190, China
Yongzhe He, Yueyuan Zhou, En Shao, Guangming Tan & Ninghui Sun
University of Chinese Academy of Sciences, Beijing, 100049, China
Yongzhe He, En Shao, Guangming Tan & Ninghui Sun
Nanjing Institute of InforSuperBahn, Nanjing, 211100, China
En Shao

Authors

Yongzhe He
View author publications
You can also search for this author in PubMed Google Scholar
Yueyuan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
En Shao
View author publications
You can also search for this author in PubMed Google Scholar
Guangming Tan
View author publications
You can also search for this author in PubMed Google Scholar
Ninghui Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to En Shao .

Editor information

Editors and Affiliations

Royal Melbourne Institute of Technology, Melbourne, VIC, Australia
Zahir Tari
Tianjin University, Tianjin, China
Keqiu Li
University of Arizona, Tucson, AZ, USA
Hongyi Wu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, Y., Zhou, Y., Shao, E., Tan, G., Sun, N. (2024). DeletePop: A DLT Execution Time Predictor Based on Comprehensive Modeling. In: Tari, Z., Li, K., Wu, H. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2023. Lecture Notes in Computer Science, vol 14493. Springer, Singapore. https://doi.org/10.1007/978-981-97-0862-8_9

Download citation

DOI: https://doi.org/10.1007/978-981-97-0862-8_9
Published: 01 March 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0861-1
Online ISBN: 978-981-97-0862-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DeletePop: A DLT Execution Time Predictor Based on Comprehensive Modeling

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Horus: An Interference-Aware Resource Manager for Deep Learning Systems

DNNEmu: A Lightweight Performance Emulator for Distributed DNN Training

Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

DeletePop: A DLT Execution Time Predictor Based on Comprehensive Modeling

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Horus: An Interference-Aware Resource Manager for Deep Learning Systems

DNNEmu: A Lightweight Performance Emulator for Distributed DNN Training

Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation