Abstract
The modeling and simulation of Deep Learning Training (DLT) are challenging problems. Due to the intricate parallel patterns, existing modelings and simulations do not consider enough factors that influence the training, which brings inaccuracy for the prediction of DLT time. To address these rising challenges, we propose DeletePop, a Deep Learning Training Execution time Predictor based on comprehensive modeling at the Operator level. It systematically abstracts the process of DLT by dividing it into computation, memory access, and communication three parts. DeletePop could predict the Job Execution Time (JET) according to the operator dataset obtained from the homogeneous network. Finally, we integrate the DeletePop into a Job Scheduling Simulator (JSS) DLTSim to make support more efficient scheduling. Although the implementation of DeletePop is based on the TensorFlow framework, the theoretical model could adapt to any other frameworks that use static graphs. DeletePop achieves up to 90% accuracy for Homogeneous Networks, and we also provide the theoretical manners to add support for Heterogeneous Networks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aida, K.: Effect of job size characteristics on job scheduling performance. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2000. LNCS, vol. 1911, pp. 1–17. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-39997-6_1
Arafa, Y., et al.: Hybrid, scalable, trace-driven performance modeling of GPGPUs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, pp. 1–15. Association for Computing Machinery. https://doi.org/10.1145/3458817.3476221
Bai, Y., et al.: Gradient compression supercharged high-performance data parallel DNN training. In: Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles CD-ROM, pp. 359–375. ACM (2021). https://doi.org/10.1145/3477132.3483553
Dadu, V., Nowatzki, T.: TaskStream: accelerating task-parallel workloads by recovering program structure. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 1–13. ACM (2022). https://doi.org/10.1145/3503222.3507706
Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, vol. 25. Curran Associates Inc. (2012)
Gautam, J.V., Prajapati, H.B., Dabhi, V.K., Chaudhary, S.: A survey on job scheduling algorithms in big data processing. In: 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), pp. 1–11 (2015). https://doi.org/10.1109/ICECCT.2015.7226035
Goldsborough, P.: A Tour of TensorFlow (2016)
Hindman, B., et al.: Mesos: a platform for fine-grained resource sharing in the data center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI 2011, pp. 295–308. USENIX Association (2011)
Javed, M.H., Ibrahim, K.Z., Lu, X.: Performance analysis of deep learning workloads using roofline trajectories. CCF Trans. High Perform. Comput. 1(3), 224–239 (2019). https://doi.org/10.1007/s42514-019-00018-4
Jia, Z., Zaharia, M., Aiken, A.: Beyond data and model parallelism for deep neural networks, vol. 1, pp. 1–13 (2019)
Kaufman, S., et al.: A learned performance model for tensor processing units 3, 387–400
Kim, Y., Choi, H., Lee, J., Kim, J.-S., Jei, H., Roh, H.: Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster. Cluster Comput. 23(3), 2287–2300 (2020). https://doi.org/10.1007/s10586-020-03144-9
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25. Curran Associates Inc. (2012)
Kwon, W., Yu, G.I., Jeong, E., Chun, B.G.: Nimble: lightweight and parallel GPU task scheduling for deep learning, vol. 33, pp. 8343–8354 (2020)
Li, A., et al.: Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Trans. Parallel Distrib. Syst. 31(1), 94–110 (2020). https://doi.org/10.1109/TPDS.2019.2928289
Li, M.: Scaling distributed machine learning with the parameter server. In: Proceedings of the 2014 International Conference on Big Data Science and Computing, BigDataScience 2014, p. 1. Association for Computing Machinery (2014). https://doi.org/10.1145/2640087.2644155
Narayanan, D., et al.: PipeDream: generalized pipeline parallelism for DNN training. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019, pp. 1–15. Association for Computing Machinery (2019). https://doi.org/10.1145/3341301.3359646
Ouyang, S., Dong, D., Xu, Y., Xiao, L.: Communication optimization strategies for distributed deep neural network training: a survey. J. Parallel Distrib. Comput. 149, 52–65 (2021). https://doi.org/10.1016/j.jpdc.2020.11.005
Park, J.H., et al.: HetPipe: enabling large DNN training on (Whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism. In: Proceedings of the 2020 USENIX Conference on USENIX Annual Technical Conference, vol. 21, pp. 307–321. USENIX Association (2020)
Patarasuk, P., Yuan, X.: Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput. 69(2), 117–124 (2009). https://doi.org/10.1016/j.jpdc.2008.09.002
Robson, E., Xu, C., Wills, L.W.: ProSE: the architecture and design of a protein discovery engine. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 655–668. ACM (2022). https://doi.org/10.1145/3503222.3507722
Sanders, P., Mehlhorn, K., Dietzfelbinger, M., Dementiev, R.: Sequential and Parallel Algorithms and Data Structures: The Basic Toolbox. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-25209-0
Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., Wilkes, J.: Omega: flexible, scalable schedulers for large compute clusters. In: Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys 2013, pp. 351–364. Association for Computing Machinery (2013). https://doi.org/10.1145/2465351.2465386
Simakov, N.A., et al.: Slurm simulator: improving slurm scheduler performance on large HPC systems by utilization of multiple controllers and node sharing. In: Proceedings of the Practice and Experience on Advanced Research Computing, PEARC 2018, pp. 1–8. Association for Computing Machinery (2018). https://doi.org/10.1145/3219104.3219111
Song, L., Mao, J., Zhuo, Y., Qian, X., Li, H., Chen, Y.: HyPar: towards hybrid parallelism for deep learning accelerator array. In: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 56–68. IEEE (2019). https://doi.org/10.1109/HPCA.2019.00027
Vavilapalli, V.K., et al.: Apache hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC 2013, pp. 1–16. Association for Computing Machinery (2013). https://doi.org/10.1145/2523616.2523633
Verma, A., Dahiya, P.K.: PCIe bus: a state-of-the-art-review. IOSR J. VLSI Sig. Process. 7(4), 24–28 (2017). https://doi.org/10.9790/4200-0704012428
Wette, P., Schwabe, A., Splietker, M., Karl, H.: Extending Hadoop’s yarn scheduler load simulator with a highly realistic network & traffic model. In: Proceedings of the 2015 1st IEEE Conference on Network Softwarization (NetSoft), pp. 1–2 (2015). https://doi.org/10.1109/NETSOFT.2015.7116169
Yang, K., Cao, R., Zhou, Y., Zhang, J., Shao, E., Tan, G.: Deep reinforcement agent for failure-aware job scheduling in high-performance computing. In: 2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS), pp. 442–449 (2021). https://doi.org/10.1109/ICPADS53394.2021.00061
Yang, X., et al.: Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2013, pp. 1–11 (2013). https://doi.org/10.1145/2503210.2503264
Yang, Y., et al.: INFless: a native serverless system for low-latency, high-throughput inference. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 768–781. ACM (2022). https://doi.org/10.1145/3503222.3507709
Zhang, P., Fang, J., Yang, C., Huang, C., Tang, T., Wang, Z.: Optimizing streaming parallelism on heterogeneous many-core architectures. IEEE Trans. Parallel Distrib. Syst. 31(8), 1878–1896 (2020). https://doi.org/10.1109/TPDS.2020.2978045
Zheng, Z., et al.: AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 359–373. ACM (2022). https://doi.org/10.1145/3503222.3507723
Acknowledgment
This work was sponsored in part by NKRDP (2021YFB0300800), and in part by NSFC (62102396), Beijing Nova Program (Z211100002121143, 20220484217), Youth Innovation Promotion Association of Chinese Academy of Sciences (2021099). Pilot for Major Scientific Research Facility of Jiangsu Province of China (NO. BM2021800).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
He, Y., Zhou, Y., Shao, E., Tan, G., Sun, N. (2024). DeletePop: A DLT Execution Time Predictor Based on Comprehensive Modeling. In: Tari, Z., Li, K., Wu, H. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2023. Lecture Notes in Computer Science, vol 14493. Springer, Singapore. https://doi.org/10.1007/978-981-97-0862-8_9
Download citation
DOI: https://doi.org/10.1007/978-981-97-0862-8_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0861-1
Online ISBN: 978-981-97-0862-8
eBook Packages: Computer ScienceComputer Science (R0)