Abstract
Machine Learning techniques are taking a prominent position in the design of system softwares. In HPC, many work are proposing to use such techniques (specifically Reinforcement Learning) to improve the performance of batch schedulers.
Their main limitation is the lack of transparency of their decision. This underlines the importance of choosing correctly the optimization criteria when evaluating these solutions. In this work, we discuss bias and limitations of the most frequent optimization metrics in the literature. We provide elements on how to evaluate performance when studying HPC batch scheduling. We also propose a new metric: the standard deviation of the utilization, which we believe can be used when the utilization reaches its limits.
We then experimentally evaluate these limitations by focusing on the use-case of runtime estimates. One of the information that HPC batch schedulers use to schedule jobs on the available resources is user runtime estimates: an estimation provided by the user of how long their job will run on the machine. These estimates are known to be inaccurate, hence many work have focused on improving runtime prediction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
ALCF Public Data. https://reports.alcf.anl.gov/data/. This data was generated from resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. Accessed 08 Dec 2022
Top500. https://www.top500.org/
Bailey Lee, C., Schwartzman, Y., Hardy, J., Snavely, A.: Are user runtime estimates inherently inaccurate? In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, pp. 253–263. Springer, Heidelberg (2005). https://doi.org/10.1007/11407522_14
Carastan-Santos, D., De Camargo, R.Y.: Obtaining dynamic scheduling policies with simulation and machine learning. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2017)
Carastan-Santos, D., De Camargo, R. Y., Trystram, D., Zrigui, S.: One can only gain by replacing easy backfilling: a simple scheduling policies case study. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 1–10. IEEE (2019)
Chiang, S.-H., Arpaci-Dusseau, A., Vernon, M.K.: The impact of more accurate requested runtimes on production job scheduling performance. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 103–127. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36180-4_7
Du, Y., Marchal, L., Pallez, G., Robert, Y.: Doing better for jobs that failed: node stealing from a batch scheduler’s perspective
Dutot, P.-F., Mercier, M., Poquet, M., Richard, O.: Batsim: a Realistic Language-Independent Resources and Jobs Management Systems Simulator. In: 20th Workshop on Job Scheduling Strategies for Parallel Processing (Chicago, United States, May 2016)
Fan, Y., et al.: DRAS: deep reinforcement learning for cluster scheduling in high performance computing. IEEE Trans. Parallel Distrib. Syst. 33(12), 4903–4917 (2022)
Fan, Y., Rich, P., Allcock, W.E., Papka, M.E., Lan, Z.: Trade-off between prediction accuracy and underestimation rate in job runtime estimates. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 530–540. IEEE (2017)
Gainaru, A., Pallez, G.: Making speculative scheduling robust to incomplete data. In: 2019 IEEE/ACM 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), pp. 62–71. IEEE (2019)
Gainaru, A., Pallez, G., Sun, H., Raghavan, P.: Speculative scheduling for stochastic HPC applications. In: Proceedings of the 48th International Conference on Parallel Processing, pp. 1–10 (2019)
Goponenko, A.V., Lamar, K., Peterson, C., Allan, B.A., Brandt, J.M., Dechev, D.: Metrics for packing efficiency and fairness of HPC cluster batch job scheduling. In: 2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 241–252 (2022)
Legrand, A., Trystram, D., Zrigui, S.: Adapting batch scheduling to workload characteristics: what can we expect from online learning? In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 686–695. IEEE (2019)
Leung, V.J., Sabin, G., Sadayappan, P.: Parallel job scheduling policies to improve fairness: a case study. In: 2010 39th International Conference on Parallel Processing Workshops, pp. 346–353. IEEE (2010)
Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001)
Patel, T., Liu, Z., Kettimuthu, R., Rich, P., Allcock, W., Tiwari, D.: Job characteristics on large-scale systems: long-term analysis, quantification, and implications. In: SC20: International conference for high performance computing, networking, storage and analysis, pp. 1–17. IEEE (2020)
Perkovic, D., Keleher, P.J.: Randomization, speculation, and adaptation in batch schedulers. In: SC 2000: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, pp. 7–7. IEEE (2000)
Tang, W., Lan, Z., Desai, N., Buettner, D.: Fault-aware, utility-based job scheduling on Blue, Gene/P systems. In: 2009 IEEE International Conference on Cluster Computing and Workshops, pp. 1–10. IEEE (2009)
Tsafrir, D.: Using inaccurate estimates accurately. In: Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2010. LNCS, vol. 6253, pp. 208–221. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16505-4_12
Tsafrir, D., Etsion, Y., Feitelson, D.G.: Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans. Parallel Distrib. Syst. 18(6), 789–803 (2007)
Verma, A., Korupolu, M., Wilkes, J.: Evaluating job packing in warehouse-scale computing. In: 2014 IEEE International Conference on Cluster Computing (CLUSTER), pp. 48–56, IEEE (2014)
D Zhang, D., Dai, D., He, Y., Bao, F.S., Xie, B.: RLScheduler: an automated HPC batch job scheduler using reinforcement learning. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15. IEEE (2020)
Zhang, D., Dai, D., Xie, B.: SchedInspector: a batch job scheduling inspector using reinforcement learning. In: Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing (New York, NY, USA, 2022), HPDC, pp. 97–109. Association for Computing Machinery (2022)
Zhang, Y., Franke, H., Moreira, J.E., Sivasubramaniam, A.: Improving parallel job scheduling by combining gang scheduling and backfilling techniques. In: Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000, pp. 133–142. IEEE (2000)
Acknowledgment
This work was supported in part by the French National Research Agency (ANR) in the frame of DASH (ANR-17-CE25-0004) and in part by the Inria Exploratory project REPAS.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Boëzennec, R., Dufossé, F., Pallez, G. (2023). Optimization Metrics for the Evaluation of Batch Schedulers in HPC. In: Klusáček, D., Corbalán, J., Rodrigo, G.P. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2023. Lecture Notes in Computer Science, vol 14283. Springer, Cham. https://doi.org/10.1007/978-3-031-43943-8_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-43943-8_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43942-1
Online ISBN: 978-3-031-43943-8
eBook Packages: Computer ScienceComputer Science (R0)