iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://doi.org/10.1007/978-3-031-43943-8_5
Optimization Metrics for the Evaluation of Batch Schedulers in HPC | SpringerLink
Skip to main content

Optimization Metrics for the Evaluation of Batch Schedulers in HPC

  • Conference paper
  • First Online:
Job Scheduling Strategies for Parallel Processing (JSSPP 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14283))

Included in the following conference series:

Abstract

Machine Learning techniques are taking a prominent position in the design of system softwares. In HPC, many work are proposing to use such techniques (specifically Reinforcement Learning) to improve the performance of batch schedulers.

Their main limitation is the lack of transparency of their decision. This underlines the importance of choosing correctly the optimization criteria when evaluating these solutions. In this work, we discuss bias and limitations of the most frequent optimization metrics in the literature. We provide elements on how to evaluate performance when studying HPC batch scheduling. We also propose a new metric: the standard deviation of the utilization, which we believe can be used when the utilization reaches its limits.

We then experimentally evaluate these limitations by focusing on the use-case of runtime estimates. One of the information that HPC batch schedulers use to schedule jobs on the available resources is user runtime estimates: an estimation provided by the user of how long their job will run on the machine. These estimates are known to be inaccurate, hence many work have focused on improving runtime prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. ALCF Public Data. https://reports.alcf.anl.gov/data/. This data was generated from resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. Accessed 08 Dec 2022

  2. Top500. https://www.top500.org/

  3. Bailey Lee, C., Schwartzman, Y., Hardy, J., Snavely, A.: Are user runtime estimates inherently inaccurate? In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2004. LNCS, vol. 3277, pp. 253–263. Springer, Heidelberg (2005). https://doi.org/10.1007/11407522_14

    Chapter  Google Scholar 

  4. Carastan-Santos, D., De Camargo, R.Y.: Obtaining dynamic scheduling policies with simulation and machine learning. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2017)

    Google Scholar 

  5. Carastan-Santos, D., De Camargo, R. Y., Trystram, D., Zrigui, S.: One can only gain by replacing easy backfilling: a simple scheduling policies case study. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 1–10. IEEE (2019)

    Google Scholar 

  6. Chiang, S.-H., Arpaci-Dusseau, A., Vernon, M.K.: The impact of more accurate requested runtimes on production job scheduling performance. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2002. LNCS, vol. 2537, pp. 103–127. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36180-4_7

    Chapter  MATH  Google Scholar 

  7. Du, Y., Marchal, L., Pallez, G., Robert, Y.: Doing better for jobs that failed: node stealing from a batch scheduler’s perspective

    Google Scholar 

  8. Dutot, P.-F., Mercier, M., Poquet, M., Richard, O.: Batsim: a Realistic Language-Independent Resources and Jobs Management Systems Simulator. In: 20th Workshop on Job Scheduling Strategies for Parallel Processing (Chicago, United States, May 2016)

    Google Scholar 

  9. Fan, Y., et al.: DRAS: deep reinforcement learning for cluster scheduling in high performance computing. IEEE Trans. Parallel Distrib. Syst. 33(12), 4903–4917 (2022)

    Article  Google Scholar 

  10. Fan, Y., Rich, P., Allcock, W.E., Papka, M.E., Lan, Z.: Trade-off between prediction accuracy and underestimation rate in job runtime estimates. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 530–540. IEEE (2017)

    Google Scholar 

  11. Gainaru, A., Pallez, G.: Making speculative scheduling robust to incomplete data. In: 2019 IEEE/ACM 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), pp. 62–71. IEEE (2019)

    Google Scholar 

  12. Gainaru, A., Pallez, G., Sun, H., Raghavan, P.: Speculative scheduling for stochastic HPC applications. In: Proceedings of the 48th International Conference on Parallel Processing, pp. 1–10 (2019)

    Google Scholar 

  13. Goponenko, A.V., Lamar, K., Peterson, C., Allan, B.A., Brandt, J.M., Dechev, D.: Metrics for packing efficiency and fairness of HPC cluster batch job scheduling. In: 2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 241–252 (2022)

    Google Scholar 

  14. Legrand, A., Trystram, D., Zrigui, S.: Adapting batch scheduling to workload characteristics: what can we expect from online learning? In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 686–695. IEEE (2019)

    Google Scholar 

  15. Leung, V.J., Sabin, G., Sadayappan, P.: Parallel job scheduling policies to improve fairness: a case study. In: 2010 39th International Conference on Parallel Processing Workshops, pp. 346–353. IEEE (2010)

    Google Scholar 

  16. Mu’alem, A.W., Feitelson, D.G.: Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans. Parallel Distrib. Syst. 12(6), 529–543 (2001)

    Article  Google Scholar 

  17. Patel, T., Liu, Z., Kettimuthu, R., Rich, P., Allcock, W., Tiwari, D.: Job characteristics on large-scale systems: long-term analysis, quantification, and implications. In: SC20: International conference for high performance computing, networking, storage and analysis, pp. 1–17. IEEE (2020)

    Google Scholar 

  18. Perkovic, D., Keleher, P.J.: Randomization, speculation, and adaptation in batch schedulers. In: SC 2000: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, pp. 7–7. IEEE (2000)

    Google Scholar 

  19. Tang, W., Lan, Z., Desai, N., Buettner, D.: Fault-aware, utility-based job scheduling on Blue, Gene/P systems. In: 2009 IEEE International Conference on Cluster Computing and Workshops, pp. 1–10. IEEE (2009)

    Google Scholar 

  20. Tsafrir, D.: Using inaccurate estimates accurately. In: Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2010. LNCS, vol. 6253, pp. 208–221. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16505-4_12

    Chapter  Google Scholar 

  21. Tsafrir, D., Etsion, Y., Feitelson, D.G.: Backfilling using system-generated predictions rather than user runtime estimates. IEEE Trans. Parallel Distrib. Syst. 18(6), 789–803 (2007)

    Article  Google Scholar 

  22. Verma, A., Korupolu, M., Wilkes, J.: Evaluating job packing in warehouse-scale computing. In: 2014 IEEE International Conference on Cluster Computing (CLUSTER), pp. 48–56, IEEE (2014)

    Google Scholar 

  23. D Zhang, D., Dai, D., He, Y., Bao, F.S., Xie, B.: RLScheduler: an automated HPC batch job scheduler using reinforcement learning. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15. IEEE (2020)

    Google Scholar 

  24. Zhang, D., Dai, D., Xie, B.: SchedInspector: a batch job scheduling inspector using reinforcement learning. In: Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing (New York, NY, USA, 2022), HPDC, pp. 97–109. Association for Computing Machinery (2022)

    Google Scholar 

  25. Zhang, Y., Franke, H., Moreira, J.E., Sivasubramaniam, A.: Improving parallel job scheduling by combining gang scheduling and backfilling techniques. In: Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000, pp. 133–142. IEEE (2000)

    Google Scholar 

Download references

Acknowledgment

This work was supported in part by the French National Research Agency (ANR) in the frame of DASH (ANR-17-CE25-0004) and in part by the Inria Exploratory project REPAS.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robin Boëzennec .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Boëzennec, R., Dufossé, F., Pallez, G. (2023). Optimization Metrics for the Evaluation of Batch Schedulers in HPC. In: Klusáček, D., Corbalán, J., Rodrigo, G.P. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2023. Lecture Notes in Computer Science, vol 14283. Springer, Cham. https://doi.org/10.1007/978-3-031-43943-8_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43943-8_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43942-1

  • Online ISBN: 978-3-031-43943-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics