Abstract
A detailed profile of exascale applications helps to understand the computation, communication and memory requirements for exascale systems and provides the insight necessary for fine-tuning the computing architecture. Obtaining such a profile is challenging as exascale systems will process unprecedented amounts of data. Profiling applications at the target scale would require the exascale machine itself. In this work we propose a methodology to extrapolate the exascale profile from experimental observations over datasets feasible for today’s machines. Extrapolation models are carefully selected by means of statistical techniques and a high-level complexity analysis is included in the selection process to speed up the learning phase and to improve the accuracy of the final model. We extrapolate run-time properties of the target applications including information about the instruction mix, memory access pattern, instruction-level parallelism, and communication requirements. Compared to state-of-the-art techniques, the proposed methodology reduces the prediction error by an order of magnitude on the instruction count and improves the accuracy by up to 1.3\(\times \) for the memory access pattern, and by more than 2\(\times \) for the communication requirements.
Similar content being viewed by others
Notes
In theory, one may generate also predictions of worst and best case executions, but this exceeds the scope of this paper.
During crossvalidation, the error for a given metric is measured relative to the difference between its maximum and minimum values in the training set. This approach avoids giving too much weight to runs with small values of \(\theta (\varvec{n})\).
Decreasing trends are handled in a similar way, but are rarely found in practice.
The relative error can be written as \(\hat{a}(\varvec{n})/a(\varvec{n})-1 \) and it measures 900 % when \(\hat{a}\) is 10 times larger than \(a\) and \(-90\) % when \(\hat{a}\) is 10 times smaller than \(a\).
We adopt the default configuration available in the Mathematica environment [29].
For these metrics the SOTA method refers to the same extrapolation technique used for the instruction count mix proposed by Calotoiu et al. [9].
There are 512 sub-bands, each partitioned in 512 channels.
Construction of the SKA is planned to begin in 2018. At that point in time, different Xeon-like architectures may be available providing different computational power.
References
Agerwala, T.: Exascale computing: the challenges and opportunities in the next decade. In: 2010 IEEE 16th International Symposium on High Performance Computer Architecture (HPCA), p. 1 (2010)
Almeida, A., Castel-Branco, M., Falcao, A.: Linear regression for calibration lines revisited: weighting schemes for bioanalytical methods. J. Chromatogr. B 774(2), 215–222 (2002)
Anghel, A., Rodríguez, G., Prisacari, B., Minkenberg, C., Dittmann, G.: Quantifying communication in graph analytics. In: High Performance Computing—30th International Conference, ISC High Performance 2015, Frankfurt, Germany, July 12–16, 2015, Proceedings, pp. 472–487 (2015)
Anghel, A., Vasilescu, L.M., Jongerius, R., Dittmann, G., Mariani, G.: An instrumentation approach for hardware-agnostic software characterization. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, CF ’15, pp. 3:1–3:8, New York, NY, USA, ACM (2015)
Bhattacharyya, A., Hoefler, T.: Pemogen: Automatic adaptive performance modeling during program runtime. In: Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT ’14, pp. 393–404, New York, NY, USA, ACM (2014)
Breugh, M.B., Eyerman, S., Eeckhout, L.: Mechanistic analytical modeling of superscalar in-order processor performance. ACM Trans. Archit. Code Optim. 11(4), 50:1–50:26 (2015)
Brief introduction | Graph 500. http://www.graph500.org
Broekema, P., van Nieuwpoort, R., Bal, H.: The Square Kilometre Array science data processor. Preliminary compute platform design. J. Instrum. 10(07), C07004 (2015)
Calotoiu, A., Hoefler, T., Poke, M., Wolf, F.: Using automated performance modeling to find scalability bugs in complex codes. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pp. 45:1–45:12, New York, NY, USA, ACM (2013)
Carlson, T., Heirman, W., Eeckhout, L.: Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In: High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for, pp. 1–12 (2011)
Checconi, F., Petrini, F., Willcock, J., Lumsdaine, A., Choudhury, A.R., Sabharwal, Y.: Breaking the speed and scalability barriers for graph exploration on distributed-memory machines. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pp. 13:1–13:12, Los Alamitos, CA, USA, IEEE Computer Society Press (2012)
Cook, H., Skadron, K.: Predictive design space exploration using genetically programmed response surfaces. In: Proceedings of the 45th Annual Design Automation Conference, DAC ’08, pp. 960–965, New York, NY, USA, ACM (2008)
Cornwell, T.J., Golap, K., Bhatnagar, S.: The noncoplanar baselines effect in radio interferometry: the w-projection algorithm. IEEE J. Sel. Top. Signal Process. 2(5), 647–657 (2008)
Eyerman, S., Eeckhout, L., Karkhanis, T., Smith, J .E.: A mechanistic performance model for superscalar out-of-order processors. ACM Trans. Comput. Syst. 27(2), 3:1–3:37 (2009)
Fiorin, L., Vermij, E., Van Lunteren, J., Jongerius, R., Hagleitner, C.: An energy-efficient custom architecture for the SKA1-Low central signal processor. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, CF ’15, pp. 5:1–5:8, New York, NY, USA, ACM (2015)
Gayawan, E., Ipinyomi, R.A.: A comparison of Akaike, Schwarz and R square criteria for model selection using some fertility models. Aust. J. Basic Appl. Sci. 3(4), 3524–3530 (2009)
Gluhovsky, I.: Determining output uncertainty of computer system models. Perform. Eval. 64(2), 103–125 (2007)
Gluhovsky, I., Vengerov, D., O’Krafka, B.: Comprehensive multivariate extrapolation modeling of multiprocessor cache miss rates. ACM Trans. Comput. Syst. (TOCS) 25(1), 1–32 (2007)
Guo, Q., Chen, T., Chen, Y., Li, L., Hu, W.: Microarchitectural design space exploration made fast. Microprocess. Microsyst. 37(1), 41–51 (2013)
Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco (2003)
Hutter, F., Xu, L., Hoos, H.H., Leyton-Brown, K.: Algorithm runtime prediction: methods & evaluation. Artif. Intell. 206, 79–111 (2014)
Jongerius, R., Mariani, G., Anghel, A., Dittmann, G., Vermij, E., Corporaal, H.: Analytic processor model for fast design-space exploration. In: 2015 33nd IEEE International Conference on Computer Design (ICCD), pp. 440–443 (2015)
Jongerius, R., Wijnholds, S., Nijboer, R., Corporaal, H.: An end-to-end computing model for the Square Kilometre Array. Computer 47(9), 48–54 (2014)
Kushilevitz, E., Nisan, N.: Communication Complexity. Cambridge University Press, New York (1997)
Li, B., Peng, L., Ramadass, B.: Accurate and efficient processor performance prediction via regression tree based modeling. J. Syst. Archit. 55(10–12), 457–467 (2009)
Mariani, G., Anghel, A., Jongerius, R., Dittmann, G.: Scaling application properties to exascale. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, CF ’15, pp. 31:1–31:8, New York, NY, USA, ACM (2015)
Mariani, G., Palermo, G., Zaccaria, V., Silvano, C.: OSCAR: an optimization methodology exploiting spatial correlation in multicore design spaces. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 31(5), 740–753 (2012)
Marin, G., Mellor-Crummey, J.: Cross-architecture performance predictions for scientific applications using parameterized models. In: Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’04/Performance ’04, pp. 2–13, New York, NY, USA, ACM (2004)
Mathematica 10, 2014. http://www.wolfram.com/mathematica/
Montgomery, D.: Design and Analysis of Experiments, 8th edn. Wiley, Hoboken (2012)
Sipser, M.: Introduction to the Theory of Computation. Thomson Course Technology, Boston (2006)
SPEC CPU benchmarks. http://www.spec.org/benchmarks.html
The LLVM compiler infrastructure project. http://www.llvm.org/
Ueno, K., Suzumura, T.: Highly scalable graph search for the Graph500 benchmark. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’12, pp. 149–160, New York, NY, USA, ACM (2012)
White, H.: A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48(4), 817–838 (1980)
Wong, A., Rexachs, D., Luque, E.: Parallel application signature for performance analysis and prediction. Parallel Distrib. Syst. IEEE Trans. 26(7), 2009–2019 (2015)
Zhang, Z., Xiaofeng, B.: Comparison about the three central composite designs with simulation. In: International Conference on Advanced Computer Control. ICACC ’09, pp. 163–167 (2009)
Acknowledgments
This work is conducted in the context of the joint ASTRON and IBM DOME project and is funded by the Netherlands Organisation for Scientific Research (NWO), the Dutch Ministry of EL&I, and the Province of Drenthe.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mariani, G., Anghel, A., Jongerius, R. et al. Scaling Properties of Parallel Applications to Exascale. Int J Parallel Prog 44, 975–1002 (2016). https://doi.org/10.1007/s10766-016-0412-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-016-0412-y