Abstract
Correctly estimating the speed-up of a parallel embedded application is crucial to efficiently compare different parallelization techniques, task graph transformations or mapping and scheduling solutions. Unfortunately, especially in case of control-dominated applications, task correlations may heavily affect the execution time of the solutions and usually this is not properly taken into account during performance analysis. We propose a methodology that combines a single profiling of the initial sequential specification with different decisions in terms of partitioning, mapping, and scheduling in order to better estimate the actual speed-up of these solutions. We validated our approach on a multi-processor simulation platform: experimental results show that our methodology, effectively identifying the correlations among tasks, significantly outperforms existing approaches for speed-up estimation. Indeed, we obtained an absolute error less than 5 % in average, even when compiling the code with different optimization levels.
Similar content being viewed by others
References
Wolf, W.: The future of multiprocessor systems-on-chips. In: Proceedings of the 41st Annual Design Automation Conference, DAC ’04, pp. 681–685 (2004)
Niemann, R., Marwedel, P.: An algorithm for hardware/software partitioning using mixed integer linear programming. Des. Autom. Embed. Syst. 2(2), 165–193 (1997)
Marwedel, P.: Embedded System Design: Embedded Systems Foundations of Cyber-Physical Systems, 2nd edn. Springer, Berlin (2010)
Ferrandi, F., Pilato, C., Tumeo, A., Sciuto, D.: Mapping and scheduling of parallel C applications with ant colony optimization onto Heterogeneous reconfigurable MPSoCs. In: Proceedings of the 15th IEEE Asia and South Pacific Design Automation Conference, ASP-DAC ’10, pp. 799–804, January 2010 (2010)
Ferrandi, F., Lanzi, P.L., Pilato, C., Sciuto, D., Tumeo, A.: Ant colony heuristic for mapping and scheduling task and communications on heterogeneous embedded systems. IEEE Trans. Comput. Aided Des. Integ. Circ. Syst. 29(6), 911–924 (2010)
Benini, L., Bertozzi, D., Bogliolo, A., Menichelli, F., Olivieri, M.: MPARM: Exploring the Multi-Processor SoC Design Space with SystemC. J. VLSI Sign. Process. 41(2), 169–182 (2005)
Beltrame, G., Fossati, L., Sciuto, D.: ReSP: A Nonintrusive Transaction-Level Reflective MPSoC Simulation Platform for Design Space Exploration. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 28(12), 1857–1869 (2009)
Li, Y.A., Antonio, J.K.: Estimating the execution time distribution for a task graph in a heterogeneous computing system. In Proceedings of the 6th Heterogeneous Computing Workshop, HCW ’97, pp. 172–184, (1997)
Manolache, S.: Analysis and optimisation of real-time systems with stochastic behaviour. Technical report, Linkoping University (2005)
Poplavko, P., Basten, T., Bekooij, M., van Meerbergen, J., Mesman, B.: Task-level timing models for guaranteed performance in multiprocessor networks-on-chip. In: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems, CASES ’03, pp. 63–72, (2003)
Coffman, E.G.: Computer and Job Shop Scheduling Theory. Wiley, New York (1976)
Sahu, A., Balakrishnan, M., Panda, P.R.: A generic platform for estimation of multi-threaded program performance on heterogeneous multiprocessors. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’09, pp. 1018–1023 (2009)
Yaldiz, S., Demir, A., Tasiran, S., Ienne, P., Leblebici, Y.: Characterizing and exploiting task-load variability and correlation for energy management in multi-core systems. In: ESTImedia, pp. 135–140 (2005)
Hubert, H., Stabernack, B., Wels, K.-I.: Performance and memory profiling for embedded system design. In: Proceedings of the International Symposium on Industrial Embedded Systems, SIES ’07, pp. 94–101 (July 2007)
Ball, T., Larus, J. R.: Efficient path profiling. In: MICRO-29: Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture, pp. 46–57 (1996)
Lattuada, M., Ferrandi, F.: Performance modeling of embedded applications with zero architectural knowledge. In: Proceedings of the Eighth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and Cystem Cynthesis, CODES/ISSS ’10, pp. 277–286 (2010)
Ferrandi, F., Lattuada, M., Pilato, C., Tumeo, A.: Performance modeling of parallel applications on MPSoCs. In: IEEE International Symposium on System-on-Chip, SOC ’09, pp. 64–67 (2009)
OpenMP. Application Program Interface, version 2.5 (May 2005)
Satish, N.R., Ravindran, K., Keutzer, K.: Scheduling task dependence graphs with variable task execution times onto heterogeneous multiprocessors. In: Proceedings of the 8th ACM international conference on Embedded software, EMSOFT ’08, pp. 149–158, New York, NY, USA. ACM (2008)
Zhu, X., Malik, S.: Using a communication architecture specification in an application-driven retargetable prototyping platform for multiprocessing. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’04, pp. 1244–1249 (2004)
Binkert, N., Beckmann, B., Black, G., Reinhardt, S.K., Saidi, A., Basu, A., Hestness, J., Hower, D.R., Krishna, T., Sardashti, S., Sen, R., Sewell, K., Shoaib, M., Vaish, N., Hill, M.D., Wood, D.A.: The Gem5 simulator. SIGARCH Comput. Archit. News 39(2), 1–7 (2011)
Miele, A., Pilato, C., Sciuto, D.: A simulation-based framework for the exploration of mapping solutions on heterogeneous MPSoCs. Int. J. Embed. Real Time Commun. Syst. 4(1), 22–41 (2013)
Lin, K.-L., Lo, C.-K., Tsay, R.-S.: Source-level timing annotation for fast and accurate tlm computation model generation. In: Design Automation Conference (ASP-DAC), 2010 15th Asia and South Pacific, pp. 235–240, (2010)
Wilson, R., French, R., Wilson, C., Amarasinghe, S., Anderson, J., T. S., Liao, S., Tseng, C., Hall, M., Lam, M., Hennessy, J.: The SUIF Compiler System: a Parallelizing and Optimizing Research Compiler. Technical report, Stanford, CA, USA (1994)
Kreku, J., Tiensyrjä, K., Vanmeerbeeck, G.: Automatic workload generation for system-level exploration based on modified GCC compiler. In: Proceedings of the Conference on Design, Automation and Test in Europe, Date ’10, pp. 369–374, (2010)
Javaid, H., Janapsatya, A., Haque, M.S., Parameswaran, S.: Rapid runtime estimation methods for pipelined MPSoCs. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’10, pp. 363–368 (2010)
Cordes, D., Marwedel, P., Mallik, A.: Automatic parallelization of embedded software using hierarchical task graphs and integer linear programming. In: Proceedings of the Eighth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, CODES/ISSS ’10, pp. 267–276 (2010)
Kim, S., Ha, S.: System-level performance analysis of multiprocessor system-on-chips by combining analytical model and execution time variation. Microprocess. Microsyst. 38(3), 233–245 (2014)
Kumar, A., Mesman, B., Corporaal, H., Ha, Y.: Iterative probabilistic performance prediction for multi-application multiprocessor systems. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 29(4), 538–551 (2010)
Xu, Y., Wang, B., Hasholzner, R., Rosales, R., Teich, J.: On robust task-accurate performance estimation. In: Proceedings of the 50th Annual Design Automation Conference, DAC ’13, ACM, New York, NY, USA, pp. 171:1–171:6 (2013)
Ernst, R., Ye, W.: Embedded program timing analysis based on path clustering and architecture classification. In: Proceedings of the 1997 IEEE/ACM International Conference on Computer-Aided Design, ICCAD ’97, pp. 598–604, (1997)
Malik, S., Martonosi, M., Li, Y.S.: Static timing analysis of embedded software. In Proceedings of the 34th Annual Design Automation Conference, DAC ’97, pp. 147–152 (1997)
Zhai, A., Colohan, C.B., Steffan, J.G., Mowry, T.C.: Compiler optimization of scalar value communication between speculative threads. In: Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS-X, pp. 171–183 (2002)
Ferrandi, F., Lattuada, M., Pilato, C., Tumeo, A.: Performance estimation for task graphs combining sequential path profiling and control dependence regions. In: Proceedings of the 7th IEEE/ACM International Conference on Formal Methods and Models for Codesign, MEMOCODE ’09, pp. 131–140 (2009)
Aho, A.V., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools. Addison-Wesley Longman Publishing Co., Inc, Melbourne (1986)
Sreedhar, V.C., Gao, G.R., Lee, Y.: Identifying loops using DJ graphs. ACM Trans. Program. Lang. Syst. 18(6), 649–658 (1996)
Ferrante, J., Ottenstein, K.J., Warren, J.D.: The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9(3), 319–349 (1987)
Girkar, M., Polychronopoulos, C.: Automatic extraction of functional parallelism from ordinary programs. IEEE Trans. Parallel Distrib. Syst. 3(2), 166–178 (1992)
Bertels, K., Sima, V., Yankova, Y., Kuzmanov, G., Luk, W., Coutinho, G., Ferrandi, F., Pilato, C., Lattuada, M., Sciuto, D., Michelotti, A.: Hartes: Hardware-software codesign for heterogeneous multicore platforms. IEEE Micro. 30, 88–97 (2010)
Thompson, M., Nikolov, H., Stefanov, T., Pimentel, A.D., Erbas, C., Polstra, S., Deprettere, E.F.: A framework for rapid system-level exploration, synthesis, and programming of multimedia MP-SoCs. In: Proceedings of the IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, CODES+ISSS ’07, pp. 9–14 (2007)
Atmel Corporation. DIOPSIS 940HF. http://www.atmel.com (2009)
Texas Instruments. TI OMAP 4. http://www.ti.com (2011)
Xilinx. Vivado Design Suite. http://www.xilinx.com (2013)
Gerstlauer, A.: Host-compiled simulation of multi-core platforms. In: Proceedings of the IEEE International Symposium on Rapid System Prototyping (RSP), pp. 1–6 (June 2010)
Synopsys Inc. Platform Architect. http://www.synopsys.com/Systems/ArchitectureDesign (2012)
Oyamada, M.S., Zschornack, F., Wagner, F.R.: Applying neural networks to performance estimation of embedded software. J. Syst. Architect. 54(1–2), 224–240 (2008)
PandA. PandA framework. http://trac.ws.dei.polimi.it/panda
GNU Compiler Collection. GCC, version 4.3. http://gcc.gnu.org/
Guthaus, M.R., Ringenberg, J.S., Ernst, D., Austin, T.M., Mudge, T., Brown. R.B.: MiBench: A free, commercially representative embedded benchmark suite. In: Proceedings of the IEEE International Workshop on Workload Characterization, WWC ’01, pp. 3–14 (2001)
Dorta, A.J., Rodriguez, C., de Sande, F., Gonzalez-Escribano, A.: The OpenMP Source Code Repository. In: Proceedings of the 13th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP ’05, pp. 244–250 (2005)
Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA ’95, pp. 24–36 (1995)
ARM922T. Technical Reference Manual. http://infocenter.arm.com
Politecnico di Milano. ReSP web-site. http://code.google.com/p/resp-sim/ (2010)
Author information
Authors and Affiliations
Corresponding author
Appendix 1: Example of Application of Task Graph Estimation Technique based on Path Profiling
Appendix 1: Example of Application of Task Graph Estimation Technique based on Path Profiling
This appendix shows how the proposed methodology is applied to estimate the performance of the example presented in Sect. 2 when SolB is considered: Task1 and Task3 are assigned to \(CPU_\alpha , Task2a\) and Task2b are assigned to \(CPU_\beta \). The resulting \(\overline{HTG}\) is shown in Fig. 7: the edge \(<Task1,Task3>\) is added to represent the scheduling order, as discussed in Sect. 5.2. The estimation starts with the application of the Hierarchical Path Profiling on the host machine, which results are reported in Table 11. For the sake of readability, we report also the sequence of basic blocks which compose each path, even if this information is equivalent to the one provided by the corresponding CRP. The order of the Control Dependence Regions in a Control Region Path is not relevant since the basic blocks are interleaved during the execution. The table shows how HPP is able to profile the paths and and to collect correlations about the execution of basic blocks before and after a loop, even if it is executed, with a representation that can be easily mapped onto the HTG.
Before estimating the execution time (\(HTC_0\)) of func_0, \(HTC_5\) is estimated as follows:
-
1.
the contribution \(BC_{i,t}\) of each basic block is computed (line 2 of Algorithm 1): the results are reported in Table 12a (e.g. \(BC_{9,6} = f(o_{13}) = 1\) since \(o_{13}\) is the only statement of \(BB_9\));
-
2.
the contribution \(\overline{BC}_{i,t}\) of each basic block including nested loops is computed (lines 4 and 6—Table 13a—e.g. \(\overline{BC}_{7,6} = BC_{7,6}\) since Task6 is simple);
-
3.
the contribution \(CC_{c,t}\) is computed summing the contribution of the single basic blocks (line 9—Table 14a—e.g. \(CC_{E,6} = \overline{BC}_{6,6} + \overline{BC}_{6,9} = 3\) since \(CDR_E\) is composed of \(BB_6\) and \(BB_9\));
-
4.
the contributions of the single Control Dependence Regions are summed to compute the contributions \(TPC_{p,t}\) (line 13—Table 15a—e.g. \(TPC_{ {{}}, 6} = CC_B + CC_E + CC_I = 6\) since path is composed of B, E and I);
-
5.
the overhead for the task management is added to \(TPC_{p,t}\) to compute \(\overline{TPC}_{p,t}\); since there is not any overhead cost in this task graph, \(\overline{TPC}_{p,t} = TPC_{p,t}\) (line 15—Table 16a—\(\overline{TPC}_{ {{}}, 6} = TPC_{ {{}}, 6}\) since Task6 has not overhead cost);
-
6.
the start and end times of each task are computed (lines 20 and 21—Table 17a—e.g. \(START_{ {{}}, 6} = STOP_{{{}}, Entry5}\) since \(Entry_5\) is the only predecessor of Task6; \(STOP_{ {{}}, 6} = START_{ {{}}, 6} + \overline{TPC}_{ {{}}, 6}\)); the execution times of the two paths are computed as the end time of task Exit (line 25—last line of Table 17a—e.g. \(PC_{{}} = STOP_{ {{}}, Exit_5}\));
-
7.
the estimation of the whole \(HTG_5\) can be computed (line 27):
$$\begin{aligned} HTC_5=N_5 \cdot \frac{PC_{{}}\cdot f_{{}}+PC_{{}}\cdot f_{{}}}{f_{{}}+f_{{}}}=10 \cdot \frac{105\cdot 100 + 6\cdot 0}{100+0}=1050 \end{aligned}$$(13)
After \(HTC_5\) has been estimated, \(HTC_0\) can be estimated in the same way and Fig. 8 shows how the different contributions are combined. These contributions are:
-
1.
the contribution of each basic block \(BC_{i,t}\) (lines 2), obtained from the clock cycles of Table 1; the results are reported in Table 12b;
-
2.
the contribution of each basic block including nested loops \(\overline{BC}_{i,t}\) (lines 4 and 6); the results are reported in Table 13b; note in particular that \(\overline{BC}_{5,2a}=BC_{5,2a} +HTC_5=1+1050\);
-
3.
the contribution of each Control Dependence Region \(CC_{c,t}\) (line 9); the results are reported in Table 14b;
-
4.
the contribution of each path to each task \(TPC_{p,t}\) (line 13); the results are reported in Table 15b;
-
5.
the contribution of each path to each task, along with the overhead cost, \(\overline{TPC}_{p,t}\) (line 15); the creation cost (50) is added to Task1 and Task2a; the synchronization and destruction cost (10) is added to Task3 and Task2b; the results are reported in Table 16b;
-
6.
\(START_{p,t}\) and \(STOP_{p,t}\) (lines 20 and 21); the results are reported in Table 17b, where the selected topological order is: \(Entry_0\)-Task0-Task1-Task2a-Task2b-Task3-Task4-\(Exit_0\);
-
7.
the contribution of each path \(PC_{p}\) (line 25): the results are reported in the last line of Table 17b;
-
8.
\(HPC_0\) in the two cases presented in Sect. 2:
-
the CRPs executed are \(P_{{}}\) and \(P_{{}}\), so the execution time estimated for the parallel version is:
$$\begin{aligned} HTC_{0} = \frac{PC_{{}} \cdot f_{{}} + PC_{{}} \cdot f_{{}}}{f_{{}} + f_{{}}} = \frac{4171 \cdot 5 + 1126 \cdot 5}{5 + 5} = 2648.5 \end{aligned}$$(14) -
the CRPs executed are \(P_{{}}\) and \(P_{{}}\), so the execution time estimated for the parallel version is:
$$\begin{aligned} HTC_{0} = \frac{PC_{{}} \cdot f_{{}} + PC_{{}} \cdot f_{{}}}{f_{{}} + f_{{}}} = \frac{2122 \cdot 5 + 2122 \cdot 5}{5 + 5} = 2122 \end{aligned}$$(15)
Finally, the speed-up for the two situations presented in Sect. 2 can be computed. The execution time of the sequential specification is 3123 cycles in both the cases, so the estimated speed-ups are 1.18 and 1.47, respectively.
Rights and permissions
About this article
Cite this article
Lattuada, M., Pilato, C. & Ferrandi, F. Performance Estimation of Task Graphs Based on Path Profiling. Int J Parallel Prog 44, 735–771 (2016). https://doi.org/10.1007/s10766-015-0372-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-015-0372-7