iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://unpaywall.org/10.1145/3307681.3325399
Making Root Cause Analysis Feasible for Large Code Bases | Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing skip to main content
10.1145/3307681.3325399acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Making Root Cause Analysis Feasible for Large Code Bases: A Solution Approach for a Climate Model

Published: 17 June 2019 Publication History

Abstract

Large-scale simulation codes that model complicated science and engineering applications typically have huge and complex code bases. For such simulation codes, where bit-for-bit comparisons are too restrictive, finding the source of statistically significant discrepancies (e.g., from a previous version, alternative hardware or supporting software stack) in output is non-trivial at best. Although there are many tools for program comprehension through debugging or slicing, few (if any) scale to a model as large as the Community Earth System Model (CESM#8482;), which consists of more than 1.5 million lines of Fortran code. Currently for the CESM, we can easily determine whether a discrepancy exists in the output using a by now well-established statistical consistency testing tool. However, this tool provides no information as to the possible cause of the detected discrepancy, leaving developers in a seemingly impossible (and frustrating) situation. Therefore, our aim in this work is to provide the tools to enable developers to trace a problem detected through the CESM output to its source. To this end, our strategy is to reduce the search space for the root cause(s) to a tractable size via a series of techniques that include creating a directed graph of internal CESM variables, extracting a subgraph (using a form of hybrid program slicing), partitioning into communities, and ranking nodes by centrality. Runtime variable sampling then becomes feasible in this reduced search space. We demonstrate the utility of this process on multiple examples of CESM simulation output by illustrating how sampling can be performed as part of an efficient parallel iterative refinement procedure to locate error sources, including sensitivity to CPU instructions. By providing CESM developers with tools to identify and understand the reason for statistically distinct output, we have positively impacted the CESM software development cycle and, in particular, its focus on quality assurance.

References

[1]
Argonne Leadership Computing Facility. 2018. Mira. https://www.alcf.anl.gov/mira. Accessed: 2019-01--15.
[2]
ARM. 2019. Allinea Forge. https://developer.arm.com/docs/101136/0701/allinea-forge. Accessed: 2019-01--15.
[3]
A. H. Baker, D. M. Hammerling, M. N Levy, H. Xu, J. M. Dennis, B. E. Eaton, J.Edwards, C. Hannay, S. A. Mickelson, R. B. Neale, D. Nychka, J. Shollenberger, J. Tribbia, M. Vertenstein, and D. Williamson. 2015. A new ensemble-based consistency test for the Community Earth System Model. Geoscientific ModelDevelopment8 (2015), 2829--2840.
[4]
A. H. Baker, Y. Hu, D. M. Hammerling, Y.-H. Tseng, H. Xu, X. Huang, F. O. Bryan,and G. Yang. 2016. Evaluating statistical consistency in the ocean model component of the Community Earth System Model (pyCECT v2.0). Geo scientific Model Development 9, 7 (2016), 2391--2406.
[5]
Allison H. Baker, Daniel J. Milroy, Dorit M. Hammerling, and Haiying Xu. 2017. Quality Assurance and Error Identification for the Community Earth System Model. In Proceedings of the First International Workshop on Software Correctness for HPC Applications (Correctness'17). ACM, New York, NY, USA, 8--13.
[6]
Leeann Bent, Darren C. Atkinson, and William G. Griswold. 2001. A Comparative Study of Two Whole Program Slicers for C. Technical Report. La Jolla, CA, USA.
[7]
Aaron Clauset, Samuel Arbesman, and Daniel B. Larremore. 2015. Systematic inequality and hierarchy in faculty hiring networks. Science Advances 1, 1 (2015). arXiv: http://advances.sciencemag.org/content/1/1/e1400005.full.pdf
[8]
Computational and Information Systems Laboratory. 2016. Yellowstone: IBM iDataPlex System (Climate Simulation Laboratory). http://n2t.net/ark:/85065/d7wd3xhc.
[9]
Computational and Information Systems Laboratory. 2017. Cheyenne: SGI ICEXA Cluster.
[10]
Linton C. Freeman. 1978. Centrality in social networks conceptual clarification. Social Networks 1, 3 (1978), 215 -- 239.
[11]
M. Girvan and M. E. J. Newman. 2002. Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99, 12 (2002), 7821--7826. arXiv: http://www.pnas.org/content/99/12/7821.full.pdf
[12]
Rajiv Gupta and Mary Lou Soffa. 1995. Hybrid Slicing: An Approach for Refining Static Slices Using Dynamic Information. In Proceedings of the 3rd ACM SIGSOFT Symposium on Foundations of Software Engineering (SIGSOFT '95). ACM, NewYork, NY, USA, 29--40.
[13]
Aric Hagberg, Pieter Swart, and Daniel Schult. 2008. Exploring network structure,dynamics, and function using NetworkX. In Proceedings of the 7th Python in Science Conference (SciPy 2008). 11--15.
[14]
William R. Harris, Sriram Sankaranarayanan, Franjo Ivancic, and Aarti Gupta.2010. Program Analysis via Satisfiability Modulo Path Programs. In Proceedings of the 37th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL '10). ACM, New York, NY, USA, 71--82.
[15]
Ki-ichiro Hashimoto and Akira Hori. 1989. Selberg-Ihara's Zeta function for p-adic Discrete Groups. In Automorphic Forms and Geometry of Arithmetic Varieties, K. Hashimoto and Y. Namikawa (Eds.). Advanced Studies in Pure Mathematics,Vol. 15. Academic Press, 171 -- 210.
[16]
Intel. 2017. Intel Fortran Compiler 17.0 Developer Guide and Reference Code Coverage Tool. https://software.intel.com/en-us/node/680224. Accessed: 2019-01--15.
[17]
Joxan Jaffar, Vijayaraghavan Murali, Jorge A. Navas, and Andrew E. Santosa. 2012. Path-Sensitive Backward Slicing. In Proceedings of the 19th International Conference on Static Analysis (SAS'12). Springer-Verlag, Berlin, Heidelberg, 231--247.
[18]
J. E. Kay, C. Deser, A. Phillips, A. Mai, C. Hannay, G. Strand, J. M. Arblaster, S. C.Bates, G. Danabasoglu, J. Edwards, M. Holland, P. Kushner, J.-F. Lamarque, D.Lawrence, K. Lindsay, A. Middleton, E. Munoz, R. Neale, K. Oleson, L. Polvani, and M. Vertenstein. 2015. The Community Earth System Model (CESM) Large Ensemble Project: A Community Resource for Studying Climate Change in the Presence of Internal Climate Variability. Bulletin of the American MeteorologicalSociety96, 8 (2015), 1333--1349. arXiv:http://dx.doi.org/10.1175/BAMS-D-13-00255.1
[19]
Younsung Kim, John Dennis, Christopher Kerr, Raghu Raj Parasanna Kumar, Amogh Simha, Allison Baker, and Sheri Mickelson. 2016. KGEN: A Python Tool for Automated Fortran Kernel Generation and Verification. In Procedia Computer Science (ICCS 2016. The International Conference on Computational Science), Vol. 80.1450--1460.
[20]
Y. Kim, J. M. Dennis, and C. Kerr. 2017. Assessing Representativeness of Kernels Using Descriptive Statistics. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). 818--825.
[21]
Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework forLifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO '04). IEEE Computer Society, Washington, DC, USA, 75--.http://dl.acm.org/citation.cfm?id=977395.977673
[22]
Travis Martin, Xiao Zhang, and M. E. J. Newman. 2014. Localization and centrality in networks. Phys. Rev. E90 (Nov 2014), 052808. Issue 5.
[23]
Daniel J. Milroy, Allison H. Baker, Dorit M. Hammerling, John M. Dennis, Sheri A. Mickelson, and Elizabeth R. Jessup. 2016. Towards Characterizing the Variability of Statistically Consistent Community Earth System Model Simulations. Procedia Computer Science 80, Supplement C (2016), 1589 -- 1600. International Conference on Computational Science 2016, ICCS 2016, 6--8 June 2016, San Diego, California, USA.
[24]
D. J. Milroy, A. H. Baker, D. M. Hammerling, and E. R. Jessup. 2018. Nine timesteps: ultra-fast statistical consistency testing of the Community Earth System Model (pyCECT v3.0). Geoscientific Model Development11, 2 (2018), 697--711.
[25]
M. E. J. Newman and M. Girvan. 2004. Finding and evaluating community structure in networks. Phys. Rev. E69 (Feb 2004), 026113. Issue 2.
[26]
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The Page Rank Citation Ranking: Bringing Order to the Web.Technical Report 1999--66. Stanford InfoLab. http://ilpubs.stanford.edu:8090/422/ Previous number =SIDL-WP-1999-0120.
[27]
Pearu Peterson. 2009. F2PY: a tool for connecting Fortran and Python programs. International Journal of Computational Science and Engineering 4, 4 (2009), 296--305.
[28]
PGI. 2017. Flang. https://github.com/flang-compiler/flang. Accessed: 2019-01--14.
[29]
RogueWave Software. 2019. TotalView. https://docs.roguewave.com/en/totalview/current/. Accessed: 2019-01--15.
[30]
Marcel Salathé, Maria Kazandjieva, Jung Woo Lee, Philip Levis, Marcus W. Feldman, and James H. Jones. 2010. A high-resolution human contact network for infectious disease transmission. Proc Natl Acad Sci USA107, 51 (21 Dec 2010), 22020--22025. 201009094{PII}.
[31]
G. Sawaya, M. Bentley, I. Briggs, G. Gopalakrishnan, and D. H. Ahn. 2017. FLiT: Cross-platform floating-point result-consistency tester and workload. In 2017 IEEE International Symposium on Workload Characterization (IISWC). 229--238.
[32]
Devavrat Shah and Tauhid Zaman. 2010. Detecting Sources of Computer Viruses in Networks: Theory and Experiment. SIGMETRICS Perform. Eval. Rev.38, 1 (June2010), 203--214.
[33]
Sameer S. Shende and Allen D. Malony. 2006. The Tau Parallel Performance System.Int. J. High Perform. Comput. Appl.20, 2 (May 2006), 287--311.
[34]
Josep Silva. 2012. A Vocabulary of Program Slicing-based Techniques. ACM Comput. Surv. 44, 3, Article 12 (June 2012), 41 pages.
[35]
Frank Tip. 1994. A Survey of Program Slicing Techniques. Technical Report. Amsterdam, The Netherlands, The Netherlands.
[36]
University of Illinois/NCSA. 2007. Clang. http://clang.llvm.org/. Accessed:2019-01--15.
[37]
Mark Weiser. 1981. Program Slicing. In Proceedings of the 5th International Conference on Software Engineering (ICSE '81). IEEE Press, Piscataway, NJ, USA,439--449. http://dl.acm.org/citation.cfm?id=800078.802557
[38]
M. Weiser. 1984. Program Slicing. IEEE Transactions on Software Engineering SE-10, 4 (July 1984), 352--357.

Cited By

View all
  • (2023)Hypergraph-based locality-enhancing methods for graph operations in Big Data applicationsThe International Journal of High Performance Computing Applications10.1177/1094342023121453238:3(210-224)Online publication date: 20-Nov-2023
  • (2021)Keeping science on keel when software movesCommunications of the ACM10.1145/338203764:2(66-74)Online publication date: 25-Jan-2021
  • (2021)On Preserving Scientific Integrity for Climate Model Data in the HPC EraComputing in Science and Engineering10.1109/MCSE.2021.311950923:6(16-24)Online publication date: 1-Nov-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '19: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing
June 2019
278 pages
ISBN:9781450366700
DOI:10.1145/3307681
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. abstract syntax tree
  2. community detection
  3. eigenvector centrality
  4. graph analysis
  5. program slicing
  6. root cause analysis

Qualifiers

  • Research-article

Funding Sources

  • Intel Parallel Computing Center for Weather and Climate Simulation

Conference

HPDC '19
Sponsor:

Acceptance Rates

HPDC '19 Paper Acceptance Rate 22 of 106 submissions, 21%;
Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)18
  • Downloads (Last 6 weeks)2
Reflects downloads up to 01 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Hypergraph-based locality-enhancing methods for graph operations in Big Data applicationsThe International Journal of High Performance Computing Applications10.1177/1094342023121453238:3(210-224)Online publication date: 20-Nov-2023
  • (2021)Keeping science on keel when software movesCommunications of the ACM10.1145/338203764:2(66-74)Online publication date: 25-Jan-2021
  • (2021)On Preserving Scientific Integrity for Climate Model Data in the HPC EraComputing in Science and Engineering10.1109/MCSE.2021.311950923:6(16-24)Online publication date: 1-Nov-2021
  • (2020)Spying on the Floating Point Behavior of Existing, Unmodified Scientific ApplicationsProceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3369583.3392673(5-16)Online publication date: 23-Jun-2020
  • (2020)RSX: Reproduction Scenario Extraction Technique for Business Application Workloads in DBMS2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW51248.2020.00043(91-96)Online publication date: Oct-2020
  • (2020)Building a Classification System for Failed Test Reports: Industrial Experience2020 IEEE International Conference On Artificial Intelligence Testing (AITest)10.1109/AITEST49225.2020.00021(91-98)Online publication date: Aug-2020
  • (2019)Investigating the Impact of Mixed Precision on Correctness for a Large Climate Code2019 IEEE/ACM 3rd International Workshop on Software Correctness for HPC Applications (Correctness)10.1109/Correctness49594.2019.00011(44-51)Online publication date: Nov-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media