Abstract
Logs are extensively used during the development and maintenance of software systems. They collect runtime events and allow tracking of code execution, which enables a variety of critical tasks such as troubleshooting and fault detection. However, large-scale software systems generate massive volumes of semi-structured log records, posing a major challenge for automated analysis. Parsing semi-structured records with free-form text log messages into structured templates is the first and crucial step that enables further analysis. Existing approaches rely on log-specific heuristics or manual rule extraction. These are often specialized in parsing certain log types, and thus, limit performance scores and generalization. We propose a novel parsing technique called NuLog that utilizes a self-supervised learning model and formulates the parsing task as masked language modeling (MLM). In the process of parsing, the model extracts summarizations from the logs in the form of a vector embedding. This allows the coupling of the MLM as pre-training with a downstream anomaly detection task. We evaluate the parsing performance of NuLog on 10 real-world log datasets and compare the results with 12 parsing techniques. The results show that NuLog outperforms existing methods in parsing accuracy with an average of 99% and achieves the lowest edit distance to the ground truth templates. Additionally, two case studies are conducted to demonstrate the ability of the approach for log-based anomaly detection in both supervised and unsupervised scenario. The results show that NuLog can be successfully used to support troubleshooting tasks. The implementation is available at https://github.com/nulog/nulog.
S. Nedelkoski and J. Bogatinovski—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Du, M., Li, F.: Spell: streaming parsing of system event logs. In: Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 859–864 (2016)
Du, M., Li, F., Zheng, G., Srikumar, V.: DeepLog: anomaly detection and diagnosis from system logs through deep learning. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1285–1298 (2017)
Fu, Q., Lou, J.G., Wang, Y., Li, J.: Execution anomaly detection in distributed systems through unstructured log analysis. In: Proceedings of the 2009 IEEE International Conference on Data Mining, pp. 149–158 (2009)
Hamooni, H., Debnath, B., Xu, J., Zhang, H., Jiang, G., Mueen, A.: LogMine: fast pattern recognition for log analytics. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 1573–1582 (2016)
He, P., Zhu, J., He, S., Li, J., Lyu, M.R.: An evaluation study on log parsing and its use in log mining. In: Proceedings of the 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 654–661 (2016)
He, P., Zhu, J., Zheng, Z., Lyu, M.R.: Drain: an online log parsing approach with fixed depth tree. In: Proceedings of the 2017 IEEE International Conference on Web Services (ICWS), pp. 33–40 (2017)
Jiang, Z.M., Hassan, A.E., Hamann, G., Flora, P.: An automated approach for abstracting execution logs to execution events. J. Softw. Maint. Evol.: Res. Pract. 20, 249–267 (2008)
Liu, J., Zhu, J., He, S., He, P., Zheng, Z., Lyu, M.R.: Logzip: extracting hidden structures via iterative clustering for log compression. In: Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 863–873. IEEE (2019)
Meng, W., et al.: LogAnomaly: unsupervised detection of sequential and quantitative anomalies in unstructured logs. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019. International Joint Conferences on Artificial Intelligence Organization, vol. 7, pp. 4739–4745 (2019)
Messaoudi, S., Panichella, A., Bianculli, D., Briand, L., Sasnauskas, R.: A search-based approach for accurate identification of log message formats. In: Proceedings of the 26th Conference on Program Comprehension, pp. 167–177 (2018)
Mizutani, M.: Incremental mining of system log format. In: Proceedings of the 2013 IEEE International Conference on Services Computing, pp. 595–602 (2013)
Nagappan, M., Vouk, M.A.: Abstracting log lines to log event types for mining software system logs. In: Proceedings of the 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010), pp. 114–117 (2010)
Nandi, A., Mandal, A., Atreja, S., Dasgupta, G.B., Bhattacharya, S.: Anomaly detection using program control flow graph mining from execution logs. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 215–224 (2016)
Nedelkoski, S., Cardoso, J., Kao, O.: Anomaly detection and classification using distributed tracing and deep learning. In: Proceedings of the 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 241–250 (2019)
Nedelkoski, S., Cardoso, J., Kao, O.: Anomaly detection from system tracing data using multimodal deep learning. In: Proceeding of the 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), pp. 179–186 (2019)
Nedelkoski, S., Bogatinovski, J., Mandapati, A.K., Becker, S., Cardoso, J., Kao, O.: Multi-source distributed system data for AI-powered analytics. In: Brogi, A., Zimmermann, W., Kritikos, K. (eds.) ESOCC 2020. LNCS, vol. 12054, pp. 161–176. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-44769-4_13
Shima, K.: Length matters: clustering system log messages using length of words. arXiv preprint arXiv:1611.03213 (2016)
Tang, L., Li, T., Perng, C.S.: LogSig: generating system events from raw textual logs. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 785–794 (2011)
Taylor, W.L.: Cloze procedure: a new tool for measuring readability. J. Q. 30, 415–433 (1953)
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Xu, W., Huang, L., Fox, A., Patterson, D., Jordan, M.I.: Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, pp. 117–132 (2009)
Zhang, X., et al.: Robust log-based anomaly detection on unstable log data. In: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 807–817 (2019)
Zhu, J., et al.: Tools and benchmarks for automated log parsing. In: Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 121–130. IEEE (2019)
Zhu, L., Laptev, N.: Deep and confident prediction for time series at uber. In: Proceedings of the 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 103–110 (2017)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Nedelkoski, S., Bogatinovski, J., Acker, A., Cardoso, J., Kao, O. (2021). Self-supervised Log Parsing. In: Dong, Y., Mladenić, D., Saunders, C. (eds) Machine Learning and Knowledge Discovery in Databases: Applied Data Science Track. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12460. Springer, Cham. https://doi.org/10.1007/978-3-030-67667-4_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-67667-4_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67666-7
Online ISBN: 978-3-030-67667-4
eBook Packages: Computer ScienceComputer Science (R0)