Abstract
Real-world regular expressions (regexes for short) have a wide range of applications in software. However, the support for regexes in test generation is insufficient. For example, existing works lack support for some important features such as lookbehind, are not resilient to subtle semantic differences (such as partial/full matching), fall short of Unicode support, leading to loss of test coverage or missed bugs. To address these challenges, in this paper, we propose a novel semantic model for comprehensively modeling the extended features in regexes, with an awareness of different matching semantics (i.e. partial/full matching) and matching precedence (i.e. greedy/lazy matching). To the best of our knowledge, this is the first attempt to consider partial/full matching semantics in modeling and to support lookbehind. Leveraging this model we then develop PowerGen, a tool for deducing matching strings for regexes, which randomly generates matching strings from the input regex effectively. We evaluate PowerGen against nine related state-of-the-art tools. The evaluation results show the high effectiveness and efficiency of PowerGen.
Y. Yan and W. Su—These authors contributed equally.
Zixuan Chen is currently employed at Kuaishou Technology.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
To facilitate error identification, we simplify lengthy regexes by isolating the problematic fragment.
- 2.
References
Aho, A.V.: Algorithms for finding patterns in strings. In: Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity, pp. 255–300. Elsevier and MIT Press (1990)
Arcaini, P., Gargantini, A., Riccobene, E.: MUTREX: a mutation-based generator of fault detecting strings for regular expressions. In: ICST Workshops 2017, pp. 87–96 (2017)
Bartoli, A., Lorenzo, A.D., Medvet, E., Tarlao, F.: Inference of regular expressions for text extraction from examples. IEEE Trans. Knowl. Data Eng. 28(5), 1217–1230 (2016)
Berglund, M., Bester, W., van der Merwe, B.: Formalising boost POSIX regular expression matching. In: Fischer, B., Uustalu, T. (eds.) ICTAC 2018. LNCS, vol. 11187, pp. 99–115. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02508-3_6
Berglund, M., van der Merwe, B.: Re-examining regular expressions with backreferences. Theor. Comput. Sci. 940, 66–80 (2023)
Berglund, M., van der Merwe, B., van Litsenborgh, S.: Regular expressions with lookahead. J. Univers. Comput. Sci. 27(4), 324–340 (2021)
Brzozowski, J.A.: Derivatives of regular expressions. J. ACM 11(4), 481–494 (1964)
Câmpeanu, C., Salomaa, K., Yu, S.: A formal study of practical regular expressions. Int. J. Found. Comput. Sci. 14(6), 1007–1018 (2003)
Câmpeanu, C., Santean, N.: On the intersection of regex languages with regular languages. Theor. Comput. Sci. 410(24–25), 2336–2344 (2009)
Câmpeanu, C., Yu, S.: Pattern expressions and pattern automata. Inf. Process. Lett. 92(6), 267–274 (2004)
Caron, P., Champarnaud, J.-M., Mignot, L.: Partial derivatives of an extended regular expression. In: Dediu, A.-H., Inenaga, S., Martín-Vide, C. (eds.) LATA 2011. LNCS, vol. 6638, pp. 179–191. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21254-3_13
Chapman, C., Stolee, K.T.: Exploring regular expression usage and context in python. In: ISSTA 2016, pp. 282–293 (2016)
Chapman, C., Wang, P., Stolee, K.T.: Exploring regular expression comprehension. In: ASE 2017, pp. 405–416 (2017)
Chen, T., Flores-Lamas, A., Hague, M., Han, Z., Hu, D., Kan, S., Lin, A.W., Rümmer, P., Wu, Z.: Solving string constraints with regex-dependent functions through transducers with priorities and variables. POPL 6, 1–31 (2022)
Chida, N., Terauchi, T.: On lookaheads in regular expressions with backreferences. In: FSCD 2022. LIPIcs, vol. 228, pp. 15:1–15:18 (2022)
Chris, K.: Regex posix - HaskellWiki. https://wiki.haskell.org/Regex_Posix
D’Antoni, L., Veanes, M.: Automata modulo theories. Commun. ACM 64, 86–95 (2021)
Davis, J.C., Coghlan, C.A., Servant, F., Lee, D.: The impact of regular expression denial of service (ReDoS) in practice: an empirical study at the ecosystem scale. In: ESEC/FSE 2018, pp. 246–256 (2018)
Davis, J.C., IV, L.G.M., Coghlan, C.A., Servant, F., Lee, D.: Why aren’t regular expressions a lingua franca? An empirical study on the re-use and portability of regular expressions. In: ESEC/FSE 2019, pp. 443–454 (2019)
ECMA: ES2018. https://262.ecma-international.org/9.0
Ellul, K., Krawetz, B., Shallit, J.O., Wang, M.W.: Regular expressions: new results and open problems. J. Autom. Lang. Comb. 10(4), 407–437 (2005)
Fent: Randexp.js. https://github.com/fent/randexp.js
Glushkov, V.M.: The abstract theory of automata. Russ. Math. Surv. 16, 1–53 (1961)
Hooimeijer, P., Veanes, M.: An evaluation of automata algorithms for string analysis. In: Jhala, R., Schmidt, D. (eds.) VMCAI 2011. LNCS, vol. 6538, pp. 248–262. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-18275-4_18
Larson, E., Kirk, A.: Generating evil test strings for regular expressions. In: ICST 2016, pp. 309–319 (2016)
Li, N., Xie, T., Tillmann, N., de Halleux, J., Schulte, W.: Reggae: automated test generation for programs using complex regular expressions. In: ASE 2009, pp. 515–519 (2009)
Liu, X., Jiang, Y., Wu, D.: A lightweight framework for regular expression verification. In: HASE 2019, pp. 1–8 (2019)
Loring, B., Mitchell, D., Kinder, J.: ExpoSE: practical symbolic execution of standalone JavaScript. In: SPIN 2017, pp. 196–199 (2017)
Loring, B., Mitchell, D., Kinder, J.: Sound regular expression semantics for dynamic symbolic execution of javascript. In: PLDI 2019, pp. 425–438 (2019)
Luo, B., Feng, Y., Wang, Z., Huang, S., Yan, R., Zhao, D.: Marrying up regular expressions with neural networks: A case study for spoken language understanding. In: ACL 2018, pp. 2083–2093 (2018)
Michael, L.G., Donohue, J., Davis, J.C., Lee, D., Servant, F.: Regexes are hard: decision-making, difficulties, and risks in programming regular expressions. In: ASE 2019, pp. 415–426 (2019)
Miller, F.P., Vandome, A.F., McBrewster, J.: Apache maven (2010). https://repo1.maven.org/maven2/
Miyazaki, T., Minamide, Y.: Derivatives of regular expressions with lookahead. J. Inf. Process. 27, 422–430 (2019)
de Moura, L., Bjørner, N.: Z3: an efficient SMT solver. In: Ramakrishnan, C.R., Rehof, J. (eds.) TACAS 2008. LNCS, vol. 4963, pp. 337–340. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78800-3_24
Møller, A.: dk.brics.automaton. https://www.brics.dk/automaton/
npm Inc: npm. https://www.npmjs.com/
O’Connor, C.: Crdoconnor/xeger. https://github.com/crdoconnor/xeger
Okui, S., Suzuki, T.: Disambiguation in regular expression matching via position automata with augmented transitions. In: Domaratzki, M., Salomaa, K. (eds.) CIAA 2010. LNCS, vol. 6482, pp. 231–240. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-18098-9_25
Python Software Foundation: Python package index - pypi. https://pypi.org/
Rampersad, N., Shallit, J.: Detecting patterns in finite regular and context-free languages. Inf. Process. Lett. 110(3), 108–112 (2010)
Salomaa, K., Yu, S.: NFA to DFA transformation for finite languages over arbitrary alphabets. J. Autom. Lang. Comb. 2(3), 177–186 (1998)
Saxena, P., Akhawe, D., Hanna, S., Mao, F., McCamant, S., Song, D.: A symbolic execution framework for JavaScript. In: S &P 2010, pp. 513–528 (2010)
Shen, Y., Jiang, Y., Xu, C., Yu, P., Ma, X., Lu, J.: ReScue: crafting regular expression DoS attacks. In: ASE 2018, pp. 225–235 (2018)
Spishak, E., Dietl, W., Ernst, M.D.: A type system for regular expressions. In: FTfJP 2012, pp. 20–26 (2012)
Stanford, C., Veanes, M., Bjørner, N.: Symbolic Boolean derivatives for efficiently solving extended regular expression constraints. In: PLDI 2021, pp. 620–635 (2021)
Stockmeyer, L.J.: The complexity of decision problems in automata theory and logic. Ph.D. thesis, Massachusetts Institute of Technology, USA (1974)
Su, W., Chen, H., Li, R., Chen, Z.: Modeling regex operators for solving regex crossword puzzles. In: Hermanns, H., et al. (eds.) SETTA 2023, LNCS, vol. 14464, pp. 206–225. Springer, Cham (2023)
Sulzmann, M., Lu, K.Z.M.: POSIX regular expression parsing with derivatives. In: Codish, M., Sumii, E. (eds.) FLOPS 2014. LNCS, vol. 8475, pp. 203–220. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07151-0_13
Tauber, A.: EXREX. https://github.com/asciimoo/exrex
Trinh, M., Chu, D., Jaffar, J.: S3: a symbolic string solver for vulnerability detection in web applications. In: CCS 2014, pp. 1232–1243 (2014)
Unicode: Unicode 15.0.0. https://unicode.org/versions/Unicode15.0.0/
Veanes, M., de Halleux, P., Tillmann, N.: Rex: symbolic regular expression explorer. In: ICST 2010, pp. 498–507 (2010)
Wang, P., Stolee, K.T.: How well are regular expressions tested in the wild? In: ESEC/FSE 2018, pp. 668–678 (2018)
Youssef, M.: Generex. https://github.com/mifmif/Generex
Yu, S.: Regular languages. In: Handbook of Formal Languages, Vol. 1: Word, Language, Grammar, pp. 41–110 (1997)
Acknowledgements
The authors would like to thank the anonymous reviewers for their helpful comments and suggestions. Work supported by the Natural Science Foundation of Beijing, China (Grant No. 4232038) and the National Natural Science Foundation of China (Grant No. 62372439).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yan, Y. et al. (2024). Deducing Matching Strings for Real-World Regular Expressions. In: Hermanns, H., Sun, J., Bu, L. (eds) Dependable Software Engineering. Theories, Tools, and Applications. SETTA 2023. Lecture Notes in Computer Science, vol 14464. Springer, Singapore. https://doi.org/10.1007/978-981-99-8664-4_19
Download citation
DOI: https://doi.org/10.1007/978-981-99-8664-4_19
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8663-7
Online ISBN: 978-981-99-8664-4
eBook Packages: Computer ScienceComputer Science (R0)