iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2020.39/metadata/acm-xml
10.1145/acmotherconferences ACM Other Conferences 0000000 10.5555/0000000 Proceedings of the 28th Annual European Symposium on Algorithms (ESA 2020) ESA 2020 10.4230/LIPIcs.ESA.2020.39 39 10003752.10003809.10010031.10010032 Theory of computation~Pattern matching 500 Practical Performance of Space Efficient Data Structures for Longest Common Extensions https://orcid.org/0000-0002-2004-6781 Dinklage Patrick Department of Computer Science, Technical University of Dortmund, Germany patrick.dinklage@tu-dortmund.de Author Fischer Johannes Department of Computer Science, Technical University of Dortmund, Germany johannes.fischer@cs.tu-dortmund.de Author Herlez Alexander Department of Computer Science, Technical University of Dortmund, Germany alexander.herlez@tu-dortmund.de Author https://orcid.org/0000-0002-2477-1702 Kociumaka Tomasz Department of Computer Science, Bar-Ilan Unviersity, Ramat Gan, Israel kociumaka@mimuw.edu.pl Author https://orcid.org/0000-0002-2379-9455 Kurpicz Florian Department of Computer Science, Technical University of Dortmund, Germany florian.kurpicz@tu-dortmund.de Author 26 08 2020 39:1 39:20

For a text T[1,n], a Longest Common Extension (LCE) query lce_T(i,j) asks for the length of the longest common prefix of the suffixes T[i,n] and T[j,n] identified by their starting positions 1 ≤ i,j ≤ n. A classic problem in stringology asks to preprocess a static text T[1,n] over an alphabet of size σ so that LCE queries can be efficiently answered on-line. Since its introduction in the 1980’s, this problem has found numerous applications: in suffix sorting, edit distance computation, approximate pattern matching, regularities finding, string mining, and many more. Text-book solutions offer O(n) preprocessing time and O(1) query time, but they employ memory-heavy data structures, such as suffix arrays, in practice several times bigger than the text itself.

Very recently, more space efficient solutions using O(nlogσ) bits of total space or even only O(log n) bits of extra space have been proposed: string synchronizing sets [Kempa and Kociumaka, STOC'19, and Birenzwige et al., SODA'20] and in-place fingerprinting [Prezza, SODA'18]. The goal of this article is to present well-engineered implementations of these new solutions and study their practicality on a commonly agreed text corpus. We show that both perform extremely well in practice, with space consumption of only around 10% of the input size for string synchronizing sets (around 20% for highly repetitive texts), and essentially no extra space for fingerprinting. Interestingly, our experiments also show that both solutions become much faster than naive scanning even for finding common prefixes of moderate length, contradicting a common belief that sophisticated data structures for LCE queries are not competitive with naive approaches [Ilie and Tinta, SPIRE'09].

text indexing longest common prefix space efficient data structures
Stephen Alstrup, Gerth Stølting Brodal, and Theis Rauhe. Pattern matching in dynamic texts. In 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 819-828. ACM/SIAM, 2000. URL: http://dl.acm.org/citation.cfm?id=338219.338645. Amihood Amir, Moshe Lewenstein, and Ely Porat. Faster algorithms for string matching with k mismatches. Journal of Algorithms, 50(2):257-275, 2004.10.1016/S0196-6774(03)00097-X Johannes Bahne, Nico Bertram, Marvin Böcker, Jonas Bode, Johannes Fischer, Hermann Foot, Florian Grieskamp, Florian Kurpicz, Marvin Löbel, Oliver Magiera, Rosa Pink, David Piper, and Christopher Poeplau. Sacabench: Benchmarking suffix array construction. In 26th International Symposium on String Processing and Information Retrieval (SPIRE), volume 11811 of Lecture Notes in Computer Science, pages 407-416. Springer, 2019.10.1007/978-3-030-32686-9_29 Hideo Bannai, Tomohiro I, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya Tsuruta. The "runs" theorem. SIAM Journal on Computing, 46(5):1501-1514, 2017.10.1137/15M1011032 Hideo Bannai, Shunsuke Inenaga, and Dominik Köppl. Computing all distinct squares in linear time for integer alphabets. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 78 of LIPIcs, pages 22:1-22:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017.10.4230/LIPIcs.CPM.2017.22 Philip Bille, Anders Roy Christiansen, Patrick Hagge Cording, and Inge Li Gørtz. Finger search in grammar-compressed strings. Theory of Computing Systems, 62(8):1715-1735, 2018.10.1007/s00224-017-9839-9 Philip Bille, Johannes Fischer, Inge Li Gørtz, Tsvi Kopelowitz, Benjamin Sach, and Hjalte Wedel Vildhøj. Sparse text indexing in small space. ACM Transactions on Algorithms, 12(3):39:1-39:19, 2016.10.1145/2836166 Philip Bille, Inge Li Gørtz, Patrick Hagge Cording, Benjamin Sach, Hjalte Wedel Vildhøj, and Søren Vind. Fingerprints in compressed strings. Journal of Computer and System Sciences, 86:171-180, 2017.10.1016/j.jcss.2017.01.002 Philip Bille, Inge Li Gørtz, Mathias Bæk Tejs Knudsen, Moshe Lewenstein, and Hjalte Wedel Vildhøj. Longest common extensions in sublinear space. In 26th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 9133 of Lecture Notes in Computer Science, pages 65-76. Springer, 2015.10.1007/978-3-319-19929-0_6 Philip Bille, Inge Li Gørtz, Benjamin Sach, and Hjalte Wedel Vildhøj. Time-space trade-offs for longest common extensions. Journal of Discrete Algorithms, 25:42-50, 2014.10.1016/j.jda.2013.06.003 Timo Bingmann, Andreas Eberle, and Peter Sanders. Engineering parallel string sorting. Algorithmica, 77(1):235-286, 2017.10.1007/s00453-015-0071-1 Or Birenzwige, Shay Golan, and Ely Porat. Locally consistent parsing for text indexing in small space. In 30th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 607-626. SIAM, 2020.10.1137/1.9781611975994.37 Maxime Crochemore, Roman Kolpakov, and Gregory Kucherov. Optimal bounds for computing α-gapped repeats. Information and Computation, 268, 2019.10.1016/j.ic.2019.104434 Roman Dementiev, Lutz Kettner, Jens Mehnert, and Peter Sanders. Engineering a sorted list data structure for 32 bit key. In Sixth Workshop on Algorithm Engineering and Experiments (ALENEX) and the First Workshop on Analytic Algorithmics and Combinatorics (ANALCO), pages 142-151. SIAM, 2004. Martin Farach-Colton, Paolo Ferragina, and S. Muthukrishnan. On the sorting-complexity of suffix tree construction. Journal of the ACM, 47(6):987-1011, 2000.10.1145/355541.355547 Héctor Ferrada and Gonzalo Navarro. Improved range minimum queries. Journal of Discrete Algorithms, 43:72-80, 2017.10.1016/j.jda.2016.09.002 Paolo Ferragina and Gonzalo Navarro. Pizza&Chili corpus: Compressed indexes and their testbeds. URL: http://pizzachili.dcc.uchile.cl/. Johannes Fischer and Volker Heun. Theoretical and practical improvements on the rmq-problem, with applications to LCA and LCE. In 17th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 4009 of Lecture Notes in Computer Science, pages 36-48. Springer, 2006.10.1007/11780441_5 Johannes Fischer, Tomohiro I, and Dominik Köppl. Deterministic sparse suffix sorting on rewritable texts. In 12th Latin American Symposium on Theoretical Informatics (LATIN), volume 9644 of Lecture Notes in Computer Science, pages 483-496. Springer, 2016.10.1007/978-3-662-49529-2_36 Johannes Fischer, Veli Mäkinen, and Gonzalo Navarro. Faster entropy-bounded compressed suffix trees. Theoretical Computer Science, 410(51):5354-5364, 2009.10.1016/j.tcs.2009.09.012 Johannes Fischer, Veli Mäkinen, and Niko Välimäki. Space efficient string mining under frequency constraints. In 8th IEEE International Conference on Data Mining (ICDM), pages 193-202. IEEE Computer Society, 2008.10.1109/ICDM.2008.32 Zvi Galil and Raffaele Giancarlo. Improved string matching with k mismatches. SIGACT News, 17(4):52-54, 1986.10.1145/8307.8309 Paweł Gawrychowski, Adam Karczmarz, Tomasz Kociumaka, Jakub Łącki, and Piotr Sankowski. Optimal dynamic strings. In 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1509-1528. SIAM, 2018.10.1137/1.9781611975031.99 Paweł Gawrychowski and Tomasz Kociumaka. Sparse suffix tree construction in optimal time and space. In 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 425-439. SIAM, 2017.10.1137/1.9781611974782.27 Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In Joachim Gudmundsson and Jyrki Katajainen, editors, 13th International Symposium on Experimental Algorithms, SEA 2014, volume 8504 of LNCS, pages 326-337. Springer, 2014.10.1007/978-3-319-07959-2_28 Dan Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997.10.1017/cbo9780511574931 Tomohiro I. Longest common extensions with recompression. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 78 of LIPIcs, pages 18:1-18:15. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2017.10.4230/LIPIcs.CPM.2017.18 Tomohiro I, Wataru Matsubara, Kouji Shimohira, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda, Kazuyuki Narisawa, and Ayumi Shinohara. Detecting regularities on grammar-compressed strings. Information and Computation, 240:74-89, 2015.10.1016/j.ic.2014.09.009 Lucian Ilie and Liviu Tinta. Practical algorithms for the longest common extension problem. In 16th International Symposium on String Processing and Information Retrieval (SPIRE), volume 5721 of Lecture Notes in Computer Science, pages 302-309. Springer, 2009.10.1007/978-3-642-03784-9_30 Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. Journal of the ACM, 53(6):918-936, 2006.10.1145/1217856.1217858 Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249-260, 1987.10.1147/rd.312.0249 Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 756-767. ACM, 2019.10.1145/3313276.3316368 Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Internal pattern matching queries in a text and applications. In 26th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 532-551. SIAM, 2015.10.1137/1.9781611973730.36 Dmitry Kosolobov. Tight lower bounds for the longest common extension problem. Information Processing Letters, 125:26-29, 2017.10.1016/j.ipl.2017.05.003 Gad M. Landau, Eugene W. Myers, and Jeanette P. Schmidt. Incremental string comparison. SIAM Journal on Computing, 27(2):557-582, 1998.10.1137/S0097539794264810 Gad M. Landau and Uzi Vishkin. Efficient string matching with k mismatches. Theoretical Computer Science, 43:239-249, 1986.10.1016/0304-3975(86)90178-7 Gad M. Landau and Uzi Vishkin. Fast string matching with k differences. Journal of Computer and System Sciences, 37(1):63-78, 1988.10.1016/0022-0000(88)90045-1 Yoshiaki Matsuoka, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Semi-dynamic compact index for short patterns and succinct van Emde Boas tree. In 26th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 9133 of Lecture Notes in Computer Science, pages 355-366. Springer, 2015.10.1007/978-3-319-19929-0_30 Kurt Mehlhorn, R. Sundar, and Christian Uhrig. Maintaining dynamic sequences under equality tests in polylogarithmic time. Algorithmica, 17(2):183-198, 1997.10.1007/BF02522825 Yuta Mori. sais: An implementation of the induced sorting algorithm. URL: https://sites.google.com/site/yuta256/. Gonzalo Navarro. Compact Data Structures: A Practical Approach. Cambridge University Press, 2016.10.1017/cbo9781316588284 Takaaki Nishimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Fully dynamic data structure for LCE queries in compressed space. In 41st International Symposium on Mathematical Foundations of Computer Science (MFCS), volume 58 of LIPIcs, pages 72:1-72:15. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2016.10.4230/LIPIcs.MFCS.2016.72 Cinzia Pizzi. Missmax: alignment-free sequence comparison with mismatches through filtering and heuristics. Algorithms for Molecular Biology, 11:6, 2016.10.1186/s13015-016-0072-x Nicola Prezza. In-place sparse suffix sorting. In 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1496-1508. SIAM, 2018.10.1137/1.9781611975031.98 Wei Quan, Bo Liu, and Yadong Wang. SALT: a fast, memory-efficient and snp-aware short read alignment tool. In IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 1774-1779. IEEE, 2019.10.1109/BIBM47256.2019.8983162 Süleyman Cenk Sahinalp and Uzi Vishkin. On a parallel-algorithms method for string matching problems. In Second Italian Conference on Algorithms and Complexity (CIAC), volume 778 of LNCS, pages 22-32. Springer, 1994.10.1007/3-540-57811-0_3 Avi Srivastava, Hirak Sarkar, Nitish Gupta, and Robert Patro. Rapmap: a rapid, sensitive and accurate tool for mapping rna-seq reads to transcriptomes. Bioinformatics, 32(12):192-200, 2016.10.1093/bioinformatics/btw277 Yuka Tanimura, Tomohiro I, Hideo Bannai, Shunsuke Inenaga, Simon J. Puglisi, and Masayuki Takeda. Deterministic sub-linear space LCE data structures with efficient construction. In 27th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 54 of LIPIcs, pages 1:1-1:10. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2016.10.4230/LIPIcs.CPM.2016.1 Peter van Emde Boas. Preserving order in a forest in less than logarithmic time and linear space. Information Processing Letters, 6(3):80-82, 1977.10.1016/0020-0190(77)90031-X Chen Zhou, Hao Chi, Leheng Wang, You Li, Yan-Jie Wu, Yan Fu, Ruixiang Sun, and Si-Min He. Speeding up tandem mass spectrometry-based database searching by longest common prefix. BMC Bioinformatics, 11:577, 2010.10.1186/1471-2105-11-577