Height-Bounded Lempel-Ziv Encodings

Bannai, Hideo; Funakoshi, Mitsuru; Hendrian, Diptarama; Matsuda, Myuji; Puglisi, Simon J.

doi:10.4230/LIPIcs.ESA.2024.18

Abstract

We introduce height-bounded LZ encodings (LZHB), a new family of compressed representations that are variants of Lempel-Ziv parsings with a focus on bounding the worst-case access time to arbitrary positions in the text directly via the compressed representation. An LZ-like encoding is a partitioning of the string into phrases of length 1 which can be encoded literally, or phrases of length at least 2 which have a previous occurrence in the string and can be encoded by its position and length. An LZ-like encoding induces an implicit referencing forest on the set of positions of the string. An LZHB encoding is an LZ-like encoding where the height of the implicit referencing forest is bounded. An LZHB encoding with height constraint h allows access to an arbitrary position of the underlying text using O(h) predecessor queries.
While computing the optimal (i.e., smallest) LZHB encoding efficiently seems to be difficult [Cicalese & Ugazio 2024, arXiv, to appear at DLT 2024], we give the first linear time algorithm for strings over a constant size alphabet that computes the greedy LZHB encoding, i.e., the string is processed from beginning to end, and the longest prefix of the remaining string that can satisfy the height constraint is taken as the next phrase. Our algorithms significantly improve both theoretically and practically, the very recently and independently proposed algorithms by Lipták et al. (CPM 2024).
We also analyze the size of height bounded LZ encodings in the context of repetitiveness measures, and show that there exists a constant c such that the size ẑ_{HB(clog n)} of the optimal LZHB encoding whose height is bounded by clog n for any string of length n is O(ĝ_{rl}), where ĝ_{rl} is the size of the smallest run-length grammar. Furthermore, we show that there exists a family of strings such that ẑ_{HB(clog n)} = o(ĝ_{rl}), thus making ẑ_{HB(clog n)} one of the smallest known repetitiveness measures for which O(polylog n) time access is possible using linear (O(ẑ_{HB(clog n)})) space.

Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms, 2(1):53-86, 2004. URL: https://doi.org/10.1016/S1570-8667(03)00065-0.
Djamal Belazzougui, Dmitry Kosolobov, Simon J. Puglisi, and Rajeev Raman. Weighted ancestors in suffix trees revisited. In Pawel Gawrychowski and Tatiana Starikovskaya, editors, 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, July 5-7, 2021, Wrocław, Poland, volume 191 of LIPIcs, pages 8:1-8:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPICS.CPM.2021.8.
Jon Louis Bentley and Derick Wood. An optimal worst case algorithm for reporting intersections of rectangles. IEEE Trans. Computers, 29(7):571-577, 1980. URL: https://doi.org/10.1109/TC.1980.1675628.
Bernard Chazelle. A functional approach to data structures and its use in multidimensional searching. SIAM J. Comput., 17(3):427-462, 1988. URL: https://doi.org/10.1137/0217026.
Ferdinando Cicalese and Francesca Ugazio. On the complexity and approximability of bounded access Lempel Ziv coding. CoRR, abs/2403.15871, 2024. URL: https://doi.org/10.48550/arXiv.2403.15871.
Maxime Crochemore, Lucian Ilie, Costas S. Iliopoulos, Marcin Kubica, Wojciech Rytter, and Tomasz Walen. Computing the longest previous factor. Eur. J. Comb., 34(1):15-26, 2013. URL: https://doi.org/10.1016/J.EJC.2012.07.011.
Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS '97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137-143. IEEE Computer Society, 1997. URL: https://doi.org/10.1109/SFCS.1997.646102.
Johannes Fischer. Inducing the lcp-array. In Frank Dehne, John Iacono, and Jörg-Rüdiger Sack, editors, Algorithms and Data Structures - 12th International Symposium, WADS 2011, New York, NY, USA, August 15-17, 2011. Proceedings, volume 6844 of Lecture Notes in Computer Science, pages 374-385. Springer, 2011. URL: https://doi.org/10.1007/978-3-642-22300-6_32.
Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. URL: https://doi.org/10.1017/CBO9780511574931.
Dominik Kempa and Barna Saha. An upper bound and linear-space queries on the LZ-end parsing. In Joseph (Seffi) Naor and Niv Buchbinder, editors, Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, SODA 2022, Virtual Conference / Alexandria, VA, USA, January 9 - 12, 2022, pages 2847-2866. SIAM, 2022. URL: https://doi.org/10.1137/1.9781611977073.111.
Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323-350, 1977. URL: https://doi.org/10.1137/0206024.
Sebastian Kreft and Gonzalo Navarro. LZ77-like compression with fast random access. In James A. Storer and Michael W. Marcellin, editors, 2010 Data Compression Conference (DCC 2010), 24-26 March 2010, Snowbird, UT, USA, pages 239-248. IEEE Computer Society, 2010. URL: https://doi.org/10.1109/DCC.2010.29.
Sebastian Kreft and Gonzalo Navarro. On compressing and indexing repetitive sequences. Theor. Comput. Sci., 483:115-133, 2013. URL: https://doi.org/10.1016/J.TCS.2012.02.006.
Zsuzsanna Lipták, Francesco Masillo, and Gonzalo Navarro. BAT-LZ out of hell. In Shunsuke Inenaga and Simon J. Puglisi, editors, 35th Annual Symposium on Combinatorial Pattern Matching, CPM 2024, June 25-27, 2024, Fukuoka, Japan, volume 296 of LIPIcs, pages 21:1-21:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2024. URL: https://doi.org/10.4230/LIPICS.CPM.2024.21.
Udi Manber and Gene Myers. Suffix arrays: A new method for on-line string searches. In David S. Johnson, editor, Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms, 22-24 January 1990, San Francisco, California, USA, pages 319-327. SIAM, 1990. URL: http://dl.acm.org/citation.cfm?id=320176.320218.
S. Muthukrishnan. Efficient algorithms for document retrieval problems. In David Eppstein, editor, Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 6-8, 2002, San Francisco, CA, USA, pages 657-666. ACM/SIAM, 2002. URL: http://dl.acm.org/citation.cfm?id=545381.545469.
Gonzalo Navarro. Compact Data Structures - A Practical Approach. Cambridge University Press, 2016.
Gonzalo Navarro. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Comput. Surv., 54(2):29:1-29:31, 2022. URL: https://doi.org/10.1145/3434399.
Gonzalo Navarro and Javiel Rojas-Ledesma. Predecessor search. ACM Comput. Surv., 53(5):105:1-105:35, 2021. URL: https://doi.org/10.1145/3409371.
Gonzalo Navarro and Cristian Urbina. Iterated straight-line programs. In José A. Soto and Andreas Wiese, editors, LATIN 2024: Theoretical Informatics - 16th Latin American Symposium, Puerto Varas, Chile, March 18-22, 2024, Proceedings, Part I, volume 14578 of Lecture Notes in Computer Science, pages 66-80. Springer, 2024. URL: https://doi.org/10.1007/978-3-031-55598-5_5.
Takaaki Nishimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Fully dynamic data structure for LCE queries in compressed space. In 41st International Symposium on Mathematical Foundations of Computer Science, MFCS 2016, August 22-26, 2016 - Kraków, Poland, volume 58 of LIPIcs, pages 72:1-72:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2016.
Wojciech Plandowski and Wojciech Rytter. Application of Lempel-Ziv encodings to the solution of words equations. In Kim Guldstrand Larsen, Sven Skyum, and Glynn Winskel, editors, Automata, Languages and Programming, 25th International Colloquium, ICALP'98, Aalborg, Denmark, July 13-17, 1998, Proceedings, volume 1443 of Lecture Notes in Computer Science, pages 731-742. Springer, 1998.
Yohei Ueki, Diptarama, Masatoshi Kurihara, Yoshiaki Matsuoka, Kazuyuki Narisawa, Ryo Yoshinaka, Hideo Bannai, Shunsuke Inenaga, and Ayumi Shinohara. Longest common subsequence in at least k length order-isomorphic substrings. In Bernhard Steffen, Christel Baier, Mark van den Brand, Johann Eder, Mike Hinchey, and Tiziana Margaria, editors, SOFSEM 2017: Theory and Practice of Computer Science - 43rd International Conference on Current Trends in Theory and Practice of Computer Science, Limerick, Ireland, January 16-20, 2017, Proceedings, volume 10139 of Lecture Notes in Computer Science, pages 363-374. Springer, 2017. URL: https://doi.org/10.1007/978-3-319-51963-0_28.
Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249-260, 1995. URL: https://doi.org/10.1007/BF01206331.
Elad Verbin and Wei Yu. Data structure lower bounds on random access to grammar-compressed strings. In Johannes Fischer and Peter Sanders, editors, Combinatorial Pattern Matching, 24th Annual Symposium, CPM 2013, Bad Herrenalb, Germany, June 17-19, 2013. Proceedings, volume 7922 of Lecture Notes in Computer Science, pages 247-258. Springer, 2013. URL: https://doi.org/10.1007/978-3-642-38905-4_24.
Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1-11. IEEE Computer Society, 1973.
Dan E. Willard. Log-logarithmic worst-case range queries are possible in space theta(n). Inf. Process. Lett., 17(2):81-84, 1983. URL: https://doi.org/10.1016/0020-0190(83)90075-3.
Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23(3):337-343, 1977. URL: https://doi.org/10.1109/TIT.1977.1055714.

Height-Bounded Lempel-Ziv Encodings

Authors Hideo Bannai , Mitsuru Funakoshi , Diptarama Hendrian , Myuji Matsuda, Simon J. Puglisi

File

Document Identifiers

Author Details

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Height-Bounded Lempel-Ziv Encodings

Authors Hideo Bannai , Mitsuru Funakoshi , Diptarama Hendrian , Myuji Matsuda, Simon J. Puglisi

File

Document Identifiers

Author Details

Funding

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

Supplementary Materials

References

Thanks for your feedback!

Could not send message