The Complexity of the Co-occurrence Problem

Bille, Philip; Gørtz, Inge Li; Stordalen, Tord

doi:10.1007/978-3-031-20643-6_4

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13617))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

477 Accesses

Abstract

Let S be a string of length n over an alphabet $\varSigma $ and let Q be a subset of $\varSigma $ of size $q \ge 2$. The co-occurrence problem is to construct a compact data structure that supports the following query: given an integer w return the number of length-w substrings of S that contain each character of Q at least once. This is a natural string problem with applications to, e.g., data mining, natural language processing, and DNA analysis. The state of the art is an $O(\sqrt{nq})$ space data structure that—with some minor additions—supports queries in $O(\log \log n)$ time [CPM 2021].

Our contributions are as follows. Firstly, we analyze the problem in terms of a new, natural parameter d, giving a simple data structure that uses O(d) space and supports queries in $O(\log \log n)$ time. The preprocessing algorithm does a single pass over S, runs in expected O(n) time, and uses $O(d + q)$ space in addition to the input. Furthermore, we show that O(d) space is optimal and that $O(\log \log n)$-time queries are optimal given optimal space. Secondly, we bound $d = O(\sqrt{nq})$, giving clean bounds in terms of n and q that match the state of the art. Furthermore, we prove that $\varOmega (\sqrt{nq})$ bits of space is necessary in the worst case, meaning that the $O(\sqrt{nq})$ upper bound is tight to within polylogarithmic factors. All of our results are based on simple and intuitive combinatorial ideas that simplify the state of the art.

Philip Bille and Inge Li Gørtz are supported by Danish Research Council grant DFF-8021-002498.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Frequency-Constrained Substring Complexity

On Hardness of Several String Indexing Problems

V-Words, Lyndon Words and Substring circ-UMFFs

References

Amagata, D., Hara, T.: Mining top-$k$ co-occurrence patterns across multiple streams (extended abstract). In: Proceeding 34th ICDE, pp. 1747–1748 (2018). https://doi.org/10.1109/ICDE.2018.00231
Carter, L., Wegman, M.N.: Universal classes of hash functions. J. Comput. Syst. Sci. 18(2), 143–154 (1979). https://doi.org/10.1016/0022-0000(79)90044-8
Article MathSciNet MATH Google Scholar
Chang, J.H., Lee, W.S.: Finding recently frequent item sets adaptively over online transactional data streams. Inf. Syst. 31(8), 849–869 (2006). https://doi.org/10.1016/j.is.2005.04.001
Article Google Scholar
Dallachiesa, M., Palpanas, T.: Identifying streaming frequent items in ad hoc time windows. Data Knowl. Eng. 87, 66–90 (2013). https://doi.org/10.1016/j.datak.2013.05.007
Article Google Scholar
Das, G., Fleischer, R., Gasieniec, L., Gunopulos, D., Kärkkäinen, J.: Episode matching. In: Proceeding 8th CPM, pp. 12–27 (1997). https://doi.org/10.1007/3-540-63220-4_46
Demaine, E.D., López-Ortiz, A., Munro, J.I.: Frequency estimation of internet packet streams with limited space. In: Proceeding 10th ESA, pp. 348–360 (2002). https://doi.org/10.1007/3-540-45749-6_33
Golab, L., DeHaan, D., Demaine, E.D., López-Ortiz, A., Munro, J.I.: Identifying frequent items in sliding windows over on-line packet streams. In: Proceeding 3rd ACM IMC, pp. 173–178 (2003). https://doi.org/10.1145/948205.948227
Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28, 51–55 (2003). https://doi.org/10.1145/762471.762473
Article Google Scholar
Li, H., Lee, S.: Mining frequent itemsets over data streams using efficient window sliding techniques. Expert Syst. Appl. 36(2), 1466–1477 (2009). https://doi.org/10.1016/j.eswa.2007.11.061
Article Google Scholar
Lim, Y., Choi, J., Kang, U.: Fast, accurate, and space-efficient tracking of time-weighted frequent items from data streams. In: Proceeding 23rd CIKM, pp. 1109–1118 (2014). https://doi.org/10.1145/2661829.2662006
Lin, C., Chiu, D., Wu, Y., Chen, A.L.P.: Mining frequent itemsets from data streams with a time-sensitive sliding window. In: Proceeding 5th SDM, pp. 68–79 (2005). https://doi.org/10.1137/1.9781611972757.7
Mozafari, B., Thakkar, H., Zaniolo, C.: Verifying and mining frequent patterns from large windows over data streams. In: Proceeding 24th ICDE, pp. 179–188 (2008). https://doi.org/10.1109/ICDE.2008.4497426
Patrascu, M., Thorup, M.: Randomization does not help searching predecessors. In: Proceedimg 18th SODA, pp. 555–564 (2007). https://dl.acm.org/citation.cfm?id=1283383.1283443
Sobel, J., Bertram, N., Ding, C., Nargesian, F., Gildea, D.: AWLCO: all-window length co-occurrence. In: Proceeding 32nd CPM, pp. 24:1–24:21. LIPIcs (2021). https://doi.org/10.4230/LIPIcs.CPM.2021.24
Willard, D.E.: Log-logarithmic worst-case range queries are possible in space $\varTheta (N)$. Inf. Process. Lett. 17(2), 81–84 (1983). https://doi.org/10.1016/0020-0190(83)90075-3
Article MathSciNet MATH Google Scholar
Yu, Z., Yu, X., Liu, Y., Li, W., Pei, J.: Mining frequent co-occurrence patterns across multiple data streams. In: Proceeding 1th EDBT, pp. 73–84 (2015). https://doi.org/10.5441/002/edbt.2015.08

Download references

Acknowledgement

We would like to thank the anonymous reviewers for their comments, which improved the presentation of the paper.

Author information

Authors and Affiliations

Technical University of Denmark, DTU Compute, Kgs. Lyngby, Denmark
Philip Bille, Inge Li Gørtz & Tord Stordalen

Authors

Philip Bille
View author publications
You can also search for this author in PubMed Google Scholar
Inge Li Gørtz
View author publications
You can also search for this author in PubMed Google Scholar
Tord Stordalen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tord Stordalen .

Editor information

Editors and Affiliations

Universidad Técnica Federico Santa María, Valparaíso, Chile
Diego Arroyuelo
Universidad de Chile, Santiago, Chile
Barbara Poblete

Appendices

A Preprocessing

Finding Minimal Co-occurrences. To build the data structure, we need to find all the minimal co-occurrences in order to determine $\delta $. For $j \ge r_{1}$, let $\textsf {lm}(j)$ denote the length of the left-minimal co-occurrence ending at index j. By Lemma 1, $\textsf {lm}(r_{i}) = \textsf {len}(\ell _{i},r_{i})$ for each $i \in [1,\mu ]$. Furthermore, for $j \in [r_{i}+1,r_{i+1}-1]$ we have $\textsf {lm}(j) = \textsf {lm}(j-1) + 1$ since both of the left-minimal co-occurrences ending at these two indices start at $\ell _{i}$. However, $\textsf {lm}(r_{i}) \le \textsf {lm}(r_{i}-1)$ for each $i \in [2,\mu ]$; the left-minimal co-occurrence ending at $r_{i}$ starts at least one index further to the right than the left-minimal co-occurrence ending at $r_{i}-1$ because $\ell _{i-1} < \ell _{i}$, so it cannot be strictly longer.

We determine $\ell _{1},\ldots ,\ell _{\mu }$ and $r_{1},\ldots ,r_{\mu }$ using the following algorithm. Traverse S and maintain $\textsf {lm}(j)$ for the current index j. Whenever $\textsf {lm}(j) \ne \textsf {lm}(j-1) + 1$ the interval $[j - \textsf {lm}(j) + 1,\; j]$ is one of the minimal co-occurrences. Note that this algorithm finds the minimal co-occurrences in order by their rightmost endpoint. We maintain $\textsf {lm}(j)$ as follows. For each character $p \in Q$ let $\textsf {dist}(j,p)$ be the distance to the closest occurrence of p on the left of j. Then $\textsf {lm}(j)$ is the maximum $\textsf {dist}(j,\cdot )$-value. As in [14], we maintain the $\textsf {dist}(j,\cdot )$-values in a linked list that is dynamically reordered according to the well-known move-to-front rule. The algorithm works as follows. Maintain a linked list over the elements in Q, ordered by increasing $\textsf {dist}$-values. Whenever you see some $p \in Q$, access its node in expected constant time through a dictionary and move it to the front of the list. The least recently seen $p \in Q$ (i.e., the p with the largest $\textsf {dist}(j,\cdot )$-value) is found at the back of the list in constant time. The algorithm uses O(q) space and expected constant time per character in S, thus it runs in expected O(n) time.

Building the Data Structure We build the data structure as follows. Traverse S and maintain the two most recently seen minimal co-occurrences using the algorithm above. We maintain the non-zero $\delta (\cdot )$-values in a dictionary D that is implemented using chained hashing in conjunction with universal hashing [2]. When we find a new minimal co-occurrence $[\ell _{i+1},r_{i+1}]$ we increment $D[\textsf {len}(\ell _{i},r_{i})]$ and decrement $D[\textsf {len}(\ell _{i},r_{i+1})]$. Recall that $Z = \{z_1,\ldots ,z_d\}$ where $z_j < z_{j+1}$ is defined such that $\delta (i) \ne 0$ if and only if $i \in Z$. After processing S the dictionary D encodes the set $\{(z_1,\delta (z_1)),\ldots ,(z_d,\delta (z_d))\}$. Sort the set to obtain the array $E[j] = \delta (z_j)$. Compute the partial sum array over E, i.e. the array

$$ F[j] = \sum _{i=1}^j E[i] = \sum _{i=1}^j \delta (z_i) = \textsf {lmco}(z_j). \qquad \qquad \qquad \text {(we use 1-indexing)} $$

Build the predecessor data structure over Z and associate $\textsf {lmco}(z_j)$ with each key $z_j$.

The algorithm for finding the minimal co-occurrences uses O(q) space and the remaining data structures all use O(d) space, for a total of $O(d + q)$ space. Finding the minimal co-occurrences and maintaining D takes O(n) expected time, and so does building the predecessor structure from the sorted input.

Furthermore, we use the following sorting algorithm to sort the d entries in D with O(d) extra space in expected O(n) time. If $d < n/\log n$ we use merge sort which uses O(d) extra space and runs in $O(d\log d) = O(n)$ time. If $d \ge n/\log n$ we use radix sort with base $\sqrt{n}$, which uses $O(\sqrt{n})$ extra space and O(n) time. To elaborate, assume without loss of generality that 2k bits are necessary to represent n. We first distribute the elements into $2^k = O(\sqrt{n})$ buckets according to the most significant k bits of their binary representation, partially sorting the input. We then sort each bucket by distributing the elements in that bucket according to the least significant k bits of their binary representation, fully sorting the input. The algorithm runs in O(n) time and uses $O(\sqrt{n}) = O(n/\log n) = O(d)$ extra space.

B Lower Bound on Time

We now prove Theorem 2 by the following reduction from the predecessor problem. Let U, Q and $G_i$ be as defined in Sect. 4.1 and let $X = \{x_1,x_2,\ldots ,x_m\} \subseteq U$ where $x_1< \ldots < x_m$. Define

$$ S = \underbrace{G_{x_1}\cdots G_{x_1}}_{x_1}\;\underbrace{G_{x_2}\cdots G_{x_2}}_{x_2 - x_1} \;\ldots \ldots \ldots \;\underbrace{G_{x_m}\cdots G_{x_m}}_{x_m - x_{m-1}} $$

By Lemma 5 we have that $\delta (x_1) = x_1$, $\delta (x_i) = x_i - x_{i-1}$ for $i \in [2,m]$ and $\delta (i) = 0$ for $i \in U\setminus X$. Then, if the predecessor of some $x \in U$ is $x_p$, we have

$$ \textsf {lmco}(x) = \sum _{i=2}^x \delta (i) = x_1 + (x_2 - x_1) + \ldots + (x_p - x_{p-1}) = x_p $$

On the other hand, if $x < x_1$ then $\sum _{i=0}^x\delta (i) = 0$, unambiguously identifying that x has no predecessor.

Applying Lemma 5 again, we have $d \le 8m$. Furthermore, there are $x_1 + (x_2 - x_1) + \ldots + (x_m - x_{m-1}) = x_m \le u$ gadgets in total so $n \le 2u^2$. Hence, given a data structure that supports $\textsf {lmco}$ in $f_t(n,q,d)$ time using $f_s(n,q,d)$ space, we get a data structure supporting predecessor queries on X in $O(f_t(2u^2, 2, 8m))$ time and $O(f_s(2u^2,2,8m))$ space, proving Theorem 2.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bille, P., Gørtz, I.L., Stordalen, T. (2022). The Complexity of the Co-occurrence Problem. In: Arroyuelo, D., Poblete, B. (eds) String Processing and Information Retrieval. SPIRE 2022. Lecture Notes in Computer Science, vol 13617. Springer, Cham. https://doi.org/10.1007/978-3-031-20643-6_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-20643-6_4
Published: 01 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20642-9
Online ISBN: 978-3-031-20643-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The Complexity of the Co-occurrence Problem

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Frequency-Constrained Substring Complexity

On Hardness of Several String Indexing Problems

V-Words, Lyndon Words and Substring circ-UMFFs

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Preprocessing

B Lower Bound on Time

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

The Complexity of the Co-occurrence Problem

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Frequency-Constrained Substring Complexity

On Hardness of Several String Indexing Problems

V-Words, Lyndon Words and Substring circ-UMFFs

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Preprocessing

B Lower Bound on Time

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation