iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://link.springer.com/10.1007/978-3-031-20643-6_4?fromPaywallRec=true
The Complexity of the Co-occurrence Problem | SpringerLink
Skip to main content

The Complexity of the Co-occurrence Problem

  • Conference paper
  • First Online:
String Processing and Information Retrieval (SPIRE 2022)

Abstract

Let S be a string of length n over an alphabet \(\varSigma \) and let Q be a subset of \(\varSigma \) of size \(q \ge 2\). The co-occurrence problem is to construct a compact data structure that supports the following query: given an integer w return the number of length-w substrings of S that contain each character of Q at least once. This is a natural string problem with applications to, e.g., data mining, natural language processing, and DNA analysis. The state of the art is an \(O(\sqrt{nq})\) space data structure that—with some minor additions—supports queries in \(O(\log \log n)\) time [CPM 2021].

Our contributions are as follows. Firstly, we analyze the problem in terms of a new, natural parameter d, giving a simple data structure that uses O(d) space and supports queries in \(O(\log \log n)\) time. The preprocessing algorithm does a single pass over S, runs in expected O(n) time, and uses \(O(d + q)\) space in addition to the input. Furthermore, we show that O(d) space is optimal and that \(O(\log \log n)\)-time queries are optimal given optimal space. Secondly, we bound \(d = O(\sqrt{nq})\), giving clean bounds in terms of n and q that match the state of the art. Furthermore, we prove that \(\varOmega (\sqrt{nq})\) bits of space is necessary in the worst case, meaning that the \(O(\sqrt{nq})\) upper bound is tight to within polylogarithmic factors. All of our results are based on simple and intuitive combinatorial ideas that simplify the state of the art.

Philip Bille and Inge Li Gørtz are supported by Danish Research Council grant DFF-8021-002498.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Amagata, D., Hara, T.: Mining top-\(k\) co-occurrence patterns across multiple streams (extended abstract). In: Proceeding 34th ICDE, pp. 1747–1748 (2018). https://doi.org/10.1109/ICDE.2018.00231

  2. Carter, L., Wegman, M.N.: Universal classes of hash functions. J. Comput. Syst. Sci. 18(2), 143–154 (1979). https://doi.org/10.1016/0022-0000(79)90044-8

    Article  MathSciNet  MATH  Google Scholar 

  3. Chang, J.H., Lee, W.S.: Finding recently frequent item sets adaptively over online transactional data streams. Inf. Syst. 31(8), 849–869 (2006). https://doi.org/10.1016/j.is.2005.04.001

    Article  Google Scholar 

  4. Dallachiesa, M., Palpanas, T.: Identifying streaming frequent items in ad hoc time windows. Data Knowl. Eng. 87, 66–90 (2013). https://doi.org/10.1016/j.datak.2013.05.007

    Article  Google Scholar 

  5. Das, G., Fleischer, R., Gasieniec, L., Gunopulos, D., Kärkkäinen, J.: Episode matching. In: Proceeding 8th CPM, pp. 12–27 (1997). https://doi.org/10.1007/3-540-63220-4_46

  6. Demaine, E.D., López-Ortiz, A., Munro, J.I.: Frequency estimation of internet packet streams with limited space. In: Proceeding 10th ESA, pp. 348–360 (2002). https://doi.org/10.1007/3-540-45749-6_33

  7. Golab, L., DeHaan, D., Demaine, E.D., López-Ortiz, A., Munro, J.I.: Identifying frequent items in sliding windows over on-line packet streams. In: Proceeding 3rd ACM IMC, pp. 173–178 (2003). https://doi.org/10.1145/948205.948227

  8. Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28, 51–55 (2003). https://doi.org/10.1145/762471.762473

    Article  Google Scholar 

  9. Li, H., Lee, S.: Mining frequent itemsets over data streams using efficient window sliding techniques. Expert Syst. Appl. 36(2), 1466–1477 (2009). https://doi.org/10.1016/j.eswa.2007.11.061

    Article  Google Scholar 

  10. Lim, Y., Choi, J., Kang, U.: Fast, accurate, and space-efficient tracking of time-weighted frequent items from data streams. In: Proceeding 23rd CIKM, pp. 1109–1118 (2014). https://doi.org/10.1145/2661829.2662006

  11. Lin, C., Chiu, D., Wu, Y., Chen, A.L.P.: Mining frequent itemsets from data streams with a time-sensitive sliding window. In: Proceeding 5th SDM, pp. 68–79 (2005). https://doi.org/10.1137/1.9781611972757.7

  12. Mozafari, B., Thakkar, H., Zaniolo, C.: Verifying and mining frequent patterns from large windows over data streams. In: Proceeding 24th ICDE, pp. 179–188 (2008). https://doi.org/10.1109/ICDE.2008.4497426

  13. Patrascu, M., Thorup, M.: Randomization does not help searching predecessors. In: Proceedimg 18th SODA, pp. 555–564 (2007). https://dl.acm.org/citation.cfm?id=1283383.1283443

  14. Sobel, J., Bertram, N., Ding, C., Nargesian, F., Gildea, D.: AWLCO: all-window length co-occurrence. In: Proceeding 32nd CPM, pp. 24:1–24:21. LIPIcs (2021). https://doi.org/10.4230/LIPIcs.CPM.2021.24

  15. Willard, D.E.: Log-logarithmic worst-case range queries are possible in space \(\varTheta (N)\). Inf. Process. Lett. 17(2), 81–84 (1983). https://doi.org/10.1016/0020-0190(83)90075-3

    Article  MathSciNet  MATH  Google Scholar 

  16. Yu, Z., Yu, X., Liu, Y., Li, W., Pei, J.: Mining frequent co-occurrence patterns across multiple data streams. In: Proceeding 1th EDBT, pp. 73–84 (2015). https://doi.org/10.5441/002/edbt.2015.08

Download references

Acknowledgement

We would like to thank the anonymous reviewers for their comments, which improved the presentation of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tord Stordalen .

Editor information

Editors and Affiliations

Appendices

A Preprocessing

Finding Minimal Co-occurrences. To build the data structure, we need to find all the minimal co-occurrences in order to determine \(\delta \). For \(j \ge r_{1}\), let \(\textsf {lm}(j)\) denote the length of the left-minimal co-occurrence ending at index j. By Lemma 1, \(\textsf {lm}(r_{i}) = \textsf {len}(\ell _{i},r_{i})\) for each \(i \in [1,\mu ]\). Furthermore, for \(j \in [r_{i}+1,r_{i+1}-1]\) we have \(\textsf {lm}(j) = \textsf {lm}(j-1) + 1\) since both of the left-minimal co-occurrences ending at these two indices start at \(\ell _{i}\). However, \(\textsf {lm}(r_{i}) \le \textsf {lm}(r_{i}-1)\) for each \(i \in [2,\mu ]\); the left-minimal co-occurrence ending at \(r_{i}\) starts at least one index further to the right than the left-minimal co-occurrence ending at \(r_{i}-1\) because \(\ell _{i-1} < \ell _{i}\), so it cannot be strictly longer.

We determine \(\ell _{1},\ldots ,\ell _{\mu }\) and \(r_{1},\ldots ,r_{\mu }\) using the following algorithm. Traverse S and maintain \(\textsf {lm}(j)\) for the current index j. Whenever \(\textsf {lm}(j) \ne \textsf {lm}(j-1) + 1\) the interval \([j - \textsf {lm}(j) + 1,\; j]\) is one of the minimal co-occurrences. Note that this algorithm finds the minimal co-occurrences in order by their rightmost endpoint. We maintain \(\textsf {lm}(j)\) as follows. For each character \(p \in Q\) let \(\textsf {dist}(j,p)\) be the distance to the closest occurrence of p on the left of j. Then \(\textsf {lm}(j)\) is the maximum \(\textsf {dist}(j,\cdot )\)-value. As in [14], we maintain the \(\textsf {dist}(j,\cdot )\)-values in a linked list that is dynamically reordered according to the well-known move-to-front rule. The algorithm works as follows. Maintain a linked list over the elements in Q, ordered by increasing \(\textsf {dist}\)-values. Whenever you see some \(p \in Q\), access its node in expected constant time through a dictionary and move it to the front of the list. The least recently seen \(p \in Q\) (i.e., the p with the largest \(\textsf {dist}(j,\cdot )\)-value) is found at the back of the list in constant time. The algorithm uses O(q) space and expected constant time per character in S, thus it runs in expected O(n) time.

Building the Data Structure We build the data structure as follows. Traverse S and maintain the two most recently seen minimal co-occurrences using the algorithm above. We maintain the non-zero \(\delta (\cdot )\)-values in a dictionary D that is implemented using chained hashing in conjunction with universal hashing [2]. When we find a new minimal co-occurrence \([\ell _{i+1},r_{i+1}]\) we increment \(D[\textsf {len}(\ell _{i},r_{i})]\) and decrement \(D[\textsf {len}(\ell _{i},r_{i+1})]\). Recall that \(Z = \{z_1,\ldots ,z_d\}\) where \(z_j < z_{j+1}\) is defined such that \(\delta (i) \ne 0\) if and only if \(i \in Z\). After processing S the dictionary D encodes the set \(\{(z_1,\delta (z_1)),\ldots ,(z_d,\delta (z_d))\}\). Sort the set to obtain the array \(E[j] = \delta (z_j)\). Compute the partial sum array over E, i.e. the array

$$ F[j] = \sum _{i=1}^j E[i] = \sum _{i=1}^j \delta (z_i) = \textsf {lmco}(z_j). \qquad \qquad \qquad \text {(we use 1-indexing)} $$

Build the predecessor data structure over Z and associate \(\textsf {lmco}(z_j)\) with each key \(z_j\).

The algorithm for finding the minimal co-occurrences uses O(q) space and the remaining data structures all use O(d) space, for a total of \(O(d + q)\) space. Finding the minimal co-occurrences and maintaining D takes O(n) expected time, and so does building the predecessor structure from the sorted input.

Furthermore, we use the following sorting algorithm to sort the d entries in D with O(d) extra space in expected O(n) time. If \(d < n/\log n\) we use merge sort which uses O(d) extra space and runs in \(O(d\log d) = O(n)\) time. If \(d \ge n/\log n\) we use radix sort with base \(\sqrt{n}\), which uses \(O(\sqrt{n})\) extra space and O(n) time. To elaborate, assume without loss of generality that 2k bits are necessary to represent n. We first distribute the elements into \(2^k = O(\sqrt{n})\) buckets according to the most significant k bits of their binary representation, partially sorting the input. We then sort each bucket by distributing the elements in that bucket according to the least significant k bits of their binary representation, fully sorting the input. The algorithm runs in O(n) time and uses \(O(\sqrt{n}) = O(n/\log n) = O(d)\) extra space.

B Lower Bound on Time

We now prove Theorem 2 by the following reduction from the predecessor problem. Let U, Q and \(G_i\) be as defined in Sect. 4.1 and let \(X = \{x_1,x_2,\ldots ,x_m\} \subseteq U\) where \(x_1< \ldots < x_m\). Define

$$ S = \underbrace{G_{x_1}\cdots G_{x_1}}_{x_1}\;\underbrace{G_{x_2}\cdots G_{x_2}}_{x_2 - x_1} \;\ldots \ldots \ldots \;\underbrace{G_{x_m}\cdots G_{x_m}}_{x_m - x_{m-1}} $$

By Lemma 5 we have that \(\delta (x_1) = x_1\), \(\delta (x_i) = x_i - x_{i-1}\) for \(i \in [2,m]\) and \(\delta (i) = 0\) for \(i \in U\setminus X\). Then, if the predecessor of some \(x \in U\) is \(x_p\), we have

$$ \textsf {lmco}(x) = \sum _{i=2}^x \delta (i) = x_1 + (x_2 - x_1) + \ldots + (x_p - x_{p-1}) = x_p $$

On the other hand, if \(x < x_1\) then \(\sum _{i=0}^x\delta (i) = 0\), unambiguously identifying that x has no predecessor.

Applying Lemma 5 again, we have \(d \le 8m\). Furthermore, there are \(x_1 + (x_2 - x_1) + \ldots + (x_m - x_{m-1}) = x_m \le u\) gadgets in total so \(n \le 2u^2\). Hence, given a data structure that supports \(\textsf {lmco}\) in \(f_t(n,q,d)\) time using \(f_s(n,q,d)\) space, we get a data structure supporting predecessor queries on X in \(O(f_t(2u^2, 2, 8m))\) time and \(O(f_s(2u^2,2,8m))\) space, proving Theorem 2.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bille, P., Gørtz, I.L., Stordalen, T. (2022). The Complexity of the Co-occurrence Problem. In: Arroyuelo, D., Poblete, B. (eds) String Processing and Information Retrieval. SPIRE 2022. Lecture Notes in Computer Science, vol 13617. Springer, Cham. https://doi.org/10.1007/978-3-031-20643-6_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20643-6_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20642-9

  • Online ISBN: 978-3-031-20643-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics