Abstract
Let S be a string of length n over an alphabet \(\varSigma \) and let Q be a subset of \(\varSigma \) of size \(q \ge 2\). The co-occurrence problem is to construct a compact data structure that supports the following query: given an integer w return the number of length-w substrings of S that contain each character of Q at least once. This is a natural string problem with applications to, e.g., data mining, natural language processing, and DNA analysis. The state of the art is an \(O(\sqrt{nq})\) space data structure that—with some minor additions—supports queries in \(O(\log \log n)\) time [CPM 2021].
Our contributions are as follows. Firstly, we analyze the problem in terms of a new, natural parameter d, giving a simple data structure that uses O(d) space and supports queries in \(O(\log \log n)\) time. The preprocessing algorithm does a single pass over S, runs in expected O(n) time, and uses \(O(d + q)\) space in addition to the input. Furthermore, we show that O(d) space is optimal and that \(O(\log \log n)\)-time queries are optimal given optimal space. Secondly, we bound \(d = O(\sqrt{nq})\), giving clean bounds in terms of n and q that match the state of the art. Furthermore, we prove that \(\varOmega (\sqrt{nq})\) bits of space is necessary in the worst case, meaning that the \(O(\sqrt{nq})\) upper bound is tight to within polylogarithmic factors. All of our results are based on simple and intuitive combinatorial ideas that simplify the state of the art.
Philip Bille and Inge Li Gørtz are supported by Danish Research Council grant DFF-8021-002498.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amagata, D., Hara, T.: Mining top-\(k\) co-occurrence patterns across multiple streams (extended abstract). In: Proceeding 34th ICDE, pp. 1747–1748 (2018). https://doi.org/10.1109/ICDE.2018.00231
Carter, L., Wegman, M.N.: Universal classes of hash functions. J. Comput. Syst. Sci. 18(2), 143–154 (1979). https://doi.org/10.1016/0022-0000(79)90044-8
Chang, J.H., Lee, W.S.: Finding recently frequent item sets adaptively over online transactional data streams. Inf. Syst. 31(8), 849–869 (2006). https://doi.org/10.1016/j.is.2005.04.001
Dallachiesa, M., Palpanas, T.: Identifying streaming frequent items in ad hoc time windows. Data Knowl. Eng. 87, 66–90 (2013). https://doi.org/10.1016/j.datak.2013.05.007
Das, G., Fleischer, R., Gasieniec, L., Gunopulos, D., Kärkkäinen, J.: Episode matching. In: Proceeding 8th CPM, pp. 12–27 (1997). https://doi.org/10.1007/3-540-63220-4_46
Demaine, E.D., López-Ortiz, A., Munro, J.I.: Frequency estimation of internet packet streams with limited space. In: Proceeding 10th ESA, pp. 348–360 (2002). https://doi.org/10.1007/3-540-45749-6_33
Golab, L., DeHaan, D., Demaine, E.D., López-Ortiz, A., Munro, J.I.: Identifying frequent items in sliding windows over on-line packet streams. In: Proceeding 3rd ACM IMC, pp. 173–178 (2003). https://doi.org/10.1145/948205.948227
Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28, 51–55 (2003). https://doi.org/10.1145/762471.762473
Li, H., Lee, S.: Mining frequent itemsets over data streams using efficient window sliding techniques. Expert Syst. Appl. 36(2), 1466–1477 (2009). https://doi.org/10.1016/j.eswa.2007.11.061
Lim, Y., Choi, J., Kang, U.: Fast, accurate, and space-efficient tracking of time-weighted frequent items from data streams. In: Proceeding 23rd CIKM, pp. 1109–1118 (2014). https://doi.org/10.1145/2661829.2662006
Lin, C., Chiu, D., Wu, Y., Chen, A.L.P.: Mining frequent itemsets from data streams with a time-sensitive sliding window. In: Proceeding 5th SDM, pp. 68–79 (2005). https://doi.org/10.1137/1.9781611972757.7
Mozafari, B., Thakkar, H., Zaniolo, C.: Verifying and mining frequent patterns from large windows over data streams. In: Proceeding 24th ICDE, pp. 179–188 (2008). https://doi.org/10.1109/ICDE.2008.4497426
Patrascu, M., Thorup, M.: Randomization does not help searching predecessors. In: Proceedimg 18th SODA, pp. 555–564 (2007). https://dl.acm.org/citation.cfm?id=1283383.1283443
Sobel, J., Bertram, N., Ding, C., Nargesian, F., Gildea, D.: AWLCO: all-window length co-occurrence. In: Proceeding 32nd CPM, pp. 24:1–24:21. LIPIcs (2021). https://doi.org/10.4230/LIPIcs.CPM.2021.24
Willard, D.E.: Log-logarithmic worst-case range queries are possible in space \(\varTheta (N)\). Inf. Process. Lett. 17(2), 81–84 (1983). https://doi.org/10.1016/0020-0190(83)90075-3
Yu, Z., Yu, X., Liu, Y., Li, W., Pei, J.: Mining frequent co-occurrence patterns across multiple data streams. In: Proceeding 1th EDBT, pp. 73–84 (2015). https://doi.org/10.5441/002/edbt.2015.08
Acknowledgement
We would like to thank the anonymous reviewers for their comments, which improved the presentation of the paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Preprocessing
Finding Minimal Co-occurrences. To build the data structure, we need to find all the minimal co-occurrences in order to determine \(\delta \). For \(j \ge r_{1}\), let \(\textsf {lm}(j)\) denote the length of the left-minimal co-occurrence ending at index j. By Lemma 1, \(\textsf {lm}(r_{i}) = \textsf {len}(\ell _{i},r_{i})\) for each \(i \in [1,\mu ]\). Furthermore, for \(j \in [r_{i}+1,r_{i+1}-1]\) we have \(\textsf {lm}(j) = \textsf {lm}(j-1) + 1\) since both of the left-minimal co-occurrences ending at these two indices start at \(\ell _{i}\). However, \(\textsf {lm}(r_{i}) \le \textsf {lm}(r_{i}-1)\) for each \(i \in [2,\mu ]\); the left-minimal co-occurrence ending at \(r_{i}\) starts at least one index further to the right than the left-minimal co-occurrence ending at \(r_{i}-1\) because \(\ell _{i-1} < \ell _{i}\), so it cannot be strictly longer.
We determine \(\ell _{1},\ldots ,\ell _{\mu }\) and \(r_{1},\ldots ,r_{\mu }\) using the following algorithm. Traverse S and maintain \(\textsf {lm}(j)\) for the current index j. Whenever \(\textsf {lm}(j) \ne \textsf {lm}(j-1) + 1\) the interval \([j - \textsf {lm}(j) + 1,\; j]\) is one of the minimal co-occurrences. Note that this algorithm finds the minimal co-occurrences in order by their rightmost endpoint. We maintain \(\textsf {lm}(j)\) as follows. For each character \(p \in Q\) let \(\textsf {dist}(j,p)\) be the distance to the closest occurrence of p on the left of j. Then \(\textsf {lm}(j)\) is the maximum \(\textsf {dist}(j,\cdot )\)-value. As in [14], we maintain the \(\textsf {dist}(j,\cdot )\)-values in a linked list that is dynamically reordered according to the well-known move-to-front rule. The algorithm works as follows. Maintain a linked list over the elements in Q, ordered by increasing \(\textsf {dist}\)-values. Whenever you see some \(p \in Q\), access its node in expected constant time through a dictionary and move it to the front of the list. The least recently seen \(p \in Q\) (i.e., the p with the largest \(\textsf {dist}(j,\cdot )\)-value) is found at the back of the list in constant time. The algorithm uses O(q) space and expected constant time per character in S, thus it runs in expected O(n) time.
Building the Data Structure We build the data structure as follows. Traverse S and maintain the two most recently seen minimal co-occurrences using the algorithm above. We maintain the non-zero \(\delta (\cdot )\)-values in a dictionary D that is implemented using chained hashing in conjunction with universal hashing [2]. When we find a new minimal co-occurrence \([\ell _{i+1},r_{i+1}]\) we increment \(D[\textsf {len}(\ell _{i},r_{i})]\) and decrement \(D[\textsf {len}(\ell _{i},r_{i+1})]\). Recall that \(Z = \{z_1,\ldots ,z_d\}\) where \(z_j < z_{j+1}\) is defined such that \(\delta (i) \ne 0\) if and only if \(i \in Z\). After processing S the dictionary D encodes the set \(\{(z_1,\delta (z_1)),\ldots ,(z_d,\delta (z_d))\}\). Sort the set to obtain the array \(E[j] = \delta (z_j)\). Compute the partial sum array over E, i.e. the array
Build the predecessor data structure over Z and associate \(\textsf {lmco}(z_j)\) with each key \(z_j\).
The algorithm for finding the minimal co-occurrences uses O(q) space and the remaining data structures all use O(d) space, for a total of \(O(d + q)\) space. Finding the minimal co-occurrences and maintaining D takes O(n) expected time, and so does building the predecessor structure from the sorted input.
Furthermore, we use the following sorting algorithm to sort the d entries in D with O(d) extra space in expected O(n) time. If \(d < n/\log n\) we use merge sort which uses O(d) extra space and runs in \(O(d\log d) = O(n)\) time. If \(d \ge n/\log n\) we use radix sort with base \(\sqrt{n}\), which uses \(O(\sqrt{n})\) extra space and O(n) time. To elaborate, assume without loss of generality that 2k bits are necessary to represent n. We first distribute the elements into \(2^k = O(\sqrt{n})\) buckets according to the most significant k bits of their binary representation, partially sorting the input. We then sort each bucket by distributing the elements in that bucket according to the least significant k bits of their binary representation, fully sorting the input. The algorithm runs in O(n) time and uses \(O(\sqrt{n}) = O(n/\log n) = O(d)\) extra space.
B Lower Bound on Time
We now prove Theorem 2 by the following reduction from the predecessor problem. Let U, Q and \(G_i\) be as defined in Sect. 4.1 and let \(X = \{x_1,x_2,\ldots ,x_m\} \subseteq U\) where \(x_1< \ldots < x_m\). Define
By Lemma 5 we have that \(\delta (x_1) = x_1\), \(\delta (x_i) = x_i - x_{i-1}\) for \(i \in [2,m]\) and \(\delta (i) = 0\) for \(i \in U\setminus X\). Then, if the predecessor of some \(x \in U\) is \(x_p\), we have
On the other hand, if \(x < x_1\) then \(\sum _{i=0}^x\delta (i) = 0\), unambiguously identifying that x has no predecessor.
Applying Lemma 5 again, we have \(d \le 8m\). Furthermore, there are \(x_1 + (x_2 - x_1) + \ldots + (x_m - x_{m-1}) = x_m \le u\) gadgets in total so \(n \le 2u^2\). Hence, given a data structure that supports \(\textsf {lmco}\) in \(f_t(n,q,d)\) time using \(f_s(n,q,d)\) space, we get a data structure supporting predecessor queries on X in \(O(f_t(2u^2, 2, 8m))\) time and \(O(f_s(2u^2,2,8m))\) space, proving Theorem 2.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bille, P., Gørtz, I.L., Stordalen, T. (2022). The Complexity of the Co-occurrence Problem. In: Arroyuelo, D., Poblete, B. (eds) String Processing and Information Retrieval. SPIRE 2022. Lecture Notes in Computer Science, vol 13617. Springer, Cham. https://doi.org/10.1007/978-3-031-20643-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-20643-6_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20642-9
Online ISBN: 978-3-031-20643-6
eBook Packages: Computer ScienceComputer Science (R0)