Abstract
We consider Maximum Agreement Problem which is, given positive and negative documents, to find a characteristic set that matches many of positive documents but rejects many of negative ones. A characteristic set is a sequence (x 1,...,x d) of strings such that each x i is a suffix of x i+1 and all x i’s appear in a document without overlaps. A characteristic set matches semi-structured documents with primitives or user’s de_ned macros. For example, (“set”, “characteristic set”, “</title> characteristic set”) is a characteristic set extracted from an HTML file. But, an algorithm that solves Maximum Agreement Problem does not output useless characteristic sets, such as those made of only tags of HTML, since such characteristic sets may match most of positive documents but also match most of negative ones. We present an algorithm that, given an integer d which is the number of strings in a characteristic set, solves Maximum Agreement Problem in O(n 2 h d) time, where n is the total length of documents and h is the height of the su_x tree of the documents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
S. Abiteboul, Querying Semi-Structured Data. In Proc. of the 6th International Conference on Database Theory, (1997) 1–18.
A. V. Aho, D. S. Hirschberg and J. D. Ullman, Bounds on the Complexity of the Longest Common Subsequences Problem, J. ACM, 23(1), pp. 1–12, 1976.
H. Ahonen, O. Heinonen, M. Klemettinen, and A. I. Verkamo, Mining in the Phrasal Frontier. In Proc. of the first European Symposium on Principles of Data Mining and Knowledge Discovery (1997), 343–350.
R. Agrawal, T. Imielinski, and A. Swami, Mining Association Rules between Sets of Items in Large Databases. In Proc. of the 1993 ACM SIGMOD Conference on Management of Data (1993), 207–216.
R. Agrawal and R. Srikant, Mining Sequential Patterns. In Proc. of the 11th Int. Conference on Data Engineering (1995), 3–14.
H. Arimura and S. Shimozono, Maximizing Agreement with a Classification by Bounded or Unbounded Number of Associated Words. In Proc. of the 9th International Symposium on Algorithms and Computation (1998).
H. Arimura, A. Wataki, R. Fujino, and S. Arikawa, A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases. In Proc. of the 9th Int. Workshop on Algorithmic Learning Theory, LNAI 1501, 247–261 (1998).
R. J. Bayardo, Efficiently Mining Long Patterns from Databases. In Proc. of the 1998 ACM SIGMOD Conference on Management of Data (1998), 85–93.
L. Devroye, W. Szpankowski, and B. Rais, A Note on the Height of Suffix Trees, SIAM J. Comput. Vol. 21, No. 1, pp. 48–53, February 1992.
R. Feldman and I. Dagan, Knowledge Discovery in Textual Databases (KDT). In Proc. of the first Int. Conference on Knowledge Discovery and Data Mining (1995), 112–117.
M. J. Kearns, R. E. Shapire, L. M. Sellie, Toward Efficient Agnostic Learning. Machine Learning, 17, pp. 115–141 1994.
H. Mannila and H. Toivonne, Discovering Generalized Episodes Using Minimal Occurrences, In Porc. of the second Int. Conference on Knowledge Discovery and Data Mining (1996), 146–151.
E. M. McCreight, A Space-Economical Suffix Tree Construction Algorithm. J. ACM, 23(2), pp. 262–272, 1976.
M. Nakanishi, M. Hashidume, M. Ito, A. Hashimoto, A Linear-Time Algorithm for Computing Characteristic Strings. In Proc. of the 5th International Symposium on Algorithms and Computation (1994), 315–323.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ikeda, D. (1999). Characteristic Sets of Strings Common to Semi-structured Documents. In: Arikawa, S., Furukawa, K. (eds) Discovery Science. DS 1999. Lecture Notes in Computer Science(), vol 1721. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46846-3_13
Download citation
DOI: https://doi.org/10.1007/3-540-46846-3_13
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66713-1
Online ISBN: 978-3-540-46846-2
eBook Packages: Springer Book Archive