Abstract
Many documents share common HTML tree structure on script generated websites, allowing us to effectively extract interested information from deep webpage by wrappers. Since tree structure evolves over time, the wrappers break frequently and need to be re-learned. In this paper, we explore the problem of constructing robust wrappers for deep web information extraction. In order to keep web extraction robust when webpage changes, a minimum cost script edit model based on machine learning techniques is proposed. With the method, we consider three edit operations under structural changes, i.e., inserting nodes, deleting nodes and substituting nodes’ labels. Firstly, we obtain the change frequencies of three edit operations for each HTML label according to the frequency of webpage change on real web data with machine learning method. Then, we compute the corresponding edit costs for three edit operations on the basis of change frequencies and minimum cost model. Finally, we choose the most proper data to extract the interested information by applying the minimum cost script. Experimental results show that the proposed approach can accomplish robust web extraction with high accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Myllymaki, J., Jackson, J.: Robust web data extraction with XML path expressions. CiteSeer (2002)
Dalvi, N., Bohannon, P., Sha, F.: Robust web extraction: an approach based on a probabilistic tree-edit model. In: SIGMOD (2009)
Parameswaran, A., Dalvi, N., Garcia-Molina, H., Rastogi, R.: Optimal Schemes for Robust Web Extraction. In: VLDB (2011)
Dalvi, N., Kumar, R., Soliman, M.: Automatic Wrappers for Large Scale Web Extraction. In: VLDB (2011)
Baumgartner, R., Gottlob, G., Herzog, M.: Scalable Web Data Extraction for Online Market Intelligence. In: VLDB (2009)
Gupta, R., Sarawagi, S.: Domain Adaptation of Information Extraction Models. SIGMOD Record 37(4), 35–40 (2008)
Cafarella, M.J., Madhavan, J., Halevy, A.: Web-Scale Extraction of Structured Data. In: SIGMOD (2008)
Cafarella, M.J., Halevy, A., Khoussainova, N.: Data Integration for the Relational Web. In: VLDB (2009)
Kasneci, G., Ramanath, M., Suchanek, F., Weikum, G.: The YAGO-NAGA Approach to Knowledge Discovery. SIGMOD Record 37(4), 41–47 (2008)
Kim, Y., Park, J., Kim, T., Choi, J.: Web Information Extraction by HTML Tree Edit Distance Matching. In: ICCIT (2007)
Anton, T.: Xpath-wrapper induction by generating tree traversal patterns. In: LWA, pp. 126–133 (2005)
van Rijsbergen, C.: Information Retrieval. Butterworths (1979)
Chidlovskii, B., Roustant, B., Brette, M.: Documentum ECI self-repairing wrappers: performance analysis. In: SIGMOD, pp. 708–717 (2006)
de Castro Reis, D., Golgher, P.B., da Silve, A.S.: Automatic web news extraction using tree edit distance. In: WWW, pp. 502–511 (2004)
Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD, pp. 759–770 (2009)
Liu, D., Wang, X., Li, H., Yan, Z.: Robust Web Extraction Based on Minimum Cost Script Edit Model. Procedia Engineering 29, 1119–1125 (2012)
Hao, Q., Cai, R., Pang, Y., Zhang, L.: From one tree to a forest: a unified solution for structured Web data extraction. In: Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR, pp. 775–784 (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, D., Wang, X., Yan, Z., Li, Q. (2012). Robust Web Data Extraction: A Novel Approach Based on Minimum Cost Script Edit Model. In: Wang, F.L., Lei, J., Gong, Z., Luo, X. (eds) Web Information Systems and Mining. WISM 2012. Lecture Notes in Computer Science, vol 7529. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33469-6_62
Download citation
DOI: https://doi.org/10.1007/978-3-642-33469-6_62
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33468-9
Online ISBN: 978-3-642-33469-6
eBook Packages: Computer ScienceComputer Science (R0)