Abstract
Web entities are often associated with many attributes that describe them. It is essential to extract these attributes for Web entity data extraction. This paper proposes a novel approach using duplicated attribute value pairs. We start by constructing a initial seed set of attributes including names and enumerable values, and a training set of Web pages from target website; After that we locate the position of each attribute by matching attribute values within the pages of the site with values contained in the seed set; Thirdly we choose the position with the highest supportiveness as path for extraction, which we use to extract other attribute value pairs with the same template. Finally, we conduct an extensive experimental study with large real data set to demonstrate the effectiveness of our extraction approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Gibson, Punera, K., Tomkins, A.: The volume and evolution of web page templates. In: WWW, pp. 830–839. ACM Press, New York (2005)
Zhu, Y., Yin, G., Wang, H., Shi, D., Li, X., Yuan, L.: An Indent Shape based Approach for Web Lists Mining. In: Wang, F.L. (ed.) WISM 2011, Part II. LNCS, vol. 6988, pp. 113–121. Springer, Heidelberg (2011)
Agichtein, E.: Confidence Estimation Methods for Partially Supervised Relation Extraction. In: The 6th SIAM International Conference on Data Mining, ACM Press, New York (2006)
Agrawal, R., Bayardo, R.J., Srikant, R.: Athena: Mining-Based Interactive Management of Text Databases. In: Zaniolo, C., Grust, T., Scholl, M.H., Lockemann, P.C. (eds.) EDBT 2000. LNCS, vol. 1777, pp. 365–379. Springer, Heidelberg (2000)
Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: The 2003 ACM SIGMOD International Conference on Management of Data, pp. 337–348. ACM Press, New York (2003)
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Papotti, P., Crescenzi, V., Merialdo, P., Bronzi, M., Blanco, L.: Redundancy-driven web data extraction and integration. In: WebDB (2010)
Gulhane, P., Rastogi, R., Sengamedu, S., Tengli, A.: Exploiting content redundancy for web information extraction. PVLDB 3(1), 578–587 (2010)
Miao, G., et al.: Extracting data records from the web using tag path clusterting. In: WWW, pp. 981–990. ACM Press, New York (2009)
Jindal, N., Liu, B.: A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction. In: The 10th SIAM, pp. 930–941 (2010)
Chang, C.-H., Lui, S.: IEPAD: Information Extraction Based on Pattern Discovery. In: The 10th International World Wide Web Conference, pp. 681–688 (2001)
Sivakumar, P., Parvathi, R.M.S.: An Efficient Approach of Noise Removal from Web Page for Effectual Web Content Mining. European Journal of Scientific Research 50(3), 340–351 (2011)
Liu, W., Meng, X., Yang, J., Xiao, J.: Duplicate Identification in Deep Web Data Integration. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds.) WAIM 2010. LNCS, vol. 6184, pp. 5–17. Springer, Heidelberg (2010)
Marchionini, G.: Exploratory search: from finding to understanding. Communications of the ACMÂ 49(4), 46 (2006)
Huang, J., Wang, H., et al.: Link-based Hidden Attribute Discovery for Objects on Web. In: 14th International Conference on Extending Database Technology, pp. 473–484. ACM Press, New York (2011)
Wang, J., Shao, B., et al.: Understanding Tables on the Web. Technique report. Microsoft Research Asia (2011)
Manning, C., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhu, Y., Yin, G., Li, X., Wang, H., Shi, D., Yuan, L. (2011). Exploiting Attribute Redundancy for Web Entity Data Extraction. In: Xing, C., Crestani, F., Rauber, A. (eds) Digital Libraries: For Cultural Heritage, Knowledge Dissemination, and Future Creation. ICADL 2011. Lecture Notes in Computer Science, vol 7008. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24826-9_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-24826-9_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24825-2
Online ISBN: 978-3-642-24826-9
eBook Packages: Computer ScienceComputer Science (R0)