Abstract
Connectivity analysis of networked documents provides high quality link structure information, which is usually lost upon a content-based learning system. It is well known that combining links and content has the potential to improve text analysis. However, exploiting link structure is non-trivial because links are often noisy and sparse. Besides, it is difficult to balance the term-based content analysis and the link-based structure analysis to reap the benefit of both. We introduce a novel networked document clustering technique that integrates the content and link information in a unified optimization framework. Under this framework, a novel dimensionality reduction method called COntent & STructure COnstrained (Costco) Feature Projection is developed. In order to extract robust link information from sparse and noisy link graphs, two link analysis methods are introduced. Experiments on benchmark data and diverse real-world text corpora validate the effectiveness of proposed methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Angelova, R., Siersdorfer, S.: A neighborhood-based approach for clustering of linked document collections. In: CIKM, pp. 778–779 (2006)
Bolelli, L., Ertekin, S., Giles, C.L.: Clustering scientific literature using sparse citation graph analysis. In: PKDD, pp. 30–41 (2006)
Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: SIGMOD, pp. 307–318 (1998)
Cohn, D.A., Hofmann, T.: The missing link - a probabilistic model of document content and hypertext connectivity. In: NIPS, pp. 430–436 (2000)
Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Mach. Learn. 42(1-2), 143–175 (2001)
Getoor, L., Friedman, N., Koller, D., Taskar, B.: Learning probabilistic models of link structure. J. Mach. Learn. Res. 3, 679–707 (2003)
He, X., Zha, H., Ding, C.H.Q., Simon, H.D.: Web document clustering using hyperlink structures. Computational Statistics & Data Analysis 41(1), 19–45 (2002)
Henzinger, M.: Hyperlink analysis on the world wide web. In: Hypertext, pp. 1–3 (2005)
Ji, X., Xu, W.: Document clustering with prior knowledge, pp. 405–412 (2006)
Menczer, F.: Lexical and semantic clustering by web links. JASIST 55(14), 1261–1269 (2004)
Modha, D.S., Spangler, W.S.: Clustering hypertext with applications to web searching. In: Hypertext, pp. 143–152 (2000)
Neumaier, A.: Solving ill-conditioned and singular linear systems: A tutorial on regularization. SIAM Review 40, 636–666 (1998)
Neville, J., Adler, M., Jensen, D.: Clustering relational data using attribute and link information. In: Proceedings of the IJCAI Text Mining and Link Analysis Workshop (2003)
Oh, H.-J., Myaeng, S.H., Lee, M.-H.: A practical hypertext catergorization method using links and incrementally available class information. In: SIGIR, pp. 264–271 (2000)
Park, H.W., Thelwall, M.: Hyperlink analyses of the world wide web: A review. J. Computer-Mediated Communication 8(4) (2003)
Pearson, K.: On lines and planes of closest fit to systems of points in space. Philo- sophical Magazine 2(6), 559–572 (1901)
Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290, 2323–2326 (2000)
Shi, J., Malik, J.: Normalized cuts and image segmentation (2000)
Wang, Y., Kitsuregawa, M.: Evaluating contents-link coupled web page clustering for web search results. In: CIKM, pp. 499–506 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yan, S., Lee, D., Wang, A.H. (2011). Costco: Robust Content and Structure Constrained Clustering of Networked Documents. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2011. Lecture Notes in Computer Science, vol 6609. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19437-5_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-19437-5_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19436-8
Online ISBN: 978-3-642-19437-5
eBook Packages: Computer ScienceComputer Science (R0)