Abstract
The key issue concerning with Topic-driven Web resource discovery is how to increase the harvest rate, and the crawler should learn from the crawled online information such as the Web pages and the hyperlink structure. We address this problem by endowing a crawler with an incremental learning ability, and propose an online incremental leaning algorithm (IncL). IncL can effectively utilize the multi-feature characteristics of Web pages to enhance their link evaluation accuracy and reliability. We take into account not only a hyperlink’s positive source pages but also its negative source pages in its score that is used to rank the Web pages. Many current crawling approaches ignore the negative pages’ effect on the page ranking. Experiments show IncL gets high harvest rate.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Kleinberg, J., Lawrence, S.: The structure of the Web. Science 294 5548, 1849–1850 (2001)
Kleinberg, J.: Authoritative Sources in a Hyperlinked Environment. In: Proc. of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 668–677 (1998)
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the web. Technical report, Stanford University (1998)
Thelwall, M.: Can Google’s PageRank be used to find the most important academic Web pages? J. of Documentation 59(2), 205–217 (2003c)
Chakrabarti, S., Berg, M.V.V., Dom, B.: Focused crawling: A new approach to topic- pecific Web resource discovery. In: Proc. of 8th Int. World Wide Web Conf. (1999)
Ricardo, B.Y., Berthier, R.N.: Modern Information Retrieval. ACM Press Series/Addison Wesley, New York (1999)
Pant, G., Menczer, F.: Topical crawling for business intelligence. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 233–244. Springer, Heidelberg (2003)
Henzinger, M.R.: Hyperlink Analysis for the Web. IEEE Internet Computing 5(1), 45–50 (2001)
Christophe, G.G.: A Note on the Utility of Incremental Learning. AI Communications 13(4), 215–224 (2000)
Pinkerton, B.: Finding what people want: Experiences with the Web Crawler. In: Proc. of the 2nd Int. World Wide Web Conf., Chicago (1994)
Hersovici, M., Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalhaim, M., Sigalit, U.: The shark search algorithm an application: Tailored web site mapping. In: Proc. 7th Int. World Wide Web Conf. (1998)
De Bra, P., Houben, G., Kornatzky, Y., Post, R.: Information retrieval in distributed Hypertexts: making client-based searching feasible. In: Proc. 4th RIAO (1994)
Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. In: 7th World Wide Web Conf., Brisbane, Australia (1998)
Diligenti, M., Coetzee, F.: Lawrence, s., Giles, C. L., Gori, M.: Focused crawling using context graphs. In: Proc. of the 26th Int. Conf. on Very Large Databases, Cairo, Egypt, pp. 527–534 (2000)
Eiron, N., McCurley, K.S., Tomlin, J.A.: Ranking the Web Frontier. In: Proc. of the 13th Int. World Wide Web Conf. (2004)
Aggarwal, C., Al-Garawi, F., Yu, P.: Intelligent crawling on the World Wide Web with arbitrary predicates. In: Proc. of the 10th Int. World Wide Web Conf., pp. 96–105 (2001)
Menczer, F., Belew, R.: Adaptive retrieval agents: Internalizing local context and caling up to the Web. Machine Learning 39(2-3), 203–242 (2000)
Davison, B.D.: Topical locality in the Web. In: Proc. of the 23rd Annual Int. Conf. on Research and Development in Information Retrieval (SIGIR 2000), Athens, Greece, pp. 272–279. ACM, New York (2000)
Menczer, F.: Links tell us about lexical and semantic Web content. Technical Report Computer Science Abstract CS.IR/0108004,arXiv.org (2001)
Chakrabarti, S., Dom, B.E., Gibson, D., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Topic distillation and spectral filtering. Artificial Intelligence Review 13(5-6), 409–435 (1999)
Menczer, F., Pant, G., Srinivasan, P.: Topical Web Crawlers: Evaluating Adaptive Algorithms. ACM Transactions on Internet Technology 4(4), 378–419 (2004)
Torgo, L., Gama, J.: Regression by classification. In: Borges, D.L., Kaestner, C.A.A. (eds.) SBIA 1996. LNCS, vol. 1159. Springer, Heidelberg (1996)
Aggarwal, C., Al-Garawi, F., Yu, P.S.: Intelligent crawling on the World Wide Web with arbitrary predicates. In: World Wide Web Conf., Hong Kong. ACM Press, New York (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, H., Huang, S. (2005). An Incremental Approach to Link Evaluation in Topic-Driven Web Resource Discovery. In: Megiddo, N., Xu, Y., Zhu, B. (eds) Algorithmic Applications in Management. AAIM 2005. Lecture Notes in Computer Science, vol 3521. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11496199_33
Download citation
DOI: https://doi.org/10.1007/11496199_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26224-4
Online ISBN: 978-3-540-32440-9
eBook Packages: Computer ScienceComputer Science (R0)