Abstract
Over the decades World Wide Web has become abundance source of distributed web content repository hyper-linked with diverse information domains. Performance of search engines in locating the information is exemplary but still there is inadequacy in search engines for focused crawling of web content. Web Page Classification being pivotal for information retrieval and management task plays imperative role for natural language processing in creating classified web document repositories and building indexed web directories. The conventional machine learning approaches extract the desired features from web pages in order to classify them whereas deep leaning algorithms learns the covet features as the network goes deeper and deeper. Transfer learning based Pre-trained models such as BERT attains impressive performance for text classification. In this study, we evaluate the effectiveness of adopting pre-trained model BERT for the task of classifying web pages into different categories. In this paper, we proposed an ensemble approach for web page classification by learning contextual representation using pre-trained bidirectional BERT and then applying deep Inception modelling with Residual connections for fine-tunes the target task by utilizing parallel multi-scale semantics. Experimental evaluation exhibit that proposed ensemble model outperforms benchmark baselines and achieve better performance in contrast to other transfer learning approaches evaluated on the web page classification task for different classification datasets.
Similar content being viewed by others
References
Altingövde IS, Özel SA, Lusoy Ö, Özsoyoglu G, Özsoyoglu ZM (2001) Topic-centric querying of Web information resources. Lecture Notes Comput Sci 2113:699–711
Brin S, Page L (2012) Reprint of: the anatomy of a large-scale hypertextual web search engine. Comput Netw 56(18):3825–3833. https://doi.org/10.1016/j.comnet.2012.10.007
Chen RC, Hsieh CH (2006) Web page classification based on a support vector machine using a weighted vote schema. Expert Syst Appl 31:427–435
Chung J, et al. (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555
Conneau A, et al. (2016) Very deep convolutional networks for text classification. arXiv preprint arXiv:1606.01781
De Bra PME, Post RDJ (1994) Information retrieval in the world wide web: making client-based searching feasible. Comput Netw ISDN Syst 27(2):183–192
Devlin J, et al. (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, (pp. 770–778)
Holden N, Freitas A A, (2004) Web Page classification with an ant Colony algorithm, parallel problem solving from nature, LNCS, springer, Vol.3242, (pp. 1092-1102)
Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1, (pp. 328-339)
Huang M, Qian Q, Zhu X (2017) Encoding syntactic knowledge in neural networks for sentiment classification. ACM Trans Inf Syst (TOIS) 35(3):26
Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Vol. 1, (pp. 655-665)
Kim Y (2014) Convolutional neural networks for sentence classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (pp. 1746–1751)
Kwon O, Lee J (2000) Web page classification based on k-nearest neighbour approach. IRAL '00: Proceedings of the fifth international workshop on Information retrieval with Asian languages (pp. 9–15)
Li Y, Zou B, Deng S, Zhou G (2020) Using feature fusion strategies in continuous authentication on smartphones. IEEE Internet Comput 24(2):49–56
Li Y et al (2020) SCANET: sensor-based continuous authentication with two-stream convolutional neural networks. ACM Trans Sens Netw 16(3):29:1–29:27
Menczer F, Pant G, Srinivasan P (2004) Topical web crawlers: evaluating adaptive algorithms. ACM Trans Internet Technol 4(4):378–419
Meshkizadeh S, Rahmani AM, Dezfuli MA (2010) Web Page Classification based on Compound of Using HTML Features & URL Features and Features of Sibling Pages. International Journal of Advancements in Computing Technology 2:36–46
Ozel SA (2011) A web page classification system based on a genetic algorithm using tagged-terms as features. Expert Syst Appl 38(4):3407–3415
Ozel SA (2011) A genetic algorithm based optimal feature selection for web page classification. In Proceedings of International Symposium on Innovations in Intelligent Systems and Applications, IEEE, (pp. 282–286)
Peters ME, et al. (2018) Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (pp. 2227–2237)
Qi X, Davison BD (2009) Web page classification: features and algorithms. ACM Computing Surveys 41(2):article 12
Radford A, et al. (2018) Improving language understanding by generative pre-training. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/languageunsupervised/languageunderstandingpaper.pdf
Ribeiro A, Fresno V, Garcia-Alegre MC, Guinea D (2003) Web page classification: a soft computing approach. Lecture Notes Artif Intell 2663:103–112
Selamat A, Omatu S (2004) Web page feature selection and classification using neural networks. Inf Sci 158:69–88
Szegedy C, Ioffe S, Vanhoucke V, Alemi A (2016) Inception-v4, inception-ResNet and the impact of residual connections on learning. AAAI'17: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (pp. 4278–4284)
Szegedy C, Liu W, Jia Y, Sermanet P (2015) Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (pp. 1-9)
Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-term memory networks. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Vol. 1, (pp. 1556–1566)
Wang B (2018) Disconnected recurrent neural networks for text categorization. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1, (pp. 2311–2320)
Xiao Y, Cho K (2016) Efficient character-level document classification by combining convolution and recurrent layers. arXiv preprint arXiv:1602.00367. https://arxiv.org/abs/1602.00367
Yang Z, et al. (2016) Hierarchical attention networks for document classification. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (pp 1480–1489)
Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification. NIPS'15: Proceedings of the 28th International Conference on Neural Information Processing Systems , Vol. 1, (pp. 649–657)
Zhou C, et al. (2015) A C-LSTM neural network for text classification. arXiv:1511.08630. http://arxiv.org/abs/1511.08630
Zhou G, et al. (2018) CNNAuth: continuous authentication via two-stream convolutional neural networks. IEEE 13th Int Conf, NAS: 1-9
Zhou G et al (2019) Using data augmentation in continuous authentication on smartphones. IEEE Internet Things J 6(1):628–640
Acknowledgements
The authors would like to thank Google Colaboratory for providing free-of-cost TPU for performing our experimentation on efficient web page classification.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Gupta, A., Bhatia, R. Ensemble approach for web page classification. Multimed Tools Appl 80, 25219–25240 (2021). https://doi.org/10.1007/s11042-021-10891-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-10891-3