Abstract
Internet users are increasingly exposed to security vulnerabilities stemming from malicious Uniform Resource Locators (URLs), which act as conduits for cyber threats. These threats, often orchestrated by sophisticated cybercriminals, underscore the importance of comprehending the intricate dynamics involved to devise robust defense mechanisms. This scholarly exposition delineates an efficacious approach for discerning diverse categories of malicious URLs leveraging machine learning algorithms. Notably, our methodology obviates the necessity of directly accessing such URLs for extracting pertinent information, relying solely on attributes inherent within the lexical composition of the URLs. The empirical analyses are predicated on meticulously curated datasets from reputable repositories such as Kaggle and PhishTank, culminating in competitive performance vis-à-vis existing literature that predominantly focuses on network-centric or content-based features.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
More information at: https://www.python.org/.
- 2.
More information at: https://scikit-learn.org/stable/.
- 3.
- 4.
Available at: https://phishtank.org/phish_archive.php.
References
Bowyer, K.W., Chawla, N.V., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. CoRR abs/1106.1813 (2011). http://arxiv.org/abs/1106.1813
Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. KDD ’16, Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2939672.2939785, https://doi.org/10.1145/2939672.2939785
Fix, E., Hodges, J.: Discriminatory Analysis: Nonparametric Discrimination: Consistency Properties. USAF School of Aviation Medicine (1951). https://books.google.com.br/books?id=4XwytAEACAAJ
Acknowledgements
This study received partial financial support from AWS, CNPq, CAPES, FINEP, and Fapemig.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Rodrigues, J., Barros, C.d., Dias, D., Guimarães, M.d.P., Tuler, E., Rocha, L. (2024). Identification of Malicious URLs: A Purely Lexical Approach. In: Gervasi, O., Murgante, B., Garau, C., Taniar, D., C. Rocha, A.M.A., Faginas Lago, M.N. (eds) Computational Science and Its Applications – ICCSA 2024. ICCSA 2024. Lecture Notes in Computer Science, vol 14814. Springer, Cham. https://doi.org/10.1007/978-3-031-64608-9_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-64608-9_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-64607-2
Online ISBN: 978-3-031-64608-9
eBook Packages: Computer ScienceComputer Science (R0)