Abstract
Unsolicited commercial e-mail (UCE), more commonly known as spam is a growing problem on the Internet. Every day people receive lots of unwanted advertising e-mails that flood their mailboxes. Fortunately, there are several approaches for spam filtering able to detect and automatically delete this kind of messages. However, spammers have adopted some techniques to reduce the effectiveness of these filters by introducing noise in their messages. This work presents a new pre-processing technique for noise identification and reduction, showing preliminary results when it is applied with a Flexible Bayes classifier. The experimental analysis confirms the advantages of using the proposed technique in order to improve spam filters accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Cunningham, P., Nowlan, N., Delany, S.J., Haahr, M.: A Case-Based Approach to Spam Filtering than Can Track Concept Drift. In: Ashley, K.D., Bridge, D.G. (eds.) ICCBR 2003. LNCS, vol. 2689, pp. 115–123. Springer, Heidelberg (2003)
The Spamhaus Project: Working to Protect Internet Networks Worldwide (2007), http://www.spamhaus.org/
Spam overview (2007), http://en.wikipedia.org/wiki/E-mail_spam
Spam statistics (2007), http://www.spamunit.com/spam-statistics/
Wittel, G.L., Wu, S.F.: On attacking statistical spam filters. CEAS: First Conference on E-mail and Anti-Spam (2004)
Leslie, C., Kuang, R.: Fast string kernels using inexact matching for protein sequences. Journal of Machine Learning Research, 1435–1455 (2004)
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. Journal of Machine Learning Research 2, 419–444 (2002)
Androutsopoulos, I., Koustias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.: An Evaluation of Naïve Bayesian Anti-Spam Filtering. In: Proceedings of the 11th European Conference on Machine Learning, Workshop on Machine Learning in the New Information Age, pp. 9–17 (2000)
Cid, I., Méndez, J.R., Peña-Glez, D., Fdez-Riverola, F.: A comparative impact study of attribute selection techniques on Naïve Bayes spam filters. In: The 8th Industrial Conference on Data Mining, ICDM 2008 (submitted for publication 2007)
Random Act of Spamness (2007), http://www.wired.com/techbiz/it/news/2004/01/61886
Hash Buster definition (2007), http://en.wikipedia.org/wiki/Hash_buster
Méndez, J.R., Fdez-Riverola, F., Díaz, F., Corchado, J.M.: Sistemas Inteligentes para la Detección y Filtrado de Correo Spam: una Revisión. Inteligencia Artificial, Revista Iberoamericana de Inteligencia Artificial 34, 63–81 (2007)
Lee, H., Ng, A.Y.: Spam deobfuscation using a Hidden Markov Model. In: Second Conference on E-mail and Anti-Spam (2005)
Shabbir, A., Farzana, M.: Word stemming to enhance spam filtering. In: CEAS: First Conference on E-mail and Anti-Spam (2004)
The Dspam project (2007), http://dspam.nuclearelephant.com/
SpamAssassin BNR (Bayes Noise Reduction) (2007), http://docs.google.com/View?docid=dfsk849w_13d4zm72
Graham, P.: Better bayesian filtering (2003), http://www.paulgraham.com/better.html
Klimt, B., Yang, Y.: Introducing the Enron corpus. In: CEAS: First Conference on E-mail and Anti-Spam (2004)
The Apache SpamAssassin Public Corpus (2007), http://spamassassin.apache.org/publiccorpus/
Crocker, D.: Standard for the Format of ARPA Internet Text Messages. STD 11, RFC 822 (2007), http://www.faqs.org/rfcs/rfc822.html
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, pp. 1137–1143 (1995)
Graham-Cumming, J.: Understanding Spam Filter Accuracy. In: jgc spam and anti-spam newsletter (2004) (2007), http://www.jgc.org/antispam/11162004-baafcd719ec31936296c1fb3d74d2cbd.pdf
Rijsbergen, C.J.: Information Retrieval (ed.). Butterworth, London (1979)
Shaw, W.M., Burgin, R., Howell, P.: Performance standards and evaluations in IR test collections: Cluster-based retrieval models. Information Processing and Management 33(1), 1–14 (1997)
Egan, J.P.: Signal Detection Theory and Roc Analysis (ed.). Academic Press, New York (1975)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cid, I., Janeiro, L.R., Méndez, J.R., Glez-Peña, D., Fdez-Riverola, F. (2008). The Impact of Noise in Spam Filtering: A Case Study. In: Perner, P. (eds) Advances in Data Mining. Medical Applications, E-Commerce, Marketing, and Theoretical Aspects. ICDM 2008. Lecture Notes in Computer Science(), vol 5077. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70720-2_18
Download citation
DOI: https://doi.org/10.1007/978-3-540-70720-2_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70717-2
Online ISBN: 978-3-540-70720-2
eBook Packages: Computer ScienceComputer Science (R0)