Abstract
Class imbalance has become a big problem that leads to inaccurate traffic classification. Accurate traffic classification of traffic flows helps us in security monitoring, IP management, intrusion detection, etc. To address the traffic classification problem, in literature, machine learning (ML) approaches are widely used. Therefore, in this paper, we also proposed an ML-based hybrid feature selection algorithm named WMI_AUC that make use of two metrics: weighted mutual information (WMI) metric and area under ROC curve (AUC). These metrics select effective features from a traffic flow. However, in order to select robust features from the selected features, we proposed robust features selection algorithm. The proposed approach increases the accuracy of ML classifiers and helps in detecting malicious traffic. We evaluate our work using 11 well-known ML classifiers on the different network environment traces datasets. Experimental results showed that our algorithms achieve more than 95% flow accuracy results.
Similar content being viewed by others
References
Foremski P (2013) On different ways to classify internet traffic? A short review of selected publications. Theor Appl Inform 25(2):119–136
Moore A, Papagiannaki K (2005) Toward the accurate identification of network applications. Passiv Act Netw Meas 3431:4–54
Nguyen T, Armitage G (2008) A survey of techniques for internet traffic classification using machine learning. IEEE Commun Surv Tutor 10(4):56–76
Karagiannis T, Broido A, Faloutsos M, Claffy K (2004) Transport layer identification of P2P traffic. In: IMC ’04 Proceedings 4th ACM SIGCOMM Conference Internet Measurement, pp 12–134
Sen S, Spatscheck O, Wang D (2004) Accurate, scalable in-network identification of p2p traffic using application signatures. In: Proceedings 13th International Conference World Wide Web, p 521
Karagiannis T (2004) Application-specific payload bit strings. http://alumni.cs.ucr.edu/~tkarag/papers/strings.txt, 2004. [Online]. http://alumni.cs.ucr.edu/~tkarag/papers/strings.txt. [Toegang verkry: 0Jan-2017]
Haffner P, Sen S, Spatscheck O, Acas DW (2005) Automated construction of application signatures. In: Proceedings 2005 Workshop Mining Network Data, pp 197–202
Moore AW, Zuev D (2005) Internet traffic classification using Bayesian analysis techniques categories and subject descriptors. In: Sigmetrics, pp 50–60
Singh R, Kumar H, Singla R (2013) Sampling based approaches to handle imbalances in network traffic dataset for machine learning techniques. arXiv Prepr. arXiv1311.2677
Labovitz C, Iekel-Johnson S, McPherson D, Oberheide J, Jahanian F (2010) Internet inter-domain traffic. SIGCOMM Computer Communication Review, vol 41
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Maes F, Collignon A, Vandermeulen D, Marchal G, Suetens P (1997) Multimodality image registration by maximization of mutual information. IEEE Trans Med Imaging 16:187
Zhang H, Lu G, Qassrawi MT, Zhang Y, Yu X (2012) Feature selection for optimizing traffic classification. Comput Commun 35(12):1457–1471
Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159
Shafiq M, Yu X, Laghari AA (2016) WeChat text messages service flow traffic classification using machine learning technique. In: 2016 6th International Conference IT Convergence and Security ICITCS 2016
Shafiq M, Yu X (2017) Effective packet number for 5G im WeChat application at early stage traffic classification. Mob Inf Syst 2017
Shafiq M et al (2017) WeChat text and picture messages service flow traffic classification using machine learning technique. In: Proceedings—18th IEEE International Conference High Performing Computer Communication 14th IEEE International Conference Smart City 2nd IEEE International Conference Data Science System HPCC/SmartCity/DSS 2016, pp 58–62
Peng L, Zhang H, Yang B, Chen Y, Qassrawi MT, Lu G (2010) Traffic identification using flexible neural trees. In: IEEE International Workshop Quality Servervice IWQoS
Lu G, Zhang H, Sha X, Chen C, Peng L (2010) TCFOM: a robust traffic classification framework based on OC-SVM combined with MC-SVM. In: Proceedings—2010 International Conference Communication Intelligence Information Security ICCIIS 2010, pp 180–186
Auld T, Moore AW, Gull SF (2007) Bayesian neural networks for internet traffic classification. IEEE Trans Neural Netw 18(1):223–239
Cieslak DA, Chawla NV, Striegel A (2006) Combating imbalance in network intrusion datasets. In: IEEE International Conference Granular Computing, pp 732–737
Nechay D, Pointurier Y, Coates M (2009) Controlling false alarm/discovery rates in online internet traffic flow classification. IEEE INFOCOM 2009:684–692
Li W, Canini M, Moore AW, Bolla R (2009) Efficient application identification and the temporal and spatial stability of classification schema. Comput Netw 53(6):790–809
Gomes DG, Agoulmine N, Bennani Y, de Souza JN (2007) Predictive connectionist approach for VoD bandwidth management. Comput Commun 30(10):2236–2247
Chen X, Wasikowski M (2008) FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems. In: Proceeding 14th ACM SIGKDD International Conference Knowledge Discovery and Data Mining—KDD 08, pp 124–132
Van Der Putten P, Van Someren M (2004) A bias-variance analysis of a real world learning problem: the CoIL challenge 2000. Mach Learn 57(–2):177–195
Lei D, Xiaochun Y, Jun X (2008) Optimizing traffic classification using hybrid feature selection. In: Ninth International Conference Web-Age Information Management, pp 520–525
Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. SIGKDD Explor 6(1):80–89
Lim Y, Kim H, Jeong J, Kim C, Kwon TT, Choi Y (2010) Internet traffic classification demystified: on the sources of the discriminative power. In: Proceedings 6th International Conference, p 9
Kamal AHM, Zhu X, Pandya A, Hsu S (2009) Feature selection with biased sample distributions. In: 2009 IEEE International Conference on Information Reuse and Integration IRI, pp 23–28
Wasikowski M, Chen X (2010) Combating the small sample class imbalance problem using feature selection. IEEE Trans Knowl Data Eng 22(10):1388–1400
Moore A, Zuev D, Crogan M (2005) Discriminators for use in flow-based classification
Peng L, Zhang H, Yang B, Chen Y (2014) Feature evaluation for early stage internet traffic identification. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence Lecture Notes in Bioinformatics), vol 8630. LNCS, pp 51–525
Peng L, Yang B, Chen Y, Chen Z (2015) Effectiveness of statistical features for early stage internet traffic identification? Int J Parallel 44:18–197
Bernaille L, Teixeira R, Akodjenou I, Soule A, Salamatian K (2006) Traffic classification on the fly. ACM SIGCOMM Comput Commun Rev 36(2):23–26
Bahl LB et al (1986) Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: ICASSP ’86. International Conference on Acoustics Speech Signal Process, vol 11, pp 49–52
Peng H Mutual information Matlab Toolbox. https://www.mathworks.com/matlabcentral/fileexchange/14888-mutual-information-computation
Peng L, Yang B, Chen Y (2015) Effective packet number for early stage internet traffic identification. Neurocomputing 156:252
WireShark Trace Traffic WireShark, 2015. [Online]. https://www.wireshark.org/. [Toegang verkry: 0Jan-2015]
Introduction to NetMate Tool. [Online]. https://dan.arndt.ca/nims/calculating-flow-statistics-using-netmate/comment-page-1/
Makhoul J, Kubala F, Schwartz R, Weischedel R (1999) Performance measures for information extraction. In: Proceedings DARPA Broadcast News Workshop, pp 249–252
Acknowledgements
This work was supported by National Natural Science Foundation of China under Grant No. 61571144.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shafiq, M., Yu, X., Bashir, A.K. et al. A machine learning approach for feature selection traffic classification using security analysis. J Supercomput 74, 4867–4892 (2018). https://doi.org/10.1007/s11227-018-2263-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-018-2263-3