Abstract
Convex optimization techniques are extensively applied to various models, algorithms, and applications of machine learning and data mining. For optimization-based classification methods, the sparsity principle can greatly help to select simple classifier models, while the single- and multi-kernel methods can effectively address nonlinearly separable problems. However, the limited sparsity and kernel methods hinder the improvement of the predictive accuracy, efficiency, iterative update, and interpretable classification model. In this paper, we propose a new Explainable Multi-sparsity Multi-kernel Nonconvex Optimization Least-squares Classifier (EM2NOLC) model, which is an optimization problem with a least-squares objective function and multi-sparsity multi-kernel nonconvex constraints, aiming to address the aforementioned issues. Based on reconstructed multiple kernel learning (MKL), the proposed model can extract important instances and features by finding the sparse coefficient and kernel weight vectors, which are used to compute importance or contribution to classification and obtain the explainable prediction. The corresponding EM2NOLC algorithm is implemented with the Alternating Direction Method of Multipliers (ADMM) method. On the real classification datasets, compared with the three ADMM classifiers of Linear Support Vector Machine Classifier, SVMC, Least Absolute Shrinkage and Selection Operator Classifier, the two MKL classifiers of SimpleMKL and EasyMKL, and the gradient descent classifier of Feature Selection for SVMC, our proposed EM2NOLC generally obtains the best predictive performance and explainable results with the least number of important instances and features having different contribution percentages.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Sra S, Nowozin S, Wright SJ (eds) (2012) Optimization for machine learning. Mit Press, Cambridge
Yang X (2019) Introduction to algorithms for data mining and machine learning. Academic Press, Cambridge
Kantardzic M (2020) Data mining concepts, models, methods, and algorithms, 3rd edn. Wiley-IEEE Press, Hoboken
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge
Shigeo A (2010) Support vector machines for pattern classification, 2nd edn. Springer, Berlin
Deng N, Tian Y, Zhang C (2012) Support vector machines: optimization-based theory, algorithms, and extensions. CRC Press, Boca Raton
Simeone O (2018) A brief introduction to machine learning for engineers. Found Trends Signal Process 12(3–4):200–431
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge
Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) SimpleMKL. J Mach Learn Res 9:2491–2521
Gönen M, Alpaydin E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268
Gu Y, Liu T, Jia X, Benediktsson JA, Chanussot J (2016) Nonlinear multiple Kernel learning with multiple-structure-element extended morphological Profiles for hyperspectral image classification. IEEE Trans Geosci Remote Sens 54(6):3235–3247
Zien, A., & Ong, C. S. (2007). Multiclass multiple kernel learning. In Proceedings of the 24th international conference on Machine learning, pages 1191–1198, ACM.
Wang T, Zhao D, Feng Y (2013) Two-stage multiple kernel learning with multiclass kernel polarization. Knowl-Based Syst 48:10–16
Nazarpour A, Adibi P (2015) Two-stage multiple kernel learning for supervised dimensionality reduction. Pattern Recogn 48(5):1854–1862
Sonnenburg S, Rätsch G, Schäfer C, Schölkopf B (2006) Large scale multiple kernel learning. J Mach Learn Res 7:1531–1565
Aiolli F, Donini M (2015) EasyMKL: a scalable multiple kernel learning algorithm. Neurocomputing 169:215–224
Lauriola I, Gallicchio C, Aiolli F (2020) Enhancing deep neural networks via multiple kernel learning. Pattern Recogn 101:107194
Zhang Z, Gao G, Yao T, He J, Tian Y (2020) An interpretable regression approach based on bi-sparse optimization. Appl Intell 50(11):4117–4142
Bach F, Jenatton R, Mairal J, Obozinski G (2011) Optimization with sparsity-inducing penalties. Found Trends Mach Learn 4(1):1–106
Rish I, Grabarnik GY (2014) Sparse modeling: theory, algorithms, and applications. Chapman & Hall/CRC Press, Boca Raton
Gregorova M (2019) Sparse learning for variable selection with structures and nonlinearities. Doctoral dissertation, Geneve
Jain P, Kar P (2017) Non-convex optimization for machine learning. Found Trends Mach Learn 10(3–4):142–336
Weston J, Elisseeff A, Schölkopf B, Tipping M (2003) Use of the zero-norm with linear models and kernel methods. J Mach Learn Res 3:1439–1461
Huang K, Zheng D, Sun J, Hotta Y, Fujimoto K, Naoi S (2010) Sparse learning for support vector classification. Pattern Recogn Lett 31(13):1944–1951
Zhu J, Rosset S, Tibshirani R, Hastie TJ (2004) 1-norm support vector machines. In Advances in neural information processing systems, pages 49–56
Wang L, Shen X (2007) On L1-Norm Multiclass Support Vector Machines. J Am Stat Assoc 102(478):583–594
Chapelle O, Keerthi SS (2008) Multi-class feature selection with support vector machines. In Proceedings of the American statistical association
Mairal J, Bach F, Ponce J (2012) Sparse modeling for image and vision processing. Found Trends Comput Graph Vis 8(2–3):85–283
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodological) 58:267–288
Yamada M, Jitkrittum W, Sigal L, Xing EP, Sugiyama M (2014) High-dimensional feature selection by feature-wise kernelized lasso. Neural Comput 26(1):185–207
Sjöstrand K, Clemmensen LH, Larsen R, Einarsson G, Ersbøll BK (2018) Spasm: a matlab toolbox for sparse statistical modeling. J Stat Softw 84(10):1–37
Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, Vapnik V (2000) Feature selection for SVMs
Parikh N, Boyd S (2013) Proximal algorithms. Found Trends Optim 1(3):123–231
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2010) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3(1):1–122
Beck A (2017) First-order methods in optimization. Mathematical optimization society and the society for industrial and applied mathematics, Philadelphia, PA 19104–2688 USA
Gallier J, Quaintance J (2019) Fundamentals of optimization theory with applications to machine learning. University of Pennsylvania, Philadelphia
Theodoridis S (2020) Machine learning a Bayesian and optimization perspective, 2nd edn. Academic Press, Elsevier
Shalev-Shwartz S, Ben-David S (2014) Understanding machine learning: from theory to algorithms. Cambridge University Press, Cambridge
Bottou L, Curtis EF, Nocedal J (2018) Optimization methods for large-scale machine learning. SIAM Rev 60(2):223–311
Shalev-Shwartz S (2011) Online learning and online convex optimization. Found Trends Mach Learn 4(2):107–194
Hazan E (2015) Introduction to online convex optimization. Found Trends Optim 2(3–4):157–325
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. The MIT Press, Cambridge
Charniak E (2019) Introduction to deep learning. The MIT Press, Cambridge
Cao J, Wang Y, He J, Liang W, Tao H, Zhu G (2021) Predicting grain losses and waste rate along the entire chain: a multitask multigated recurrent unit autoencoder based method. IEEE Trans Industr Inform 17(6):4390–4400
Hall, P. & Gill, N. (2019). An Introduction to Machine Learning Interpretability, An Applied Perspective on Fairness, Accountability, Transparency, and Explainable AI, 2nd Edition. O'Reilly Media, Inc.
Murdoch WJ, Singh C, Kumbier K, Abbasi-Asl R, Yu B (2019) Interpretable machine learning: definitions, methods, and applications. PNAS 116(44):22071–22080
Molnar C (2021). Interpretable machine learning, a guide for making black box models explainable. Leanpub.com
Hastie T, Tibshirani R, Wainwright M (2015) Statistical learning with sparsity: the lasso and generalizations. CRC Press, Boca Raton
Suykens JA, Vandewalle J, Moor BD (2001) Optimal control by least squares support vector machines. Neural Netw 14(1):23–35
Xanthopoulos P, Pardalos PM, Trafalis TB (2012) Robust data mining. Springer Science & Business Media, Berlin
Boyd S, Vandenberghe L (2018) Introduction to applied linear algebra vectors, matrices, and least squares. Cambridge University Press, Cambridge
Dua D, Graff C (2019) UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science
Hand DJ (2009) Measuring classifier performance: a coherent alternative to the area under the ROC curve. Mach Learn 77:103–123
Matlab, http://www.mathworks.com
Acknowledgements
The authors would like to thank anonymous reviewers for their valuable comments and suggestions. This research has been partially supported by the Key Program of National Natural Science Foundation of China under grant 92046026, in part by the National Natural Science Foundation of China (No. 61877061, 71271191, 71871109, 91646204, 71701089), in part by the Jiangsu Provincial Key Research and Development Program under Grant BE2020001-3, in part by the Jiangsu Provincial Policy Guidance Program under grant BZ2020008, and in part by the High-End Foreign Experts Projects under grant G2021194011L.
Author information
Authors and Affiliations
Contributions
ZZ contributed to conceptualization, methodology, validation, formal analysis, writing—original draft, funding acquisition. JH contributed to methodology, investigation, resources, validation. JC contributed to data curation, supervision, funding acquisition. Shuqing Li contributed to validation, software, visualization. XL contributed to formal analysis, visualization, funding acquisition. KZ contributed to data curation, visualization, validation. PW contributed to formal analysis, writing—checking, resources. SY contributed to writing—review and editing, supervision, project administration.
Corresponding author
Ethics declarations
Conflict of interests
The authors declare that they have no conflict of interests in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
1.1 Proof of equality (6)
For any two input points \({\varvec{x}}_{i}\) and \({\varvec{x}}_{j}\) (\(i,j = 1, \cdots ,n\)) from the training set \({\varvec{T}}\), suppose that a basis function \(\phi ( \cdot )\) mapping any feature value \({\varvec{x}}_{jm}\)(\(m = 1, \cdots ,d\)) from the input space to a new feature space is given. If their product \(\phi ({\varvec{x}}_{jm} ) \times \phi ({\varvec{x}}_{im} )\) in the feature space can be replaced with the kernel function \(\kappa (x_{jm} , x_{im} )\) regarding the \(m{\text{th}}\) feature. The kernel vector \({\varvec{u}}_{ji}\)(\({\varvec{u}}_{ji} \in {\mathbb{R}}^{d}\)) of \(d\) features is denoted as
Given the kernel weight vector \({\varvec{\mu}}^{t}\) at iteration \(t\), the row-wise multi-kernel vector \({\varvec{A}}_{j}\)(\({\varvec{A}}_{j} \in {\mathbb{R}}^{n}\)) is defined as a weighted similarity between the input points \({\varvec{x}}_{j}\) and other input points in the training set, that is we have the equality
The row-wise multi-kernel matrix \({\varvec{A}}\)(\({\varvec{A}} \in {\mathbb{R}}^{n \times n}\)) has the below form
So, for any two input points \({\varvec{x}}_{i}\) and \({\varvec{x}}_{j}\), the element of the matrix \({\varvec{A}}\) is \({\varvec{A}}_{ji} = \sum\nolimits_{m = 1}^{d} {\mu_{m}^{t} \kappa ({\varvec{x}}_{jm} , \, {\varvec{x}}_{im} )}\) for all \(i,j = 1, \cdots ,n\).
1.2 Proof of equality (8)
For any feature \({\varvec{f}}_{m}\)(\({\varvec{f}}_{m} \in {\mathbb{R}}^{n}\), \(m = 1, \cdots ,d\)) from the training set \({\varvec{T}}\), assume that the mapping function \(\phi ( \cdot )\) transforms the feature value \({\varvec{x}}_{jm}\)(\(j = 1, \cdots ,n\)) in the input space into a new feature space, the Kronecker product \({\varvec{P}}\) (\({\varvec{P}} \in {\mathbb{R}}^{n \times n}\)) regarding feature \({\varvec{f}}_{m}\) is computed by
If the coefficient vector \({\varvec{\lambda}}^{t}\) at iteration \(t\) is given, then the column-wise multi-kernel vector \({\varvec{B}}_{m}\)(\({\varvec{B}}_{m} \in {\mathbb{R}}^{n}\)) with respect to the \(m{\text{th}}\) feature can be obtained by the multiplication of the transposed matrix \({\varvec{P}}^{T}\) and the vector \({\varvec{\lambda}}^{t}\) with the form
The column-wise multi-kernel matrix \({\varvec{B}}\)(\({\varvec{B}} \in {\mathbb{R}}^{n \times d}\)) is denoted as.
\(\user2{B = }\left( {{\varvec{B}}_{1} , \cdots ,{\varvec{B}}_{d} } \right)\).
Thus, for any input points \({\varvec{x}}_{i}\) with respect to the feature \({\varvec{f}}_{m}\), the element of the matrix \({\varvec{B}}\) is \({\varvec{B}}_{im} = \sum\nolimits_{j = 1}^{n} {\lambda_{j}^{t} \kappa ({\varvec{x}}_{jm} , \, {\varvec{x}}_{im} )}\) for all \(i = 1, \cdots ,n\) and \(m = 1, \cdots ,d\).
1.3 Proof of the \({\mathbf{\lambda - step}}\) EM2NOLC algorithm via ADMM
Corresponding with the \(\lambda {\text{ - step}}\) EM2NOLC model (7) and its ADMM optimization problem (13) with the separable objective and equality constraint functions, the augmented Lagrangian function regarding the scaled dual variables \({\varvec{u}}\)(\({\varvec{u}} \in {\mathbb{R}}^{n}\)) and the penalty parameter \(\rho\)(\(\rho > 0\)) is defined as.
\({\varvec{L}}_{\rho } ({\varvec{\lambda}}, \, {\varvec{q}}, \, {\varvec{u}}) = f({\varvec{\lambda}}) + g({\varvec{q}}) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\lambda}} - \, {\varvec{q}} + {\varvec{u}}} \right\|_{2}^{2} - \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {\varvec{u}} \right\|_{2}^{2}\)
The ADMM updates (17), (18), and (16) of the \(\lambda {\text{ - step}}\) EM2NOLC algorithm are obtained from the partial derivative of \({\varvec{L}}_{\rho } (\lambda , \, {\varvec{q}}, \, {\varvec{u}})\) with respect to its three parameters \({\varvec{\lambda}}\), \({\varvec{q}}\), and \({\varvec{u}}\), respectively. We can express the \(\lambda {\text{ - minimization}}\) as the proximal operator:
Then the gradient of \({\varvec{L}}_{\rho } ({\varvec{\lambda}}, \, {\varvec{q}}^{k} , \, {\varvec{u}}^{k} )\) regarding \({\varvec{\lambda}}\) is set to zero, we analytically obtained the resulting \(\lambda {\text{ - update}}\):
The \(q{\text{ - minimization}}\) of the \(\lambda {\text{ - step}}\) EM2NOLC algorithm has the form
Dual to \(S_{0} (C_{\lambda } ) = \{ {\varvec{q}} \in {\mathbb{R}}^{n} |\left\| {\varvec{q}} \right\|_{0} \le C_{\lambda } \}\) is a nonconvex set, the \(q{\text{ - minimization}}\) may not converge to an optimal point. We can apply the projected gradient method to approximate the \(q{\text{ - update}}\) procedure. So, we have the \(q{\text{ - update}}\):
where the indicator function has \(I_{{S_{0} (C_{\lambda } )}} ({\varvec{q}}) = 1\) if \({\varvec{q}} \in S_{0} (C_{\lambda } )\), otherwise \(I_{{S_{0} (C_{\lambda } )}} ({\varvec{q}}) = 0\). For any vector \({\varvec{q}}\)(\({\varvec{q}} \in {\mathbb{R}}^{n}\)), the projection operator \(\Pi_{{S_{0} (C_{\lambda } )}} ({\varvec{q}})\) can be actually implemented by sorting the elements of the vector \({\varvec{q}}\) in descending order of their absolute values and setting all elements to zeros except top \(C_{\lambda }\).
The \(u{\text{ - update}}\) step can be considered as the change of constraint residuals in the optimization problem (13), which ensures the convergence of the ADMM iterations (17), (18), and (16).
1.4 Proof of the \(\mu - step\) EM2 NOLC algorithm via ADMM
Similar to “Proof of the λ-step EM2NOLC algorithm via ADMM” in Appendix, for the \(\mu {\text{ - step}}\) EM2NOLC model (9) and its ADMM optimization problem (20) with the separable objective functions and equality constraints, the augmented Lagrangian function is denoted as.
\({\varvec{L}}_{\rho } ({\varvec{\mu}}, \, {\varvec{z}}, \, {\varvec{v}}) = h({\varvec{\mu}}) + l(z) + \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {{\varvec{\mu}} - \, {\varvec{z}} + {\varvec{v}}} \right\|_{2}^{2} - \left( {{\rho \mathord{\left/ {\vphantom {\rho 2}} \right. \kern-\nulldelimiterspace} 2}} \right)\left\| {\varvec{v}} \right\|_{2}^{2},\)
with the scaled dual variables \({\varvec{v}}\)(\({\varvec{v}} \in {\mathbb{R}}^{d}\)).
The ADMM updates (24), (25), and (23) of the \(\mu {\text{ - step}}\) EM2NOLC algorithm are, respectively, generated from the KKT optimality conditions of \({\varvec{L}}_{\rho } ({\varvec{\mu}}, \, {\varvec{z}}, \, {\varvec{v}})\) regarding its three parameters \({\varvec{\mu}}\), \({\mathbf{z}}\), and \({\varvec{v}}\). The \(\mu {\text{ - minimization}}\) is defined as the proximal operator:
Setting the gradient of \({\varvec{L}}_{\rho } ({\varvec{\mu}}, \, {\varvec{z}}^{k} , \, {\varvec{v}}^{k} )\) with respect to \({\varvec{\mu}}\) to zero, we can get the analytical solution, that is the \(\mu {\text{ - update}}\):
The \(z{\text{ - minimization}}\) of the \(\mu {\text{ - step}}\) EM2NOLC algorithm has the minimization problem with the form
Similarly, owing to \(S_{0} (C_{\mu } ) = \{ {\mathbf{z}} \in {\mathbb{R}}^{d} |\left\| {\mathbf{z}} \right\|_{0} \le C_{\mu } \}\) is a nonconvex set, we can apply the projected gradient method to the \(z{\text{ - minimization}}\) to obtain the \(z{\text{ - update}}\):
with the indicator function \(I_{{S_{0} (C_{\mu } )}} ({\mathbf{z}}) = 1\) for \({\mathbf{z}} \in S_{0} (C_{\mu } )\) and \(I_{{S_{0} (C_{\mu } )}} ({\mathbf{z}}) = 0\) for \({\mathbf{z}} \notin S_{0} (C_{\mu } )\). For any vector \({\mathbf{z}}\)(\({\mathbf{z}} \in {\mathbb{R}}^{d}\)), the projection operator \(\Pi_{{S_{0} (C_{\mu } )}} ({\mathbf{z}})\) can be carried out by sorting the elements of the vector \({\mathbf{z}}\) in descending order of their absolute values and setting all elements to zeros except the largest \(C_{\mu }\) elements.
Finally, the \(v{\text{ - update}}\) step can be regarded as the change of constraint residuals in the ADMM problem (20), which guarantees the convergence of the ADMM updates (24), (25), and (23).
Appendix B
2.1 KS curves in Fig. 2
See Fig. 2.
2.2 ROC curves in Fig. 3
See Fig. 3.
2.3 IFs for EM2NOLC in Fig. 4
See Fig. 4.
2.4 IFs for EM2NOLC in Fig. 5
See Fig. 5.
2.5 Correlation analysis of selected features in Fig. 6
See Fig. 6.
2.6 Correlation analysis of selected features in Fig. 7
See Fig. 7.
Rights and permissions
About this article
Cite this article
Zhang, Z., He, J., Cao, J. et al. An explainable multi-sparsity multi-kernel nonconvex optimization least-squares classifier method via ADMM. Neural Comput & Applic 34, 16103–16128 (2022). https://doi.org/10.1007/s00521-022-07282-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-022-07282-6