Abstract
Support Vector Machine (SVM) is one of the most important class of machine learning models and algorithms, and has been successfully applied in various fields. Nonlinear optimization plays a crucial role in SVM methodology, both in defining the machine learning models and in designing convergent and efficient algorithms for large-scale training problems. In this paper we present the convex programming problems underlying SVM focusing on supervised binary classification. We analyze the most important and used optimization methods for SVM training problems, and we discuss how the properties of these problems can be incorporated in designing useful algorithms.
Similar content being viewed by others
References
Astorino A, Fuduli A (2015) Support vector machine polyhedral separability in semisupervised learning. J Optim Theory Appl 164:1039–1050
Bertsekas DP (1999) Nonlinear programming. Athena Scientific, Belmont
Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory, COLT ’92. ACM, New York, pp 144–152
Byrd RH, Chin GM, Neveitt W, Nocedal J (2011) On the use of stochastic hessian information in optimization methods for machine learning. SIAM J Optim 21:977–995
Carrizosa E, Romero Morales D (2013) Supervised classification and mathematical optimization. Comput Oper Res 40:150–165
Cassioli A, Chiavaioli A, Manes C, Sciandrone M (2013) An incremental least squares algorithm for large scale linear classification. Eur J Oper Res 224:560–565
Chang CC, Hsu CW, Lin CJ (2000) The analysis of decomposition methods for support vector machines. IEEE Trans Neural Netw Learn Syst 11:1003–1008
Chang KW, Hsieh CJ, Lin CJ (2008) Coordinate descent method for large-scale l2-loss linear support vector machines. J Mach Learn Res 9:1369–1398
Chapelle O (2007) Training a support vector machine in the primal. Neural Comput 19:1155–1178
Chen PH, Fan RE, Lin CJ (2006) A study on smo-type decomposition methods for support vector machines. IEEE Trans Neural Netw 17:893–908
Chiang WL, Lee MC, Lin CJ (2016) Parallel dual coordinate descent method for large-scale linear classification in multi-core environments. In: Proceedings of the 22Nd ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’16. ACM, New York, pp 1485–1494
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297
Dai HY, Fletcher R (2006) New algorithms for singly linearly constrained quadratic programs subject to lower and upper bounds. Math Programm 106:403–421
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874
Fan RE, Chen PH, Lin CJ (2005) Working set selection using second order information for training support vector machines. J Mach Learn Res 6:1889–1918
Ferris MC, Munson TS (2004) Semismooth support vector machines. Math Programm B 101:185–204
Ferris MC, Munson TS (2002) Interior-point methods for massive support vector machines. SIAM J Optim 13:783–804
Fine S, Scheinberg K (2001) Efficient svm training using low-rank kernel representations. J Mach Learn Res 2:243–264
Fletcher R (1987) Practical methods of optimization, 2nd edn. Wiley, New York
Franc V, Sonnenburg S (2009) Optimized cutting plane algorithm for large-scale risk minimization. J Mach Learn Res 10:2157–2192
Franc V, Sonnenburg S (2008) Optimized cutting plane algorithm for support vector machines. In: Proceedings of the 25th international conference on machine learning, ICML ’08. ACM, New York, pp 320–327
Fung G, Mangasarian OL (2001) Proximal support vector machine classifiers. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’01. ACM New York, pp 77–86
Gaudioso M, Gorgone E, Labbé M, Rodríguez-Chía AM (2017) Lagrangian relaxation for svm feature selection. Comput OR 87:137–145
Gertz EM, Griffin JD (2010) Using an iterative linear solver in an interior-point method for generating support vector machines. Comput Optim Appl 47:431–453
Glasmachers T, Igel C (2006) Maximum-gain working set selection for svms. J Mach Learn Res 7:1437–1466
Goldfarb D, Scheinberg K (2008) Numerically stable ldlt factorizations in interior point methods for convex quadratic programming. IMA J Numer Anal 28:806–826
Golub GH, Van Loan CF (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, Baltimore
Hsieh CJ, Chang KW, Lin CJ, Keerthi SS, Sundararajan S (2008) A dual coordinate descent method for large-scale linear svm. In: Proceedings of the 25th international conference on machine learning, ICML ’08. ACM, New York, pp 408–415
Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13:415–425
Hsu CW, Lin CJ (2002) A simple decomposition method for support vector machines. Mach Learn 46:291–314
Teo CH, Vishwanthan SVN, Smola AJ, Le QV (2010) Bundle methods for regularized risk minimization. J Mach Learn Res 11:311–365
Joachims T (1999) Advances in kernel methods. Chapter making large-scale support vector machine learning practical. MIT Press, Cambridge, pp 169–184
Joachims T ( 2006) Training linear svms in linear time. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’06. ACM, New York, pp 217–226
Joachims T, Finley T, Yu CNJ (2009) Cutting-plane training of structural svms. Mach Learn 77:27–59
Joachims T, Yu CNJ (2009) Sparse kernel svms via cutting-plane training. Mach Learn 76:179–193
Keerthi SS, DeCoste D (2005) A modified finite Newton method for fast solution of large scale linear svms. J Mach Learn Res 6:341–361
Keerthi SS, Gilbert EG (2002) Convergence of a generalized smo algorithm for svm classifier design. Mach Learn 46:351–360
Keerthi SS, Lin CJ (2003) Asymptotic behaviors of support vector machines with gaussian kernel. Neural Comput 15:1667–1689
Kimeldorf GS, Wahba G (1970) A correspondence between bayesian estimation on stochastic processes and smoothing by splines. Ann Math Stat 41:495–502
Kiwiel KC (2008) Breakpoint searching algorithms for the continuous quadratic knapsack problem. Math Program 112:473–491
Le QV, Smola AJ, Vishwanathan S (2008) Bundle methods for machine learning. In: Platt JC, Koller D, Singer Y, Roweis ST (eds) Advances in neural information processing systems 20. Curran Associates Inc, New York, pp 1377–1384
Lee MC, Chiang WL, Lin CJ (2015) Fast matrix-vector multiplications for large-scale logistic regression on shared-memory systems. In: Aggarwal C, Zhou Z-H, Tuzhilin A, Xiong H, Wu X (eds) ICDM. IEEE Computer Society, pp 835–840
Lee YJ, Mangasarian OL (2001) SSVM: a smooth support vector machine for classification. Comput Optim Appl 20:5–22
Lin C-J, Lucidi S, Palagi L, Risi A, Sciandrone M (2009) A decomposition algorithm model for singly linearly constrained problems subject to lower and upper bounds. J Optim Theory Appl 141:107–126
Lin CJ (2001) Formulations of support vector machines: a note from an optimization point of view. Neural Comput 13:307–317
Lin CJ (2001) On the convergence of the decomposition method for support vector machines. IEEE Trans Neural Netw 12:1288–1298
Lin CJ (2002) Asymptotic convergence of an SMO algorithm without any assumptions. IEEE Trans Neural Netw 13:248–250
Lin CJ (2002) A formal analysis of stopping criteria of decomposition methods for support vector machines. IEEE Trans Neural Netw 13:1045–1052
Lin CJ, Morè JJ (1999) Newton’s method for large bound-constrained optimization problems. SIAM J Optim 9:1100–1127
Lucidi S, Palagi L, Risi A, Sciandrone M (2007) A convergent decomposition algorithm for support vector machines. Comput Optim Appl 38:217–234
Lucidi S, Palagi L, Risi A, Sciandrone M (2009) A convergent hybrid decomposition algorithm model for svm training. IEEE Trans Neural Netw 20:1055–1060
Mangasarian OL (1994) Nonlinear programming. Classics in applied mathematics. Society for Industrial and Applied Mathematics. ISBN: 9780898713411
Mangasarian OL (2002) A finite Newton method for classification. Optim Methods Softw 17:913–929
Mangasarian OL (2006) Exact 1-norm support vector machines via unconstrained convex differentiable minimization. J Mach Learn Res 7:1517–1530
Mangasarian OL, Musicant DR (1999) Successive overrelaxation for support vector machines. IEEE Trans Neural Netw 10:1032–1037
Mangasarian OL, Musicant DR (2001) Lagrangian support vector machines. J Mach Learn Res 1:161–177
Mavroforakis ME, Theodoridis S (2006) A geometric approach to support vector machine (SVM) classification. IEEE Trans Neural Netw 17:671–682
Osuna E, Freund R, Girosi F (1997) An improved training algorithm for support vector machines. In: Neural networks for signal processing VII. Proceedings of the 1997 IEEE signal processing society workshop, pp 276–285
Osuna E, Freund R, Girosit F (1997) Training support vector machines: an application to face detection. In Proceedings of IEEE computer society conference on computer vision and pattern recognition, pp 130–136
Palagi L, Sciandrone M (2005) On the convergence of a modified version of the svmlight algorithm. Optim Methods Softw 20:315–332
Pardalos PM, Kovoor N (1990) An algorithm for a singly constrained class of quadratic programs subject to upper and lower bounds. Math Program 46:321–328
Platt JC (1999) Advances in kernel methods. Chapter fast training of support vector machines using sequential minimal optimization. MIT Press, Cambridge, pp 185–208
Scholkopf B, Smola AJ (2001) Learning with Kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Shalev-Shwartz S, Singer Y, Srebro N, Cotter A (2011) Pegasos: primal estimated sub-gradient solver for svm. Math Program 127:3–30
Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, New York
Suykens JAK, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9:293–300
Glasmachers T, Dogan U (2013) Accelerated coordinate descent with adaptive coordinate frequencies. In: Asian conference on machine learning, ACML 2013, Canberra, ACT, Australia, pp 72–86
Serafini T, Zanni L (2005) On the working set selection in gradient projection-based decomposition techniques for support vector machines. Optim Methods Softw 20:583–596
Tsang IW, Kwok JT, Cheung PM (2005) Core vector machines: fast svm training on very large data sets. J Mach Learn Res 6:363–392
Vapnik VN (1998) Statistical learning theory. Wiley-Interscience, New York
Wang PW, Lin CJ (2014) Iteration complexity of feasible descent methods for convex optimization. J Mach Learn Res 15:1523–1548
Wang Z, Crammer K, Vucetic S (2012) Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale svm training. J Mach Learn Res 13:3103–3131
Woodsend K, Gondzio J (2009) Hybrid mpi/openmp parallel linear support vector machine training. J Mach Learn Res 10:1937–1953
Woodsend K, Gondzio J (2011) Exploiting separability in large-scale linear support vector machine training. Comput Optim Appl 49:241–269
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Proof of existence and uniqueness of the optimal hyperplane
The idea underlying the proof of existence and uniqueness of the optimal hyperplane is based on the following steps:
-
for each separating hyperplane H(w, b), there exists a separating hyperplane \(H({\hat{w}},{\hat{b}})\) such that
$$\begin{aligned} {1\over {\Vert w\Vert }}\le \rho (w,b)\le {1\over {\Vert {\hat{w}}\Vert }}; \end{aligned}$$ -
the above condition implies that problem (2), i.e.,
$$\begin{aligned} \begin{array}{ll} \displaystyle {\max \limits _{w\in \mathfrak {R}^n,b\in \mathfrak {R}}}&{}\quad \rho (w,b)\\ \mathrm{s.t.}&{}\quad w^Tx^i+b\ge 1,\phantom {-}\quad \forall x^i\in A\\ &{}w^Tx^j+b\le -1,\qquad \forall x^j\in B \end{array} \end{aligned}$$admits solution provided that the following problem
$$\begin{aligned} \begin{array}{ll}\displaystyle {\max \limits _{w\in \mathfrak {R}^n,b\in \mathfrak {R}}}&{}\quad {{1}\over {\Vert w\Vert }}\\ \mathrm{s.t.}&{}\quad w^Tx^i+b\ge 1,\phantom {-}\quad \forall x^i\in A\\ &{}w^Tx^j+b\le -1,\qquad \forall x^j\in B\end{array} \end{aligned}$$(54)admits solution;
-
problem (54) is obviously equivalent to
$$\begin{aligned} \begin{array}{ll}\displaystyle {\min \limits _{w\in \mathfrak {R}^n,b\in \mathfrak {R}}}&{}\quad \Vert w\Vert ^2\\ \mathrm{s.t.}&{}\quad w^Tx^i+b\ge 1,\phantom {-}\quad \forall x^i\in A\\ &{}w^Tx^j+b\le -1,\qquad \forall x^j\in B;\end{array} \end{aligned}$$(55) -
then we prove that (55) admits a unique solution, which is also the unique solution of (2).
Lemma 1
Let \(H({\hat{w}},{\hat{b}})\) be a separating hyperplane. Then
Proof
Since
it follows
\(\square \)
Lemma 2
Given any separating hyperplane \(H({\hat{w}},{\hat{b}})\), there exists a separating hyperplane \(H({\bar{w}},{\bar{b}})\) such that
Moreover there exist two points \(x^+\in A\) and \(x^-\in B\) such that
Proof
Let \({\hat{x}}^i\in A\) and \({\hat{x}}^j\in B\) be the closest points to \(H({\hat{w}},{\hat{b}})\), that is, the two points such that
from which it follows
Let us consider the numbers \(\alpha \) and \(\beta \) such that
that is, the numbers
It can be easily verified that \(0<\alpha \le 1\). We will show that the hyperplane \(H({\bar{w}},{\bar{b}}) \equiv H(\alpha {\hat{w}},\beta )\) is a separating hyperplane for the sets A and B, and it is such that (56) holds. Indeed, using (58), we have
As \(\alpha >0\), we can write
from which we get that \({\bar{w}}\) and \({\bar{b}}\) satisifies (1), and hence, that \(H({\bar{w}},{\bar{b}})\) is a separating hyperplane for the sets A and B.
Furthermore, taking into account (61) and the value of \(\alpha \), we have
Condition (56) follows from the above equality and (59). Using (60) we obtain that (57) holds with \(x^+={\hat{x}}^i\) and \(x^-={\hat{x}}^j\). \(\square \)
Proposition 4
The following problem
admits a unique solution \((w^\star ,b^\star )\).
Proof
Let \({{\mathcal {F}}}\) the feasible set, that is,
Given any \((w_o,b_o)\in {{\mathcal {F}}}\), let us consider the level set
The set \({{\mathcal {L}}}_o\) is closed, and we will show that is also bounded. To this aim, assume by contradiction that there exists an unbounded sequence \(\{(w_k,b_k)\}\) belonging to \({{\mathcal {L}}}_o\). Since \(\Vert w_k\Vert \le \Vert w_o\Vert ,\forall k\), we must have \(|b_k|\rightarrow \infty \). For any k we can write
and hence, as \(|b_k|\rightarrow \infty \), for k sufficiently large, we have \(\Vert w_k\Vert ^2 >\Vert w_o\Vert ^2\), and this contradicts the fact that \(\{(w_k,b_k)\}\) belongs to \({{\mathcal {L}}}_o\). Thus \({{\mathcal {L}}}_o\) is a compact set.
Weirstrass’s theorem implies that the function \(\Vert w\Vert ^2\) admits a minimum \((w^\star ,b^\star )\) on \({{\mathcal {L}}}_o\), and hence, on \(\mathcal{F}\). As consequence, \((w^\star ,b^\star )\) is a solution of (62).
In order to prove that \((w^\star ,b^\star )\) is the unique solution, by contradiction assume that there exists a pair \(({\bar{w}},\bar{b})\in {{\mathcal {F}}}\), \(({\bar{w}},{\bar{b}})\ne (w^\star ,b^\star )\), such that \(\Vert {\bar{w}}\Vert ^2=\Vert w^\star \Vert ^2\). Suppose \({\bar{w}}\ne w^\star \). The set \({{\mathcal {F}}}\) is convex, so that
Since \(\Vert w\Vert ^2\) is a strictly convex function, for any \(\lambda \in (0,1)\) it follows
Getting \(\lambda =1/2\), which corresponds to consider the pair \(({\tilde{w}}, \tilde{b})\equiv \displaystyle \left( {1\over 2}w^\star +{1\over 2}\bar{w},{1\over 2} b^\star \right. \left. +{1\over 2}{\bar{b}}\right) \), we have \((\tilde{w},{\tilde{b}})\in {{\mathcal {F}}}\) and
and this contradicts the fact that \((w^\star ,b^\star )\) is a global minimum. Therefore, we must have \({\bar{w}}\equiv w^\star \).
Assume \(b^\star >{\bar{b}}\) (the case \(b^\star <{\bar{b}}\) is analogous), and consider the point \({\hat{x}}^i\in A\) such that
(the existence of such a point follows from (57) of Lemma 2). We have
and this contradicts the fact that \({\bar{w}}^Tx^i+{\bar{b}}\ge 1,~\forall x^i\in A\). As consequence, we must have \({\bar{b}}\equiv b^\star \), and hence the uniqueness of the solution is proved. \(\square \)
Proposition 5
Let \((w^\star ,b^\star )\) be the solution of (62). Then, \((w^\star ,b^\star )\) is the unique solution of the following problem
Proof
We observe that \((w^\star ,b^\star )\) is the unique solution of the problem
Lemmas 1 and 2 imply that, for any separating hyperplane H(w, b), we have
and hence, for the separating hyperplane \(H(w^\star ,b^\star )\) we obtain \(\rho (w^\star , b^\star )=\displaystyle {1\over {\Vert w^\star \Vert }}\), which implies that \(H(w^\star ,b^\star )\) is the optimal separating hyperplane. \(\square \)
Appendix B: The Wolfe dual and its properties
Consider the convex problem
with \(f:\mathfrak {R}^n\rightarrow \mathfrak {R}\) convex and continuously differentiable, \(g:\mathfrak {R}^n\rightarrow \mathfrak {R}^m\) convex and continuously differentiable, and \(h:\mathfrak {R}^n\rightarrow \mathfrak {R}^p\) affine functions. Then its Wolfe dual is
where \(L(x,\lambda ,\mu ) = f(x)+\lambda ^Tg(x)+\mu ^Th(x)\).
Proposition 6
Let \(x^*\) be a global solution of problem (64) with multipliers \((\lambda ^*,\mu ^*)\). Then it is also a solution of problem (65) and there is zero duality gap, i.e., \(f(x^*) = L(x^*,\lambda ^*,\mu ^*)\).
Proof
The point \((x^*,\lambda ^*,\mu ^*)\) is clearly feasible for problem (65) since it satisfies the KKT conditions of problem (64). Furthermore, by complementarity (\((\lambda ^*)^Tg(x^*)=0\)) and feasibility (\(h(x^*)=0\))
so that there is zero duality gap. Furthermore, for any \(\lambda \ge 0\), \(\mu \in \mathfrak {R}^p\), by the feasibility of \(x^*\), we have
By the convexity assumptions on f and g, the nonnegativity of \(\lambda \) and by the linearity of h, we get that \(L(\cdot ,\lambda ,\mu )\) is a convex function in x and hence, for any feasible \((x,\lambda ,\mu )\), we can write
where the last equality derives from the constraints of problem (65). By combining (66) and (67), we get
and hence \((x^*,\lambda ^*,\mu ^*)\) is a global solution of problem (65). \(\square \)
A stronger result can be proved when the primal problem is a convex quadratic programming problem defined by (6).
Proposition 7
Let \(f(x)={{1}\over {2}}x^TQx+c^Tx\), and suppose that the matrix Q is symmetric and positive semidefinite. Let \(({\bar{x}},{\bar{\lambda }})\) be a solution of Wolfe’s dual (7). Then, there exists a vector \(x^\star \) (not necessarily equal to \({\bar{x}}\)) such that
-
(i)
\(Q(x^\star -{\bar{x}})=0\);
-
(ii)
\(x^\star \) is a solution of problem (6); and
-
(iii)
\(x^*\) is a global minimum of (6) with associated multipliers \({\bar{\lambda }}\).
Proof
First, we show how in this case problem (7) is a convex quadratic programming problem. In particular, problem (7) becomes for the quadratic case:
Multiplying the constraints (68) by \(x^T\) we get
which implies that the objective function (68) can be rewritten as
which shows how problem (68) is actually a convex quadratic optimization problem. For this problem, the KKT conditions are necessary and sufficient for global optimality, and, if we denote by v the multipliers of the equality constraints (69) and by z the multipliers of the constraints (70), we get that there must exist multipliers v and z such that the following conditions hold;
The expression of z can be derived by constraints (72), and substituted in (73) and (74), implying:
Furthermore by subtracting (71) from (75), we get
By combining (79), (78), (77) and (76) we get that the pair \((v,{\bar{\lambda }})\) satisfies the KKT conditions of problem (6), and hence setting \(x^*=v\) we get the thesis, keeping into account that point (i) derives from (71). \(\square \)
Appendix C: Kernel characterization
Proposition 8
Let \(K:X\times X\rightarrow \mathfrak {R}\) be a symmetric function. Function K is a kernel if and only if the \(l\times l\) matrix
is positive semidefinite for any set of training vectors \(\{x^1,\ldots ,x^l\}\).
Proof
-
necessity Symmetry derives from the symmetry of the function K. To prove positive semidefiniteness we look at the quadratic form, for any \(v\in \mathfrak {R}^l\):
$$\begin{aligned} v^TKv&=\displaystyle \sum \limits _{i=1}^l\sum \limits _{j=1}^lv_iv_jK(x^i,x^j) =\sum _{i=1}^l\sum \limits _{j=1}^lv_iv_j\langle \phi (x^i),\phi (x^j)\rangle \\&=\left\langle \sum \limits _{i=1}^lv_i \phi (x^i), \sum _{j=1}^lv_j\phi (x_j)\right\rangle \\&=\displaystyle \langle z,z\rangle \ge 0 \end{aligned}$$ -
sufficiency Assume
$$\begin{aligned} \left( \begin{array}{ccc} K(x^1,x^1)&{}\ldots &{}K(x^1,x^l)\\ &{}\vdots &{}\\ K(x^l,x^1)&{}\ldots &{}K(x^l,x^l) \end{array} \right) \succeq 0 \end{aligned}$$(80)We need to prove that there exists a linear space \({{\mathcal {H}}}\), a function \(\phi :X\rightarrow {{\mathcal {H}}}\) and a scalar product \(\langle \cdot ,\cdot \rangle \) defined on \({{\mathcal {H}}}\) such that \(k(x,y) = \langle \phi (x),\phi (y)\rangle \) for all \(x,y\in X\). Consider the linear space
$$\begin{aligned} \displaystyle {{\mathcal {H}}} = lin\left\{ K(\cdot ,y)\,:\, y\in X\right\} \end{aligned}$$with the generic element \(f(\cdot )\)
$$\begin{aligned} \displaystyle f = \sum \limits _{i=1}^m\alpha _iK(\cdot ,x^i) \end{aligned}$$for any \(m\in N\), with \(\alpha _i\in \mathfrak {R}\) for \(i=1,\ldots ,m\). Given two elements \(f,g\in {{\mathcal {H}}}\), with \(g(\cdot ) = \sum _{j=1}^{m^\prime }\beta _jK(\cdot ,x^j)\), define the function \(\rho :{{\mathcal {H}}}\times {{\mathcal {H}}}\rightarrow \mathfrak {R}\) defined as
$$\begin{aligned} \rho (f,g)=\sum \limits _{i=1}^m\sum \limits _{j=1}^{m^\prime }\alpha _i\beta _j K(x^i,x^j) \end{aligned}$$It can be shown that the function \(\rho \) is a scalar product in the space \({{\mathcal {H}}}\), by showing that the following properties hold:
-
(i)
\(\rho (f,g) = \rho (g,f)\)
-
(ii)
\(\rho (f^1+f^2,g) = \rho (f^1,g)+\rho (f^2,g)\)
-
(iii)
\(\rho (\lambda f,g) = \lambda \rho (f,g) \)
-
(iv)
\(\rho (f,f)\ge 0\) and \(\rho (f,f)=0\) implies \(f=0\)
The first three properties are a consequence of the definition of \(\rho \) and can be easily verified. We need to show property (iv). First, we observe that, given \(f^1,\ldots ,f^p\) in \({{\mathcal {H}}}\) the matrix with elements \(\rho _{st} = \rho (f^s,f^t)\) is symmetric (thanks to property (i)) and positive semidefinite. Indeed,
$$\begin{aligned} \displaystyle \sum \limits _{i=1}^p\sum \limits _{j=1}^p\gamma _i\gamma _j\rho _{ij} =\sum _{i=1}^p\sum \limits _{j=1}^p\gamma _i\gamma _j\rho (f^i,f^j) =\rho \left( \sum \limits _{i=1}^p\gamma _if^i,\sum _{j=1}^p\gamma _jf^j\right) \ge 0 \end{aligned}$$This implies in turn that all principal minors have non negative determinant. Consider any \(2\times 2\) principal minor, with elements \(\rho _{ij} = \rho (f^i,f^j)\). The nonnegativity of the determinant, and the symmetry of the matrix imply
$$\begin{aligned}&\rho (f^i,f^i)\rho (f^j,f^j)- \rho (f^i,f^j) \rho (f^j,f^i )\\&=\rho (f^i,f^i)\rho (f^j,f^j)- \rho (f^i,f^j)^2\ge 0 \end{aligned}$$so that
$$\begin{aligned} \rho (f^i,f^j)^2\le \rho (f^i,f^i)\rho (f^j,f^j) \end{aligned}$$(81)We note that, setting \(m^\prime =1\), \(g(\cdot ) = k(\cdot ,x)\), f(x) can be written as
$$\begin{aligned} f(x) = \displaystyle \sum \limits _{i=1}^m\alpha _iK(x,x^i) = \rho (K(\cdot ,x),f) \end{aligned}$$with \(K(\cdot ,x)\in {{\mathcal {H}}}\). Furthermore, for any \(x,y \in X\), we get
$$\begin{aligned} \rho (K(\cdot ,x),K(\cdot ,y)) = K(x,y) \end{aligned}$$Using (81) with \(f^i = K(\cdot ,x)\) and \(f^j = f(x)\) we get
$$\begin{aligned}&f(x)^2 = \rho (K(\cdot ,x),f)\le \rho (f^i,f^i)\rho (f^j,f^j) = \rho (K(\cdot ,x),\\&K(\cdot ,x))\rho (f,f) = K(x,x)\rho (f,f) \end{aligned}$$that implies, thanks to (80), both \(\rho (f,f)\ge 0\) and that if \(\rho (f,f)=0\), then \(f(x)^2\le 0\) for all \(x\in X\) and hence \(f=0\). \(\square \)
-
(i)
Rights and permissions
About this article
Cite this article
Piccialli, V., Sciandrone, M. Nonlinear optimization and support vector machines. 4OR-Q J Oper Res 16, 111–149 (2018). https://doi.org/10.1007/s10288-018-0378-2
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10288-018-0378-2
Keywords
- Statistical learning theory
- Support vector machine
- Convex quadratic programming
- Wolfe’s dual theory
- Kernel functions
- Nonlinear optimization methods