Abstract
In the development of first-order methods for smooth (resp., composite) convex optimization problems, where smooth functions with Lipschitz continuous gradients are minimized, the gradient (resp., gradient mapping) norm becomes a fundamental optimality measure. Under this measure, a fixed iteration algorithm with the optimal iteration complexity for the smooth case is known, while determining this number of iteration to obtain a desired accuracy requires the prior knowledge of the distance from the initial point to the optimal solution set. In this paper, we report an adaptive regularization approach, which attains the nearly optimal iteration complexity without knowing the distance to the optimal solution set. To obtain further faster convergence adaptively, we secondly apply this approach to construct a first-order method that is adaptive to the Hölderian error bound condition (or equivalently, the Łojasiewicz gradient property), which covers moderately wide classes of applications. The proposed method attains nearly optimal iteration complexity with respect to the gradient mapping norm.
Similar content being viewed by others
Data Availability
Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.
Notes
Suppose that \(\varphi \) is an L-smooth convex function satisfying (4) for some exponent \(\rho \in [1,2[\). For \(\rho =1\), Lemma 2.2 implies \(\left\| \nabla {\varphi }(x)\right\| \ge \kappa \) for \(x \not \in \mathop {\text {Argmin}}\varphi \). If \(\rho \in ]1,2[\), on the other hand, Lemmas 2.1 and 2.2 imply \({\frac{1}{2L}\left\| \nabla {\varphi }(x)\right\| ^2 \le \varphi (x)-\varphi ^* \le \kappa ^{-\frac{1}{\rho -1}}\left\| \nabla {\varphi }(x)\right\| ^{\frac{\rho }{\rho -1}}}\) for all x, which yields \(\left\| \nabla {\varphi }(x)\right\| \ge \text {const.}\) for all \(x \not \in \text {Argmin }\varphi \). This contradicts to the continuity of \(\nabla {\varphi }\) at points in \(\text {Argmin }\varphi \).
By the strong convexity of \(\left\langle a,x\right\rangle +h(x)\) and \(\left\langle b,x\right\rangle +h(x)\), we have
$$\begin{aligned} \frac{\mu }{2}\left\| x_a^*-x_b^*\right\| ^2 \le [\left\langle a,x_b^*\right\rangle +h(x_b^*)] - [\left\langle a,x_a^*\right\rangle +h(x_a^*)] ~~\text {and}~~ \frac{\mu }{2}\left\| x_a^*-x_b^*\right\| ^2 \le [\left\langle b,x_a^*\right\rangle +h(x_a^*)]-[\left\langle b,x_b^*\right\rangle +h(x_b^*)], \end{aligned}$$respectively. Adding them implies \( \mu \left\| x_a^*-x_b^*\right\| ^2 \le \left\langle a-b,x_b^*-x_a^*\right\rangle \le \left\| a-b\right\| \left\| x_b^*-x_a^*\right\| . \)
In fact, since the derivative \(\log (1+x)\) of the function \(h(x):=(1+x)\log (1+x)-x\) is increasing and vanishes at \(x=0\), we have \(\min _{x>-1}h(x)=h(0)=0\).
This assumption on \(L_f/L_{\min }\) means that we know a good estimate \(L_{\min }\) close to \(L_f\) in advance.
Although (29) is asserted in the case \({x^{(t)}_+} \not \in X^*\), it trivially holds if \({x^{(t)}_+} \in X^*\) unless \(\rho = 1\).
References
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2, 183–202 (2009)
Nesterov, Y.: A method of solving a convex programming problem with convergence rate \(O(1/k^2)\). Sov. Math. Dokl. 27, 372–376 (1983)
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140, 125–161 (2013)
Nesterov, Y.: Universal gradient methods for convex optimization problems. Math. Program. 152, 381–404 (2015)
Renegar, J., Grimmer, B.: A simple nearly-optimal restart scheme for speeding-up first order methods. ArXiv preprint arXiv:1803.00151v1 (2018)
Roulet, V., d’Aspremont, A.: Sharpness, restart and acceleration. SIAM J. Optim. 30, 262–289 (2020)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103, 127–152 (2005)
Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120, 221–259 (2009)
Lu, Z.: Smooth optimization approach for sparse covariance selection. SIAM J. Optim. 19, 1807–1827 (2009)
Nesterov, Y., Gasnikov, A., Guminov, S., Dvurechensky, P.: Primal-dual accelerated gradient methods with small-dimensional relaxation oracle. Optim. Methods Softw. (2020). https://doi.org/10.1080/10556788.2020.1731747
Kim, D., Fessler, J.A.: Optimizing the efficiency of first-order methods for decreasing the gradient of smooth convex functions. J. Optim. Theory Appl. (2020). https://doi.org/10.1007/s10957-020-01770-2
Nesterov, Y.: How to make the gradients small. Optima 88, 10–11 (2012)
Kim, D., Fessler, J.A.: Generalizing the optimized gradient method for smooth convex minimization. SIAM J. Optim. 28, 1920–1950 (2018)
Łojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels. In: Les Équations aux Dérivées Partielles, pp. 87–89. Éditions du Centre National de la Recherche Scientifique, Paris (1963)
Łojasiewicz, S.: Ensembles semi-analytiques. preprint, Institut des Hautes Études Scientifiques (1965)
Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Program. 137, 91–129 (2013)
Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. 165, 471–507 (2017)
Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17, 1205–1223 (2007)
Liu, M., Yang, T.: Adaptive accelerated gradient converging methods under Hölderian error bound condition. Adv. Neural Inf. Process. Syst. 30, 1 (2017)
Fercoq, O., Qu, Z.: Adaptive restart of accelerated gradient methods under local quadratic growth condition. IMA J. Numer. Anal. 39, 2069–2095 (2019)
Lin, Q., Xiao, L.: An adaptive accelerated proximal gradient method and its homotopy continuation for sparse optimization. Comput. Optim. Appl. 60, 633–674 (2015)
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1, 123–231 (2014)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, Norwell (2004)
Rockafellar, R.T.: Convex Analysis. Princeton University Press, New Jersey (1970)
Łojasiewicz, S.: Sur le problème de la division. Studia Math. 18, 87–136 (1959)
Li, G.: Global error bounds for piecewise convex polynomials. Math. Program. 137, 37–64 (2013)
Nemirovsky, A.: Information-based complexity of linear operator equations. J. Complexity 8, 153–175 (1992)
Nemirovsky, A., Nesterov, Y.: Optimal methods for smooth convex minimization. USSR Comput. Math. Math. Phys. 25, 21–30 (1985)
Acknowledgements
The authors are thankful to an anonymous referee for valuable comments that improved this paper. This work was partially supported by the Grant-in-Aid for Young Scientists (B) (17K12645) and the Grant-in-Aid for Scientific Research (C) (18K11178) from Japan Society for the Promotion of Science.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Lionel Thibault.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ito, M., Fukuda, M. Nearly Optimal First-Order Methods for Convex Optimization under Gradient Norm Measure: an Adaptive Regularization Approach. J Optim Theory Appl 188, 770–804 (2021). https://doi.org/10.1007/s10957-020-01806-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-020-01806-7
Keywords
- Smooth/composite convex optimization
- Accelerated proximal gradient methods
- Hölderian error bound
- Adaptive methods