Abstract
In this paper, we propose a stochastic Gauss–Newton (SGN) algorithm to study the online principal component analysis (OPCA) problem, which is formulated by using the symmetric low-rank product model for dominant eigenspace calculation. Compared with existing OPCA solvers, SGN is of improved robustness with respect to the varying input data and algorithm parameters. In addition, turning to an evaluation of data stream based on approximated objective functions, we develop a new adaptive stepsize strategy for SGN which requires no priori knowledge of the input data, and numerically illustrate its comparable performance with SGN adopting the manaully-tuned diminishing stepsize. Without assuming the eigengap to be positive, we also establish the global and optimal convergence rate of SGN with the specified stepsize using the diffusion approximation theory.
Similar content being viewed by others
Data availability
The real datasets that support the findings of this study are all openly available. To be specific, the MNIST dataset is available at http://yann.lecun.com/exdb/mnist/, the Fashion-MNIST dataset is available at https://github.com/zalandoresearch/fashion-mnist, and the CIFAR-10 dataset is available at https://www.cs.toronto.edu/~kriz/cifar.html.
References
Allen-Zhu, Z., Li, Y.: First efficient convergence for streaming \(k\)-PCA: a global, gap-free, and near-optimal rate. In: 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pp. 487–492. IEEE (2017)
Balcan, M.-F., Du, S.S., Wang, Y., Yu, A.W.: An improved gap-dependency analysis of the noisy power method. In: Conference on Learning Theory, pp. 284–309. PMLR (2016)
Balsubramani, A., Dasgupta, S., Freund, Y.: The fast convergence of incremental PCA. In: Advances in Neural Information Processing Systems, vol. 26, pp. 3174–3182 (2013)
Balzano, L., Chi, Y., M Lu, Y.: Streaming PCA and subspace tracking: the missing data case. Proc. IEEE 106(8), 1293–1310 (2018)
Cardot, H., Degras, D.: Online principal component analysis in high dimension: which algorithm to choose? Int. Stat. Rev. 86(1), 29–50 (2018)
Chen, M., Yang, L., Wang, M., Zhao, T.: Dimensionality reduction for stationary time series via stochastic nonconvex optimization. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3500–3510 (2018)
De Sa, C., Re, C., Olukotun, K.: Global convergence of stochastic gradient descent for some non-convex matrix problems. In: International Conference on Machine Learning, pp. 2332–2341. PMLR (2015)
Drusvyatskiy, D., Paquette, C.: Efficiency of minimizing compositions of convex functions and smooth maps. Math. Program. 178, 503–558 (2019)
Ethier, S.N., Kurtz, T.G.: Markov Processes: Characterization and Convergence. Wiley, New York (1986)
Feng, Y., Li, L., Liu, J.-G.: Semigroups of stochastic gradient descent and online principal component analysis: properties and diffusion approximations. Commun. Math. Sci. 16(3), 777–789 (2018)
Gao, B., Liu, X., Yuan, Y.: Parallelizable algorithms for optimization problems with orthogonality constraints. SIAM J. Sci. Comput. 41(3), A1949–A1983 (2019)
Golub, G.H., Van Loan, C.F.: Matrix Computations, vol. 3. JHU Press, Baltimore (2013)
Hardt, M., Price, E.: The noisy power method: a meta algorithm with applications. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, vol. 2, pp. 2861–2869 (2014)
Henriksen, A., Ward, R.: AdaOja: adaptive learning rates for streaming PCA. arXiv preprint arXiv:1905.12115 (2019)
Huang, D., Niles-Weed, J., Ward, R.: Streaming \(k\)-PCA: efficient guarantees for Oja’s algorithm, beyond rank-one updates. In: Conference on Learning Theory, pp. 2463–2498. PMLR (2021)
Jain, P., Jin, C., Kakade, S.M., Netrapalli, P., Sidford, A.: Streaming PCA: matching matrix Bernstein and near-optimal finite sample guarantees for Oja’s algorithm. In: Conference on Learning Theory, pp. 1147–1164. PMLR (2016)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, vol. 1, pp. 315–323 (2013)
Jolliffe, I.T.: Principal Component Analysis. Springer, New York (1986)
Jolliffe, I.T., Cadima, J.: Principal component analysis: a review and recent developments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 374(2065), 20150202 (2016)
Knyazev, A.V.: Toward the optimal preconditioned eigensolver: locally optimal block preconditioned conjugate gradient method. SIAM J. Sci. Comput. 23(2), 517–541 (2001)
Krasulina, T.P.: The method of stochastic approximation for the determination of the least eigenvalue of a symmetrical matrix. USSR Comput. Math. Math. Phys. 9(6), 189–195 (1969)
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, Department of Computer Science, University of Toronto (2009)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Li, C.J., Wang, M., Han, L., Tong, Z.: Near-optimal stochastic approximation for online principal component estimation. Math. Program. 167(1), 75–97 (2016)
Li, C.J., Wang, M., Liu, H., Zhang, T.: Diffusion approximations for online principal component estimation and global convergence. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Li, C.-L., Lin, H.-T., Lu, C.-J.: Rivalry of two families of algorithms for memory-restricted streaming PCA. In: Artificial Intelligence and Statistics, pp. 473–481. PMLR (2016)
Liu, X., Wen, Z., Zhang, Y.: Limited memory block Krylov subspace optimization for computing dominant singular value decompositions. SIAM J. Sci. Comput. 35(3), A1641–A1668 (2013)
Liu, X., Wen, Z., Zhang, Y.: An efficient Gauss–Newton algorithm for symmetric low-rank product matrix approximations. SIAM J. Optim. 25(3), 1571–1608 (2015)
Lv, J.C., Yi, Z., Tan, K.K.: Global convergence of Oja’s PCA learning algorithm with a non-zero-approaching adaptive learning rate. Theoret. Comput. Sci. 367(3), 286–307 (2006)
Oja, E.: Simplified neuron model as a principal component analyzer. J. Math. Biol. 15(3), 267–273 (1982)
Pearson, K.: LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 2(11), 559–572 (1901)
Raychaudhuri, S., Stuart, J.M., Altman, R.B.: Principal components analysis to summarize microarray experiments: application to sporulation time series. In: Biocomputing 2000, pp. 455–466. World Scientific (1999)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Rutishauser, H.: Simultaneous iteration method for symmetric matrices. Numer. Math. 16(3), 205–223 (1970)
Saad, Y.: On the rates of convergence of the Lanczos and the block-Lanczos methods. SIAM J. Numer. Anal. 17(5), 687–706 (1980)
Shamir, O.: A stochastic PCA and SVD algorithm with an exponential convergence rate. In: International Conference on Machine Learning, pp. 144–152. PMLR (2015)
Shamir, O.: Convergence of stochastic gradient descent for PCA. In: International Conference on Machine Learning, pp. 257–265. PMLR (2016)
Shamir, O.: Fast stochastic algorithms for SVD and PCA: convergence properties and convexity. In: International Conference on Machine Learning, pp. 248–256. PMLR (2016)
Sleijpen, G.L.G., Van der Vorst, H.A.: A Jacobi–Davidson iteration method for linear eigenvalue problems. SIAM Rev. 42(2), 267–293 (2000)
Sorensen, D.C.: Implicitly restarted Arnoldi/Lanczos methods for large scale eigenvalue calculations. In: Parallel Numerical Algorithms, pp. 119–165. Springer (1997)
Stathopoulos, A., Fischer, C.F.: A Davidson program for finding a few selected extreme eigenpairs of a large, sparse, real, symmetric matrix. Comput. Phys. Commun. 79(2), 268–290 (1994)
Stewart, G.W., Sun, J.: Matrix Perturbation Theory. Academic Press, Cambridge (1990)
Subasi, A., Gursoy, M.I.: EEG signal classification using PCA, ICA, LDA and support vector machines. Expert Syst. Appl. 37(12), 8659–8666 (2010)
Tang, C.: Exponentially convergent stochastic \(k\)-PCA without variance reduction. In: Advances in Neural Information Processing Systems, pp. 12393–12404 (2019)
Tropp, J.A.: An introduction to matrix concentration inequalities. Found. Trends Mach. Learn. 8(1–2), 1–230 (2015)
Turk, M., Pentland, A.: Eigenfaces for recognition. J. Cogn. Neurosci. 3(1), 71–86 (1991)
Uhlenbeck, G.E.: On the theory of the Brownian motion. Phys. Rev. 36(5), 823 (1930)
Vu, V.Q., Lei, J.: Minimax sparse principal subspace estimation in high dimensions. Ann. Stat. 41(6), 2905–2947 (2013)
Wang, B., Ma, S., Xue, L.: Riemannian stochastic proximal gradient methods for nonsmooth optimization over the Stiefel manifold. J. Mach. Learn. Res. 23(1), 4599–4631 (2022)
Wang, C., Lu, Y.M.: Online learning for sparse PCA in high dimensions: exact dynamics and phase transitions. In: 2016 IEEE Information Theory Workshop (ITW), pp. 186–190. IEEE (2016)
Wang, L., Gao, B., Liu, X.: Multipliers correction methods for optimization problems over the Stiefel manifold. CSIAM Trans. Appl. Math. 2, 508–531 (2021)
Wang, Z., Liu, B., Chen, S., Ma, S., Xue, L., Zhao, H.: A manifold proximal linear method for sparse spectral clustering with application to single-cell RNA sequencing data analysis. INFORMS J. Optim. 4(2), 200–214 (2022)
Ward, R., Wu, X., Bottou, L.: AdaGrad stepsizes: sharp convergence over nonconvex landscapes. In: International Conference on Machine Learning, pp. 6677–6686. PMLR (2019)
Wen, Z., Yang, C., Liu, X., Zhang, Y.: Trace-penalty minimization for large-scale eigenspace computation. J. Sci. Comput. 66(3), 1175–1203 (2016)
Wen, Z., Zhang, Y.: Accelerating convergence by augmented Rayleigh–Ritz projections for large-scale eigenpair computation. SIAM J. Matrix Anal. Appl. 38(2), 273–296 (2017)
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)
Xiao, N., Liu, X., Yuan, Y.: A class of smooth exact penalty function methods for optimization problems with orthogonality constraints. Optim. Methods Softw. 1–37 (2020)
Yang, W., Xu, H.: Streaming sparse principal component analysis. In: International Conference on Machine Learning, pp. 494–503. PMLR (2015)
Zhou, S., Bai, Y.: Convergence analysis of Oja’s iteration for solving online PCA with nonzero-mean samples. Sci. China Math. 64(4), 849–868 (2021)
Zilber, P., Nadler, B.: GNMR: a provable one-line algorithm for low rank matrix recovery. SIAM J. Math. Data Sci. 4(2), 909–934 (2022)
Funding
The work of the second author was supported by the National Natural Science Foundation of China (No. 12125108, 11971466, 11991021, 12021001 and 11688101), Key Research Program of Frontier Sciences, Chinese Academy of Sciences (No. ZDBS-LY-7022), and the Youth Innovation Promotion Association, Chinese Academy of Sciences. The work of the third author was supported by the National Natural Science Foundation of China (No. 12071060).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declared no potential conflicts of interest with respect to the research, authorship, or publication of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Xin Liu: The research was supported in part by the National Natural Science Foundation of China (Nos. 12125108, 11971466, 11991021, 12021001 and 11688101), Key Research Program of Frontier Sciences, Chinese Academy of Sciences (No. ZDBS-LY-7022), and the Youth Innovation Promotion Association, Chinese Academy of Sciences. Liwei Xu: The research was supported in part by the National Natural Science Foundation of China (No. 12071060).
Appendices
A Boundedness of Solution to ODE (3.7)
Lemma A.1
The solution X(t) to ODE (3.7) is uniformly bounded.
Proof
Denote \(L={{\textbf {tr}}}(X^\top X)\) and \(M={{\textbf {tr}}}(X^\top \varSigma X(X^\top X)^{-1})\).
The explicit solution of (A.1) is given by
where \(C(L^0)\) is some constant depending only on the initial value \(L(0)=L^0={{\textbf {tr}}}(X^{0\top }X^{0})\). Since
we have
which then yields the bound
\(\square \)
B Proof of Lemma 3.1
We begin with the full-rankness of SGN.
Lemma B.1
Suppose that the Assumptions [A1]–[A3] hold. Then, the sequence \(\{X^{(k)}\}\) generated by Algorithm 1 with the stepsize \(\alpha ^{(k)}\le 1\) satisfies
which indicates that \(\sigma _{min}(X^{(k+1)})>0\) whenever \(\sigma _{min}(X^{(k)})>0\).
Proof
To simplify the notation, we denote
Recall that \(P=X(X^\top X)^{-1}\). It is obvious that
Thus, we have
where
Then, we obtain
Since \(\sigma _{\max }(P)=\sigma _{\min }^{-1}(X)\), we have
\(\square \)
Though the full-rankness of SGN has been proved in Lemma B.1, the possibility of \(\lim \nolimits _{k\rightarrow \infty }\sigma _{\min }(X^{(k)})=0\) can not be ruled out. As a theoretical supplement, we set a threshold \({\underline{\sigma }}>0\), and take an additional correction step \(X^{(k)}+S^{(k)}_c\) when \(\sigma _{\min }(X^{(k)})\le {\underline{\sigma }}\). The correction direction \(S^{(k)}_c\) is chosen to satisfy
This will be detailed in Lemma B.2.
We perform an SVD of the iterate \(X^{(k)}\in {\mathbb {R}}^{n\times p}\) as
where \(U^{(k)}\in {\mathbb {R}}^{n\times n}\) is an orthogonal matrix, \(U_1^{(k)}=U^{(k)}_{(:, 1:p)}\in {\mathbb {R}}^{n\times p}\), \(\varSigma ^{(k)}\in {\mathbb {R}}^{n\times p}\) and \(\varSigma ^{(k)}_1\in {\mathbb {R}}^{p\times p}\) have the singular values \(\sigma _{1}^{(k)}\ge \cdots \ge \sigma _p^{(k)}>0\) of \(X^{(k)}\) on their diagonals and zeros elsewhere, and \(V^{(k)}\in {\mathbb {R}}^{p\times p}\) is also an orthogonal matrix. Then, we have the following result.
Lemma B.2
For some iterate \(X^{(k)}\in {\mathbb {R}}^{n\times p}\) with the SVD form (B.5). Given a threshold
Suppose that there are \(p_{{\underline{\sigma }}}\) singular values no greater than \({\underline{\sigma }}\). Define a correction direction as
where \(\theta =\sqrt{\lambda _n}\), \(Q^{(k)}\in {\mathbb {R}}^{n\times p}\) has orthonormal columns, and \(V_{p_{{\underline{\sigma }}}}^{(k)}\in {\mathbb {R}}^{p\times p_{{\underline{\sigma }}}}\) is formed by the last \(p_{{\underline{\sigma }}}\) columns of \(V^{(k)}\). Then, the inequalities (B.2) and (B.3) hold if \(Q^{(k)}\in {\textbf {span}}\{U_{(:, (p+1):n)}^{(k)}\}\); and (B.2), (B.3) and (B.4) all hold if \(Q^{(k)}\) is taken as \(Q^{(k)}=U_{(:, (p-p_{{\underline{\sigma }}}+1):p)}^{(k)}\).
Proof
Here and below in this proof, we drop the superscript “(k)" to simplify the notations, that is
and \(\sigma _i = \sigma _i^{(k)}\ (i=1,\ldots , k)\).
By (B.5) and the definition (B.7) of \(S_c\), we have
If \(Q\in {\textbf {span}}\{U_{(:, (p+1):n)}\}\), we have
and if \(Q=U_{(:, (p-p_{{\underline{\sigma }}}+1):p)}\), we have
The first inequality (B.2) is then obviously true from
To check the second inequality (B.3), we compute
Then, setting \(\theta =\sqrt{\lambda _n}\) and using (B.6) immediately gives the desired inequality (B.3).
For the third inequality (B.4), we have
where \(q_{i}\in {\mathbb {R}}^{n}\) is the ith column of Q. When \(Q=U_{(:, (p-p_{{\underline{\sigma }}}+1):p)}\), we have \({\bar{q}}_1={\bar{u}}_{p-p_{{\underline{\sigma }}}+1},\ldots , {\bar{q}}_{p_{{\underline{\sigma }}}}={\bar{u}}_{p}\), and thus \(\Vert \sin ^2 (X+S_c, U_{p'})\Vert _{\text {F}}^2-\Vert \sin ^2 (X, U_{p'})\Vert _{\text {F}}^2<0\). But when \(Q\in {\textbf {span}}\{U_{(:, (p+1):n)}\}\), it is easy to find a counterexample for (B.4), e.g., the case of \({\bar{U}}_{1}=I_p\). \(\square \)
Remark B.1
(1) For the requirements on the correction step, (B.2) is natural. And (B.3) and (B.4) are just different measures for the convergence analysis, where the former is the one adopted in [28], and the latter is the one that our work focuses on; (2) The choice of \(Q\in {\textbf {span}}\{U_{(:, (p+1):n)}\}\) coincides with the correction step taken in [28], which only suffices to obtain (B.2) and (B.3). And however, this is also acceptable in our work as the estimation error of SGN iterates does not decrease monotonically.
Next, we give the conditional expected boundedness of the increment of SGN iterates using sufficiently small stepsize.
Lemma B.3
Suppose the Assumptions [A1]–[A3] hold and stepsize \(\alpha ^{(k)}\) is chosen by
where \({\underline{\sigma }}\) is the threshold given in (B.6), \(\varphi _2 = {\mathbb {E}}\Vert \varSigma _{h}^{(k)}\Vert _2\) and \(\varphi _4 = {\mathbb {E}}\Vert \varSigma _{h}^{(k)}\Vert ^2_2\). Then \(\{X^{(k)}\}\) generated by Algorithm 1 satisfies
Proof
For simplicity, we use the same notations as (B.1). From the update rule of SGN, it is straightforward to get
Hereafter, we assume that \(X^{(k)} = X\) is known. Suppose that
we then have
where the inequality (i) follows from (B.9).
Given that \({{\textbf {tr}}}(X^{(0)\top }X^{(0)})=p\), the hypothesis (B.11) may not be available. Hence, we turn to the case of \({{\textbf {tr}}}(X^\top X)\ge 2p{\mathbb {E}}\Vert \varSigma _{h}\Vert _2\), where we have
The inequality above shows that \({\mathbb {E}}{{\textbf {tr}}}\left( X^{+\top }X^{+}\right) \) will not increase, until (B.11) is satisfied. Even though (B.11) might never hold, the SGN iterates could always be bounded in expectation by the initial value due to the non-increasing property. Or more specifically,
We then turn to the expectational increment
which thus completes the proof. \(\square \)
Then, we are in a position to prove Lemma 3.1.
Proof of Lemma 3.1
The infinitesimal mean of \({{\textbf {vec}}}\left( X(t)\right) \) could be directly obtained from (3.5) as
and the infinitesimal variance from (3.6) as
where the inequality is due to the result in Lemma B.3 that \(\alpha ^{-2}{\mathbb {E}}\Vert \varDelta X_\alpha \Vert _{\text {F}}^2\) is finite, and thus \({\mathbb {E}}\Vert {{\textbf {vec}}}\left( \varDelta X_\alpha \right) \Vert _{\text {F}}^2={\mathcal {O}}(\alpha ^2)\).
Applying Corollary 4.2 in Section 7.4 of [9] then completes the proof. \(\square \)
C Proof of Lemma 3.2
Proof
As \(X_{\alpha }(t)\) lies around some stationary point \({\tilde{X}}\) associated with the eigen-index set \({\mathcal {I}}_{p}=\{i_{1},\ldots , i_{p}\}\) and the extended one \(\hat{{\mathcal {I}}}_p\), we impose two additional assumptions on \({\tilde{X}}\) for simplicity:
-
(i)
\(i_{1}> i_{2}>\cdots> i_{p-1}>i_{p}\);
-
(ii)
\(\lambda _{i_{1}}, \lambda _{i_{2}}, \ldots , \lambda _{i_{p-1}}\) are single eigenvalues and \(\lambda _{i_{p}}\) has multiplicity \((q-p+1)\ge 1\),
where \(q=\vert \hat{{\mathcal {I}}}_p\vert \). Then, \(X_{\alpha }(t)\) could be expressed in the form
where \(E_q=[e_{i_{1}}, e_{i_{2}}, \ldots , e_{i_{q}}]\in {\mathbb {R}}^{n\times q},~\varLambda _{q}={ {\textbf {Diag}}}(\lambda _{i_{1}}, \lambda _{i_{2}}, \ldots , \lambda _{i_{q}})\in {\mathbb {R}}^{q\times q}\), V is a p-dimensional orthogonal matrix, and the weight matrix \(W\in {\mathbb {R}}^{q\times p}\) is defined by
Let \( \varDelta Y_\alpha (t) = Y_\alpha (t+\alpha )-Y_\alpha (t)=\alpha ^{-\frac{1}{2}}\left( X_\alpha (t+\alpha )-X_\alpha (t)\right) V. \) In the same vein as the proof of Lemma 3.1, we compute the (np)-dimensional infinitesimal mean as
and the (np)-dimensional infinitesimal variance matrix as
where
where \(\varSigma _h=\sum _{i=1}^h a_{i}a_{i}^\top /h\) and \(a_{i}\in {\mathbb {R}}^n~(i = 1, \ldots , h)\) are i.i.d. copies of the random vector a. For any \(1\le l,~r\le p\), we have
where
and \(E_{l, r}\) denotes the n-dimensional matrix with its (l, r)th element being one and other entries being zero. Using (C.1), (C.2) and applying Corollary 4.2 in Section 7.4 of [9] finally yield the SDE approximation (3.14). It could be easily verified that even if the additional assumptions (i), (ii) are eliminated, (3.14) is still valid. \(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhou, S., Liu, X. & Xu, L. Stochastic Gauss–Newton Algorithms for Online PCA. J Sci Comput 96, 72 (2023). https://doi.org/10.1007/s10915-023-02289-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10915-023-02289-0