iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://doi.org/10.1007/s10915-023-02289-0
Stochastic Gauss–Newton Algorithms for Online PCA | Journal of Scientific Computing Skip to main content
Log in

Stochastic Gauss–Newton Algorithms for Online PCA

  • Published:
Journal of Scientific Computing Aims and scope Submit manuscript

Abstract

In this paper, we propose a stochastic Gauss–Newton (SGN) algorithm to study the online principal component analysis (OPCA) problem, which is formulated by using the symmetric low-rank product model for dominant eigenspace calculation. Compared with existing OPCA solvers, SGN is of improved robustness with respect to the varying input data and algorithm parameters. In addition, turning to an evaluation of data stream based on approximated objective functions, we develop a new adaptive stepsize strategy for SGN which requires no priori knowledge of the input data, and numerically illustrate its comparable performance with SGN adopting the manaully-tuned diminishing stepsize. Without assuming the eigengap to be positive, we also establish the global and optimal convergence rate of SGN with the specified stepsize using the diffusion approximation theory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Algorithm 2
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

The real datasets that support the findings of this study are all openly available. To be specific, the MNIST dataset is available at http://yann.lecun.com/exdb/mnist/, the Fashion-MNIST dataset is available at https://github.com/zalandoresearch/fashion-mnist, and the CIFAR-10 dataset is available at https://www.cs.toronto.edu/~kriz/cifar.html.

References

  1. Allen-Zhu, Z., Li, Y.: First efficient convergence for streaming \(k\)-PCA: a global, gap-free, and near-optimal rate. In: 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pp. 487–492. IEEE (2017)

  2. Balcan, M.-F., Du, S.S., Wang, Y., Yu, A.W.: An improved gap-dependency analysis of the noisy power method. In: Conference on Learning Theory, pp. 284–309. PMLR (2016)

  3. Balsubramani, A., Dasgupta, S., Freund, Y.: The fast convergence of incremental PCA. In: Advances in Neural Information Processing Systems, vol. 26, pp. 3174–3182 (2013)

  4. Balzano, L., Chi, Y., M Lu, Y.: Streaming PCA and subspace tracking: the missing data case. Proc. IEEE 106(8), 1293–1310 (2018)

    Article  Google Scholar 

  5. Cardot, H., Degras, D.: Online principal component analysis in high dimension: which algorithm to choose? Int. Stat. Rev. 86(1), 29–50 (2018)

    Article  MathSciNet  Google Scholar 

  6. Chen, M., Yang, L., Wang, M., Zhao, T.: Dimensionality reduction for stationary time series via stochastic nonconvex optimization. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3500–3510 (2018)

  7. De Sa, C., Re, C., Olukotun, K.: Global convergence of stochastic gradient descent for some non-convex matrix problems. In: International Conference on Machine Learning, pp. 2332–2341. PMLR (2015)

  8. Drusvyatskiy, D., Paquette, C.: Efficiency of minimizing compositions of convex functions and smooth maps. Math. Program. 178, 503–558 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  9. Ethier, S.N., Kurtz, T.G.: Markov Processes: Characterization and Convergence. Wiley, New York (1986)

    Book  MATH  Google Scholar 

  10. Feng, Y., Li, L., Liu, J.-G.: Semigroups of stochastic gradient descent and online principal component analysis: properties and diffusion approximations. Commun. Math. Sci. 16(3), 777–789 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  11. Gao, B., Liu, X., Yuan, Y.: Parallelizable algorithms for optimization problems with orthogonality constraints. SIAM J. Sci. Comput. 41(3), A1949–A1983 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  12. Golub, G.H., Van Loan, C.F.: Matrix Computations, vol. 3. JHU Press, Baltimore (2013)

    Book  MATH  Google Scholar 

  13. Hardt, M., Price, E.: The noisy power method: a meta algorithm with applications. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, vol. 2, pp. 2861–2869 (2014)

  14. Henriksen, A., Ward, R.: AdaOja: adaptive learning rates for streaming PCA. arXiv preprint arXiv:1905.12115 (2019)

  15. Huang, D., Niles-Weed, J., Ward, R.: Streaming \(k\)-PCA: efficient guarantees for Oja’s algorithm, beyond rank-one updates. In: Conference on Learning Theory, pp. 2463–2498. PMLR (2021)

  16. Jain, P., Jin, C., Kakade, S.M., Netrapalli, P., Sidford, A.: Streaming PCA: matching matrix Bernstein and near-optimal finite sample guarantees for Oja’s algorithm. In: Conference on Learning Theory, pp. 1147–1164. PMLR (2016)

  17. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, vol. 1, pp. 315–323 (2013)

  18. Jolliffe, I.T.: Principal Component Analysis. Springer, New York (1986)

    Book  MATH  Google Scholar 

  19. Jolliffe, I.T., Cadima, J.: Principal component analysis: a review and recent developments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 374(2065), 20150202 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  20. Knyazev, A.V.: Toward the optimal preconditioned eigensolver: locally optimal block preconditioned conjugate gradient method. SIAM J. Sci. Comput. 23(2), 517–541 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  21. Krasulina, T.P.: The method of stochastic approximation for the determination of the least eigenvalue of a symmetrical matrix. USSR Comput. Math. Math. Phys. 9(6), 189–195 (1969)

    Article  MathSciNet  MATH  Google Scholar 

  22. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, Department of Computer Science, University of Toronto (2009)

  23. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  24. Li, C.J., Wang, M., Han, L., Tong, Z.: Near-optimal stochastic approximation for online principal component estimation. Math. Program. 167(1), 75–97 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  25. Li, C.J., Wang, M., Liu, H., Zhang, T.: Diffusion approximations for online principal component estimation and global convergence. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

  26. Li, C.-L., Lin, H.-T., Lu, C.-J.: Rivalry of two families of algorithms for memory-restricted streaming PCA. In: Artificial Intelligence and Statistics, pp. 473–481. PMLR (2016)

  27. Liu, X., Wen, Z., Zhang, Y.: Limited memory block Krylov subspace optimization for computing dominant singular value decompositions. SIAM J. Sci. Comput. 35(3), A1641–A1668 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  28. Liu, X., Wen, Z., Zhang, Y.: An efficient Gauss–Newton algorithm for symmetric low-rank product matrix approximations. SIAM J. Optim. 25(3), 1571–1608 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  29. Lv, J.C., Yi, Z., Tan, K.K.: Global convergence of Oja’s PCA learning algorithm with a non-zero-approaching adaptive learning rate. Theoret. Comput. Sci. 367(3), 286–307 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  30. Oja, E.: Simplified neuron model as a principal component analyzer. J. Math. Biol. 15(3), 267–273 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  31. Pearson, K.: LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 2(11), 559–572 (1901)

    Article  MATH  Google Scholar 

  32. Raychaudhuri, S., Stuart, J.M., Altman, R.B.: Principal components analysis to summarize microarray experiments: application to sporulation time series. In: Biocomputing 2000, pp. 455–466. World Scientific (1999)

  33. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  34. Rutishauser, H.: Simultaneous iteration method for symmetric matrices. Numer. Math. 16(3), 205–223 (1970)

    Article  MathSciNet  MATH  Google Scholar 

  35. Saad, Y.: On the rates of convergence of the Lanczos and the block-Lanczos methods. SIAM J. Numer. Anal. 17(5), 687–706 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  36. Shamir, O.: A stochastic PCA and SVD algorithm with an exponential convergence rate. In: International Conference on Machine Learning, pp. 144–152. PMLR (2015)

  37. Shamir, O.: Convergence of stochastic gradient descent for PCA. In: International Conference on Machine Learning, pp. 257–265. PMLR (2016)

  38. Shamir, O.: Fast stochastic algorithms for SVD and PCA: convergence properties and convexity. In: International Conference on Machine Learning, pp. 248–256. PMLR (2016)

  39. Sleijpen, G.L.G., Van der Vorst, H.A.: A Jacobi–Davidson iteration method for linear eigenvalue problems. SIAM Rev. 42(2), 267–293 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  40. Sorensen, D.C.: Implicitly restarted Arnoldi/Lanczos methods for large scale eigenvalue calculations. In: Parallel Numerical Algorithms, pp. 119–165. Springer (1997)

  41. Stathopoulos, A., Fischer, C.F.: A Davidson program for finding a few selected extreme eigenpairs of a large, sparse, real, symmetric matrix. Comput. Phys. Commun. 79(2), 268–290 (1994)

    Article  MATH  Google Scholar 

  42. Stewart, G.W., Sun, J.: Matrix Perturbation Theory. Academic Press, Cambridge (1990)

    MATH  Google Scholar 

  43. Subasi, A., Gursoy, M.I.: EEG signal classification using PCA, ICA, LDA and support vector machines. Expert Syst. Appl. 37(12), 8659–8666 (2010)

    Article  Google Scholar 

  44. Tang, C.: Exponentially convergent stochastic \(k\)-PCA without variance reduction. In: Advances in Neural Information Processing Systems, pp. 12393–12404 (2019)

  45. Tropp, J.A.: An introduction to matrix concentration inequalities. Found. Trends Mach. Learn. 8(1–2), 1–230 (2015)

    Article  MATH  Google Scholar 

  46. Turk, M., Pentland, A.: Eigenfaces for recognition. J. Cogn. Neurosci. 3(1), 71–86 (1991)

    Article  Google Scholar 

  47. Uhlenbeck, G.E.: On the theory of the Brownian motion. Phys. Rev. 36(5), 823 (1930)

    Article  MATH  Google Scholar 

  48. Vu, V.Q., Lei, J.: Minimax sparse principal subspace estimation in high dimensions. Ann. Stat. 41(6), 2905–2947 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  49. Wang, B., Ma, S., Xue, L.: Riemannian stochastic proximal gradient methods for nonsmooth optimization over the Stiefel manifold. J. Mach. Learn. Res. 23(1), 4599–4631 (2022)

    MathSciNet  Google Scholar 

  50. Wang, C., Lu, Y.M.: Online learning for sparse PCA in high dimensions: exact dynamics and phase transitions. In: 2016 IEEE Information Theory Workshop (ITW), pp. 186–190. IEEE (2016)

  51. Wang, L., Gao, B., Liu, X.: Multipliers correction methods for optimization problems over the Stiefel manifold. CSIAM Trans. Appl. Math. 2, 508–531 (2021)

    Article  MathSciNet  Google Scholar 

  52. Wang, Z., Liu, B., Chen, S., Ma, S., Xue, L., Zhao, H.: A manifold proximal linear method for sparse spectral clustering with application to single-cell RNA sequencing data analysis. INFORMS J. Optim. 4(2), 200–214 (2022)

    Article  MathSciNet  Google Scholar 

  53. Ward, R., Wu, X., Bottou, L.: AdaGrad stepsizes: sharp convergence over nonconvex landscapes. In: International Conference on Machine Learning, pp. 6677–6686. PMLR (2019)

  54. Wen, Z., Yang, C., Liu, X., Zhang, Y.: Trace-penalty minimization for large-scale eigenspace computation. J. Sci. Comput. 66(3), 1175–1203 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  55. Wen, Z., Zhang, Y.: Accelerating convergence by augmented Rayleigh–Ritz projections for large-scale eigenpair computation. SIAM J. Matrix Anal. Appl. 38(2), 273–296 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  56. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)

  57. Xiao, N., Liu, X., Yuan, Y.: A class of smooth exact penalty function methods for optimization problems with orthogonality constraints. Optim. Methods Softw. 1–37 (2020)

  58. Yang, W., Xu, H.: Streaming sparse principal component analysis. In: International Conference on Machine Learning, pp. 494–503. PMLR (2015)

  59. Zhou, S., Bai, Y.: Convergence analysis of Oja’s iteration for solving online PCA with nonzero-mean samples. Sci. China Math. 64(4), 849–868 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  60. Zilber, P., Nadler, B.: GNMR: a provable one-line algorithm for low rank matrix recovery. SIAM J. Math. Data Sci. 4(2), 909–934 (2022)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Funding

The work of the second author was supported by the National Natural Science Foundation of China (No. 12125108, 11971466, 11991021, 12021001 and 11688101), Key Research Program of Frontier Sciences, Chinese Academy of Sciences (No. ZDBS-LY-7022), and the Youth Innovation Promotion Association, Chinese Academy of Sciences. The work of the third author was supported by the National Natural Science Foundation of China (No. 12071060).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xin Liu.

Ethics declarations

Conflict of interest

The authors declared no potential conflicts of interest with respect to the research, authorship, or publication of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Xin Liu: The research was supported in part by the National Natural Science Foundation of China (Nos. 12125108, 11971466, 11991021, 12021001 and 11688101), Key Research Program of Frontier Sciences, Chinese Academy of Sciences (No. ZDBS-LY-7022), and the Youth Innovation Promotion Association, Chinese Academy of Sciences. Liwei Xu: The research was supported in part by the National Natural Science Foundation of China (No. 12071060).

Appendices

A Boundedness of Solution to ODE (3.7)

Lemma A.1

The solution X(t) to ODE (3.7) is uniformly bounded.

Proof

Denote \(L={{\textbf {tr}}}(X^\top X)\) and \(M={{\textbf {tr}}}(X^\top \varSigma X(X^\top X)^{-1})\).

$$\begin{aligned} \frac{dL}{dt}&=2{{\textbf {tr}}}\left( X^\top \frac{dX}{dt}\right) \nonumber \\&={{\textbf {tr}}}\left( 2X^\top \varSigma X(X^\top X)^{-1}-X^\top X-X^\top \varSigma X(X^\top X)^{-1}\right) =M-L. \end{aligned}$$
(A.1)

The explicit solution of (A.1) is given by

$$\begin{aligned} L(t)=\exp \{-t\}\left( \int M\exp \{t\}ds+C(L^0)\right) , \end{aligned}$$

where \(C(L^0)\) is some constant depending only on the initial value \(L(0)=L^0={{\textbf {tr}}}(X^{0\top }X^{0})\). Since

$$\begin{aligned} \lambda _{n}I_{p}\preccurlyeq (X^\top X)^{-\frac{1}{2}}X^\top \varSigma X(X^\top X)^{-\frac{1}{2}} \preccurlyeq \lambda _{1}I_{p}, \end{aligned}$$

we have

$$\begin{aligned} p\lambda _{n}\le M\le p\lambda _{1}, \end{aligned}$$

which then yields the bound

$$\begin{aligned} p\lambda _{n}+C(L^0)\exp \{-t\}\le L\le p\lambda _{1}+C(L^0)\exp \{-t\}. \end{aligned}$$

\(\square \)

B Proof of Lemma 3.1

We begin with the full-rankness of SGN.

Lemma B.1

Suppose that the Assumptions [A1]–[A3] hold. Then, the sequence \(\{X^{(k)}\}\) generated by Algorithm  1 with the stepsize \(\alpha ^{(k)}\le 1\) satisfies

$$\begin{aligned} \sigma _{min}(X^{(k+1)})\ge \frac{1}{2}\sigma _{min}(X^{(k)}), \end{aligned}$$

which indicates that \(\sigma _{min}(X^{(k+1)})>0\) whenever \(\sigma _{min}(X^{(k)})>0\).

Proof

To simplify the notation, we denote

$$\begin{aligned} X^{+} = X^{(k+1)},\quad X=X^{(k)},\quad \alpha =\alpha ^{(k)},\quad \varSigma _{h}=\varSigma _{h}^{(k)},\quad S = S^{(k)}. \end{aligned}$$
(B.1)

Recall that \(P=X(X^\top X)^{-1}\). It is obvious that

$$\begin{aligned} X^{+\top }\left( \lambda _{\max }(PP^\top )I_n-PP^\top \right) X^{+}\succeq 0. \end{aligned}$$

Thus, we have

$$\begin{aligned} \sigma ^{2}_{\min }(X^{+})\sigma _{\max }^{2}(P)\ge \sigma _{\min }^{2}(P^\top X^{+}), \end{aligned}$$

where

$$\begin{aligned} P^\top X^{+}=I_p+\frac{\alpha }{2}\left( (X^\top X)^{-1}X^\top \varSigma _h X(X^\top X)^{-1}-I_p\right) . \end{aligned}$$

Then, we obtain

$$\begin{aligned} \sigma _{\min }(P^\top X^{+})&=\lambda _{\min }(P^\top X^{+})\\&=1-\frac{\alpha }{2}\left( 1-\lambda _{\min }\left( (X^\top X)^{-1}X^\top \varSigma _{h}X(X^\top X)^{-1}\right) \right) \\&\ge 1-\frac{\alpha }{2}. \end{aligned}$$

Since \(\sigma _{\max }(P)=\sigma _{\min }^{-1}(X)\), we have

$$\begin{aligned} \sigma _{\min }(X^{+})\ge \sigma _{\min }(X)\sigma _{\min }(P^\top X^{+})\ge \frac{1}{2}\sigma _{\min }(X)>0. \end{aligned}$$

\(\square \)

Though the full-rankness of SGN has been proved in Lemma B.1, the possibility of \(\lim \nolimits _{k\rightarrow \infty }\sigma _{\min }(X^{(k)})=0\) can not be ruled out. As a theoretical supplement, we set a threshold \({\underline{\sigma }}>0\), and take an additional correction step \(X^{(k)}+S^{(k)}_c\) when \(\sigma _{\min }(X^{(k)})\le {\underline{\sigma }}\). The correction direction \(S^{(k)}_c\) is chosen to satisfy

$$\begin{aligned}&\sigma _{\min }(X^{(k)}+S^{(k)}_c)>{\underline{\sigma }}, \end{aligned}$$
(B.2)
$$\begin{aligned}&\quad f_{\text {E}}(X^{(k)}+S^{(k)}_c)\le f_{\text {E}}(X^{(k)}), \end{aligned}$$
(B.3)
$$\begin{aligned}&\quad \Vert \sin \varTheta (X^{(k)}+S^{(k)}_c, U_{p'})\Vert _{\text {F}}^{2}\le \Vert \sin \varTheta (X^{(k)}, U_{p'})\Vert _{\text {F}}^{2}. \end{aligned}$$
(B.4)

This will be detailed in Lemma B.2.

We perform an SVD of the iterate \(X^{(k)}\in {\mathbb {R}}^{n\times p}\) as

$$\begin{aligned} X^{(k)} = U^{(k)}\varSigma ^{(k)} V^{(k)\top } = U_{1}^{(k)}\varSigma _1^{(k)}V^{(k)\top }, \end{aligned}$$
(B.5)

where \(U^{(k)}\in {\mathbb {R}}^{n\times n}\) is an orthogonal matrix, \(U_1^{(k)}=U^{(k)}_{(:, 1:p)}\in {\mathbb {R}}^{n\times p}\), \(\varSigma ^{(k)}\in {\mathbb {R}}^{n\times p}\) and \(\varSigma ^{(k)}_1\in {\mathbb {R}}^{p\times p}\) have the singular values \(\sigma _{1}^{(k)}\ge \cdots \ge \sigma _p^{(k)}>0\) of \(X^{(k)}\) on their diagonals and zeros elsewhere, and \(V^{(k)}\in {\mathbb {R}}^{p\times p}\) is also an orthogonal matrix. Then, we have the following result.

Lemma B.2

For some iterate \(X^{(k)}\in {\mathbb {R}}^{n\times p}\) with the SVD form (B.5). Given a threshold

$$\begin{aligned} {\underline{\sigma }}\le \frac{4}{35}\frac{\lambda _n}{\lambda _1}\sqrt{\lambda _n}. \end{aligned}$$
(B.6)

Suppose that there are \(p_{{\underline{\sigma }}}\) singular values no greater than \({\underline{\sigma }}\). Define a correction direction as

$$\begin{aligned} S^{(k)}_c=\theta Q^{(k)} V_{p_{{\underline{\sigma }}}}^{(k)\top }, \end{aligned}$$
(B.7)

where \(\theta =\sqrt{\lambda _n}\), \(Q^{(k)}\in {\mathbb {R}}^{n\times p}\) has orthonormal columns, and \(V_{p_{{\underline{\sigma }}}}^{(k)}\in {\mathbb {R}}^{p\times p_{{\underline{\sigma }}}}\) is formed by the last \(p_{{\underline{\sigma }}}\) columns of \(V^{(k)}\). Then, the inequalities (B.2) and (B.3) hold if \(Q^{(k)}\in {\textbf {span}}\{U_{(:, (p+1):n)}^{(k)}\}\); and (B.2), (B.3) and (B.4) all hold if \(Q^{(k)}\) is taken as \(Q^{(k)}=U_{(:, (p-p_{{\underline{\sigma }}}+1):p)}^{(k)}\).

Proof

Here and below in this proof, we drop the superscript “(k)" to simplify the notations, that is

$$\begin{aligned} X=X^{(k)},\ S_{c}=S_{c}^{(k)},\ U_1=U_1^{(k)},\ \varSigma _1=\varSigma _{1}^{(k)},\quad V=V^{(k)},\ V_{p_{{\underline{\sigma }}}}=V_{p_{{\underline{\sigma }}}}^{(k)},\ Q=Q^{(k)}, \end{aligned}$$

and \(\sigma _i = \sigma _i^{(k)}\ (i=1,\ldots , k)\).

By (B.5) and the definition (B.7) of \(S_c\), we have

$$\begin{aligned}&X^\top X+X^\top S_{c}+S_{c}^\top X+ S_{c}^\top S_{c}\nonumber \\&\quad =V\varSigma _1^2 V^\top +\theta V\varSigma _1U_1^\top QV_{p_{{\underline{\sigma }}}}^\top +\theta V_{p_{{\underline{\sigma }}}}Q^\top U_1\varSigma _1V^\top +\theta ^2V_{p_{{\underline{\sigma }}}}Q^\top QV_{p_{{\underline{\sigma }}}}^\top . \end{aligned}$$
(B.8)

If \(Q\in {\textbf {span}}\{U_{(:, (p+1):n)}\}\), we have

$$\begin{aligned} (B.8)=V{\textbf {Diag}}\left\{ \sigma _1^2,\ldots ,\sigma _{p-p_{{\underline{\sigma }}}}^2, (\sigma _{p-p_{{\underline{\sigma }}}+1}^2+\theta \sigma _{p-p_{{\underline{\sigma }}}+1}+\theta ^2), \ldots , (\sigma _{p}^2+\theta \sigma _{p}+\theta ^2)\right\} V^\top , \end{aligned}$$

and if \(Q=U_{(:, (p-p_{{\underline{\sigma }}}+1):p)}\), we have

$$\begin{aligned} (B.8)=V{\textbf {Diag}}\left\{ \sigma _1^2,\ldots , \sigma _{p-p_{{\underline{\sigma }}}}^2, (\sigma _{p-p_{{\underline{\sigma }}}+1}^2+\theta ^2), \ldots , (\sigma _{p}^2+\theta ^2)\right\} V^\top . \end{aligned}$$

The first inequality (B.2) is then obviously true from

$$\begin{aligned} \sigma _{\min }^2(X+S_{c})&=\lambda _{\min }((X+S_{c})^\top (X+S_{c}))\\&={\left\{ \begin{array}{ll} \min \left\{ \sigma _{p-p_{{\underline{\sigma }}}}^2, (\sigma _{p}^2+\theta ^2+\theta \sigma _{p})\right\} , &{} Q\in {\textbf {span}}\{U_{(:, (p+1):n)}\}, \\ \min \left\{ \sigma _{p-p_{{\underline{\sigma }}}}^2, (\sigma _{p}^2+\theta ^2)\right\} , &{} Q=U_{(:, (p-p_{{\underline{\sigma }}}+1):p)}.\end{array}\right. } \end{aligned}$$

To check the second inequality (B.3), we compute

$$\begin{aligned}&f_{\text {E}}(X+S_c)-f_{\text {E}}(X)\\&\quad =2\theta ^2(\sigma _{p-p_{{\underline{\sigma }}}+1}^2+\cdots +\sigma _p^2)+p_{{\underline{\sigma }}}\theta ^4-4{{\textbf {tr}}}\left( \varSigma XS_c^\top \right) -2\theta ^2{{\textbf {tr}}}\left( Q^\top \varSigma Q\right) \\&\qquad + {\left\{ \begin{array}{ll} \sum \limits _{i=p-p_{{\underline{\sigma }}}+1}^{p}4\theta \sigma _i^3+\sum \limits _{i=p-p_{{\underline{\sigma }}}+1}^{p}4\theta ^2\sigma _i^2+\sum \limits _{i=p-p_{{\underline{\sigma }}}+1}^{p}4\theta ^3\sigma _i, &{} Q\in {\textbf {span}}\{U_{(:, (p+1):n)}\}, \\ 0, &{} Q=U_{(:, (p-p_{{\underline{\sigma }}}+1):p)},\end{array}\right. }\\&\quad \le \ {\left\{ \begin{array}{ll} p_{{\underline{\sigma }}}\theta \left( 4{\underline{\sigma }}^3+6\theta ^2{\underline{\sigma }}^2+(4\theta ^2+4\lambda _1){\underline{\sigma }}+\theta ^3-2\theta \lambda _n\right) , &{} Q\in {\textbf {span}}\{U_{(:, (p+1):n)}\}, \\ p_{{\underline{\sigma }}}\theta ( 2\theta {\underline{\sigma }}^2+4\lambda _1{\underline{\sigma }}+\theta ^3-2\theta \lambda _n), &{} Q=U_{(:, (p-p_{{\underline{\sigma }}}+1):p)}.\end{array}\right. } \end{aligned}$$

Then, setting \(\theta =\sqrt{\lambda _n}\) and using (B.6) immediately gives the desired inequality (B.3).

For the third inequality (B.4), we have

$$\begin{aligned}&\Vert \sin ^2 (X+S_c, U_{p'})\Vert _{\text {F}}^2-\Vert \sin ^2 (X, U_{p'})\Vert _{\text {F}}^2\\&\quad = \theta ^2\left( \frac{{\bar{u}}_{p-p_{{\underline{\sigma }}}+1}^\top {\bar{u}}_{p-p_{{\underline{\sigma }}}+1}}{\sigma ^2_{p-p_{{\underline{\sigma }}}+1}+\theta ^2}+\cdots +\frac{{\bar{u}}_{p}^\top {\bar{u}}_{p}}{\sigma ^2_{p}+\theta ^2}\right) -\theta ^2\left( \frac{{\bar{q}}_{1}^\top {\bar{q}}_{1}}{\sigma ^2_{p-p_{{\underline{\sigma }}}+1}+\theta ^2}+\cdots +\frac{{\bar{q}}_{p_{{\underline{\sigma }}}}^\top {\bar{q}}_{p_{{\underline{\sigma }}}}}{\sigma ^2_{p}+\theta ^2}\right) \\&\qquad -2\theta \left( \frac{{\bar{u}}_{p-p_{{\underline{\sigma }}}+1}^\top {\bar{q}}_{1}\sigma _{p-p_{{\underline{\sigma }}}+1}}{\sigma ^2_{p-p_{{\underline{\sigma }}}+1}+\theta ^2}+\cdots +\frac{{\bar{u}}_{p}^\top {\bar{q}}_{p_{{\underline{\sigma }}}}\sigma _p}{\sigma ^2_{p}+\theta ^2}\right) , \end{aligned}$$

where \(q_{i}\in {\mathbb {R}}^{n}\) is the ith column of Q. When \(Q=U_{(:, (p-p_{{\underline{\sigma }}}+1):p)}\), we have \({\bar{q}}_1={\bar{u}}_{p-p_{{\underline{\sigma }}}+1},\ldots , {\bar{q}}_{p_{{\underline{\sigma }}}}={\bar{u}}_{p}\), and thus \(\Vert \sin ^2 (X+S_c, U_{p'})\Vert _{\text {F}}^2-\Vert \sin ^2 (X, U_{p'})\Vert _{\text {F}}^2<0\). But when \(Q\in {\textbf {span}}\{U_{(:, (p+1):n)}\}\), it is easy to find a counterexample for (B.4), e.g., the case of \({\bar{U}}_{1}=I_p\). \(\square \)

Remark B.1

(1) For the requirements on the correction step, (B.2) is natural. And (B.3) and (B.4) are just different measures for the convergence analysis, where the former is the one adopted in [28], and the latter is the one that our work focuses on; (2) The choice of \(Q\in {\textbf {span}}\{U_{(:, (p+1):n)}\}\) coincides with the correction step taken in [28], which only suffices to obtain (B.2) and (B.3). And however, this is also acceptable in our work as the estimation error of SGN iterates does not decrease monotonically.

Next, we give the conditional expected boundedness of the increment of SGN iterates using sufficiently small stepsize.

Lemma B.3

Suppose the Assumptions [A1]–[A3] hold and stepsize \(\alpha ^{(k)}\) is chosen by

$$\begin{aligned} \alpha ^{(k)}\le \min \left\{ \frac{{\underline{\sigma }}^2\varphi _{2}}{2\varphi _{4}},~1\right\} , \end{aligned}$$
(B.9)

where \({\underline{\sigma }}\) is the threshold given in (B.6), \(\varphi _2 = {\mathbb {E}}\Vert \varSigma _{h}^{(k)}\Vert _2\) and \(\varphi _4 = {\mathbb {E}}\Vert \varSigma _{h}^{(k)}\Vert ^2_2\). Then \(\{X^{(k)}\}\) generated by Algorithm  1 satisfies

$$\begin{aligned} {\mathbb {E}}\left[ \left\| S^{(k)}(X^{(k)})\right\| _{\text { F}}^2\Big \vert X^{(k)}=X\right] <p{\underline{\sigma }}^{-2}\varphi _4+\frac{p}{4}\max \{1,\ 2\varphi _2\}. \end{aligned}$$

Proof

For simplicity, we use the same notations as (B.1). From the update rule of SGN, it is straightforward to get

$$\begin{aligned} {{\textbf {tr}}}\left( X^{+\top }X^{+}\right)&=\left( \frac{1}{4}\alpha ^2-\alpha +1\right) {{\textbf {tr}}}(X^\top X)+\left( -\frac{1}{2}\alpha ^2+\alpha \right) {{\textbf {tr}}}\left( X^\top \varSigma _{h}X(X^\top X)^{-1}\right) \nonumber \\&\quad \quad \quad \quad +\alpha ^2{{\textbf {tr}}}\left( X^\top \varSigma _{h}^{2}X(X^\top X)^{-2}\right) -\frac{3}{4}\alpha ^2{{\textbf {tr}}}\left( (X^\top \varSigma _{h}X)^2(X^\top X)^{-3}\right) . \end{aligned}$$
(B.10)

Hereafter, we assume that \(X^{(k)} = X\) is known. Suppose that

$$\begin{aligned} {{\textbf {tr}}}(X^\top X)\le 2p{\mathbb {E}}\Vert \varSigma _{h}\Vert _2, \end{aligned}$$
(B.11)

we then have

$$\begin{aligned} {\mathbb {E}}{{\textbf {tr}}}(X^{+\top }X^{+})&\le p\left( \frac{1}{2}\alpha ^2-2\alpha +2-\frac{1}{2}\alpha ^2+\alpha \right) {\mathbb {E}}\Vert \varSigma _{h}\Vert _2+\alpha ^2{\mathbb {E}}\Vert \varSigma _{h}\Vert ^2_2{{\textbf {tr}}}((X^\top X)^{-1})\\&\le p\left( -\alpha +2\right) {\mathbb {E}}\Vert \varSigma _{h}\Vert _2+\alpha ^2p{\mathbb {E}}\Vert \varSigma _{h}\Vert ^2_2\sigma _{\min }^{-2}(X)\\&\le ^{(i)} p\left( -\alpha +2\right) {\mathbb {E}}\Vert \varSigma _{h}\Vert _2+\frac{p}{2}\alpha {\mathbb {E}}\Vert \varSigma _{h}\Vert _2=p\left( 2-\frac{1}{2}\alpha \right) \varphi _{2}<\infty , \end{aligned}$$

where the inequality (i) follows from (B.9).

Given that \({{\textbf {tr}}}(X^{(0)\top }X^{(0)})=p\), the hypothesis (B.11) may not be available. Hence, we turn to the case of \({{\textbf {tr}}}(X^\top X)\ge 2p{\mathbb {E}}\Vert \varSigma _{h}\Vert _2\), where we have

$$\begin{aligned} {\mathbb {E}}{{\textbf {tr}}}\left( X^{+\top }X^{+}\right)&\le \left( \frac{1}{4}\alpha ^2-\alpha +1-\frac{1}{4}\alpha ^2+\frac{1}{2}\alpha \right) {{\textbf {tr}}}(X^\top X)+\alpha ^2{\mathbb {E}}\Vert \varSigma _{h}\Vert ^2_2{{\textbf {tr}}}((X^\top X)^{-1})\\&\le \left( \frac{1}{4}\alpha ^2-\alpha +1-\frac{1}{4}\alpha ^2+\frac{1}{2}\alpha +\frac{1}{4}\alpha \right) {{\textbf {tr}}}(X^\top X)\le {{\textbf {tr}}}(X^\top X). \end{aligned}$$

The inequality above shows that \({\mathbb {E}}{{\textbf {tr}}}\left( X^{+\top }X^{+}\right) \) will not increase, until (B.11) is satisfied. Even though (B.11) might never hold, the SGN iterates could always be bounded in expectation by the initial value due to the non-increasing property. Or more specifically,

$$\begin{aligned} {\mathbb {E}}{{\textbf {tr}}}\left( X^{+\top }X^{+}\right) \le \max \{p,\ 2p\varphi _2\}. \end{aligned}$$

We then turn to the expectational increment

$$\begin{aligned} {\mathbb {E}}{{\textbf {tr}}}\left( S^\top S\right)&\le {\mathbb {E}}{{\textbf {tr}}}\left( (X^\top X)^{-1}X^\top \varSigma _{h}^2X(X^\top X)^{-1} \right) +\frac{1}{4}{{\textbf {tr}}}\left( X^\top X\right) \\&\le {\mathbb {E}}\Vert \varSigma _h\Vert ^2_2{{\textbf {tr}}}((X^\top X)^{-1})+\frac{1}{4}{{\textbf {tr}}}\left( X^\top X\right) \\&\le p{\underline{\sigma }}^{-2}\varphi _4+\frac{1}{4}\max \{p,\ 2p\varphi _2\} <\infty , \end{aligned}$$

which thus completes the proof. \(\square \)

Then, we are in a position to prove Lemma 3.1.

Proof of Lemma 3.1

The infinitesimal mean of \({{\textbf {vec}}}\left( X(t)\right) \) could be directly obtained from (3.5) as

$$\begin{aligned}&\lim _{\alpha \rightarrow 0}\frac{1}{\alpha }{\mathbb {E}}[{{\textbf {vec}}}\left( \varDelta X_\alpha \right) \vert X_\alpha =X]\\&\quad ={{\textbf {vec}}}\left( \varSigma X(X^\top X)^{-1}-\frac{1}{2}X-\frac{1}{2}X(X^\top X)^{-1}X^\top \varSigma X(X^\top X)^{-1}\right) , \end{aligned}$$

and the infinitesimal variance from (3.6) as

$$\begin{aligned} \lim _{\alpha \rightarrow 0}\frac{1}{\alpha }{\mathbb {E}}\left[ {{\textbf {vec}}}\left( \varDelta X_\alpha \right) {{\textbf {vec}}}\left( \varDelta X_\alpha \right) ^\top \vert X_\alpha =X\right] \le 0=0(={\mathcal {O}}(\alpha )), \end{aligned}$$

where the inequality is due to the result in Lemma B.3 that \(\alpha ^{-2}{\mathbb {E}}\Vert \varDelta X_\alpha \Vert _{\text {F}}^2\) is finite, and thus \({\mathbb {E}}\Vert {{\textbf {vec}}}\left( \varDelta X_\alpha \right) \Vert _{\text {F}}^2={\mathcal {O}}(\alpha ^2)\).

Applying Corollary 4.2 in Section 7.4 of [9] then completes the proof. \(\square \)

C Proof of Lemma 3.2

Proof

As \(X_{\alpha }(t)\) lies around some stationary point \({\tilde{X}}\) associated with the eigen-index set \({\mathcal {I}}_{p}=\{i_{1},\ldots , i_{p}\}\) and the extended one \(\hat{{\mathcal {I}}}_p\), we impose two additional assumptions on \({\tilde{X}}\) for simplicity:

  1. (i)

    \(i_{1}> i_{2}>\cdots> i_{p-1}>i_{p}\);

  2. (ii)

    \(\lambda _{i_{1}}, \lambda _{i_{2}}, \ldots , \lambda _{i_{p-1}}\) are single eigenvalues and \(\lambda _{i_{p}}\) has multiplicity \((q-p+1)\ge 1\),

where \(q=\vert \hat{{\mathcal {I}}}_p\vert \). Then, \(X_{\alpha }(t)\) could be expressed in the form

$$\begin{aligned} X_{\alpha }(t)=E_{q}\varLambda _{q}^{\frac{1}{2}}WV^\top +{\mathcal {O}}(\alpha ^{\frac{1}{2}}), \end{aligned}$$

where \(E_q=[e_{i_{1}}, e_{i_{2}}, \ldots , e_{i_{q}}]\in {\mathbb {R}}^{n\times q},~\varLambda _{q}={ {\textbf {Diag}}}(\lambda _{i_{1}}, \lambda _{i_{2}}, \ldots , \lambda _{i_{q}})\in {\mathbb {R}}^{q\times q}\), V is a p-dimensional orthogonal matrix, and the weight matrix \(W\in {\mathbb {R}}^{q\times p}\) is defined by

$$\begin{aligned} W= \left[ \begin{array}{cc} I_{p-1}&{}0_{(p-1)\times 1}\\ 0_{(q-p+1)\times (p-1)}&{}w \end{array} \right] ,\quad w^\top w=1,\quad w\in {\mathbb {R}}^{q-p+1}. \end{aligned}$$

Let \( \varDelta Y_\alpha (t) = Y_\alpha (t+\alpha )-Y_\alpha (t)=\alpha ^{-\frac{1}{2}}\left( X_\alpha (t+\alpha )-X_\alpha (t)\right) V. \) In the same vein as the proof of Lemma 3.1, we compute the (np)-dimensional infinitesimal mean as

$$\begin{aligned} \lim _{\alpha \rightarrow 0}\frac{1}{\alpha }{\mathbb {E}}\left[ {{\textbf {vec}}}(\varDelta Y_\alpha (t))\vert Y_\alpha (t)=Y\right] =~{{\textbf {vec}}}\left( \varSigma Y\varLambda _{p}^{-1}-Y\right) , \end{aligned}$$
(C.1)

and the (np)-dimensional infinitesimal variance matrix as

$$\begin{aligned} \lim _{\alpha \rightarrow 0}\frac{1}{\alpha }{\mathbb {E}}\left[ {{\textbf {vec}}}(\varDelta Y_\alpha ){{\textbf {vec}}}(\varDelta Y_\alpha )^\top \vert Y_\alpha (t)=Y\right] = {\mathbb {E}} \left[ \begin{array}{cccc} S_{(:, 1)}S_{(:, 1)}^\top &{}\cdots &{} S_{(:, 1)}S_{(:, p)}^\top \\ \cdots &{}\cdots &{}\cdots \\ S_{(:, p)}S_{(:, 1)}^\top &{}\cdots &{} S_{(:, p)}S_{(:, p)}^\top \end{array} \right] , \end{aligned}$$

where

$$\begin{aligned} S_{(:, j)}&=\varSigma _{h}E_{q}\varLambda _{q}^{\frac{1}{2}}W\varLambda _{p(:, j)}^{-\frac{1}{2}}-\frac{1}{2}E_{q}\varLambda _{q}^{\frac{1}{2}}W_{(:, j)}\\&\quad -\frac{1}{2}E_{q}\varLambda _{q}^{\frac{1}{2}}W\varLambda _{p}^{-1}W^\top \varLambda _{q}^{\frac{1}{2}}E_{q}^\top \varSigma _{h}E_{q}\varLambda _{q}^{\frac{1}{2}}W\varLambda _{p(:, j)}^{-1},~\quad 1\le j\le p, \end{aligned}$$

where \(\varSigma _h=\sum _{i=1}^h a_{i}a_{i}^\top /h\) and \(a_{i}\in {\mathbb {R}}^n~(i = 1, \ldots , h)\) are i.i.d. copies of the random vector a. For any \(1\le l,~r\le p\), we have

$$\begin{aligned} {\mathbb {E}}\left[ S_{(:, l)}S_{(:, r)}^\top \right]&= \left( -\frac{1}{2}-\frac{1}{2}+\frac{1}{4}+\frac{1}{4}+\frac{1}{4}\right) T_{1}^{lr}+\left( 1-\frac{1}{2}-\frac{1}{2}+\frac{1}{4}\right) \left( 1-\frac{1}{h}\right) T_{1}^{lr}\nonumber \\&\quad +\frac{1}{h}T_{2}^{lr}-\frac{1}{2h}\left( {\tilde{W}}T_{2}^{lr}+T_{2}^{lr}{\tilde{W}}\right) +\frac{1}{4h}{\tilde{W}}T_{2}^{lr}{\tilde{W}}, \end{aligned}$$
(C.2)

where

$$\begin{aligned} \begin{aligned} T^{lr}_{1}&=\left( \lambda _{i_{l}}\lambda _{i_{r}}\right) ^{\frac{1}{2}}(E_{q}W)_{(:, l)}\left( (E_{q}W)_{(:, r)}\right) ^\top ,\\ T^{lr}_{2}&=\left\{ \begin{array}{ll} \varSigma +2T^{lr}_{1}+\sum _{j=1}^{\tau }w_{j}^2({\mathbb {E}}[b_{i_{p}}^{4}]/\lambda _{i_{p}}-3\lambda _{i_{p}})E_{i_{p}+j-1, i_{p}+j-1}, &{} {l=r=p,}\\ \varSigma +({\mathbb {E}}[b_{i_{l}}^{4}]/\lambda _{i_{l}}^2-1)T^{lr}_{1},&{} {l=r\ne p,}\\ (\lambda _{i_{l}}\lambda _{i_{r}})^{\frac{1}{2}}\left( T^{lr}_{1}+T^{rl}_{1}\right) ,&{} {otherwise,} \end{array} \right. \\ {\tilde{W}}&= \left[ \begin{array}{cc} WW^\top &{}0_{q\times (n-q)}\\ 0_{(n-q)\times q}&{}0_{(n-q)\times (n-q)} \end{array} \right] \in {\mathbb {R}}^{n\times n}, \end{aligned} \end{aligned}$$

and \(E_{l, r}\) denotes the n-dimensional matrix with its (lr)th element being one and other entries being zero. Using (C.1), (C.2) and applying Corollary 4.2 in Section 7.4 of [9] finally yield the SDE approximation (3.14). It could be easily verified that even if the additional assumptions (i), (ii) are eliminated, (3.14) is still valid. \(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, S., Liu, X. & Xu, L. Stochastic Gauss–Newton Algorithms for Online PCA. J Sci Comput 96, 72 (2023). https://doi.org/10.1007/s10915-023-02289-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10915-023-02289-0

Keywords

Navigation