Stochastic Gauss–Newton Algorithms for Online PCA

Zhou, Siyun; Liu, Xin; Xu, Liwei

doi:10.1007/s10915-023-02289-0

Stochastic Gauss–Newton Algorithms for Online PCA

Published: 22 July 2023

Volume 96, article number 72, (2023)
Cite this article

Journal of Scientific Computing Aims and scope Submit manuscript

384 Accesses
2 Citations
Explore all metrics

Abstract

In this paper, we propose a stochastic Gauss–Newton (SGN) algorithm to study the online principal component analysis (OPCA) problem, which is formulated by using the symmetric low-rank product model for dominant eigenspace calculation. Compared with existing OPCA solvers, SGN is of improved robustness with respect to the varying input data and algorithm parameters. In addition, turning to an evaluation of data stream based on approximated objective functions, we develop a new adaptive stepsize strategy for SGN which requires no priori knowledge of the input data, and numerically illustrate its comparable performance with SGN adopting the manaully-tuned diminishing stepsize. Without assuming the eigengap to be positive, we also establish the global and optimal convergence rate of SGN with the specified stepsize using the diffusion approximation theory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Near-optimal stochastic approximation for online principal component estimation

Article 19 August 2017

Advances in Incremental PCA Algorithms

Convergence analysis of Oja’s iteration for solving online PCA with nonzero-mean samples

Article 14 April 2020

Data availability

The real datasets that support the findings of this study are all openly available. To be specific, the MNIST dataset is available at http://yann.lecun.com/exdb/mnist/, the Fashion-MNIST dataset is available at https://github.com/zalandoresearch/fashion-mnist, and the CIFAR-10 dataset is available at https://www.cs.toronto.edu/~kriz/cifar.html.

References

Allen-Zhu, Z., Li, Y.: First efficient convergence for streaming $k$-PCA: a global, gap-free, and near-optimal rate. In: 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pp. 487–492. IEEE (2017)
Balcan, M.-F., Du, S.S., Wang, Y., Yu, A.W.: An improved gap-dependency analysis of the noisy power method. In: Conference on Learning Theory, pp. 284–309. PMLR (2016)
Balsubramani, A., Dasgupta, S., Freund, Y.: The fast convergence of incremental PCA. In: Advances in Neural Information Processing Systems, vol. 26, pp. 3174–3182 (2013)
Balzano, L., Chi, Y., M Lu, Y.: Streaming PCA and subspace tracking: the missing data case. Proc. IEEE 106(8), 1293–1310 (2018)
Article Google Scholar
Cardot, H., Degras, D.: Online principal component analysis in high dimension: which algorithm to choose? Int. Stat. Rev. 86(1), 29–50 (2018)
Article MathSciNet Google Scholar
Chen, M., Yang, L., Wang, M., Zhao, T.: Dimensionality reduction for stationary time series via stochastic nonconvex optimization. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3500–3510 (2018)
De Sa, C., Re, C., Olukotun, K.: Global convergence of stochastic gradient descent for some non-convex matrix problems. In: International Conference on Machine Learning, pp. 2332–2341. PMLR (2015)
Drusvyatskiy, D., Paquette, C.: Efficiency of minimizing compositions of convex functions and smooth maps. Math. Program. 178, 503–558 (2019)
Article MathSciNet MATH Google Scholar
Ethier, S.N., Kurtz, T.G.: Markov Processes: Characterization and Convergence. Wiley, New York (1986)
Book MATH Google Scholar
Feng, Y., Li, L., Liu, J.-G.: Semigroups of stochastic gradient descent and online principal component analysis: properties and diffusion approximations. Commun. Math. Sci. 16(3), 777–789 (2018)
Article MathSciNet MATH Google Scholar
Gao, B., Liu, X., Yuan, Y.: Parallelizable algorithms for optimization problems with orthogonality constraints. SIAM J. Sci. Comput. 41(3), A1949–A1983 (2019)
Article MathSciNet MATH Google Scholar
Golub, G.H., Van Loan, C.F.: Matrix Computations, vol. 3. JHU Press, Baltimore (2013)
Book MATH Google Scholar
Hardt, M., Price, E.: The noisy power method: a meta algorithm with applications. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, vol. 2, pp. 2861–2869 (2014)
Henriksen, A., Ward, R.: AdaOja: adaptive learning rates for streaming PCA. arXiv preprint arXiv:1905.12115 (2019)
Huang, D., Niles-Weed, J., Ward, R.: Streaming $k$-PCA: efficient guarantees for Oja’s algorithm, beyond rank-one updates. In: Conference on Learning Theory, pp. 2463–2498. PMLR (2021)
Jain, P., Jin, C., Kakade, S.M., Netrapalli, P., Sidford, A.: Streaming PCA: matching matrix Bernstein and near-optimal finite sample guarantees for Oja’s algorithm. In: Conference on Learning Theory, pp. 1147–1164. PMLR (2016)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, vol. 1, pp. 315–323 (2013)
Jolliffe, I.T.: Principal Component Analysis. Springer, New York (1986)
Book MATH Google Scholar
Jolliffe, I.T., Cadima, J.: Principal component analysis: a review and recent developments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 374(2065), 20150202 (2016)
Article MathSciNet MATH Google Scholar
Knyazev, A.V.: Toward the optimal preconditioned eigensolver: locally optimal block preconditioned conjugate gradient method. SIAM J. Sci. Comput. 23(2), 517–541 (2001)
Article MathSciNet MATH Google Scholar
Krasulina, T.P.: The method of stochastic approximation for the determination of the least eigenvalue of a symmetrical matrix. USSR Comput. Math. Math. Phys. 9(6), 189–195 (1969)
Article MathSciNet MATH Google Scholar
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, Department of Computer Science, University of Toronto (2009)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Li, C.J., Wang, M., Han, L., Tong, Z.: Near-optimal stochastic approximation for online principal component estimation. Math. Program. 167(1), 75–97 (2016)
Article MathSciNet MATH Google Scholar
Li, C.J., Wang, M., Liu, H., Zhang, T.: Diffusion approximations for online principal component estimation and global convergence. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Li, C.-L., Lin, H.-T., Lu, C.-J.: Rivalry of two families of algorithms for memory-restricted streaming PCA. In: Artificial Intelligence and Statistics, pp. 473–481. PMLR (2016)
Liu, X., Wen, Z., Zhang, Y.: Limited memory block Krylov subspace optimization for computing dominant singular value decompositions. SIAM J. Sci. Comput. 35(3), A1641–A1668 (2013)
Article MathSciNet MATH Google Scholar
Liu, X., Wen, Z., Zhang, Y.: An efficient Gauss–Newton algorithm for symmetric low-rank product matrix approximations. SIAM J. Optim. 25(3), 1571–1608 (2015)
Article MathSciNet MATH Google Scholar
Lv, J.C., Yi, Z., Tan, K.K.: Global convergence of Oja’s PCA learning algorithm with a non-zero-approaching adaptive learning rate. Theoret. Comput. Sci. 367(3), 286–307 (2006)
Article MathSciNet MATH Google Scholar
Oja, E.: Simplified neuron model as a principal component analyzer. J. Math. Biol. 15(3), 267–273 (1982)
Article MathSciNet MATH Google Scholar
Pearson, K.: LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 2(11), 559–572 (1901)
Article MATH Google Scholar
Raychaudhuri, S., Stuart, J.M., Altman, R.B.: Principal components analysis to summarize microarray experiments: application to sporulation time series. In: Biocomputing 2000, pp. 455–466. World Scientific (1999)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Article MathSciNet MATH Google Scholar
Rutishauser, H.: Simultaneous iteration method for symmetric matrices. Numer. Math. 16(3), 205–223 (1970)
Article MathSciNet MATH Google Scholar
Saad, Y.: On the rates of convergence of the Lanczos and the block-Lanczos methods. SIAM J. Numer. Anal. 17(5), 687–706 (1980)
Article MathSciNet MATH Google Scholar
Shamir, O.: A stochastic PCA and SVD algorithm with an exponential convergence rate. In: International Conference on Machine Learning, pp. 144–152. PMLR (2015)
Shamir, O.: Convergence of stochastic gradient descent for PCA. In: International Conference on Machine Learning, pp. 257–265. PMLR (2016)
Shamir, O.: Fast stochastic algorithms for SVD and PCA: convergence properties and convexity. In: International Conference on Machine Learning, pp. 248–256. PMLR (2016)
Sleijpen, G.L.G., Van der Vorst, H.A.: A Jacobi–Davidson iteration method for linear eigenvalue problems. SIAM Rev. 42(2), 267–293 (2000)
Article MathSciNet MATH Google Scholar
Sorensen, D.C.: Implicitly restarted Arnoldi/Lanczos methods for large scale eigenvalue calculations. In: Parallel Numerical Algorithms, pp. 119–165. Springer (1997)
Stathopoulos, A., Fischer, C.F.: A Davidson program for finding a few selected extreme eigenpairs of a large, sparse, real, symmetric matrix. Comput. Phys. Commun. 79(2), 268–290 (1994)
Article MATH Google Scholar
Stewart, G.W., Sun, J.: Matrix Perturbation Theory. Academic Press, Cambridge (1990)
MATH Google Scholar
Subasi, A., Gursoy, M.I.: EEG signal classification using PCA, ICA, LDA and support vector machines. Expert Syst. Appl. 37(12), 8659–8666 (2010)
Article Google Scholar
Tang, C.: Exponentially convergent stochastic $k$-PCA without variance reduction. In: Advances in Neural Information Processing Systems, pp. 12393–12404 (2019)
Tropp, J.A.: An introduction to matrix concentration inequalities. Found. Trends Mach. Learn. 8(1–2), 1–230 (2015)
Article MATH Google Scholar
Turk, M., Pentland, A.: Eigenfaces for recognition. J. Cogn. Neurosci. 3(1), 71–86 (1991)
Article Google Scholar
Uhlenbeck, G.E.: On the theory of the Brownian motion. Phys. Rev. 36(5), 823 (1930)
Article MATH Google Scholar
Vu, V.Q., Lei, J.: Minimax sparse principal subspace estimation in high dimensions. Ann. Stat. 41(6), 2905–2947 (2013)
Article MathSciNet MATH Google Scholar
Wang, B., Ma, S., Xue, L.: Riemannian stochastic proximal gradient methods for nonsmooth optimization over the Stiefel manifold. J. Mach. Learn. Res. 23(1), 4599–4631 (2022)
MathSciNet Google Scholar
Wang, C., Lu, Y.M.: Online learning for sparse PCA in high dimensions: exact dynamics and phase transitions. In: 2016 IEEE Information Theory Workshop (ITW), pp. 186–190. IEEE (2016)
Wang, L., Gao, B., Liu, X.: Multipliers correction methods for optimization problems over the Stiefel manifold. CSIAM Trans. Appl. Math. 2, 508–531 (2021)
Article MathSciNet Google Scholar
Wang, Z., Liu, B., Chen, S., Ma, S., Xue, L., Zhao, H.: A manifold proximal linear method for sparse spectral clustering with application to single-cell RNA sequencing data analysis. INFORMS J. Optim. 4(2), 200–214 (2022)
Article MathSciNet Google Scholar
Ward, R., Wu, X., Bottou, L.: AdaGrad stepsizes: sharp convergence over nonconvex landscapes. In: International Conference on Machine Learning, pp. 6677–6686. PMLR (2019)
Wen, Z., Yang, C., Liu, X., Zhang, Y.: Trace-penalty minimization for large-scale eigenspace computation. J. Sci. Comput. 66(3), 1175–1203 (2016)
Article MathSciNet MATH Google Scholar
Wen, Z., Zhang, Y.: Accelerating convergence by augmented Rayleigh–Ritz projections for large-scale eigenpair computation. SIAM J. Matrix Anal. Appl. 38(2), 273–296 (2017)
Article MathSciNet MATH Google Scholar
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)
Xiao, N., Liu, X., Yuan, Y.: A class of smooth exact penalty function methods for optimization problems with orthogonality constraints. Optim. Methods Softw. 1–37 (2020)
Yang, W., Xu, H.: Streaming sparse principal component analysis. In: International Conference on Machine Learning, pp. 494–503. PMLR (2015)
Zhou, S., Bai, Y.: Convergence analysis of Oja’s iteration for solving online PCA with nonzero-mean samples. Sci. China Math. 64(4), 849–868 (2021)
Article MathSciNet MATH Google Scholar
Zilber, P., Nadler, B.: GNMR: a provable one-line algorithm for low rank matrix recovery. SIAM J. Math. Data Sci. 4(2), 909–934 (2022)
Article MathSciNet MATH Google Scholar

Download references

Funding

The work of the second author was supported by the National Natural Science Foundation of China (No. 12125108, 11971466, 11991021, 12021001 and 11688101), Key Research Program of Frontier Sciences, Chinese Academy of Sciences (No. ZDBS-LY-7022), and the Youth Innovation Promotion Association, Chinese Academy of Sciences. The work of the third author was supported by the National Natural Science Foundation of China (No. 12071060).

Author information

Authors and Affiliations

School of Mathematical Sciences, University of Electronic Science and Technology of China, Chengdu, China
Siyun Zhou & Liwei Xu
State Key Laboratory of Scientific and Engineering Computing, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, and University of Chinese Academy of Sciences, Beijing, China
Xin Liu

Authors

Siyun Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Xin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Liwei Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Liu.

Ethics declarations

Conflict of interest

The authors declared no potential conflicts of interest with respect to the research, authorship, or publication of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Xin Liu: The research was supported in part by the National Natural Science Foundation of China (Nos. 12125108, 11971466, 11991021, 12021001 and 11688101), Key Research Program of Frontier Sciences, Chinese Academy of Sciences (No. ZDBS-LY-7022), and the Youth Innovation Promotion Association, Chinese Academy of Sciences. Liwei Xu: The research was supported in part by the National Natural Science Foundation of China (No. 12071060).

Appendices

A Boundedness of Solution to ODE (3.7)

Lemma A.1

The solution X(t) to ODE (3.7) is uniformly bounded.

Proof

Denote $L={{\textbf {tr}}}(X^\top X)$ and $M={{\textbf {tr}}}(X^\top \varSigma X(X^\top X)^{-1})$.

$$\begin{aligned} \frac{dL}{dt}&=2{{\textbf {tr}}}\left( X^\top \frac{dX}{dt}\right) \nonumber \\&={{\textbf {tr}}}\left( 2X^\top \varSigma X(X^\top X)^{-1}-X^\top X-X^\top \varSigma X(X^\top X)^{-1}\right) =M-L. \end{aligned}$$

(A.1)

The explicit solution of (A.1) is given by

$$\begin{aligned} L(t)=\exp \{-t\}\left( \int M\exp \{t\}ds+C(L^0)\right) , \end{aligned}$$

where $C(L^0)$ is some constant depending only on the initial value $L(0)=L^0={{\textbf {tr}}}(X^{0\top }X^{0})$. Since

$$\begin{aligned} \lambda _{n}I_{p}\preccurlyeq (X^\top X)^{-\frac{1}{2}}X^\top \varSigma X(X^\top X)^{-\frac{1}{2}} \preccurlyeq \lambda _{1}I_{p}, \end{aligned}$$

we have

$$\begin{aligned} p\lambda _{n}\le M\le p\lambda _{1}, \end{aligned}$$

which then yields the bound

$$\begin{aligned} p\lambda _{n}+C(L^0)\exp \{-t\}\le L\le p\lambda _{1}+C(L^0)\exp \{-t\}. \end{aligned}$$

$\square $

B Proof of Lemma 3.1

We begin with the full-rankness of SGN.

Lemma B.1

Suppose that the Assumptions [A1]–[A3] hold. Then, the sequence $\{X^{(k)}\}$ generated by Algorithm 1 with the stepsize $\alpha ^{(k)}\le 1$ satisfies

$$\begin{aligned} \sigma _{min}(X^{(k+1)})\ge \frac{1}{2}\sigma _{min}(X^{(k)}), \end{aligned}$$

which indicates that $\sigma _{min}(X^{(k+1)})>0$ whenever $\sigma _{min}(X^{(k)})>0$.

Proof

To simplify the notation, we denote

$$\begin{aligned} X^{+} = X^{(k+1)},\quad X=X^{(k)},\quad \alpha =\alpha ^{(k)},\quad \varSigma _{h}=\varSigma _{h}^{(k)},\quad S = S^{(k)}. \end{aligned}$$

(B.1)

Recall that $P=X(X^\top X)^{-1}$. It is obvious that

$$\begin{aligned} X^{+\top }\left( \lambda _{\max }(PP^\top )I_n-PP^\top \right) X^{+}\succeq 0. \end{aligned}$$

Thus, we have

$$\begin{aligned} \sigma ^{2}_{\min }(X^{+})\sigma _{\max }^{2}(P)\ge \sigma _{\min }^{2}(P^\top X^{+}), \end{aligned}$$

where

$$\begin{aligned} P^\top X^{+}=I_p+\frac{\alpha }{2}\left( (X^\top X)^{-1}X^\top \varSigma _h X(X^\top X)^{-1}-I_p\right) . \end{aligned}$$

Then, we obtain

$$\begin{aligned} \sigma _{\min }(P^\top X^{+})&=\lambda _{\min }(P^\top X^{+})\\&=1-\frac{\alpha }{2}\left( 1-\lambda _{\min }\left( (X^\top X)^{-1}X^\top \varSigma _{h}X(X^\top X)^{-1}\right) \right) \\&\ge 1-\frac{\alpha }{2}. \end{aligned}$$

Since $\sigma _{\max }(P)=\sigma _{\min }^{-1}(X)$, we have

$$\begin{aligned} \sigma _{\min }(X^{+})\ge \sigma _{\min }(X)\sigma _{\min }(P^\top X^{+})\ge \frac{1}{2}\sigma _{\min }(X)>0. \end{aligned}$$

$\square $

Though the full-rankness of SGN has been proved in Lemma B.1, the possibility of $\lim \nolimits _{k\rightarrow \infty }\sigma _{\min }(X^{(k)})=0$ can not be ruled out. As a theoretical supplement, we set a threshold ${\underline{\sigma }}>0$, and take an additional correction step $X^{(k)}+S^{(k)}_c$ when $\sigma _{\min }(X^{(k)})\le {\underline{\sigma }}$. The correction direction $S^{(k)}_c$ is chosen to satisfy

$$\begin{aligned}&\sigma _{\min }(X^{(k)}+S^{(k)}_c)>{\underline{\sigma }}, \end{aligned}$$

(B.2)

$$\begin{aligned}&\quad f_{\text {E}}(X^{(k)}+S^{(k)}_c)\le f_{\text {E}}(X^{(k)}), \end{aligned}$$

(B.3)

$$\begin{aligned}&\quad \Vert \sin \varTheta (X^{(k)}+S^{(k)}_c, U_{p'})\Vert _{\text {F}}^{2}\le \Vert \sin \varTheta (X^{(k)}, U_{p'})\Vert _{\text {F}}^{2}. \end{aligned}$$

(B.4)

This will be detailed in Lemma B.2.

We perform an SVD of the iterate $X^{(k)}\in {\mathbb {R}}^{n\times p}$ as

$$\begin{aligned} X^{(k)} = U^{(k)}\varSigma ^{(k)} V^{(k)\top } = U_{1}^{(k)}\varSigma _1^{(k)}V^{(k)\top }, \end{aligned}$$

(B.5)

where $U^{(k)}\in {\mathbb {R}}^{n\times n}$ is an orthogonal matrix, $U_1^{(k)}=U^{(k)}_{(:, 1:p)}\in {\mathbb {R}}^{n\times p}$, $\varSigma ^{(k)}\in {\mathbb {R}}^{n\times p}$ and $\varSigma ^{(k)}_1\in {\mathbb {R}}^{p\times p}$ have the singular values $\sigma _{1}^{(k)}\ge \cdots \ge \sigma _p^{(k)}>0$ of $X^{(k)}$ on their diagonals and zeros elsewhere, and $V^{(k)}\in {\mathbb {R}}^{p\times p}$ is also an orthogonal matrix. Then, we have the following result.

Lemma B.2

For some iterate $X^{(k)}\in {\mathbb {R}}^{n\times p}$ with the SVD form (B.5). Given a threshold

$$\begin{aligned} {\underline{\sigma }}\le \frac{4}{35}\frac{\lambda _n}{\lambda _1}\sqrt{\lambda _n}. \end{aligned}$$

(B.6)

Suppose that there are $p_{{\underline{\sigma }}}$ singular values no greater than ${\underline{\sigma }}$. Define a correction direction as

$$\begin{aligned} S^{(k)}_c=\theta Q^{(k)} V_{p_{{\underline{\sigma }}}}^{(k)\top }, \end{aligned}$$

(B.7)

where $\theta =\sqrt{\lambda _n}$, $Q^{(k)}\in {\mathbb {R}}^{n\times p}$ has orthonormal columns, and $V_{p_{{\underline{\sigma }}}}^{(k)}\in {\mathbb {R}}^{p\times p_{{\underline{\sigma }}}}$ is formed by the last $p_{{\underline{\sigma }}}$ columns of $V^{(k)}$. Then, the inequalities (B.2) and (B.3) hold if $Q^{(k)}\in {\textbf {span}}\{U_{(:, (p+1):n)}^{(k)}\}$; and (B.2), (B.3) and (B.4) all hold if $Q^{(k)}$ is taken as $Q^{(k)}=U_{(:, (p-p_{{\underline{\sigma }}}+1):p)}^{(k)}$.

Proof

Here and below in this proof, we drop the superscript “(k)" to simplify the notations, that is

$$\begin{aligned} X=X^{(k)},\ S_{c}=S_{c}^{(k)},\ U_1=U_1^{(k)},\ \varSigma _1=\varSigma _{1}^{(k)},\quad V=V^{(k)},\ V_{p_{{\underline{\sigma }}}}=V_{p_{{\underline{\sigma }}}}^{(k)},\ Q=Q^{(k)}, \end{aligned}$$

and $\sigma _i = \sigma _i^{(k)}\ (i=1,\ldots , k)$.

By (B.5) and the definition (B.7) of $S_c$, we have

$$\begin{aligned}&X^\top X+X^\top S_{c}+S_{c}^\top X+ S_{c}^\top S_{c}\nonumber \\&\quad =V\varSigma _1^2 V^\top +\theta V\varSigma _1U_1^\top QV_{p_{{\underline{\sigma }}}}^\top +\theta V_{p_{{\underline{\sigma }}}}Q^\top U_1\varSigma _1V^\top +\theta ^2V_{p_{{\underline{\sigma }}}}Q^\top QV_{p_{{\underline{\sigma }}}}^\top . \end{aligned}$$

(B.8)

If $Q\in {\textbf {span}}\{U_{(:, (p+1):n)}\}$, we have

$$\begin{aligned} (B.8)=V{\textbf {Diag}}\left\{ \sigma _1^2,\ldots ,\sigma _{p-p_{{\underline{\sigma }}}}^2, (\sigma _{p-p_{{\underline{\sigma }}}+1}^2+\theta \sigma _{p-p_{{\underline{\sigma }}}+1}+\theta ^2), \ldots , (\sigma _{p}^2+\theta \sigma _{p}+\theta ^2)\right\} V^\top , \end{aligned}$$

and if $Q=U_{(:, (p-p_{{\underline{\sigma }}}+1):p)}$, we have

$$\begin{aligned} (B.8)=V{\textbf {Diag}}\left\{ \sigma _1^2,\ldots , \sigma _{p-p_{{\underline{\sigma }}}}^2, (\sigma _{p-p_{{\underline{\sigma }}}+1}^2+\theta ^2), \ldots , (\sigma _{p}^2+\theta ^2)\right\} V^\top . \end{aligned}$$

The first inequality (B.2) is then obviously true from

$$\begin{aligned} \sigma _{\min }^2(X+S_{c})&=\lambda _{\min }((X+S_{c})^\top (X+S_{c}))\\&={\left\{ \begin{array}{ll} \min \left\{ \sigma _{p-p_{{\underline{\sigma }}}}^2, (\sigma _{p}^2+\theta ^2+\theta \sigma _{p})\right\} , &{} Q\in {\textbf {span}}\{U_{(:, (p+1):n)}\}, \\ \min \left\{ \sigma _{p-p_{{\underline{\sigma }}}}^2, (\sigma _{p}^2+\theta ^2)\right\} , &{} Q=U_{(:, (p-p_{{\underline{\sigma }}}+1):p)}.\end{array}\right. } \end{aligned}$$

To check the second inequality (B.3), we compute

$$\begin{aligned}&f_{\text {E}}(X+S_c)-f_{\text {E}}(X)\\&\quad =2\theta ^2(\sigma _{p-p_{{\underline{\sigma }}}+1}^2+\cdots +\sigma _p^2)+p_{{\underline{\sigma }}}\theta ^4-4{{\textbf {tr}}}\left( \varSigma XS_c^\top \right) -2\theta ^2{{\textbf {tr}}}\left( Q^\top \varSigma Q\right) \\&\qquad + {\left\{ \begin{array}{ll} \sum \limits _{i=p-p_{{\underline{\sigma }}}+1}^{p}4\theta \sigma _i^3+\sum \limits _{i=p-p_{{\underline{\sigma }}}+1}^{p}4\theta ^2\sigma _i^2+\sum \limits _{i=p-p_{{\underline{\sigma }}}+1}^{p}4\theta ^3\sigma _i, &{} Q\in {\textbf {span}}\{U_{(:, (p+1):n)}\}, \\ 0, &{} Q=U_{(:, (p-p_{{\underline{\sigma }}}+1):p)},\end{array}\right. }\\&\quad \le \ {\left\{ \begin{array}{ll} p_{{\underline{\sigma }}}\theta \left( 4{\underline{\sigma }}^3+6\theta ^2{\underline{\sigma }}^2+(4\theta ^2+4\lambda _1){\underline{\sigma }}+\theta ^3-2\theta \lambda _n\right) , &{} Q\in {\textbf {span}}\{U_{(:, (p+1):n)}\}, \\ p_{{\underline{\sigma }}}\theta ( 2\theta {\underline{\sigma }}^2+4\lambda _1{\underline{\sigma }}+\theta ^3-2\theta \lambda _n), &{} Q=U_{(:, (p-p_{{\underline{\sigma }}}+1):p)}.\end{array}\right. } \end{aligned}$$

Then, setting $\theta =\sqrt{\lambda _n}$ and using (B.6) immediately gives the desired inequality (B.3).

For the third inequality (B.4), we have

$$\begin{aligned}&\Vert \sin ^2 (X+S_c, U_{p'})\Vert _{\text {F}}^2-\Vert \sin ^2 (X, U_{p'})\Vert _{\text {F}}^2\\&\quad = \theta ^2\left( \frac{{\bar{u}}_{p-p_{{\underline{\sigma }}}+1}^\top {\bar{u}}_{p-p_{{\underline{\sigma }}}+1}}{\sigma ^2_{p-p_{{\underline{\sigma }}}+1}+\theta ^2}+\cdots +\frac{{\bar{u}}_{p}^\top {\bar{u}}_{p}}{\sigma ^2_{p}+\theta ^2}\right) -\theta ^2\left( \frac{{\bar{q}}_{1}^\top {\bar{q}}_{1}}{\sigma ^2_{p-p_{{\underline{\sigma }}}+1}+\theta ^2}+\cdots +\frac{{\bar{q}}_{p_{{\underline{\sigma }}}}^\top {\bar{q}}_{p_{{\underline{\sigma }}}}}{\sigma ^2_{p}+\theta ^2}\right) \\&\qquad -2\theta \left( \frac{{\bar{u}}_{p-p_{{\underline{\sigma }}}+1}^\top {\bar{q}}_{1}\sigma _{p-p_{{\underline{\sigma }}}+1}}{\sigma ^2_{p-p_{{\underline{\sigma }}}+1}+\theta ^2}+\cdots +\frac{{\bar{u}}_{p}^\top {\bar{q}}_{p_{{\underline{\sigma }}}}\sigma _p}{\sigma ^2_{p}+\theta ^2}\right) , \end{aligned}$$

where $q_{i}\in {\mathbb {R}}^{n}$ is the ith column of Q. When $Q=U_{(:, (p-p_{{\underline{\sigma }}}+1):p)}$, we have ${\bar{q}}_1={\bar{u}}_{p-p_{{\underline{\sigma }}}+1},\ldots , {\bar{q}}_{p_{{\underline{\sigma }}}}={\bar{u}}_{p}$, and thus $\Vert \sin ^2 (X+S_c, U_{p'})\Vert _{\text {F}}^2-\Vert \sin ^2 (X, U_{p'})\Vert _{\text {F}}^2<0$. But when $Q\in {\textbf {span}}\{U_{(:, (p+1):n)}\}$, it is easy to find a counterexample for (B.4), e.g., the case of ${\bar{U}}_{1}=I_p$. $\square $

Remark B.1

(1) For the requirements on the correction step, (B.2) is natural. And (B.3) and (B.4) are just different measures for the convergence analysis, where the former is the one adopted in [28], and the latter is the one that our work focuses on; (2) The choice of $Q\in {\textbf {span}}\{U_{(:, (p+1):n)}\}$ coincides with the correction step taken in [28], which only suffices to obtain (B.2) and (B.3). And however, this is also acceptable in our work as the estimation error of SGN iterates does not decrease monotonically.

Next, we give the conditional expected boundedness of the increment of SGN iterates using sufficiently small stepsize.

Lemma B.3

Suppose the Assumptions [A1]–[A3] hold and stepsize $\alpha ^{(k)}$ is chosen by

$$\begin{aligned} \alpha ^{(k)}\le \min \left\{ \frac{{\underline{\sigma }}^2\varphi _{2}}{2\varphi _{4}},~1\right\} , \end{aligned}$$

(B.9)

where ${\underline{\sigma }}$ is the threshold given in (B.6), $\varphi _2 = {\mathbb {E}}\Vert \varSigma _{h}^{(k)}\Vert _2$ and $\varphi _4 = {\mathbb {E}}\Vert \varSigma _{h}^{(k)}\Vert ^2_2$. Then $\{X^{(k)}\}$ generated by Algorithm 1 satisfies

$$\begin{aligned} {\mathbb {E}}\left[ \left\| S^{(k)}(X^{(k)})\right\| _{\text { F}}^2\Big \vert X^{(k)}=X\right] <p{\underline{\sigma }}^{-2}\varphi _4+\frac{p}{4}\max \{1,\ 2\varphi _2\}. \end{aligned}$$

Proof

For simplicity, we use the same notations as (B.1). From the update rule of SGN, it is straightforward to get

$$\begin{aligned} {{\textbf {tr}}}\left( X^{+\top }X^{+}\right)&=\left( \frac{1}{4}\alpha ^2-\alpha +1\right) {{\textbf {tr}}}(X^\top X)+\left( -\frac{1}{2}\alpha ^2+\alpha \right) {{\textbf {tr}}}\left( X^\top \varSigma _{h}X(X^\top X)^{-1}\right) \nonumber \\&\quad \quad \quad \quad +\alpha ^2{{\textbf {tr}}}\left( X^\top \varSigma _{h}^{2}X(X^\top X)^{-2}\right) -\frac{3}{4}\alpha ^2{{\textbf {tr}}}\left( (X^\top \varSigma _{h}X)^2(X^\top X)^{-3}\right) . \end{aligned}$$

(B.10)

Hereafter, we assume that $X^{(k)} = X$ is known. Suppose that

$$\begin{aligned} {{\textbf {tr}}}(X^\top X)\le 2p{\mathbb {E}}\Vert \varSigma _{h}\Vert _2, \end{aligned}$$

(B.11)

we then have

$$\begin{aligned} {\mathbb {E}}{{\textbf {tr}}}(X^{+\top }X^{+})&\le p\left( \frac{1}{2}\alpha ^2-2\alpha +2-\frac{1}{2}\alpha ^2+\alpha \right) {\mathbb {E}}\Vert \varSigma _{h}\Vert _2+\alpha ^2{\mathbb {E}}\Vert \varSigma _{h}\Vert ^2_2{{\textbf {tr}}}((X^\top X)^{-1})\\&\le p\left( -\alpha +2\right) {\mathbb {E}}\Vert \varSigma _{h}\Vert _2+\alpha ^2p{\mathbb {E}}\Vert \varSigma _{h}\Vert ^2_2\sigma _{\min }^{-2}(X)\\&\le ^{(i)} p\left( -\alpha +2\right) {\mathbb {E}}\Vert \varSigma _{h}\Vert _2+\frac{p}{2}\alpha {\mathbb {E}}\Vert \varSigma _{h}\Vert _2=p\left( 2-\frac{1}{2}\alpha \right) \varphi _{2}<\infty , \end{aligned}$$

where the inequality (i) follows from (B.9).

Given that ${{\textbf {tr}}}(X^{(0)\top }X^{(0)})=p$, the hypothesis (B.11) may not be available. Hence, we turn to the case of ${{\textbf {tr}}}(X^\top X)\ge 2p{\mathbb {E}}\Vert \varSigma _{h}\Vert _2$, where we have

$$\begin{aligned} {\mathbb {E}}{{\textbf {tr}}}\left( X^{+\top }X^{+}\right)&\le \left( \frac{1}{4}\alpha ^2-\alpha +1-\frac{1}{4}\alpha ^2+\frac{1}{2}\alpha \right) {{\textbf {tr}}}(X^\top X)+\alpha ^2{\mathbb {E}}\Vert \varSigma _{h}\Vert ^2_2{{\textbf {tr}}}((X^\top X)^{-1})\\&\le \left( \frac{1}{4}\alpha ^2-\alpha +1-\frac{1}{4}\alpha ^2+\frac{1}{2}\alpha +\frac{1}{4}\alpha \right) {{\textbf {tr}}}(X^\top X)\le {{\textbf {tr}}}(X^\top X). \end{aligned}$$

The inequality above shows that ${\mathbb {E}}{{\textbf {tr}}}\left( X^{+\top }X^{+}\right) $ will not increase, until (B.11) is satisfied. Even though (B.11) might never hold, the SGN iterates could always be bounded in expectation by the initial value due to the non-increasing property. Or more specifically,

$$\begin{aligned} {\mathbb {E}}{{\textbf {tr}}}\left( X^{+\top }X^{+}\right) \le \max \{p,\ 2p\varphi _2\}. \end{aligned}$$

We then turn to the expectational increment

$$\begin{aligned} {\mathbb {E}}{{\textbf {tr}}}\left( S^\top S\right)&\le {\mathbb {E}}{{\textbf {tr}}}\left( (X^\top X)^{-1}X^\top \varSigma _{h}^2X(X^\top X)^{-1} \right) +\frac{1}{4}{{\textbf {tr}}}\left( X^\top X\right) \\&\le {\mathbb {E}}\Vert \varSigma _h\Vert ^2_2{{\textbf {tr}}}((X^\top X)^{-1})+\frac{1}{4}{{\textbf {tr}}}\left( X^\top X\right) \\&\le p{\underline{\sigma }}^{-2}\varphi _4+\frac{1}{4}\max \{p,\ 2p\varphi _2\} <\infty , \end{aligned}$$

which thus completes the proof. $\square $

Then, we are in a position to prove Lemma 3.1.

Proof of Lemma 3.1

The infinitesimal mean of ${{\textbf {vec}}}\left( X(t)\right) $ could be directly obtained from (3.5) as

$$\begin{aligned}&\lim _{\alpha \rightarrow 0}\frac{1}{\alpha }{\mathbb {E}}[{{\textbf {vec}}}\left( \varDelta X_\alpha \right) \vert X_\alpha =X]\\&\quad ={{\textbf {vec}}}\left( \varSigma X(X^\top X)^{-1}-\frac{1}{2}X-\frac{1}{2}X(X^\top X)^{-1}X^\top \varSigma X(X^\top X)^{-1}\right) , \end{aligned}$$

and the infinitesimal variance from (3.6) as

$$\begin{aligned} \lim _{\alpha \rightarrow 0}\frac{1}{\alpha }{\mathbb {E}}\left[ {{\textbf {vec}}}\left( \varDelta X_\alpha \right) {{\textbf {vec}}}\left( \varDelta X_\alpha \right) ^\top \vert X_\alpha =X\right] \le 0=0(={\mathcal {O}}(\alpha )), \end{aligned}$$

where the inequality is due to the result in Lemma B.3 that $\alpha ^{-2}{\mathbb {E}}\Vert \varDelta X_\alpha \Vert _{\text {F}}^2$ is finite, and thus ${\mathbb {E}}\Vert {{\textbf {vec}}}\left( \varDelta X_\alpha \right) \Vert _{\text {F}}^2={\mathcal {O}}(\alpha ^2)$.

Applying Corollary 4.2 in Section 7.4 of [9] then completes the proof. $\square $

C Proof of Lemma 3.2

Proof

As $X_{\alpha }(t)$ lies around some stationary point ${\tilde{X}}$ associated with the eigen-index set ${\mathcal {I}}_{p}=\{i_{1},\ldots , i_{p}\}$ and the extended one $\hat{{\mathcal {I}}}_p$, we impose two additional assumptions on ${\tilde{X}}$ for simplicity:

(i)
$i_{1}> i_{2}>\cdots> i_{p-1}>i_{p}$;
(ii)
$\lambda _{i_{1}}, \lambda _{i_{2}}, \ldots , \lambda _{i_{p-1}}$ are single eigenvalues and $\lambda _{i_{p}}$ has multiplicity $(q-p+1)\ge 1$,

where $q=\vert \hat{{\mathcal {I}}}_p\vert $. Then, $X_{\alpha }(t)$ could be expressed in the form

$$\begin{aligned} X_{\alpha }(t)=E_{q}\varLambda _{q}^{\frac{1}{2}}WV^\top +{\mathcal {O}}(\alpha ^{\frac{1}{2}}), \end{aligned}$$

where $E_q=[e_{i_{1}}, e_{i_{2}}, \ldots , e_{i_{q}}]\in {\mathbb {R}}^{n\times q},~\varLambda _{q}={ {\textbf {Diag}}}(\lambda _{i_{1}}, \lambda _{i_{2}}, \ldots , \lambda _{i_{q}})\in {\mathbb {R}}^{q\times q}$, V is a p-dimensional orthogonal matrix, and the weight matrix $W\in {\mathbb {R}}^{q\times p}$ is defined by

$$\begin{aligned} W= \left[ \begin{array}{cc} I_{p-1}&{}0_{(p-1)\times 1}\\ 0_{(q-p+1)\times (p-1)}&{}w \end{array} \right] ,\quad w^\top w=1,\quad w\in {\mathbb {R}}^{q-p+1}. \end{aligned}$$

Let $ \varDelta Y_\alpha (t) = Y_\alpha (t+\alpha )-Y_\alpha (t)=\alpha ^{-\frac{1}{2}}\left( X_\alpha (t+\alpha )-X_\alpha (t)\right) V. $ In the same vein as the proof of Lemma 3.1, we compute the (np)-dimensional infinitesimal mean as

$$\begin{aligned} \lim _{\alpha \rightarrow 0}\frac{1}{\alpha }{\mathbb {E}}\left[ {{\textbf {vec}}}(\varDelta Y_\alpha (t))\vert Y_\alpha (t)=Y\right] =~{{\textbf {vec}}}\left( \varSigma Y\varLambda _{p}^{-1}-Y\right) , \end{aligned}$$

(C.1)

and the (np)-dimensional infinitesimal variance matrix as

$$\begin{aligned} \lim _{\alpha \rightarrow 0}\frac{1}{\alpha }{\mathbb {E}}\left[ {{\textbf {vec}}}(\varDelta Y_\alpha ){{\textbf {vec}}}(\varDelta Y_\alpha )^\top \vert Y_\alpha (t)=Y\right] = {\mathbb {E}} \left[ \begin{array}{cccc} S_{(:, 1)}S_{(:, 1)}^\top &{}\cdots &{} S_{(:, 1)}S_{(:, p)}^\top \\ \cdots &{}\cdots &{}\cdots \\ S_{(:, p)}S_{(:, 1)}^\top &{}\cdots &{} S_{(:, p)}S_{(:, p)}^\top \end{array} \right] , \end{aligned}$$

where

$$\begin{aligned} S_{(:, j)}&=\varSigma _{h}E_{q}\varLambda _{q}^{\frac{1}{2}}W\varLambda _{p(:, j)}^{-\frac{1}{2}}-\frac{1}{2}E_{q}\varLambda _{q}^{\frac{1}{2}}W_{(:, j)}\\&\quad -\frac{1}{2}E_{q}\varLambda _{q}^{\frac{1}{2}}W\varLambda _{p}^{-1}W^\top \varLambda _{q}^{\frac{1}{2}}E_{q}^\top \varSigma _{h}E_{q}\varLambda _{q}^{\frac{1}{2}}W\varLambda _{p(:, j)}^{-1},~\quad 1\le j\le p, \end{aligned}$$

where $\varSigma _h=\sum _{i=1}^h a_{i}a_{i}^\top /h$ and $a_{i}\in {\mathbb {R}}^n~(i = 1, \ldots , h)$ are i.i.d. copies of the random vector a. For any $1\le l,~r\le p$, we have

$$\begin{aligned} {\mathbb {E}}\left[ S_{(:, l)}S_{(:, r)}^\top \right]&= \left( -\frac{1}{2}-\frac{1}{2}+\frac{1}{4}+\frac{1}{4}+\frac{1}{4}\right) T_{1}^{lr}+\left( 1-\frac{1}{2}-\frac{1}{2}+\frac{1}{4}\right) \left( 1-\frac{1}{h}\right) T_{1}^{lr}\nonumber \\&\quad +\frac{1}{h}T_{2}^{lr}-\frac{1}{2h}\left( {\tilde{W}}T_{2}^{lr}+T_{2}^{lr}{\tilde{W}}\right) +\frac{1}{4h}{\tilde{W}}T_{2}^{lr}{\tilde{W}}, \end{aligned}$$

(C.2)

where

$$\begin{aligned} \begin{aligned} T^{lr}_{1}&=\left( \lambda _{i_{l}}\lambda _{i_{r}}\right) ^{\frac{1}{2}}(E_{q}W)_{(:, l)}\left( (E_{q}W)_{(:, r)}\right) ^\top ,\\ T^{lr}_{2}&=\left\{ \begin{array}{ll} \varSigma +2T^{lr}_{1}+\sum _{j=1}^{\tau }w_{j}^2({\mathbb {E}}[b_{i_{p}}^{4}]/\lambda _{i_{p}}-3\lambda _{i_{p}})E_{i_{p}+j-1, i_{p}+j-1}, &{} {l=r=p,}\\ \varSigma +({\mathbb {E}}[b_{i_{l}}^{4}]/\lambda _{i_{l}}^2-1)T^{lr}_{1},&{} {l=r\ne p,}\\ (\lambda _{i_{l}}\lambda _{i_{r}})^{\frac{1}{2}}\left( T^{lr}_{1}+T^{rl}_{1}\right) ,&{} {otherwise,} \end{array} \right. \\ {\tilde{W}}&= \left[ \begin{array}{cc} WW^\top &{}0_{q\times (n-q)}\\ 0_{(n-q)\times q}&{}0_{(n-q)\times (n-q)} \end{array} \right] \in {\mathbb {R}}^{n\times n}, \end{aligned} \end{aligned}$$

and $E_{l, r}$ denotes the n-dimensional matrix with its (l, r)th element being one and other entries being zero. Using (C.1), (C.2) and applying Corollary 4.2 in Section 7.4 of [9] finally yield the SDE approximation (3.14). It could be easily verified that even if the additional assumptions (i), (ii) are eliminated, (3.14) is still valid. $\square $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhou, S., Liu, X. & Xu, L. Stochastic Gauss–Newton Algorithms for Online PCA. J Sci Comput 96, 72 (2023). https://doi.org/10.1007/s10915-023-02289-0

Download citation

Received: 24 October 2022
Revised: 24 June 2023
Accepted: 01 July 2023
Published: 22 July 2023
DOI: https://doi.org/10.1007/s10915-023-02289-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stochastic Gauss–Newton Algorithms for Online PCA

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Near-optimal stochastic approximation for online principal component estimation

Advances in Incremental PCA Algorithms

Convergence analysis of Oja’s iteration for solving online PCA with nonzero-mean samples

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

A Boundedness of Solution to ODE (3.7)

Lemma A.1

Proof

B Proof of Lemma 3.1

Lemma B.1

Proof

Lemma B.2

Proof

Remark B.1

Lemma B.3

Proof

Proof of Lemma 3.1

C Proof of Lemma 3.2

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation