Mathematics authors titles recent submissions

arXiv:1711.06644v1 [math.OC] 17 Nov 2017

Nearly Optimal Stochastic Approximation for Online
Principal Subspace Estimation
Xin Liang∗

Zhen-Chen Guo†

Ren-Cang Li‡

Wen-Wei Lin§

November 20, 2017

Abstract
Processing streaming data as they arrive is often necessary for high dimensional data
analysis. In this paper, we analyse the convergence of a subspace online PCA iteration,
as a followup of the recent work of Li, Wang, Liu, and Zhang [Math. Program., Ser. B,
DOI 10.1007/s10107-017-1182-z] who considered the case for the most significant principal
component only, i.e., a single vector. Under the sub-Gaussian assumption, we obtain a finitesample error bound that closely matches the minimax information lower bound.


Key words. Principal component analysis, Principal component subspace, Stochastic approximation, High-dimensional data, Online algorithm, Finite-sample analysis
AMS subject classifications. 62H25, 68W27, 60H10, 60H15

Contents
1 Introduction

2

2 Canonical Angles

4

3 Online PCA for Principal Subspace

5

4 Main Results

6


5 Comparisons with Previous Results

9

6 Proof of Theorem 4.1
6.1 Simplification . . . . . . . . .
6.2 Increments of One Iteration .
6.3 Quasi-power iteration process
6.4 Proof of Theorem 4.1 . . . . .

.
.
.
.

.
.
.
.


.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.


.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.


.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.


.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.


.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.


.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.


.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.


11
11
13
18
26

7 Proofs of Theorems 4.2 and 4.3

28

8 Conclusion

34

∗ Department

of Applied Mathematics, National Chiao Tung University, Hsinchu 300, Taiwan, R.O.C.. E-mail:
liangxinslm@pku.edu.cn. Supported in part by the MOST.
† Department of Applied Mathematics, National Chiao Tung University, Hsinchu 300, Taiwan, R.O.C.. E-mail:
guozhch06@gmail.com. Supported in part by the MOST.

‡ Department of Mathematics, University of Texas at Arlington, P.O. Box 19408, Arlington, TX 76019, USA.
E-mail: rcli@uta.edu. Supported in part by NSF grants CCF-1527104, DMS-1719620, and NSFC grant 11428104.
§ Department of Applied Mathematics, National Chiao Tung University, Hsinchu 300, Taiwan, R.O.C.. E-mail:
wwlin@am.nctu.edu.tw. Supported in part by the MOST, the NCTS, and the ST Yau Centre at the National Chiao
Tung University.

1

1

Introduction

Principal component analysis (PCA) introduced in [7, 10] is one of the most well-known and
popular methods for dimensional reduction in high-dimensional data analysis.
Let X ∈ Rd be a random vector with mean E{X} and covariance matrix


Σ = E (X − E{X})(X − E{X})T .

To reduce the dimension of X from d to p, PCA looks for a p-dimensional linear subspace that is
closest to the centered random vector X −E{X} in a mean squared sense, through the independent
and identically distributed samples X (1) , . . . , X (n) .
Denote by Gp (Rd ) the Grassmann manifold of p-planes in Rd , or equivalently, the set of all
p-dimensional subspaces of Rd . Without loss of generality, we assume E{X} = 0. Then PCA
corresponds to a stochastic optimization problem


(1.1)
min E k(Id − ΠU )Xk22 ,
U ∈Gp (Rd )

where Id is the d × d identity matrix, and ΠU is the orthogonal projector onto the subspace U. Let
Σ = U ΛU T be the spectral decomposition of Σ, where
Λ = diag(λ1 , . . . , λd ) with λ1 ≥ · · · ≥ λd ≥ 0, and orthogonal U = [u1 , . . . , ud ].

(1.2)

If λp > λp+1 , then the unique solution to the optimization problem (1.1), namely the p-dimensional
principal subspace of Σ, is U∗ = R([u1 , . . . , up ]), the subspace spanned by u1 , . . . , up .
In practice, Σ is unknown, and we must use sample data to estimate U∗ . The
classical PCA
b = 1 Pn X (i) (X (i) )T .
does it by the spectral decomposition of the empirical covariance matrix Σ
i=1
n
Specifically, the classical PCA uses Ub∗ = R([b
u1 , . . . , u
bp ]) to estimate U∗ , where u
bi is the correb
sponding eigenvectors of Σ. The important quantity is the distance between U∗ and Ub∗ . Vu and
σ2
Lei [21, Theorem 3.1] proved that if p(d − p) n∗ is bounded, then
n
o
σ2
E ksin Θ(Ub∗ , U∗ )k2F ≥ cp(d − p) ∗ ,
e∗ ∈Gp (Rd ) X∈P0 (σ2 ,d)
n
U

inf

sup

(1.3)

where c > 0 is an absolute constant, and P0 (σ∗2 , d) is the set of all d-dimensional sub-Gaussian
λ1 λp+1
2
distributions for which the eigenvalues of the covariance matrix satisfy (λp −λ
2 ≤ σ∗ . Note that
p+1 )
λ1 λp+1
(λp −λp+1 )2

is the effective noise variance. In the classical PCA, obtaining the empirical covariance
matrix has time complexity O(nd2 ) and space complexity O(d2 ). So storing and calculating a
large empirical covariance matrix are very expensive when the data are of high dimension, not to
mention the cost of computing its eigenvalues and eigenvectors.
To reduce both the time and space complexities, Oja [14] proposed an online PCA iteration
u
e(n) = u(n−1) + β (n−1) X (n) (X (n) )T u(n−1) ,

u(n) = u
e(n) ke
u(n) k−1
2 ,

(1.4)

to approximate the most significant principal component, where β (n) > 0 is a stepsize. Later Oja
and Karhunen [15] proposed a subspace online PCA iteration
e (n) = U (n−1) + X (n) (X (n) )T U (n−1) diag(β (n−1) , . . . , β (n−1) ),
U
p
1
(n)

e (n) R(n) ,
U (n) = U

(1.5)

to approximate the principal subspace U∗ , where βi > 0 for 1 ≤ i ≤ p are stepsizes, and R(n) is
a normalization matrix to make U (n) have orthonormal columns. One such an R(n) is
e (n) )T U
e (n) ]−1/2 .
R(n) = [(U

(1.6)

Usually, we can use a fixed stepsize β. It can be seen that these methods update the approximations
incrementally by processing data one at a time as soon as it comes in, and calculating the empirical
2

covariance matrix explicitly is completely avoided. In the online PCA, obtaining the principal
subspace has time complexity O(p2 nd) and space complexity O(pd), which is much less than those
required by the classical PCA.
Although the online PCA iteration (1.4) was proposed over 30 years ago, its convergence analysis
is rather scarce. Some recent works [2, 8, 16] studied the convergence of the online PCA for the
most significant principal component, i.e., u1 , from different points of view and obtained some
results for the case where the samples are almost surely uniformly bounded. For such a case, De
Sa, Olukotun, and Ré [4] studied a different but closely related problem, in which the angular
part is equivalent to the online PCA, and obtained some convergence results. In contrast, for
the distributions with sub-Gaussian tails (note that the samples of this kind of distributions may
be unbounded), Li, Wang, Liu, and Zhang [11] proved a nearly optimal convergence rate for the
iteration (1.4): if the initial guess u(0) is randomly chosen according to a uniform distribution and
the stepsize β is chosen in accordance with the sample size n, then there exists a high-probability
event A∗ with P{A∗ } ≥ 1 − δ such that
d
o
n
X
λ1 λi
1
ln n

E |tan Θ(u(n) , u∗ )|2 A∗ ≤ C(d, n, δ)
n λ1 − λ2 i=2 λ1 − λi

≤ C(d, n, δ)

(d − 1) ln n
λ1 λ2
,
(λ1 − λ2 )2
n

(1.7a)
(1.7b)

where δ ∈ [0, 1), u∗ = u1 in (1.2), and C(d, n, δ) can be approximately treated as a constant. It
can be seen that this bound matches the minimax low bound (1.3) up to a logarithmic factor of n,
hence, nearly optimal. Also, the convergence rate holds true as long as the initial approximation
satisfies
|tan Θ(u(0) , u∗ )| ≤ cd,
(1.8)
for some constant c > 0, which means nearly global. It is significant because a uniformly distributed
initial value is nearly orthogonal to the principal component with high probability when d is large.
This result is more general than previous ones in [2, 8, 16], because it is for distributions that can
possibly be unbounded, and the convergence rate is nearly optimal and nearly global. For more
details of comparison, the reader is referred to [11].
However, there is still no convergence result for the subspace online PCA, namely the subspace
iteration (1.5). Our main purpose in this paper is to analyze the convergence of the subspace online
PCA iteration (1.5), similarly to the effort in [11] which is for the special case p = 1. One of our
results for the convergence rate states that: if the initial guess U (0) is randomly chosen to satisfy
(n)
that R(U (0) ) is uniformly sampled from Gp (Rd ), and the stepsize βi for 1 ≤ i ≤ p are chosen the
same and in accordance with the same size n, then there exists a high-probability event H∗ with
2
P{H∗ } ≥ 1 − 2δ p , such that
p
d
o
n
X
X
1
ln n
λj λi

E ktan Θ(U (n) , U∗ )k2F H∗ ≤ C(d, n, δ)
n λp − λp+1 j=1 i=p+1 λj − λi

≤ C(d, n, δ)
2

p(d − p) ln n
λp λp+1
,
(λp − λp+1 )2
n

(1.9a)
(1.9b)

where the constant C(d, n, δ) → 24ψ 4 /(1 − δ p ) as d → ∞ and n → ∞, and ψ is X’s Orlicz norm.
This is also nearly optimal, nearly global, and valid for any sub-Gaussian distribution. When p = 1,
it degenerates to (1.7), as it should be. Although this result of ours look like a straightforward
generalization, its proof, however, turns out to be nontrivially much more complicated.
The rest of this paper is organized as follows. Section 2 reviews the basics about the canonical
angles and the canonical angle matrix between two k-dimensional subspaces, the metrics on Gp (Rd ),
and proves a lemma on the tangent of the canonical angle matrix, which will be used in later proofs.
(n)
In section 3, we reformulate the subspace online PCA iteration (1.5) for the case βi = β for all
1 ≤ i ≤ p, which will be the version to be analyzed. Our main results one of which leads to (1.9)
are stated in section 4. We compare our results for p = 1 with the recent results in [11] and outline
3

the technical differences in proof also from [11] in section 5. Due to their complexities, the proofs
of these results are deferred to sections 6 and 7. Finally, section 8 summarizes the results of the
paper.
Notations. Rn×m is the set of all n × m real matrices, Rn = Rn×1 , and R = R1 . In (or simply
I if its dimension is clear from the context) is the n × n identity matrix and ej is its jth column
(usually with dimension determined by the context). For a matrix X, σ(X), kXk∞ , kXk2 and
kXkF are the multiset of the singular values, the ℓ∞ -operator norm, the spectral norm, and the
Frobenius norm of X, respectively. R(X) is the column space spanned by the columns of X, X(i,j)
is the (i, j)th entry of X, and X(k:ℓ,:) and X(:,i:j) are two submatrices of X consisting of its row k
to row ℓ and column i to column j, respectively. X ◦ Y is the Hadamard, i.e., entrywise, product
of matrices (vector) X and Y of the same size.
For any vector or matrix X, Y , X ≤ Y (X < Y ) means X(i,j) ≤ Y(i,j) (X(i,j) < Y(i,j) ) for any
i, j. X ≥ Y (X > Y ) if −X ≤ −Y (−X < −Y ); X ≤ α (X < α) for a scalar α means X(i,j) ≤ α
(X(i,j) < α) for any i, j; similarly X ≥ α and X > α.
For a subset or an event A, Ac is the complement set of A. By σ{A1 , . . . , Ap } we denote the
σ-algebra generated by the events A1 , . . . , Ap . N = {1, 2, 3, . . .}. E{X; A} := E{X1A } denotes
the expectation of a random variable X over event A. Note that
E{X; A} = E{X | A} P{A}.
(1.10)
 

For a random vector or matrix X, E{X} := E X (i,j) . Note that kE{X}kui ≤ E{kXkui } for
ui = 2, F. Write cov◦ (X, Y ) := E{[X − E{X}] ◦ [Y − E{Y }]} and var◦ (X) := cov◦ (X, X).

2

Canonical Angles

For two subspaces X , Y ∈ Gp (Rd ), let X, Y ∈ Cd×p be the basis matrices of X and Y, respectively,
i.e., X = R(X) and Y = R(Y ), and denote by σj for 1 ≤ j ≤ k in nondecreasing order, i.e.,
σ1 ≤ · · · ≤ σp , the singular values of
(X T X)−1/2 X T Y (Y T Y )−1/2 .
The k canonical angles θj (X , Y) between X to Y are defined by
0 ≤ θj (X , Y) := arccos σj ≤

π
2

for 1 ≤ j ≤ p.

(2.1)

They are in non-increasing order, i.e., θ1 (X , Y) ≥ · · · ≥ θp (X , Y). Set
Θ(X , Y) = diag(θ1 (X , Y), . . . , θp (X , Y)).

(2.2)

It can be seen that angles so defined are independent of the basis matrices X and Y , which are
not unique. With the definition of canonical angles,
ksin Θ(X , Y)kui

for ui = 2, F

are metrics on Gp (Rd ) [17, Section II.4].
In what follows, we sometimes place a vector or matrix in one or both arguments of θj ( · , · )
and Θ( · , · ) with the understanding that it is about the subspace spanned by the vector or the
columns of the matrix argument.
For any X ∈ Rd×p , if X(1:p,:) is nonsingular, then we can define
−1
T (X) := X(p+1:d,:) X(1:p,:)
.

Lemma 2.1. For X ∈ Rd×p with nonsingular X(1:p,:) , we have for ui = 2, F

 



tan Θ(X, Ip )
= kT (X)kui .

0
ui
4

(2.3)

(2.4)

 
I
Proof. Let Y = p ∈ Rd×p . By definition, σj = cos θj (X, Y ) for 1 ≤ j ≤ p are the singular
0
values of

T  

−1/2

−1/2
I
I
I + T (X)T T (X)
= I + T (X)T T (X)
.
T (X)
0
So if τ1 ≥ τ2 ≥ · · · ≥ τp are the singular values of T (X), then
q
1 − σj2
σj = (1 + τj2 )−1/2 ⇒ τj =
= tan θj (X, Y ),
σj
and hence the identity (2.4).

3

Online PCA for Principal Subspace

T
d
Let X =
o 2 , . . . , X d ] be a random vector in R . Assume E{X} = 0. Its covariance matrix
n [X 1 , X
T
has the spectral decomposition
Σ := E XX

Σ = U ΛU T

with U = [u1 , u2 , . . . , ud ], Λ = diag(λ1 , . . . , λd ),

(3.1)

where U ∈ Rd×d is orthogonal, and λi for 1 ≤ i ≤ d are the eigenvalues of Σ, arranged for
convenience in non-increasing order. Assume
λ1 ≥ · · · ≥ λp > λp+1 ≥ · · · ≥ λd > 0.

(3.2)

In section 1, we mention the subspace online PCA iteration (1.5) of Oja and Karhunen [15] for
computing the principal subspace of dimension p
U∗ = R(U(:,1:p) ) = R([u1 , u2 , . . . , up ]).
(n)

(3.3)

In this paper, we will use a fixed stepsize β for all βi there. Then U (n) can be stated in a more
explicit manner with the help of the following lemma.
Lemma 3.1. Let V ∈ Rd×p with V T V = Ip , y ∈ Rd with y T y = 1, and 0 < β ∈ R, and let
W := V + βyy T V = (Id + βyy T )V,

V+ := W (W T W )−1/2 .

If V T y 6= 0, then
V+ = V + βyy T V − [1 − (1 + α)−1/2 ](V + βyy T V )zz T ,
where γ = kV T yk2 , z = V T y/γ, and α = β(2 + β)γ 2 . In particular, V+T V+ = Ip .
Proof. Let z = V T y ∈ Rp . We have
W T W = V T [Id + βyy T ]2 V = V T [Id + β(2 + β)yy T ]2 V = Ip + αzz T .
Let Z⊥ ∈ Rp×(p−1) such that [z, Z⊥ ]T [z, Z⊥] = Ip . The eigen-decomposition of W T W is


1+α
T
[z, Z⊥ ]T
W W = [z, Z⊥]
Ip−1
which yields
(W T W )−1/2 = [z, Z⊥]



(1 + α)−1/2
Ip−1

T
= (1 + α)−1/2 zz T + Z⊥ Z⊥

5



[z, Z⊥ ]T

Algorithm 3.1 Subspace Online PCA
1:
2:
3:
4:
5:
6:

Choose U (0) ∈ Rd×p with (U (0) )T U (0) = I, and choose stepsize β > 0.
for n = 1, 2, . . . until convergence do
Take an X’s sample X (n) ;


Z (n) = (U (n−1) )T X (n) , α(n) = β 2 + β(X (n) )T X (n) (Z (n) )T Z (n) , α
e(n) = (1 + α(n) )−1/2 ;
U (n) = U (n−1) + β α
e(n) X (n) (Z (n) )T −
end for

1−e
α(n)
U (n−1) Z (n) (Z (n) )T .
(Z (n) )T Z (n)

= (1 + α)−1/2 zz T + Ip − zz T
= Ip − [1 − (1 + α)−1/2 ]zz T .

Therefore,
V+ = (V + βyy T V ){Ip − [1 − (1 + α)−1/2 ]zz T }

= V + βyy T V − [1 − (1 + α)−1/2 ](V + βyy T V )zz T ,

as expected.
(n)

With the help of this lemma, for the fixed stepsize βi = β for all 1 ≤ i ≤ p, we outline in
Algorithm 3.1 a subspace online PCA algorithm based on (1.5) and (1.6). The iteration at its
line 5 combines (1.5) and (1.6) as one. This seems like a minor reformulation, but it turns out to
be one of the keys that make our analysis go through. The rest of this paper is devoted to analyze
its convergence.
Remark 3.1. A couple of comments are in order for Algorithm 3.1.
1. Vectors X (n) ∈ Rd for n = 1, 2, . . . are independent and identically distributed samples of
X.
2. If the algorithm converges, it is expected that
 
I
(n)
U
→ U∗ := U p = [u1 , u2 , . . . , up ]
0
in the sense that ksin Θ(U (n) , U∗ )kui → 0 as n → ∞.
Notations introduced in this section, except those in Lemma 3.1 will be adopted throughout
the rest of this paper.

4

Main Results

We point out that any statement we will make is meant to hold almost surely.
We are concerned with random variables/vectors that have a sub-Gaussian distribution which
we will define next. To that end, we need to introduce the Orlicz ψα -norm of a random variable/vector. More details can be found in [19].
Definition 4.1. The Orlicz ψα -norm of a random variable X ∈ R is defined as



 α 
X
≤2 ,
kXkψα := inf ξ > 0 : E exp
ξ
and the Orlicz ψα -norm of a random vector X ∈ Rd is defined as
kXkψα := sup kv T Xkψα .
kvk2 =1

We say that random variable/vector X follows a sub-Gaussian distribution if kXkψ2 < ∞.
6

By the definition, we conclude that any bounded random variable/vector follows a sub-Gaussian
distribution. To prepare our convergence analysis, we make a few assumptions.
Assumption 4.1. X = [X 1 , X 2 , . . . , X d ]T ∈ Rd is a random vector.
o
n
(A-1) E{X} = 0, and Σ := E XX T has the spectral decomposition (3.1) satisfying (3.2);

(A-2) ψ := kΣ −1/2 Xkψ2 ≤ ∞.

The principal subspace U∗ in (3.3) is uniquely determined under (A-1) of Assumption 4.1. On
the other hand, (A-2) of Assumption 4.1 ensures that all 1-dimensional marginals of X have subGaussian tails, or equivalently, X follows a sub-Gaussian distribution. This is also an assumption
that is used in [11].
Assumption 4.2.
(A-3) There is a constant ν ≥ 1, independent of d, such that


k(Id − ΠU∗ )Xk22 ≤ (ν − 1)p max kU∗T Xk2∞ , λ1 .

(4.1)

To intuitively understand Assumption 4.2, we note U∗T X ∈ Rp and thus kΠU∗ Xk2 = kU∗T Xk2 ≤

p kU∗T Xk∞ . Hence (4.1) is implied by
k(Id − ΠU∗ )Xk2 ≤


ν − 1 kΠU∗ Xk2 .

(4.2)

This inequality can be easily interpreted, namely, the optimal principle component subspace U∗
should be inclusive enough so that the component of X in the orthogonal complement
of U∗ is

almost surely dominated by the component of X in U∗ by a constant factor ν − 1. The case
ν = 1 is rather special and it means (Id − ΠU∗ )X = 0 almost surely, i.e., X ∈ U∗ almost surely.
In what follows, we will state our main results and leave their proofs to later sections because
of their high complexity. To that end, first we introduce some quantities. The eigenvalue gap is
γ = λp − λp+1 .

(4.3)

For s > 0 and the stepsize β < 1 such that βγ < 1, define


Ns (β) := min n ∈ N : (1 − βγ)n ≤ β s =




s ln β
,
ln(1 − βγ)

(4.4)

where ⌈·⌉ is the ceiling function taking the smallest integer that is no smaller than its argument,
and for 0 < ε < 1/7,
& 1/2−3ε '


ln ε
7ε/2−1/2
(1−21−m )(3ε−1/2)
=2+
M (ε) := min m ∈ N : β
≤β
≥ 2.
(4.5)
ln 2
Theorem 4.1. Given

ε ∈ (0, 1/7), ω > 0, φ > 0, κ > 6[M(ε)−1]/2 max{2φω 1/2 , 2},


1
! 1−2ε

2
 1ε 

 1−4ε


2−1
γ2
1
,
,
0 < β < min 1,
.


νpλ1
8κpλ1
130ν 1/2 κ2 p2 λ21

(4.6a)
(4.6b)

Let U (n) for n = 1, 2, . . . be the approximations of U∗ generated by Algorithm 3.1. Under Assumptions 4.1 and 4.2, if ktan Θ(U (0) , U∗ )k22 ≤ φ2 d − 1, and
dβ 1−7ε ≤ ω,

K > N3/2−37ε/4 (β),

7

(4.7)

then there exist absolute constants1 Cψ , Cν , C◦ and a high-probability event H with
P{H} ≥ 1 − K[(2 + e)d + ep] exp(−Cνψ β −ε )
such that for any n ∈ [N3/2−37ε/4 (β), K]
n
o
E ktan Θ(U (n) , U∗ )k2F ; H ≤ (1 − βγ)2(n−1) 2pφ2 d
+

p
32ψ 4 β
4 2 2 −1 3
ϕ(p,
d;
Λ)
+
C
κ
ν
λ
γ
p
d − pβ 3/2−7ε ,

1
2 − λ1 β 1−2ε

(4.8)

where e = exp(1) is Euler’s number, Cνψ = max{Cν ν −1/2 , Cψ ψ −1 }, and2
ϕ(p, d; Λ) :=



λj λi
p(d − p)λ1 λd p(d − p)λp λp+1

,
.
λ − λi
λ1 − λd
λp − λp+1
j=1 i=p+1 j

p
d
X
X

(4.9)

The conclusion of Theorem 4.1 holds for any given U (0) satisfying ktan Θ(U (0) , U∗ )k22 ≤ φ2 d− 1.
However, it is not easy, if at all possible, to verify this condition. Next we consider a randomly
selected U (0) .
Suppose on Gp (Rd ) we use a uniform distribution, the one with the Haar invariant probability
measure (see [3, Section 1.4] and [9, Section 4.6]). We refer the reader to [3, Section 2.2] on how to
generate such a uniform distribution on Gp (Rd ). Our assumption for a randomly selected U (0) is
randomly selected U (0) satisfies that R(U (0) ) is uniformly sampled from Gp (Rd ).

(4.10)

4.1 and 4.2, for sufficiently large d and any β satisfying (4.6b)
Theorem 4.2. Under Assumptions

with κ = 6[M(ε)−1]/2 max{2Cp , 2}, and
ε ∈ (0, 1/7),

p < (d + 1)/2,

δ ∈ [0, 1],

K > N3/2−37ε/4 (β),

where Cp is a constant only dependent on p, if (4.10) holds, and
dβ 1−3ε ≤ δ 2 ,

2

K[(2 + e)d + ep] exp(−Cνψ β −ε ) ≤ δ p ,
2

then there exists a high-probability event H∗ with P{H∗ } ≥ 1 − 2δ p such that
n
o
E ktan Θ(U (n) , U∗ )k2F ; H∗ ≤ 2(1 − βγ)2(n−1) p Cp2 δ −2 d
+

p
32ψ 4 β
ϕ(p, d; Λ) + C◦ κ4 ν 2 λ21 γ −1 p3 d − pβ 3/2−7ε
1−2ε
2 − λ1 β

(4.11)

for any n ∈ [N3/2−37ε/4 (β), K], where ϕ(p, d; Λ) is as in (4.9).

Finally, suppose that the number of principal components p and the eigenvalue gap γ = λp −
λp+1 is known in advance, and the sample size is fixed at N∗ . We must choose a proper β to obtain
the principal components as accurately as possible. A good choice turns out to be
β = β∗ :=

3 ln N∗
.
2γN∗

(4.12)

1 We attach each with a subscript for the convenience of indicating their associations. They don’t change as the
values of the subscript variables vary, by which we mean absolute constants. Later in (6.5), we explicitly bound
these absolute constants.
2 To see the inclusion in (4.9), we note the following: if 0 ≤ a ≤ c < d ≤ b, then

0≤

1
1
1
1
dc
≤ < ≤

=
b
d
c
a
d−c

8

1
1
c



1
d



1
1
a



1
b

=

ab
.
b−a

Theorem 4.3. Under Assumptions 4.1 and 4.2, for sufficiently large d ≥ 2p and sufficiently large
N∗ , δ ∈ [0, 1] satisfying
dβ∗1−3ε ≤ δ 2 ,

2

N∗ [(2 + e)d + ep] exp(−Cνψ β −ε ) ≤ δ p ,

where β∗ is given by (4.12), if (4.10) holds, then there exists a high-probability event H∗ with
2
P{H∗ } ≥ 1 − 2δ p , such that
o
n
ϕ(p, d; Λ) ln N∗
E ktan Θ(U (N∗ ) , U∗ )k2F ; H∗ ≤ C∗ (d, N∗ , δ)
,
λp − λp+1 N∗

(4.13)

where the constant C∗ (d, N∗ , δ) → 24ψ 4 as d → ∞, N∗ → ∞, and ϕ(p, d; Λ) is as in (4.9).
In Theorems 4.1, 4.2 and 4.3, the conclusions are stated in term of the expectation of ktan Θ(U (n) , U∗ )k2F
over some highly probable event. These expectations can be turned into conditional expectations,
thanks to the relation (1.10). In fact, (1.9) is a consequence of (4.13) and (1.10).
The proofs of the three theorems are given in sections 6 and 7. Although overall our proofs
follow the same structure of that in Li, Wang, Liu, and Zhang [11], there are inherently critical
subtleties in going from one-dimension (p = 1) to multi-dimension (p > 1). In fact, one of key
steps in proof works for p = 1 does not seem to work for p > 1. More detail will be discussed in
the next section.

5

Comparisons with Previous Results

Our three theorems in the previous section, namely Theorems 4.1, 4.2 and 4.3, are the analogs for
p > 1 of Li, Wang, Liu, and Zhang’s three theorems [11, Theorems 1, 2, and 3] which are for p = 1
only. Naturally, we would like to know how our results when applied to the case p = 1 and proofs
would stand against those in [11]. In what follows, we will do a fairly detailed comparison. But
before we do that, let us state their theorems (in our notation).
Theorem 5.1 ( [11, Theorem 1]). Under Assumption 4.1 and p = 1, suppose there exists a
constant φ > 1 such that tan Θ(U (0) , U∗ ) ≤ φ2 d. Then for any ε ∈ (0, 1/8), stepsize β > 0
satisfying d[λ21 γ −1 β]1−2ε ≤ b1 φ−2 , and any t > 1, there exists an event H with


b o (β, φ) exp −C0 [λ21 γ −1 β]−2ε − 4dN
bt (β) exp −C1 [λ21 γ −1 β]−2ε ,
P{H} ≥ 1 − 2(d + 2)N

b1 (β) + N
b o (β, φ), N
bt (β)]
such that for any n ∈ [N

d
n
o
X
λ1 − λ2 2 −1 3/2−4ε
bo
[λ γ β]
,
E tan2 Θ(U (n) , U∗ ); H ≤ (1 − βγ)2[n−N (β,φ)] + C2 βϕ(1, d; Λ) + C2
λ − λi 1
i=2 1
(5.1)
where b1 ∈ (0, ln2 2/16), C0 , C1 , C2 are absolute constants, and




− ln[4φ2 d]
o
n
2 −1
b
,
=
N (β, φ) := min n ∈ N : (1 − βγ) ≤ [4φ d]
ln(1 − βγ)


2 −1


bs (β) := min n ∈ N : (1 − βγ)n ≤ [λ21 γ −1 β]s = s ln[λ1 γ β] .
N
ln(1 − βγ)

What we can see that Theorem 4.1 for p = 1 is essentially the same as Theorem 5.1. In fact,
bo
bo
since (1 − βγ)1−N (β,φ) ≤ 4φ2 d ≤ (1 − βγ)−N (β,φ) , the upper bounds by (4.8) for p = 1 and by
(5.1) are comparable in the sense that they are in the same order in d, β, δ. The major difference
is perhaps the use of Assumption 4.2 by Theorem 4.1, but not by Theorem 5.1. In this sense, for
the case p = 1, Theorem 5.1 is stronger.
A natural question then arises as to whether Assumption 4.2 is really necessary for the conclusion of Theorem 4.1 to hold. We need it here for our later proof to go through, but it is possible
9

to generalize the proving techniques in [11] which is for the one-dimensional case (p = 1) to the
multi-dimensional case (p > 1) so that Assumption 4.2 becomes unnecessary. Indeed, we tried but
didn’t succeed, due to we believe unsurmountable obstacles. We now explain. The basic structure of the proof in [11] is to split the Grassmann manifold Gp (Rd ) where the initial guess comes
from into two regions: the cold region and warm region. Roughly speaking, an approximation
U (n) in the warm region means that ktan Θ(U (n) , U∗ )kF is small while it in the cold region means
that ktan Θ(U (n) , U∗ )kF is not that small. U∗ sits at the “center ” of the warm region which is
wrapped around by the cold region. The proof is divided into two cases: the first case is when
the initial guess is in the warm region and the other one is when it is in the code region. For
the first case, they proved that the algorithm will produce a sequence convergent to the principal
subspace (which is actually the most significant principal component because it is for p = 1) with
high probability. For the second case, they first proved that the algorithm will produce a sequence
of approximations that, after a finite number of iterations, will fall into the warm region with high
probability, and then use the conclusion proved for the first case to conclude the proof because of
the Markov property.
For our situation p > 1, we still structure our proof in the same way, i.e., dividing the whole
proof into two cases of the cold region and warm region. The proof in [11] for the warm region
case can be carried over with a little extra effort, as we will see later, but we didn’t find that it was
possible to use a similar argument in [11] to get the job done for the cold region case. Three major
difficulties are as follows. In [11], essentially kcot Θ(U (n) , U∗ )kF was used to track the the behavior
of a martingale along with the power iteration. Note cot Θ(U (n) , U∗ ) is p × p and thus it is a scalar
when p = 1, perfectly well-conditioned if treated as a matrix. But for p > 1, it is a genuine matrix
and, in fact, an inverse of a random matrix in the proof. The first difficulty is how to estimate the
inverse because it may not even exist! We tried to separate the flow of U (n) into two subflows: the
ill-conditioned flow and the well-conditioned flow, and estimate the related quantities separately.
Here the ill-conditioned flow at each step represents the subspace generated by the singular vectors
of cot Θ(U (n) , U∗ ) whose corresponding singular values are tiny, while the well-conditioned flow at
each step represents the subspace generated by the other singular vectors, of which the inverse
(restricted to this subspace) is well conditioned. Unfortunately, tracking the two flows can be an
impossible task because, due to the randomness, some elements in the ill-conditioned flow could
jump to the well-conditioned flow during the iteration, and vice versa. This is the second difficulty.
The third one is to build a martingale to go along with a proper power iteration, or equivalently,
to find the Doob decomposition of the process, because the recursion formula of the main part of
the inverse — the drift in the Doob decomposition, even if limited to the well-conditioned flow, is
not a linear operator, which makes it impossible to build a proper power iteration.
In the end, to deal with the cold region, we gave up the idea of estimating kcot Θ(U (n) , U∗ )kF .
Instead, we invent another method: cutting the cold region into many layers, each wrapped around
by another with the innermost one around the warm region. We prove the initial guess in any layer
will produce a sequence of approximations that will fall into its inner neighbor layer (or the warm
region if the layer is innermost) in a finite number of iterations with high probability. Therefore
eventually, any initial guess in the cold region will lead to an approximation in the warm region
within a finite number of iterations with high probability, returning to the case of initial guesses
coming from the warm region because of the Markov property. This enables us to completely avoid
the difficulties mentioned above. This technique can also be used for the one-dimensional case to
simplify the proof in [11] if with Assumption 4.2.
Theorem 5.2 ( [11, Theorem 2]). Under Assumption 4.1 and p = 1, suppose that U (0) is uniformly
sampled from the unit sphere. Then for any ε ∈ (0, 1/8), stepsize β > 0, δ > 0 satisfying

b2 (β) exp −C3 [λ21 γ −1 β]−2ε ≤ δ,
d[λ21 γ −1 β]1−2ε ≤ b2 δ 2 ,
4dN
b2 (β), N
b3 (β)]
there exists an event H∗ with P{H∗ } ≥ 1 − 2δ such that for any n ∈ [N

d
o
n
X
λ1 − λ2 2 −1 3/2−4ε
[λ γ β]
,
E tan2 Θ(U (n) , U∗ ); H∗ ≤ C4 (1 − βγ)2n δ −4 d2 + C4 βϕ(1, d; Λ) + C4
λ − λi 1
i=2 1
(5.2)

10

where b2 , C3 , C4 are absolute constants.
Theorem 5.3 ( [11, Theorem 3]). Under Assumption 4.1 and p = 1, suppose that U (0) is uniformly
N∗
sampled from the unit sphere and let β∗ = 2 ln
γN . Then for any ε ∈ (0, 1/8), N∗ ≥ 1, δ > 0
satisfying

b2 (β∗ ) exp −C6 [λ2 γ −1 β∗ ]−2ε ≤ δ,
d[λ21 γ −1 β∗ ]1−2ε ≤ b3 δ 2 ,
4dN
1
there exists an event H∗ with P{H∗ } ≥ 1 − 2δ such that
o
n
ϕ(1, d; Λ) ln N∗
,
E tan2 Θ(U (N∗ ) , U∗ ); H∗ ≤ C∗ (d, N∗ , δ)
λ1 − λ2 N∗

(5.3)

where the constant C∗ (d, N∗ , δ) → C5 as d → ∞, N∗ → ∞, and b3 , C5 , C6 are absolute constants.
Our Theorems 4.2 and 4.3 when applied to the case p = 1 do not exactly yield Theorems 5.2
and 5.3, respectively. But the resulting conditions and upper bounds have the same orders in
variables d, β, δ, and the coefficients of β and lnNN∗ in the upper bounds are comparable. But we
note that the first term in right-hand side of (4.11) is proportional to d, not d2 as in (5.2); so ours
is tighter.
The proofs here for Theorems 4.2 and 4.3 are nearly the same as those in [11] for Theorems 5.2
and 5.3 owing to the fact that the difficult estimates have already been taken care of by either
Theorem 4.1 or Theorem 5.1. But still there are some extras for p > 1, namely, to estimate the
marginal probability for the uniform distribution on the Grassmann manifold of dimension higher
than 1. We cannot find it in the literature, and thus have to build it ourselves with the help of the
theory of special functions of a matrix argument, rarely used in the statistical community.
It may also be worth pointing out that all absolute constants, except Cp which has an explicit
expression in (7.4) and Cψ , in our theorems are concretely bounded as in (6.5), whereas those in
Theorems 5.1 to 5.3 are not.

6

Proof of Theorem 4.1

In this section, we will prove Theorem 4.1. For that purpose, we build a quite amount of preparation
material in sections 6.1 to 6.3 before we prove the theorem in section 6.4. Figure 6.1 shows a
pictorial description of our proof process.
section 6.3

Lemma 6.1
Lemma 6.2

Lemma 6.3

Lemma 6.4

section 6.2

Lemma 6.6
Lemma 6.5

Theorem 4.1

Figure 6.1: Proof process for Theorem 4.1

6.1

Simplification

Without loss of generality, we may assume that the covariance matrix Σ diagonal. Otherwise, we
can perform a (constant) orthogonal transformation as follows. Recall the spectral decomposition
Σ = U ΛU T in (3.1). Instead of the random vector X, we equivalently consider
Y ≡ [Y 1 , Y 2 , . . . , Y n ]T := U T X.
Accordingly, perform the same orthogonal transformation on all involved quantities:
 
I
(n)
T (n)
(n)
T (n)
T
Y
=U X , V
= U U , V∗ = U U∗ = p .
0
11

(6.1)

As consequences, we will have the equivalent versions of Algorithm 3.1, Theorems 4.1, 4.2 and 4.3.
Firstly, because
(V (n−1) )T Y (n) = (U (n−1) )T X (n) ,

(Y (n) )T Y (n) = (X (n) )T X (n) ,

the equivalent version of Algorithm 3.1 is obtained by symbolically replacing all letters X, U by
Y, V while keeping their respective superscripts. If the algorithm converges, it is expected that
R(V (n) ) → R(V∗ ). Secondly, noting
kΣ −1/2 Xkψ2 = kU Λ−1/2 U T Xkψ2 = kΛ−1/2 Y kψ2 ,
we can restate Assumptions 4.1 and 4.2 equivalently as
o
n
(A-1′ ) E{Y } = 0, E Y Y T = Λ = diag(λ1 , . . . , λd ) with (3.2);
(A-2′ ) ψ := kΛ−1/2 Y kψ2 ≤ ∞.



(A-3′ ) k(Id − ΠV∗ )Y k22 ≤ (ν − 1)p max kY k2∞ , λ1 , where 1 ≤ ν is a constant, independent of d,
and V∗ = R(V∗ ).
Thirdly, all canonical angles between two subspaces are invariant under the orthogonal transformation. Therefore the equivalent versions of Theorems 4.1, 4.2 and 4.3 for Y is also simply obtained
by replacing all letters X, U by Y, V while keeping their respective superscripts.
In the rest of this section, we will prove the mentioned equivalent version of Theorem 4.1.
Likewise in the next section, we will prove the equivalent versions of Theorems 4.2 and 4.3. In the
other words, essentially we assume that Σ is diagonal.
To facilitate our proof, we introduce new notations for two particular submatrices of any V ∈
Rd×p :
V̄ = V(1:p,:) , V = V(p+1:d,:) .
(6.2)
¯
In particular, T (V ) = V V̄ −1 for the operator T defined in (2.3), provided V̄ is nonsingular. Set
¯
Λ̄ = diag(λ1 , . . . , λp ), Λ = diag(λp+1 , . . . , λd ).
(6.3)
¯
Although the assignments to Λ̄ and Λ are not consistent with the extractions defined by (6.2), they
¯
don’t seem to cause confusions in our later presentations.
For κ > 1, define S(κ) := {V ∈ Rd×p : σ(V̄ ) ⊂ [ κ1 , 1]}. It can be verified that
p
V ∈ S(κ) ⇔ kT (V )k2 ≤ κ2 − 1.
(6.4)
For the sequence V (n) , define

Nout {S(κ)} := min{n : V (n) ∈
/ S(κ)},

Nin {S(κ)} := min{n : V (n) ∈ S(κ)}.

Nout {S(κ)} is the first step of the iterative process at which V (n) jumps from S(κ) to outside, and
Nin {S(κ)} is the first step of the iterative process at which V (n) jumps from outside to S(κ). Also,
define
o
n
e := λ1 β −2ε , Nqb {λ}
e := min n ≥ 1 : max{kY (n) k∞ , kZ (n) k∞ } ≥ λ
e1/2 .
λ
e is the first step of the iterative process at which an entry of Y (n) or Z (n) exceeds λ.
e For
Nqb {λ}
e
n < Nqb {λ}, we have
e 1/2 , kZ (n) k2 < (pλ)
e 1/2 .
kY (n) k2 < (dλ)

e 1/2 .
Moreover, under (A-3′ ), kY (n) k2 < (νpλ)
(n)
For convenience, we will set T
= T (V (n) ), and let Fn = σ{Y (1) , . . . , Y (n) } be the σ-algebra
filtration, i.e., the information known by step n. Also, since in this section ε, β are fixed, we
suppress the dependency information of M (ε) on ε and Ns = Ns (β) on β to simply write M =
M (ε), Ns = Ns (β).
12

Lastly, we discuss some of the important implications of the condition


1
! 1−2ε

2
 1−4ε
 1ε 



2−1
1
γ2
.
0 < β < min 1,
,
,


νpλ1
8κpλ1
130ν 1/2 κ2 p2 λ21

(4.6b)

It guarantees that

(β-1) β < 1 and βγ = β 1/2+2ε β 1/2−2ε (λp − λp+1 ) ≤
e ≤ β 2ε ωλ1 β −2ε ≤ ωλ1 ;
(β-2) dβ λ

e ≤ νpβ λ
e = νpλ1 β 1−2ε ≤
(β-3) pβ λ

1
8κpλ1 λ1



1√
;
16 3


2 − 1.

Set

CV
C∆
CT


C◦


15
3
16 + 13 2
5 7
2
3
e
e
e
≈ 4.298;
= + (νpλβ) + (νpλβ) + (νpλβ) ≤
2 2
8
8 √
8
1
e + CV pλβ
e ≤ 22 + 7 2 ≈ 3.987;
= 2 + (νpλβ)
2
8

251
+ 122 2
e ≤
= CV + 2C∆ + 2C∆ CV pλβ
≈ 26.471;
16


2
(3 − 2)C∆
565 + 171 2
=

≈ 0.038;
2
64(CT + 2C∆ )
21504


223702 + 183539 2
= 4 2CT Cκ ≤
≈ 5.618;
86016


29 + 8 2 1/2−3ε
4CT
29 + 8 2

√ +

β + [CT +

=
32
16(3 − 2) (3 − 2)C∆
C2
3CT2
2CT 1−3ε

β
+ T2 β 3/2−3ε
β 1/2+3ε +
2
C∆
2C∆
2(3 − 2)C∆

2582968 + 1645155 2
≈ 342.464.

14336

(6.5a)
(6.5b)
(6.5c)
(6.5d)
(6.5e)

+

(6.5f)

The condition (4.6b) also guarantees that
e 1/2 κ = 2C∆ pλ1 β 1/2−2ε κ ≤ 2C∆ < 1, and thus 2C∆ pλβκ
e < 1;
(β-4) 2C∆ pλβ
8


e2 γ −1 β 5ε ≤ 1, and thus 4 2CT ν 1/2 κ2 p2 λ
e2 γ −1 β 1/2+χ ≤ 1 for χ ∈ [−1/2 +
(β-5) 4 2CT ν 1/2 κ2 p2 λ
5ε, 0].

6.2

Increments of One Iteration

Lemma 6.1. For any fixed K ≥ 1,
n
o
e > K ≥ 1 − e(d + p)K exp(−Cψ ψ −1 β −2ε ),
P Nqb {λ}

where Cψ is an absolute constant.

Proof. Since
o
o n
n
o
[ n
e1/2
e ≤K ⊂
e1/2 ∪ kZ (n) k∞ ≥ λ
Nqb {λ}
kY (n) k∞ ≥ λ
n≤K



[

n≤K




o
o
[ n
[ n
(n)
(n)
e1/2 
e1/2 ∪

|≥λ
|eT
|eT
|≥λ
j Z
i Y
1≤j≤p

1≤i≤d

13

=

[

n≤K

we know




o
o
[ n
[ n
(n)
e1/2 ∪
e1/2  ,

|eT
|≥λ
|(V (n−1) ej )T Y (n) | ≥ λ
i Y
1≤i≤d

1≤j≤p



o
o
n
o
X n
X
X n
(n)
e ≤K ≤
e1/2 
e1/2 +

P Nqb {λ}
P |(V (n−1) ej )T Y (n) | ≥ λ
P |eT
|≥λ
i Y
n≤K

1≤j≤p

1≤i≤d

)
(

e1/2
X X
(Λ1/2 ei )T −1/2 (n)
λ
Λ
Y ≥
=
P 1/2
kΛ ei k2
kΛ1/2 ei k2
n≤K 1≤i≤d
)
(

X X
e1/2
(Λ1/2 V (n−1) ej )T −1/2 (n)
λ
Λ
Y ≥
+
P 1/2 (n−1)
kΛ V
ej k2
kΛ1/2 V (n−1) ej k2
n≤K 1≤j≤p


e
Cψ,i eTλΛei
X X


exp 1 − (Λ1/2 e )T i
by [20, (5.10)]
k kΛ1/2 eii k2 Λ−1/2 Y (n) kψ2
n≤K 1≤i≤d


e
Cψ,d+j (V (n−1) ej )Tλ ΛV (n−1) ej
X X

+
exp 1 − (Λ1/2 V (n−1) e )T
k kΛ1/2 V (n−1) eii k2 Λ−1/2 Y (n) kψ2
n≤K 1≤j≤p
!
e
min1≤i≤d+p Cψ,i λ
≤ eK(d + p) exp − −1/2 (n)

Y kψ2 kΛk2

≤ eK(d + p) exp −Cψ ψ −1 β −2ε ,

where Cψ,i , i = 1, . . . , d + p are absolute constants [20, (5.10)], and Cψ = min1≤i≤d+p Cψ,i .
e then
Lemma 6.2. Suppose that the conditions of Theorem 4.1 hold. If n < Nqb {λ},

V (n+1) = V (n) + βY (n+1) (Z (n+1) )T
h
i
β
− β 1 + (Y (n+1) )T Y (n+1) V (n) Z (n+1) (Z (n+1) )T + R(n) (Z (n+1) )T ,
2

(6.6)

e 3/2 β 2 and CV is as in (6.5a).
where R(n) ∈ Rd is a random vector with kR(n) k2 ≤ CV ν 1/2 (pλ)

Proof. To avoid the cluttered superscripts, in this proof, we drop the superscript “·(n) ” and use
the superscript
“·+ ” o
to replace “·(n+1) ” on V , and drop the superscript “·(n+1) ” on Y, Z.
n
e > n , by (4.7) and (β-3), we have
On Nqb {λ}
e
e ≤ (2 +
α = β(2 + βY T Y )Z T Z ≤ β(2 + νpλβ)p
λ



2 − 1)( 2 − 1)/ν ≤ 1.

By Taylor’s expansion, there exists 0 < ξ < α such that

1
3
1
(1 + α)−1/2 = 1 − α +
α2
2
8 (1 + ξ)5/2
β2 T
Y Y Z T Z + β 2 (Z T Z)2 ζ,
= 1 − βZ T Z −
2
where ζ =

3
3
1
e 2 . Thus
(2 + βY T Y )2 ≤ (2 + νpβ λ)
8 (1+ξ)5/2
8




ZZ T
β2 T
Y Y Z T Z − β 2 (Z T Z)2 ζ
V + = (V + βY Z T ) I − βZ T Z +
2
Z TZ
β2 T
(Y Y )V ZZ T + RZ T ,
= V + βY Z T − βV ZZ T −
2
14

where R = −

β2 T
(Z Z)(2 + βY T Y )Y + ζβ 2 (Z T Z)V Z + ζβ 3 (Z T Z)2 Y for which
2
β2 e
e
e 1/2 + ζβ 2 (pλ)
e 3/2 + ζβ 3 (pλ)
e 2 (νpλ)
e 1/2
(pλ)(2 + βνpλ)(νp
λ)
2


1
e + 3 (2 + βνpλ)
e 2 + 3 (2 + βνpλ)
e 2 (βpλ)
e ν 1/2 (pλ)
e 3/2 β 2
(2 + βνpλ)
=
2
8
8
e 3/2 β 2 ,
= CV ν 1/2 (pλ)

kRk2 ≤

as expected.

(n)
Lemma 6.3. Suppose
that the conditions
n
o of Theorem 4.1 hold. Let τ = kT k2 , and CT be as in
e Nout {S(κ)} , then the following statements hold.
(6.5c). If n < min Nqb {λ},

1. T (n) and T (n+1) are well-defined.


(n)
(n)
2. Define ET (V (n) ) by E T (n+1) − T (n) Fn = β(ΛT (n) − T (n) Λ̄) + ET (V (n) ). Then
¯
(n)
e 2 (1 + τ 2 )3/2 ;
(a) sup
kE (V )k2 ≤ CT ν 1/2 (pλβ)
V ∈S(κ)

T

e
e 2 (1 + τ 2 )3/2 .
k2 ≤ ν 1/2 (pλβ)(1
+ τ 2 ) + CT ν 1/2 (pλβ)

3. Define R◦ by var◦ T (n+1) − T (n) | Fn = β 2 H◦ + R◦ . Then

(a) H◦ = var◦ Y Ȳ T ≤ 16ψ 4 H, where H = [ηij ](d−p)×p with ηij = λp+i λj for i = 1, . . . , d−
¯
p, j = 1, . . . , p;
e 3 (1 + τ 2 )5/2 + 2C 2 ν(pλβ)
e 4 (1 + τ 2 )3 .
e 2 τ (1 + 11 τ + τ 2 + 1 τ 3 ) + 4CT ν(pλβ)
(b) kR◦ k2 ≤ (νpλβ)
(b) kT

(n+1)

−T

(n)

2

T

4

Proof. For readability, we will drop the superscript “·(n) ”, and use the superscript “·+ ” to replace
“·(n+1) ” for V, R, drop the superscript “·(n+1) ” on Y, Z, and drop the conditional sign “ | Fn ” in the
computation of E{·}, var(·), cov(·) with the understanding that they are conditional with respect
to Fn . Finally, for any expression or variable F , we define ∆F := F + − F .
Consider item 1. Since n < Nout {S(κ)}, we have V ∈ S(κ) and τ = kT k2 ≤ (κ2 − 1)1/2 . Thus,
−1
kV̄ k2 ≤ κ and T = V V̄ −1 is well-defined. Recall (6.6) and the partitioning
¯
Y =

p
d−p



1


Y
¯



,

R=

p
d−p



1


R
¯



.

We have ∆V̄ = β(Ȳ Z T − (1 + β2 Y T Y )V̄ ZZ T ) + R̄Z T , and
R̄ = −

β2 T
(Z Z)(2 + βY T Y )Ȳ + ζβ 2 (Z T Z)V̄ Z + ζβ 3 (Z T Z)2 Ȳ .
2

e 1/2 , we find
Noticing kȲ k2 ≤ (pλ)

e + β(1 + β νpλ)p
e λ
e + CV (pλ)
e 2β2
k∆V̄ k2 ≤ β(pλ)
2


β e
e
e = C∆ pλβ,
e
≤ 2 + νpλ + CV pλβ pλβ
2

e
where C∆ is as in (6.5b). Thus k∆V̄ V̄ −1 k2 ≤ k∆V̄ k2 kV̄ −1 k2 ≤ C∆ pλβκ
≤ 1/2 by (β-4). As a
result, V̄ + is nonsingular, and
k(V̄ + )−1 k2 ≤

kV̄ −1 k2
≤ 2kV̄ −1 k2 .
1 − kV̄ −1 ∆V̄ k2
15

In particular, T + = V + (V̄ + )−1 is well-defined. This proves item 1.
¯
For item 2, using the Shermann-Morrison-Woodbury formula [5, p. 95], we get
∆T = V + (V̄ + )−1 − V V̄ −1
¯
¯
= (V + ∆V )(V̄ + ∆V̄ )−1 − V V̄ −1
¯
¯
¯
= (V + ∆V )(V̄ −1 − V̄ −1 ∆V̄ (V̄ + ∆V̄ )−1 ) − V V̄ −1
¯
¯
¯
= ∆V V̄ −1 − V V̄ −1 ∆V̄ (V̄ + ∆V̄ )−1 − ∆V V̄ −1 ∆V̄ (V̄ + ∆V̄ )−1
¯
¯
¯
= ∆V V̄ −1 − V V̄ −1 ∆V̄ (V̄ −1 − V̄ −1 ∆V̄ (V̄ + ∆V̄ )−1 ) − ∆V V̄ −1 ∆V̄ (V̄ + ∆V̄ )−1
¯
¯
¯
= ∆V V̄ −1 − T ∆V̄ V̄ −1 + T ∆V̄ V̄ −1 ∆V̄ (V̄ + )−1 − ∆V V̄ −1 ∆V̄ (V̄ + )−1
¯
¯
= [∆V − T ∆V̄ ][I − (V̄ + )−1 ∆V̄ ]V̄ −1 .
¯
 


I
Write TL = −T I and TR =
, and then TL V = 0 and V = TR V̄ . Thus,
T
∆T = TL ∆V [I − (V̄ + )−1 ∆V̄ ]V T TR .

Since ∆V is rank-1, ∆T is also rank-1. By Lemma 6.2,




β
∆T = TL βY Z T − β(1 + Y T Y )V ZZ T + RZ T I − (V̄ + )−1 ∆V̄ V T TR
2



T
T
I − (V̄ + )−1 ∆V̄ V T TR
= TL βY Y V + RZ
T

= TL (βY Y V V

= TL (βY Y

T

T

(6.7)

+ RT )TR

+ RT )TR ,

where RT = RZ T V T − (βY + R)Z T (V̄ + )−1 ∆V̄ V T . Note that
TL Y Y T TR = Y Ȳ T − T Ȳ Y T T − T Ȳ Ȳ T + Y Y T T,
¯
¯
¯¯

(6.8)

and






E Y Ȳ T = 0, E T Ȳ Ȳ T = T E Ȳ Ȳ T = T Λ̄,



 ¯



E T Ȳ Y T T = T E Ȳ Y T T = 0, E Y Y T T = E Y Y T T = ΛT,
¯
¯
¯
¯¯
¯¯
Thus, E{∆T } = β(ΛT − T Λ̄) + ET (V ), where ET (V ) = E{TL RT TR }.
¯
Since V ∈ S(κ), kT k2 ≤ (κ2 − 1)1/2 by (6.4). Thus
e 1/2 + [(νpλ)
e 1/2 β + kRk2 ](pλ)
e 1/2 2(1 + kT k2 )1/2 C∆ pλβ
e
kRT k2 ≤ kRk2(pλ)
2
1/2 e 2 2
e 2 β 2 + (1 + kT k2 )1/2 [1 + CV pλβ]2C
e
≤ CV ν 1/2 (pλ)
(pλ) β
∆ν
2

(6.9a)
(6.9b)

(6.10)

e 2 (1 + kT k2 )1/2 ,
≤ CT ν 1/2 (pλβ)
2

e
where CT = CV + 2C∆ (1 + CV pλβ).
Therefore,

kET (V )k2 ≤ E{kTL RT TR k2 } ≤ (1 + kT k22 ) E{kRT k2 }.

Item 2(a) holds. For item 2(b), we have
k∆T k2 ≤ (1 + kT k22 )(βkY Y T V V T k2 + kRT k2 )
e 1/2 (pλ)
e 1/2 (1 + kT k2 ) + CT ν 1/2 (pλβ)
e 2 (1 + kT k2 )3/2
≤ β(νpλ)
2

≤ν

1/2

2

e
e 2 (1 + kT k2)3/2 .
pλβ(1
+ kT k22) + CT ν 1/2 (pλβ)
2

Now we turn to item 3. We have



var◦ (∆T ) = var◦ TL (βY Y T + RT )TR = β 2 var◦ TL Y Y T TR + 2βR◦,1 + R◦,2 ,
16

(6.11)


where R◦,1 = cov◦ TL Y Y T TR , TL RT TR , and R◦,2 = var◦ (TL RT TR ). By (6.8),


var◦ TL Y Y T TR = var◦ Y Ȳ T + R◦,0 ,
¯

where

(6.12)




R◦,0 = var◦ T Ȳ Y T T + var◦ T Ȳ Ȳ T + var◦ Y Y T T
¯
¯¯



− 2 cov◦ Y Ȳ T , T Ȳ Y T T − 2 cov◦ Y Ȳ T , T Ȳ Ȳ T + 2 cov◦ Y Ȳ T , Y Y T T
¯
¯
¯
¯
¯¯



+ 2 cov◦ T Ȳ Y T T, T Ȳ Ȳ T − 2 cov◦ T Ȳ Y T T, Y Y T T − 2 cov◦ T Ȳ Ȳ T , Y Y T T .
¯¯
¯
¯¯
¯

T
2
and R◦ = β R◦,0 + 2βR◦,1 + R◦,2 .
Examine (6.11) and (6.12) together to get H◦ = var◦ Y Ȳ
¯
We note
1/2

By [20, (5.11)],

−1/2
T 1/2 −1/2
Y,
Λ
Y = λj eT
Yj = eT
j Λ
j Y = ej Λ


 2
T
T
T
T
ei var◦ Y Ȳ ej = var(ei Y Ȳ ej ) = var(Yp+i Yj ) = E Yp+i
Yj2 .
¯
¯

o
n

−1/2
−1/2
Y k4ψ2 ≤ 16λ2j kΛ−1/2 Y k4ψ2 = 16λ2j ψ 4 .
E Yj4 = λ2j E (eT
Y )4 ≤ 16λ2j keT

j Λ

Therefore


 4  4 1/2
T
eT
ej ≤ [E Yp+i
E Yj ]
≤ 16λp+i λj ψ 4 ,
i var◦ Y Ȳ
¯

i.e., H◦ = var◦ Y Ȳ T ≤ 16ψ 4 H. This proves item 3(a). To show item 3(b), first we bound the
¯
entrywise variance and covariance. For any matrices A1 , A2 , by Schur’s inequality (which was
generalized to all unitarily invariant norm in [6, Theorem 3.1]),
kA1 ◦ A2 k2 ≤ kA1 k2 kA2 k2 ,

(6.13)

we have
kcov◦ (A1 , A2 )k2 = kE{A1 ◦ A2 } − E{A1 } ◦ E{A2 }k2
≤ E{kA1 ◦ A2 k2 } + kE{A1 } ◦ E{A2 }k2

≤ E{kA1 k2 kA2 k2 } + kE{A1 }k2 kE{A2 }k2 ,


kvar◦ (A1 )k2 ≤ E kA1 k22 + kE{A1 }k22 .

(6.14a)
(6.14b)

Apply (6.14) to R◦,1 and R◦,2 to get

e 3 β 2 (1 + kT k2 )5/2 ,
kR◦,1 k2 ≤ 2CT ν(pλ)
2

upon using

e 4 (1 + kT k2)3 ,
kR◦,2 k2 ≤ 2CT2 ν(pλβ)
2

e
kTL Y Y T TR k2 = kTL Y Y T V V T TR k2 ≤ ν 1/2 (pλ)(1
+ kT k22 ),
e 2 (1 + kT k2)3/2 .
kTL RT TR k2 ≤ CT ν 1/2 (pλβ)
2

For R◦,0 , by (6.9), we have


kcov◦ Y Ȳ T , T Ȳ Y T T k2
¯
¯

kcov◦ Y Ȳ T , T Ȳ Ȳ T k2
¯

kcov◦ Y Ȳ T , Y Y T T k2
¯
¯¯

kcov◦ T Ȳ Y T T, T Ȳ Ȳ T k2
¯

kcov◦ T Ȳ Y T T, Y Y T T k2
¯
¯¯

kvar◦ T Ȳ Y T T k2
¯



≤ E kY Ȳ T k22 kT k22,

 ¯
≤ E kY Ȳ T k2 kȲ Ȳ T k2 kT k2,

 ¯
≤ E kY Ȳ T k2 kY Y T k2 kT k2,
¯
¯
¯


≤ E kY Ȳ T k2 kȲ Ȳ T k2 kT k32,
¯


≤ E kY Ȳ T k2 kY Y T k2 kT k32,
¯ ¯
 ¯
≤ E kY Ȳ T k22 kT k42,
¯
17

(6.15)




kvar◦ T Ȳ Ȳ T k2 ≤ E kȲ Ȳ T k22 kT k22 + kT Λ̄k22 ,



kvar◦ Y Y T T k2 ≤ E kY Y T k22 kT k22 + kΛT k22,
¯¯
¯
¯
¯


kcov◦ T Ȳ Ȳ T , Y Y T T k2 ≤ E kȲ Ȳ T k2 kY Y T k2 kT k22 + kT Λ̄k2 kΛT k2.
¯
¯¯
¯¯

Since

we have

e
kȲ Ȳ T k2 + kY Y T k2 = Ȳ T Ȳ + Y T Y = Y T Y ≤ νpλ,
¯¯
¯ ¯
e
Ȳ T Ȳ + Y T Y
νpλ
kY Ȳ T k2 = (Ȳ T Ȳ )1/2 (Y T Y )1/2 ≤
¯ ¯ ≤
,
¯
¯ ¯
2
2



kR◦,0 k2 ≤ E 2kY Ȳ T k22 + (kȲ Ȳ T k2 + kY Y T k2 )2 kT k22 + (kT Λ̄k2 + kΛT k2)2
¯
¯¯





+ 2 E kY Ȳ T k2 (kȲ Ȳ T k2 + kY Y T k2 ) kT k2 + kT k32 + E kY Ȳ T k22 kT k42
¯
¯
¯¯
h
i
e 2 kT k2 + 3 (νpλ)
e 2 kT k3 + 1 (νpλ)
e 2 + (λ1 + λp+1 )2 kT k2 + (νpλ)
e 2 kT k4
≤ (νpλ)
2
2
2
2
4


e 2 kT k2 1 + 11 kT k2 + kT k2 + 1 kT k3 .
(6.16)
≤ (νpλ)
2
2
2
4

Finally collecting (6.15) and (6.16) yields the desired bound on R◦ = β 2 R◦,0 + 2βR◦,1 + R◦,2 .

6.3

Quasi-power iteration process



Define D(n+1) = T (n+1) − E T (n+1) Fn . It can be seen that
o
o
n
n


T (n) − E T (n) Fn = 0, E D(n+1) Fn = 0,
o
n



E D(n+1) ◦ D(n+1) Fn = var◦ T (n+1) − T (n) | Fn .

By item 2 of Lemma 6.3, we have

o
n

T (n+1) = D(n+1) + T (n) + E T (n+1) − T (n) Fn

(n)

= D(n+1) + T (n) + β(ΛT (n) − T (n) Λ̄) + ET (V (n) )
¯
(n)
= LT (n) + D(n+1) + ET (V (n) ),

where L : T 7→ T + β ΛT − βT Λ̄ is a bounded linear operator. It can be verified that LT = L ◦ T ,
¯
the Hadamard product of L and T , where L = [λij ](d−p)×p with λij = 1 + βλp+i − βλj . Moreover,
it can be shown that3 kLkui = ρ(L) = 1 − βγ, where kLkui = supkT kui =1 kLT kui is an operator
norm induced by the matrix norm k·kui . Recursively,
T (n) = Ln T (0) +

n
X
s=1

Ln−s D(s) +

n
X
s=1

(s−1)

Ln−s ET

(V (s−1) ) =: J1 + J2 + J3 .

Define events Mn (χ), Tn (χ), and Qn as


1
Mn (χ) = kT (n) − Ln T (0) k2 ≤ (κ2 β 2χ−1 − 1)1/2 β χ−3ε ,
2
o
n
o
n
(n)
2 2χ−1
1/2 χ−3ε
e .
, Qn = n < Nqb {λ}
Tn (χ) = kT k2 ≤ (κ β
− 1) β
3

Since λ(L) = {λij : i = 1, . . . , d − p, j = 1, . . . , p}, we have ρ(L) = 1 − β(λp − λp+1 ). Thus for any T ,
kLT kui = kT (I − β Λ̄) + β ΛT kui
¯