Statistics authors titles recent submissions
Neon2: Finding Local Minima via First-Order Oracles
(version 1)∗
arXiv:1711.06673v1 [cs.LG] 17 Nov 2017
Zeyuan Allen-Zhu
[email protected]
Microsoft Research
Yuanzhi Li
[email protected]
Princeton University
November 17, 2017
Abstract
We propose a reduction for non-convex optimization that can (1) turn a stationary-point
finding algorithm into a local-minimum finding one, and (2) replace the Hessian-vector product computations with only gradient computations. It works both in the stochastic and the
deterministic settings, without hurting the algorithm’s performance.
As applications, our reduction turns Natasha2 into a first-order method without hurting
its performance. It also converts SGD, GD, SCSG, and SVRG into local-minimum finding
algorithms outperforming some best known results.
1
Introduction
Nonconvex optimization has become increasing popular due its ability to capture modern machine
learning tasks in large scale.
Most notably, training deep neural networks corresponds to minimizing
P
a function f (x) = n1 ni=1 fi (x) over x ∈ Rd that is non-convex, where each training sample
i corresponds to one loss function fi (·) in the summation. This average structure allows one
to perform stochastic gradient descent (SGD) which uses a random ∇fi (x) —corresponding to
computing backpropagation once— to approximate ∇f (x) and performs descent updates.
Motivated by such large-scale machine learning applications, we wish to design faster first-order
non-convex optimization methods that outperform the performance of gradient descent, both in
the online and offline settings. In this paper, we say an algorithm is online if its complexity is
independent of n (so n can be infinite), and offline otherwise. In recently years, researchers across
different communities have gathered together to tackle this challenging question. By far, known
theoretical approaches mostly fall into one of the following two categories.
First-order methods for stationary points. In analyzing first-order methods, we denote
by gradient complexity T the number of computations of ∇fi (x). To achieve an ε-approximate
stationary point —namely, a point x with k∇f (x)k ≤ ε— it is a folklore that gradient descent
(GD) is offline and needs T ∝ O εn2 , while stochastic gradient decent (SGD) is online and needs
2/3
T ∝ O ε14 . In recent years, the offline complexity has been improved to T ∝ O nε2 by the
∗
The result of this paper was briefly discussed at a Berkeley Simons workshop on Oct 6 and internally presented
at Microsoft on Oct 30. We started to prepare this manuscript on Nov 11, after being informed of the independent
and similar work of Xu and Yang [28]. Their result appeared on arXiv on Nov 3. To respect the fact that their work
appeared online before us, we have adopted their algorithm name Neon and called our new algorithm Neon2. We
encourage readers citing this work to also cite [28].
1
1
by the SCSG
SVRG method [3, 23], and the online complexity has been improved to T ∝ O ε10/3
method [18]. Both of them rely on the so-called variance-reduction technique, originally discovered
for convex problems [11, 16, 24, 26].
These algorithms SVRG and SCSG are only capable of finding stationary points, which may
not necessarily be approximate local minima and are arguably bad solutions for neural-network
training [9, 10, 14]. Therefore,
can we turn stationary-point finding algorithms into local-minimum finding ones?
Hessian-vector methods for local minima. It is common knowledge that using information
about the Hessian, one can find ε-approximate local minima —namely, a point x with k∇f (x)k ≤ ε
and also ∇2 f (x) −ε1/C I.1 In 2006, Nesterov and Polyak [20] showed that one can find an ε1
approximate in O( ε1.5
) iterations, but each iteration requires an (offline) computation as heavy as
inverting the matrix ∇2 f (x).
To fix this issue, researchers propose to study the so-called “Hessian-free” methods that, in
addition to gradient computations, also compute Hessian-vector products. That is, instead of using
the full matrix ∇2 fi (x) or ∇2 f (x), these methods also compute ∇2 fi (x) · v for indices i and vectors
v.2 For Hessian-free methods, we denote by gradient complexity T the number of computations
of ∇fi (x) plus that of ∇2 fi (x) · v. The hope of using Hessian-vector products is to improve the
complexity T as a function of ε.
Such improvement was first shown possible independently by [1, 7] for the offline setting, with
3/4
n
+ εn1.75 so is better than that of gradient descent.
complexity T ∝ ε1.5
In the online setting, the
1
first improvement was by Natasha2 which gives complexity T ∝ ε3.25
[2].
Unfortunately, it is argued by some researchers that Hessian-vector products are not general
enough and may not be as simple to implement as evaluating gradients [8]. Therefore,
can we turn Hessian-free methods into first-order ones, without hurting their performance?
1.1
From Hessian-Vector Products to First-Order Methods
i (x)
}. Given any
Recall by definition of derivative we have ∇2 fi (x) · v = limq→0 { ∇fi (x+qv)−∇f
q
Hessian-free method, at least at a high level, can we replace every occurrence of ∇2 fi (x) · v with
i (x)
for some small q > 0?
w = ∇fi (x+qv)−∇f
q
Note the error introduced in this approximation is k∇2 fi (x) · v − wk ∝ qkvk2 . Therefore, as
long as the original algorithm is sufficiently stable to adversarial noise, and as long as q is small
enough, this can convert Hessian-free algorithms into first-order ones.
In this paper, we demonstrate this idea by converting negative-curvature-search (NC-search)
subroutines into first-order processes. NC-search is a key subroutine used in state-of-the-art
Hessian-free methods that have rigorous proofs (see [1, 2, 7]). It solves the following simple task:
negative-curvature search (NC-search)
given point x0 , decide if ∇2 f (x0 ) −δI or find a unit vector v such that v ⊤ ∇2 f (x0 )v ≤ − 2δ .
Online Setting.
In the online setting, NC-search can be solved by Oja’s algorithm [21] which
1
We say A −δI if all the eigenvalues of A are no smaller than −δ. In this high-level introduction, we focus only
on the case when δ = ε1/C for some constant C.
2
Hessian-free methods are useful because when fi (·) is explicitly given, computing its gradient is in the same
complexity as computing its Hessian-vector product [22, 25], using backpropagation.
2
2 ) computations of Hessian-vector products (first proved in [6] and applied to NC-search
e
costs O(1/δ
in [2]).
In this paper, we propose a method Neon2online which solves the NC-search problem via only
stochastic first-order updates. That is, starting from x1 = x0 + ξ where ξ is some random perturbation, we keep updating xt+1 = xt − η(∇fi (xt ) − ∇fi (x0 )). In the end, the vector xT − x0 gives
us enough information about the negative curvature.
2 ) stochastic grae
Theorem 1 (informal). Our Neon2online algorithm solves NC-search using O(1/δ
dients, without Hessian-vector product computations.
2 ) matches that of Oja’s algorithm, and is information-theoretically ope
This complexity O(1/δ
timal (up to log factors), see the lower bound in [6].
We emphasize that the independent work Neon by Xu and Yang [28] is actually the first recorded
3 ) stochastic gradients,
e
theoretical result that proposed this approach. However, Neon needs O(1/δ
because it uses full gradient descent to find NC (on a sub-sampled objective) inspired by [15] and
the power method; instead, Neon2online uses stochastic gradients and is based on our prior work on
Oja’s algorithm [6].
By plugging Neon2online into Natasha2 [2], we achieve the following corollary (see Figure 1(c)):
Theorem 2 (informal). Neon2online turns Natasha2 into a stochastic first-order method, without
1
e 3.25
+
hurting its performance. That is, it finds an (ε, δ)-approximate local minimum in T = O
ε
1
1
+ δ5 stochastic gradient computations, without Hessian-vector product computations.
ε3 δ
(We say x is an approximate local minimum if k∇f (x)k ≤ ε and ∇2 f (x) −δI.)
Offline Setting. There are a number of ways to solve the NC-search problem in the offline setting
e
using Hessian-vector products. Most notably, power
√ method uses O(n/δ) computations of Hessiane
vector products, Lanscoz method [17] uses O(n/ δ) computations,
and shift-and-invert [12] on top
√
e + n3/4 / δ) computations.
of SVRG [26] (that we call SI+SVRG) uses O(n
In this paper, we convert Lanscoz’s method and SI+SVRG into first-order ones:
√
det algorithm solves NC-search using O(1/
e
δ) full gradients
Theorem 3 (informal).
√ Our Neon2
svrg
e
e +
(or equivalently
O(n/ δ) stochastic gradients), and our Neon2
solves NC-search using O(n
√
3/4
n / δ) stochastic gradients.
We emphasize that, although analyzed in the online setting only, the work Neon by Xu and
Yang [28] also applies to the offline setting, and seems to be the first result to solve NC-search
e
using first-order√gradients with a theoretical proof. However, Neon uses O(1/δ)
full gradients
det
e
instead of O(1/ δ). Their approach is inspired by [15], but our Neon2 is based on Chebyshev
approximation theory (see textbook [27]) and its recent stability analysis [5].
By putting Neon2det and Neon2svrg into the CDHS method of Carmon et al. [7], we have3
Theorem 4 (informal). Neon2det turns CDHS into a first-order method
without hurting its perfor1
1
e
mance: it finds an (ε, δ)-approximate local minimum in O ε1.75 + δ3.5 full gradient computations.
Neon2svrg turns CDHS into a first-order method without hurting its performance: it finds an (ε, δ)3/4
3/4
n
e 1.5
approximate local minimum in T = O
+ δn3 + εn1.75 + nδ3.5 stochastic gradient computations.
ε
3
We note that the original paper of CDHS only proved such complexity results (although requiring Hessian-vector
1
e 1.75
products) for the special case of δ ≥ ε1/2 . In such a case, it requires either O
full gradient computations or
ε
3/4
n
n
e
O 1.5 + 1.75 stochastic gradient computations.
ε
ε
3
T
T=δ-7
T
T=δ-6
ε-5
T=δ-3 ε-2
ε
ε-4
T=ε-4
ε2/3 ε4/7 ε1/2
ε1/4
δ
ε-3.33
T=δ-6
Neon2+SCSG
Neon+SGD
T=δ-5
ε-5
ε-4
T
Neon2+SGD
Neon+SCSG
T=δ-5
ε-5
T=δ-3 ε-2
T=ε-3.33
ε
(a)
ε2/3
ε1/2 ε4/9
ε1/4
Neon+Natasha
T=δ-5
ε-3.75
δ
ε-3.6
ε-3.25
Neon2+Natasha2
ε
T=δ-1 ε-3
ε3/4
(b)
ε3/5 ε1/2
T=ε-3.25
ε1/4
δ
(c)
Figure 1: Neon vs Neon2 for finding (ε, δ)-approximate local minima. We emphasize that Neon2 and Neon are based
on the same high-level idea, but Neon is arguably the first-recorded result to turn stationary-point finding
algorithms (such as SGD, SCSG) into local-minimum finding ones, with theoretical proofs.
One should perhaps compare Neon2det to the interesting work “convex until guilty” by Carmon
1.75 ) full gradients,
e
et al. [8]. Their method finds ε-approximate stationary points using O(1/ε
and is arguably the first first-order method achieving a convergence rate better than 1/ε2 of GD.
Unfortunately, it is unclear if their method guarantees local minima. In comparison, Neon2det on
CDHS achieves the same complexity but guarantees its output to be an approximate local minimum.
Remark 1.1. All the cited works in this sub-section requires the objective to have (1) Lipschitzcontinuous Hessian (a.k.a. second-order smoothness) and (2) Lipschitz-continuous gradient (a.k.a.
Lipschitz smoothness). One can argue that (1) and (2) are both necessary for finding approximate
local minima, but if only finding approximate stationary points, then only (2) is necessary. We
shall formally discuss our assumptions in Section 2.
1.2
From Stationary Points to Local Minima
Given any (first-order) algorithm that finds only stationary points (such as GD, SGD, or SCSG [18]),
we can hope for using the NC-search routine to identify whether or not its output x satisfies
∇2 f (x) −δI. If so, then automatically x becomes an (ε, δ)-approximate local minima so we
can terminate. If not, then we can go in its negative curvature direction to further decrease the
objective.
In the independent work of Xu and Yang [28], they proposed to apply their Neon method
for NC-search, and thus turned SGD and SCSG into first-order methods finding approximate local
minima. In this paper, we use Neon2 instead. We show the following theorem:
Theorem 5 (informal). To find an (ε, δ)-approximate local minima,
e 14 + 21 3 + 15 stochastic gradients;
(a) Neon2+SGD needs T = O
ε
ε δ
δ
1
1
1
e 10/3
(b) Neon2+SCSG needs T = O
+
stochastic gradients; and
+
2
3
5
ε
δ
δ
ε
n
1
e n2 + 3.5
e 12 + 3.5
(c) Neon2+GD needs T = O
(so
O
full gradients).
ε
δ
ε
δ
3/4
2/3
5/12
e n 2 + n3 + n2 1/2 + n3.5 stochastic gradients.
(d) Neon2+SVRG needs T = O
ε
δ
δ
ε δ
We make several comments as follows.
(a) We compare Neon2+SGD to Ge et al. [13], where the authors showed SGD plus perturbation
4 ) stochastic gradients to find (ε, ε1/4 )-approximate local minima. This
e
needs T = O(poly(d)/ε
is the perhaps first time that a theoretical guarantee for finding local minima is given using
first-order oracles.
4
To some extent, Theorem 5a is superior because we have (1) removed the poly(d) factor,4 (2)
4 ) as long as δ ≥ ε2/3 , and (3) a much simpler analysis.
e
achieved T = O(1/ε
We also remark that, if using Neon instead of Neon2, one achieves a slightly worse complexity
e 14 + 17 , see Figure 1(a) for a comparison.5
T =O
ε
δ
(b) Neon2+SCSG turns SCSG into a local-minimum finding algorithm. Again,
if using Neon instead
1
1
1
e 10/3
+
of Neon2, one gets a slightly worse complexity T = O
, see Figure 1(b).
+
ε2 δ 3
δ6
ε
(c) We compare Neon2+GD to Jin et al. [15], where the authors showed GD plus perturbation
2 ) full gradients to find (ε, ε1/2 )-approximate local minima. This is perhaps the
e
needs O(1/ε
first time that one can convert a stationary-point finding method (namely GD) into a local
minimum-finding one, without hurting its performance.
2 ) full gradients as long as δ ≥ ε4/7 .
e
To some extent, Theorem 5c is better because we use O(1/ε
(d) Our result for Neon2+SVRG does not seem to be recorded anywhere, even if Hessian-vector
product computations are allowed.
Limitation. We note that there is limitation of using Neon2 (or Neon) to turn an algorithm finding
stationary points to that finding local minima. Namely, given any algorithm A, if the gradient
complexity for A to find an ε-approximate stationary point is T , then after this conversion, it finds
(ε, δ)-approximate local minima in a gradient complexity that is at least T . This is because the
new algorithm, after combining Neon2 and A, tries to alternatively find stationary points (using
A) and escape from saddle points (using Neon2). Therefore, it must pay at least complexity T . In
contrast, methods such as Natasha2 swing by saddle points instead of go to saddle points and then
escape. This has enabled it to achieve a smaller complexity T = O(ε−3.25 ) for δ ≥ ε1/4 .
2
Preliminaries
Throughout this paper, we denote by k · k the Euclidean norm. We use i ∈R [n] to denote that
i is generated from [n] = {1, 2, . . . , n} uniformly at random. We denote by I[event] the indicator
function of probabilistic events.
We denote by kAk2 the spectral norm of matrix A. For symmetric matrices A and B, we write
A B to indicate that A − B is positive semidefinite (PSD). Therefore, A −σI if and only if
all eigenvalues of A are no less than −σ. We denote by λmin (A) and λmax (A) the minimum and
maximum eigenvalue of a symmetric matrix A.
Recall some definitions on smoothness (for other equivalent definitions, see textbook [19])
Definition 2.1. For a function f : Rd → R,
• f is L-Lipschitz smooth (or L-smooth for short) if
∀x, y ∈ Rd , k∇f (x) − ∇f (y)k ≤ Lkx − yk.
• f is second-order L2 -Lipschitz smooth (or L2 -second-order smooth for short) if
∀x, y ∈ Rd , k∇2 f (x) − ∇2 f (y)k2 ≤ L2 kx − yk.
The following fact says the variance of a random variable decreases by a factor m if we choose
m independent copies and average them. It is trivial to prove, see for instance [18].
4
We are aware that the original authors of [13] have a different proof to remove its poly(d) factor, but have not
found it online at this moment.
5
e 14 + 16 with a slight change of the algorithm, but not beyond.
Their complexity might be improvable to O
ε
δ
5
algorithm
stationary
SGD (folklore)
O
e
local minima perturbed SGD [13] O
local minima
local minima
stationary
local minima
local minima
local minima
Neon+SGD [28]
Neon2+SGD
SCSG [18]
Neon+SCSG [28]
Neon2+SCSG
Natasha2 [2]
e
O
e
O
O
O
O
e
O
e
local minima Neon+Natasha2 [28] O
local minima
stationary
local minima
local minima
stationary
Neon2+Natasha2
GD (folklore)
perturbed GD [15]
Neon2+GD
SVRG [3, 23]
local minima
Neon2+SVRG
stationary
“guilty” [8]
local minima
FastCubic [1]
local minima
CDHS [7]
local minima
Neon2+CDHS
Hessianvector
products
gradient complexity T
e
O
O
e
O
e
O
O
e
O
1
ε4
poly(d)
ε4
1
ε4
1
ε4
(only for δ ≥ ε
+
1
δ7
+
1
ε2 δ 3
1
ε10/3
1
ε10/3
1
ε10/3
+
1
ε2 δ 3
+
1
ε2 δ 3
+
1
δ5
1
ε3 δ
+
1
δ6
+
1
ε3 δ
+
1
δ5
n
1
δ5
ε2
n
ε2
n
ε2
+
n2/3
ε2
2/3
n
ε2
n
δ 3.5
+n
+
n
δ3
(only for δ ≥ ε1/2 )
+
e
O
e
O
n
ε1.75
n
n3/4
ε1.5 + ε1.75
n
ε1.5
+
n
δ3
e
O
n
ε1.5
+
n
δ3
e
O
1
δ6
+
1
ε3.25
ε3.25
+
+
+
)
1
ε3 δ
1
ε3.25
1
+
1
δ5
1/4
5/12
n
ε2 δ 1/2
+
3/4
n
δ 3.5
(only for δ ≥ ε1/2 )
3/4
3/4
+ εn1.75 + nδ3.5
3/4
3/4
+ εn1.75 + nδ3.5
variance
Lip.
bound smooth
2nd order
smooth
no
needed
needed
no
no
needed
needed
needed
no
needed
needed
needed
no
needed
needed
needed
no
needed
needed
no
no
needed
needed
needed
no
needed
needed
needed
needed
needed
needed
needed
no
needed
needed
needed
no
needed
needed
needed
no
no
needed
no
no
no
needed
needed
no
no
needed
needed
no
no
needed
no
no
no
needed
needed
no
no
needed
needed
needed
no
needed
needed
needed
no
needed
needed
no
no
needed
needed
Table 1: Complexity for finding k∇f (x)k ≤ ε and ∇2 f (x) −δI. Following tradition, in these complexity bounds,
we assume variance and smoothness parameters as constants, and only show the dependency on n, d, ε.
Remark 1. Variance bounds is needed for online methods.
Remark 2. Lipschitz smoothness is needed for finding approximate stationary points.
Remark 3. Second-order Lipschitz smoothness is needed for finding approximate local minima.
P
Fact 2.2. If v1 , . . . , vn ∈ Rd satisfy ni=1 vi = ~0, and S is a non-empty, uniform random subset of
[n]. Then
h
P
2 i
P
I[|S| 0, confidence p ∈ (0, 1].
1: for j = 1, 2, · · · Θ(log 1/p) do
⋄ boost the confidence
(f,
x
,
δ,
p);
2:
vj ← Neon2online
0
weak
3:
if vj 6= ⊥ then
2
1/p
4:
m ← Θ( L log
), v ′ ← Θ( Lδ2 )v.
δ2
5:
Draw i1 , . . .P
, im ∈R [n].
(v ′ )⊤ ∇fij (x0 + v ′ ) − ∇fij (x0 )
6:
zj = mkv1 ′ k2 m
j=1
2
7:
if zj ≤ −3δ/4 return v = vj
8:
end if
9: end for
10: return v = ⊥.
Algorithm 2 Neon2online
weak (f, x0 , δ, p)
1:
2:
3:
4:
5:
6:
7:
8:
η←
δ
,
C02 L2 log(d/p)
T ←
C02 log(d/p)
,
ηδ
ξ ← Gaussian random vector with norm σ.
x1 ← x0 + ξ.
for t ← 1 to T do
xt+1 ← xt − η (∇fi (xt ) − ∇fi (x0 )) where i ∈R [n].
t+1 −x0
if kxt+1 − x0 k2 ≥ r then return v = kxxt+1
−x0 k2
end for
return v = ⊥;
⋄ for sufficiently large constant C0
2
⋄ σ def
= η(d/p)−2C0 L2δ L3
⋄ r def
= (d/p)C0 σ
where both f (·) and each fi (·) can be nonconvex. We wish to find (ε, δ)-local minima which are
points x satisfying
k∇f (x)k ≤ ε
and
We need the following three assumptions
∇2 f (x) −δI .
• Each fi (x) is L-Lipschitz smooth.
• Each fi (x) is second-order L2 -Lipschitz smooth.
(In fact, the gradient complexity of Neon2 in this paper only depends polynomially on the
second-order smoothness of f (x) (rather than fi (x)), and the time complexity depends logarithmically on the second-order smoothness of fi (x). To make notations simple, we decide to
simply assume each fi (x) is L2 -second-order smooth.)
• Stochastic gradients have bounded variance: ∀x ∈ Rd :
(This assumption is needed only for online algorithms.)
3
Ei∈R [n] k∇f (x) − ∇fi (x)k2 ≤ V .
Neon2 in the Online Setting
We propose Neon2online formally in Algorithm 1.
It repeatedly invokes Neon2online
weak in Algorithm 2, whose goal is to solve the NC-search problem
with confidence 2/3 only; then Neon2online invokes Neon2online
weak repeatedly for log(1/p) times to boost
the confidence to 1 − p. We prove the following theorem:
7
P
Theorem 1 (Neon2online ). Let f (x) = n1 ni=1 fi (x) where each fi is L-smooth and L2 -secondorder smooth. For every point x0 ∈ Rd , every δ > 0, every p ∈ (0, 1], the output
v = Neon2online (f, x0 , δ, p)
satisfies that, with probability at least 1 − p:
1. If v = ⊥, then ∇2 f (x0 ) −δI.
2. If v 6= ⊥, then kvk2 = 1 and v ⊤ ∇2 f (x0 )v ≤ − 2δ .
Moreover, the total number of stochastic gradient evaluations O
log2 (d/p)L2
.
δ2
The proof of Theorem 1 immediately follows from Lemma 3.1 and Lemma 3.2 below.
online
Lemma 3.1 (Neon2online
weak ). In the same setting as Theorem 1, the output v = Neon2weak (f, x0 , δ, p)
51
δ.
satisfies If λmin (∇2 f (x0 )) ≤ −δ, then with probability at least 2/3, v 6= ⊥ and v ⊤ ∇2 f (x0 )v ≤ − 100
Proof sketch of Lemma 3.1. We explain why Neon2online
weak works as follows. Starting from a randomly
perturbed point x1 = x0 + ξ, it keeps updating xt+1 ← xt − η (∇fi (xt ) − ∇fi (x0 )) for some random
index i ∈ [n], and stops either when T iterations are reached, or when kxt+1 − x0 k2 > r. Therefore,
we have kxt − x0 k2 ≤ r throughout the iterations, and thus can approximate ∇2 fi (x0 )(xt − x0 )
using ∇fi (xt ) − ∇fi (x0 ), up to error O(r2 ). This is a small term based on our choice of r.
Ignoring the error term, our updates look like xt+1 −x0 = I−η∇2 fi (x0 ) (xt −x0 ). This is exactly
the same as Oja’sPalgorithm [21] which is known to approximately compute the minimum eigenvector
of ∇2 f (x0 ) = n1 ni=1 fi (x0 ). Using the recent optimal convergence analysis of Oja’s algorithm [6],
log r
one can conclude that after T1 = Θ ηλσ iterations, where λ = max{0, −λmin (∇2 f (x0 ))}, then we
not only have that kxt+1 − x0 k2 is blown up, but also it aligns well with the minimum eigenvector
of ∇2 f (x0 ). In other words, if λ ≥ δ, then the algorithm must stop before T .
Finally, one has to carefully argue that the error does not blow up in this iterative process. We
defer the proof details to Appendix A.2.
Our Lemma 3.2 below tells us we can verify if the output v of Neon2online
weak is indeed correct (up
to additive 4δ ), so we can boost the success probability to 1 − p.
Lemma 3.2 (verification). In the same setting as Theorem 1, let vectors x, v ∈ Rd . If i1 , . . . , im ∈R
[n] and define
1 Pm
⊤
z=m
j=1 v (∇fij (x + v) − ∇fij (x))
Then, if kvk ≤
δ
8L2
and m = Θ( L
2
log 1/p
),
δ2
with probability at least 1 − p,
z
v ⊤ ∇2 f (x)v
kvk2 −
≤ 4δ .
2
kvk
2
4
2
Neon2 in the Deterministic Setting
We propose Neon2det formally in Algorithm 3 and prove the following theorem:
8
Algorithm 3 Neon2det (f, x0 , δ, p)
Input: A function√f , vector x0 , negative curvature target δ > 0, failure probability p ∈ (0, 1].
1:
2:
3:
4:
5:
6:
7:
8:
9:
T ←
C12 log(d/p) L
√
.
δ
⋄ for sufficiently large constant C1 .
ξ ← Gaussian random vector with norm σ;
⋄ σ def
= (d/p)−2C1 T 3δL2
x1 ← x0 + ξ. y1 ← ξ, y0 ← 0
for t ← 1 to T do
3δ
y
yt+1 = 2M(yt ) − yt−1 ;
⋄ M(y) def
= − L1 (∇f (x0 + y) − ∇f (x0 )) + 1 − 4L
xt+1 = x0 + yt+1 − M(yt ).
t+1 −x0
⋄ r def
= (d/p)C1 σ
if kxt+1 − x0 k2 ≥ r then return kxxt+1
−x0 k2 .
end for
return ⊥.
Theorem 3 (Neon2det ). Let f (x) be a function that is L-smooth and L2 -second-order smooth.
For every point x0 ∈ Rd , every δ > 0, every p ∈ (0, 1], the output v = Neon2det (f, x0 , δ, p) satisfies
that, with probability at least 1 − p:
1. If v = ⊥, then ∇2 f (x0 ) −δI.
2. If v 6= ⊥, then kvk2 = 1 and v ⊤ ∇2 f (x0 )v ≤ − 2δ .
Moreover, the total number full gradient evaluations is O
√
log2 (d/p) L
√
.
δ
Proof sketch of Theorem 3. We explain the high-level
intuition of Neon2det and the proof of Theorem 3
1 2
3δ
as follows. Define M = − L ∇ f (x0 ) + 1 − 4L I. We immediately notice that
• all eigenvalues of ∇2 f (x0 ) in −3δ
4 , L are mapped to the eigenvalues of M in [−1, 1], and
δ
• any eigenvalue of ∇2 f (x0 ) smaller than −δ is mapped to eigenvalue of M greater than 1 + 4L
.
e L , if we compute xT +1 = x0 + MT ξ for some random vector ξ, by the
Therefore, as long as T ≥ Ω
δ
theory of power method, xT +1 − x0 must be a negative-curvature direction of ∇2 f (x0 ) with value
≤ 12 δ. There are two issues with this approach.
√
e √L
The first issue is that, the degree T of this matrix polynomial MT can be reduced to T = Ω
δ
if the so-called Chebyshev polynomial is used.
Claim 4.1. Let Tt (x) be the t-th Chebyshev polynomial of the first kind, defined as [27]:
def
T0 (x) = 1,
def
def
T1 (x) = x,
Tn+1 (x) = 2x · Tn (x) − Tn−1 (x)
[−1, 1]
if x ∈ [−1, 1];
t
t
√
√
then Tt (x) satisfies: Tt (x) ∈
if x > 1.
12 x + x2 − 1 , x + x2 − 1
√
Since Tt (x) stays between [−1, 1] when x ∈ [−1, 1], and grows to ≈ (1 + x2 − 1)t for x ≥ 1, we
δ
shall grow
can use TT (M) in replacement of MT . Then, any eigenvalue of M that is above 1 + 4L
√
p
L
T
e √ . This is quadratically faster than
in a speed like (1 + δ/L) , so it suffices to choose T ≥ Ω
σ
applying the power method, so in Neon2det we wish to compute xt+1 ≈ x0 + Tt (M) ξ.
The second issue is that, since we cannot compute Hessian-vector products, we have to use the
9
gradient difference to approximate it; that is, we can only use M(y) to approximate My where
3δ
1
def
y .
M(y) = − (∇f (x0 + y) − ∇f (x0 )) + 1 −
L
4L
How does error propagate if we compute Tt (M) ξ by replacing M with M? Note that this is a very
non-trivial question, because the coefficients of the polynomial Tt (x) is as large as 2O(t) .
It turns out, the way that error propagates depends on how the Chebyshev polynomial is
calculated. If the so-called backward recurrence formula is used, namely,
y0 = 0,
y1 = ξ,
yt = 2M(yt−1 ) − yt−2
and setting xT +1 = x0 + yT +1 − M(yT ), then this xT +1 is sufficiently close to the exact value
x0 + Tt (M) ξ. This is known as the stability theory of computing Chebyshev polynomials, and is
proved in our prior work [5].
We defer all the proof details to Appendix B.2.
5
Neon2 in the SVRG Setting
Recall that the shift-and-invert (SI) approach [12] on top of the SVRG method [26] solves the minimum eigenvector problem as follows. Given any matrix A = ∇2 f (x0 ) and suppose its eigenvalues
are λ1 ≤ · · · ≤ λd . Then, if λ > −λ1 , we can define positive semidefinite matrix B = (λI+A)−1 , and
then apply power method to find an (approximate) maximum eigenvector of B, which necessarily
is an (approximate) minimum eigenvector of A.
The SI approach specifies a binary search routine to determine the shifting constant λ, and
ensures that B = (λI + A)−1 is always “well conditioned,” meaning that it suffices to apply power
method on B for logarithmic number of iterations. In other words, the task of computing the
minimum eigenvector of A reduces to computing matrix-vector products By for poly-logarithmic
number of times. Moreover, the stability of SI was shown in a number of papers, including [12] and
[4]. This means, it suffices for us to compute By approximately.
However, how to compute By for an arbitrary vector y. It turns out, this is equivalent to
minimizing a convex quadratic function that is of a finite sum form
n
1 X ⊤
def 1
z (λI + ∇2 fi (x0 ))z + y ⊤ z .
g(z) = z ⊤ (λI + A)z + y ⊤ z =
2
2n
i=1
Therefore, one can apply the a variant of the SVRG method (arguably first discovered by ShalevShwartz [26]) to solve this task. In each iteration, SVRG needs to evaluate a stochastic gradient
(λI + ∇2 fi (x0 ))z + y at some point z for some random i ∈ [n]. Instead of evaluating it exactly
(which require a Hessian-vector product), we use ∇fi (x0 + z) − ∇fi (x0 ) to approximate ∇2 fi (x0 ) · z.
Of course, one needs to show also that the SVRG method is stable to noise. Using similar
techniques as the previous two sections, one can show that the error term is proportional to O(kzk22 ),
and thus as long as we bound the norm of z is bounded (just like we did in the previous two sections),
this should not affect the performance of the algorithm. We decide to ignore the detailed theoretical
proof of this result, because it will complicate this paper.
P
Theorem 3 (Neon2svrg ). Let f (x) = n1 ni=1 fi (x) where each fi is L-smooth and L2 -second-order
smooth. For every point x0 ∈ Rd , every δ > 0, every p ∈ (0, 1], the output v = Neon2svrg (f, x0 , δ, p)
satisfies that, with probability at least 1 − p:
1. If v = ⊥, then ∇2 f (x0 ) −δI.
2. If v 6= ⊥, then kvk2 = 1 and v ⊤ ∇2 f (x0 )v ≤ − 2δ .
10
e n+
Moreover, the total number stochastic gradient evaluations is O
6
√
n3/4
√ L .
δ
Applications of Neon2
We show how Neon2online can be applied to existing algorithms such as SGD, GD, SCSG, SVRG,
Natasha2, CDHS. Unfortunately, we are unaware of a generic statement for applying Neon2online
to any algorithm. Therefore, we have to prove them individually.6
Throughout this section, we assume that some starting vector x0 ∈ Rd and upper bound ∆f
is given to the algorithm, and it satisfies f (x0 ) − minx {f (x)} ≤ ∆f . This is only for the purpose
of proving theoretical bounds. In practice, because ∆f only appears in specifying the number
of iterations, can just run enough number of iterations and then halt the algorithm, without the
necessity of knowing ∆f .
6.1
Auxiliary Claims
Claim 6.1. For any x, using O(( εV2 + 1) log p1 ) stochastic gradients, we can decide
with probability 1 − p :
either k∇f (x)k ≥
ε
2
or
k∇f (x)k ≤ ε .
Proof. Suppose we generate m = O(log p1 ) random uniform subsets S1 , . . . , Sm of [n], each of
2
1 P
cardinality B = max{ 32ε
i∈Sj ∇fi (x), we have according to
V , 1}. Then, denoting by vj = B
2
ε
V
= 32
. In other words, with probability at least 1/2 over
Fact 2.2 that ESj kvj − ∇f (x)k2 ≤ B
the randomness of Sj , we have kvj k − k∇f (x)k ≤ kvj − ∇f (x)k ≤ 4ε . Since m = O(log p1 ), we
at least 1 − p, it satisfies that at least m/2 + 1 of the vectors vj satisfy
have with probability
kvj k − k∇f (x)k ≤ ε . Now, if we select v ∗ = vj where j ∈ [m] is the index that gives the median
4
value of kvj k, then it satisfies kvj k − k∇f (x)k ≤ 4ε . Finally, we can check if kvj k ≤ 3ε
4 . If so, then
ε
we conclude that k∇f (x)k ≤ ε, and if not, we conclude that k∇f (x)k ≥ 2 .
Claim 6.2. If v is a unit vector and v ⊤ ∇2 f (y)v ≤ − 2δ , suppose we choose y ′ = y ±
sign is random, then f (y) −
Proof. Letting η =
δ
L2 ,
E[f (y ′ )]
≥
δ3
12L22
δ
L2 v
where the
.
then by the second-order smoothness,
1
L2
f (y) − E[f (y ′ )] ≥ E h∇f (y), y − y ′ i − (y − y ′ )⊤ ∇2 f (y)(y − y ′ ) −
ky − y ′ k3
2
6
2δ
3
3
L2 η 3
η
L
η
δ
η2 ⊤ 2
2
.
kvk3 ≥
−
=
= − v ∇ f (y)v −
2
6
4
6
12L22
6.2
Neon2 on SGD and GD
To apply Neon2 to turn SGD into an algorithm finding approximate local minima, we propose the
following process Neon2+SGD (see Algorithm 4). In each iteration t, we first apply SGD with minibatch size O( ε12 ) (see Line 4). Then, if SGD finds a point with small gradient, we apply Neon2online to
decide if it has a negative curvature, if so, then we move in the direction of the negative curvature
(see Line 10). We have the following theorem:
6
This is because stationary-point finding algorithms have somewhat different guarantees. For instance, in minibatch SGD we have f (xt ) − E[f (xt+1 )] ≥ Ω(k∇f (xt )k2 ) but in SCSG we have f (xt ) − E[f (xt+1 )] ≥ E[Ω(k∇f (xt+1 )k2 )].
11
Algorithm 4 Neon2+SGD(f, x0 , p, ε, δ)
Input: function f (·), starting vector x0 , confidence p ∈ (0, 1), ε > 0 and δ > 0.
L2 ∆f
L∆f
1: K ← O 2δ 3 + ε2 ;
⋄ ∆f is any upper bound on f (x0 ) − minx {f (x)}
2: for t ← 0 to K − 1 do
def
, 1};
3:
S ← a uniform random
subset of [n] with cardinality |S| = B = max{ 8V
ε2
P
1
4:
xt+1/2 ← xt − L|S| i∈S ∇fi (xt );
⋄ estimate k∇f (xt )k using O(ε−2 V log(K/p)) stochastic gradients
5:
if k∇f (xt )k ≥ 2ε then
6:
xt+1 ← xt+1/2 ;
7:
else
⋄ necessarily k∇f (xt )k ≤ ε
p
online
8:
v ← Neon2
(xt , δ, 2K );
9:
if v = ⊥ then return xt ;
⋄ necessarily ∇2 f (xt ) −δI
δ
⋄ necessarily v⊤ ∇2 f (xt )v ≤ −δ/2
10:
else xt+1 ← xt ± L2 v;
11:
end if
12: end for
13: will not reach this line (with probability ≥ 1 − p).
Algorithm 5 Neon2+GD(f, x0 , p, ε, δ)
Input: function f (·), starting vector x0 , confidence p ∈ (0, 1), ε > 0 and δ > 0.
L2 ∆f
L∆f
1: K ← O 2δ 3 + ε2 ;
⋄ ∆f is any upper bound on f (x0 ) − minx {f (x)}
2: for t ← 0 to K − 1 do
3:
xt+1/2 ← xt − L1 ∇f (xt );
4:
if k∇f (xt )k ≥ 2ε then
5:
xt+1 ← xt+1/2 ;
6:
else
p
7:
v ← Neon2det (xt , δ, 2K
);
8:
if v = ⊥ then return xt ;
⋄ necessarily ∇2 f (xt ) −δI
9:
else xt+1 ← xt ± Lδ2 v;
⋄ necessarily v⊤ ∇2 f (xt )v ≤ −δ/2
10:
end if
11: end for
12: will not reach this line (with probability ≥ 1 − p).
Theorem 5a. With probability at least 1 − p, Neon2+SGD outputs an (ε,
δ)-approximate local
L2 L22 ∆f
2∆
L
L∆
f
f
V
2
e ( 2 + 1)
+ ε2 + δ 2 δ 3 .
minimum in gradient complexity T = O
ε
δ3
e
Corollary 6.3. Treating ∆f , V, L, L2 as constants, we have T = O
1
ε4
+
1
ε2 δ 3
+
1
δ5
.
One can similarly (and more easily) give an algorithm Neon2+GD, which is the same as Neon2+SGD
except that the mini-batch SGD is replaced with a full gradient descent, and the use of Neon2online
is replaced with Neon2det . We have the following theorem:
Theorem 5c. With probabilityat least 1 − p, Neon2+GD
outputs an (ε, δ)-approximate local min
1/2 L2 ∆f
L∆
f
L
2
e
imum in gradient complexity O
+ δ1/2 δ3
full gradient computations.
ε2
We only prove Theorem 5a in Appendix C and the proof of Theorem 5c is only simpler.
12
6.3
Neon2 on SCSG and SVRG
Background. We first recall the main idea of the SVRG method for non-convex optimization [3, 23].
It is an offline method but is what SCSG is built on. SVRG divides iterations into epochs, each
of length n. It maintains a snapshot point x
e for each epoch, and computes the full gradient
∇f (e
x) only for snapshots. Then, in each iteration t at point xt , SVRG defines gradient estimator
e (xt ) def
e (xt )] = ∇f (xt ), and performs update
∇f
= ∇fi (xt ) − ∇fi (e
x) + ∇f (e
x) which satisfies Ei [∇f
e (xt ) for learning rate α.
xt+1 ← xt − α∇f
The SCSG method of Lei et al. [18] proposed a simple fix to turn SVRG into an online method.
They changed the P
epoch length of SVRG from n to B ≈ 1/ε2 , and then replaced the computation
1
x) where S is a random subset of [n] with cardinality |S| = B. To
of ∇f (e
x) with |S| i∈S ∇fi (e
make this approach even more general, they also analyzed SCSG in the mini-batch setting, with
mini-batch size b ∈ {1, 2, . . . , B}.7 Their Theorem 3.1 [18] says that,
Lemma 6.4 ([18]). There exist constant C > 1 such that, if we run SCSG for an epoch of size B
(so using O(B) stochastic gradients)8 with mini-batch b ∈ {1, 2, . . . , B} starting from a point xt
and moving to x+
t , then
6V
2
.
E k∇f (x+
≤ C · L(b/B)1/3 f (xt ) − E[f (x+
t )k
t )] +
B
Our Approach. In principle, one can apply the same idea of Neon2+SGD on SCSG to turn it into
an algorithm finding approximate local minima. Unfortunately,
this is not quite possible because
2 , as opposed to k∇f (x )k2 in SGD (see (C.1)).
the left hand side of Lemma 6.4 is on E k∇f (x+
)k
t
t
This means, instead of testing whether xt is a good local minimum (as we did in Neon2+SGD), this
time we need to test whether x+
t is a good local minimum. This creates some extra difficulty so we
need a different proof.
}. However, choosing
Remark 6.5. As for the parameters of SCSG, we simply use B = max{1, 48V
ε2
(ε2 +V)ε4 L6
mini-batch size b = 1 does not necessarily give the best complexity, so a tradeoff b = Θ( δ9 L3 2 )
is needed. (A similar tradeoff was also discovered by the authors of Neon [28].) Note that this
quantity b may be larger than B, and if this happens, SCSG becomes essentially equivalent to one
iteration of SGD with mini-batch size b. Instead of analyzing this boundary case b > B separately,
we decide to simply run Neon2+SGD whenever b > B happens, to simplify our proof.
We show the following theorem (proved in Appendix C)
Theorem 5b. With probability at least
outputs an (ε, δ)-approximate
local
2/3, Neon2+SCSG
L∆f L2
L∆f
L22 ∆f V
L2
e
+ δ 2 + ε2 δ 2 .
minimum in gradient complexity T = O ε4/3 V 1/3 + δ3
ε2
(To provide the simplest proof, we have shown Theorem 5b only with probability 2/3. One can for
instance boost the confidence to 1 − p by running log p1 copies of Neon2+SCSG.)
e
Corollary 6.6. Treating ∆f , V, L, L2 as constants, we have T = O
1
ε10/3
+
1
ε2 δ 3
+
1
δ5
.
P
x)
x) with |S1′ | i∈S ′ ∇fi (xt ) − ∇fi (e
That is, they reduced the epoch length to Bb , and replaced ∇fi (xt ) − ∇fi (e
for some S ′ that is a random subset of [n] with cardinality |S ′ | = b.
8
We remark that Lei et al. [18] only showed that an epoch runs in an expectation of O(B) stochastic gradients.
We assume it is exact here to simplify proofs. One can for instance stop SCSG after O(B log p1 ) stochastic gradient
computations, and then Lemma 6.4 will succeed with probability ≥ 1 − p.
7
13
Algorithm 6 Neon2+SCSG(f, x0 , ε, δ)
Input: function f (·), starting vector x0 , ε > 0 and δ > 0.
(ε2 +V)ε4 L62
};
b
←
max
1,
Θ(
) ;
1: B ← max{1, 48V
2
ε
δ 9 L3
2: if b > B then return Neon2+SGD(f, x0 , 2/3, ε, δ);
⋄ for cleaner analysis purpose, see Remark 6.5
Lb1/3 ∆f
3: K ← Θ 4/3 1/3 ;
⋄ ∆f is any upper bound on f (x0 ) − minx {f (x)}
ε V
4: for t ← 0 to K − 1 do
5:
xt+1/2 ← apply SCSG on xt for one epoch of size B = max{Θ(V/ε2 ), 1};
6:
if k∇f (xt+1/2 )k ≥ 2ε then
⋄ estimate k∇f (xt )k using O(ε−2 V log K) stochastic gradients
7:
xt+1 ← xt+1/2 ;
8:
else
⋄ necessarily k∇f (xt+1/2 )k ≤ ε
online
9:
v ← Neon2
(f, xt+1/2 , δ, 1/20K);
10:
if v = ⊥ then return xt+1/2 ;
⋄ necessarily ∇2 f (xt+1/2 ) −δI
⋄ necessarily v⊤ ∇2 f (xt+1/2 )v ≤ −δ/2
11:
else xt+1 ← xt+1/2 ± Lδ2 v;
12:
end if
13: end for
14: will not reach this line (with probability ≥ 2/3).
As for SVRG, it is an offline method and its one-epoch lemma looks like9
2
E k∇f (x+
≤ C · Ln1/3 f (xt ) − E[f (x+
t )k
t )] .
If one replaces the use of Lemma 6.4 with this new inequality, and replace the use of Neon2online
with Neon2svrg , then we get the following theorem:
Theorem 5d. With probability at least
outputs
an
(ε, δ)-approximate local
2/3, Neon2+SVRG
√
2∆
3/4 L
L∆
L
f
f
n
2
e
n + √δ
.
minimum in gradient complexity T = O
+ δ3
ε2 n1/3
For a clean presentation of this paper, we ignore the pseudocode and proof because they are only
simpler than Neon2+SCSG.
6.4
Neon2 on Natasha2 and CDHS
The recent results Carmon et al. [7] (that we refer to CDHS) and Natasha2 [2] are both Hessianfree methods where the only Hessian-vector product computations come from the exact NC-search
process we study in this paper. Therefore, by replacing their NC-search with Neon2, we can directly
turn them into first-order methods without the necessity of computing Hessian-vector products.
We state the following two theorems where the proofs are exactly the same as the papers [7]
and [2]. We directly state them by assuming ∆f , V, L, L2 are constants, to simplify our notions.
Theorem 2. One can replace Oja’s algorithm with Neon2online in Natasha2 without hurting its
performance, turning it into a first-order stochastic method.
Treating ∆f , V, L, L2 as constants, Natasha2 finds an (ε, δ)-approximate local minimum in T =
1
1
1
e 3.25
+
+
stochastic gradient computations.
O
3
5
ε
ε δ
δ
Theorem 4. One can replace Lanczos method with Neon2det or Neon2svrg in CDHS without hurting
it performance, turning it into a first-order method.
9
There are at least three different variants of SVRG [3, 18, 23]. We have adopted the lemma of [18] for simplicity.
14
Treating ∆f , L, L2 as constants, CDHS finds an (ε, δ)-approximate local minimum in either
3/4
3/4
1
n
1
e 1.5
e
full gradient computations (if Neon2det is used) or in T = O
+ δ3.5
+ δn3 + εn1.75 + nδ3.5
O ε1.75
ε
stochastic gradient computations (if Neon2svrg is used).
Acknowledgements
We would like to thank Tianbao Yang and Yi Xu for helpful feedbacks on this manuscript.
Appendix
A
Missing Proofs for Section 3
A.1
Auxiliary Lemmas
We use the following lemma to approximate hessian-vector products:
Lemma A.1. If
f (x) is L2 -second-order smooth,
then for every point x ∈ Rd and every vector
v ∈ Rd , we have:
∇f (x + v) − ∇f (x) − ∇2 f (x)v
2 ≤ L2 kvk22 .
R1
Proof of Lemma A.1. We can write ∇f (x+v)−∇f (x) = t=0 ∇2 f (x+tv)vdt. Subtracting ∇2 f (x)v
we have:
Z 1
2
2
∇f (x + v) − ∇f (x) − ∇2 f (x)v
=
∇ f (x + tv) − ∇ f (x) vdt
2
t=0
2
Z 1
2
∇ f (x + tv) − ∇2 f (x)
kvk2 dt ≤ L2 kvk22 .
≤
2
t=0
We need the following auxiliary lemma about martingale concentration bound:
Lemma A.2. Consider random events {Ft }t≥1 and random variables x1 , . . . , xT ≥ 0 and a1 , . . . , AT ∈
[−ρ, ρ] for ρ ∈ [0, 1/2] where each xt and at only depend on F1 , . . . , Ft . Letting x0 = 0 and suppose
there exist constant b ≥ 0 and λ > 0 such that for every t ≥ 1:
xt ≤ xt−1 (1 − at ) + b and E[at | F1 , . . . , Ft−1 ] ≥ −λ .
q
λT +2ρ T log Tp
Then, we have for every p ∈ (0, 1): Pr xT ≥ T be
≤p .
Proof. We know that
xT ≤ (1 − aT )xT −1 + b
≤ (1 − aT ) ((1 − aT −1 )xT −2 + b) + b
= (1 − aT )(1 − aT −1 )xT −2 + (1 − aT )b + b
≤ ··· ≤
T Y
T
X
s=2 t=s
(1 − as )b
For each s ∈ [T ], we consider the random process define as
t ≥ s : yt+1 = (1 − at )yt ,
15
ys = b
Therefore
log yt+1 = log(1 − at ) + log yt
For log(1−at ) ∈ [−2ρ, ρ] and E[log(1−at ) | F1 , · · · Ft−1 ] ≤ λ. Thus, we can apply Azuma-Hoeffding
inequality on log yt to conclude that
q
λT +2ρ T log Tp
Pr yT ≥ be
≤ p/T .
Taking union bound over s we complete the proof.
A.2
Proof of Lemma 3.1
Proof of Lemma 3.1. Let it ∈ [n] be the random index i chosen when computing xt+1 from xt in
Line 5 of Neon2online
weak . We will write the update rule of xt in terms of the Hessian before we stop.
By Lemma A.1, we know that for every t ≥ 1,
∇fit (xt ) − ∇fit (x0 ) − ∇2 fit (x0 )(xt − x0 )
≤ L2 kxt − x0 k22 .
2
Therefore, there exists error vector ξt ∈
Rd
with kξt k2 ≤ L2 kxt − x0 k22 such that
(xt+1 − x0 ) = (xt − x0 ) − η∇2 fit (x0 )(xt − x0 ) + ηξt .
For notational simplicity, let us denote by
def
zt = x t − x 0 ,
def
A t = Bt + R t
where
Bt = ∇2 fit (x0 ),
def
then it satisfies
def
Rt = −
ξt zt⊤
kzt k22
zt+1 = zt − ηBt zt + ηξt = (I − ηAt )zt .
We have kRt k2 ≤ L2 kzt k2 ≤ L2 · r. By the L-smoothness of fit , we know kBt k2 ≤ L and thus
kAt k2 ≤ kBt k2 + kRt k2 ≤ kBt k2 + L2 r ≤ 2L.
def
zt
⊤ = (I − ηA ) · · · (I − ηA )ξξ ⊤ (I − ηA ) · · · (I − ηA ) and w def
Now, define Φt = zt+1 zt+1
t
1
1
t
t = kzt k2 =
zt
. Then, before we stop, we have:
(Tr(Φt−1 ))1/2
Tr(Φt ) = Tr(Φt−1 ) 1 − 2ηwt⊤ At wt + η 2 wt⊤ A2t wt
≤ Tr(Φt−1 ) 1 − 2ηwt⊤ At wt + 4η 2 L2
≤ Tr(Φt−1 ) 1 − 2ηwt⊤ Bt wt + 2ηkRt k2 + 4η 2 L2
①
≤ Tr(Φt−1 ) 1 − 2ηwt⊤ Bt wt + 8η 2 L2 .
2
L
Above, ① is because our choice of parameter satisfies r ≤ η L
. Therefore,
2
log (Tr(Φt )) ≤ log (Tr(Φt−1 )) + log 1 − 2ηwt⊤ Bt wt + 8η 2 L2 .
Letting λ = −λmin (∇2 f (x0 )) = −λmin (EBt [Bt ]), since the randomness
⊤ of Bt is independent of wt ,
⊤
we know that wt Bt wt ∈ [−L, L] and for every wt , it satisfies EBt wt Bt wt | wt ≥ −λ. Which (by
concavity of log) also implies that E[log 1 − 2ηwt⊤ Bt wt + 8η 2 L2 ] ≤ 2ηλ and log(1 − 2ηwt⊤ Bt wt +
8η 2 L2 ) ∈ [−2(2ηL + 8η 2 L2 ), 2ηL + 8η 2 L2 ] ∈ [−6ηL, 3ηL].
Hence, applying Azuma-Hoeffding inequality on log(Φt ) we have
r
1
Pr log(Φt ) − log(Φ0 ) ≥ 2ηλt + 16ηL t log
≤p .
p
16
In other words, with probability at least 1 − p, Neon2online
weak will not terminate until t ≥ T0 , where
2
2
T0 is given by the equation (recall Φ0 = kz1 k = σ ):
r
2
r
1
.
2ηλT0 + 16ηL T0 log = log
p
σ2
def
Next, we turn to accuracy. Let “true” vector vt+1 = (I − ηBt ) · · · (I − ηB1 )ξ and we have
zt+1 − vt+1 =
t
Y
s=1
(I − ηAs )ξ −
t
Y
s=1
(I − ηBs )ξ = (I − ηBt )(zt − vt ) − ηRt zt .
def
Thus, if we call ut = zt − vt with u1 = 0, then, before the algorithm stops, we have:
kut+1 − (I − ηBt )ut k2 ≤ ηkRt zt k2 ≤ ηL2 r2
Using Young’s inequality ka + bk22 ≤ (1 + β)kak22 + β1 + 1 kbk22 for every β > 0, we have:
①
u t Bt u t
2
2
2
2 4
2
2 2
kut+1 k2 ≤ 1 + η k(I − ηBt )ut k2 + 8L2 r ≤ kut k2 1 − 2η
+ 4η L + 8L22 r4 .
kut k22
Above, ① assumes without loss of generality that L ≥ 1 (as otherwise we can re-scale the problem).
Therefore, applying martingale concentration Lemma A.2, we know
q
t
2 ηλt+8ηL t log p
≤p .
Pr kut k2 ≥ 16L2 r te
Now we can apply the recent of Oja’s algorithm — [6, Theorem 4]. By our choice of parameter
η, we have: with probability at least 99/100 the following holds:
1. Norm growth: kvt k2 ≥ e(ηλ−2η
2. Negative curvature:
2 L2
vt⊤ ∇2 f (x0 )vt
kvt k22
)t σ/d.
≤ −(1 − 2η)λ + O
log(d)
ηt
.
Then let us consider the case: λ ≥ δ, let us consider a fixed T1 defined as
log 2dr
C0 (log d/p + log(2d))
σ
T1 =
=
(version 1)∗
arXiv:1711.06673v1 [cs.LG] 17 Nov 2017
Zeyuan Allen-Zhu
[email protected]
Microsoft Research
Yuanzhi Li
[email protected]
Princeton University
November 17, 2017
Abstract
We propose a reduction for non-convex optimization that can (1) turn a stationary-point
finding algorithm into a local-minimum finding one, and (2) replace the Hessian-vector product computations with only gradient computations. It works both in the stochastic and the
deterministic settings, without hurting the algorithm’s performance.
As applications, our reduction turns Natasha2 into a first-order method without hurting
its performance. It also converts SGD, GD, SCSG, and SVRG into local-minimum finding
algorithms outperforming some best known results.
1
Introduction
Nonconvex optimization has become increasing popular due its ability to capture modern machine
learning tasks in large scale.
Most notably, training deep neural networks corresponds to minimizing
P
a function f (x) = n1 ni=1 fi (x) over x ∈ Rd that is non-convex, where each training sample
i corresponds to one loss function fi (·) in the summation. This average structure allows one
to perform stochastic gradient descent (SGD) which uses a random ∇fi (x) —corresponding to
computing backpropagation once— to approximate ∇f (x) and performs descent updates.
Motivated by such large-scale machine learning applications, we wish to design faster first-order
non-convex optimization methods that outperform the performance of gradient descent, both in
the online and offline settings. In this paper, we say an algorithm is online if its complexity is
independent of n (so n can be infinite), and offline otherwise. In recently years, researchers across
different communities have gathered together to tackle this challenging question. By far, known
theoretical approaches mostly fall into one of the following two categories.
First-order methods for stationary points. In analyzing first-order methods, we denote
by gradient complexity T the number of computations of ∇fi (x). To achieve an ε-approximate
stationary point —namely, a point x with k∇f (x)k ≤ ε— it is a folklore that gradient descent
(GD) is offline and needs T ∝ O εn2 , while stochastic gradient decent (SGD) is online and needs
2/3
T ∝ O ε14 . In recent years, the offline complexity has been improved to T ∝ O nε2 by the
∗
The result of this paper was briefly discussed at a Berkeley Simons workshop on Oct 6 and internally presented
at Microsoft on Oct 30. We started to prepare this manuscript on Nov 11, after being informed of the independent
and similar work of Xu and Yang [28]. Their result appeared on arXiv on Nov 3. To respect the fact that their work
appeared online before us, we have adopted their algorithm name Neon and called our new algorithm Neon2. We
encourage readers citing this work to also cite [28].
1
1
by the SCSG
SVRG method [3, 23], and the online complexity has been improved to T ∝ O ε10/3
method [18]. Both of them rely on the so-called variance-reduction technique, originally discovered
for convex problems [11, 16, 24, 26].
These algorithms SVRG and SCSG are only capable of finding stationary points, which may
not necessarily be approximate local minima and are arguably bad solutions for neural-network
training [9, 10, 14]. Therefore,
can we turn stationary-point finding algorithms into local-minimum finding ones?
Hessian-vector methods for local minima. It is common knowledge that using information
about the Hessian, one can find ε-approximate local minima —namely, a point x with k∇f (x)k ≤ ε
and also ∇2 f (x) −ε1/C I.1 In 2006, Nesterov and Polyak [20] showed that one can find an ε1
approximate in O( ε1.5
) iterations, but each iteration requires an (offline) computation as heavy as
inverting the matrix ∇2 f (x).
To fix this issue, researchers propose to study the so-called “Hessian-free” methods that, in
addition to gradient computations, also compute Hessian-vector products. That is, instead of using
the full matrix ∇2 fi (x) or ∇2 f (x), these methods also compute ∇2 fi (x) · v for indices i and vectors
v.2 For Hessian-free methods, we denote by gradient complexity T the number of computations
of ∇fi (x) plus that of ∇2 fi (x) · v. The hope of using Hessian-vector products is to improve the
complexity T as a function of ε.
Such improvement was first shown possible independently by [1, 7] for the offline setting, with
3/4
n
+ εn1.75 so is better than that of gradient descent.
complexity T ∝ ε1.5
In the online setting, the
1
first improvement was by Natasha2 which gives complexity T ∝ ε3.25
[2].
Unfortunately, it is argued by some researchers that Hessian-vector products are not general
enough and may not be as simple to implement as evaluating gradients [8]. Therefore,
can we turn Hessian-free methods into first-order ones, without hurting their performance?
1.1
From Hessian-Vector Products to First-Order Methods
i (x)
}. Given any
Recall by definition of derivative we have ∇2 fi (x) · v = limq→0 { ∇fi (x+qv)−∇f
q
Hessian-free method, at least at a high level, can we replace every occurrence of ∇2 fi (x) · v with
i (x)
for some small q > 0?
w = ∇fi (x+qv)−∇f
q
Note the error introduced in this approximation is k∇2 fi (x) · v − wk ∝ qkvk2 . Therefore, as
long as the original algorithm is sufficiently stable to adversarial noise, and as long as q is small
enough, this can convert Hessian-free algorithms into first-order ones.
In this paper, we demonstrate this idea by converting negative-curvature-search (NC-search)
subroutines into first-order processes. NC-search is a key subroutine used in state-of-the-art
Hessian-free methods that have rigorous proofs (see [1, 2, 7]). It solves the following simple task:
negative-curvature search (NC-search)
given point x0 , decide if ∇2 f (x0 ) −δI or find a unit vector v such that v ⊤ ∇2 f (x0 )v ≤ − 2δ .
Online Setting.
In the online setting, NC-search can be solved by Oja’s algorithm [21] which
1
We say A −δI if all the eigenvalues of A are no smaller than −δ. In this high-level introduction, we focus only
on the case when δ = ε1/C for some constant C.
2
Hessian-free methods are useful because when fi (·) is explicitly given, computing its gradient is in the same
complexity as computing its Hessian-vector product [22, 25], using backpropagation.
2
2 ) computations of Hessian-vector products (first proved in [6] and applied to NC-search
e
costs O(1/δ
in [2]).
In this paper, we propose a method Neon2online which solves the NC-search problem via only
stochastic first-order updates. That is, starting from x1 = x0 + ξ where ξ is some random perturbation, we keep updating xt+1 = xt − η(∇fi (xt ) − ∇fi (x0 )). In the end, the vector xT − x0 gives
us enough information about the negative curvature.
2 ) stochastic grae
Theorem 1 (informal). Our Neon2online algorithm solves NC-search using O(1/δ
dients, without Hessian-vector product computations.
2 ) matches that of Oja’s algorithm, and is information-theoretically ope
This complexity O(1/δ
timal (up to log factors), see the lower bound in [6].
We emphasize that the independent work Neon by Xu and Yang [28] is actually the first recorded
3 ) stochastic gradients,
e
theoretical result that proposed this approach. However, Neon needs O(1/δ
because it uses full gradient descent to find NC (on a sub-sampled objective) inspired by [15] and
the power method; instead, Neon2online uses stochastic gradients and is based on our prior work on
Oja’s algorithm [6].
By plugging Neon2online into Natasha2 [2], we achieve the following corollary (see Figure 1(c)):
Theorem 2 (informal). Neon2online turns Natasha2 into a stochastic first-order method, without
1
e 3.25
+
hurting its performance. That is, it finds an (ε, δ)-approximate local minimum in T = O
ε
1
1
+ δ5 stochastic gradient computations, without Hessian-vector product computations.
ε3 δ
(We say x is an approximate local minimum if k∇f (x)k ≤ ε and ∇2 f (x) −δI.)
Offline Setting. There are a number of ways to solve the NC-search problem in the offline setting
e
using Hessian-vector products. Most notably, power
√ method uses O(n/δ) computations of Hessiane
vector products, Lanscoz method [17] uses O(n/ δ) computations,
and shift-and-invert [12] on top
√
e + n3/4 / δ) computations.
of SVRG [26] (that we call SI+SVRG) uses O(n
In this paper, we convert Lanscoz’s method and SI+SVRG into first-order ones:
√
det algorithm solves NC-search using O(1/
e
δ) full gradients
Theorem 3 (informal).
√ Our Neon2
svrg
e
e +
(or equivalently
O(n/ δ) stochastic gradients), and our Neon2
solves NC-search using O(n
√
3/4
n / δ) stochastic gradients.
We emphasize that, although analyzed in the online setting only, the work Neon by Xu and
Yang [28] also applies to the offline setting, and seems to be the first result to solve NC-search
e
using first-order√gradients with a theoretical proof. However, Neon uses O(1/δ)
full gradients
det
e
instead of O(1/ δ). Their approach is inspired by [15], but our Neon2 is based on Chebyshev
approximation theory (see textbook [27]) and its recent stability analysis [5].
By putting Neon2det and Neon2svrg into the CDHS method of Carmon et al. [7], we have3
Theorem 4 (informal). Neon2det turns CDHS into a first-order method
without hurting its perfor1
1
e
mance: it finds an (ε, δ)-approximate local minimum in O ε1.75 + δ3.5 full gradient computations.
Neon2svrg turns CDHS into a first-order method without hurting its performance: it finds an (ε, δ)3/4
3/4
n
e 1.5
approximate local minimum in T = O
+ δn3 + εn1.75 + nδ3.5 stochastic gradient computations.
ε
3
We note that the original paper of CDHS only proved such complexity results (although requiring Hessian-vector
1
e 1.75
products) for the special case of δ ≥ ε1/2 . In such a case, it requires either O
full gradient computations or
ε
3/4
n
n
e
O 1.5 + 1.75 stochastic gradient computations.
ε
ε
3
T
T=δ-7
T
T=δ-6
ε-5
T=δ-3 ε-2
ε
ε-4
T=ε-4
ε2/3 ε4/7 ε1/2
ε1/4
δ
ε-3.33
T=δ-6
Neon2+SCSG
Neon+SGD
T=δ-5
ε-5
ε-4
T
Neon2+SGD
Neon+SCSG
T=δ-5
ε-5
T=δ-3 ε-2
T=ε-3.33
ε
(a)
ε2/3
ε1/2 ε4/9
ε1/4
Neon+Natasha
T=δ-5
ε-3.75
δ
ε-3.6
ε-3.25
Neon2+Natasha2
ε
T=δ-1 ε-3
ε3/4
(b)
ε3/5 ε1/2
T=ε-3.25
ε1/4
δ
(c)
Figure 1: Neon vs Neon2 for finding (ε, δ)-approximate local minima. We emphasize that Neon2 and Neon are based
on the same high-level idea, but Neon is arguably the first-recorded result to turn stationary-point finding
algorithms (such as SGD, SCSG) into local-minimum finding ones, with theoretical proofs.
One should perhaps compare Neon2det to the interesting work “convex until guilty” by Carmon
1.75 ) full gradients,
e
et al. [8]. Their method finds ε-approximate stationary points using O(1/ε
and is arguably the first first-order method achieving a convergence rate better than 1/ε2 of GD.
Unfortunately, it is unclear if their method guarantees local minima. In comparison, Neon2det on
CDHS achieves the same complexity but guarantees its output to be an approximate local minimum.
Remark 1.1. All the cited works in this sub-section requires the objective to have (1) Lipschitzcontinuous Hessian (a.k.a. second-order smoothness) and (2) Lipschitz-continuous gradient (a.k.a.
Lipschitz smoothness). One can argue that (1) and (2) are both necessary for finding approximate
local minima, but if only finding approximate stationary points, then only (2) is necessary. We
shall formally discuss our assumptions in Section 2.
1.2
From Stationary Points to Local Minima
Given any (first-order) algorithm that finds only stationary points (such as GD, SGD, or SCSG [18]),
we can hope for using the NC-search routine to identify whether or not its output x satisfies
∇2 f (x) −δI. If so, then automatically x becomes an (ε, δ)-approximate local minima so we
can terminate. If not, then we can go in its negative curvature direction to further decrease the
objective.
In the independent work of Xu and Yang [28], they proposed to apply their Neon method
for NC-search, and thus turned SGD and SCSG into first-order methods finding approximate local
minima. In this paper, we use Neon2 instead. We show the following theorem:
Theorem 5 (informal). To find an (ε, δ)-approximate local minima,
e 14 + 21 3 + 15 stochastic gradients;
(a) Neon2+SGD needs T = O
ε
ε δ
δ
1
1
1
e 10/3
(b) Neon2+SCSG needs T = O
+
stochastic gradients; and
+
2
3
5
ε
δ
δ
ε
n
1
e n2 + 3.5
e 12 + 3.5
(c) Neon2+GD needs T = O
(so
O
full gradients).
ε
δ
ε
δ
3/4
2/3
5/12
e n 2 + n3 + n2 1/2 + n3.5 stochastic gradients.
(d) Neon2+SVRG needs T = O
ε
δ
δ
ε δ
We make several comments as follows.
(a) We compare Neon2+SGD to Ge et al. [13], where the authors showed SGD plus perturbation
4 ) stochastic gradients to find (ε, ε1/4 )-approximate local minima. This
e
needs T = O(poly(d)/ε
is the perhaps first time that a theoretical guarantee for finding local minima is given using
first-order oracles.
4
To some extent, Theorem 5a is superior because we have (1) removed the poly(d) factor,4 (2)
4 ) as long as δ ≥ ε2/3 , and (3) a much simpler analysis.
e
achieved T = O(1/ε
We also remark that, if using Neon instead of Neon2, one achieves a slightly worse complexity
e 14 + 17 , see Figure 1(a) for a comparison.5
T =O
ε
δ
(b) Neon2+SCSG turns SCSG into a local-minimum finding algorithm. Again,
if using Neon instead
1
1
1
e 10/3
+
of Neon2, one gets a slightly worse complexity T = O
, see Figure 1(b).
+
ε2 δ 3
δ6
ε
(c) We compare Neon2+GD to Jin et al. [15], where the authors showed GD plus perturbation
2 ) full gradients to find (ε, ε1/2 )-approximate local minima. This is perhaps the
e
needs O(1/ε
first time that one can convert a stationary-point finding method (namely GD) into a local
minimum-finding one, without hurting its performance.
2 ) full gradients as long as δ ≥ ε4/7 .
e
To some extent, Theorem 5c is better because we use O(1/ε
(d) Our result for Neon2+SVRG does not seem to be recorded anywhere, even if Hessian-vector
product computations are allowed.
Limitation. We note that there is limitation of using Neon2 (or Neon) to turn an algorithm finding
stationary points to that finding local minima. Namely, given any algorithm A, if the gradient
complexity for A to find an ε-approximate stationary point is T , then after this conversion, it finds
(ε, δ)-approximate local minima in a gradient complexity that is at least T . This is because the
new algorithm, after combining Neon2 and A, tries to alternatively find stationary points (using
A) and escape from saddle points (using Neon2). Therefore, it must pay at least complexity T . In
contrast, methods such as Natasha2 swing by saddle points instead of go to saddle points and then
escape. This has enabled it to achieve a smaller complexity T = O(ε−3.25 ) for δ ≥ ε1/4 .
2
Preliminaries
Throughout this paper, we denote by k · k the Euclidean norm. We use i ∈R [n] to denote that
i is generated from [n] = {1, 2, . . . , n} uniformly at random. We denote by I[event] the indicator
function of probabilistic events.
We denote by kAk2 the spectral norm of matrix A. For symmetric matrices A and B, we write
A B to indicate that A − B is positive semidefinite (PSD). Therefore, A −σI if and only if
all eigenvalues of A are no less than −σ. We denote by λmin (A) and λmax (A) the minimum and
maximum eigenvalue of a symmetric matrix A.
Recall some definitions on smoothness (for other equivalent definitions, see textbook [19])
Definition 2.1. For a function f : Rd → R,
• f is L-Lipschitz smooth (or L-smooth for short) if
∀x, y ∈ Rd , k∇f (x) − ∇f (y)k ≤ Lkx − yk.
• f is second-order L2 -Lipschitz smooth (or L2 -second-order smooth for short) if
∀x, y ∈ Rd , k∇2 f (x) − ∇2 f (y)k2 ≤ L2 kx − yk.
The following fact says the variance of a random variable decreases by a factor m if we choose
m independent copies and average them. It is trivial to prove, see for instance [18].
4
We are aware that the original authors of [13] have a different proof to remove its poly(d) factor, but have not
found it online at this moment.
5
e 14 + 16 with a slight change of the algorithm, but not beyond.
Their complexity might be improvable to O
ε
δ
5
algorithm
stationary
SGD (folklore)
O
e
local minima perturbed SGD [13] O
local minima
local minima
stationary
local minima
local minima
local minima
Neon+SGD [28]
Neon2+SGD
SCSG [18]
Neon+SCSG [28]
Neon2+SCSG
Natasha2 [2]
e
O
e
O
O
O
O
e
O
e
local minima Neon+Natasha2 [28] O
local minima
stationary
local minima
local minima
stationary
Neon2+Natasha2
GD (folklore)
perturbed GD [15]
Neon2+GD
SVRG [3, 23]
local minima
Neon2+SVRG
stationary
“guilty” [8]
local minima
FastCubic [1]
local minima
CDHS [7]
local minima
Neon2+CDHS
Hessianvector
products
gradient complexity T
e
O
O
e
O
e
O
O
e
O
1
ε4
poly(d)
ε4
1
ε4
1
ε4
(only for δ ≥ ε
+
1
δ7
+
1
ε2 δ 3
1
ε10/3
1
ε10/3
1
ε10/3
+
1
ε2 δ 3
+
1
ε2 δ 3
+
1
δ5
1
ε3 δ
+
1
δ6
+
1
ε3 δ
+
1
δ5
n
1
δ5
ε2
n
ε2
n
ε2
+
n2/3
ε2
2/3
n
ε2
n
δ 3.5
+n
+
n
δ3
(only for δ ≥ ε1/2 )
+
e
O
e
O
n
ε1.75
n
n3/4
ε1.5 + ε1.75
n
ε1.5
+
n
δ3
e
O
n
ε1.5
+
n
δ3
e
O
1
δ6
+
1
ε3.25
ε3.25
+
+
+
)
1
ε3 δ
1
ε3.25
1
+
1
δ5
1/4
5/12
n
ε2 δ 1/2
+
3/4
n
δ 3.5
(only for δ ≥ ε1/2 )
3/4
3/4
+ εn1.75 + nδ3.5
3/4
3/4
+ εn1.75 + nδ3.5
variance
Lip.
bound smooth
2nd order
smooth
no
needed
needed
no
no
needed
needed
needed
no
needed
needed
needed
no
needed
needed
needed
no
needed
needed
no
no
needed
needed
needed
no
needed
needed
needed
needed
needed
needed
needed
no
needed
needed
needed
no
needed
needed
needed
no
no
needed
no
no
no
needed
needed
no
no
needed
needed
no
no
needed
no
no
no
needed
needed
no
no
needed
needed
needed
no
needed
needed
needed
no
needed
needed
no
no
needed
needed
Table 1: Complexity for finding k∇f (x)k ≤ ε and ∇2 f (x) −δI. Following tradition, in these complexity bounds,
we assume variance and smoothness parameters as constants, and only show the dependency on n, d, ε.
Remark 1. Variance bounds is needed for online methods.
Remark 2. Lipschitz smoothness is needed for finding approximate stationary points.
Remark 3. Second-order Lipschitz smoothness is needed for finding approximate local minima.
P
Fact 2.2. If v1 , . . . , vn ∈ Rd satisfy ni=1 vi = ~0, and S is a non-empty, uniform random subset of
[n]. Then
h
P
2 i
P
I[|S| 0, confidence p ∈ (0, 1].
1: for j = 1, 2, · · · Θ(log 1/p) do
⋄ boost the confidence
(f,
x
,
δ,
p);
2:
vj ← Neon2online
0
weak
3:
if vj 6= ⊥ then
2
1/p
4:
m ← Θ( L log
), v ′ ← Θ( Lδ2 )v.
δ2
5:
Draw i1 , . . .P
, im ∈R [n].
(v ′ )⊤ ∇fij (x0 + v ′ ) − ∇fij (x0 )
6:
zj = mkv1 ′ k2 m
j=1
2
7:
if zj ≤ −3δ/4 return v = vj
8:
end if
9: end for
10: return v = ⊥.
Algorithm 2 Neon2online
weak (f, x0 , δ, p)
1:
2:
3:
4:
5:
6:
7:
8:
η←
δ
,
C02 L2 log(d/p)
T ←
C02 log(d/p)
,
ηδ
ξ ← Gaussian random vector with norm σ.
x1 ← x0 + ξ.
for t ← 1 to T do
xt+1 ← xt − η (∇fi (xt ) − ∇fi (x0 )) where i ∈R [n].
t+1 −x0
if kxt+1 − x0 k2 ≥ r then return v = kxxt+1
−x0 k2
end for
return v = ⊥;
⋄ for sufficiently large constant C0
2
⋄ σ def
= η(d/p)−2C0 L2δ L3
⋄ r def
= (d/p)C0 σ
where both f (·) and each fi (·) can be nonconvex. We wish to find (ε, δ)-local minima which are
points x satisfying
k∇f (x)k ≤ ε
and
We need the following three assumptions
∇2 f (x) −δI .
• Each fi (x) is L-Lipschitz smooth.
• Each fi (x) is second-order L2 -Lipschitz smooth.
(In fact, the gradient complexity of Neon2 in this paper only depends polynomially on the
second-order smoothness of f (x) (rather than fi (x)), and the time complexity depends logarithmically on the second-order smoothness of fi (x). To make notations simple, we decide to
simply assume each fi (x) is L2 -second-order smooth.)
• Stochastic gradients have bounded variance: ∀x ∈ Rd :
(This assumption is needed only for online algorithms.)
3
Ei∈R [n] k∇f (x) − ∇fi (x)k2 ≤ V .
Neon2 in the Online Setting
We propose Neon2online formally in Algorithm 1.
It repeatedly invokes Neon2online
weak in Algorithm 2, whose goal is to solve the NC-search problem
with confidence 2/3 only; then Neon2online invokes Neon2online
weak repeatedly for log(1/p) times to boost
the confidence to 1 − p. We prove the following theorem:
7
P
Theorem 1 (Neon2online ). Let f (x) = n1 ni=1 fi (x) where each fi is L-smooth and L2 -secondorder smooth. For every point x0 ∈ Rd , every δ > 0, every p ∈ (0, 1], the output
v = Neon2online (f, x0 , δ, p)
satisfies that, with probability at least 1 − p:
1. If v = ⊥, then ∇2 f (x0 ) −δI.
2. If v 6= ⊥, then kvk2 = 1 and v ⊤ ∇2 f (x0 )v ≤ − 2δ .
Moreover, the total number of stochastic gradient evaluations O
log2 (d/p)L2
.
δ2
The proof of Theorem 1 immediately follows from Lemma 3.1 and Lemma 3.2 below.
online
Lemma 3.1 (Neon2online
weak ). In the same setting as Theorem 1, the output v = Neon2weak (f, x0 , δ, p)
51
δ.
satisfies If λmin (∇2 f (x0 )) ≤ −δ, then with probability at least 2/3, v 6= ⊥ and v ⊤ ∇2 f (x0 )v ≤ − 100
Proof sketch of Lemma 3.1. We explain why Neon2online
weak works as follows. Starting from a randomly
perturbed point x1 = x0 + ξ, it keeps updating xt+1 ← xt − η (∇fi (xt ) − ∇fi (x0 )) for some random
index i ∈ [n], and stops either when T iterations are reached, or when kxt+1 − x0 k2 > r. Therefore,
we have kxt − x0 k2 ≤ r throughout the iterations, and thus can approximate ∇2 fi (x0 )(xt − x0 )
using ∇fi (xt ) − ∇fi (x0 ), up to error O(r2 ). This is a small term based on our choice of r.
Ignoring the error term, our updates look like xt+1 −x0 = I−η∇2 fi (x0 ) (xt −x0 ). This is exactly
the same as Oja’sPalgorithm [21] which is known to approximately compute the minimum eigenvector
of ∇2 f (x0 ) = n1 ni=1 fi (x0 ). Using the recent optimal convergence analysis of Oja’s algorithm [6],
log r
one can conclude that after T1 = Θ ηλσ iterations, where λ = max{0, −λmin (∇2 f (x0 ))}, then we
not only have that kxt+1 − x0 k2 is blown up, but also it aligns well with the minimum eigenvector
of ∇2 f (x0 ). In other words, if λ ≥ δ, then the algorithm must stop before T .
Finally, one has to carefully argue that the error does not blow up in this iterative process. We
defer the proof details to Appendix A.2.
Our Lemma 3.2 below tells us we can verify if the output v of Neon2online
weak is indeed correct (up
to additive 4δ ), so we can boost the success probability to 1 − p.
Lemma 3.2 (verification). In the same setting as Theorem 1, let vectors x, v ∈ Rd . If i1 , . . . , im ∈R
[n] and define
1 Pm
⊤
z=m
j=1 v (∇fij (x + v) − ∇fij (x))
Then, if kvk ≤
δ
8L2
and m = Θ( L
2
log 1/p
),
δ2
with probability at least 1 − p,
z
v ⊤ ∇2 f (x)v
kvk2 −
≤ 4δ .
2
kvk
2
4
2
Neon2 in the Deterministic Setting
We propose Neon2det formally in Algorithm 3 and prove the following theorem:
8
Algorithm 3 Neon2det (f, x0 , δ, p)
Input: A function√f , vector x0 , negative curvature target δ > 0, failure probability p ∈ (0, 1].
1:
2:
3:
4:
5:
6:
7:
8:
9:
T ←
C12 log(d/p) L
√
.
δ
⋄ for sufficiently large constant C1 .
ξ ← Gaussian random vector with norm σ;
⋄ σ def
= (d/p)−2C1 T 3δL2
x1 ← x0 + ξ. y1 ← ξ, y0 ← 0
for t ← 1 to T do
3δ
y
yt+1 = 2M(yt ) − yt−1 ;
⋄ M(y) def
= − L1 (∇f (x0 + y) − ∇f (x0 )) + 1 − 4L
xt+1 = x0 + yt+1 − M(yt ).
t+1 −x0
⋄ r def
= (d/p)C1 σ
if kxt+1 − x0 k2 ≥ r then return kxxt+1
−x0 k2 .
end for
return ⊥.
Theorem 3 (Neon2det ). Let f (x) be a function that is L-smooth and L2 -second-order smooth.
For every point x0 ∈ Rd , every δ > 0, every p ∈ (0, 1], the output v = Neon2det (f, x0 , δ, p) satisfies
that, with probability at least 1 − p:
1. If v = ⊥, then ∇2 f (x0 ) −δI.
2. If v 6= ⊥, then kvk2 = 1 and v ⊤ ∇2 f (x0 )v ≤ − 2δ .
Moreover, the total number full gradient evaluations is O
√
log2 (d/p) L
√
.
δ
Proof sketch of Theorem 3. We explain the high-level
intuition of Neon2det and the proof of Theorem 3
1 2
3δ
as follows. Define M = − L ∇ f (x0 ) + 1 − 4L I. We immediately notice that
• all eigenvalues of ∇2 f (x0 ) in −3δ
4 , L are mapped to the eigenvalues of M in [−1, 1], and
δ
• any eigenvalue of ∇2 f (x0 ) smaller than −δ is mapped to eigenvalue of M greater than 1 + 4L
.
e L , if we compute xT +1 = x0 + MT ξ for some random vector ξ, by the
Therefore, as long as T ≥ Ω
δ
theory of power method, xT +1 − x0 must be a negative-curvature direction of ∇2 f (x0 ) with value
≤ 12 δ. There are two issues with this approach.
√
e √L
The first issue is that, the degree T of this matrix polynomial MT can be reduced to T = Ω
δ
if the so-called Chebyshev polynomial is used.
Claim 4.1. Let Tt (x) be the t-th Chebyshev polynomial of the first kind, defined as [27]:
def
T0 (x) = 1,
def
def
T1 (x) = x,
Tn+1 (x) = 2x · Tn (x) − Tn−1 (x)
[−1, 1]
if x ∈ [−1, 1];
t
t
√
√
then Tt (x) satisfies: Tt (x) ∈
if x > 1.
12 x + x2 − 1 , x + x2 − 1
√
Since Tt (x) stays between [−1, 1] when x ∈ [−1, 1], and grows to ≈ (1 + x2 − 1)t for x ≥ 1, we
δ
shall grow
can use TT (M) in replacement of MT . Then, any eigenvalue of M that is above 1 + 4L
√
p
L
T
e √ . This is quadratically faster than
in a speed like (1 + δ/L) , so it suffices to choose T ≥ Ω
σ
applying the power method, so in Neon2det we wish to compute xt+1 ≈ x0 + Tt (M) ξ.
The second issue is that, since we cannot compute Hessian-vector products, we have to use the
9
gradient difference to approximate it; that is, we can only use M(y) to approximate My where
3δ
1
def
y .
M(y) = − (∇f (x0 + y) − ∇f (x0 )) + 1 −
L
4L
How does error propagate if we compute Tt (M) ξ by replacing M with M? Note that this is a very
non-trivial question, because the coefficients of the polynomial Tt (x) is as large as 2O(t) .
It turns out, the way that error propagates depends on how the Chebyshev polynomial is
calculated. If the so-called backward recurrence formula is used, namely,
y0 = 0,
y1 = ξ,
yt = 2M(yt−1 ) − yt−2
and setting xT +1 = x0 + yT +1 − M(yT ), then this xT +1 is sufficiently close to the exact value
x0 + Tt (M) ξ. This is known as the stability theory of computing Chebyshev polynomials, and is
proved in our prior work [5].
We defer all the proof details to Appendix B.2.
5
Neon2 in the SVRG Setting
Recall that the shift-and-invert (SI) approach [12] on top of the SVRG method [26] solves the minimum eigenvector problem as follows. Given any matrix A = ∇2 f (x0 ) and suppose its eigenvalues
are λ1 ≤ · · · ≤ λd . Then, if λ > −λ1 , we can define positive semidefinite matrix B = (λI+A)−1 , and
then apply power method to find an (approximate) maximum eigenvector of B, which necessarily
is an (approximate) minimum eigenvector of A.
The SI approach specifies a binary search routine to determine the shifting constant λ, and
ensures that B = (λI + A)−1 is always “well conditioned,” meaning that it suffices to apply power
method on B for logarithmic number of iterations. In other words, the task of computing the
minimum eigenvector of A reduces to computing matrix-vector products By for poly-logarithmic
number of times. Moreover, the stability of SI was shown in a number of papers, including [12] and
[4]. This means, it suffices for us to compute By approximately.
However, how to compute By for an arbitrary vector y. It turns out, this is equivalent to
minimizing a convex quadratic function that is of a finite sum form
n
1 X ⊤
def 1
z (λI + ∇2 fi (x0 ))z + y ⊤ z .
g(z) = z ⊤ (λI + A)z + y ⊤ z =
2
2n
i=1
Therefore, one can apply the a variant of the SVRG method (arguably first discovered by ShalevShwartz [26]) to solve this task. In each iteration, SVRG needs to evaluate a stochastic gradient
(λI + ∇2 fi (x0 ))z + y at some point z for some random i ∈ [n]. Instead of evaluating it exactly
(which require a Hessian-vector product), we use ∇fi (x0 + z) − ∇fi (x0 ) to approximate ∇2 fi (x0 ) · z.
Of course, one needs to show also that the SVRG method is stable to noise. Using similar
techniques as the previous two sections, one can show that the error term is proportional to O(kzk22 ),
and thus as long as we bound the norm of z is bounded (just like we did in the previous two sections),
this should not affect the performance of the algorithm. We decide to ignore the detailed theoretical
proof of this result, because it will complicate this paper.
P
Theorem 3 (Neon2svrg ). Let f (x) = n1 ni=1 fi (x) where each fi is L-smooth and L2 -second-order
smooth. For every point x0 ∈ Rd , every δ > 0, every p ∈ (0, 1], the output v = Neon2svrg (f, x0 , δ, p)
satisfies that, with probability at least 1 − p:
1. If v = ⊥, then ∇2 f (x0 ) −δI.
2. If v 6= ⊥, then kvk2 = 1 and v ⊤ ∇2 f (x0 )v ≤ − 2δ .
10
e n+
Moreover, the total number stochastic gradient evaluations is O
6
√
n3/4
√ L .
δ
Applications of Neon2
We show how Neon2online can be applied to existing algorithms such as SGD, GD, SCSG, SVRG,
Natasha2, CDHS. Unfortunately, we are unaware of a generic statement for applying Neon2online
to any algorithm. Therefore, we have to prove them individually.6
Throughout this section, we assume that some starting vector x0 ∈ Rd and upper bound ∆f
is given to the algorithm, and it satisfies f (x0 ) − minx {f (x)} ≤ ∆f . This is only for the purpose
of proving theoretical bounds. In practice, because ∆f only appears in specifying the number
of iterations, can just run enough number of iterations and then halt the algorithm, without the
necessity of knowing ∆f .
6.1
Auxiliary Claims
Claim 6.1. For any x, using O(( εV2 + 1) log p1 ) stochastic gradients, we can decide
with probability 1 − p :
either k∇f (x)k ≥
ε
2
or
k∇f (x)k ≤ ε .
Proof. Suppose we generate m = O(log p1 ) random uniform subsets S1 , . . . , Sm of [n], each of
2
1 P
cardinality B = max{ 32ε
i∈Sj ∇fi (x), we have according to
V , 1}. Then, denoting by vj = B
2
ε
V
= 32
. In other words, with probability at least 1/2 over
Fact 2.2 that ESj kvj − ∇f (x)k2 ≤ B
the randomness of Sj , we have kvj k − k∇f (x)k ≤ kvj − ∇f (x)k ≤ 4ε . Since m = O(log p1 ), we
at least 1 − p, it satisfies that at least m/2 + 1 of the vectors vj satisfy
have with probability
kvj k − k∇f (x)k ≤ ε . Now, if we select v ∗ = vj where j ∈ [m] is the index that gives the median
4
value of kvj k, then it satisfies kvj k − k∇f (x)k ≤ 4ε . Finally, we can check if kvj k ≤ 3ε
4 . If so, then
ε
we conclude that k∇f (x)k ≤ ε, and if not, we conclude that k∇f (x)k ≥ 2 .
Claim 6.2. If v is a unit vector and v ⊤ ∇2 f (y)v ≤ − 2δ , suppose we choose y ′ = y ±
sign is random, then f (y) −
Proof. Letting η =
δ
L2 ,
E[f (y ′ )]
≥
δ3
12L22
δ
L2 v
where the
.
then by the second-order smoothness,
1
L2
f (y) − E[f (y ′ )] ≥ E h∇f (y), y − y ′ i − (y − y ′ )⊤ ∇2 f (y)(y − y ′ ) −
ky − y ′ k3
2
6
2δ
3
3
L2 η 3
η
L
η
δ
η2 ⊤ 2
2
.
kvk3 ≥
−
=
= − v ∇ f (y)v −
2
6
4
6
12L22
6.2
Neon2 on SGD and GD
To apply Neon2 to turn SGD into an algorithm finding approximate local minima, we propose the
following process Neon2+SGD (see Algorithm 4). In each iteration t, we first apply SGD with minibatch size O( ε12 ) (see Line 4). Then, if SGD finds a point with small gradient, we apply Neon2online to
decide if it has a negative curvature, if so, then we move in the direction of the negative curvature
(see Line 10). We have the following theorem:
6
This is because stationary-point finding algorithms have somewhat different guarantees. For instance, in minibatch SGD we have f (xt ) − E[f (xt+1 )] ≥ Ω(k∇f (xt )k2 ) but in SCSG we have f (xt ) − E[f (xt+1 )] ≥ E[Ω(k∇f (xt+1 )k2 )].
11
Algorithm 4 Neon2+SGD(f, x0 , p, ε, δ)
Input: function f (·), starting vector x0 , confidence p ∈ (0, 1), ε > 0 and δ > 0.
L2 ∆f
L∆f
1: K ← O 2δ 3 + ε2 ;
⋄ ∆f is any upper bound on f (x0 ) − minx {f (x)}
2: for t ← 0 to K − 1 do
def
, 1};
3:
S ← a uniform random
subset of [n] with cardinality |S| = B = max{ 8V
ε2
P
1
4:
xt+1/2 ← xt − L|S| i∈S ∇fi (xt );
⋄ estimate k∇f (xt )k using O(ε−2 V log(K/p)) stochastic gradients
5:
if k∇f (xt )k ≥ 2ε then
6:
xt+1 ← xt+1/2 ;
7:
else
⋄ necessarily k∇f (xt )k ≤ ε
p
online
8:
v ← Neon2
(xt , δ, 2K );
9:
if v = ⊥ then return xt ;
⋄ necessarily ∇2 f (xt ) −δI
δ
⋄ necessarily v⊤ ∇2 f (xt )v ≤ −δ/2
10:
else xt+1 ← xt ± L2 v;
11:
end if
12: end for
13: will not reach this line (with probability ≥ 1 − p).
Algorithm 5 Neon2+GD(f, x0 , p, ε, δ)
Input: function f (·), starting vector x0 , confidence p ∈ (0, 1), ε > 0 and δ > 0.
L2 ∆f
L∆f
1: K ← O 2δ 3 + ε2 ;
⋄ ∆f is any upper bound on f (x0 ) − minx {f (x)}
2: for t ← 0 to K − 1 do
3:
xt+1/2 ← xt − L1 ∇f (xt );
4:
if k∇f (xt )k ≥ 2ε then
5:
xt+1 ← xt+1/2 ;
6:
else
p
7:
v ← Neon2det (xt , δ, 2K
);
8:
if v = ⊥ then return xt ;
⋄ necessarily ∇2 f (xt ) −δI
9:
else xt+1 ← xt ± Lδ2 v;
⋄ necessarily v⊤ ∇2 f (xt )v ≤ −δ/2
10:
end if
11: end for
12: will not reach this line (with probability ≥ 1 − p).
Theorem 5a. With probability at least 1 − p, Neon2+SGD outputs an (ε,
δ)-approximate local
L2 L22 ∆f
2∆
L
L∆
f
f
V
2
e ( 2 + 1)
+ ε2 + δ 2 δ 3 .
minimum in gradient complexity T = O
ε
δ3
e
Corollary 6.3. Treating ∆f , V, L, L2 as constants, we have T = O
1
ε4
+
1
ε2 δ 3
+
1
δ5
.
One can similarly (and more easily) give an algorithm Neon2+GD, which is the same as Neon2+SGD
except that the mini-batch SGD is replaced with a full gradient descent, and the use of Neon2online
is replaced with Neon2det . We have the following theorem:
Theorem 5c. With probabilityat least 1 − p, Neon2+GD
outputs an (ε, δ)-approximate local min
1/2 L2 ∆f
L∆
f
L
2
e
imum in gradient complexity O
+ δ1/2 δ3
full gradient computations.
ε2
We only prove Theorem 5a in Appendix C and the proof of Theorem 5c is only simpler.
12
6.3
Neon2 on SCSG and SVRG
Background. We first recall the main idea of the SVRG method for non-convex optimization [3, 23].
It is an offline method but is what SCSG is built on. SVRG divides iterations into epochs, each
of length n. It maintains a snapshot point x
e for each epoch, and computes the full gradient
∇f (e
x) only for snapshots. Then, in each iteration t at point xt , SVRG defines gradient estimator
e (xt ) def
e (xt )] = ∇f (xt ), and performs update
∇f
= ∇fi (xt ) − ∇fi (e
x) + ∇f (e
x) which satisfies Ei [∇f
e (xt ) for learning rate α.
xt+1 ← xt − α∇f
The SCSG method of Lei et al. [18] proposed a simple fix to turn SVRG into an online method.
They changed the P
epoch length of SVRG from n to B ≈ 1/ε2 , and then replaced the computation
1
x) where S is a random subset of [n] with cardinality |S| = B. To
of ∇f (e
x) with |S| i∈S ∇fi (e
make this approach even more general, they also analyzed SCSG in the mini-batch setting, with
mini-batch size b ∈ {1, 2, . . . , B}.7 Their Theorem 3.1 [18] says that,
Lemma 6.4 ([18]). There exist constant C > 1 such that, if we run SCSG for an epoch of size B
(so using O(B) stochastic gradients)8 with mini-batch b ∈ {1, 2, . . . , B} starting from a point xt
and moving to x+
t , then
6V
2
.
E k∇f (x+
≤ C · L(b/B)1/3 f (xt ) − E[f (x+
t )k
t )] +
B
Our Approach. In principle, one can apply the same idea of Neon2+SGD on SCSG to turn it into
an algorithm finding approximate local minima. Unfortunately,
this is not quite possible because
2 , as opposed to k∇f (x )k2 in SGD (see (C.1)).
the left hand side of Lemma 6.4 is on E k∇f (x+
)k
t
t
This means, instead of testing whether xt is a good local minimum (as we did in Neon2+SGD), this
time we need to test whether x+
t is a good local minimum. This creates some extra difficulty so we
need a different proof.
}. However, choosing
Remark 6.5. As for the parameters of SCSG, we simply use B = max{1, 48V
ε2
(ε2 +V)ε4 L6
mini-batch size b = 1 does not necessarily give the best complexity, so a tradeoff b = Θ( δ9 L3 2 )
is needed. (A similar tradeoff was also discovered by the authors of Neon [28].) Note that this
quantity b may be larger than B, and if this happens, SCSG becomes essentially equivalent to one
iteration of SGD with mini-batch size b. Instead of analyzing this boundary case b > B separately,
we decide to simply run Neon2+SGD whenever b > B happens, to simplify our proof.
We show the following theorem (proved in Appendix C)
Theorem 5b. With probability at least
outputs an (ε, δ)-approximate
local
2/3, Neon2+SCSG
L∆f L2
L∆f
L22 ∆f V
L2
e
+ δ 2 + ε2 δ 2 .
minimum in gradient complexity T = O ε4/3 V 1/3 + δ3
ε2
(To provide the simplest proof, we have shown Theorem 5b only with probability 2/3. One can for
instance boost the confidence to 1 − p by running log p1 copies of Neon2+SCSG.)
e
Corollary 6.6. Treating ∆f , V, L, L2 as constants, we have T = O
1
ε10/3
+
1
ε2 δ 3
+
1
δ5
.
P
x)
x) with |S1′ | i∈S ′ ∇fi (xt ) − ∇fi (e
That is, they reduced the epoch length to Bb , and replaced ∇fi (xt ) − ∇fi (e
for some S ′ that is a random subset of [n] with cardinality |S ′ | = b.
8
We remark that Lei et al. [18] only showed that an epoch runs in an expectation of O(B) stochastic gradients.
We assume it is exact here to simplify proofs. One can for instance stop SCSG after O(B log p1 ) stochastic gradient
computations, and then Lemma 6.4 will succeed with probability ≥ 1 − p.
7
13
Algorithm 6 Neon2+SCSG(f, x0 , ε, δ)
Input: function f (·), starting vector x0 , ε > 0 and δ > 0.
(ε2 +V)ε4 L62
};
b
←
max
1,
Θ(
) ;
1: B ← max{1, 48V
2
ε
δ 9 L3
2: if b > B then return Neon2+SGD(f, x0 , 2/3, ε, δ);
⋄ for cleaner analysis purpose, see Remark 6.5
Lb1/3 ∆f
3: K ← Θ 4/3 1/3 ;
⋄ ∆f is any upper bound on f (x0 ) − minx {f (x)}
ε V
4: for t ← 0 to K − 1 do
5:
xt+1/2 ← apply SCSG on xt for one epoch of size B = max{Θ(V/ε2 ), 1};
6:
if k∇f (xt+1/2 )k ≥ 2ε then
⋄ estimate k∇f (xt )k using O(ε−2 V log K) stochastic gradients
7:
xt+1 ← xt+1/2 ;
8:
else
⋄ necessarily k∇f (xt+1/2 )k ≤ ε
online
9:
v ← Neon2
(f, xt+1/2 , δ, 1/20K);
10:
if v = ⊥ then return xt+1/2 ;
⋄ necessarily ∇2 f (xt+1/2 ) −δI
⋄ necessarily v⊤ ∇2 f (xt+1/2 )v ≤ −δ/2
11:
else xt+1 ← xt+1/2 ± Lδ2 v;
12:
end if
13: end for
14: will not reach this line (with probability ≥ 2/3).
As for SVRG, it is an offline method and its one-epoch lemma looks like9
2
E k∇f (x+
≤ C · Ln1/3 f (xt ) − E[f (x+
t )k
t )] .
If one replaces the use of Lemma 6.4 with this new inequality, and replace the use of Neon2online
with Neon2svrg , then we get the following theorem:
Theorem 5d. With probability at least
outputs
an
(ε, δ)-approximate local
2/3, Neon2+SVRG
√
2∆
3/4 L
L∆
L
f
f
n
2
e
n + √δ
.
minimum in gradient complexity T = O
+ δ3
ε2 n1/3
For a clean presentation of this paper, we ignore the pseudocode and proof because they are only
simpler than Neon2+SCSG.
6.4
Neon2 on Natasha2 and CDHS
The recent results Carmon et al. [7] (that we refer to CDHS) and Natasha2 [2] are both Hessianfree methods where the only Hessian-vector product computations come from the exact NC-search
process we study in this paper. Therefore, by replacing their NC-search with Neon2, we can directly
turn them into first-order methods without the necessity of computing Hessian-vector products.
We state the following two theorems where the proofs are exactly the same as the papers [7]
and [2]. We directly state them by assuming ∆f , V, L, L2 are constants, to simplify our notions.
Theorem 2. One can replace Oja’s algorithm with Neon2online in Natasha2 without hurting its
performance, turning it into a first-order stochastic method.
Treating ∆f , V, L, L2 as constants, Natasha2 finds an (ε, δ)-approximate local minimum in T =
1
1
1
e 3.25
+
+
stochastic gradient computations.
O
3
5
ε
ε δ
δ
Theorem 4. One can replace Lanczos method with Neon2det or Neon2svrg in CDHS without hurting
it performance, turning it into a first-order method.
9
There are at least three different variants of SVRG [3, 18, 23]. We have adopted the lemma of [18] for simplicity.
14
Treating ∆f , L, L2 as constants, CDHS finds an (ε, δ)-approximate local minimum in either
3/4
3/4
1
n
1
e 1.5
e
full gradient computations (if Neon2det is used) or in T = O
+ δ3.5
+ δn3 + εn1.75 + nδ3.5
O ε1.75
ε
stochastic gradient computations (if Neon2svrg is used).
Acknowledgements
We would like to thank Tianbao Yang and Yi Xu for helpful feedbacks on this manuscript.
Appendix
A
Missing Proofs for Section 3
A.1
Auxiliary Lemmas
We use the following lemma to approximate hessian-vector products:
Lemma A.1. If
f (x) is L2 -second-order smooth,
then for every point x ∈ Rd and every vector
v ∈ Rd , we have:
∇f (x + v) − ∇f (x) − ∇2 f (x)v
2 ≤ L2 kvk22 .
R1
Proof of Lemma A.1. We can write ∇f (x+v)−∇f (x) = t=0 ∇2 f (x+tv)vdt. Subtracting ∇2 f (x)v
we have:
Z 1
2
2
∇f (x + v) − ∇f (x) − ∇2 f (x)v
=
∇ f (x + tv) − ∇ f (x) vdt
2
t=0
2
Z 1
2
∇ f (x + tv) − ∇2 f (x)
kvk2 dt ≤ L2 kvk22 .
≤
2
t=0
We need the following auxiliary lemma about martingale concentration bound:
Lemma A.2. Consider random events {Ft }t≥1 and random variables x1 , . . . , xT ≥ 0 and a1 , . . . , AT ∈
[−ρ, ρ] for ρ ∈ [0, 1/2] where each xt and at only depend on F1 , . . . , Ft . Letting x0 = 0 and suppose
there exist constant b ≥ 0 and λ > 0 such that for every t ≥ 1:
xt ≤ xt−1 (1 − at ) + b and E[at | F1 , . . . , Ft−1 ] ≥ −λ .
q
λT +2ρ T log Tp
Then, we have for every p ∈ (0, 1): Pr xT ≥ T be
≤p .
Proof. We know that
xT ≤ (1 − aT )xT −1 + b
≤ (1 − aT ) ((1 − aT −1 )xT −2 + b) + b
= (1 − aT )(1 − aT −1 )xT −2 + (1 − aT )b + b
≤ ··· ≤
T Y
T
X
s=2 t=s
(1 − as )b
For each s ∈ [T ], we consider the random process define as
t ≥ s : yt+1 = (1 − at )yt ,
15
ys = b
Therefore
log yt+1 = log(1 − at ) + log yt
For log(1−at ) ∈ [−2ρ, ρ] and E[log(1−at ) | F1 , · · · Ft−1 ] ≤ λ. Thus, we can apply Azuma-Hoeffding
inequality on log yt to conclude that
q
λT +2ρ T log Tp
Pr yT ≥ be
≤ p/T .
Taking union bound over s we complete the proof.
A.2
Proof of Lemma 3.1
Proof of Lemma 3.1. Let it ∈ [n] be the random index i chosen when computing xt+1 from xt in
Line 5 of Neon2online
weak . We will write the update rule of xt in terms of the Hessian before we stop.
By Lemma A.1, we know that for every t ≥ 1,
∇fit (xt ) − ∇fit (x0 ) − ∇2 fit (x0 )(xt − x0 )
≤ L2 kxt − x0 k22 .
2
Therefore, there exists error vector ξt ∈
Rd
with kξt k2 ≤ L2 kxt − x0 k22 such that
(xt+1 − x0 ) = (xt − x0 ) − η∇2 fit (x0 )(xt − x0 ) + ηξt .
For notational simplicity, let us denote by
def
zt = x t − x 0 ,
def
A t = Bt + R t
where
Bt = ∇2 fit (x0 ),
def
then it satisfies
def
Rt = −
ξt zt⊤
kzt k22
zt+1 = zt − ηBt zt + ηξt = (I − ηAt )zt .
We have kRt k2 ≤ L2 kzt k2 ≤ L2 · r. By the L-smoothness of fit , we know kBt k2 ≤ L and thus
kAt k2 ≤ kBt k2 + kRt k2 ≤ kBt k2 + L2 r ≤ 2L.
def
zt
⊤ = (I − ηA ) · · · (I − ηA )ξξ ⊤ (I − ηA ) · · · (I − ηA ) and w def
Now, define Φt = zt+1 zt+1
t
1
1
t
t = kzt k2 =
zt
. Then, before we stop, we have:
(Tr(Φt−1 ))1/2
Tr(Φt ) = Tr(Φt−1 ) 1 − 2ηwt⊤ At wt + η 2 wt⊤ A2t wt
≤ Tr(Φt−1 ) 1 − 2ηwt⊤ At wt + 4η 2 L2
≤ Tr(Φt−1 ) 1 − 2ηwt⊤ Bt wt + 2ηkRt k2 + 4η 2 L2
①
≤ Tr(Φt−1 ) 1 − 2ηwt⊤ Bt wt + 8η 2 L2 .
2
L
Above, ① is because our choice of parameter satisfies r ≤ η L
. Therefore,
2
log (Tr(Φt )) ≤ log (Tr(Φt−1 )) + log 1 − 2ηwt⊤ Bt wt + 8η 2 L2 .
Letting λ = −λmin (∇2 f (x0 )) = −λmin (EBt [Bt ]), since the randomness
⊤ of Bt is independent of wt ,
⊤
we know that wt Bt wt ∈ [−L, L] and for every wt , it satisfies EBt wt Bt wt | wt ≥ −λ. Which (by
concavity of log) also implies that E[log 1 − 2ηwt⊤ Bt wt + 8η 2 L2 ] ≤ 2ηλ and log(1 − 2ηwt⊤ Bt wt +
8η 2 L2 ) ∈ [−2(2ηL + 8η 2 L2 ), 2ηL + 8η 2 L2 ] ∈ [−6ηL, 3ηL].
Hence, applying Azuma-Hoeffding inequality on log(Φt ) we have
r
1
Pr log(Φt ) − log(Φ0 ) ≥ 2ηλt + 16ηL t log
≤p .
p
16
In other words, with probability at least 1 − p, Neon2online
weak will not terminate until t ≥ T0 , where
2
2
T0 is given by the equation (recall Φ0 = kz1 k = σ ):
r
2
r
1
.
2ηλT0 + 16ηL T0 log = log
p
σ2
def
Next, we turn to accuracy. Let “true” vector vt+1 = (I − ηBt ) · · · (I − ηB1 )ξ and we have
zt+1 − vt+1 =
t
Y
s=1
(I − ηAs )ξ −
t
Y
s=1
(I − ηBs )ξ = (I − ηBt )(zt − vt ) − ηRt zt .
def
Thus, if we call ut = zt − vt with u1 = 0, then, before the algorithm stops, we have:
kut+1 − (I − ηBt )ut k2 ≤ ηkRt zt k2 ≤ ηL2 r2
Using Young’s inequality ka + bk22 ≤ (1 + β)kak22 + β1 + 1 kbk22 for every β > 0, we have:
①
u t Bt u t
2
2
2
2 4
2
2 2
kut+1 k2 ≤ 1 + η k(I − ηBt )ut k2 + 8L2 r ≤ kut k2 1 − 2η
+ 4η L + 8L22 r4 .
kut k22
Above, ① assumes without loss of generality that L ≥ 1 (as otherwise we can re-scale the problem).
Therefore, applying martingale concentration Lemma A.2, we know
q
t
2 ηλt+8ηL t log p
≤p .
Pr kut k2 ≥ 16L2 r te
Now we can apply the recent of Oja’s algorithm — [6, Theorem 4]. By our choice of parameter
η, we have: with probability at least 99/100 the following holds:
1. Norm growth: kvt k2 ≥ e(ηλ−2η
2. Negative curvature:
2 L2
vt⊤ ∇2 f (x0 )vt
kvt k22
)t σ/d.
≤ −(1 − 2η)λ + O
log(d)
ηt
.
Then let us consider the case: λ ≥ δ, let us consider a fixed T1 defined as
log 2dr
C0 (log d/p + log(2d))
σ
T1 =
=