Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLconf SF - 11/13/15

Beating the Perils of Non-Convexity:
Machine Learning using Tensor Methods
Anima Anandkumar
U.C. Irvine

Learning with Big Data
Learning is ﬁnding needle in a haystack

High dimensional regime: as data grows, more variables!
Useful information: low-dimensional structures.
Learning with big data: ill-posed problem.

High dimensional regime: as data grows, more variables!
Useful information: low-dimensional structures.
Learning with big data: ill-posed problem.
Learning with big data: statistically and computationally challenging!

Optimization for Learning
Most learning problems can be cast as optimization.
Unsupervised Learning
Clustering
k-means, hierarchical . . .
Maximum Likelihood Estimator
Probabilistic latent variable models
Supervised Learning
Optimizing a neural network with
respect to a loss function Input
Neuron
Output

Convex vs. Non-convex Optimization
Progress is only tip of the iceberg..
Images taken from https://www.facebook.com/nonconvex

Convex vs. Non-convex Optimization
Progress is only tip of the iceberg.. Real world is mostly non-convex!
Images taken from https://www.facebook.com/nonconvex

Convex vs. Nonconvex Optimization
Unique optimum: global/local. Multiple local optima

In high dimensions possibly
exponential local optima

In high dimensions possibly
exponential local optima
How to deal with non-convexity?

Outline
1 Introduction
2 Spectral Methods: Matrices and Tensors
3 Applications of Tensor Methods
4 Conclusion

Matrices and Tensors as Data Structures
Modern data: Multi-modal,
multi-relational data.
Matrices: pairwise relations. Tensors: higher order relations.
Multi-modal data ﬁgure from Lise Getoor slides.

Tensor Representation of Higher Order Moments
Matrix: Pairwise Moments
E[x ⊗ x] ∈ Rd×d is a second order tensor.
E[x ⊗ x]i1,i2 = E[xi1 xi2 ].
For matrices: E[x ⊗ x] = E[xx⊤].
Tensor: Higher order Moments
E[x ⊗ x ⊗ x] ∈ Rd×d×d is a third order tensor.
E[x ⊗ x ⊗ x]i1,i2,i3 = E[xi1 xi2 xi3 ].

Spectral Decomposition of Tensors
M2 =
i
λiui ⊗ vi
= + ....
Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2

Spectral Decomposition of Tensors
M2 =
i
λiui ⊗ vi
= + ....
Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2
M3 =
i
λiui ⊗ vi ⊗ wi
= + ....
Tensor M3 λ1u1 ⊗ v1 ⊗ w1 λ2u2 ⊗ v2 ⊗ w2
We have developed eﬃcient methods to solve tensor decomposition.

Tensor Decomposition Methods
SVD on Pairwise Moments to
obtain W
Fast SVD via Random
Projection
+ ++ +
Dimensionality Reduction of
third order moments using W
+ ++ +
Power method on reduced tensor T: v →
T(I, v, v)
T(I, v, v)
.
Guaranteed to converge to correct solution!

Summary on Tensor Methods
= + ....
Tensor decomposition: iterative algorithm with tensor contractions.
Tensor contraction: T(W1, W2, W3), where Wi are matrices/vectors.
Strengths of tensor method
Guaranteed convergence to globally optimal solution!
Fast and accurate, orders of magnitude faster than previous methods.
Embarrassingly parallel and suited for distributed systems, e.g.Spark.
Exploit optimized BLAS operations on CPUs and GPUs.

Preliminary Results on Spark
Alternating Least Squares for Tensor Decomposition.
min
w,A,B,C
T −
k
i=1
λiA(:, i) ⊗ B(:, i) ⊗ C(:, i)
2
F
Update Rows Independently
Tensor slices
B C
worker 1
worker 2
worker i
worker k
Results on NYtimes corpus
3 ∗ 105 documents, 108 words
Spark Map-Reduce
26mins 4 hrs
https://github.com/FurongHuang/SpectralLDA-TensorSpark

Tensor Methods for Learning LVMs
GMM HMM
h1 h2 h3
x1 x2 x3
ICA
h1 h2 hk
x1 x2 xd
Multiview and Topic Models

Overview of Our Approach
Tensor-based Model Estimation
= + +
Unlabeled Data

= + +
Unlabeled Data → Probabilistic admixture models

= + +
Unlabeled Data → Probabilistic admixture models → Tensor Decomposition

= + +
Unlabeled Data → Probabilistic admixture models → Tensor Decomposition → Inference

= + +
tensor decomposition → correct model

= + +
tensor decomposition → correct model
Highly parallelized, CPU/GPU/Cloud implementation.
Guaranteed global convergence for Tensor Decomposition.

Community Detection on Yelp and DBLP datasets
Facebook
n ∼ 20, 000
Yelp
n ∼ 40, 000
DBLP
n ∼ 1 million
Tensor vs. Variational
FB YP DBLP−sub DBLP
10
0
10
1
10
2
10
3
10
4
10
5
Datasets
Runningtime(seconds)
Variational
Tensor

Tensor Methods for Training Neural Networks
Tremendous practical impact
with deep learning.
Algorithm: backpropagation.
Highly non-convex optimization

Toy Example: Failure of Backpropagation
x1
x2
y=1y=−1
Labeled input samples
Goal: binary classiﬁcation
σ(·) σ(·)
y
x1 x2x
w1 w2
Our method: guaranteed risk bounds for training neural networks

Backpropagation vs. Our Method
Weights w2 randomly drawn and ﬁxed
Backprop (quadratic) loss surface
−4
−3
−2
−1
0
1
2
3
4 −4
−3
−2
−1
0
1
2
3
4
200
250
300
350
400
450
500
550
600
650
w1(1) w1(2)

Backpropagation vs. Our Method
Weights w2 randomly drawn and ﬁxed
Backprop (quadratic) loss surface
−4
−3
−2
−1
0
1
2
3
4 −4
−3
−2
−1
0
1
2
3
4
200
250
300
350
400
450
500
550
600
650
w1(1) w1(2)
Loss surface for our method
−4
−3
−2
−1
0
1
2
3
4 −4
−3
−2
−1
0
1
2
3
4
0
20
40
60
80
100
120
140
160
180
200
w1(1) w1(2)

Feature Transformation for Training Neural Networks
Feature learning: Learn φ(·) from input data.
How to use φ(·) to train neural networks?
x
φ(x)
y

Feature Transformation for Training Neural Networks
Feature learning: Learn φ(·) from input data.
How to use φ(·) to train neural networks?
x
φ(x)
y
Invert E[φ(x) ⊗ y] to Learn Neural Network Parameters
In general, E[φ(x) ⊗ y] is a tensor.
What class of φ(·) useful for training neural networks?
What algorithms can invert moment tensors to obtain parameters?

Score Function Transformations
Score function for x ∈ Rd
with pdf p(·):
S1(x) := −∇x log p(x)
Input:
x ∈ Rd
S1(x) ∈ Rd

with pdf p(·):
S1(x) := −∇x log p(x)
mth
-order score function:
Input:
x ∈ Rd
S1(x) ∈ Rd

with pdf p(·):
S1(x) := −∇x log p(x)
mth
Sm(x) := (−1)m ∇(m)p(x)
p(x)
Input:
x ∈ Rd
S1(x) ∈ Rd

with pdf p(·):
S1(x) := −∇x log p(x)
mth
Sm(x) := (−1)m ∇(m)p(x)
p(x)
Input:
x ∈ Rd
S2(x) ∈ Rd×d

with pdf p(·):
S1(x) := −∇x log p(x)
mth
Sm(x) := (−1)m ∇(m)p(x)
p(x)
Input:
x ∈ Rd
S3(x) ∈ Rd×d×d

NN-LiFT: Neural Network LearnIng
using Feature Tensors
Input:
x ∈ Rd
Score function
S3(x) ∈ Rd×d×d

Input:
x ∈ Rd
Score function
S3(x) ∈ Rd×d×d
1
n
n
i=1
yi ⊗ S3(xi) =
1
n
n
i=1
yi⊗
Estimating M3 using
labeled data {(xi, yi)}
S3(xi)
Cross-
moment

Input:
x ∈ Rd
Score function
S3(x) ∈ Rd×d×d
1
n
n
i=1
yi ⊗ S3(xi) =
1
n
n
i=1
yi⊗
Estimating M3 using
labeled data {(xi, yi)}
S3(xi)
Cross-
moment
Rank-1 components are the estimates of ﬁrst layer weights
+ + · · ·
CP tensor
decomposition

Reinforcement Learning (RL) of POMDPs
Partially observable Markov decision processes.
Proposed Method
Consider memoryless policies. Episodic learning: indirect exploration.
Tensor methods: careful conditioning required for learning.
First RL method for POMDPs with logarithmic regret bounds.
xi xi+1 xi+2
yi yi+1
ri ri+1
ai ai+1
0 1000 2000 3000 4000 5000 6000 7000
2
2.1
2.2
2.3
2.4
2.5
2.6
2.7
Number of Trials
AverageReward
SM−UCRL−POMDP
UCRL−MDP
Q−learning
Random Policy
Logarithmic Regret Bounds for POMDPs using Spectral Methods by K. Azzizade, A. Lazaric, A.
, under preparation.

Convolutional Tensor Decomposition
== ∗
xx f∗
i w∗
i F∗ w∗
(a)Convolutional dictionary model (b)Reformulated model
.... ....= ++
Cumulant λ1(F∗
1 )⊗3 +λ2(F∗
2 )⊗3 . . .
Eﬃcient methods for tensor decomposition with circulant constraints.
Convolutional Dictionary Learning through Tensor Factorization by F. Huang, A. , June 2015.

Hierarchical Tensor Decomposition
= + .... = + .... = + ....
= + ....
Relevant for learning latent tree graphical models.
Propose ﬁrst algorithm for integrated learning of hierarchy and
components of tensor decomposition.
Highly parallelized operations but with global consistency guarantees.
Scalable Latent Tree Model and its Application to Health Analytics By F. Huang, U. N.
Niranjan, J. Perros, R. Chen, J. Sun, A. , Preprint, June 2014.

Summary and Outlook
Summary
Tensor methods: a powerful paradigm for guaranteed large-scale
machine learning.
First methods to provide provable bounds for training neural
networks, many latent variable models (e.g HMM, LDA), POMDPs!

Summary and Outlook
Summary
Tensor methods: a powerful paradigm for guaranteed large-scale
machine learning.
First methods to provide provable bounds for training neural
networks, many latent variable models (e.g HMM, LDA), POMDPs!
Outlook
Training multi-layer neural networks, models with invariances,
reinforcement learning using neural networks . . .
Uniﬁed framework for tractable non-convex methods with guaranteed
convergence to global optima?

My Research Group and Resources
Podcast/lectures/papers/software available at
http://newport.eecs.uci.edu/anandkumar/

Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLconf SF - 11/13/15

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLconf SF - 11/13/15

Similar to Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLconf SF - 11/13/15 (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLconf SF - 11/13/15