Learning Sparse Representation

Learning Sparse
Representations

Gabriel Peyré

www.numerical-tours.com

Image Priors
Mathematical image prior:
compression, denoising, super-resolution, . . .

Image Priors
Smooth images:
Sobolev prior: || f ||2

Low-pass Fourier coe cients.

Image Priors
Smooth images:


Piecewise smooth images:
Total variation prior: || f ||

Sparse wavelets coe cients.

Image Priors
Smooth images:


Piecewise smooth images:
Total variation prior: || f ||

Sparse wavelets coe cients.

Learning the prior from exemplars?

Overview

•Sparsity and Redundancy

•Dictionary Learning

•Extensions

•Task-driven Learning

•Texture Synthesis

Image Representation
Q 1
Dictionary D = {dm }m=0 of atoms dm RN .
Q 1
Image decomposition: f = xm dm = Dx
m=0 dm

xm
dm
=

f D x

Q 1
Q 1
m=0 dm
Image approximation: f Dx
xm
dm
=

f D x

Q 1
Q 1
m=0 dm
xm
dm
Orthogonal dictionary: N = Q =
xm = f, dm
f D x

Q 1
Q 1
m=0 dm
xm
dm
Orthogonal dictionary: N = Q =
xm = f, dm
f D x
Redundant dictionary: N Q
Examples: TI wavelets, curvelets, . . .
x is not unique.

Sparsity
Q 1
Decomposition: f= xm dm = Dx
m=0

Sparsity: most xm are small.
Example: wavelet transform.
Image f

Coe cients x

Sparsity
Q 1
m=0

Image f
Ideal sparsity: most xm are zero.
J0 (x) = | {m xm = 0} |

Coe cients x

Sparsity
Q 1
m=0

Image f
Ideal sparsity: most xm are zero.
J0 (x) = | {m xm = 0} |

Approximate sparsity: compressibility
||f Dx|| is small with J0 (x) M.
Coe cients x

Sparse Coding
Q 1
Redundant dictionary D = {dm }m=0 , Q N.
non-unique representation f = Dx.
Sparsest decomposition: min J0 (x)
f =Dx

Sparse Coding
Q 1
f =Dx
1
Sparsest approximation: min ||f Dx|| + J0 (x)
2
x 2

Equivalence min ||f Dx||
M ⇥ J0 (x) M
min J0 (x)
||f Dx||

Sparse Coding
Q 1
f =Dx
1
2
x 2

M ⇥ J0 (x) M
min J0 (x)
||f Dx||
Ortho-basis D:
⇤ Pick the M largest
f, dm ⇥ if |xm | 2 coe cients
xm =
0 otherwise. in { f, dm ⇥}m

Sparse Coding
Q 1
f =Dx
1
2
x 2

M ⇥ J0 (x) M
min J0 (x)
||f Dx||
Ortho-basis D:
⇤ Pick the M largest
f, dm ⇥ if |xm | 2 coe cients
xm =
0 otherwise. in { f, dm ⇥}m

General redundant dictionary: NP-hard.

Convex Relaxation: L1 Prior
J0 (x) = | {m xm = 0} |
J0 (x) = 0 null image.
Image with 2 pixels: J0 (x) = 1 sparse image.
J0 (x) = 2 non-sparse image.
d1

d0

q=0

J0 (x) = | {m xm = 0} |
d1

d0

q=0 q = 1/2 q=1 q = 3/2 q=2
q
priors: Jq (x) = |xm |q (convex for q 1)
m

J0 (x) = | {m xm = 0} |
d1

d0

q=0 q = 1/2 q=1 q = 3/2 q=2
q
priors: Jq (x) = |xm |q (convex for q 1)
m

Sparse 1
prior: J1 (x) = ||x||1 = |xm |
m

Inverse Problems

Denoising/approximation: = Id.

Inverse Problems

Denoising/approximation: = Id.
Examples: Inpainting, super-resolution, compressed-sensing

Regularized Inversion
Denoising/compression: y = f0 + w RN .
Sparse approximation: f = Dx where
1
x ⇥ argmin ||y Dx||2 + ||x||1
x 2

Fidelity

1
x ⇥ argmin ||y Dx||2 + ||x||1
x 2

Fidelity Replace
D by D
Inverse problems y = f0 + w RP .
1
x ⇥ argmin ||y Dx|| + ||x||1
2
x 2

1
x ⇥ argmin ||y Dx||2 + ||x||1
x 2

Fidelity Replace
D by D
Inverse problems y = f0 + w RP .
1
x ⇥ argmin ||y Dx|| + ||x||1
2
x 2

Numerical solvers: proximal splitting schemes.
www.numerical-tours.com

Dictionary Learning: MAP Energy
Set of (noisy) exemplars {yk }k .
1
Sparse approximation: min ||yk Dxk || + ||xk ||1
2
xk 2

1
Sparse approximation: min min ||yk Dxk || + ||xk ||1
2
D C xk
k
2

Dictionary learning

1
2
D C xk
k
2

Dictionary learning
Constraint: C = {D = (dm )m m, ||dm || 1}
Otherwise: D + , X 0

1
2
D C xk
k
2

Dictionary learning
Matrix formulation:
1
min f (X, D) = ||Y DX|| + ||X||1
2

X⇥R Q K 2
D⇥C RN Q

1
2
D C xk
k
2

Dictionary learning
Matrix formulation:
1
min f (X, D) = ||Y DX|| + ||X||1
2

X⇥R Q K 2
min f (X, D)
D⇥C R N Q
X

Convex with respect to X.
Convex with respect to D. D
Non-onvex with respect to (X, D). Local minima

Dictionary Learning: Algorithm
Step 1: k, minimization on xk
1
min ||yk Dxk || + ||xk ||1
2
xk 2

Convex sparse coding.

D, initialization

1
min ||yk Dxk || + ||xk ||1
2
xk 2


Step 2: Minimization on D D, initialization
min ||Y DX|| 2
D C

Convex constraint minimization.

1
min ||yk Dxk || + ||xk ||1
2
xk 2


min ||Y DX|| 2
D C

Projected gradient descent:
D ( +1)
= ProjC D ( )
(D ( )
X Y )X

1
min ||yk Dxk || + ||xk ||1
2
xk 2


min ||Y DX|| 2
D C

Projected gradient descent:
D ( +1)
= ProjC D ( )
(D ( )
X Y )X
Convergence: toward a stationary point
of f (X, D). D, convergence

Patch-based Learning

Learning D

Exemplar patches yk Dictionary D
[Olshausen, Fields 1997]
State of the art denoising [Elad et al. 2006]

Patch-based Learning

Learning D

Exemplar patches yk Dictionary D
[Olshausen, Fields 1997]
State of the art denoising [Elad et al. 2006]

Learning D

Sparse texture synthesis, inpainting [Peyr´ 2008]
e

Comparison with PCA
PCA dimensionality reduction:
⇥ k, min ||Y D(k) X|| D (k)
= (dm )m=0
k 1
D
Linear (PCA): Fourier-like atoms.
RUBINSTEIN et al.: al.: DICTIONARIES FOR SPARSE REPRESENTATION
RUBINSTEIN et DICTIONARIES FOR SPARSE REPRESENTATION

1980 by by Bast
1980 Bastiaa
fundamental prop
fundamental p
A basic 1-D G
A basic 1-D
forms
forms
© ©
G = = n,
G ⇤ ⇤

DCT PCA where w(·) is is
where w(·) a
Fig. 1. 1.Left: A fewfew £ 12 12 DCT atoms. Right: The ﬁrst 40 KLT atoms, (typically a Gau
Fig. Left: A 12 12 £ DCT atoms. Right: The ﬁrst 40 KLT atoms, (typically a G
trained using 12 £ 12 12 image patches from Lena.
trained using 12 £ image patches from Lena.
frequency resolu
frequency reso
matical foundatio
matical founda
late 1980’s by by
late 1980’s D
B. B. Non-Linear Revolution and Elements Modern Dictionary
Non-Linear Revolution and Elements of of Modern Dictionary
who studied thet
who studied
Design
Design
and by by Feichti
and Feichting
In In statistics research, the 1980’s saw the rise of new generalized group
statistics research, the 1980’s saw the rise of a a new generalized gro
powerful approach known as as robust statistics. Robust statistics
powerful approach known robust statistics. Robust statistics

Comparison with PCA
PCA dimensionality reduction:
⇥ k, min ||Y D(k) X|| D (k)
= (dm )m=0
k 1
D
Linear (PCA): Fourier-like atoms.
RUBINSTEIN et al.: al.: DICTIONARIES FOR SPARSE REPRESENTATION
RUBINSTEIN et DICTIONARIES FOR SPARSE REPRESENTATION
Sparse (learning): Gabor-like atoms.
1980 by by Bast
1980 Bastiaa
fundamental prop
fundamental p
A basic 1-D G
A basic 1-D
forms
forms
© ©
4
G = = n,
G ⇤ ⇤
4

DCT PCA where w(·) is is
where w(·) a
Fig. 1. 1.Left: A fewfew £ 12 12 DCT atoms. Right: The ﬁrst 40 KLT atoms, (typically a Gau
Fig. Left: A 12 12 £ DCT atoms. Right: The ﬁrst 40 KLT atoms, 0.15
(typically a G
0.15

trained using 12 £ 12 12 image patches from Lena.
trained using 12 £ image patches from Lena.
frequency resolu
0.1
frequency reso
0.1

matical foundatio
matical founda
0.05 0.05

0 0

late 1980’s by by
late 1980’s D
B. B. Non-Linear Revolution and Elements Modern Dictionary
Non-Linear Revolution and Elements of of Modern Dictionary -0.05 -0.05

who studied thet
who studied
Design
Design
-0.1 -0.1

and by by Feichti
-0.15
and Feichting
-0.15

In In statistics research, the 1980’s saw the rise of new generalized group
statistics research, the 1980’s saw the rise of a a new generalized gro
-0.2 -0.2

Gabor Learned
powerful approach known as as robust statistics. Robust statistics
powerful approach known robust statistics. Robust statistics

Patch-based Denoising
Noisy image: f = f0 + w.
Step 1: Extract patches. yk (·) = f (zk + ·)

yk

[Aharon & Elad 2006]

Step 2: Dictionary learning.
1
min ||yk Dxk || + ||xk ||1
2
D,(xk )k 2
k

yk


Step 2: Dictionary learning.
1
min ||yk Dxk || + ||xk ||1
2
D,(xk )k 2
k
Step 3: Patch averaging. yk = Dxk
˜
˜
f (·) ⇥ yk (· zk )
˜
k

yk ˜
yk


Learning with Missing Data
Inverse problem: y = f0 + w LEARNING MULTISCALE AND S

1 1
min ||y f || +
2
||pk (f ) Dxk || + ⇥||xk ||1
2
f,(xk )k 2 2
k
D C
f0
Patch extractor: pk (f ) = f (zk + ·)
pk
LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original

y

(a) Original (b) Damaged


1 1
min ||y f || +
2
||pk (f ) Dxk || + ⇥||xk ||1
2
f,(xk )k 2 2
k
D C
f0
pk
Step 1: k, minimization on xk LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237

(a) Original

y



1 1
min ||y f || +
2
||pk (f ) Dxk || + ⇥||xk ||1
2
f,(xk )k 2 2
k
D C
f0
pk


Step 2: Minimization on D
(a) Original

y
Quadratic constrained.



1 1
min ||y f || +
2
||pk (f ) Dxk || + ⇥||xk ||1
2
f,(xk )k 2 2
k
D C
f0
pk


Step 2: Minimization on D
(a) Original

y
Quadratic constrained.

Step 3: Minimization on f
Quadratic.

Inpainting Example
LEARNING MULTISCALE AND SPARSE REPRESENTATIONS
LEARNING MULTISCALE AND SPARSE REPRESENTATIONS 237
237


Image f0
(a) Original
(a) Original
Observations
(c) Restored, N = 1
(b) Damaged
(b) Damaged
Regularized f
(d) Restored, N = 2

y = using + w
Fig. 14. Inpainting f0 N = 2 and n = 16 × 16 (bottom-right image), or N = 1 and n = 8 × 8
(bottom-left). J = 100 iterations were performed, producing an adaptive dictionary. During the
learning, 50% of the patches were used. A sparsity factor L = 10 has been used during the learning
process and L = 25 for the ﬁnal reconstruction. The damaged image was created by removing 75% of
the data from the original image. The initial PSNR is 6.13dB. The resulting PSNR for N = 2 is

[Mairal et al. 2008]
33.97dB and 31.75dB for N = 1.

Adaptive Inpainting and Separation

Wavelets Local DCT

Wavelets Local DCT Learned

[Peyr´, Fadili, Starck 2010]
e

OISING ALGORITHM WITH 256 ATOMS OF SIZE 7 7 3 FOR
TS ARE THOSE GIVEN BY MCAULEY AND AL [28] WITH THEIR “3
D DICTIONARY. THE BOTTOM-RIGHT ARE THE IMPROVEMENTS OBTAINED WITH THE ADAPTIVE APPRO

SENTATION FOR COLOR IMAGE RESTORATION
EST RESULTS FOR EACH GROUP. AS CAN BE SEEN, OUR PROPOSED TECHNIQUE CONSISTENTLY PRODUC

K-SVD ALGORITHM [2] ON EACH CHANNEL SEPARATELY WITH 8 8 ATOMS. THE BOTTOM-LEFT AR
s are reduced with our proposed technique (
Higher Dimensional Learning
mples of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposed ( in the new metric).
in our proposed new metric). Both images have been denoised with the same global dictionary.
bserves a bias effect in the color from the castle and in some part of the water. What is more, the color of the sky is piecewise constant when MAIRAL
rs), which is another artifact our approach corrected. (a) Original. (b) Original algorithm,
dB.
dB. (c) Proposed algorithm,
et al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION

naries with 256 atoms learned on a generic database of natural images, with two different sizes of patches. Note the large number of color-less atoms.
can have negative values, the vectors are presented scaled and shifted to the [0,255] range per channel: (a) 5 5 3 patches; (b) 8 8 3 patches.

OLOR IMAGE RESTORATION 61

Fig. 7. Data set used for evaluating denoising experiments.
Learning D

(a) Training Image; (b) resulting dictionary; (b) is the dictionary learned in the image in (a). The dictionary is more colored than the global one.
TABLE I


les of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposedatoms learned on new metric).
Fig. 2. Dictionaries with 256 ( in the a generic database of natural images, with two d
are reduced with our proposed technique ( TABLE I our proposed new metric). Both images have been denoised negative values, the vectors are presented scaled and shifted to the [0,2
in
Since the atoms can have
with the same global dictionary.
ervesITH 256 ATOMS OF SIZE castle and in3 FOR of the water. What is more, the color of the sky is . EACH constant IS DIVIDED IN FOUR
W a bias effect in the color from the 7 7 some part AND 6 6 3 FOR piecewise CASE when
EN BY is another artifact our approach corrected. (a)HEIR “3(b) Original algorithm, HE TOP-RIGHT RESULTS ARE THOSE OBTAINED BY
), which
MCAULEY AND AL [28] WITH T Original. 3 MODEL.” T dB. (c) Proposed algorithm,
3 MODEL.” THE TOP-RIGHT RESUL

dB.
AND 6

M [2] ON EACH CHANNEL SEPARATELY WITH 8 8 ATOMS. THE BOTTOM-LEFT ARE OUR RESULTS OBTAINED
E BOTTOM-RIGHT ARE THE IMPROVEMENTS OBTAINED WITH THE ADAPTIVE APPROACH WITH 20 ITERATIONS.
EACH GROUP. AS CAN BE SEEN, OUR PROPOSED TECHNIQUE CONSISTENTLY PRODUCES THE BEST RESULTS
6 3 FOR . EAC

OISING ALGORITHM WITH 256 ATOMS OF SIZE 7 7 3 FOR
TS ARE THOSE GIVEN BY MCAULEY AND AL [28] WITH THEIR “3
D DICTIONARY. THE BOTTOM-RIGHT ARE THE IMPROVEMENTS OBTAINED WITH THE ADAPTIVE APPRO

SENTATION FOR COLOR IMAGE RESTORATION
EST RESULTS FOR EACH GROUP. AS CAN BE SEEN, OUR PROPOSED TECHNIQUE CONSISTENTLY PRODUC

K-SVD ALGORITHM [2] ON EACH CHANNEL SEPARATELY WITH 8 8 ATOMS. THE BOTTOM-LEFT AR
Higher Dimensional Learning
O NLINE L EARNING FOR M ATRIX FACTORIZATION AND FACTORIZATION AND S PARSE C ODING
O NLINE L EARNING FOR M ATRIX S PARSE C ODING
mples of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposed (
s are reduced with our proposed technique (
in the new metric).
in our proposed new metric). Both images have been denoised with the same global dictionary.
bserves a bias effect in the color from the castle and in some part of the water. What is more, the color of the sky is piecewise constant when MAIRAL
rs), which is another artifact our approach corrected. (a) Original. (b) Original algorithm,
dB.
dB. (c) Proposed algorithm,
et al.: SPARSE REPRESENTATION FOR COLOR IMAGE RESTORATION

naries with 256 atoms learned on a generic database of natural images, with two different sizes of patches. Note the large number of color-less atoms.
can have negative values, the vectors are presented scaled and shifted to the [0,255] range per channel: (a) 5 5 3 patches; (b) 8 8 3 patches.

OLOR IMAGE RESTORATION 61

Learning D

(a) Training Image; (b) resulting dictionary; (b) is the dictionary learned in the image in (a). The dictionary is more colored than the global one.
TABLE I


les of color artifacts while reconstructing a damaged version of the image (a) without the improvement here proposedatoms learned on new metric).
Fig. 2. Dictionaries with 256 ( in the a generic database of natural images, with two d
are reduced with our proposed technique ( TABLE I our proposed new metric). Both images have been denoised negative values, the vectors are presented scaled and shifted to the [0,2
in
Since the atoms can have
with the same global dictionary.
ervesITH 256 ATOMS OF SIZE castle and in3 FOR of the water. What is more, the color of the sky is . EACH constant IS DIVIDED IN FOUR
W a bias effect in the color from the 7 7 some part AND 6 6 3 FOR piecewise CASE when
EN BY is another artifact our approach corrected. (a)HEIR “3(b) Original algorithm, HE TOP-RIGHT RESULTS ARE THOSE OBTAINED BY
), which
MCAULEY AND AL [28] WITH T Original. 3 MODEL.” T dB. (c) Proposed algorithm,
3 MODEL.” THE TOP-RIGHT RESUL

dB.
AND 6

M [2] ON EACH CHANNEL SEPARATELY WITH 8 8 ATOMS. THE BOTTOM-LEFT ARE OUR RESULTS OBTAINED
E BOTTOM-RIGHT ARE THE IMPROVEMENTS OBTAINED WITH THE ADAPTIVE APPROACH WITH 20 ITERATIONS. Inpainting
EACH GROUP. AS CAN BE SEEN, OUR PROPOSED TECHNIQUE CONSISTENTLY PRODUCES THE BEST RESULTS
6 3 FOR . EAC

Facial Image Compression O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282 271

[Elad et al. 2009]
show recognizable faces. We use a database containing around 6000
such facial images, some of which are used for training and tuning
the algorithm, and the others for testing it, similar to the approach

Image registration.
taken in [17].
In our work we propose a novel compression algorithm, related
to the one presented in [17], improving over it.
Our algorithm relies strongly on recent advancements made in
using sparse and redundant representation of signals [18–26], and
learning their sparsifying dictionaries [27–29]. We use the K-SVD
algorithm for learning the dictionaries for representing small
image patches in a locally adaptive way, and use these to sparse-
code the patches’ content. This is a relatively simple and
straight-forward algorithm with hardly any entropy coding stage.
Yet, it is shown to be superior to several competing algorithms:
(i) the JPEG2000, (ii) the VQ-based algorithm presented in [17],
and (iii) A Principal Component Analysis (PCA) approach.2 Fig. 1. (Left) Piece-wise afﬁne warping of the image by triangulation. (Right) A
In the next section we provide some background material for uniform slicing to disjoint square patches for coding purposes.
this work: we start by presenting the details of the compression
algorithm developed in [17], as their scheme is the one we embark K-Means) per each patch separately, using patches taken from the
from in the development of ours. We also describe the topic of same location from 5000 training images. This way, each VQ is
sparse and redundant representations and the K-SVD, that are adapted to the expected local content, and thus the high perfor-
the foundations for our algorithm. In Section 3 we turn to present mance presented by this algorithm. The number of code-words
the proposed algorithm in details, showing its various steps, and in the VQ is a function of the bit-allocation for the patches. As
discussing its computational/memory complexities. Section 4 we argue in the next section, VQ coding is limited by the available
presents results of our method, demonstrating the claimed number of examples and the desired rate, forcing relatively small
superiority. We conclude in Section 5 with a list of future activities patch sizes. This, in turn, leads to a loss of some redundancy be-
that can further improve over the proposed scheme. tween adjacent patches, and thus loss of potential compression.
Another ingredient in this algorithm that partly compensates
2. Background material for the above-described shortcoming is a multi-scale coding
scheme. The image is scaled down and VQ-coded using patches
2.1. VQ-based image compression of size 8 Â 8. Then it is interpolated back to the original resolution,
and the residual is coded using VQ on 8 Â 8 pixel patches once
Among the thousands of papers that study still image again. This method can be applied on a Laplacian pyramid of the
compression algorithms, there are relatively few that consider original (warped) image with several scales [33].
the treatment of facial images [2–17]. Among those, the most As already mentioned above, the results shown in [17] surpass
recent and the best performing algorithm is the one reported in those obtained by JPEG2000, both visually and in Peak-Signal-to-
[17]. That paper also provides a thorough literature survey that Noise Ratio (PSNR) quantitative comparisons. In our work we pro-
compares the various methods and discusses similarities and pose to replace the coding stage from VQ to sparse and redundant
differences between them. Therefore, rather than repeating such representations—this leads us to the next subsection, were we de-
a survey here, we refer the interested reader to [17]. In this scribe the principles behind this coding strategy.
sub-section we concentrate on the description of the algorithm
in [17] as our method resembles it to some extent. 2.2. Sparse and redundant representations

Facial Image Compression O.O. Bryt, M. EladJ. J. Vis. Commun. Image R. 19 (2008) 270–282
Bryt, M. Elad / / Vis. Commun. Image R. 19 (2008) 270–282 271
271

[Elad et al. 2009]
show recognizable faces. We use a a database containing around 6000
show recognizable faces. We use database containing around 6000

Image registration.
taken in [17].
taken in [17].
In our work we propose a a novel compression algorithm, related
In our work we propose novel compression algorithm, related

Non-overlapping patches (fk )k .
using sparse and redundant representation of signals [18–26], and fk
image patches in a a locally adaptive way, and use these to sparse-
image patches in locally adaptive way, and use these to sparse-
code the patches’ content. This isis a a relatively simple and
code the patches’ content. This relatively simple and
Yet, itit is shown to be superior to several competing algorithms:
Yet, is shown to be superior to several competing algorithms:
2
and (iii) AA Principal Component Analysis (PCA) approach.2
and (iii) Principal Component Analysis (PCA) approach. Fig. 1.1. (Left) Piece-wise afﬁne warping of the image by triangulation. (Right) A
Fig. (Left) Piece-wise afﬁne warping of the image by triangulation. (Right) A
In the next section we provide some background material for
In the next section we provide some background material for uniform slicing toto disjoint square patches for coding purposes.
uniform slicing disjoint square patches for coding purposes.
algorithm developed in [17], as their scheme isis the one we embark
algorithm developed in [17], as their scheme the one we embark K-Means) per each patch separately, using patches taken from the
K-Means) per each patch separately, using patches taken from the
from in the development of ours. We also describe the topic of
from in the development of ours. We also describe the topic of same location from 5000 training images. This way, each VQ isis
same location from 5000 training images. This way, each VQ
sparse and redundant representations and the K-SVD, that are
adapted to the expected local content, and thus the high perfor-
the foundations for our algorithm. In Section 3 3 we turn to present
the foundations for our algorithm. In Section we turn to present mance presented by this algorithm. The number of code-words
mance presented by this algorithm. The number of code-words
the proposed algorithm in details, showing its various steps, and
the proposed algorithm in details, showing its various steps, and in the VQ isisa afunction of the bit-allocation for the patches. As
in the VQ function of the bit-allocation for the patches. As
discussing its computational/memory complexities. Section 4 4
discussing its computational/memory complexities. Section we argue in the next section, VQ coding isis limited by the available
we argue in the next section, VQ coding limited by the available
presents results of our method, demonstrating the claimed
number of examples and the desired rate, forcing relatively small
superiority. We conclude in Section 5 5 with a list of future activities
superiority. We conclude in Section with a list of future activities patch sizes. This, in turn, leads to a a loss of some redundancy be-
patch sizes. This, in turn, leads to loss of some redundancy be-
that can further improve over the proposed scheme.
tween adjacent patches, and thus loss of potential compression.
2. Background material
2. Background material for the above-described shortcoming isis a a multi-scale coding
for the above-described shortcoming multi-scale coding
scheme. The image isisscaled down and VQ-coded using patches
scheme. The image scaled down and VQ-coded using patches
2.1. VQ-based image compression
2.1. VQ-based image compression of size 8 8 Â 8. Then it is interpolated back to the original resolution,
of size Â 8. Then it is interpolated back to the original resolution,
and the residual isiscoded using VQ on 8 8 Â 8pixel patches once
and the residual coded using VQ on Â 8 pixel patches once
Among the thousands of papers that study still image
Among the thousands of papers that study still image again. This method can be applied on a a Laplacian pyramid of the
again. This method can be applied on Laplacian pyramid of the
compression algorithms, there are relatively few that consider
original (warped) image with several scales [33].
the treatment of facial images [2–17]. Among those, the most
As already mentioned above, the results shown in [17] surpass
recent and the best performing algorithm isis the one reported in
recent and the best performing algorithm the one reported in those obtained by JPEG2000, both visually and in Peak-Signal-to-
those obtained by JPEG2000, both visually and in Peak-Signal-to-
[17]. That paper also provides a athorough literature survey that
[17]. That paper also provides thorough literature survey that Noise Ratio (PSNR) quantitative comparisons. In our work we pro-
Noise Ratio (PSNR) quantitative comparisons. In our work we pro-
compares the various methods and discusses similarities and
pose to replace the coding stage from VQ to sparse and redundant
differences between them. Therefore, rather than repeating such
representations—this leads us to the next subsection, were we de-
a asurvey here, we refer the interested reader to [17]. In this
survey here, we refer the interested reader to [17]. In this scribe the principles behind this coding strategy.
scribe the principles behind this coding strategy.
in [17] as our method resembles itit to some extent.
in [17] as our method resembles to some extent. 2.2. Sparse and redundant representations
2.2. Sparse and redundant representations

Facial Image Compression O. Bryt, M. Elad / J. Vis. Commun. Image R. 19 (2008) 270–282

Before turning to preset the results we should add the follow-
O.O. Bryt, M. EladJ. J. ing: while all theImage R. 19 (2008) 270–282 specific database
Bryt, M. Elad / / Vis. Commun. Image R. shown here 270–282
Vis. Commun. results 19 (2008) refer to the
was trained for patch number 80 (The left
271
coding atoms, and similarly, in Fig. 7 we271
can
we operate on, the overall scheme proposed is general and should was trained for patch number 87 (The right
apply to other face images databases just as well. Naturally, some sparse coding atoms. It can be seen that bot

[Elad et al. 2009]
show recognizable faces. We use a a database containing around 6000 the parameters might be necessary, and among those,
show recognizable faces. We use database containing around 6000 in
changes images similar in nature to the image patch
such facial images, some of which are used for training and tuning size is the most important to consider. We also note that
such facial images, some of which are used for training andthe patch
tuning trained for. A similar behavior was observed

the algorithm, and the others for testing it, similar to the approach from one source of images to another, this relative size
as one shifts
of the background in the photos may vary, and
the
necessarily 4.2. Reconstructed images

Image registration.
taken in [17].
taken in [17]. leads to changes in performance. More specifically, when the back-
In our work we propose a a novel compression algorithm, ground small such larger (e.g., the images we useperformance is
In our work we propose novel compression algorithm, related related
tively
regions are
regions),
the
compression
here have rela- Our coding strategy allows us to learn w
age are more difficult than others to co
to the one presented in [17], improving over it. expected to improve. assigning the same representation error th
Our algorithm relies strongly on recent advancements made in dictionaries

Non-overlapping patches (fk )k .
4.1. K-SVD
using sparse and redundant representation of signals [18–26], and fk
patches, and observing how many atoms
representation of each patch on average.
a small number of allocated atoms are simp
learning their sparsifying dictionaries [27–29]. We use the K-SVDThe primary stopping condition for the training process was set others. We would expect that the represent
to be a limitation on the maximal number of K-SVD iterations of the image such as the background, p
algorithm for learning the dictionaries for representing (being 100). A secondary stopping condition was a limitation on
algorithm for learning the dictionaries for representingsmall small maybe parts of the clothes will be simpler
image patches in a a locally adaptive way, and use these to sparse-
image patches in locally adaptive way, and use these to sparse-
the minimal representation error. In the image compression stage tion of areas containing high frequency e

Dictionary learning (Dk )k .
we added a limitation on the maximal number of atoms per patch. hair or the eyes. Fig. 8 shows maps of atom
code the patches’ content. This isis a a relatively simple and
code the patches’ content. This relatively simple and
These conditions were used to allow us to better control the rates and representation error (RMSE—squared
straight-forward algorithm with hardly any entropy coding of stage.
stage.
straight-forward algorithm with hardly any entropy coding the resulting images and the overall simulation time. squared error) per patch for the images in
Every obtained dictionary contains 512 patches of size different bit-rates. It can be seen that more
Yet, itit is shown to be superior to several competing algorithms: as atoms. In Fig. 6 we can see the dictionary that
Yet, is shown to be superior to several competing algorithms:pixels
15 Â 15 to patches containing the facial details (h
2
and (iii) AA Principal Component Analysis (PCA) approach.2
and (iii) Principal Component Analysis (PCA) approach. Fig. 1.1. (Left) Piece-wise affine warping of the image by triangulation. (Right) A
Fig. (Left) Piece-wise affine warping of the image by triangulation. (Right) A
In the next section we provide some background material for
In the next section we provide some background material for uniform slicing toto disjoint square patches for coding purposes.
uniform slicing disjoint square patches for coding purposes.
algorithm developed in [17], as their scheme isis the one we embark
algorithm developed in [17], as their scheme the one we embark K-Means) per each patch separately, using patches taken from the
K-Means) per each patch separately, using patches taken from the
from in the development of ours. We also describe the topic of
from in the development of ours. We also describe the topic of same location from 5000 training images. This way, each VQ isis
same location from 5000 training images. This way, each VQ
sparse and redundant representations and the K-SVD, that are
adapted to the expected local content, and thus the high perfor-
the foundations for our algorithm. In Section 3 3 we turn to present
the foundations for our algorithm. In Section we turn to present mance presented by this algorithm. The number of code-words
mance presented by this algorithm. The number of code-words
the proposed algorithm in details, showing its various steps, and
the proposed algorithm in details, showing its various steps, and in the VQ isisa afunction of the bit-allocation for the patches. As
in the VQ function of the bit-allocation for the patches. As
discussing its computational/memory complexities. Section 4 4
discussing its computational/memory complexities. Section we argue in the next section, VQ coding isis limited by the available
we argue in the next section, VQ coding limited by the available
presents results of our method, demonstrating the claimed
number of examples and the desired rate, forcing relatively small
superiority. We conclude in Section 5 5 with a list of future activities
superiority. We conclude in Section with a list of future activities patch sizes. This, in turn, leads to a a loss of some redundancy be-
patch sizes. This, in turn, leads to loss of some redundancy be-
that can further improve over the proposed scheme.
tween adjacent patches, and thus loss of potential compression.

AnotherThe Dictionary obtained by K-SVD for Patch No. 80 (the that partlyOMPcompensates
Another ingredient in this algorithmleft eye) using the compensates
Fig. 6. ingredient in this algorithm that partly method with L ¼ 4.
for the above-described shortcoming isis a a multi-scale coding
for the above-described shortcoming multi-scale coding
Dk
scheme. The image isisscaled down and VQ-coded using patches
scheme. The image scaled down and VQ-coded using patches
2.1. VQ-based image compression
2.1. VQ-based image compression of size 8 8 Â 8. Then it is interpolated back to the original resolution,
of size Â 8. Then it is interpolated back to the original resolution,
and the residual isiscoded using VQ on 8 8 Â 8pixel patches once
and the residual coded using VQ on Â 8 pixel patches once
Among the thousands of papers that study still image
Among the thousands of papers that study still image again. This method can be applied on a a Laplacian pyramid of the
again. This method can be applied on Laplacian pyramid of the
compression algorithms, there are relatively few that consider
original (warped) image with several scales [33].
the treatment of facial images [2–17]. Among those, the most
As already mentioned above, the results shown in [17] surpass
recent and the best performing algorithm isis the one reported in
recent and the best performing algorithm the one reported in those obtained by JPEG2000, both visually and in Peak-Signal-to-
those obtained by JPEG2000, both visually and in Peak-Signal-to-
[17]. That paper also provides a athorough literature survey that
[17]. That paper also provides thorough literature survey that Noise Ratio (PSNR) quantitative comparisons. In our work we pro-
Noise Ratio (PSNR) quantitative comparisons. In our work we pro-
compares the various methods and discusses similarities and
pose to replace the coding stage from VQ to sparse and redundant
differences between them. Therefore, rather than repeating such
representations—this leads us to the next subsection, were we de-
a asurvey here, we refer the interested reader to [17]. In this
survey here, we refer the interested reader to [17]. In this scribe the principles behind this coding strategy.
scribe the principles behind this coding strategy.
in [17] as our method resembles itit to some extent.
in [17] as our method resembles to some extent. 2.2. Sparse and redundant representations
2.2. Sparse and redundant representations

Learning Sparse Representation

Learning Sparse Representation

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Learning Sparse Representation

Similar to Learning Sparse Representation (20)

More from Gabriel Peyré

More from Gabriel Peyré (20)

Recently uploaded

Recently uploaded (20)

Learning Sparse Representation