Estimating Structured Vector Autoregressive Models Using Regularized Estimation

ICML2016
Estimating Structured Vector
Autoregressive Models
NEC
ICML2016 2016/7/21

• Vector Autoregressive  
•
• i.i.d.
• →( ) Lasso Group-Lasso i.i.d.
• →( ) R ( )
• R
ap-
y in
lar-
(1)
uch
, is
dis-
ion
ani,
hine
ume
nancial time series (Tsay, 2005) to modeling the dynamical
systems (Ljung, 1998) and estimating brain function con-
nectivity (Valdes-Sosa et al., 2005), among others. A VAR
model of order d is defined as
xt = A1xt 1 + A2xt 2 + · · · + Adxt d + ✏t , (2)
where xt 2 Rp
denotes a multivariate time series, Ak 2
Rp⇥p
, k = 1, . . . , d are the parameters of the model, and
d 1 is the order of the model. In this work, we as-
sume that the noise ✏t 2 Rp
follows a Gaussian distribu-
tion, ✏t ⇠ N(0, ⌃), with E(✏t✏T
t ) = ⌃ and E(✏t✏T
t+⌧ ) = 0,
for ⌧ 6= 0. The VAR process is assumed to be stable and
stationary (Lutkepohl, 2007), while the noise covariance
matrix ⌃ is assumed to be positive definite with bounded
largest eigenvalue, i.e., ⇤min(⌃) > 0 and ⇤max(⌃) < 1.
In the current context, the parameters {Ak} are assumed
vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E)
y = Z + ✏,
where y 2 RNp
, Z = (Ip⇥p ⌦ X) 2 RNp⇥dp2
, 2 Rdp2
,
✏ 2 RNp
, and ⌦ is the Kronecker product. The covari-
ance matrix of the noise ✏ is now E[✏✏T
] = ⌃ ⌦ IN⇥N .
Consequently, the regularized estimator takes the form
ˆ = argmin
2Rdp2
1
N
||y Z ||2
2 + N R( ), (4)
where R( ) can be any vector norm, separable along the
rows of matrices Ak. Specifically, if we denote =
[ T
1 . . . T
p ]T
and Ak(i, :) as the row of matrix Ak for
k = 1, . . . , d, then our assumption is equivalent to
R( )=
pX
i=1
R i =
pX
i=1
R
✓h
A1(i, :)T
. . .Ad(i, :)T
iT
◆
. (5)
where A
all the
for 2
of origi
Pd
k=1 A
and Sec
2.3. Pro
In what
trix X i
results w
bounds
Define
we assu
is distri
matrix C
C =
2

-  
Lasso 
[Basu15]
 
VAR
i.i.d. [Banerjee14]
VAR  
3

• i.i.d.
•
•
•
high probability
||Z ||2
|| ||2
p
N
= O(1) is a pos-
he estimation er-
ded by k k2 
ote that the norm
assumed to be the
rom our assump-
und on the regu-
+ w2
(⌦R)
N2
⌘
. As
nd d grows and
2
✏1) + log(p)) we
(1+✏1)
w2
(⌦R)
N2
◆
.
ion, we will show
some ⌫ > 0 and
p 1
, where Sdp 1
defined as ⌦Ej
=
R( j)
o
, for r >
T
, for j is of size
isfy N O N
+ N2 . With high
then the restricted eigenvalue condition ||Z ||
|| ||2
for 2 cone(⌦E) holds, so that  = O(1
itive constant. Moreover, the norm of the est
ror in optimization problem (4) is bounded b
O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
(cone(⌦Ej
)). Note tha
compatibility constant (cone(⌦Ej
)) is assume
same for all j = 1, . . . , p, which follows from o
tion in (5).
Consider now Theorem 3.3 and the bound on
larization parameter N O
⇣
w(⌦R)
p
N
+ w2
(
N
the dimensionality of the problem p and d
the number of samples N increases, the first t
will dominate the second one w2
(⌦R)
N2 . This c
by computing N for which the two terms bec
2
↑
w,
the regularization parameter
nd on R⇤
[ 1
N ZT
✏]  ↵, for
lish the required relationship
2 Rdp
|R(u)  1}, and define
be a Gaussian width of set
ny ✏1 > 0 and ✏2 > 0 with
( min(✏2
2, ✏1) + log(p)) we
(⌦R)
p
N
+ c1(1+✏1)
w2
(⌦R)
N2
◆
e constants.
alue condition, we will show
⌫, for some ⌫ > 0 and
L indicate stronger dependency in the data, thus
more samples for the RE conditions to hold with h
ability.
Analyzing Theorems 3.3 and 3.4 we can interpre
tablished results as follows. As the size and di
ality N, p and d of the problem increase, we e
the scale of the results and use the order notatio
note the constants. Select a number of sample
N O(w2
(⇥)) and let the regularization param
isfy N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. With high pr
then the restricted eigenvalue condition ||Z ||2
|| ||2
for 2 cone(⌦E) holds, so that  = O(1)
itive constant. Moreover, the norm of the estim
ror in optimization problem (4) is bounded by
O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
(cone(⌦Ej
)). Note that
)) is assumed
same for all j = 1, . . . , p, which follows from our
tion in (5).
y ⌘ > 0, inf
2cone(⌦E )
||(Ip⇥p⌦X) ||2
|| ||2
⌫,
L 2
p
M cw(⇥) ⌘ and c, c1, c2 are
nts, and L, M are defined in (9) and (13).
3.4, we can choose ⌘ = 1
2
p
NL and set
2
p
M cw(⇥) ⌘ and since
p
N > 0
ed, we can establish a lower bound on the
ples N:
p
N > 2
p
M+cw(⇥)
p
L/2
= O(w(⇥)).
bound and using (9) and (13), we can con-
3.3
W
sep
ize
su
OW
3.
To
ram
VAR
h= 1
X(h) = E[Xj,:XT
j+h,:] 2 Rdp⇥dp
, then we can write
inf
1ldp
!2[0,2⇡]
⇤l[⇢X(!)]  ⇤k[CU ]
1kNdp
 sup
1ldp
!2[0,2⇡]
⇤l[⇢X(!)].
The closed form expression of spectral density is
⇢X(!) = I Ae i! 1
⌃E
h
I Ae i! 1
i⇤
,
where ⌃E is the covariance matrix of a noise vector and A
are as defined in expression (6). Thus, an upper bound on
CU can be obtained as ⇤max[CU ]  ⇤max(⌃)
⇤min(A) , where we
defined ⇤min(A) = min
!2[0,2⇡]
⇤min(A(!)) for
A(!) = I AT
ei!
I Ae i!
. (12)
Referring back to covariance matrix Qa in (11), we get
⇤max[Qa]  ⇤max(⌃)/⇤min(A) = M. (13)
We note that for a general VAR model, there might not exist
closed-form expressions for ⇤max(A) and ⇤min(A). How-
ever, for some special cases there are results establishing
the bounds on these quantities (e.g., see Proposition 2.2 in
Lemma 3.2 Assume tha
condition holds
||Z
||
for 2 cone(⌦E) an
cone(⌦E) is a cone of th
|| ||2 
1 +
r
where (cone(⌦E)) is a
fined as (cone(⌦E)) =
Note that the above error
and (16) hold, then the
(17). However, the result
tities, involving Z and ✏,
the following we establis
Estimating Structured VAR
density is that it has a closed form expression (see Section
9.4 of (Priestley, 1981))
⇢(!)= I
dX
k=1
Ake ki!
! 1
⌃
2
4 I
dX
k=1
Ake ki!
! 1
3
5
⇤
,
where ⇤ denotes a Hermitian of a matrix. Therefore, from
(8) we can establish the following lower bound
⇤min[CX] ⇤min(⌃)/⇤max(A) = L, (9)
3. Regularized Es
Denote by = ˆ
optimization problem
rameter. The focus of
under which the optim
tees on the accuracy of
term is bounded: || ||
lish such conditions, w
et al., 2014). Specifica4

• : p vector VAR(d)
•
•
-
n
-
)
h
s
-
n
,
e
e
xt = A1xt 1 + A2xt 2 + · · · + Adxt d + ✏t , (2)
where xt 2 Rp
Rp⇥p
t ) = ⌃ and E(✏t✏T
t+⌧ ) = 0,
timation problems of the form:
ˆ = argmin
2Rq
1
M
ky Z k2
2 + M R( ) , (1)
{(yi, zi), i = 1, . . . , M}, yi 2 R, zi 2 Rq
, such
= [yT
1 , . . . , yT
M ]T
and Z = [zT
1 , . . . , zT
M ]T
, is
ining set of M independently and identically dis-
d (i.i.d.) samples, M > 0 is a regularization
eter, and R(·) denotes a suitable norm (Tibshirani,
dings of the 33rd
International Conference on Machine
g, New York, NY, USA, 2016. JMLR: W&CP volume
pyright 2016 by the author(s).
xt = A1xt 1 + A2xt 2 + ·
where xt 2 Rp
denotes a mult
Rp⇥p
, k = 1, . . . , d are the par
d 1 is the order of the mod
fo
tion, ✏t ⇠ N (0, ⌃), with E(✏t✏T
t
for ⌧ 6= 0. The VAR process is
stationary (Lutkepohl, 2007), w
matrix ⌃ is assumed to be posi
largest eigenvalue, i.e., ⇤min(⌃)
In the current context, the para
characterizing sample complexity and error bounds.
2.1. Regularized Estimator
To estimate the parameters of the VAR model, we trans-
form the model in (2) into the form suitable for regularized
estimator (1). Let (x0, x1, . . . , xT ) denote the T + 1 sam-
ples generated by the stable VAR model in (2), then stack-
ing them together we obtain
2
6
6
6
4
xT
d
xT
d+1
...
xT
T
3
7
7
7
5
=
2
6
6
6
4
xT
d 1 xT
d 2 . . . xT
0
xT
d xT
d 1 . . . xT
1
...
...
...
...
xT
T 1 xT
T 2 . . . xT
T d
3
7
7
7
5
2
6
6
6
4
AT
1
AT
2
...
AT
d
3
7
7
7
5
+
2
6
6
6
4
✏T
d
✏T
d+1
...
✏T
T
3
7
7
7
5
which can also be compactly written as
Y = XB + E, (3)
where Y 2 RN⇥p
, X 2 RN⇥dp
, B 2 Rdp⇥p
, and E 2
RN⇥p
for N = T d+1. Vectorizing (column-wise) each
matrix in (3), we get
y = Z + ✏,
of matrix X (and conseque
dependencies, following (B
utilize the spectral represen
VAR models to control the d
2.2. Stability of VAR Mode
Since VAR models are (line
analysis we need to establi
VAR model (2) is stable, i.e
not diverge over time. For u
venient to rewrite VAR mod
alent VAR model of order 1
2
6
6
6
4
xt
xt 1
...
xt (d 1)
3
7
7
7
5
=
2
6
6
6
6
6
4
A1 A2 . . .
I 0 . . .
0 I . . .
...
...
...
0 0 . . .
| {z
A
where A 2 Rdp⇥dp
. There
all the eigenvalues of A sa
for 2 C, | | < 1. Equi
of original parameters Ak,
P
(1). Let (x0, x1, . . . , xT ) denote the T + 1 sam-
ated by the stable VAR model in (2), then stack-
ogether we obtain
=
2
6
6
6
4
xT
d 1 xT
d 2 . . . xT
0
xT
d xT
d 1 . . . xT
1
...
...
...
...
xT
T 1 xT
T 2 . . . xT
T d
3
7
7
7
5
2
6
6
6
4
AT
1
AT
2
...
AT
d
3
7
7
7
5
+
2
6
6
6
4
✏T
d
✏T
d+1
...
✏T
T
3
7
7
7
5
also be compactly written as
Y = XB + E, (3)
2 RN⇥p
, X 2 RN⇥dp
, B 2 Rdp⇥p
, and E 2
N = T d+1. Vectorizing (column-wise) each
(3), we get
ec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E)
Since VAR mo
analysis we ne
VAR model (2
not diverge ove
venient to rewr
alent VAR mod
2
6
6
6
4
xt
xt 1
...
xt (d 1)
3
7
7
7
5
=
2
6
6
6
6
6
4
|
where A 2 R
all the eigenva
for 2 C, |
les generated by the stable VAR model in (2), then stack-
ng them together we obtain
2
6
6
6
4
xT
d
xT
d+1
...
xT
T
3
7
7
7
5
=
2
6
6
6
4
xT
d 1 xT
d 2 . . . xT
0
xT
d xT
d 1 . . . xT
1
...
...
...
...
xT
T 1 xT
T 2 . . . xT
T d
3
7
7
7
5
2
6
6
6
4
AT
1
AT
2
...
AT
d
3
7
7
7
5
+
2
6
6
6
4
✏T
d
✏T
d+1
...
✏T
T
3
7
7
7
5
hich can also be compactly written as
Y = XB + E, (3)
here Y 2 RN⇥p
, X 2 RN⇥dp
, B 2 Rdp⇥p
, and E 2
N⇥p
Since V
analysi
VAR m
not div
venient
alent V
2
6
6
6
4
x
xt
xt (
where
all the
2
6
6
6
4
xT
d
xT
d+1
...
xT
T
3
7
7
7
5
=
2
6
6
6
4
xT
d 1 xT
d 2 . . . xT
0
xT
d xT
d 1 . . . xT
1
...
...
...
...
xT
T 1 xT
T 2 . . . xT
T d
3
7
7
7
5
2
6
6
6
4
AT
1
AT
2
...
AT
d
3
7
7
7
5
+
2
6
6
6
4
✏T
d
✏T
d+1
...
✏T
T
3
7
7
7
5
Y = XB + E, (3)
where Y 2 RN⇥p
, X 2 RN⇥dp
, B 2 Rdp⇥p
, and E 2
RN⇥p
2
6
6
6
4
xT
d
xT
d+1
...
xT
T
3
7
7
7
5
=
2
6
6
6
4
xT
d 1 xT
d 2 . . . xT
0
xT
d xT
d 1 . . . xT
1
...
...
...
...
xT
T 1 xT
T 2 . . . xT
T d
3
7
7
7
5
2
6
6
6
4
AT
1
AT
2
...
AT
d
3
7
7
7
5
+
2
6
6
6
4
✏T
d
✏T
d+1
...
✏T
T
3
7
7
7
5
Y = XB + E, (3)
where Y 2 RN⇥p
, X 2 RN⇥dp
, B 2 Rdp⇥p
, and E 2
RN⇥p
2
6
6
6
4
xT
d
xT
d+1
...
xT
T
3
7
7
7
5
=
2
6
6
6
4
xT
d 1 xT
d 2 . . . xT
0
xT
d xT
d 1 . . . xT
1
...
...
...
...
xT
T 1 xT
T 2 . . . xT
T d
3
7
7
7
5
2
6
6
6
4
AT
1
AT
2
...
AT
d
3
7
7
7
5
+
2
6
6
6
4
✏T
d
✏T
d+1
...
✏T
T
3
7
7
7
5
Y = XB + E, (3)
where Y 2 RN⇥p
, X 2 RN⇥dp
, B 2 Rdp⇥p
, and E 2
RN⇥p
y = Z + ✏,
where y 2 RNp
, Z = (Ip⇥p ⌦ X) 2 RNp⇥dp2
, 2 Rdp2
,
✏ 2 RNp
analysis we
VAR model
not diverge o
venient to re
alent VAR m
2
6
6
6
4
xt
xt 1
...
xt (d 1)
3
7
7
7
5
where A 2
all the eigen
for 2 C,
of original p
Pd
k=1 Ak
1
k
and Section
xT
T xT
T 1 xT
T 2 . . . xT
T d AT
d ✏T
T
Y = XB + E, (3)
where Y 2 RN⇥p
, X 2 RN⇥dp
, B 2 Rdp⇥p
, and E 2
RN⇥p
y = Z + ✏,
where y 2 RNp
, Z = (Ip⇥p ⌦ X) 2 RNp⇥dp2
, 2 Rdp2
,
✏ 2 RNp
ance matrix of the noise ✏ is now E[✏✏T
] = ⌃ ⌦ IN⇥N .
Consequently, the regularized estimator takes the form
ˆ = argmin
2Rdp2
1
N
||y Z ||2
2 + N R( ), (4)
2
6
6
6
4
xt
xt 1
...
xt (d 1)
3
7
7
7
5
=
where A 2
all the eigen
for 2 C,
of original p
Pd
k=1 Ak
1
k
and Section
2.3. Propert
In what follo
trix X in (3)
results will th
i.i.d.
5

2 ⌦E
 
 
Lasso regularization norms. I
the main ideas of our proof te
delegated to the supplement.
To establish lower bound on
N , we derive an upper bou
some ↵ > 0, which will estab
N ↵ R⇤
[ 1
N ZT
✏].
Theorem 3.3 Let ⌦R = {u 2
w(⌦R) = E[ sup
u2⌦R
hg, ui] to
⌦R for g ⇠ N(0, I). For a
probability at least 1 c exp
can establish that
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w
Thm4.3
Thm4.4
Lem.4.1
Lem.4.2
o hold with high prob-
e can interpret the es-
e size and dimension-
crease, we emphasize
order notations to de-
er of samples at least
ization parameter sat-
With high probability
tion ||Z ||2
|| ||2
p
N
 = O(1) is a pos-
of the estimation er-
bounded by k k2 
)). Note that the norm
ction 3.4 we will present
ique, with all the details
regularization parameter
n R⇤
[ 1
N ZT
✏]  ↵, for
the required relationship
p
a Gaussian width of set
> 0 and ✏2 > 0 with
min(✏2
2, ✏1) + log(p)) we
)
+ c1(1+✏1)
w2
(⌦R)
N2
◆
nstants.
condition, we will show
N as showing that large values of M and small va
L indicate stronger dependency in the data, thus req
more samples for the RE conditions to hold with high
ability.
Analyzing Theorems 3.3 and 3.4 we can interpret
tablished results as follows. As the size and dime
ality N, p and d of the problem increase, we emp
the scale of the results and use the order notations
note the constants. Select a number of samples a
N O(w2
(⇥)) and let the regularization paramet
isfy N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. With high prob
|| ||2
for 2 cone(⌦E) holds, so that  = O(1) is
itive constant. Moreover, the norm of the estimat
ror in optimization problem (4) is bounded by k
O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
(cone(⌦Ej
)). Note that the
)) is assumed to
same for all j = 1, . . . , p, which follows from our as
o understand the bound on
s of M and small values of
y in the data, thus requiring
ions to hold with high prob-
3.4 we can interpret the es-
As the size and dimension-
m increase, we emphasize
e the order notations to de-
number of samples at least
gularization parameter sat-
R)
⌘
. With high probability
condition ||Z ||2
|| ||2
p
N
that  = O(1) is a pos-
norm of the estimation er-
4) is bounded by k k2 
⌦Ej
⌦Ej
)) is assumed to be the 6

2 ⌦E
 
 
N ↵ R⇤
[ 1
N ZT
✏].
w(⌦R) = E[ sup
u2⌦R
hg, ui] to
can establish that
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w
Thm4.3
Thm4.4
Lem.4.1
tion ||Z ||2
|| ||2
p
N
 = O(1) is a pos-
bounded by k k2 
n R⇤
[ 1
N ZT
✏]  ↵, for
p
> 0 and ✏2 > 0 with
min(✏2
2, ✏1) + log(p)) we
)
+ c1(1+✏1)
w2
(⌦R)
N2
◆
nstants.
ability.
N O(w2
isfy N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. With high prob
|| ||2
O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
(cone(⌦Ej
)). Note that the
)) is assumed to
Lem.4.2
R)
⌘
condition ||Z ||2
|| ||2
p
N
⌦Ej
⌦Ej

Restricted Eigenvalue Condition
•  
(well-conditioned)
• cone Lasso-type
•
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w(⌦R)
p
N
+ c1(1+✏1)
w2
(⌦R)
N2
◆
where c, c1 and c2 are positive constants.
To establish restricted eigenvalue condition, we will show
that inf
2cone(⌦E )
||(Ip⇥p⌦X) ||2
|| ||2
⌫, for some ⌫ > 0 and
then set
p
N = ⌫.
Theorem 3.4 Let ⇥ = cone(⌦Ej
) Sdp 1
, where Sdp 1
is a unit sphere. The error set ⌦Ej
is deﬁned as ⌦Ej
=n
j 2 Rdp
R( ⇤
j + j)  R( ⇤
j ) + 1
r R( j)
o
, for r >
1, j = 1, . . . , p, and = [ T
1 , . . . , T
p ]T
, for j is of size
dp ⇥ 1, and ⇤
= [ ⇤T
1 . . . ⇤T
p ]T
, for ⇤
j 2 Rdp
. The set
⌦Ej
is a part of the decomposition in ⌦E = ⌦E1
⇥ · · · ⇥
⌦E due to the assumption on the row-wise separability of
for
itiv
ror
O
⇣
com
sam
tion
Con
lari
the
the
wil
by
w(⌦
p
w(⌦
low
 
( )⌦E
N ↵ R [N Z ✏].
Theorem 3.3 Let ⌦R = {u 2 Rd
w(⌦R) = E[ sup
u2⌦R
hg, ui] to be
⌦R for g ⇠ N(0, I). For any ✏
probability at least 1 c exp( m
can establish that
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w(⌦R
p
N
where c, c1 and c2 are positive co
To establish restricted eigenvalue
that inf
2cone(⌦E )
||(Ip⇥p⌦X) ||2
|| ||2
then set
p
N = ⌫.
Theorem 3.4 Let ⇥ = cone(⌦Ej⌦E 8

Restricted Error Set
•
•
• r=2
⇤
⇤
+ ⌦E
(L1)(L1)
conecone
ce matrix
1.2 of the
(11)
XT
N,:
⇤T
2
ng all the
n order to
Qa), ob-
y stacking
as in (8),
process in
⇡], where
write
X(!)].
1
i⇤
,
for some constant r > 1, where R⇤
⇥ 1
N ZT
✏
⇤
is a
dual form of the vector norm R(·), which is defined as
R⇤
[ 1
N ZT
✏] = sup
R(U)1
⌦ 1
N ZT
✏, U
↵
, for U 2 Rdp2
, where
U = [uT
1 , uT
2 , . . . , uT
p ]T
and ui 2 Rdp
. Then the error
vector k k2 belongs to the set
⌦E=
⇢
2 Rdp2
R( ⇤
+ )  R( ⇤
)+
1
r
R( ) . (15)
The second condition in (Banerjee et al., 2014) establishes
the upper bound on the estimation error.
Lemma 3.2 Assume that the restricted eigenvalue (RE)
condition holds
||Z ||2
|| ||2
p
N, (16)
ting Structured VAR
tion
3
5
⇤
,
rom
(9)
r
(10)
need
3. Regularized Estimation Guarantees
Denote by = ˆ ⇤
the error between the solution of
optimization problem (4) and ⇤
, the true value of the pa-
rameter. The focus of our work is to determine conditions
under which the optimization problem in (4) has guaran-
tees on the accuracy of the obtained solution, i.e., the error
term is bounded: || ||2  for some known . To estab-
lish such conditions, we utilize the framework of (Banerjee
et al., 2014). Specifically, estimation error analysis is based
on the following known results adapted to our settings. The
first one characterizes the restricted error set ⌦E, where the
error belongs.
Lemma 3.1 Assume that

(r>1)
⌦E
2⌦E
9

2 ⌦E
 
 
N ↵ R⇤
[ 1
N ZT
✏].
w(⌦R) = E[ sup
u2⌦R
hg, ui] to
can establish that
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w
Thm4.3
Thm4.4
Lem.4.1
Lem.4.2
tion ||Z ||2
|| ||2
p
N
 = O(1) is a pos-
bounded by k k2 
n R⇤
[ 1
N ZT
✏]  ↵, for
p
> 0 and ✏2 > 0 with
min(✏2
2, ✏1) + log(p)) we
)
+ c1(1+✏1)
w2
(⌦R)
N2
◆
nstants.
ability.
N O(w2
isfy N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. With high prob
|| ||2
O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
(cone(⌦Ej
)). Note that the
)) is assumed to
R)
⌘
condition ||Z ||2
|| ||2
p
N
⌦Ej
⌦Ej

• 4.2
•
•
•
k k2
ond condition in (Banerjee et al., 2014) establishes
er bound on the estimation error.
a 3.2 Assume that the restricted eigenvalue (RE)
on holds
||Z ||2
|| ||2
p
N, (16)
2 cone(⌦E) and some constant  > 0, where
E) is a cone of the error set, then
|| ||2 
1 + r
r
N

(cone(⌦E)), (17)
(cone(⌦E)) is a norm compatibility constant, de-
(cone(⌦E)) = sup R(U)
||U||2
.
,
density of the VAR process in
X(h)e hi!
, ! 2 [0, 2⇡], where
Rdp⇥dp
, then we can write
⇤k[CU ]
kNdp
 sup
1ldp
!2[0,2⇡]
⇤l[⇢X(!)].
on of spectral density is
1
⌃E
h
I Ae i! 1
i⇤
,
e matrix of a noise vector and A
on (6). Thus, an upper bound on
max[CU ]  ⇤max(⌃)
n
2⇡]
⇤min(A(!)) for
AT
ei!
I Ae i!
. (12)
nce matrix Qa in (11), we get
ax(⌃)/⇤min(A) = M. (13)
The second condition in (Banerjee et al., 2014
Lemma 3.2 Assume that the restricted eigen
condition holds
||Z ||2
|| ||2
p
N,
for 2 cone(⌦E) and some constant  >
cone(⌦E) is a cone of the error set, then
|| ||2 
1 + r
r
N

(cone(⌦E)),
where (cone(⌦E)) is a norm compatibility c
ﬁned as (cone(⌦E)) = sup
U2cone(⌦E )
R(U)
||U||2
.
Note that the above error bound is deterministic
rocess in
⇡], where
rite
(!)].
1
i⇤
,
or and A
bound on
where we
(12)
e get
(13)
condition holds
||Z ||2
|| ||2
p
N, (16)
for 2 cone(⌦E) and some constant  > 0, where
|| ||2 
1 + r
r
N

(cone(⌦E)), (17)
where (cone(⌦E)) is a norm compatibility constant, de-
U2cone(⌦E )
R(U)
||U||2
.
The second condition in (Banerjee et al., 2014) establishes
he upper bound on the estimation error.
ondition holds
||Z ||2
|| ||2
p
N, (16)
or 2 cone(⌦E) and some constant  > 0, where
one(⌦E) is a cone of the error set, then
|| ||2 
1 + r
r
N

(cone(⌦E)), (17)
U2cone(⌦E )
R(U)
||U||2
.
condition holds
||Z ||2
|| ||2
p
N, (16)
|| ||2 
1 + r
r
N

(cone(⌦E)), (17)
U2cone(⌦E )
R(U)
||U||2
.
 
E
k = 1, . . . , d. Since L1 is decomposable, it can
that (cone(⌦Ej ))  4
p
s. Next, since ⌦R
Rdp
|R(u)  1}, then using Lemma 3 in (Baner
2014) and Gaussian width results in (Chandrasekap
d A
on
we
12)
13)
condition holds
||Z ||2
|| ||2
p
N, (16)
|| ||2 
1 + r
r
N

(cone(⌦E)), (17)
U2cone(⌦E )
R(U)
||U||2
.
Note that the above error bound is deterministic, i.e., if (14)
s:
11

•
•
•
• Lasso
k k2
or 2 cone(⌦E) and some constant  > 0, where
one(⌦E) is a cone of the error set, then
|| ||2 
1 + r
r
N

(cone(⌦E)), (17)
U2cone(⌦E )
R(U)
||U||2
.
Note that the above error bound is deterministic, i.e., if (14)
nd (16) hold, then the error satisfies the upper bound in
17). However, the results are defined in terms of the quan-
ities, involving Z and ✏, which are random. Therefore, in
he following we establish high probability bounds on the
egularization parameter in (14) and RE condition in (16).
etails
meter
, for
nship
efine
of set
with
) we
R)
◆
show
0 and
dp 1
L indicate stronger dependency in the data, thus requiring
more samples for the RE conditions to hold with high prob-
ability.
Analyzing Theorems 3.3 and 3.4 we can interpret the es-
tablished results as follows. As the size and dimension-
ality N, p and d of the problem increase, we emphasize
the scale of the results and use the order notations to de-
note the constants. Select a number of samples at least
N O(w2
(⇥)) and let the regularization parameter sat-
isfy N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
|| ||2
p
N
for 2 cone(⌦E) holds, so that  = O(1) is a pos-
itive constant. Moreover, the norm of the estimation er-
ror in optimization problem (4) is bounded by k k2 
O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
(cone(⌦Ej
)) is assumed to be the
same for all j = 1, . . . , p, which follows from our assump-
tion in (5).
Consider now Theorem 3.3 and the bound on the regu-
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. As
the dimensionality of the problem p and d grows and
ze and dimension-
ase, we emphasize
er notations to de-
of samples at least
tion parameter sat-
th high probability
n ||Z ||2
|| ||2
p
N
= O(1) is a pos-
the estimation er-
unded by k k2 
Note that the norm
s assumed to be the
s from our assump-
ound on the regu-
)
+ w2
(⌦R)
⌘
. As
N
equired relationship
u)  1}, and define
aussian width of set
0 and ✏2 > 0 with
✏2
2, ✏1) + log(p)) we
c1(1+✏1)
w2
(⌦R)
N2
◆
nts.
dition, we will show
or some ⌫ > 0 and
Sdp 1
, where Sdp 1
s defined as ⌦Ej
=o
Analyzing Theorems 3.3 and 3.4 we can interpret the es-
tablished results as follows. As the size and dimension-
ality N, p and d of the problem increase, we emphasize
the scale of the results and use the order notations to de-
note the constants. Select a number of samples at least
N O(w2
isfy N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
|| ||2
p
N
for 2 cone(⌦E) holds, so that  = O(1) is a pos-
itive constant. Moreover, the norm of the estimation er-
ror in optimization problem (4) is bounded by k k2 
O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
(cone(⌦Ej
same for all j = 1, . . . , p, which follows from our assump-
tion in (5).
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. As
the number of samples N increases, the first term w(⌦R)
p
N
2
□
0 and
Sdp 1
Ej
=
r r >
of size
he set
· · · ⇥
lity of
ui] to
d u 2
2⌘2
+
⌫,
c2 are
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. As
the number of samples N increases, the first term w(⌦R)
p
N
will dominate the second one w2
(⌦R)
N2 . This can be seen
by computing N for which the two terms become equal
w(⌦R)
p
N
= w2
(⌦R)
N2 , which happens at N = w
2
3 (⌦R) <
w(⌦R). Therefore, we can rewrite our results as fol-
lows: once the restricted eigenvalue condition holds and
N O
⇣
w(⌦R)
p
N
⌘
, the error norm is upper-bounded by
k k2  O
⇣
w(⌦R)
p
N
⌘
(cone(⌦Ej
)).
3.3. Special Cases
While the presented results are valid for any norm R(·),
k = 1, . . . , d. Since L1 is decomposable, it can be shown
hat (cone(⌦Ej ))  4
p
s. Next, since ⌦R = {u 2
Rdp
|R(u)  1}, then using Lemma 3 in (Banerjee et al.,
2014) and Gaussian width results in (Chandrasekaran et al.,
2012), we can establish that w(⌦R)  O(
p
log(dp)).
Therefore, based on Theorem 4.3 and the discussion at the
nd of Section 3.2, the bound on the regularization parame-
er takes the form N O
⇣p
log(dp)/N
⌘
. Hence, the es-
⇣p ⌘
quence of weights and | |(1)
quence of absolute values of ,
In (Chen & Banerjee, 2015) it
O(
p
log(dp)/¯c), where ¯c is
and the norm compatibility co
2c2
1
p
s/¯c. Therefore, based on
O
⇣p
log(dp)/(¯cN)
⌘
and the e
by k k2  O
⇣
2c1
¯c
p
s log(dp)
ce L1 is decomposable, it can be shown
))  4
p
s. Next, since ⌦R = {u 2
then using Lemma 3 in (Banerjee et al.,
an width results in (Chandrasekaran et al.,
stablish that w(⌦R)  O(
p
log(dp)).
on Theorem 4.3 and the discussion at the
, the bound on the regularization parame-
N O
⇣p
log(dp)/N
⌘
. Hence, the es-
ounded by k k2  O
⇣p
s log(dp)/N
⌘
(log(dp)).
quence of weights and | |(1)
quence of absolute values of
In (Chen & Banerjee, 2015) it
O(
p
log(dp)/¯c), where ¯c is
and the norm compatibility co
2c2
1
p
s/¯c. Therefore, based on
O
⇣p
log(dp)/(¯cN)
⌘
and the
by k k2  O
⇣
2c1
¯c
p
s log(dp)
We note that the bound obtained
is similar to the bound obtaine
k = 1, . . . , d. Since L1 is decompo
p
s. Nex
Rdp
|R(u)  1}, then using Lemma
2014) and Gaussian width results in
2012), we can establish that w(⌦
Therefore, based on Theorem 4.3 an
end of Section 3.2, the bound on the
ter takes the form N O
⇣p
log(d
timation error is bounded by k k2 
as long as N > O(log(dp)).
 k = 1, . . . , d. Since L1 is decomposable, it ca
p
s. Next, since ⌦R
Rdp
|R(u)  1}, then using Lemma 3 in (Ban
2014) and Gaussian width results in (Chandrase
2012), we can establish that w(⌦R)  O(
Therefore, based on Theorem 4.3 and the discu
end of Section 3.2, the bound on the regularizat⇣p ⌘12

2 ⌦E
 
 
N ↵ R⇤
[ 1
N ZT
✏].
w(⌦R) = E[ sup
u2⌦R
hg, ui] to
can establish that
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w
R)
⌘
condition ||Z ||2
|| ||2
p
N
⌦Ej
⌦Ej
Thm4.3
Thm4.4
Lem.4.1
Lem.4.2
tion ||Z ||2
|| ||2
p
N
 = O(1) is a pos-
bounded by k k2 
n R⇤
[ 1
N ZT
✏]  ↵, for
p
> 0 and ✏2 > 0 with
min(✏2
2, ✏1) + log(p)) we
)
+ c1(1+✏1)
w2
(⌦R)
N2
◆
nstants.
ability.
N O(w2
isfy N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. With high prob
|| ||2
O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
(cone(⌦Ej
)). Note that the
)) is assumed to
same for all j = 1, . . . , p, which follows from our as13

( ) : R -
• R -
•
• →
) can be any vector norm, separable along the
matrices Ak. Specifically, if we denote =
T
]T
, d, then our assumption is equivalent to
pX
=1
R i =
pX
i=1
R
✓h
A1(i, :)T
. . .Ad(i, :)T
iT
◆
. (5)
clutter and without loss of generality, we assume
R(·) to be the same for each row i. Since the
ecouples across rows, it is straightforward to ex-
analysis to the case when a different norm is used
ow of Ak, e.g., L1 for row one, L2 for row two,
t norm (Argyriou et al., 2012) for row three, etc.
hat within a row, the norm need not be decompos-
results wil
bounds fo
Define an
we assum
is distribu
matrix CX
CX =
2
6
6
4
where (
since CX
bounded a
[ T
1 . . . T
p ]T
R( )=
pX
i=1
R i =
pX
i=1
R
✓h
A1(i, :)T
. . .Ad(i, :)T
iT
◆
. (5)
To reduce clutter and without loss of generality, we assume
the norm R(·) to be the same for each row i. Since the
analysis decouples across rows, it is straightforward to ex-
tend our analysis to the case when a different norm is used
for each row of Ak, e.g., L1 for row one, L2 for row two,
K-support norm (Argyriou et al., 2012) for row three, etc.
Observe that within a row, the norm need not be decompos-
A2A1 A3 A4
ˆ = argmin
2Rdp2
1
N
||y Z ||2
2 + N R( ), (4)
) can be any vector norm, separable along the
matrices Ak. Specifically, if we denote =
T
]T
, d, then our assumption is equivalent to
pX
=1
R i =
pX
i=1
R
✓h
A1(i, :)T
. . .Ad(i, :)T
iT
◆
. (5)
clutter and without loss of generality, we assume
R(·) to be the same for each row i. Since the
ecouples across rows, it is straightforward to ex-
analysis to the case when a different norm is used
In what fo
trix X in
results wil
bounds fo
Define an
we assum
is distribu
matrix CX
CX =
2
6
6
4
where (
ˆ = argmin
2Rdp2
1
N
||y Z ||2
2 + N R( ), (4)
[ T
1 . . . T
p ]T
R( )=
pX
i=1
R i =
pX
i=1
R
✓h
A1(i, :)T
. . .Ad(i, :)T
iT
◆
. (5)
To reduce clutter and without loss of generality, we assume
the norm R(·) to be the same for each row i. Since the
analysis decouples across rows, it is straightforward to ex-
tend our analysis to the case when a different norm is used
A2A1 A3 A4
= argmin
2Rdp2 N
||y Z ||2 + N R( ), (4)
ere R( ) can be any vector norm, separable along the
ws of matrices Ak. Specifically, if we denote =
. . . T
p ]T
= 1, . . . , d, then our assumption is equivalent to
( )=
pX
i=1
R i =
pX
i=1
R
✓h
A1(i, :)T
. . .Ad(i, :)T
iT
◆
. (5)
reduce clutter and without loss of generality, we assume
norm R(·) to be the same for each row i. Since the
lysis decouples across rows, it is straightforward to ex-
d our analysis to the case when a different norm is used
each row of Ak, e.g., L1 for row one, L2 for row two,
support norm (Argyriou et al., 2012) for row three, etc.
trix
resu
bou
Defi
we
is d
mat
C
whe
sinc
bou
on ap-
ially in
regular-
(1)
q
, such
T
M ]T
, is
lly dis-
rization
shirani,
Machine
volume
ranging from describing the behavior of economic and fi-
xt = A1xt 1 + A2xt 2 + · · · + Adxt d + ✏t , (2)
where xt 2 Rp
Rp⇥p
t ) = ⌃ and E(✏t✏T
t+⌧ ) = 0,
Aj
Theorem 4.3 Let ⌦R = {u 2 R |R(u)  1}
of set ⌦R for g ⇠ N(0, I). For any ✏1 > 0 an
log(p)) we can establish that
R⇤

1
N
ZT
✏ 
✓
c2(
To establish restricted eigenvalue condition,
⌫ > 0 and then set
p
N = ⌫.
Theorem 4.4 Let ⇥ = cone(⌦Ej ) Sdp 1, w
⌦Ej =
n
j 2 Rdp R( ⇤
j + j)  R( ⇤
j ) +
for j is of size dp ⇥ 1, and ⇤ = [ ⇤T
1 . . . ⇤
p
in ⌦E = ⌦E1 ⇥ · · · ⇥ ⌦Ep due to the assum
14

• 3.4
•
•
⇤T
1 . . . ⇤T
p ]T
, for ⇤
j 2 Rdp
. The set
decomposition in ⌦E = ⌦E1
⇥ · · · ⇥
mption on the row-wise separability of
Also define w(⇥) = E[sup
u2⇥
hg, ui] to
of set ⇥ for g ⇠ N(0, I) and u 2
bability at least 1 c1 exp( c2⌘2
+
> 0, inf
2cone(⌦E )
||(Ip⇥p⌦X) ||2
|| ||2
⌫,
2
p
nd L, M are defined in (9) and (13).
we can choose ⌘ = 1
2
p
NL and set
p
p
N > 0
e can establish a lower bound on the
N:
p
N > 2
p
M+cw(⇥)
p
L/2
= O(w(⇥)).
d and using (9) and (13), we can con-
er of samples needed to satisfy the
condition is smaller if ⇤min(A) and
w(⌦R)
p
N
= w (⌦R)
N2 , which happens at N
w(⌦R). Therefore, we can rewrite ou
lows: once the restricted eigenvalue con
N O
⇣
w(⌦R)
p
N
⌘
, the error norm is u
k k2  O
⇣
w(⌦R)
p
N
⌘
(cone(⌦Ej
)).
3.3. Special Cases
While the presented results are valid fo
separable along the rows of Ak, it is instr
ize our analysis to a few popular regula
such as L1 and Group Lasso, Sparse G
OWL norms.
3.3.1. LASSO
To establish results for L1 norm, we ass
rameter ⇤
is s-sparse, which in our case
resent the largest number of non-zero ele
i = 1, . . . , p, i.e., the combined i-th r
⌦Ej
is a part of the decomposition in ⌦E =
⌦Ep
due to the assumption on the row-wise
norm R(·) in (5). Also define w(⇥) = E
be a Gaussian width of set ⇥ for g ⇠ N
Rdp
. Then with probability at least 1 c
log(p)), for any ⌘ > 0, inf
2cone(⌦E )
||(Ip⇥p
||
where ⌫ =
p
NL 2
p
M cw(⇥) ⌘ a
positive constants, and L, M are defined in
3.2. Discussion
From Theorem 3.4, we can choose ⌘ = 1
2p
N =
p
NL 2
p
M cw(⇥) ⌘ and s
must be satisfied, we can establish a lowe
number of samples N:
p
N > 2
p
M+cw(⇥
p
L/2
Examining this bound and using (9) and (1
clude that the number of samples needed
restricted eigenvalue condition is smaller i
1, j = 1, . . . , p, and = [ 1 , . . . , p ] , for j is of size
dp ⇥ 1, and ⇤
= [ ⇤T
1 . . . ⇤T
p ]T
, for ⇤
j 2 Rdp
. The set
⌦Ej
⇥ · · · ⇥
⌦Ep
due to the assumption on the row-wise separability of
norm R(·) in (5). Also define w(⇥) = E[sup
u2⇥
hg, ui] to
be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2
Rdp
. Then with probability at least 1 c1 exp( c2⌘2
+
2cone(⌦E )
||(Ip⇥p⌦X) ||2
|| ||2
⌫,
where ⌫ =
p
NL 2
p
positive constants, and L, M are defined in (9) and (13).
3.2. Discussion
2
p
NL and set
p
N =
p
NL 2
p
p
N > 0
must be satisfied, we can establish a lower bound on the
p
N > 2
p
M+cw(⇥)
p
L/2
= O(w(⇥)).
Examining this bound and using (9) and (13), we can con-
clude that the number of samples needed to satisfy the
restricted eigenvalue condition is smaller if ⇤min(A) and
w(
p
w(
low
N
k
3.3
W
se
ize
su
OW
3.
To
ra
re
i
phere. The error set ⌦Ej
=
dp
R( ⇤
j + j)  R( ⇤
j ) + 1
r R( j)
o
, for r >
. . , p, and = [ T
1 , . . . , T
p ]T
, for j is of size
nd ⇤
= [ ⇤T
1 . . . ⇤T
p ]T
, for ⇤
j 2 Rdp
. The set
part of the decomposition in ⌦E = ⌦E1
⇥ · · · ⇥
o the assumption on the row-wise separability of
) in (5). Also define w(⇥) = E[sup
u2⇥
hg, ui] to
ssian width of set ⇥ for g ⇠ N (0, I) and u 2
n with probability at least 1 c1 exp( c2⌘2
+
or any ⌘ > 0, inf
2cone(⌦E )
||(Ip⇥p⌦X) ||2
|| ||2
⌫,
=
p
NL 2
p
onstants, and L, M are defined in (9) and (13).
ssion
orem 3.4, we can choose ⌘ = 1
2
p
NL and set
p
NL 2
p
p
N > 0
atisfied, we can establish a lower bound on the
p 2
p
M+cw(⇥)
the number of samples N
will dominate the second o
by computing N for whic
w(⌦R)
p
N
= w2
(⌦R)
N2 , which
w(⌦R). Therefore, we c
lows: once the restricted e
N O
⇣
w(⌦R)
p
N
⌘
, the er
k k2  O
⇣
w(⌦R)
p
N
⌘
(con
3.3. Special Cases
While the presented result
separable along the rows of
ize our analysis to a few
such as L1 and Group La
OWL norms.
3.3.1. LASSO
To establish results for L
is a unit sphere. The error set ⌦Ej
=n
j 2 Rdp
R( ⇤
j + j)  R( ⇤
j ) + 1
r R( j)
o
, for r >
1, j = 1, . . . , p, and = [ T
1 , . . . , T
p ]T
, for j is of size
dp ⇥ 1, and ⇤
= [ ⇤T
1 . . . ⇤T
p ]T
, for ⇤
j 2 Rdp
. The set
⌦Ej
⇥ · · · ⇥
⌦Ep
u2⇥
hg, ui] to
Rdp
+
2cone(⌦E )
||(Ip⇥p⌦X) ||2
|| ||2
⌫,
where ⌫ =
p
NL 2
p
3.2. Discussion
2
p
NL and set
p
N =
p
NL 2
p
p
N > 0
the num
will dom
by comp
w(⌦R)
p
N
=
w(⌦R).
lows: on
N O
k k2 
3.3. Spec
While th
separable
ize our a
such as
OWL no
3.3.1. L
define w(⇥) = E[sup
u2⇥
hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N
probability at least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0
inf
||(Ip⇥p ⌦ X) ||2
⌫,
: Gaussian width
n this Section we present the main results of our work, followed by the discussion on their prop
lustrating some special cases based on popular Lasso and Group Lasso regularization norms. I
4 we will present the main ideas of our proof technique, with all the details delegated to the App
nd D.
o establish lower bound on the regularization parameter N , we derive an upper bound on R⇤[ 1
N Z
or some ↵ > 0, which will establish the required relationship N ↵ R⇤[ 1
N ZT ✏].
heorem 4.3 Let ⌦R = {u 2 Rdp|R(u)  1}, and define w(⌦R) = E[ sup
u2⌦R
hg, ui] to be a Gaus
set ⌦R for g ⇠ N(0, I). For any ✏1 > 0 and ✏2 > 0 with probability at least 1 c exp( min
g(p)) we can establish that
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w(⌦R)
p
N
+ c1(1+✏1)
w2(⌦R)
N2
◆
here c, c1 and c2 are positive constants.
o establish restricted eigenvalue condition, we will show that inf
2cone(⌦E)
||(Ip⇥p⌦X) ||2
|| ||2
⌫,
> 0 and then set
p
N = ⌫.
heorem 4.4 Let ⇥ = cone(⌦Ej ) Sdp 1, where Sdp 1 is a unit sphere. The error set ⌦Ej is
Ej =
n
j 2 Rdp R( ⇤
j + j)  R( ⇤
j ) + 1
r R( j)
o
, for r > 1, j = 1, . . . , p, and = [ T
1 , .
] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 Rdp. Then with
p( c2⌘2 + log(p)), for any ⌘ > 0
inf
||(Ip⇥p ⌦ X) ||2
⌫,
ing Theorems 3.3 and 3.4 we can interpret the es-
ed results as follows. As the size and dimension-
, p and d of the problem increase, we emphasize
le of the results and use the order notations to de-
e constants. Select a number of samples at least
O(w2
N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
e restricted eigenvalue condition ||Z ||2
|| ||2
p
N
2 cone(⌦E) holds, so that  = O(1) is a pos-
onstant. Moreover, the norm of the estimation er-
optimization problem (4) is bounded by k k2 
⌦R)
N
+ w2
(⌦R)
N2
⌘
(cone(⌦Ej
ibility constant (cone(⌦Ej
or all j = 1, . . . , p, which follows from our assump-
(5).
er now Theorem 3.3 and the bound on the regu-⇣
w(⌦R)
p
w2
(⌦R)
⌘ 15

• p  
•  
•
•
•
ion on the row-wise separability of
o define w(⇥) = E[sup
u2⇥
hg, ui] to
set ⇥ for g ⇠ N(0, I) and u 2
bility at least 1 c1 exp( c2⌘2
+
0, inf
2cone(⌦E)
||(Ip⇥p⌦X) ||2
|| ||2
⌫,
p
L, M are defined in (9) and (13).
can choose ⌘ = 1
2
p
NL and set
cw(⇥) ⌘ and since
p
N > 0
an establish a lower bound on the
p
N > 2
p
M+cw(⇥)
p
L/2
= O(w(⇥)).
nd using (9) and (13), we can con-
of samples needed to satisfy the
ndition is smaller if ⇤min(A) and
lows: once the restricted eigenvalue condi
N O
⇣
w(⌦R)
p
N
⌘
, the error norm is upp
k k2  O
⇣
w(⌦R)
p
N
⌘
(cone(⌦Ej
)).
3.3. Special Cases
While the presented results are valid for a
separable along the rows of Ak, it is instruc
ize our analysis to a few popular regulariz
such as L1 and Group Lasso, Sparse Gro
OWL norms.
3.3.1. LASSO
To establish results for L1 norm, we assum
rameter ⇤
is s-sparse, which in our case is
resent the largest number of non-zero elem
i = 1, . . . , p, i.e., the combined i-th row
norm R(·) in (5). Also define w(⇥) = E[s
u
be a Gaussian width of set ⇥ for g ⇠ N(0
Rdp
. Then with probability at least 1 c1 e
2cone(⌦E)
||(Ip⇥p⌦
|| |
where ⌫ =
p
NL 2
p
M cw(⇥) ⌘ and
positive constants, and L, M are defined in (9
3.2. Discussion
2
p
p
N =
p
NL 2
p
M cw(⇥) ⌘ and sin
must be satisfied, we can establish a lower b
p
N > 2
p
M+cw(⇥)
p
L/2
Examining this bound and using (9) and (13),
clude that the number of samples needed t
restricted eigenvalue condition is smaller if ⇤
cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M are defined in (9)
choose ⌘ = 1
2
p
NL and set
p
N =
p
NL 2
p
we can establish a lower bound on the number of samples N
p
N >
2
p
M + cw(⇥)
p
L/2
= O(w(⇥)). (18)
sing (9) and (13), we can conclude that the number of samples needed to satisfy
dition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A) and ⇤max(⌃)
eans that matrices A and A in (10) and (12) must be well conditioned and the
eigenvalues well inside the unit circle (see Section 3.2). Alternatively, we can
wing that large values of M and small values of L indicate stronger dependency
ore samples for the RE conditions to hold with high probability.
d 4.4 we can interpret the established results as follows. As the size and dimen-
oblem increase, we emphasize the scale of the results and use the order notations
ect a number of samples at least N O(w2(⇥)) and let the regularization pa-
2
⌘
Ej
⌦Ep
due to the assumption on t
norm R(·) in (5). Also define
be a Gaussian width of set ⇥
Rdp
. Then with probability at
log(p)), for any ⌘ > 0, in
2co
where ⌫ =
p
NL 2
p
M cw
positive constants, and L, M ar
3.2. Discussion
From Theorem 3.4, we can ch
p
N =
p
NL 2
p
M cw(⇥
must be satisfied, we can estab
p
N >
Examining this bound and usin
clude that the number of sam
restricted eigenvalue condition
probability at least 1 c1 exp( c2⌘ + log(p)), for any ⌘ > 0
inf
2cone(⌦E)
||(Ip⇥p ⌦ X) ||2
|| ||2
⌫,
where ⌫ =
p
NL 2
p
M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M are define
and (13).
4.2 Discussion
2
p
NL and set
p
N =
p
NL 2
p
M cw(⇥) ⌘ an
p
N > 0 must be satisfied, we can establish a lower bound on the number of samples N
p
N >
2
p
M + cw(⇥)
p
L/2
= O(w(⇥)).
Examining this bound and using (9) and (13), we can conclude that the number of samples needed to
the restricted eigenvalue condition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A) and ⇤
are smaller. In turn, this means that matrices A and A in (10) and (12) must be well conditioned
VAR process is stable, with eigenvalues well inside the unit circle (see Section 3.2). Alternatively,
also understand (18) as showing that large values of M and small values of L indicate stronger depe
in the data, thus requiring more samples for the RE conditions to hold with high probability.
Analyzing Theorems 4.3 and 4.4 we can interpret the established results as follows. As the size and
sionality N, p and d of the problem increase, we emphasize the scale of the results and use the order no
to denote the constants. Select a number of samples at least N O(w2(⇥)) and let the regularizat
rameter satisfy N O
⇣
w(⌦R)
p + w2(⌦R)
2
⌘
. With high probability then the restricted eigenvalue co
and 3.4 we can interpret the es-
ws. As the size and dimension-
problem increase, we emphasize
nd use the order notations to de-
ct a number of samples at least
the regularization parameter sat-
w2
(⌦R)
N2
⌘
value condition ||Z ||2
|| ||2
p
N
ds, so that  = O(1) is a pos-
, the norm of the estimation er-
em (4) is bounded by k k2 
cone(⌦Ej
cone(⌦Ej
which follows from our assump-
3.3 and the bound on the regu-
O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. As
e problem p and d grows and
r
1, j = 1, . . . , p, and = [ T
1 , . . . , T
p ]T
, for j is of size
dp ⇥ 1, and ⇤
= [ ⇤T
1 . . . ⇤T
p ]T
, for ⇤
j 2 Rdp
. The set
⌦Ej
⇥ · · · ⇥
⌦Ep
u2⇥
hg, ui] to
Rdp
+
2cone(⌦E )
||(Ip⇥p⌦X) ||2
|| ||2
⌫,
where ⌫ =
p
NL 2
p
3.2. Discussion
2
p
NL and set
p
N =
p
NL 2
p
p
N > 0
p
N > 2
p
M+cw(⇥)
p = O(w(⇥)).
by
w(⌦
p
N
w(⌦
low
N
k k
3.3.
Wh
sepa
ize
such
OW
3.3
To
), for any ⌘ > 0
||(Ip⇥p ⌦ X) ||2
|| ||2
⌫,
c1, c2 are positive constants, and L and M are defined in (9)
L and set
p
N =
p
NL 2
p
lower bound on the number of samples N
M + cw(⇥)
p
L/2
= O(w(⇥)). (18)
we can conclude that the number of samples needed to satisfy
⇤min(A) and ⇤min(⌃) are larger and ⇤max(A) and ⇤max(⌃)
A and A in (10) and (12) must be well conditioned and the
0 <
= E[sup
u2⇥
hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 R
least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0
inf
2cone(⌦E)
||(Ip⇥p ⌦ X) ||2
|| ||2
⌫,
NL 2
p
M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M ar
sion
m 4.4, we can choose ⌘ = 1
2
p
NL and set
p
N =
p
NL 2
p
M cw(⇥)
ust be satisfied, we can establish a lower bound on the number of samples N
p
N >
2
p
M + cw(⇥)
p
L/2
= O(w(⇥)).
is bound and using (9) and (13), we can conclude that the number of samples n
eigenvalue condition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A
n turn, this means that matrices A and A in (10) and (12) must be well cond16

•  
 
 
sion
m 4.4, we can choose ⌘ = 1
2
p
NL and set
p
N =
p
NL 2
p
M cw(⇥)
ust be satisfied, we can establish a lower bound on the number of samples N
p
N >
2
p
M + cw(⇥)
p
L/2
= O(w(⇥)).
is bound and using (9) and (13), we can conclude that the number of samples n
eigenvalue condition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A
n turn, this means that matrices A and A in (10) and (12) must be well cond
is stable, with eigenvalues well inside the unit circle (see Section 3.2). Altern
nd (18) as showing that large values of M and small values of L indicate strong
us requiring more samples for the RE conditions to hold with high probability.
eorems 4.3 and 4.4 we can interpret the established results as follows. As the s
and d of the problem increase, we emphasize the scale of the results and use the
constants. Select a number of samples at least N O(w2(⇥)) and let the reg
y N O
⇣
w(⌦R)
p
N
+ w2(⌦R)
N2
⌘
. With high probability then the restricted eigen
N for 2 cone(⌦E) holds, so that  = O(1) is a positive constant. Moreover,
VAR
⇢X(!) = I Ae i! 1
⌃E
h
I Ae i! 1
i⇤
,
where ⌃E is the covariance matrix of a noise vector and A
are as defined in expression (6). Thus, an upper bound on
CU can be obtained as ⇤max[CU ]  ⇤max(⌃)
defined ⇤min(A) = min
!2[0,2⇡]
⇤min(A(!)) for
A(!) = I AT
ei!
I Ae i!
. (12)
⇤max[Qa]  ⇤max(⌃)/⇤min(A) = M. (13)
We note that for a general VAR model, there might not exist
closed-form expressions for ⇤max(A) and ⇤min(A). How-
ever, for some special cases there are results establishing
the bounds on these quantities (e.g., see Proposition 2.2 in
(Basu & Michailidis, 2015)).
inf
1jp
!2[0,2⇡]
⇤j[⇢(!)]  ⇤k[CX]
1kdp
 sup
1jp
!2[0,2⇡]
⇤j[⇢(!)], (8)
k[·] denotes the k-th eigenvalue of a matrix and for i =
p
1, ⇢(!) =
P1
h= 1 (h)e hi!, ! 2
s the spectral density, i.e., a Fourier transform of the autocovariance matrix (h). The advantage of
spectral density is that it has a closed form expression (see Section 9.4 of [24])
⇢(!)= I
dX
k=1
Ake ki!
! 1
⌃
2
4 I
dX
k=1
Ake ki!
! 1
3
5
⇤
,
denotes a Hermitian of a matrix. Therefore, from (8) we can establish the following lower bound
⇤min[CX] ⇤min(⌃)/⇤max(A) = L, (9)
e defined ⇤max(A) = max
!2[0,2⇡]
⇤max(A(!)) for
A(!)= I
dX
k=1
AT
k eki!
!
I
dX
k=1
Ake ki!
!
, (10)
endix B.1 for additional details.
ishing high probability bounds we will also need information about a vector q = Xa 2 RN for
Rdp, kak2 = 1. Since each element XT
i,:a ⇠ N(0, aT CXa), it follows that q ⇠ N(0, Qa) with a
ce matrix Qa 2 RN⇥N . It can be shown (see Appendix B.3 for more details) that Qa can be written
4 .. .. .. ..
(d 1)T (d 2)T . . . (0)
5
E(xtxT
t+h) 2 Rp⇥p. It turns out that since CX is a block-Toeplitz matrix, its eigenvalues can
(see [13])
inf
1jp
!2[0,2⇡]
⇤j[⇢(!)]  ⇤k[CX]
1kdp
 sup
1jp
!2[0,2⇡]
⇤j[⇢(!)], (8)
notes the k-th eigenvalue of a matrix and for i =
p
1, ⇢(!) =
P1
h= 1 (h)e hi!, ! 2
pectral density, i.e., a Fourier transform of the autocovariance matrix (h). The advantage of
al density is that it has a closed form expression (see Section 9.4 of [24])
⇢(!)= I
dX
k=1
Ake ki!
! 1
⌃
2
4 I
dX
k=1
Ake ki!
! 1
3
5
⇤
,
es a Hermitian of a matrix. Therefore, from (8) we can establish the following lower bound
⇤min[CX] ⇤min(⌃)/⇤max(A) = L, (9)
ned ⇤max(A) = max
!2[0,2⇡]
⇤max(A(!)) for
A(!)= I
dX
k=1
AT
k eki!
!
I
dX
k=1
Ake ki!
!
, (10)
B.1 for additional details.
high probability bounds we will also need information about a vector q = Xa 2 RN for
kak2 = 1. Since each element XT
i,:a ⇠ N(0, aT CXa), it follows that q ⇠ N(0, Qa) with a
rix Qa 2 RN⇥N . It can be shown (see Appendix B.3 for more details) that Qa can be written
Qa = (I ⌦ aT
)CU (I ⌦ a), (11)
inf
1ldp
!2[0,2⇡]
⇤l[⇢X(!)]  ⇤k[CU ]
1kNdp
 sup
1ldp
!2[0,2⇡]
⇤l[⇢X(!
The closed form expression of spectral density is
⇢X(!) = I Ae i! 1
⌃E
h
I Ae i! 1
where ⌃E is the covariance matrix of a noise vector and A are as defined in
bound on CU can be obtained as ⇤max[CU ]  ⇤max(⌃)
⇤min(A) , where we defined ⇤
for
A(!) = I AT
ei!
I Ae i!
.
⇤max[Qa]  ⇤max(⌃)/⇤min(A) = M.
We note that for a general VAR model, there might not exist closed-form
⇤min(A). However, for some special cases there are results establishing
(e.g., see Proposition 2.2 in [5]).
is that it has a closed form expression (see Section
Priestley, 1981))
I
dX
k=1
Ake ki!
! 1
⌃
2
4 I
dX
k=1
Ake ki!
! 1
3
5
⇤
,
denotes a Hermitian of a matrix. Therefore, from
an establish the following lower bound
⇤min[CX] ⇤min(⌃)/⇤max(A) = L, (9)
we defined ⇤max(A) = max
!2[0,2⇡]
⇤max(A(!)) for
= I
dX
k=1
AT
k eki!
!
I
dX
k=1
Ake ki!
!
. (10)
lishing high probability bounds we will also need
tion about a vector q = Xa 2 RN
for any a 2 Rdp
,
1. Since each element XT
i,:a ⇠ N(0, aT
CXa),
ws that q ⇠ N(0, Qa) with a covariance matrix
N⇥N
3. Regularized Estimati
Denote by = ˆ ⇤
the e
optimization problem (4) and
rameter. The focus of our wor
under which the optimization
tees on the accuracy of the obt
term is bounded: || ||2  f
lish such conditions, we utilize
et al., 2014). Specifically, estim
on the following known results
first one characterizes the restr
error belongs.
Lemma 3.1 Assume that
N rR⇤

to
R( )=
pX
i=1
R i =
pX
i=1
R
✓h
A1(i, :)T
. . .Ad(i, :)T
iT
◆
.
To reduce clutter and without loss of generality, we assume the norm R(·) to be the sam
Since the analysis decouples across rows, it is straightforward to extend our analysis t
different norm is used for each row of Ak, e.g., L1 for row one, L2 for row two, K-suppo
three, etc. Observe that within a row, the norm need not be decomposable across column
The main difference between the estimation problem in (1) and the formulation in (4
pendence between the samples (x0, x1, . . . , xT ), violating the i.i.d. assumption on the
1, . . . , Np}. In particular, this leads to the correlations between the rows and columns
consequently of Z). To deal with such dependencies, following [5], we utilize the spectra
the autocovariance of VAR models to control the dependencies in matrix X.
3.2 Stability of VAR Model
Since VAR models are (linear) dynamical systems, for the analysis we need to establish
which the VAR model (2) is stable, i.e., the time-series process does not diverge over time.
stability, it is convenient to rewrite VAR model of order d in (2) as an equivalent VAR mo
2
6
6
6
4
xt
xt 1
...
xt (d 1)
3
7
7
7
5
=
2
6
6
6
6
6
4
A1 A2 . . . Ad 1 Ad
I 0 . . . 0 0
0 I . . . 0 0
...
...
...
...
...
0 0 . . . I 0
3
7
7
7
7
7
5
| {z }
A
2
6
6
6
4
xt 1
xt 2
...
xt d
3
7
7
7
5
+
2
6
6
6
4
✏t
0
...
0
3
7
7
7
5
where A 2 Rdp⇥dp. Therefore, VAR process is stable if all the eigenvalues of A satisfy de
0 for 2 C, | | < 1. Equivalently, if expressed in terms of original parameters Ak, sta
det(I
Pd
k=1 Ak
1
k ) = 0 (see Appendix A for more details).
 
L =
1, j = 1, . . . , p, and = [ 1 , . . . , p ] , for j is of size
dp ⇥ 1, and ⇤
= [ ⇤T
1 . . . ⇤T
p ]T
, for ⇤
j 2 Rdp
. The set
⌦Ej
⇥ · · · ⇥
⌦Ep
u2⇥
hg, ui] to
Rdp
+
2cone(⌦E )
||(Ip⇥p⌦X) ||2
|| ||2
⌫,
where ⌫ =
p
NL 2
p
3.2. Discussion
2
p
NL and set
p
N =
p
NL 2
p
p
N > 0
p
N > 2
p
M+cw(⇥)
p
L/2
= O(w(⇥)).
Examining this bound and using (9) and (13), we can con-
L =
 
17

2 ⌦E
 
 
N ↵ R⇤
[ 1
N ZT
✏].
w(⌦R) = E[ sup
u2⌦R
hg, ui] to
can establish that
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w
Thm4.3
Thm4.4
Lem.4.1
Lem.4.2
tion ||Z ||2
|| ||2
p
N
 = O(1) is a pos-
bounded by k k2 
n R⇤
[ 1
N ZT
✏]  ↵, for
p
> 0 and ✏2 > 0 with
min(✏2
2, ✏1) + log(p)) we
)
+ c1(1+✏1)
w2
(⌦R)
N2
◆
nstants.
ability.
N O(w2
isfy N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. With high prob
|| ||2
O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
(cone(⌦Ej
)). Note that the
)) is assumed to
R)
⌘
condition ||Z ||2
|| ||2
p
N
⌦Ej
⌦Ej

•
• 4.3
•
•
d by the discussion on their properties and
up Lasso regularization norms. In Section
ll the details delegated to the Appendices C
derive an upper bound on R⇤[ 1
N ZT ✏]  ↵,
N ↵ R⇤[ 1
N ZT ✏].
R) = E[ sup
u2⌦R
hg, ui] to be a Gaussian width
bability at least 1 c exp( min(✏2
2, ✏1) +
1(1+✏1)
w2(⌦R)
N2
◆
t inf
2cone(⌦E)
||(Ip⇥p⌦X) ||2
|| ||2
⌫, for some
illustrating some special cases based
4.4 we will present the main ideas of o
and D.
To establish lower bound on the regula
for some ↵ > 0, which will establish
Theorem 4.3 Let ⌦R = {u 2 Rdp|R(
of set ⌦R for g ⇠ N(0, I). For any ✏1
log(p)) we can establish that
R⇤

1
N
ZT
✏
where c, c1 and c2 are positive consta
To establish restricted eigenvalue co
⌫ > 0 and then set
p
N = ⌫.
ction we present the main results of our work, followed by the discussion on their prope
g some special cases based on popular Lasso and Group Lasso regularization norms. In
ll present the main ideas of our proof technique, with all the details delegated to the Appe
sh lower bound on the regularization parameter N , we derive an upper bound on R⇤[ 1
N Z
↵ > 0, which will establish the required relationship N ↵ R⇤[ 1
N ZT ✏].
4.3 Let ⌦R = {u 2 Rdp|R(u)  1}, and define w(⌦R) = E[ sup
u2⌦R
hg, ui] to be a Gauss
for g ⇠ N(0, I). For any ✏1 > 0 and ✏2 > 0 with probability at least 1 c exp( min(
e can establish that
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w(⌦R)
p
N
+ c1(1+✏1)
w2(⌦R)
N2
◆
c1 and c2 are positive constants.
ish restricted eigenvalue condition, we will show that inf
2cone(⌦E)
||(Ip⇥p⌦X) ||2
|| ||2
⌫,
d then set
p
N = ⌫.
4.4 Let ⇥ = cone(⌦Ej ) Sdp 1, where Sdp 1 is a unit sphere. The error set ⌦Ej is d
To establish lower bound on the regularization pa
N , we derive an upper bound on R⇤
[ 1
N ZT
✏] 
some ↵ > 0, which will establish the required rela
N ↵ R⇤
[ 1
N ZT
✏].
Theorem 3.3 Let ⌦R = {u 2 Rdp
|R(u)  1}, an
w(⌦R) = E[ sup
u2⌦R
hg, ui] to be a Gaussian widt
⌦R for g ⇠ N(0, I). For any ✏1 > 0 and ✏2 >
probability at least 1 c exp( min(✏2
2, ✏1) + log
can establish that
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w(⌦R)
p
N
+ c1(1+✏1)
w2
To establish restricted eigenvalue condition, we w
||(Ip⇥p⌦X) ||2
To establish lower bound on the regularizat
N , we derive an upper bound on R⇤
[ 1
N ZT
some ↵ > 0, which will establish the require
N ↵ R⇤
[ 1
N ZT
✏].
Theorem 3.3 Let ⌦R = {u 2 Rdp
|R(u)  1
w(⌦R) = E[ sup
u2⌦R
hg, ui] to be a Gaussian
⌦R for g ⇠ N(0, I). For any ✏1 > 0 and
probability at least 1 c exp( min(✏2
2, ✏1)
can establish that
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w(⌦R)
p
N
+ c1(1+
To establish restricted eigenvalue condition,
||(Ip⇥p⌦X) ||2
rL
ned in terms of the quantities, involving Z and ✏, which are
blish high probability bounds on the regularization parameter
f our work, followed by the discussion on their properties and
ular Lasso and Group Lasso regularization norms. In Section
of technique, with all the details delegated to the Appendices C
n parameter N , we derive an upper bound on R⇤[ 1
N ZT ✏]  ↵,
uired relationship N ↵ R⇤[ 1
N ZT ✏].
}, and define w(⌦R) = E[ sup
u2⌦R
hg, ui] to be a Gaussian width
nd ✏2 > 0 with probability at least 1 c exp( min(✏2
2, ✏1) +
(1+✏2)
w(⌦R)
p + c1(1+✏1)
w2(⌦R)
2
◆
re (cone(⌦E)) is a norm compatibility constant, defined as (cone(⌦E)) = sup
U2cone(⌦E)
e that the above error bound is deterministic, i.e., if (14) and (16) hold, then the error sati
nd in (17). However, the results are defined in terms of the quantities, involving Z and
dom. Therefore, in the following we establish high probability bounds on the regularizat
14) and RE condition in (16).
High Probability Bounds
his Section we present the main results of our work, followed by the discussion on their
strating some special cases based on popular Lasso and Group Lasso regularization norm
we will present the main ideas of our proof technique, with all the details delegated to the
D.
stablish lower bound on the regularization parameter N , we derive an upper bound on R⇤
some ↵ > 0, which will establish the required relationship N ↵ R⇤[ 1
N ZT ✏].
orem 4.3 Let ⌦R = {u 2 Rdp|R(u)  1}, and define w(⌦R) = E[ sup
u2⌦R
hg, ui] to be a G
et ⌦R for g ⇠ N(0, I). For any ✏1 > 0 and ✏2 > 0 with probability at least 1 c exp(
p)) we can establish that
:  
Gaussian width
)
= O(w(⇥)). (18)
clude that the number of samples needed to satisfy
nd ⇤min(⌃) are larger and ⇤max(A) and ⇤max(⌃)
n (10) and (12) must be well conditioned and the
nit circle (see Section 3.2). Alternatively, we can
nd small values of L indicate stronger dependency
tions to hold with high probability.
blished results as follows. As the size and dimen-
e the scale of the results and use the order notations
ast N O(w2(⇥)) and let the regularization pa-
robability then the restricted eigenvalue condition
1) is a positive constant. Moreover, the norm of the
by k k2  O
⇣
w(⌦R)
p
N
+ w2(⌦R)
N2
⌘
(cone(⌦Ej )).
) is assumed to be the same for all j = 1, . . . , p,
arization parameter N O
⇣
w(⌦R)
p
N
+ w2(⌦R)
N2
⌘
.
the number of samples N increases, the first term 19

• N p (d=1 )
• N  
0 1000 2000 3000 4000 5000
0
5
10
15
20
0 1000 2000 3000 4000 5000
0
0.2
0.4
0.6
0.8
1
0 100 200 300 400 500 600 700
0.4
0.5
0.6
0.7
0.8
0.9
1
kk2
N Np
kk2
/max
/max
N/[s log(pd)]
0 100 200 300 400 500 600
0
5
10
15
20
kk2
N N
kk2
/max
/max
0 1000 2000 3000 4000 5000
0
5
10
15
0 500 1000 1500 2000 2500
0
5
10
15
0 1000 2000 3000 4000 5000
0.4
0.5
0.6
0.7
0.8
0.9
1
N/[s(m + log K)] K
10 20 30 40 50 60
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 1: Results for estimating parameters of a stable first order sparse VAR (top row) and group sparse
VAR (bottom row). Problem dimensions: p 2 [10, 600], N 2 [10, 5000], N
max
2 [0, 1], K 2 [2, 60] and
d = 1. Figures (a) and (e) show dependency of errors on sample size for different p; in Figure (b) the N
is scaled by (s log p) and plotted against k k2 to show that errors scale as (s log p)/N; in (f) the graph
is similar to (b) but for group sparse VAR; in (c) and (g) we show dependency of N on p (or number of
groups K in (g)) for fixed sample size N; finally, Figures (d) and (h) display the dependency of N on N
for fixed p.
p
⇥
hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 Rdp. Th
c1 exp( c2⌘2 + log(p)), for any ⌘ > 0
inf
2cone(⌦E)
||(Ip⇥p ⌦ X) ||2
|| ||2
⌫,
p
M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M are define
we can choose ⌘ = 1
2
p
NL and set
p
N =
p
NL 2
p
M cw(⇥) ⌘ an
tisfied, we can establish a lower bound on the number of samples N
p
N >
2
p
M + cw(⇥)
p
L/2
= O(w(⇥)).
20

• VAR(d=1) 1step
•
• or
k
N N
k
/
/
0 1000 2000 3000 4000 5000
0
5
0 500 1000 1500 2000 2500
0
5
0 1000 2000 3000 4000 5000
0.4
0.5
0.6
N/[s(m + log K)] K
10 20 30 40 50 60
0.3
0.4
0.5
0.6
(e) (f) (g) (h)
Figure 1. Results for estimating parameters of a stable first order sparse VAR (top row) and group sparse VAR (bottom row). Problem
dimensions: p 2 [10, 600], N 2 [10, 5000], N
max
2 [0, 1], K 2 [2, 60] and d = 1. Figures (a) and (e) show dependency of errors on
sample size for different p; in Figure (b) the N is scaled by (s log p) and plotted against k k2 to show that errors scale as (s log p)/N
in (f) the graph is similar to (b) but for group sparse VAR; in (c) and (g) we show dependency of N on p (or number of groups K in
(g)) for fixed sample size N; finally, Figures (d) and (h) display the dependency of N on N for fixed p.
Lasso OWL Group Lasso Sparse Group Lasso Ridge
32.3(6.5) 32.2(6.6) 32.7(6.5) 32.2(6.4) 33.5(6.1)
32.7(7.9) 44.5(15.6) 75.3(8.4) 38.4(9.6) 99.9(0.2)
Table 1. Mean squared error (row 2) of the five methods used in fitting VAR model, evaluated on aviation dataset (MSE is computed
using one-step-ahead prediction errors). Row 3 shows the average number of non-zeros (as a percentage of total number of elements) in
the VAR matrix. The last row shows a typical sparsity pattern in A1 for each method (darker dots - stronger dependencies, lighter dots
weaker dependencies). The values in parenthesis denote one standard deviation after averaging the results over 300 flights.
the results after averaging across 300 flights.
From the table we can see that the considered problem ex-
hibits a sparse structure since all the methods detected sim-
ilar patterns in matrix A1. In particular, the analysis of
such patterns revealed a meaningful relationship among the
flight parameters (darker dots), e.g., normal acceleration
had high dependency on vertical speed and angle-of-attack,
the altitude had mainly dependency with fuel quantity, ver-
tical speed with aircraft nose pitch angle, etc. The results
also showed that the sparse regularization helps in recov-
5. Conclusions
In this work we present a set of results for characterizing
non-asymptotic estimation error in estimating structured
vector autoregressive models. The analysis holds for any
norms, separable along the rows of parameter matrices
Our analysis is general as it is expressed in terms of Gaus
sian widths, a geometric measure of size of suitable sets
and includes as special cases many of the existing result
focused on structured sparsity in VAR models.
A1
nMSE(%)
(%)
R(·)
TO 2
f its atomic formulation.
tiation of the CG algorithm to handle the Ivanov
lation of OWL regularization; more specifically:
e show how the atomic formulation of the OWL
orm allows solving efficiently the linear program-
ming problem in each iteration of the CG algorithm;
ased on results from [25], we show convergence of
he resulting algorithm and provide explicit values
or the constants.
w derivation of the proximity operator of the OWL
arguably simpler than those in [10] and [44],
ghting its connection to isotonic regression and the
adjacent violators (PAV) algorithm [3], [8].
ficient method to project onto an OWL norm ball,
on a root-finding scheme.
ng the Ivanov formulation under OWL regulariza-
sing projected gradient algorithms, based on the
sed OWL projection.
er is organized as follows. Section II, after reviewing
norm and some of its basic properties, presents
formulation, and derives its dual norm. The two
utational tools for using OWL regularization, the
operator and the projection on a ball, are addressed
III. Section IV instantiates the CG algorithm
rated projected gradient algorithms to tackle the
optimization formulation of OWL regularization
egression. Finally, Section V reports experimental
strating the performance and comparison of the
pproaches, and Section VI concludes the paper.
ase bold letters, e.g., x, y, denote (column) vectors,
oses are xT
, yT
, and the i-th and j-th components
n as xi and yj. Matrices are written in upper
Fig. 1. OWL balls in R2 with different weights: (a) w1 > w2 > 0; (b)
w1 = w2 > 0; (c) w1 > w2 = 0.
where w 2 Km+ is a vector of non-increasing weights, i.e.,
belonging to the so-called monotone non-negative cone [13],
Km+ = {x 2 Rn
: x1 x2 · · · xn 0} ⇢ Rn
+. (2)
OWL balls in R2
and R3
, for different choices of the weight
vector w, are illustrated in Figs. 1 and 2.
Fig. 2. OWL balls in R3 with different weights: (a) w1 > w2 > w3 > 0;
(b) w1 > w2 = w3 > 0; (c) w1 = w2 > w3 > 0; (d) w1 = w2 > w3 = 0;
(e) w1 > w2 = w3 = 0; (f) w1 = w2 = w3 > 0.
OWL balls in R2 with different weights: (a) w1 > w2 > 0; (b)
> 0; (c) w1 > w2 = 0.
w 2 Km+ is a vector of non-increasing weights, i.e.,
ng to the so-called monotone non-negative cone [13],
+ = {x 2 Rn
: x1 x2 · · · xn 0} ⇢ Rn
+. (2)
alls in R2
and R3
, for different choices of the weight
w, are illustrated in Figs. 1 and 2.
OWL balls in R3 with different weights: (a) w1 > w2 > w3 > 0;
w2 = w3 > 0; (c) w1 = w2 > w3 > 0; (d) w1 = w2 > w3 = 0;
w2 = w3 = 0; (f) w1 = w2 = w3 > 0.
act that, if w 2 Km+ {0}, then ⌦w is indeed a norm
nvex and homogenous of degree 1), was shown in [10],
s clear that ⌦w is lower bounded by the (appropriately
`1 norm:
⌦w(x) w1|x|[1] = w1 kxk1, (3)
[Zheng15]
21

• -  
VAR
• i.i.d.
• :
inf
2cone(⌦E)
p⇥p 2
|| ||2
⌫,
M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M are de
can choose ⌘ = 1
2
p
NL and set
p
N =
p
NL 2
p
M cw(⇥) ⌘
ﬁed, we can establish a lower bound on the number of samples N
p
N >
2
p
M + cw(⇥)
p
L/2
= O(w(⇥)).
nd using (9) and (13), we can conclude that the number of samples neede
condition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A) an
s means that matrices A and A in (10) and (12) must be well condition
with eigenvalues well inside the unit circle (see Section 3.2). Alternativ
showing that large values of M and small values of L indicate stronger d
N > p
L/2
= O(w(⇥))
mining this bound and using (9) and (13), we can conclude that the
stricted eigenvalue condition is smaller if ⇤min(A) and ⇤min(⌃) a
maller. In turn, this means that matrices A and A in (10) and (12
process is stable, with eigenvalues well inside the unit circle (see
understand (18) as showing that large values of M and small value
data, thus requiring more samples for the RE conditions to hold w
yzing Theorems 4.3 and 4.4 we can interpret the established result
lity N, p and d of the problem increase, we emphasize the scale of t
note the constants. Select a number of samples at least N O(w
er satisfy N O
⇣
w(⌦R)
p
N
+ w2(⌦R)
N2
⌘
. With high probability the
2
p
N for 2 cone(⌦E) holds, so that  = O(1) is a positive
ation error in optimization problem (4) is bounded by k k2  O
/2
= O(w(⇥)). (18)
n conclude that the number of samples needed to satisfy
(A) and ⇤min(⌃) are larger and ⇤max(A) and ⇤max(⌃)
d A in (10) and (12) must be well conditioned and the
e the unit circle (see Section 3.2). Alternatively, we can
f M and small values of L indicate stronger dependency
conditions to hold with high probability.
he established results as follows. As the size and dimen-
phasize the scale of the results and use the order notations
s at least N O(w2(⇥)) and let the regularization pa-
high probability then the restricted eigenvalue condition
= O(1) is a positive constant. Moreover, the norm of the
nded by k k2  O
⇣
w(⌦R)
p
N
+ w2(⌦R)
N2
⌘
(cone(⌦Ej )).
(⌦Ej )) is assumed to be the same for all j = 1, . . . , p,
22

•
•
•
•
• -  
• [Negahban09]
• ARX  
23

Estimating Structured Vector Autoregressive Models Using Regularized Estimation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Estimating Structured Vector Autoregressive Models Using Regularized Estimation

Similar to Estimating Structured Vector Autoregressive Models Using Regularized Estimation (20)

Recently uploaded

Recently uploaded (20)

Estimating Structured Vector Autoregressive Models Using Regularized Estimation