Successfully reported this slideshow.
Upcoming SlideShare
×

# Estimating structured vector autoregressive models

3,785 views

Published on

ICML2016読み会の資料です

Published in: Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No

### Estimating structured vector autoregressive models

1. 1. ICML2016 Estimating Structured Vector Autoregressive Models NEC ICML2016 2016/7/21
2. 2. • Vector Autoregressive   • • i.i.d. • →( ) Lasso Group-Lasso i.i.d. • →( ) R ( ) • R ap- y in lar- (1) uch , is dis- ion ani, hine ume nancial time series (Tsay, 2005) to modeling the dynamical systems (Ljung, 1998) and estimating brain function con- nectivity (Valdes-Sosa et al., 2005), among others. A VAR model of order d is deﬁned as xt = A1xt 1 + A2xt 2 + · · · + Adxt d + ✏t , (2) where xt 2 Rp denotes a multivariate time series, Ak 2 Rp⇥p , k = 1, . . . , d are the parameters of the model, and d 1 is the order of the model. In this work, we as- sume that the noise ✏t 2 Rp follows a Gaussian distribu- tion, ✏t ⇠ N(0, ⌃), with E(✏t✏T t ) = ⌃ and E(✏t✏T t+⌧ ) = 0, for ⌧ 6= 0. The VAR process is assumed to be stable and stationary (Lutkepohl, 2007), while the noise covariance matrix ⌃ is assumed to be positive deﬁnite with bounded largest eigenvalue, i.e., ⇤min(⌃) > 0 and ⇤max(⌃) < 1. In the current context, the parameters {Ak} are assumed vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) y = Z + ✏, where y 2 RNp , Z = (Ip⇥p ⌦ X) 2 RNp⇥dp2 , 2 Rdp2 , ✏ 2 RNp , and ⌦ is the Kronecker product. The covari- ance matrix of the noise ✏ is now E[✏✏T ] = ⌃ ⌦ IN⇥N . Consequently, the regularized estimator takes the form ˆ = argmin 2Rdp2 1 N ||y Z ||2 2 + N R( ), (4) where R( ) can be any vector norm, separable along the rows of matrices Ak. Speciﬁcally, if we denote = [ T 1 . . . T p ]T and Ak(i, :) as the row of matrix Ak for k = 1, . . . , d, then our assumption is equivalent to R( )= pX i=1 R i = pX i=1 R ✓h A1(i, :)T . . .Ad(i, :)T iT ◆ . (5) where A all the for 2 of origi Pd k=1 A and Sec 2.3. Pro In what trix X i results w bounds Deﬁne we assu is distri matrix C C = 2
3. 3.   -   Lasso  [Basu15]   VAR i.i.d. [Banerjee14] VAR   3
4. 4. • i.i.d. • • • high probability ||Z ||2 || ||2 p N = O(1) is a pos- he estimation er- ded by k k2  ote that the norm assumed to be the rom our assump- und on the regu- + w2 (⌦R) N2 ⌘ . As nd d grows and 2 ✏1) + log(p)) we (1+✏1) w2 (⌦R) N2 ◆ . ion, we will show some ⌫ > 0 and p 1 , where Sdp 1 deﬁned as ⌦Ej = R( j) o , for r > T , for j is of size isfy N O N + N2 . With high then the restricted eigenvalue condition ||Z || || ||2 for 2 cone(⌦E) holds, so that  = O(1 itive constant. Moreover, the norm of the est ror in optimization problem (4) is bounded b O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note tha compatibility constant (cone(⌦Ej )) is assume same for all j = 1, . . . , p, which follows from o tion in (5). Consider now Theorem 3.3 and the bound on larization parameter N O ⇣ w(⌦R) p N + w2 ( N the dimensionality of the problem p and d the number of samples N increases, the ﬁrst t will dominate the second one w2 (⌦R) N2 . This c by computing N for which the two terms bec 2 ↑ w, the regularization parameter nd on R⇤ [ 1 N ZT ✏]  ↵, for lish the required relationship 2 Rdp |R(u)  1}, and deﬁne be a Gaussian width of set ny ✏1 > 0 and ✏2 > 0 with ( min(✏2 2, ✏1) + log(p)) we (⌦R) p N + c1(1+✏1) w2 (⌦R) N2 ◆ e constants. alue condition, we will show ⌫, for some ⌫ > 0 and L indicate stronger dependency in the data, thus more samples for the RE conditions to hold with h ability. Analyzing Theorems 3.3 and 3.4 we can interpre tablished results as follows. As the size and di ality N, p and d of the problem increase, we e the scale of the results and use the order notatio note the constants. Select a number of sample N O(w2 (⇥)) and let the regularization param isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high pr then the restricted eigenvalue condition ||Z ||2 || ||2 for 2 cone(⌦E) holds, so that  = O(1) itive constant. Moreover, the norm of the estim ror in optimization problem (4) is bounded by O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that compatibility constant (cone(⌦Ej )) is assumed same for all j = 1, . . . , p, which follows from our tion in (5). y ⌘ > 0, inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, L 2 p M cw(⇥) ⌘ and c, c1, c2 are nts, and L, M are deﬁned in (9) and (13). 3.4, we can choose ⌘ = 1 2 p NL and set 2 p M cw(⇥) ⌘ and since p N > 0 ed, we can establish a lower bound on the ples N: p N > 2 p M+cw(⇥) p L/2 = O(w(⇥)). bound and using (9) and (13), we can con- 3.3 W sep ize su OW 3. To ram VAR h= 1 X(h) = E[Xj,:XT j+h,:] 2 Rdp⇥dp , then we can write inf 1ldp !2[0,2⇡] ⇤l[⇢X(!)]  ⇤k[CU ] 1kNdp  sup 1ldp !2[0,2⇡] ⇤l[⇢X(!)]. The closed form expression of spectral density is ⇢X(!) = I Ae i! 1 ⌃E h I Ae i! 1 i⇤ , where ⌃E is the covariance matrix of a noise vector and A are as deﬁned in expression (6). Thus, an upper bound on CU can be obtained as ⇤max[CU ]  ⇤max(⌃) ⇤min(A) , where we deﬁned ⇤min(A) = min !2[0,2⇡] ⇤min(A(!)) for A(!) = I AT ei! I Ae i! . (12) Referring back to covariance matrix Qa in (11), we get ⇤max[Qa]  ⇤max(⌃)/⇤min(A) = M. (13) We note that for a general VAR model, there might not exist closed-form expressions for ⇤max(A) and ⇤min(A). How- ever, for some special cases there are results establishing the bounds on these quantities (e.g., see Proposition 2.2 in Lemma 3.2 Assume tha condition holds ||Z || for 2 cone(⌦E) an cone(⌦E) is a cone of th || ||2  1 + r where (cone(⌦E)) is a ﬁned as (cone(⌦E)) = Note that the above error and (16) hold, then the (17). However, the result tities, involving Z and ✏, the following we establis Estimating Structured VAR density is that it has a closed form expression (see Section 9.4 of (Priestley, 1981)) ⇢(!)= I dX k=1 Ake ki! ! 1 ⌃ 2 4 I dX k=1 Ake ki! ! 1 3 5 ⇤ , where ⇤ denotes a Hermitian of a matrix. Therefore, from (8) we can establish the following lower bound ⇤min[CX] ⇤min(⌃)/⇤max(A) = L, (9) 3. Regularized Es Denote by = ˆ optimization problem rameter. The focus of under which the optim tees on the accuracy of term is bounded: || || lish such conditions, w et al., 2014). Speciﬁca4
5. 5. • : p vector VAR(d) • • - n - ) h s - n , e e nancial time series (Tsay, 2005) to modeling the dynamical systems (Ljung, 1998) and estimating brain function con- nectivity (Valdes-Sosa et al., 2005), among others. A VAR model of order d is deﬁned as xt = A1xt 1 + A2xt 2 + · · · + Adxt d + ✏t , (2) where xt 2 Rp denotes a multivariate time series, Ak 2 Rp⇥p , k = 1, . . . , d are the parameters of the model, and d 1 is the order of the model. In this work, we as- sume that the noise ✏t 2 Rp follows a Gaussian distribu- tion, ✏t ⇠ N(0, ⌃), with E(✏t✏T t ) = ⌃ and E(✏t✏T t+⌧ ) = 0, for ⌧ 6= 0. The VAR process is assumed to be stable and stationary (Lutkepohl, 2007), while the noise covariance matrix ⌃ is assumed to be positive deﬁnite with bounded largest eigenvalue, i.e., ⇤min(⌃) > 0 and ⇤max(⌃) < 1. In the current context, the parameters {Ak} are assumed timation problems of the form: ˆ = argmin 2Rq 1 M ky Z k2 2 + M R( ) , (1) {(yi, zi), i = 1, . . . , M}, yi 2 R, zi 2 Rq , such = [yT 1 , . . . , yT M ]T and Z = [zT 1 , . . . , zT M ]T , is ining set of M independently and identically dis- d (i.i.d.) samples, M > 0 is a regularization eter, and R(·) denotes a suitable norm (Tibshirani, dings of the 33rd International Conference on Machine g, New York, NY, USA, 2016. JMLR: W&CP volume pyright 2016 by the author(s). model of order d is deﬁned as xt = A1xt 1 + A2xt 2 + · where xt 2 Rp denotes a mult Rp⇥p , k = 1, . . . , d are the par d 1 is the order of the mod sume that the noise ✏t 2 Rp fo tion, ✏t ⇠ N (0, ⌃), with E(✏t✏T t for ⌧ 6= 0. The VAR process is stationary (Lutkepohl, 2007), w matrix ⌃ is assumed to be posi largest eigenvalue, i.e., ⇤min(⌃) In the current context, the para Estimating Structured VAR characterizing sample complexity and error bounds. 2.1. Regularized Estimator To estimate the parameters of the VAR model, we trans- form the model in (2) into the form suitable for regularized estimator (1). Let (x0, x1, . . . , xT ) denote the T + 1 sam- ples generated by the stable VAR model in (2), then stack- ing them together we obtain 2 6 6 6 4 xT d xT d+1 ... xT T 3 7 7 7 5 = 2 6 6 6 4 xT d 1 xT d 2 . . . xT 0 xT d xT d 1 . . . xT 1 ... ... ... ... xT T 1 xT T 2 . . . xT T d 3 7 7 7 5 2 6 6 6 4 AT 1 AT 2 ... AT d 3 7 7 7 5 + 2 6 6 6 4 ✏T d ✏T d+1 ... ✏T T 3 7 7 7 5 which can also be compactly written as Y = XB + E, (3) where Y 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 RN⇥p for N = T d+1. Vectorizing (column-wise) each matrix in (3), we get vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) y = Z + ✏, of matrix X (and conseque dependencies, following (B utilize the spectral represen VAR models to control the d 2.2. Stability of VAR Mode Since VAR models are (line analysis we need to establi VAR model (2) is stable, i.e not diverge over time. For u venient to rewrite VAR mod alent VAR model of order 1 2 6 6 6 4 xt xt 1 ... xt (d 1) 3 7 7 7 5 = 2 6 6 6 6 6 4 A1 A2 . . . I 0 . . . 0 I . . . ... ... ... 0 0 . . . | {z A where A 2 Rdp⇥dp . There all the eigenvalues of A sa for 2 C, | | < 1. Equi of original parameters Ak, P (1). Let (x0, x1, . . . , xT ) denote the T + 1 sam- ated by the stable VAR model in (2), then stack- ogether we obtain = 2 6 6 6 4 xT d 1 xT d 2 . . . xT 0 xT d xT d 1 . . . xT 1 ... ... ... ... xT T 1 xT T 2 . . . xT T d 3 7 7 7 5 2 6 6 6 4 AT 1 AT 2 ... AT d 3 7 7 7 5 + 2 6 6 6 4 ✏T d ✏T d+1 ... ✏T T 3 7 7 7 5 also be compactly written as Y = XB + E, (3) 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 N = T d+1. Vectorizing (column-wise) each (3), we get ec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) Since VAR mo analysis we ne VAR model (2 not diverge ove venient to rewr alent VAR mod 2 6 6 6 4 xt xt 1 ... xt (d 1) 3 7 7 7 5 = 2 6 6 6 6 6 4 | where A 2 R all the eigenva for 2 C, | les generated by the stable VAR model in (2), then stack- ng them together we obtain 2 6 6 6 4 xT d xT d+1 ... xT T 3 7 7 7 5 = 2 6 6 6 4 xT d 1 xT d 2 . . . xT 0 xT d xT d 1 . . . xT 1 ... ... ... ... xT T 1 xT T 2 . . . xT T d 3 7 7 7 5 2 6 6 6 4 AT 1 AT 2 ... AT d 3 7 7 7 5 + 2 6 6 6 4 ✏T d ✏T d+1 ... ✏T T 3 7 7 7 5 hich can also be compactly written as Y = XB + E, (3) here Y 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 N⇥p for N = T d+1. Vectorizing (column-wise) each matrix in (3), we get vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) Since V analysi VAR m not div venient alent V 2 6 6 6 4 x xt xt ( where all the ples generated by the stable VAR model in (2), then stack- ing them together we obtain 2 6 6 6 4 xT d xT d+1 ... xT T 3 7 7 7 5 = 2 6 6 6 4 xT d 1 xT d 2 . . . xT 0 xT d xT d 1 . . . xT 1 ... ... ... ... xT T 1 xT T 2 . . . xT T d 3 7 7 7 5 2 6 6 6 4 AT 1 AT 2 ... AT d 3 7 7 7 5 + 2 6 6 6 4 ✏T d ✏T d+1 ... ✏T T 3 7 7 7 5 which can also be compactly written as Y = XB + E, (3) where Y 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 RN⇥p for N = T d+1. Vectorizing (column-wise) each matrix in (3), we get vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) ples generated by the stable VAR model in (2), then stack- ing them together we obtain 2 6 6 6 4 xT d xT d+1 ... xT T 3 7 7 7 5 = 2 6 6 6 4 xT d 1 xT d 2 . . . xT 0 xT d xT d 1 . . . xT 1 ... ... ... ... xT T 1 xT T 2 . . . xT T d 3 7 7 7 5 2 6 6 6 4 AT 1 AT 2 ... AT d 3 7 7 7 5 + 2 6 6 6 4 ✏T d ✏T d+1 ... ✏T T 3 7 7 7 5 which can also be compactly written as Y = XB + E, (3) where Y 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 RN⇥p for N = T d+1. Vectorizing (column-wise) each matrix in (3), we get vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) ing them together we obtain 2 6 6 6 4 xT d xT d+1 ... xT T 3 7 7 7 5 = 2 6 6 6 4 xT d 1 xT d 2 . . . xT 0 xT d xT d 1 . . . xT 1 ... ... ... ... xT T 1 xT T 2 . . . xT T d 3 7 7 7 5 2 6 6 6 4 AT 1 AT 2 ... AT d 3 7 7 7 5 + 2 6 6 6 4 ✏T d ✏T d+1 ... ✏T T 3 7 7 7 5 which can also be compactly written as Y = XB + E, (3) where Y 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 RN⇥p for N = T d+1. Vectorizing (column-wise) each matrix in (3), we get vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) y = Z + ✏, where y 2 RNp , Z = (Ip⇥p ⌦ X) 2 RNp⇥dp2 , 2 Rdp2 , ✏ 2 RNp , and ⌦ is the Kronecker product. The covari- analysis we VAR model not diverge o venient to re alent VAR m 2 6 6 6 4 xt xt 1 ... xt (d 1) 3 7 7 7 5 where A 2 all the eigen for 2 C, of original p Pd k=1 Ak 1 k and Section xT T xT T 1 xT T 2 . . . xT T d AT d ✏T T which can also be compactly written as Y = XB + E, (3) where Y 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 RN⇥p for N = T d+1. Vectorizing (column-wise) each matrix in (3), we get vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) y = Z + ✏, where y 2 RNp , Z = (Ip⇥p ⌦ X) 2 RNp⇥dp2 , 2 Rdp2 , ✏ 2 RNp , and ⌦ is the Kronecker product. The covari- ance matrix of the noise ✏ is now E[✏✏T ] = ⌃ ⌦ IN⇥N . Consequently, the regularized estimator takes the form ˆ = argmin 2Rdp2 1 N ||y Z ||2 2 + N R( ), (4) 2 6 6 6 4 xt xt 1 ... xt (d 1) 3 7 7 7 5 = where A 2 all the eigen for 2 C, of original p Pd k=1 Ak 1 k and Section 2.3. Propert In what follo trix X in (3) results will th i.i.d. 5
6. 6.   2 ⌦E     Lasso regularization norms. I the main ideas of our proof te delegated to the supplement. To establish lower bound on N , we derive an upper bou some ↵ > 0, which will estab N ↵ R⇤ [ 1 N ZT ✏]. Theorem 3.3 Let ⌦R = {u 2 w(⌦R) = E[ sup u2⌦R hg, ui] to ⌦R for g ⇠ N(0, I). For a probability at least 1 c exp can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w Thm4.3 Thm4.4 Lem.4.1 Lem.4.2 o hold with high prob- e can interpret the es- e size and dimension- crease, we emphasize order notations to de- er of samples at least ization parameter sat- With high probability tion ||Z ||2 || ||2 p N  = O(1) is a pos- of the estimation er- bounded by k k2  )). Note that the norm ction 3.4 we will present ique, with all the details regularization parameter n R⇤ [ 1 N ZT ✏]  ↵, for the required relationship p |R(u)  1}, and deﬁne a Gaussian width of set > 0 and ✏2 > 0 with min(✏2 2, ✏1) + log(p)) we ) + c1(1+✏1) w2 (⌦R) N2 ◆ nstants. condition, we will show N as showing that large values of M and small va L indicate stronger dependency in the data, thus req more samples for the RE conditions to hold with high ability. Analyzing Theorems 3.3 and 3.4 we can interpret tablished results as follows. As the size and dime ality N, p and d of the problem increase, we emp the scale of the results and use the order notations note the constants. Select a number of samples a N O(w2 (⇥)) and let the regularization paramet isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high prob then the restricted eigenvalue condition ||Z ||2 || ||2 for 2 cone(⌦E) holds, so that  = O(1) is itive constant. Moreover, the norm of the estimat ror in optimization problem (4) is bounded by k O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the compatibility constant (cone(⌦Ej )) is assumed to same for all j = 1, . . . , p, which follows from our as o understand the bound on s of M and small values of y in the data, thus requiring ions to hold with high prob- 3.4 we can interpret the es- As the size and dimension- m increase, we emphasize e the order notations to de- number of samples at least gularization parameter sat- R) ⌘ . With high probability condition ||Z ||2 || ||2 p N that  = O(1) is a pos- norm of the estimation er- 4) is bounded by k k2  ⌦Ej )). Note that the norm ⌦Ej )) is assumed to be the 6
7. 7.   2 ⌦E     Lasso regularization norms. I the main ideas of our proof te delegated to the supplement. To establish lower bound on N , we derive an upper bou some ↵ > 0, which will estab N ↵ R⇤ [ 1 N ZT ✏]. Theorem 3.3 Let ⌦R = {u 2 w(⌦R) = E[ sup u2⌦R hg, ui] to ⌦R for g ⇠ N(0, I). For a probability at least 1 c exp can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w Thm4.3 Thm4.4 Lem.4.1 o hold with high prob- e can interpret the es- e size and dimension- crease, we emphasize order notations to de- er of samples at least ization parameter sat- With high probability tion ||Z ||2 || ||2 p N  = O(1) is a pos- of the estimation er- bounded by k k2  )). Note that the norm ction 3.4 we will present ique, with all the details regularization parameter n R⇤ [ 1 N ZT ✏]  ↵, for the required relationship p |R(u)  1}, and deﬁne a Gaussian width of set > 0 and ✏2 > 0 with min(✏2 2, ✏1) + log(p)) we ) + c1(1+✏1) w2 (⌦R) N2 ◆ nstants. condition, we will show N as showing that large values of M and small va L indicate stronger dependency in the data, thus req more samples for the RE conditions to hold with high ability. Analyzing Theorems 3.3 and 3.4 we can interpret tablished results as follows. As the size and dime ality N, p and d of the problem increase, we emp the scale of the results and use the order notations note the constants. Select a number of samples a N O(w2 (⇥)) and let the regularization paramet isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high prob then the restricted eigenvalue condition ||Z ||2 || ||2 for 2 cone(⌦E) holds, so that  = O(1) is itive constant. Moreover, the norm of the estimat ror in optimization problem (4) is bounded by k O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the compatibility constant (cone(⌦Ej )) is assumed to same for all j = 1, . . . , p, which follows from our as Lem.4.2 o understand the bound on s of M and small values of y in the data, thus requiring ions to hold with high prob- 3.4 we can interpret the es- As the size and dimension- m increase, we emphasize e the order notations to de- number of samples at least gularization parameter sat- R) ⌘ . With high probability condition ||Z ||2 || ||2 p N that  = O(1) is a pos- norm of the estimation er- 4) is bounded by k k2  ⌦Ej )). Note that the norm ⌦Ej )) is assumed to be the 7
8. 8. Restricted Eigenvalue Condition •   (well-conditioned) • cone Lasso-type • R⇤  1 N ZT ✏  ✓ c2(1+✏2) w(⌦R) p N + c1(1+✏1) w2 (⌦R) N2 ◆ where c, c1 and c2 are positive constants. To establish restricted eigenvalue condition, we will show that inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, for some ⌫ > 0 and then set p N = ⌫. Theorem 3.4 Let ⇥ = cone(⌦Ej ) Sdp 1 , where Sdp 1 is a unit sphere. The error set ⌦Ej is deﬁned as ⌦Ej =n j 2 Rdp R( ⇤ j + j)  R( ⇤ j ) + 1 r R( j) o , for r > 1, j = 1, . . . , p, and = [ T 1 , . . . , T p ]T , for j is of size dp ⇥ 1, and ⇤ = [ ⇤T 1 . . . ⇤T p ]T , for ⇤ j 2 Rdp . The set ⌦Ej is a part of the decomposition in ⌦E = ⌦E1 ⇥ · · · ⇥ ⌦E due to the assumption on the row-wise separability of for itiv ror O ⇣ com sam tion Con lari the the wil by w(⌦ p w(⌦ low   ( )⌦E N ↵ R [N Z ✏]. Theorem 3.3 Let ⌦R = {u 2 Rd w(⌦R) = E[ sup u2⌦R hg, ui] to be ⌦R for g ⇠ N(0, I). For any ✏ probability at least 1 c exp( m can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w(⌦R p N where c, c1 and c2 are positive co To establish restricted eigenvalue that inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 then set p N = ⌫. Theorem 3.4 Let ⇥ = cone(⌦Ej⌦E 8
9. 9. Restricted Error Set • • • r=2 ⇤ ⇤ + ⌦E (L1)(L1) conecone ce matrix 1.2 of the (11) XT N,: ⇤T 2 ng all the n order to Qa), ob- y stacking as in (8), process in ⇡], where write X(!)]. 1 i⇤ , for some constant r > 1, where R⇤ ⇥ 1 N ZT ✏ ⇤ is a dual form of the vector norm R(·), which is deﬁned as R⇤ [ 1 N ZT ✏] = sup R(U)1 ⌦ 1 N ZT ✏, U ↵ , for U 2 Rdp2 , where U = [uT 1 , uT 2 , . . . , uT p ]T and ui 2 Rdp . Then the error vector k k2 belongs to the set ⌦E= ⇢ 2 Rdp2 R( ⇤ + )  R( ⇤ )+ 1 r R( ) . (15) The second condition in (Banerjee et al., 2014) establishes the upper bound on the estimation error. Lemma 3.2 Assume that the restricted eigenvalue (RE) condition holds ||Z ||2 || ||2 p N, (16) ting Structured VAR tion 3 5 ⇤ , rom (9) r (10) need 3. Regularized Estimation Guarantees Denote by = ˆ ⇤ the error between the solution of optimization problem (4) and ⇤ , the true value of the pa- rameter. The focus of our work is to determine conditions under which the optimization problem in (4) has guaran- tees on the accuracy of the obtained solution, i.e., the error term is bounded: || ||2  for some known . To estab- lish such conditions, we utilize the framework of (Banerjee et al., 2014). Speciﬁcally, estimation error analysis is based on the following known results adapted to our settings. The ﬁrst one characterizes the restricted error set ⌦E, where the error belongs. Lemma 3.1 Assume that  (r>1) ⌦E 2⌦E 9
10. 10.   2 ⌦E     Lasso regularization norms. I the main ideas of our proof te delegated to the supplement. To establish lower bound on N , we derive an upper bou some ↵ > 0, which will estab N ↵ R⇤ [ 1 N ZT ✏]. Theorem 3.3 Let ⌦R = {u 2 w(⌦R) = E[ sup u2⌦R hg, ui] to ⌦R for g ⇠ N(0, I). For a probability at least 1 c exp can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w Thm4.3 Thm4.4 Lem.4.1 Lem.4.2 o hold with high prob- e can interpret the es- e size and dimension- crease, we emphasize order notations to de- er of samples at least ization parameter sat- With high probability tion ||Z ||2 || ||2 p N  = O(1) is a pos- of the estimation er- bounded by k k2  )). Note that the norm ction 3.4 we will present ique, with all the details regularization parameter n R⇤ [ 1 N ZT ✏]  ↵, for the required relationship p |R(u)  1}, and deﬁne a Gaussian width of set > 0 and ✏2 > 0 with min(✏2 2, ✏1) + log(p)) we ) + c1(1+✏1) w2 (⌦R) N2 ◆ nstants. condition, we will show N as showing that large values of M and small va L indicate stronger dependency in the data, thus req more samples for the RE conditions to hold with high ability. Analyzing Theorems 3.3 and 3.4 we can interpret tablished results as follows. As the size and dime ality N, p and d of the problem increase, we emp the scale of the results and use the order notations note the constants. Select a number of samples a N O(w2 (⇥)) and let the regularization paramet isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high prob then the restricted eigenvalue condition ||Z ||2 || ||2 for 2 cone(⌦E) holds, so that  = O(1) is itive constant. Moreover, the norm of the estimat ror in optimization problem (4) is bounded by k O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the compatibility constant (cone(⌦Ej )) is assumed to same for all j = 1, . . . , p, which follows from our as o understand the bound on s of M and small values of y in the data, thus requiring ions to hold with high prob- 3.4 we can interpret the es- As the size and dimension- m increase, we emphasize e the order notations to de- number of samples at least gularization parameter sat- R) ⌘ . With high probability condition ||Z ||2 || ||2 p N that  = O(1) is a pos- norm of the estimation er- 4) is bounded by k k2  ⌦Ej )). Note that the norm ⌦Ej )) is assumed to be the 10
11. 11. • 4.2 • • • k k2 ond condition in (Banerjee et al., 2014) establishes er bound on the estimation error. a 3.2 Assume that the restricted eigenvalue (RE) on holds ||Z ||2 || ||2 p N, (16) 2 cone(⌦E) and some constant  > 0, where E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), (17) (cone(⌦E)) is a norm compatibility constant, de- (cone(⌦E)) = sup R(U) ||U||2 . , density of the VAR process in X(h)e hi! , ! 2 [0, 2⇡], where Rdp⇥dp , then we can write ⇤k[CU ] kNdp  sup 1ldp !2[0,2⇡] ⇤l[⇢X(!)]. on of spectral density is 1 ⌃E h I Ae i! 1 i⇤ , e matrix of a noise vector and A on (6). Thus, an upper bound on max[CU ]  ⇤max(⌃) ⇤min(A) , where we n 2⇡] ⇤min(A(!)) for AT ei! I Ae i! . (12) nce matrix Qa in (11), we get ax(⌃)/⇤min(A) = M. (13) The second condition in (Banerjee et al., 2014 the upper bound on the estimation error. Lemma 3.2 Assume that the restricted eigen condition holds ||Z ||2 || ||2 p N, for 2 cone(⌦E) and some constant  > cone(⌦E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), where (cone(⌦E)) is a norm compatibility c ﬁned as (cone(⌦E)) = sup U2cone(⌦E ) R(U) ||U||2 . Note that the above error bound is deterministic rocess in ⇡], where rite (!)]. 1 i⇤ , or and A bound on where we (12) e get (13) the upper bound on the estimation error. Lemma 3.2 Assume that the restricted eigenvalue (RE) condition holds ||Z ||2 || ||2 p N, (16) for 2 cone(⌦E) and some constant  > 0, where cone(⌦E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), (17) where (cone(⌦E)) is a norm compatibility constant, de- ﬁned as (cone(⌦E)) = sup U2cone(⌦E ) R(U) ||U||2 . The second condition in (Banerjee et al., 2014) establishes he upper bound on the estimation error. Lemma 3.2 Assume that the restricted eigenvalue (RE) ondition holds ||Z ||2 || ||2 p N, (16) or 2 cone(⌦E) and some constant  > 0, where one(⌦E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), (17) where (cone(⌦E)) is a norm compatibility constant, de- ﬁned as (cone(⌦E)) = sup U2cone(⌦E ) R(U) ||U||2 . the upper bound on the estimation error. Lemma 3.2 Assume that the restricted eigenvalue (RE) condition holds ||Z ||2 || ||2 p N, (16) for 2 cone(⌦E) and some constant  > 0, where cone(⌦E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), (17) where (cone(⌦E)) is a norm compatibility constant, de- ﬁned as (cone(⌦E)) = sup U2cone(⌦E ) R(U) ||U||2 .   E k = 1, . . . , d. Since L1 is decomposable, it can that (cone(⌦Ej ))  4 p s. Next, since ⌦R Rdp |R(u)  1}, then using Lemma 3 in (Baner 2014) and Gaussian width results in (Chandrasekap d A on we 12) 13) Lemma 3.2 Assume that the restricted eigenvalue (RE) condition holds ||Z ||2 || ||2 p N, (16) for 2 cone(⌦E) and some constant  > 0, where cone(⌦E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), (17) where (cone(⌦E)) is a norm compatibility constant, de- ﬁned as (cone(⌦E)) = sup U2cone(⌦E ) R(U) ||U||2 . Note that the above error bound is deterministic, i.e., if (14) s: 11
12. 12. • • • • Lasso k k2 or 2 cone(⌦E) and some constant  > 0, where one(⌦E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), (17) where (cone(⌦E)) is a norm compatibility constant, de- ﬁned as (cone(⌦E)) = sup U2cone(⌦E ) R(U) ||U||2 . Note that the above error bound is deterministic, i.e., if (14) nd (16) hold, then the error satisﬁes the upper bound in 17). However, the results are deﬁned in terms of the quan- ities, involving Z and ✏, which are random. Therefore, in he following we establish high probability bounds on the egularization parameter in (14) and RE condition in (16). etails meter , for nship eﬁne of set with ) we R) ◆ show 0 and dp 1 L indicate stronger dependency in the data, thus requiring more samples for the RE conditions to hold with high prob- ability. Analyzing Theorems 3.3 and 3.4 we can interpret the es- tablished results as follows. As the size and dimension- ality N, p and d of the problem increase, we emphasize the scale of the results and use the order notations to de- note the constants. Select a number of samples at least N O(w2 (⇥)) and let the regularization parameter sat- isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high probability then the restricted eigenvalue condition ||Z ||2 || ||2 p N for 2 cone(⌦E) holds, so that  = O(1) is a pos- itive constant. Moreover, the norm of the estimation er- ror in optimization problem (4) is bounded by k k2  O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the norm compatibility constant (cone(⌦Ej )) is assumed to be the same for all j = 1, . . . , p, which follows from our assump- tion in (5). Consider now Theorem 3.3 and the bound on the regu- larization parameter N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . As the dimensionality of the problem p and d grows and ze and dimension- ase, we emphasize er notations to de- of samples at least tion parameter sat- th high probability n ||Z ||2 || ||2 p N = O(1) is a pos- the estimation er- unded by k k2  Note that the norm s assumed to be the s from our assump- ound on the regu- ) + w2 (⌦R) ⌘ . As N equired relationship u)  1}, and deﬁne aussian width of set 0 and ✏2 > 0 with ✏2 2, ✏1) + log(p)) we c1(1+✏1) w2 (⌦R) N2 ◆ nts. dition, we will show or some ⌫ > 0 and Sdp 1 , where Sdp 1 s deﬁned as ⌦Ej =o Analyzing Theorems 3.3 and 3.4 we can interpret the es- tablished results as follows. As the size and dimension- ality N, p and d of the problem increase, we emphasize the scale of the results and use the order notations to de- note the constants. Select a number of samples at least N O(w2 (⇥)) and let the regularization parameter sat- isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high probability then the restricted eigenvalue condition ||Z ||2 || ||2 p N for 2 cone(⌦E) holds, so that  = O(1) is a pos- itive constant. Moreover, the norm of the estimation er- ror in optimization problem (4) is bounded by k k2  O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the norm compatibility constant (cone(⌦Ej )) is assumed to be the same for all j = 1, . . . , p, which follows from our assump- tion in (5). Consider now Theorem 3.3 and the bound on the regu- larization parameter N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . As the dimensionality of the problem p and d grows and the number of samples N increases, the ﬁrst term w(⌦R) p N 2 □ 0 and Sdp 1 Ej = r r > of size he set · · · ⇥ lity of ui] to d u 2 2⌘2 + ⌫, c2 are Consider now Theorem 3.3 and the bound on the regu- larization parameter N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . As the dimensionality of the problem p and d grows and the number of samples N increases, the ﬁrst term w(⌦R) p N will dominate the second one w2 (⌦R) N2 . This can be seen by computing N for which the two terms become equal w(⌦R) p N = w2 (⌦R) N2 , which happens at N = w 2 3 (⌦R) < w(⌦R). Therefore, we can rewrite our results as fol- lows: once the restricted eigenvalue condition holds and N O ⇣ w(⌦R) p N ⌘ , the error norm is upper-bounded by k k2  O ⇣ w(⌦R) p N ⌘ (cone(⌦Ej )). 3.3. Special Cases While the presented results are valid for any norm R(·), Estimating Structured VAR k = 1, . . . , d. Since L1 is decomposable, it can be shown hat (cone(⌦Ej ))  4 p s. Next, since ⌦R = {u 2 Rdp |R(u)  1}, then using Lemma 3 in (Banerjee et al., 2014) and Gaussian width results in (Chandrasekaran et al., 2012), we can establish that w(⌦R)  O( p log(dp)). Therefore, based on Theorem 4.3 and the discussion at the nd of Section 3.2, the bound on the regularization parame- er takes the form N O ⇣p log(dp)/N ⌘ . Hence, the es- ⇣p ⌘ quence of weights and | |(1) quence of absolute values of , In (Chen & Banerjee, 2015) it O( p log(dp)/¯c), where ¯c is and the norm compatibility co 2c2 1 p s/¯c. Therefore, based on O ⇣p log(dp)/(¯cN) ⌘ and the e by k k2  O ⇣ 2c1 ¯c p s log(dp) Estimating Structured VAR ce L1 is decomposable, it can be shown ))  4 p s. Next, since ⌦R = {u 2 then using Lemma 3 in (Banerjee et al., an width results in (Chandrasekaran et al., stablish that w(⌦R)  O( p log(dp)). on Theorem 4.3 and the discussion at the , the bound on the regularization parame- N O ⇣p log(dp)/N ⌘ . Hence, the es- ounded by k k2  O ⇣p s log(dp)/N ⌘ (log(dp)). quence of weights and | |(1) quence of absolute values of In (Chen & Banerjee, 2015) it O( p log(dp)/¯c), where ¯c is and the norm compatibility co 2c2 1 p s/¯c. Therefore, based on O ⇣p log(dp)/(¯cN) ⌘ and the by k k2  O ⇣ 2c1 ¯c p s log(dp) We note that the bound obtained is similar to the bound obtaine k = 1, . . . , d. Since L1 is decompo that (cone(⌦Ej ))  4 p s. Nex Rdp |R(u)  1}, then using Lemma 2014) and Gaussian width results in 2012), we can establish that w(⌦ Therefore, based on Theorem 4.3 an end of Section 3.2, the bound on the ter takes the form N O ⇣p log(d timation error is bounded by k k2  as long as N > O(log(dp)).  k = 1, . . . , d. Since L1 is decomposable, it ca that (cone(⌦Ej ))  4 p s. Next, since ⌦R Rdp |R(u)  1}, then using Lemma 3 in (Ban 2014) and Gaussian width results in (Chandrase 2012), we can establish that w(⌦R)  O( Therefore, based on Theorem 4.3 and the discu end of Section 3.2, the bound on the regularizat⇣p ⌘12
13. 13.   2 ⌦E     Lasso regularization norms. I the main ideas of our proof te delegated to the supplement. To establish lower bound on N , we derive an upper bou some ↵ > 0, which will estab N ↵ R⇤ [ 1 N ZT ✏]. Theorem 3.3 Let ⌦R = {u 2 w(⌦R) = E[ sup u2⌦R hg, ui] to ⌦R for g ⇠ N(0, I). For a probability at least 1 c exp can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w o understand the bound on s of M and small values of y in the data, thus requiring ions to hold with high prob- 3.4 we can interpret the es- As the size and dimension- m increase, we emphasize e the order notations to de- number of samples at least gularization parameter sat- R) ⌘ . With high probability condition ||Z ||2 || ||2 p N that  = O(1) is a pos- norm of the estimation er- 4) is bounded by k k2  ⌦Ej )). Note that the norm ⌦Ej )) is assumed to be the Thm4.3 Thm4.4 Lem.4.1 Lem.4.2 o hold with high prob- e can interpret the es- e size and dimension- crease, we emphasize order notations to de- er of samples at least ization parameter sat- With high probability tion ||Z ||2 || ||2 p N  = O(1) is a pos- of the estimation er- bounded by k k2  )). Note that the norm ction 3.4 we will present ique, with all the details regularization parameter n R⇤ [ 1 N ZT ✏]  ↵, for the required relationship p |R(u)  1}, and deﬁne a Gaussian width of set > 0 and ✏2 > 0 with min(✏2 2, ✏1) + log(p)) we ) + c1(1+✏1) w2 (⌦R) N2 ◆ nstants. condition, we will show N as showing that large values of M and small va L indicate stronger dependency in the data, thus req more samples for the RE conditions to hold with high ability. Analyzing Theorems 3.3 and 3.4 we can interpret tablished results as follows. As the size and dime ality N, p and d of the problem increase, we emp the scale of the results and use the order notations note the constants. Select a number of samples a N O(w2 (⇥)) and let the regularization paramet isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high prob then the restricted eigenvalue condition ||Z ||2 || ||2 for 2 cone(⌦E) holds, so that  = O(1) is itive constant. Moreover, the norm of the estimat ror in optimization problem (4) is bounded by k O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the compatibility constant (cone(⌦Ej )) is assumed to same for all j = 1, . . . , p, which follows from our as13
14. 14. ( ) : R - • R - • • → ) can be any vector norm, separable along the matrices Ak. Speciﬁcally, if we denote = T ]T and Ak(i, :) as the row of matrix Ak for , d, then our assumption is equivalent to pX =1 R i = pX i=1 R ✓h A1(i, :)T . . .Ad(i, :)T iT ◆ . (5) clutter and without loss of generality, we assume R(·) to be the same for each row i. Since the ecouples across rows, it is straightforward to ex- analysis to the case when a different norm is used ow of Ak, e.g., L1 for row one, L2 for row two, t norm (Argyriou et al., 2012) for row three, etc. hat within a row, the norm need not be decompos- results wil bounds fo Deﬁne an we assum is distribu matrix CX CX = 2 6 6 4 where ( since CX bounded a where R( ) can be any vector norm, separable along the rows of matrices Ak. Speciﬁcally, if we denote = [ T 1 . . . T p ]T and Ak(i, :) as the row of matrix Ak for k = 1, . . . , d, then our assumption is equivalent to R( )= pX i=1 R i = pX i=1 R ✓h A1(i, :)T . . .Ad(i, :)T iT ◆ . (5) To reduce clutter and without loss of generality, we assume the norm R(·) to be the same for each row i. Since the analysis decouples across rows, it is straightforward to ex- tend our analysis to the case when a different norm is used for each row of Ak, e.g., L1 for row one, L2 for row two, K-support norm (Argyriou et al., 2012) for row three, etc. Observe that within a row, the norm need not be decompos- A2A1 A3 A4 ˆ = argmin 2Rdp2 1 N ||y Z ||2 2 + N R( ), (4) ) can be any vector norm, separable along the matrices Ak. Speciﬁcally, if we denote = T ]T and Ak(i, :) as the row of matrix Ak for , d, then our assumption is equivalent to pX =1 R i = pX i=1 R ✓h A1(i, :)T . . .Ad(i, :)T iT ◆ . (5) clutter and without loss of generality, we assume R(·) to be the same for each row i. Since the ecouples across rows, it is straightforward to ex- analysis to the case when a different norm is used In what fo trix X in results wil bounds fo Deﬁne an we assum is distribu matrix CX CX = 2 6 6 4 where ( ˆ = argmin 2Rdp2 1 N ||y Z ||2 2 + N R( ), (4) where R( ) can be any vector norm, separable along the rows of matrices Ak. Speciﬁcally, if we denote = [ T 1 . . . T p ]T and Ak(i, :) as the row of matrix Ak for k = 1, . . . , d, then our assumption is equivalent to R( )= pX i=1 R i = pX i=1 R ✓h A1(i, :)T . . .Ad(i, :)T iT ◆ . (5) To reduce clutter and without loss of generality, we assume the norm R(·) to be the same for each row i. Since the analysis decouples across rows, it is straightforward to ex- tend our analysis to the case when a different norm is used A2A1 A3 A4 = argmin 2Rdp2 N ||y Z ||2 + N R( ), (4) ere R( ) can be any vector norm, separable along the ws of matrices Ak. Speciﬁcally, if we denote = . . . T p ]T and Ak(i, :) as the row of matrix Ak for = 1, . . . , d, then our assumption is equivalent to ( )= pX i=1 R i = pX i=1 R ✓h A1(i, :)T . . .Ad(i, :)T iT ◆ . (5) reduce clutter and without loss of generality, we assume norm R(·) to be the same for each row i. Since the lysis decouples across rows, it is straightforward to ex- d our analysis to the case when a different norm is used each row of Ak, e.g., L1 for row one, L2 for row two, support norm (Argyriou et al., 2012) for row three, etc. trix resu bou Deﬁ we is d mat C whe sinc bou on ap- ially in regular- (1) q , such T M ]T , is lly dis- rization shirani, Machine volume ranging from describing the behavior of economic and ﬁ- nancial time series (Tsay, 2005) to modeling the dynamical systems (Ljung, 1998) and estimating brain function con- nectivity (Valdes-Sosa et al., 2005), among others. A VAR model of order d is deﬁned as xt = A1xt 1 + A2xt 2 + · · · + Adxt d + ✏t , (2) where xt 2 Rp denotes a multivariate time series, Ak 2 Rp⇥p , k = 1, . . . , d are the parameters of the model, and d 1 is the order of the model. In this work, we as- sume that the noise ✏t 2 Rp follows a Gaussian distribu- tion, ✏t ⇠ N(0, ⌃), with E(✏t✏T t ) = ⌃ and E(✏t✏T t+⌧ ) = 0, for ⌧ 6= 0. The VAR process is assumed to be stable and stationary (Lutkepohl, 2007), while the noise covariance matrix ⌃ is assumed to be positive deﬁnite with bounded largest eigenvalue, i.e., ⇤min(⌃) > 0 and ⇤max(⌃) < 1. In the current context, the parameters {Ak} are assumed Aj Theorem 4.3 Let ⌦R = {u 2 R |R(u)  1} of set ⌦R for g ⇠ N(0, I). For any ✏1 > 0 an log(p)) we can establish that R⇤  1 N ZT ✏  ✓ c2( where c, c1 and c2 are positive constants. To establish restricted eigenvalue condition, ⌫ > 0 and then set p N = ⌫. Theorem 4.4 Let ⇥ = cone(⌦Ej ) Sdp 1, w ⌦Ej = n j 2 Rdp R( ⇤ j + j)  R( ⇤ j ) + for j is of size dp ⇥ 1, and ⇤ = [ ⇤T 1 . . . ⇤ p in ⌦E = ⌦E1 ⇥ · · · ⇥ ⌦Ep due to the assum 14
15. 15. • 3.4 • • ⇤T 1 . . . ⇤T p ]T , for ⇤ j 2 Rdp . The set decomposition in ⌦E = ⌦E1 ⇥ · · · ⇥ mption on the row-wise separability of Also deﬁne w(⇥) = E[sup u2⇥ hg, ui] to of set ⇥ for g ⇠ N(0, I) and u 2 bability at least 1 c1 exp( c2⌘2 + > 0, inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, 2 p M cw(⇥) ⌘ and c, c1, c2 are nd L, M are deﬁned in (9) and (13). we can choose ⌘ = 1 2 p NL and set p M cw(⇥) ⌘ and since p N > 0 e can establish a lower bound on the N: p N > 2 p M+cw(⇥) p L/2 = O(w(⇥)). d and using (9) and (13), we can con- er of samples needed to satisfy the condition is smaller if ⇤min(A) and w(⌦R) p N = w (⌦R) N2 , which happens at N w(⌦R). Therefore, we can rewrite ou lows: once the restricted eigenvalue con N O ⇣ w(⌦R) p N ⌘ , the error norm is u k k2  O ⇣ w(⌦R) p N ⌘ (cone(⌦Ej )). 3.3. Special Cases While the presented results are valid fo separable along the rows of Ak, it is instr ize our analysis to a few popular regula such as L1 and Group Lasso, Sparse G OWL norms. 3.3.1. LASSO To establish results for L1 norm, we ass rameter ⇤ is s-sparse, which in our case resent the largest number of non-zero ele i = 1, . . . , p, i.e., the combined i-th r ⌦Ej is a part of the decomposition in ⌦E = ⌦Ep due to the assumption on the row-wise norm R(·) in (5). Also deﬁne w(⇥) = E be a Gaussian width of set ⇥ for g ⇠ N Rdp . Then with probability at least 1 c log(p)), for any ⌘ > 0, inf 2cone(⌦E ) ||(Ip⇥p || where ⌫ = p NL 2 p M cw(⇥) ⌘ a positive constants, and L, M are deﬁned in 3.2. Discussion From Theorem 3.4, we can choose ⌘ = 1 2p N = p NL 2 p M cw(⇥) ⌘ and s must be satisﬁed, we can establish a lowe number of samples N: p N > 2 p M+cw(⇥ p L/2 Examining this bound and using (9) and (1 clude that the number of samples needed restricted eigenvalue condition is smaller i 1, j = 1, . . . , p, and = [ 1 , . . . , p ] , for j is of size dp ⇥ 1, and ⇤ = [ ⇤T 1 . . . ⇤T p ]T , for ⇤ j 2 Rdp . The set ⌦Ej is a part of the decomposition in ⌦E = ⌦E1 ⇥ · · · ⇥ ⌦Ep due to the assumption on the row-wise separability of norm R(·) in (5). Also deﬁne w(⇥) = E[sup u2⇥ hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 Rdp . Then with probability at least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0, inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, where ⌫ = p NL 2 p M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L, M are deﬁned in (9) and (13). 3.2. Discussion From Theorem 3.4, we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ and since p N > 0 must be satisﬁed, we can establish a lower bound on the number of samples N: p N > 2 p M+cw(⇥) p L/2 = O(w(⇥)). Examining this bound and using (9) and (13), we can con- clude that the number of samples needed to satisfy the restricted eigenvalue condition is smaller if ⇤min(A) and w( p w( low N k 3.3 W se ize su OW 3. To ra re i phere. The error set ⌦Ej is deﬁned as ⌦Ej = dp R( ⇤ j + j)  R( ⇤ j ) + 1 r R( j) o , for r > . . , p, and = [ T 1 , . . . , T p ]T , for j is of size nd ⇤ = [ ⇤T 1 . . . ⇤T p ]T , for ⇤ j 2 Rdp . The set part of the decomposition in ⌦E = ⌦E1 ⇥ · · · ⇥ o the assumption on the row-wise separability of ) in (5). Also deﬁne w(⇥) = E[sup u2⇥ hg, ui] to ssian width of set ⇥ for g ⇠ N (0, I) and u 2 n with probability at least 1 c1 exp( c2⌘2 + or any ⌘ > 0, inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, = p NL 2 p M cw(⇥) ⌘ and c, c1, c2 are onstants, and L, M are deﬁned in (9) and (13). ssion orem 3.4, we can choose ⌘ = 1 2 p NL and set p NL 2 p M cw(⇥) ⌘ and since p N > 0 atisﬁed, we can establish a lower bound on the p 2 p M+cw(⇥) the number of samples N will dominate the second o by computing N for whic w(⌦R) p N = w2 (⌦R) N2 , which w(⌦R). Therefore, we c lows: once the restricted e N O ⇣ w(⌦R) p N ⌘ , the er k k2  O ⇣ w(⌦R) p N ⌘ (con 3.3. Special Cases While the presented result separable along the rows of ize our analysis to a few such as L1 and Group La OWL norms. 3.3.1. LASSO To establish results for L is a unit sphere. The error set ⌦Ej is deﬁned as ⌦Ej =n j 2 Rdp R( ⇤ j + j)  R( ⇤ j ) + 1 r R( j) o , for r > 1, j = 1, . . . , p, and = [ T 1 , . . . , T p ]T , for j is of size dp ⇥ 1, and ⇤ = [ ⇤T 1 . . . ⇤T p ]T , for ⇤ j 2 Rdp . The set ⌦Ej is a part of the decomposition in ⌦E = ⌦E1 ⇥ · · · ⇥ ⌦Ep due to the assumption on the row-wise separability of norm R(·) in (5). Also deﬁne w(⇥) = E[sup u2⇥ hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 Rdp . Then with probability at least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0, inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, where ⌫ = p NL 2 p M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L, M are deﬁned in (9) and (13). 3.2. Discussion From Theorem 3.4, we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ and since p N > 0 the num will dom by comp w(⌦R) p N = w(⌦R). lows: on N O k k2  3.3. Spec While th separable ize our a such as OWL no 3.3.1. L deﬁne w(⇥) = E[sup u2⇥ hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N probability at least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0 inf ||(Ip⇥p ⌦ X) ||2 ⌫, : Gaussian width n this Section we present the main results of our work, followed by the discussion on their prop lustrating some special cases based on popular Lasso and Group Lasso regularization norms. I 4 we will present the main ideas of our proof technique, with all the details delegated to the App nd D. o establish lower bound on the regularization parameter N , we derive an upper bound on R⇤[ 1 N Z or some ↵ > 0, which will establish the required relationship N ↵ R⇤[ 1 N ZT ✏]. heorem 4.3 Let ⌦R = {u 2 Rdp|R(u)  1}, and deﬁne w(⌦R) = E[ sup u2⌦R hg, ui] to be a Gaus set ⌦R for g ⇠ N(0, I). For any ✏1 > 0 and ✏2 > 0 with probability at least 1 c exp( min g(p)) we can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w(⌦R) p N + c1(1+✏1) w2(⌦R) N2 ◆ here c, c1 and c2 are positive constants. o establish restricted eigenvalue condition, we will show that inf 2cone(⌦E) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, > 0 and then set p N = ⌫. heorem 4.4 Let ⇥ = cone(⌦Ej ) Sdp 1, where Sdp 1 is a unit sphere. The error set ⌦Ej is Ej = n j 2 Rdp R( ⇤ j + j)  R( ⇤ j ) + 1 r R( j) o , for r > 1, j = 1, . . . , p, and = [ T 1 , . ] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 Rdp. Then with p( c2⌘2 + log(p)), for any ⌘ > 0 inf ||(Ip⇥p ⌦ X) ||2 ⌫, ing Theorems 3.3 and 3.4 we can interpret the es- ed results as follows. As the size and dimension- , p and d of the problem increase, we emphasize le of the results and use the order notations to de- e constants. Select a number of samples at least O(w2 (⇥)) and let the regularization parameter sat- N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high probability e restricted eigenvalue condition ||Z ||2 || ||2 p N 2 cone(⌦E) holds, so that  = O(1) is a pos- onstant. Moreover, the norm of the estimation er- optimization problem (4) is bounded by k k2  ⌦R) N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the norm ibility constant (cone(⌦Ej )) is assumed to be the or all j = 1, . . . , p, which follows from our assump- (5). er now Theorem 3.3 and the bound on the regu-⇣ w(⌦R) p w2 (⌦R) ⌘ 15
16. 16. • p   •   • • • ion on the row-wise separability of o deﬁne w(⇥) = E[sup u2⇥ hg, ui] to set ⇥ for g ⇠ N(0, I) and u 2 bility at least 1 c1 exp( c2⌘2 + 0, inf 2cone(⌦E) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, p M cw(⇥) ⌘ and c, c1, c2 are L, M are deﬁned in (9) and (13). can choose ⌘ = 1 2 p NL and set cw(⇥) ⌘ and since p N > 0 an establish a lower bound on the p N > 2 p M+cw(⇥) p L/2 = O(w(⇥)). nd using (9) and (13), we can con- of samples needed to satisfy the ndition is smaller if ⇤min(A) and lows: once the restricted eigenvalue condi N O ⇣ w(⌦R) p N ⌘ , the error norm is upp k k2  O ⇣ w(⌦R) p N ⌘ (cone(⌦Ej )). 3.3. Special Cases While the presented results are valid for a separable along the rows of Ak, it is instruc ize our analysis to a few popular regulariz such as L1 and Group Lasso, Sparse Gro OWL norms. 3.3.1. LASSO To establish results for L1 norm, we assum rameter ⇤ is s-sparse, which in our case is resent the largest number of non-zero elem i = 1, . . . , p, i.e., the combined i-th row norm R(·) in (5). Also deﬁne w(⇥) = E[s u be a Gaussian width of set ⇥ for g ⇠ N(0 Rdp . Then with probability at least 1 c1 e log(p)), for any ⌘ > 0, inf 2cone(⌦E) ||(Ip⇥p⌦ || | where ⌫ = p NL 2 p M cw(⇥) ⌘ and positive constants, and L, M are deﬁned in (9 3.2. Discussion From Theorem 3.4, we can choose ⌘ = 1 2 p p N = p NL 2 p M cw(⇥) ⌘ and sin must be satisﬁed, we can establish a lower b number of samples N: p N > 2 p M+cw(⇥) p L/2 Examining this bound and using (9) and (13), clude that the number of samples needed t restricted eigenvalue condition is smaller if ⇤ cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M are deﬁned in (9) choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ and since we can establish a lower bound on the number of samples N p N > 2 p M + cw(⇥) p L/2 = O(w(⇥)). (18) sing (9) and (13), we can conclude that the number of samples needed to satisfy dition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A) and ⇤max(⌃) eans that matrices A and A in (10) and (12) must be well conditioned and the eigenvalues well inside the unit circle (see Section 3.2). Alternatively, we can wing that large values of M and small values of L indicate stronger dependency ore samples for the RE conditions to hold with high probability. d 4.4 we can interpret the established results as follows. As the size and dimen- oblem increase, we emphasize the scale of the results and use the order notations ect a number of samples at least N O(w2(⇥)) and let the regularization pa- 2 ⌘ Ej ⌦Ep due to the assumption on t norm R(·) in (5). Also deﬁne be a Gaussian width of set ⇥ Rdp . Then with probability at log(p)), for any ⌘ > 0, in 2co where ⌫ = p NL 2 p M cw positive constants, and L, M ar 3.2. Discussion From Theorem 3.4, we can ch p N = p NL 2 p M cw(⇥ must be satisﬁed, we can estab number of samples N: p N > Examining this bound and usin clude that the number of sam restricted eigenvalue condition probability at least 1 c1 exp( c2⌘ + log(p)), for any ⌘ > 0 inf 2cone(⌦E) ||(Ip⇥p ⌦ X) ||2 || ||2 ⌫, where ⌫ = p NL 2 p M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M are deﬁne and (13). 4.2 Discussion From Theorem 4.4, we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ an p N > 0 must be satisﬁed, we can establish a lower bound on the number of samples N p N > 2 p M + cw(⇥) p L/2 = O(w(⇥)). Examining this bound and using (9) and (13), we can conclude that the number of samples needed to the restricted eigenvalue condition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A) and ⇤ are smaller. In turn, this means that matrices A and A in (10) and (12) must be well conditioned VAR process is stable, with eigenvalues well inside the unit circle (see Section 3.2). Alternatively, also understand (18) as showing that large values of M and small values of L indicate stronger depe in the data, thus requiring more samples for the RE conditions to hold with high probability. Analyzing Theorems 4.3 and 4.4 we can interpret the established results as follows. As the size and sionality N, p and d of the problem increase, we emphasize the scale of the results and use the order no to denote the constants. Select a number of samples at least N O(w2(⇥)) and let the regularizat rameter satisfy N O ⇣ w(⌦R) p + w2(⌦R) 2 ⌘ . With high probability then the restricted eigenvalue co and 3.4 we can interpret the es- ws. As the size and dimension- problem increase, we emphasize nd use the order notations to de- ct a number of samples at least the regularization parameter sat- w2 (⌦R) N2 ⌘ . With high probability value condition ||Z ||2 || ||2 p N ds, so that  = O(1) is a pos- , the norm of the estimation er- em (4) is bounded by k k2  cone(⌦Ej )). Note that the norm cone(⌦Ej )) is assumed to be the which follows from our assump- 3.3 and the bound on the regu- O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . As e problem p and d grows and r 1, j = 1, . . . , p, and = [ T 1 , . . . , T p ]T , for j is of size dp ⇥ 1, and ⇤ = [ ⇤T 1 . . . ⇤T p ]T , for ⇤ j 2 Rdp . The set ⌦Ej is a part of the decomposition in ⌦E = ⌦E1 ⇥ · · · ⇥ ⌦Ep due to the assumption on the row-wise separability of norm R(·) in (5). Also deﬁne w(⇥) = E[sup u2⇥ hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 Rdp . Then with probability at least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0, inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, where ⌫ = p NL 2 p M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L, M are deﬁned in (9) and (13). 3.2. Discussion From Theorem 3.4, we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ and since p N > 0 must be satisﬁed, we can establish a lower bound on the number of samples N: p N > 2 p M+cw(⇥) p = O(w(⇥)). by w(⌦ p N w(⌦ low N k k 3.3. Wh sepa ize such OW 3.3 To ), for any ⌘ > 0 ||(Ip⇥p ⌦ X) ||2 || ||2 ⌫, c1, c2 are positive constants, and L and M are deﬁned in (9) L and set p N = p NL 2 p M cw(⇥) ⌘ and since lower bound on the number of samples N M + cw(⇥) p L/2 = O(w(⇥)). (18) we can conclude that the number of samples needed to satisfy ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A) and ⇤max(⌃) A and A in (10) and (12) must be well conditioned and the 0 < = E[sup u2⇥ hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 R least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0 inf 2cone(⌦E) ||(Ip⇥p ⌦ X) ||2 || ||2 ⌫, NL 2 p M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M ar sion m 4.4, we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ust be satisﬁed, we can establish a lower bound on the number of samples N p N > 2 p M + cw(⇥) p L/2 = O(w(⇥)). is bound and using (9) and (13), we can conclude that the number of samples n eigenvalue condition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A n turn, this means that matrices A and A in (10) and (12) must be well cond16
17. 17. •       sion m 4.4, we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ust be satisﬁed, we can establish a lower bound on the number of samples N p N > 2 p M + cw(⇥) p L/2 = O(w(⇥)). is bound and using (9) and (13), we can conclude that the number of samples n eigenvalue condition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A n turn, this means that matrices A and A in (10) and (12) must be well cond is stable, with eigenvalues well inside the unit circle (see Section 3.2). Altern nd (18) as showing that large values of M and small values of L indicate strong us requiring more samples for the RE conditions to hold with high probability. eorems 4.3 and 4.4 we can interpret the established results as follows. As the s and d of the problem increase, we emphasize the scale of the results and use the constants. Select a number of samples at least N O(w2(⇥)) and let the reg y N O ⇣ w(⌦R) p N + w2(⌦R) N2 ⌘ . With high probability then the restricted eigen N for 2 cone(⌦E) holds, so that  = O(1) is a positive constant. Moreover, VAR ⇢X(!) = I Ae i! 1 ⌃E h I Ae i! 1 i⇤ , where ⌃E is the covariance matrix of a noise vector and A are as deﬁned in expression (6). Thus, an upper bound on CU can be obtained as ⇤max[CU ]  ⇤max(⌃) ⇤min(A) , where we deﬁned ⇤min(A) = min !2[0,2⇡] ⇤min(A(!)) for A(!) = I AT ei! I Ae i! . (12) Referring back to covariance matrix Qa in (11), we get ⇤max[Qa]  ⇤max(⌃)/⇤min(A) = M. (13) We note that for a general VAR model, there might not exist closed-form expressions for ⇤max(A) and ⇤min(A). How- ever, for some special cases there are results establishing the bounds on these quantities (e.g., see Proposition 2.2 in (Basu & Michailidis, 2015)). inf 1jp !2[0,2⇡] ⇤j[⇢(!)]  ⇤k[CX] 1kdp  sup 1jp !2[0,2⇡] ⇤j[⇢(!)], (8) k[·] denotes the k-th eigenvalue of a matrix and for i = p 1, ⇢(!) = P1 h= 1 (h)e hi!, ! 2 s the spectral density, i.e., a Fourier transform of the autocovariance matrix (h). The advantage of spectral density is that it has a closed form expression (see Section 9.4 of [24]) ⇢(!)= I dX k=1 Ake ki! ! 1 ⌃ 2 4 I dX k=1 Ake ki! ! 1 3 5 ⇤ , denotes a Hermitian of a matrix. Therefore, from (8) we can establish the following lower bound ⇤min[CX] ⇤min(⌃)/⇤max(A) = L, (9) e deﬁned ⇤max(A) = max !2[0,2⇡] ⇤max(A(!)) for A(!)= I dX k=1 AT k eki! ! I dX k=1 Ake ki! ! , (10) endix B.1 for additional details. ishing high probability bounds we will also need information about a vector q = Xa 2 RN for Rdp, kak2 = 1. Since each element XT i,:a ⇠ N(0, aT CXa), it follows that q ⇠ N(0, Qa) with a ce matrix Qa 2 RN⇥N . It can be shown (see Appendix B.3 for more details) that Qa can be written 4 .. .. .. .. (d 1)T (d 2)T . . . (0) 5 E(xtxT t+h) 2 Rp⇥p. It turns out that since CX is a block-Toeplitz matrix, its eigenvalues can (see [13]) inf 1jp !2[0,2⇡] ⇤j[⇢(!)]  ⇤k[CX] 1kdp  sup 1jp !2[0,2⇡] ⇤j[⇢(!)], (8) notes the k-th eigenvalue of a matrix and for i = p 1, ⇢(!) = P1 h= 1 (h)e hi!, ! 2 pectral density, i.e., a Fourier transform of the autocovariance matrix (h). The advantage of al density is that it has a closed form expression (see Section 9.4 of [24]) ⇢(!)= I dX k=1 Ake ki! ! 1 ⌃ 2 4 I dX k=1 Ake ki! ! 1 3 5 ⇤ , es a Hermitian of a matrix. Therefore, from (8) we can establish the following lower bound ⇤min[CX] ⇤min(⌃)/⇤max(A) = L, (9) ned ⇤max(A) = max !2[0,2⇡] ⇤max(A(!)) for A(!)= I dX k=1 AT k eki! ! I dX k=1 Ake ki! ! , (10) B.1 for additional details. high probability bounds we will also need information about a vector q = Xa 2 RN for kak2 = 1. Since each element XT i,:a ⇠ N(0, aT CXa), it follows that q ⇠ N(0, Qa) with a rix Qa 2 RN⇥N . It can be shown (see Appendix B.3 for more details) that Qa can be written Qa = (I ⌦ aT )CU (I ⌦ a), (11) inf 1ldp !2[0,2⇡] ⇤l[⇢X(!)]  ⇤k[CU ] 1kNdp  sup 1ldp !2[0,2⇡] ⇤l[⇢X(! The closed form expression of spectral density is ⇢X(!) = I Ae i! 1 ⌃E h I Ae i! 1 where ⌃E is the covariance matrix of a noise vector and A are as deﬁned in bound on CU can be obtained as ⇤max[CU ]  ⇤max(⌃) ⇤min(A) , where we deﬁned ⇤ for A(!) = I AT ei! I Ae i! . Referring back to covariance matrix Qa in (11), we get ⇤max[Qa]  ⇤max(⌃)/⇤min(A) = M. We note that for a general VAR model, there might not exist closed-form ⇤min(A). However, for some special cases there are results establishing (e.g., see Proposition 2.2 in [5]). Estimating Structured VAR is that it has a closed form expression (see Section Priestley, 1981)) I dX k=1 Ake ki! ! 1 ⌃ 2 4 I dX k=1 Ake ki! ! 1 3 5 ⇤ , denotes a Hermitian of a matrix. Therefore, from an establish the following lower bound ⇤min[CX] ⇤min(⌃)/⇤max(A) = L, (9) we deﬁned ⇤max(A) = max !2[0,2⇡] ⇤max(A(!)) for = I dX k=1 AT k eki! ! I dX k=1 Ake ki! ! . (10) lishing high probability bounds we will also need tion about a vector q = Xa 2 RN for any a 2 Rdp , 1. Since each element XT i,:a ⇠ N(0, aT CXa), ws that q ⇠ N(0, Qa) with a covariance matrix N⇥N 3. Regularized Estimati Denote by = ˆ ⇤ the e optimization problem (4) and rameter. The focus of our wor under which the optimization tees on the accuracy of the obt term is bounded: || ||2  f lish such conditions, we utilize et al., 2014). Speciﬁcally, estim on the following known results ﬁrst one characterizes the restr error belongs. Lemma 3.1 Assume that N rR⇤  to R( )= pX i=1 R i = pX i=1 R ✓h A1(i, :)T . . .Ad(i, :)T iT ◆ . To reduce clutter and without loss of generality, we assume the norm R(·) to be the sam Since the analysis decouples across rows, it is straightforward to extend our analysis t different norm is used for each row of Ak, e.g., L1 for row one, L2 for row two, K-suppo three, etc. Observe that within a row, the norm need not be decomposable across column The main difference between the estimation problem in (1) and the formulation in (4 pendence between the samples (x0, x1, . . . , xT ), violating the i.i.d. assumption on the 1, . . . , Np}. In particular, this leads to the correlations between the rows and columns consequently of Z). To deal with such dependencies, following [5], we utilize the spectra the autocovariance of VAR models to control the dependencies in matrix X. 3.2 Stability of VAR Model Since VAR models are (linear) dynamical systems, for the analysis we need to establish which the VAR model (2) is stable, i.e., the time-series process does not diverge over time. stability, it is convenient to rewrite VAR model of order d in (2) as an equivalent VAR mo 2 6 6 6 4 xt xt 1 ... xt (d 1) 3 7 7 7 5 = 2 6 6 6 6 6 4 A1 A2 . . . Ad 1 Ad I 0 . . . 0 0 0 I . . . 0 0 ... ... ... ... ... 0 0 . . . I 0 3 7 7 7 7 7 5 | {z } A 2 6 6 6 4 xt 1 xt 2 ... xt d 3 7 7 7 5 + 2 6 6 6 4 ✏t 0 ... 0 3 7 7 7 5 where A 2 Rdp⇥dp. Therefore, VAR process is stable if all the eigenvalues of A satisfy de 0 for 2 C, | | < 1. Equivalently, if expressed in terms of original parameters Ak, sta det(I Pd k=1 Ak 1 k ) = 0 (see Appendix A for more details).   L = 1, j = 1, . . . , p, and = [ 1 , . . . , p ] , for j is of size dp ⇥ 1, and ⇤ = [ ⇤T 1 . . . ⇤T p ]T , for ⇤ j 2 Rdp . The set ⌦Ej is a part of the decomposition in ⌦E = ⌦E1 ⇥ · · · ⇥ ⌦Ep due to the assumption on the row-wise separability of norm R(·) in (5). Also deﬁne w(⇥) = E[sup u2⇥ hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 Rdp . Then with probability at least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0, inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, where ⌫ = p NL 2 p M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L, M are deﬁned in (9) and (13). 3.2. Discussion From Theorem 3.4, we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ and since p N > 0 must be satisﬁed, we can establish a lower bound on the number of samples N: p N > 2 p M+cw(⇥) p L/2 = O(w(⇥)). Examining this bound and using (9) and (13), we can con- L =   17
18. 18.   2 ⌦E     Lasso regularization norms. I the main ideas of our proof te delegated to the supplement. To establish lower bound on N , we derive an upper bou some ↵ > 0, which will estab N ↵ R⇤ [ 1 N ZT ✏]. Theorem 3.3 Let ⌦R = {u 2 w(⌦R) = E[ sup u2⌦R hg, ui] to ⌦R for g ⇠ N(0, I). For a probability at least 1 c exp can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w Thm4.3 Thm4.4 Lem.4.1 Lem.4.2 o hold with high prob- e can interpret the es- e size and dimension- crease, we emphasize order notations to de- er of samples at least ization parameter sat- With high probability tion ||Z ||2 || ||2 p N  = O(1) is a pos- of the estimation er- bounded by k k2  )). Note that the norm ction 3.4 we will present ique, with all the details regularization parameter n R⇤ [ 1 N ZT ✏]  ↵, for the required relationship p |R(u)  1}, and deﬁne a Gaussian width of set > 0 and ✏2 > 0 with min(✏2 2, ✏1) + log(p)) we ) + c1(1+✏1) w2 (⌦R) N2 ◆ nstants. condition, we will show N as showing that large values of M and small va L indicate stronger dependency in the data, thus req more samples for the RE conditions to hold with high ability. Analyzing Theorems 3.3 and 3.4 we can interpret tablished results as follows. As the size and dime ality N, p and d of the problem increase, we emp the scale of the results and use the order notations note the constants. Select a number of samples a N O(w2 (⇥)) and let the regularization paramet isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high prob then the restricted eigenvalue condition ||Z ||2 || ||2 for 2 cone(⌦E) holds, so that  = O(1) is itive constant. Moreover, the norm of the estimat ror in optimization problem (4) is bounded by k O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the compatibility constant (cone(⌦Ej )) is assumed to same for all j = 1, . . . , p, which follows from our as o understand the bound on s of M and small values of y in the data, thus requiring ions to hold with high prob- 3.4 we can interpret the es- As the size and dimension- m increase, we emphasize e the order notations to de- number of samples at least gularization parameter sat- R) ⌘ . With high probability condition ||Z ||2 || ||2 p N that  = O(1) is a pos- norm of the estimation er- 4) is bounded by k k2  ⌦Ej )). Note that the norm ⌦Ej )) is assumed to be the 18
19. 19. • • 4.3 • • d by the discussion on their properties and up Lasso regularization norms. In Section ll the details delegated to the Appendices C derive an upper bound on R⇤[ 1 N ZT ✏]  ↵, N ↵ R⇤[ 1 N ZT ✏]. R) = E[ sup u2⌦R hg, ui] to be a Gaussian width bability at least 1 c exp( min(✏2 2, ✏1) + 1(1+✏1) w2(⌦R) N2 ◆ t inf 2cone(⌦E) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, for some illustrating some special cases based 4.4 we will present the main ideas of o and D. To establish lower bound on the regula for some ↵ > 0, which will establish Theorem 4.3 Let ⌦R = {u 2 Rdp|R( of set ⌦R for g ⇠ N(0, I). For any ✏1 log(p)) we can establish that R⇤  1 N ZT ✏ where c, c1 and c2 are positive consta To establish restricted eigenvalue co ⌫ > 0 and then set p N = ⌫. ction we present the main results of our work, followed by the discussion on their prope g some special cases based on popular Lasso and Group Lasso regularization norms. In ll present the main ideas of our proof technique, with all the details delegated to the Appe sh lower bound on the regularization parameter N , we derive an upper bound on R⇤[ 1 N Z ↵ > 0, which will establish the required relationship N ↵ R⇤[ 1 N ZT ✏]. 4.3 Let ⌦R = {u 2 Rdp|R(u)  1}, and deﬁne w(⌦R) = E[ sup u2⌦R hg, ui] to be a Gauss for g ⇠ N(0, I). For any ✏1 > 0 and ✏2 > 0 with probability at least 1 c exp( min( e can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w(⌦R) p N + c1(1+✏1) w2(⌦R) N2 ◆ c1 and c2 are positive constants. ish restricted eigenvalue condition, we will show that inf 2cone(⌦E) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, d then set p N = ⌫. 4.4 Let ⇥ = cone(⌦Ej ) Sdp 1, where Sdp 1 is a unit sphere. The error set ⌦Ej is d To establish lower bound on the regularization pa N , we derive an upper bound on R⇤ [ 1 N ZT ✏]  some ↵ > 0, which will establish the required rela N ↵ R⇤ [ 1 N ZT ✏]. Theorem 3.3 Let ⌦R = {u 2 Rdp |R(u)  1}, an w(⌦R) = E[ sup u2⌦R hg, ui] to be a Gaussian widt ⌦R for g ⇠ N(0, I). For any ✏1 > 0 and ✏2 > probability at least 1 c exp( min(✏2 2, ✏1) + log can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w(⌦R) p N + c1(1+✏1) w2 where c, c1 and c2 are positive constants. To establish restricted eigenvalue condition, we w ||(Ip⇥p⌦X) ||2 To establish lower bound on the regularizat N , we derive an upper bound on R⇤ [ 1 N ZT some ↵ > 0, which will establish the require N ↵ R⇤ [ 1 N ZT ✏]. Theorem 3.3 Let ⌦R = {u 2 Rdp |R(u)  1 w(⌦R) = E[ sup u2⌦R hg, ui] to be a Gaussian ⌦R for g ⇠ N(0, I). For any ✏1 > 0 and probability at least 1 c exp( min(✏2 2, ✏1) can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w(⌦R) p N + c1(1+ where c, c1 and c2 are positive constants. To establish restricted eigenvalue condition, ||(Ip⇥p⌦X) ||2 rL ned in terms of the quantities, involving Z and ✏, which are blish high probability bounds on the regularization parameter f our work, followed by the discussion on their properties and ular Lasso and Group Lasso regularization norms. In Section of technique, with all the details delegated to the Appendices C n parameter N , we derive an upper bound on R⇤[ 1 N ZT ✏]  ↵, uired relationship N ↵ R⇤[ 1 N ZT ✏]. }, and deﬁne w(⌦R) = E[ sup u2⌦R hg, ui] to be a Gaussian width nd ✏2 > 0 with probability at least 1 c exp( min(✏2 2, ✏1) + (1+✏2) w(⌦R) p + c1(1+✏1) w2(⌦R) 2 ◆ re (cone(⌦E)) is a norm compatibility constant, deﬁned as (cone(⌦E)) = sup U2cone(⌦E) e that the above error bound is deterministic, i.e., if (14) and (16) hold, then the error sati nd in (17). However, the results are deﬁned in terms of the quantities, involving Z and dom. Therefore, in the following we establish high probability bounds on the regularizat 14) and RE condition in (16). High Probability Bounds his Section we present the main results of our work, followed by the discussion on their strating some special cases based on popular Lasso and Group Lasso regularization norm we will present the main ideas of our proof technique, with all the details delegated to the D. stablish lower bound on the regularization parameter N , we derive an upper bound on R⇤ some ↵ > 0, which will establish the required relationship N ↵ R⇤[ 1 N ZT ✏]. orem 4.3 Let ⌦R = {u 2 Rdp|R(u)  1}, and deﬁne w(⌦R) = E[ sup u2⌦R hg, ui] to be a G et ⌦R for g ⇠ N(0, I). For any ✏1 > 0 and ✏2 > 0 with probability at least 1 c exp( p)) we can establish that :   Gaussian width ) = O(w(⇥)). (18) clude that the number of samples needed to satisfy nd ⇤min(⌃) are larger and ⇤max(A) and ⇤max(⌃) n (10) and (12) must be well conditioned and the nit circle (see Section 3.2). Alternatively, we can nd small values of L indicate stronger dependency tions to hold with high probability. blished results as follows. As the size and dimen- e the scale of the results and use the order notations ast N O(w2(⇥)) and let the regularization pa- robability then the restricted eigenvalue condition 1) is a positive constant. Moreover, the norm of the by k k2  O ⇣ w(⌦R) p N + w2(⌦R) N2 ⌘ (cone(⌦Ej )). ) is assumed to be the same for all j = 1, . . . , p, arization parameter N O ⇣ w(⌦R) p N + w2(⌦R) N2 ⌘ . the number of samples N increases, the ﬁrst term 19
20. 20. • N p (d=1 ) • N   0 1000 2000 3000 4000 5000 0 5 10 15 20 0 1000 2000 3000 4000 5000 0 0.2 0.4 0.6 0.8 1 0 100 200 300 400 500 600 700 0.4 0.5 0.6 0.7 0.8 0.9 1 kk2 N Np kk2 /max /max N/[s log(pd)] 0 100 200 300 400 500 600 0 5 10 15 20 kk2 N N kk2 /max /max 0 1000 2000 3000 4000 5000 0 5 10 15 0 500 1000 1500 2000 2500 0 5 10 15 0 1000 2000 3000 4000 5000 0.4 0.5 0.6 0.7 0.8 0.9 1 N/[s(m + log K)] K 10 20 30 40 50 60 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (a) (b) (c) (d) (e) (f) (g) (h) Figure 1: Results for estimating parameters of a stable ﬁrst order sparse VAR (top row) and group sparse VAR (bottom row). Problem dimensions: p 2 [10, 600], N 2 [10, 5000], N max 2 [0, 1], K 2 [2, 60] and d = 1. Figures (a) and (e) show dependency of errors on sample size for different p; in Figure (b) the N is scaled by (s log p) and plotted against k k2 to show that errors scale as (s log p)/N; in (f) the graph is similar to (b) but for group sparse VAR; in (c) and (g) we show dependency of N on p (or number of groups K in (g)) for ﬁxed sample size N; ﬁnally, Figures (d) and (h) display the dependency of N on N for ﬁxed p. p ⇥ hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 Rdp. Th c1 exp( c2⌘2 + log(p)), for any ⌘ > 0 inf 2cone(⌦E) ||(Ip⇥p ⌦ X) ||2 || ||2 ⌫, p M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M are deﬁne we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ an tisﬁed, we can establish a lower bound on the number of samples N p N > 2 p M + cw(⇥) p L/2 = O(w(⇥)). 20
21. 21. • VAR(d=1) 1step • • or k N N k / / 0 1000 2000 3000 4000 5000 0 5 0 500 1000 1500 2000 2500 0 5 0 1000 2000 3000 4000 5000 0.4 0.5 0.6 N/[s(m + log K)] K 10 20 30 40 50 60 0.3 0.4 0.5 0.6 (e) (f) (g) (h) Figure 1. Results for estimating parameters of a stable ﬁrst order sparse VAR (top row) and group sparse VAR (bottom row). Problem dimensions: p 2 [10, 600], N 2 [10, 5000], N max 2 [0, 1], K 2 [2, 60] and d = 1. Figures (a) and (e) show dependency of errors on sample size for different p; in Figure (b) the N is scaled by (s log p) and plotted against k k2 to show that errors scale as (s log p)/N in (f) the graph is similar to (b) but for group sparse VAR; in (c) and (g) we show dependency of N on p (or number of groups K in (g)) for ﬁxed sample size N; ﬁnally, Figures (d) and (h) display the dependency of N on N for ﬁxed p. Lasso OWL Group Lasso Sparse Group Lasso Ridge 32.3(6.5) 32.2(6.6) 32.7(6.5) 32.2(6.4) 33.5(6.1) 32.7(7.9) 44.5(15.6) 75.3(8.4) 38.4(9.6) 99.9(0.2) Table 1. Mean squared error (row 2) of the ﬁve methods used in ﬁtting VAR model, evaluated on aviation dataset (MSE is computed using one-step-ahead prediction errors). Row 3 shows the average number of non-zeros (as a percentage of total number of elements) in the VAR matrix. The last row shows a typical sparsity pattern in A1 for each method (darker dots - stronger dependencies, lighter dots weaker dependencies). The values in parenthesis denote one standard deviation after averaging the results over 300 ﬂights. the results after averaging across 300 ﬂights. From the table we can see that the considered problem ex- hibits a sparse structure since all the methods detected sim- ilar patterns in matrix A1. In particular, the analysis of such patterns revealed a meaningful relationship among the ﬂight parameters (darker dots), e.g., normal acceleration had high dependency on vertical speed and angle-of-attack, the altitude had mainly dependency with fuel quantity, ver- tical speed with aircraft nose pitch angle, etc. The results also showed that the sparse regularization helps in recov- 5. Conclusions In this work we present a set of results for characterizing non-asymptotic estimation error in estimating structured vector autoregressive models. The analysis holds for any norms, separable along the rows of parameter matrices Our analysis is general as it is expressed in terms of Gaus sian widths, a geometric measure of size of suitable sets and includes as special cases many of the existing result focused on structured sparsity in VAR models. A1 nMSE(%) (%) R(·) TO 2 f its atomic formulation. tiation of the CG algorithm to handle the Ivanov lation of OWL regularization; more speciﬁcally: e show how the atomic formulation of the OWL orm allows solving efﬁciently the linear program- ming problem in each iteration of the CG algorithm; ased on results from [25], we show convergence of he resulting algorithm and provide explicit values or the constants. w derivation of the proximity operator of the OWL arguably simpler than those in [10] and [44], ghting its connection to isotonic regression and the adjacent violators (PAV) algorithm [3], [8]. ﬁcient method to project onto an OWL norm ball, on a root-ﬁnding scheme. ng the Ivanov formulation under OWL regulariza- sing projected gradient algorithms, based on the sed OWL projection. er is organized as follows. Section II, after reviewing norm and some of its basic properties, presents formulation, and derives its dual norm. The two utational tools for using OWL regularization, the operator and the projection on a ball, are addressed III. Section IV instantiates the CG algorithm rated projected gradient algorithms to tackle the optimization formulation of OWL regularization egression. Finally, Section V reports experimental strating the performance and comparison of the pproaches, and Section VI concludes the paper. ase bold letters, e.g., x, y, denote (column) vectors, oses are xT , yT , and the i-th and j-th components n as xi and yj. Matrices are written in upper Fig. 1. OWL balls in R2 with different weights: (a) w1 > w2 > 0; (b) w1 = w2 > 0; (c) w1 > w2 = 0. where w 2 Km+ is a vector of non-increasing weights, i.e., belonging to the so-called monotone non-negative cone [13], Km+ = {x 2 Rn : x1 x2 · · · xn 0} ⇢ Rn +. (2) OWL balls in R2 and R3 , for different choices of the weight vector w, are illustrated in Figs. 1 and 2. Fig. 2. OWL balls in R3 with different weights: (a) w1 > w2 > w3 > 0; (b) w1 > w2 = w3 > 0; (c) w1 = w2 > w3 > 0; (d) w1 = w2 > w3 = 0; (e) w1 > w2 = w3 = 0; (f) w1 = w2 = w3 > 0. OWL balls in R2 with different weights: (a) w1 > w2 > 0; (b) > 0; (c) w1 > w2 = 0. w 2 Km+ is a vector of non-increasing weights, i.e., ng to the so-called monotone non-negative cone [13], + = {x 2 Rn : x1 x2 · · · xn 0} ⇢ Rn +. (2) alls in R2 and R3 , for different choices of the weight w, are illustrated in Figs. 1 and 2. OWL balls in R3 with different weights: (a) w1 > w2 > w3 > 0; w2 = w3 > 0; (c) w1 = w2 > w3 > 0; (d) w1 = w2 > w3 = 0; w2 = w3 = 0; (f) w1 = w2 = w3 > 0. act that, if w 2 Km+ {0}, then ⌦w is indeed a norm nvex and homogenous of degree 1), was shown in [10], s clear that ⌦w is lower bounded by the (appropriately `1 norm: ⌦w(x) w1|x|[1] = w1 kxk1, (3) [Zheng15] 21
22. 22. • -   VAR • i.i.d. • : inf 2cone(⌦E) p⇥p 2 || ||2 ⌫, M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M are de can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ ﬁed, we can establish a lower bound on the number of samples N p N > 2 p M + cw(⇥) p L/2 = O(w(⇥)). nd using (9) and (13), we can conclude that the number of samples neede condition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A) an s means that matrices A and A in (10) and (12) must be well condition with eigenvalues well inside the unit circle (see Section 3.2). Alternativ showing that large values of M and small values of L indicate stronger d N > p L/2 = O(w(⇥)) mining this bound and using (9) and (13), we can conclude that the stricted eigenvalue condition is smaller if ⇤min(A) and ⇤min(⌃) a maller. In turn, this means that matrices A and A in (10) and (12 process is stable, with eigenvalues well inside the unit circle (see understand (18) as showing that large values of M and small value data, thus requiring more samples for the RE conditions to hold w yzing Theorems 4.3 and 4.4 we can interpret the established result lity N, p and d of the problem increase, we emphasize the scale of t note the constants. Select a number of samples at least N O(w er satisfy N O ⇣ w(⌦R) p N + w2(⌦R) N2 ⌘ . With high probability the 2 p N for 2 cone(⌦E) holds, so that  = O(1) is a positive ation error in optimization problem (4) is bounded by k k2  O /2 = O(w(⇥)). (18) n conclude that the number of samples needed to satisfy (A) and ⇤min(⌃) are larger and ⇤max(A) and ⇤max(⌃) d A in (10) and (12) must be well conditioned and the e the unit circle (see Section 3.2). Alternatively, we can f M and small values of L indicate stronger dependency conditions to hold with high probability. he established results as follows. As the size and dimen- phasize the scale of the results and use the order notations s at least N O(w2(⇥)) and let the regularization pa- high probability then the restricted eigenvalue condition = O(1) is a positive constant. Moreover, the norm of the nded by k k2  O ⇣ w(⌦R) p N + w2(⌦R) N2 ⌘ (cone(⌦Ej )). (⌦Ej )) is assumed to be the same for all j = 1, . . . , p, 22
23. 23. • • • • • -   • [Negahban09] • ARX   23