1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jeff Roach

1D Convolutional Neural Networks for
Time Series Modeling
PyData LA 2018
Nathan Janos and Jeff Roach

Who are we?
● Nathan Janos
○ Chief Data Officer @ System1 (4.5 years)
○ 15 years in ad-tech optimization
● Jeff Roach
○ Data Scientist @ System1 (2+ years)
○ Background in epidemiology

Part 1: Motivation and Prototype
Nathan Janos

From 2D to 1D
???
Graphics attributed to Mathworks and Wikipedia Creative Commons

Motivation
● Deep learning and the new wave of neural networks are increasingly popular
● Focus is in the visual space for classification
● We are interested in time series forecasting
● Couldn’t find as much modern work in this area
○ Sequence classification in language, text, audio
○ LSTM (long short-term memory), GRU (gated recurrent unit), RNN (recurrent NN)
Graphic attributed to Wikipedia Creative Commons

Discrete Time Signal Processing
● What about combining DSP with NNs?
○ Used in domains such as speech processing, sonar, radar, biomedical engineering, seismology
● Why not treat our hourly data as samples like one of these signals?
● 2D convolution works well for image classification
● Can 1D convolution work for time series forecasting?
● Had the idea to apply classic discrete time convolution techniques to 1D data...

Convolution
● Inspired by the convolution used in visual NNs (cross correlation)
● But instead use the definition of convolution used in signal processing
● It’s the integral of the product of two functions after one is reversed and shifted
Graphics attributed to Wikipedia Creative Commons

Basic Architecture
pool
relu
pool
relu
output = y(t+1)
hidden layer
hidden layer
regression
input = y(t)
convolve(input,weights)

Parameterized
w w w w w w w w w w w w w w w w w w w w w
y(t)
filter layer 1
T is length of time series
t size of intermediate time series
W is size of window
F is number of filters
D is depth of filters
F = 6
pool pool pool pool pool pool
relu relu relu relu relu relu
w w w w w w w w w w w w w w w w w w w w wfilter layer 2
D = 2
W = 24
T = 1512
t = 24
t = 12
t = 6
regression layer
w w w w w w
w w w w w w
w w w w w w
w w w w w w
w w w w w w
w w w w w w
y(t+1)

Parameter Space Example
● One layer of filters has n*(n + 1)/2 = 6(6-1)/2 = 15 weights
● Two layers = 15 * 2 = 30 filter weight parameters
● Two layers deep and a window of 24 hours = each bottom filter output has
24/2/2 = 6 values
● 6 filters * 6 output values = 36 regression weight parameters
● 66 total parameters
● A network with 24 filters, 3 deep, running on week of hourly of data = 1404

Learning About Learning Rate
iterations over time
learning rate
This is a type of stochastic gradient descent with restarts (SGDR)

Data and Testing
● Revenue per click data on mobile devices in automotive category
● Hourly data from 4/1/2018 to 6/2/2018 (63 days, 9 weeks of data)
● Train on first 8 weeks of hourly data
● Test on last week of data
● Compared MASE (mean absolute scaled error) of model to MASE of “simple”
1-hour lagged data model
○ MASE < 1.0 means we are beating the simple model

Best Network Results
● Using networks with about 6 filters, 3 deep, window of 24 hours of data
● Training took ~20 minutes
● Training on 8 weeks of data
● Best MASE compared to simple 1-hour lag model was ~0.86

Prototype Conclusion
● Probably should use GPU framework to make it faster
● Lots of time spent on hyperparameter tuning
● Should consider other network architectures
● Build out in an established NN framework to leverage backpropagation

Part 2: Architecture Extension and
Production PyTorch Implementation
Jeff Roach

Goals
● Port to Python
○ PyTorch
○ Fastai
● Find architecture improvements
● Beat current best production model (TBATS)
○ Linear time series model that captures complex seasonal trends
○ Exponential Smoothing State Space Model With Box-Cox Transformation, ARMA Errors, Trend And Seasonal
Components
○ TBATS R package to fit model as described in De Livera, Hyndman & Snyder (2011)

Architecture
w w w
y(t)
W is size of window
F = 1
pool
relu
w w w
pool
relu
W = 24
T = 1512
w w w w y(t+1)
WaveNet
ww
filter layer 1
filter layer 2
t = 24 in
t = 12 out
t = 6 in
regression layer
t = 12 in
t = 6 out

Architecture
y(t)
W is size of window
F = 6
W = 24
T = 1512
w w w w w w
w w w w w w
w w w w w w
w w w w w w
w w w w w w
w w w w w w y(t+1)
WaveNet Expansion
filter layer 1
filter layer 2
t = 24 in
t = 12 out
t = 6 in
regression layer
t = 12 in
t = 6 out

Architecture
w w w
y(t)
W is size of window
F = 1
pool
relu
w w w
pool
relu
W = 24
T = 1512
w w w w y(t+1)www1x
filter layer 1
filter layer 2
t = 24 in
t = 12 out
t = 6 in
regression layer
t = 12 in
t = 6 out
WaveNet Expansion

Architecture
w w w
y(t)
W is size of window
F = 6
pool
relu
w w w
pool
relu
W = 24
T = 1512
w y(t+1)w24x w12x 6x
filter layer 1
filter layer 2
t = 24 in
t = 12 out
t = 6 in
regression layer
t = 12 in
t = 6 out
WaveNet Expansion

Architecture
w w w
y(t)
filter layer 1
W is size of window
pool
relu
w w wfilter layer 2
pool
relu
W = 24
T = 1512
t = 24 in
t = 12 out
t = 6 in
regression layer w y(t+1)
Ensemble
t = 12 in
t = 6 out
108x
Dropout
BatchNorm
(continuous variables)
relu
dropout
Embedding
Linear
Fully Connected
2x layers
1000, 500 neurons
0.001, 0.01 dropout
0.04 dropout
w500x
WaveNet
fastai’s MixedInputModel
Hour as category

FilterNet in PyTorch
class FilterNet(nn.Module):
def __init__(self, emb_szs, n_cont, emb_drop, out_sz, ...):
# Wavenet model layers
self.c1a = conv_layer(window=window // 2, ks=1, dilation=1)
self.c1b = conv_layer(window=window // 4, ks=1, dilation=2)
self.c2a = conv_layer(window=window // 2, ks=2, dilation=1)
self.c2b = conv_layer(window=window // 4, ks=2, dilation=2)
…
# Mixed Input model
...
# Final layer
self.f = Flatten()
self.lin = nn.Linear(szs[-1] + 108, out_sz, bias=False)
def forward(self, x_win, x_cat, x_cont):
# Wavenet model
self.f1a = self.c1a(x_win)
self.f1b = self.c1b(self.f1a)
self.f2a = self.c2a(x_win)
self.f2b = self.c2b(self.f2a)
…
x_wave = torch.cat([self.f1a, self.f1b, self.f2a, ... ], 2)
# Mixed Input Model
...
x_mix = dropout(x)
# Combine results from both nets
...
x_comb = torch.cat([x_wave, x_mix], 2)
lin_out = self.lin(self.f(x_comb))
return lin_out

Model Comparison t-1
Model (Language, Processing Unit) MASE Time
TBATS (R, CPU) 0.90 30s
WaveNet Expansion (Matlab, CPU) 0.86 1200s
WaveNet Expansion (PyTorch, GPU) 0.86 16s
FilterNet (PyTorch, GPU) 0.82 27s

Model Comparison t-1 Different Category
● Previously trained Automotive category
● Forecasted on Finance category
Trained on Automotive data Automotive Finance
TBATS (R, CPU) 0.90 0.96
FilterNet (PyTorch, GPU) 0.82 0.90

Model Comparison t-1 Missing Data
Replaced every nth step with n-1 past data point TBATS FilterNet % diff
2 step (every other) 1.23 1.29 +5%
6 1.03 0.98 -5%
12 0.98 0.91 -7%
24 0.96 0.87 -9%
MASE

Jagged Dataset
● Jagged
○ Categories use different features
○ Long/short time periods
○ Few/many missing data points
● ~1300 advertising categories
● Hourly data
● Training = 37 days or 888 hours
● Test = 7 days or 168 hours

Model Comparison t-1 Jagged
MASE Time
TBATS, Single Category 0.84 17s
FilterNet, Single Category 0.83 4s
FilterNet, Full training set (~1300 Categories) 0.78 60s
FilterNet, Full training set, test Category removed from training set 0.78 60s

Model Comparison t-1 Jagged
# of training set days TBATS FilterNet % diff
14 1.30 0.82 -37%
21 1.05 0.86 -18%
28 0.82 0.83 0%
37 0.84 0.83 0%
MASE

Conclusion
● FilterNet perks
○ Performance (7%)
○ Training speed (10%-300%)
○ Context
○ Less sensitive to data quantity
■ But, more sensitive to data quality
● Convolution and context models are complimentary

Questions?
Thanks for attending!

Prototype Training Detail
● Initialize
○ Seed filter weights with random values from [-1, -0.5, 0, 0.5, 1]
○ Seed regression weights with random values in range -0.2 to 0.2
○ Set learning rate to 1.0
● Iterate 100s to 100,000s of times
○ Forward propagate current network and store MSE
○ Randomly select a subset of weights (usually 10%) to move a small amount one at a time
■ Filter weights are moved in random increments of 0.1 or 0.01
■ Regression weights are moved by another different small amount
■ Store resulting MSE from moving each weight independently
○ Update parameters
■ If MSE for a filter weight delta is lower update it by that random increment
■ If MSE for a regression weight delta is lower update by gradient
○ Update learning rate: multiply by 0.95
○ If learning rate < threshold set back up to last initial learning rate * 0.95

References
● Time Series Wavenet paper
○ https://arxiv.org/pdf/1703.04691.pdf

Model Comparison t-1 Facebook
MASE Time
TBATS, Single Category 0.84 17s
FilterNet, Single Category, w/ imputed, Batch Size = 1 1.41 150s
FilterNet, Single Category, Batch Size = 1 0.84 150s
FilterNet, Single Category, Batch Size = 512 0.83 4s
FilterNet, Full training set (~1300 Categories) 0.78 60s
FilterNet, Full training set, w/o CNN 0.79 60s
FilterNet, Full training set, w/o Mixed Input 0.80 60s
FilterNet, Full training set, test Category removed from training set 0.78 60s

Production
● Inspired by how fastai loads pretrained models
● Save trained model to dictionary
○ State, structure, tuning parameters, etc.
● Model framework in common location
● Rebuild model using framework and dictionary

1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jeff Roach

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jeff Roach

Similar to 1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jeff Roach (20)

More from PyData

More from PyData (20)

Recently uploaded

Recently uploaded (20)

1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jeff Roach