Machine Learning Course Day 1 Recap

Machine Learning course with
Nathaniel Shimoni Aug 2019

Machine Learning Course – Day 1

outline
Introduction to the KNIME platform
Data IO (reading/writing data from/to files)
Basic data manipulation (data preprocessing)
Basic data exploration and plotting
Initial data modeling – logistic regression
Exploration of model results
Using complex features to improve model results
Aug 4th 2019 Machine learning and data science with KNIME – Nathaniel Shimoni 3

The KNIME platform
KNIME platform is an interactive tool that allows data
processing, modeling, visualization and many more…
KNIME
explorer
Node
repository
Work area
Execution
console
Workflow
outline

The KNIME platform – Input and output
Reading and writing data from/to files
1. go to the node repository and search for “read”
2. Select csv reader and drag to work area
3. Right click the node and select configure
4. Browse and select the file that you would like
to open (throughout this tutorial we will use
the file “Churn_Modelling.csv” in the course
materials folder)
5. We can customize the file separators

Reading and writing data from/to files
6. We can select number of lines to skip and number of lines to read
7. And also customize the encoding type
(this can be useful for some file type/languages)

Finally after we are done configuring the node we can run it and explore the
resulting data:
1. Right click the node
2. Click execute or press F7 to run the cell
3. Once reading is complete click File Table to view
the result – expect to see the following output:

1. Notice the column types (D for double, I for integer, S for string)
2. We can also view the data dimensions (10k rows and 13 columns)

1. We can also get some basic information about the data we loaded
such as minimal and maximal value

The KNIME platform – data manipulation
Filtering
Sometimes we want to look/process/model only
part of the data so we need to filter out the rest
1. Add a row filter node to the workflow
2. Filter out all the rows in which age is lower
than 30
3. Note that you can choose whether to include
or exclude the rows that satisfy the condition

Filtering (cont.)
We can also use various logical rules for filtering:
1. Add a rule based row filter node
2. Configure the new node by adding
one or more logical rules
3. Add “=>“ key combination and either
TRUE or FALSE as an output

Replacing values
Sometimes we would like to replace specific values with another
alternative value
this would even be necessary when:
1. We get text error messages within a numerical feature column
2. We have outliers that we want to change to another value
3. We have some missing of fixed values (read errors)

Replacing values
1. Select string replacer node
2. Replace any name that starts with ‘A’ to
‘Ace’

Missing values:
We have several ways to deal with missing values depending on type of
the feature we can:
1. Fill in with a specific value
2. Replace with following/previous good value
3. Fill in based on some statistical value
(average, mode etc.)
4. Fill in with an interpolation
or with a moving average
* Use the missing value node to experiment
with all of these options

Pivoting / unpivoting:
In some cases we would like to aggregate data per each feature value
we can use pivoting node to accomplish this task
In this example we will check the average age and 90th percentile of age
for male / female in each country
1. Add pivoting node to the workflow
2. Add ‘geography’ in the groups tab
3. Add ‘gender ‘ in the pivots tab
4. Add age to the aggregation tab (twice)
5. Define mean and percentile as desired
aggregation functions

Pivoting / unpivoting:
We get 3 views from this process
1. A pivot table view
2. A group-totals view (summarizes rows)
3. A pivot totals view (summarizes columns)

Renaming features
Sometimes our column names are tricky to remember or even
distinguish – in such cases it can be a good idea to rename these
columns to contain a more human-friendly name
1. Add a rename node to the workflow
2. Change the names of columns to human-understandable names

Creating new features
In most cases the data we get when we start our
research process will not contain the optimal features
to begin with and we would like to create additional
features that will result in a better predictive power
1. Add a formula node
2. Create a new feature that holds the following
formula:
balance / salary

Dealing with textual and categorical features
For most algorithms data should not contain any textual feature for
processing. This can be solved using “one hot encoding”
1. Add a ‘one to many’ node
2. Add the desired string features
to selection
3. Run the node and check the
output table
4. You should now see columns that
contain binary values for each
of the values in these string
features

The KNIME platform – data exploration
Before modeling our data we should first get to know it better
This can be done in many ways – using plots, exploring statistical aggregations
and extremes of our data, and also via basic modeling and viewing the results
and errors
We will start with scatter plots
There are many scatter plot nodes I recommend using the “2D-3D scatter plot”
node for exploration as its interactive mode is very convenient and fits this
stage well.

Scatter plots
1. Add a “2D-3D scatter plot” node
2. Select columns for your plot
(this could be changed later in
the interactive mode)
3. Make sure to adjust number of
rows to display according to
your need

Scatter plots
4. Run the node
5. Select the relevant feature
for each of the axes
6. Use the target column in
color values
7. Note that you can filter the
presented results using the
sliders in the bottom right
corner
8. You can rotate the plot by
dragging one of the axis to
the desired direction

The KNIME platform – basic modeling and concepts
Its finally time to create our model – a Logistic Regression model
Use the Partitioning (aka train test split) node to create train and validation sets
This node has 2 return ports
One for our training set – the data our model
will be trained on
And the other for our validation (or test)
the data that the model will not be exposed to
and we will test the model performance with

1. Add a logistic regression learner node to the workflow
2. Set our desired target column
3. Select the desired explaining variables

1. Add a logistic regression predictor to the workflow
2. Make sure to connect the predictor to both the model and the validation
partition we created
3. We can define if predictions will be probabilities
or the predicted category
4. The flow should look like the one below:

The KNIME platform – result analysis & error analysis
Checking our results:
1. Add a scorer node
2. Define the target column and the
predicted column
3. run the node

The KNIME platform – result analysis & error analysis
Checking our results:
1. We can now view either the confusion
matrix or the accuracy results
(shown in the bottom of the list)
2. A summarized view of both is
available using the view in the middle
of the list

The KNIME platform – complex feature creation
Now we can use all what we have done so far to create some advanced
features and improve our results
Your task:
Use all what we have learned so far to improve the model results
You may explore additional exploration and transformation nodes
At this point do not use other learner-predictor nodes
Let the competition begin!

outline
Recap – logistic regression model
Feature selection in Knime
Classification using random forest & XGBoost
Hyper-Parameter tuning in Knime
Regression using random forest & XGBoost
What to do when data is not IID (Identically independently distributed)
Time series unique characteristics
Mini-Hackathon!

The KNIME platform – basic modeling recap
1. Using a logistic regression model we got initial result of ~80% accuracy and
with some effort we managed to improve it to 83% accuracy still using
logistic regression model

The KNIME platform – basic modeling - feature selection
It turns out that we can get to 83% accuracy with only 7 features
(to improve our score we have to ignore the ‘gender’, ‘hasCrCredit’, and ‘balance’ features)
But, how can we decide which features to use and which to ignore?

The KNIME platform – basic modeling - feature selection loop
introducing feature selection loops!
First we run some
experiments with
different combinations
Then we filter in only
the relevant columns
Finally we run the model with
best feature combination

The KNIME platform – feature selection loop
Now lets do this one step at a time…
First lets reconstruct our experiment:
1. Read the data from the previous exercise to a new workflow
(“churn_modeling.csv” file)
2. Encode the target column to be nominal using ‘number to string’
node
3. Remove outliers from the data using the
‘numeric outliers’ node
4. Fill in missing values using the ‘missing’ node
5. Split the data to training and testing using partitioning node (use
the same method and seed from lesson 1 for a valid comparison)
6. Add ‘logistic regression learner’ and
‘logistic regression predictor’ nodes
7. Add a ‘scorer’ node so that we can view the metrics of our
model’s predictions

And… add the feature selection loop:
8. Add a ‘feature selection loop start’ node before
the partitioning node
9. Add a ‘feature selection loop end’ after the scorer
node
10. Right click the scorer node and select “show flow
variables ports” you will notice two red dots
above the node
11. Connect the scorer variable output
to the feature selection loop end
input port (red dots)
12. Add a ‘feature selection filter’ node

13. Copy and paste the partitioning, learner, predictor and scorer nodes
14. Connect the new partitioning node to the feature selection filter node
You should now have a workflow similar to the one below:

outline
Mini-Hackathon!

The KNIME platform – random forest model
Now that we are familiar with some tree-based
models we can give them a try…
1. Read the data from the previous exercise to a new
workflow (“churn_modeling.csv” file)
2. Encode the target column to be nominal using
number to string node
3. Add features as you wish
4. Add a random forest learner and predictor nodes
5. Tune the number of models, maximal depth,
minimum node size and split criteria as you wish
6. Add a scorer node and ROC-curve node
7. Run the workflow and explore results

The KNIME platform – XGBoost model
Now that we are familiar with some tree-based
models we can give them a try…
1. Read the data from the previous exercise to a new
workflow (“churn_modeling.csv” file)
2. Encode the target column to be nominal using
number to string node
3. Add features as you wish
4. Add a XGBoost learner and predictor nodes
5. Tune the number of models, maximal depth,
minimum node size and split criteria as you wish
6. Add a scorer node and ROC-curve node
7. Run the workflow and explore results

The KNIME platform – comparing XGBoost vs. RF models
• Once added the two new models our workflow should look as follows:

outline
Mini-Hackathon!

The KNIME platform – (hyper)parameter optimization
• Well results look great, in comparison with our previous logistic regression
models, but are they optimal?
• To answer this question we may wish to use parameter optimization loop nodes

1. With our previous random forest and xgboost models we will add two
parameter optimization loop – the below steps are for the Random forest
model please complete the xgboost with similar steps
2. Add a parameter optimization loop start node
and connect it to the variable in-port
3. Add a parameter optimization loop end
and connect the variable out-port of the
scorer to its in-port

4. Configure the hyper parameter loop start node:
• Add a subsample, min_child_weight & max_depth
parameters with relevant ranges
• Select number of iterations for random search
or step-size for brute-force search
5. Configure the learner to use the parameters

6. Configure the parameter optimization loop end
set objective function to Accuracy and optimization to maximized
7. Run the loop from the loop end node
8. View the best configuration
or all configuration results

outline
Parameter tuning in Knime
Mini-Hackathon!

The KNIME platform – regression
• The main difference between what we have
done so far for classification problems to
what we will do with regression tasks is the
characteristics of the target column
• In classification tasks our target is categorical
(represented as string in Knime) and in
regression tasks our target will be numerical.
• Pay attention to use appropriate nodes –
contain ‘(regression)’ in the node description
for both learner and predictor
• To the right you can find an example for a
random forest based regression workflow
• Build & run the workflow
• Report the results you got
We will use a numeric
scorer as metrics for
regression differs from
classification metrics The line plot will show
the difference between
predicted value and the
ground truth

1. As in our previous tasks we
will start with reading the
data – insert a csv reader
node and read the file:
‘steam data - lesson 2.csv’
2. As before after reading the
data lets first understand
what is our goal – we
would like to validate the
data of P64TI4332 tag.
3. We will do this by using all
other parameters to
predict the value of the
P64TI4332 tag

We can now use the things we
have learned so far to compare
the regression error metrics
(MAE, RMSE, R2) among
various regression algorithms
we will compare:
• linear regression
• polynomial regression
• random forest
• XGBoost
(no need to panic from the number of nodes –
its going to be quite simple )

outline
Mini-Hackathon!

Not IID – still OK! just pay attention
• When data is not IID we need to make adjustments to:
• Validation method
• Our feature creation process
• Our target(s)
• Data originating groups

Time series data
• Split by time, not randomly
• Use lagged columns for features
• Use time difference based features
• Be aware of these important terms:
• Horizon of forecast
• Lag of available data
• Seasonality (may be more than one num)
• Trend
• Frequency
• Sampling rate and timing
• Prediction vs. backcast vs. current time regression
Random split of the data resembles imputation task
Time based separation is appropriate for prediction tasks

The KNIME platform – prediction
1. We will now reuse the
data from the previous
regression task and
redefine our problem as
prediction task
2. Given the current (and
historical) data we should
predict the temperature 4
hours ahead
3. Use the lag column node
to create the appropriate
target column
4. Try to come up with
effective explaining
features

outline
Mini-Hackathon!

Class mini-hackathon!
Use all what we have learned so far to create a model that
1. Reads the case study XXX.csv file
2. Predicts the current target using the other variables
3. *Predicts tomorrow’s target using today’s features (24h horizon)
4. **Classifies whether tomorrow’s target will be higher or lower than today’s
current target
You may explore any additional nodes that you think relevant
And probably find nodepit.com to be useful
Let the competition begin! (guided classroom competition)

outline
Recap - What to do when data is not IID (Identically independently distributed)
Lag-column node – the long (and obvious) way to create lagged features
Some more useful loop types:
• Column list loops
• Table row to variable loops

Not IID – still OK! just pay attention
• When data is not IID we need to make adjustments to:
• Validation method
• Our feature creation process
• Our target(s)
• Data originating groups

Time series data
• Split by time, not randomly
• Use lagged columns for features
• Use time-difference-based features
• Be aware of these important terms:
• Horizon of forecast
• Lag of available data
• Seasonality (may be more than one num)
• Trend
• Frequency
• Sampling rate and timing
• Prediction vs. backcast vs. current time regression
Random split of the data resembles imputation task
Time based separation is appropriate for prediction tasks

regression task and
prediction task
predict the target few
hours ahead
feature columns
features

1. Based on our last workflow (from HW) we would now like to create a
prediction task that uses the same variables but with lagged features instead
of those from the same timestamp

• While the former example is valid and will work – it will be hard to experiment
with (consider changing the lag-interval from 5 to 4 or creating few lag-
intervals instead of just one)
• Introducing: table row variable loops

• But what if we want to create few lag-intervals
• Introducing: column list loops

• Now let’s combine it all together…

outline
From “we’re new to Knime” to “we’ve pushed Knime to the edge”
Deep Learning within Knime:
• Constructing the network
• Preparing the data
• Training a learner
• Saving the network and weights
• Predicting new data
Other network types
LSTMs using knime
CNNs using knime
Knime forums

outline
Other network types
LSTMs using knime
CNNs using knime
Knime forums (you should really write a blogpost on this part of the lesson)

The KNIME platform – time series modeling with Deep Learning
We will start with the workflow we created during the last lesson

The KNIME platform – Deep Learning in Knime
• The first network we will construct is a simple feed forward network
• A feed forward network works similarly to the learners that we have trained
before – we just need to define its structure first

We will use the same
preprocessing workflow
from the past lesson
We will define the network
architecture using Keras nodes
Externally modeling stage is
very similar to what we have
done before, inside
configuration largely differs
Additional preprocessing for FFN is
minimal - we need to remove all
missing values

For input layers we will define the
shape of the input
in FFN - this will be the number of
features we feed the model with
For hidden layers we will select the
number of units (neurons) the layer
will contain and the activation function
that we would like to use
For output layers we will set the number
of units (neurons) the layer will contain
and the activation function that we
would like to use,
Note that this will be done according to
our target and not selected by our choice
Note that the type of this activation will be
highly responsible for the model’s final result

Now let’s configure the
network learner
hyperparametersTake the target column out of
the input columns
Verify that conversion is set to
“from number (double)”
Define the target column on
the relevant tab
Verify that conversion is set to
“from number (double)”
Remember to use a loss function appropriate
to the problem you are trying to model !

Now let’s configure the
network learner
hyperparameters
In the options tab define
“epochs” – number of
time the model sees all
of the training samples
Set the learning rate –
that’s right inverse relation
to number of epochs

After executing the network learner
Right click the learner node and
select “view : Learning Monitor”
Let's take a look at the various information that
this view supplies us with…
view accuracy (for classification tasks)
Or loss for any task
Training/validation error
Current epoch & batch
Stop learning process
(and keep results!)

We need to configure our predictor
(keras executer) to yield the results
we want
And configure our
scorer accordingly
Add an output
to the predictor
Select the last layer’s output as the predictors output
Make sure that output type is numeric so we can use our scorer
Check this box to have the option to run a scorer

Finally, we have trained the
network, and can save the model
weights for later use
Select a file name, click save and
Run the Keras network writer node!

Now lets try and read the network
we saved and use it for prediction
Connect the predictor node:
• to the network reader port
• to the validation data
Define the predictor output (as before)
And … Predict!

outline
Other network types
LSTMs using knime
CNNs using knime

By now we can probably wrap
this part as a meta-node
(We can also add more lags)
We will define the new network architecture
using LSTM/CuDNN LSTM nodes
Additional preprocessing for
using LSTMs is more complex
and contains some extra steps

using LSTM/CuDNN LSTM nodes
For input layers we will define the shape of
the input in LSTM - this will be
(number of time stamps {3 in our example}) x
(the number of features {9 in our example })
Check this boxes if next node is also LSTM node
Select number
of neurons
For output layers we will set the number of
units (neurons) the layer will contain and the
activation function that we would like to use,
Note that this will be done according to our
target and not selected by our choice

We will usually want to
normalize our data prior to
using it in a NN to help
convergence of the model
You may consider to normalize the
target as well but keep in mind to de-
normalize it after prediction and before
assessment of model results
We would like the columns to be
ordered by features and not by time
lags so let’s sort the columns…
Column order before sorting
Column order after sorting
Use the actions buttons to create
the sorting order you need

Now comes the hacky part…
Use “create column collection” to
convert multiple columns to one
collection column

Now comes hacky part II
Use “data row to image”
to convert the collection column we got in
the previous stage to a 3D array
a.k.a. “Tensor”
(While this stage is not mandatory, it will
enable us to look-into the tensor dimensions
and verify we got the desired results)

Finally we need to add the target column
back before training the learner
All other steps are the same
as in the FFN workflow!

outline
Other network types
LSTMs using knime
CNNs using knime

using 1D convolutions layer nodes, add layers,
and dropout layers
Same additional preprocessing
as for using LSTMs

Select number of filters to train
Select filter size
Select stride size (gap between adjacent filters)
Is padding needed? Same=>yes ; normal=> no padding
Creates a skip within the graph
* Verify that all input tensors
have the same dimensions
Randomly erases {drop rate} neurons while training

Deep Learning
You have just competed your first
“Deep Learning for multivariate time
series modeling with Knime” !!!
Of course, this is merely just the
beginning of the journey…
But now the great part comes in:
apply all you have learned to your daily work- you’d be amazed with the things
you can accomplish with your newly acquired skills

The content was shared with the
hope to boost learning process for
those making their first steps with
the KNIME platform
Feel free to share your experience and
comments on using it (good or bad)
Mail: nathaniel@post.bgu.ac.il
Or via my LinkedIn page

Machine Learning Course Day 1 Recap

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine Learning Course Day 1 Recap

Similar to Machine Learning Course Day 1 Recap (20)

More from Nathaniel Shimoni

More from Nathaniel Shimoni (6)

Recently uploaded

Recently uploaded (20)

Machine Learning Course Day 1 Recap

Editor's Notes