Practicing Data Science - Asking for Directions in an AI Project

© 2019 KNIME AG. All rights reserved.
Practicing Data Science
KNIME: Rosaria.Silipo@knime.com
@KNIME
Asking for Directions in an AI Project
… is starting soon …

KNIME: Rosaria.Silipo@knime.com
@KNIME
Asking for Directions in an AI Project

Introduction
This webinar collects the answers to
the questions I get every time I start
a new data science project
3

The Standard DS Process
4
https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining

The Training Process as a Workflow
5

How Standard is the Standard Process?
• Data Preparation
– Data types (structured vs. unstructured)
– Weird distributions (rare and infrequent classes)
– Model-dependent transformations
• Machine Learning model
– Model yes/no
– Which problem?
– Which model for which problem?
• Deployment
– Reports, dashboards, REST, or just save to DB?
– Scalability
6
The standard data mining
process is not very
standard

Do I need to train a ML model?
7
Sometimes a picture is better than 1000
words
Customer Description
Money vs. Loyalty
User Behaviour
Energy Consumption
Sometimes we only
need KPI measures.
Clickstream Analysis
Multiple Aggregations
Sometimes we only need
a Data WareHouse
data
DB
data data data
DWH
Business
Unit Business
Unit
Business
Unit
ETL ETL ETL ETL
ETL
ETL ETL ETL

Classification or Number Prediction?
8
Classes: Red, Blonde, Brown, Black
EnergyUsage(KwH)
now Wed 12:00
Binning
Discretization
deep learning network

deep learning network
Number Prediction or Time Series Analysis?
9
Linear Regression
Time Series Prediction
y from x1, x2, ..., xn x(t) from past x(t-1) ... x(t-n)
time
original
predicted
Make sure that the future does not
mix with the past in data partitioning

Supervised vs. Unsupervised ML Algorithms
10
x1 x2 xn...x3 class
yx1 x2 xn...x3
Labelled Training Set
x1 x2 xn...x3
Unlabelled Training Set
Supervised Unsupervised
DBSCAN
Fuzzy c-Means
Hierarchical clusteringActiveLearning

Unevenly Distributed, Infrequent, and Rare Classes
Infrequent
11
Unevenly distributed Rare (anomaly)
distance
Auto-encoder
distance
numerical prediction
clustering
Training only on „normal“ data

Structured Data vs. Unstructured Data
12
Structured Data Unstructured Data
Text NetworksImages
Text / Image / Network / Chemistry Extension
To numbers

The Deployment Process as a Workflow
13

Deployment: REST API, Shiny Dashboards, plain Background Execution
14
Your workflow as ...
... a REST API
... Guided Application

Scalability: Spark, Parallel Execution, in-DB Processing
15
Spark
Parallel Execution
on Server
In-database
processing

Summary
• Is the standard DS process so standard?
• Do I need a ML model?
• Training
– Classification or Number Prediction?
• Number Prediction or Time Series Analysis?
• Supervised or Unsupervised Learning?
– Unevenly Distributed, Infrequent, and Rare Classes
– Structured vs. Unstructured Data
• Deployment
– REST API, Dashboards, Background Execution
– Scalability Options
16

KNIME Spring Summit 2019
March 18 – 22 at bcc Berlin Congress Center, Berlin
• Monday & Tuesday: One-day courses
• Wednesday & Thursday: Summit sessions
• Friday: Workshops
Use the code
WEBINAR-20
for 20% off tickets!
Register at
knime.com/spring-summit2019

Free Copy of “Practicing Data Science” e-Book from KNIME Press
https://www.knime.com/knimepress
with this code: PDS-WEBINAR-0319
18

© 2019 KNIME AG. All rights reserved. 19
The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by KNIME.com AG under license from KNIME GmbH,
and are registered in the United States. KNIME® is also registered in Germany.
Thank You!

Let’s unroll it!
It always starts
with some data …
20
Data
Preparation
Model
Training
Model
Optimization
Deployment
Data Manipulation
Data Blending
Missing Values Handling
Feature Generation
Dimensionality Reduction
Feature Selection
Outlier Removal
Normalization
Partitioning
…
Model Training
Bag of Models
Model Selection
Ensemble Models
Own Ensemble Model
External Models
Import Existing Models
Model Factory
…
Parameter Tuning
Parameter Optimization
Regularization
Model Size
No. Iterations
…
Performance Measures
Accuracy
ROC Curve
Cross-Validation
…
Files & DBs
Dashboards
REST API
SQL Code Export
Reporting
…
Model
Evaluation

The many Lives of a Dataset
21
Data
Preparation
Model
Training
Model
Optimization
Model
Evaluation
Deployment
Partitioning:
• Training Set
• Validation Set
• Test Set
Training Set Validation Set Test Set New Data from Real
World Applications
Original Data
Set with Past
Observations

Data Exploration
• Data Understanding is a Data Exploration phase
• The Data Exploration phase is useful to get to
know the data
• KNIME offers a few visualization nodes to build
dashboards to explore data
22

Practicing Data Science - Asking for Directions in an AI Project

Recommended

Recommended

More Related Content

More from KNIMESlides

More from KNIMESlides (20)

Recently uploaded

Recently uploaded (20)

Practicing Data Science - Asking for Directions in an AI Project