SlideShare a Scribd company logo
1 of 109
Download to read offline
Machine Learning in Science and Industry
Day 1, lectures 1 & 2
Alex Rogozhnikov, Tatiana Likhomanenko
Heidelberg, GradDays 2017
About authors
Alex Rogozhnikov Tatiana Likhomanenko
Yandex researchers (internet-company in Russia, mostly known for search engine)
applying machine learning in CERN projects for 3 (Alex) and 4 (Tatiana) years
graduated from Yandex School of Data Analysis
1 / 108
Intro notes
4 days course
but we'll cover most significant algorithms of machine learning
2–2.5 hours for lecture, then optional practical session
we can be a bit out of schedule, but we'll try to fit
all materials are published in repository:
https://github.com/yandexdataschool/MLatGradDays
we have a challenge!
know material? Spend more time on challenge!
you can participate in teams of two
use chat for questions on challenge
2 / 108
What is Machine Learning about?
a method of teaching computers to make and improve predictions or behaviors based
on some data?
a field of computer science, probability theory, and optimization theory which allows
complex tasks to be solved for which a logical/procedural approach would not be
possible or feasible?
a type of AI that provides computers with the ability to learn without being explicitly
programmed?
somewhat in between of statistics, AI, optimization theory, signal processing and pattern
matching?
3 / 108
What is Machine Learning about
Inference of statistical dependencies which give us ability to predict
4 / 108
What is Machine Learning about
Inference of statistical dependencies which give us ability to predict
Data is cheap, knowledge is precious
5 / 108
Machine Learning is used in
search engines
spam detection
security: virus detection, DDOS defense
computer vision and speech recognition
market basket analysis, customer relationship management (CRM), churn prediction
credit scoring / insurance scoring, fraud detection
health monitoring
traffic jam prediction, self-driving cars
advertisement systems / recommendation systems / news clustering
handwriting recognition
opinion mining
6 / 108
Machine Learning is used in
search engines
spam detection
security: virus detection, DDOS defense
computer vision and speech recognition
market basket analysis, customer relationship management (CRM), churn prediction
credit scoring / insurance scoring, fraud detection
health monitoring
traffic jam prediction, self-driving cars
advertisement systems / recommendation systems / news clustering
handwriting recognition
opinion mining
and hundreds more
7 / 108
Machine Learning in science
In particle physics
High-level triggers
Particle identification
Tracking
Tagging
High-level analysis
star / galaxy / quasar classification in astronomy
analysis of gene expression
brain-computer interfaces
weather forecasting
8 / 108
Machine Learning in science
In particle physics
High-level triggers
Particle identification
Tracking
Tagging
High-level analysis
star / galaxy / quasar classification in astronomy
analysis of gene expression
brain-computer interfaces
weather forecasting
In mentioned applications different data is used and different information is inferred, but the
ideas beyond are quite similar.
9 / 108
General notion
In supervised learning the training data is represented as a set of pairs
x , y
i is an index of an observation
x is a vector of features available for an observation
y is a target — the value we need to predict
features = observables = variables
observations = samples = events
i i
i
i
10 / 108
Classification problem
y ∈ Y , where Y is finite set of labels.
Examples
particle identification based on information about track:
x = (p, η, E, charge, χ , ...)
Y = {electron, muon, pion, ... }
binary classification:
Y = {0, 1} — 1 is signal, 0 is background
i
i
2
11 / 108
Regression problem
y ∈ R
Examples:
predicting price of a house by it's positions
predicting money income / number of customers of a shop
reconstructing real momentum of a particle
12 / 108
Regression problem
y ∈ R
Examples:
predicting price of a house by it's positions
predicting money income / number of customers of a shop
reconstructing real momentum of a particle
Why do we need automatic classification/regression?
in applications up to thousands of features
higher quality
much faster adaptation to new problems
13 / 108
Classification based on nearest neighbours
Given training set of objects and their labels {x , y } we predict the label for the new
observation x:
= y , j = arg ρ(x, x )
Here and after ρ(x, ) is the distance in the space of features.
i i
y^ j
i
min i
x~
14 / 108
Visualization of decision rule
Consider a classification problem with 2 features:
x = (x , x ), y ∈ Y = {0, 1}i i
1
i
2
i
15 / 108
k Nearest Neighbours (kNN)
We can use k neighbours to compute probabilities to belong to each class:
p (x) =
Question: why this may be a good idea?
y~
k
# of knn of x in class y~
16 / 108
17 / 108
k = 1, 2, 5, 30
18 / 108
Overfitting
Question: how accurate is classification on training dataset when k = 1?
19 / 108
Overfitting
Question: how accurate is classification on training dataset when k = 1?
answer: it is ideal (closest neighbor is observation itself)
quality is lower when k > 1
20 / 108
Overfitting
Question: how accurate is classification on training dataset when k = 1?
answer: it is ideal (closest neighbor is observation itself)
quality is lower when k > 1
this doesn't mean k = 1 is the best,
it means we cannot use training observations to
estimate quality
when classifier's decision rule is too complex and
captures details from training data that are not relevant
to distribution, we call this an overfitting
21 / 108
Model selection
Given two models, which one should we select?
ML is about inference of statistical dependencies, which give us ability to predict
22 / 108
Model selection
Given two models, which one should we select?
ML is about inference of statistical dependencies, which give us ability to predict
The best model is the model which gives better predictions for new observations.
Simplest way to control this is to check quality on a holdout — a sample not used during
training (cross-validation).This gives unbiased estimate of quality for new data.
estimates have variance
multiple testing introduces bias (solution: train + validation + test, like kaggle)
very important factor if amount of data is small
there are more approaches to cross-validation
23 / 108
Regression using kNN
Regression with nearest neighbours is done by averaging of output
Model prediction:
(x) = yy^
k
1
j∈knn(x)
∑ j
24 / 108
kNN with weights
Average neighbours' output with weights:
=
the closer the neighbour, the higher weights
of its contribution, i.e.:
w = 1/ρ(x, x )
y^
w∑j∈knn(x) j
w y∑j∈knn(x) j j
j j
25 / 108
Computational complexity
Given that dimensionality of space is d and there are n training samples:
training time: ~ O(save a link to the data)
prediction time: n × d for each observation
Prediction is extremely slow.
To resolve this one can build index-like structure to select neighbours faster.
26 / 108
Spacial index: ball tree
training time
~ O(d × n log(n))
prediction time
~ log(n) × d
for each observation
27 / 108
Spacial index: ball tree
training time
~ O(d × n log(n))
prediction time
~ log(n) × d
for each observation
Other options exist:
KD-tree
FLANN
FALCONN
IMI, NO-IMI
Those are very useful for searching over large databases with items of high dimensions28 / 108
Overview of kNN
Awesomely simple classifier and regressor
Provides too optimistic quality on training data
Quite slow, though optimizations exist
Too sensitive to scale of features
Doesn't provide good results with data of high dimensions
29 / 108
Sensitivity to scale of features
Euclidean distance:
ρ(x, ) = (x − ) + (x − ) + ... + (x − )x~ 2
1 x~
1
2
2 x~
2
2
d x~
d
2
30 / 108
Sensitivity to scale of features
Euclidean distance:
ρ(x, ) = (x − ) + (x − ) + ... + (x − )
Change scale of first feature:
(assuming that x , x , ..., x are of the same order)
x~ 2
1 x~
1
2
2 x~
2
2
d x~
d
2
ρ(x, )x~ 2
ρ(x, )x~ 2
= (10x − 10 ) + (x − ) + ... + (x − )1 x~
1
2
2 x~
2
2
d x~
d
2
∼ 100(x − )1 x~
1
2
1 2 n
31 / 108
Sensitivity to scale of features
Euclidean distance:
ρ(x, ) = (x − ) + (x − ) + ... + (x − )
Change scale of first feature:
(assuming that x , x , ..., x are of the same order)
Scaling of features frequently increases quality.
Otherwise contribution from features with large scale is dominating.
x~ 2
1 x~
1
2
2 x~
2
2
d x~
d
2
ρ(x, )x~ 2
ρ(x, )x~ 2
= (10x − 10 ) + (x − ) + ... + (x − )1 x~
1
2
2 x~
2
2
d x~
d
2
∼ 100(x − )1 x~
1
2
1 2 n
32 / 108
Distance function matters
Minkowski distance
ρ(x, ) = (x − )
Canberra
ρ(x, ) =
Cosine metric
ρ(x, ) =
x~ p
l
∑ l x~
l
p
x~
l
∑
∣x ∣ + ∣ ∣l x~
l
∣x − ∣l x~
l
x~
∣x∣ ∣ ∣x~
< x, >x~
33 / 108
image source
Problems with high dimensions
With higher dimensions d >> 1 the neighbouring
points are further.
Example: consider n training data points being
distributed uniformly in the unit cube:
expected number of point in the ball of size r is
proportional to the r
to collect the same amount on neighbors, we need
to put r = const → 1
kNN suffers from curse of dimensionality.
d
1/d
34 / 108
Nearest neighbours applications
Classification and regression based on kNN used quite rarely, but the nearest neighbours
search is very widespread, used when you need to find something similar in the database.
finding documents and images
finding similar proteins by using 3d-structure
search in chemical databases
person identification by photo or by speech
user identification by behavior on the web
Similarity (distance) is very important and isn't defined trivially in most cases.
35 / 108
Measuring quality of binary classification
The classifier's output in binary classification is real variable (say, signal is blue and
background is red)
Which classifier provides better discrimination?
36 / 108
Measuring quality of binary classification
The classifier's output in binary classification is real variable (say, signal is blue and
background is red)
Which classifier provides better discrimination?
Discrimination is identical in all three cases
37 / 108
ROC curve demonstration
38 / 108
ROC curve
39 / 108
ROC curve
These distributions have the same ROC curve:
(ROC curve is passed signal vs passed bck
dependency)
40 / 108
ROC curve
Defined only for binary classification
Contains important information:
all possible combinations of signal and background efficiencies you may achieve by setting
a threshold
Particular values of thresholds (and initial pdfs) don't matter, ROC curve doesn't contain
this information
ROC curve = information about order of observations' predictions:
b b s b s b ... s s b s s
Comparison of algorithms should be based on the information from ROC curve.
41 / 108
ROC AUC (area under the ROC curve)
ROC AUC = P(r < r )
where r , r are predictions of random
background and signal observations.
b s
b s
42 / 108
Classifier have the same ROC AUC, but
which is better ...
if the problem is spam detection?
Spam is class 0, non-spam is class 1
(putting normal letter in spam is very
undesireable)
for background rejection system at the
LHC?
Background is class 0.
(we need to pass very few background)
43 / 108
Classifier have the same ROC AUC, but
which is better ...
if the problem is spam detection?
Spam is class 0, non-spam is class 1
(putting normal letter in spam is very
undesireable)
for background rejection system at the
LHC?
Background is class 0.
(we need to pass very few background)
Applications frequently demand different
metric.
44 / 108
Statistical Machine Learning
Machine learning we use in practice is based on statistics
Main assumption: the data is generated from probabilistic distribution:
p(x, y)
Does there really exist the distribution of people / pages / texts?
Same question for stars / proteins / viruses?
45 / 108
Statistical Machine Learning
Machine learning we use in practice is based on statistics
Main assumption: the data is generated from probabilistic distribution:
p(x, y)
Does there really exist the distribution of people / pages / texts?
Same question for stars / proteins / viruses?
In HEP these distributions do exist
Statistical framework turned out to be quite helpful to build models.
46 / 108
Optimal regression
Assuming that we know real distributions p(x, y) and the goal is to minimize squared error:
L = E(y − (x)) → min
An optimal prediction is a marginalized average
(x ) = E(y∣x = x )
y^ 2
y^ 0 0
47 / 108
Optimal classification. Bayes optimal classifier
Assuming that we know real distributions p(x, y) we reconstruct using Bayes' rule
p(y∣x) = =
=
p(x)
p(x, y)
p(x)
p(y)p(x∣y)
p(y = 0 ∣ x)
p(y = 1 ∣ x)
p(y = 0) p(x ∣ y = 0)
p(y = 1) p(x ∣ y = 1)
48 / 108
Optimal classification. Bayes optimal classifier
Assuming that we know real distributions p(x, y) we reconstruct using Bayes' rule
p(y∣x) = =
=
Lemma (Neyman–Pearson):
The best classification quality is provided by (Bayes optimal classifier)
p(x)
p(x, y)
p(x)
p(y)p(x∣y)
p(y = 0 ∣ x)
p(y = 1 ∣ x)
p(y = 0) p(x ∣ y = 0)
p(y = 1) p(x ∣ y = 1)
p(y = 0 ∣ x)
p(y = 1 ∣ x)
49 / 108
Optimal Binary Classification
Bayes optimal classifier has the highest possible ROC curve.
Since the classification quality depends only on order, p(y = 1 ∣ x) gives optimal
classification quality too!
= ×
p(y = 0 ∣ x)
p(y = 1 ∣ x)
p(y = 0)
p(y = 1)
p(x ∣ y = 0)
p(x ∣ y = 1)
50 / 108
Optimal Binary Classification
Bayes optimal classifier has the highest possible ROC curve.
Since the classification quality depends only on order, p(y = 1 ∣ x) gives optimal
classification quality too!
= ×
How can we estimate from data terms from this expression?
p(y = 1), p(y = 0) ?
p(y = 0 ∣ x)
p(y = 1 ∣ x)
p(y = 0)
p(y = 1)
p(x ∣ y = 0)
p(x ∣ y = 1)
51 / 108
Optimal Binary Classification
Bayes optimal classifier has the highest possible ROC curve.
Since the classification quality depends only on order, p(y = 1 ∣ x) gives optimal
classification quality too!
= ×
How can we estimate from data terms from this expression?
p(y = 1), p(y = 0) ?
p(x ∣ y = 1), p(x ∣ y = 0) ?
p(y = 0 ∣ x)
p(y = 1 ∣ x)
p(y = 0)
p(y = 1)
p(x ∣ y = 0)
p(x ∣ y = 1)
52 / 108
Histograms density estimation
Counting number of samples in each bin and normalizing.
fast
choice of binning is crucial
number of bins grows exponentially → curse of dimensionality
53 / 108
Kernel density estimation
f(x) = K
K(x) is kernel, h is bandwidth
Typically, gaussian kernel is used,
but there are many others.
Approach is very close to weighted
kNN.
nh
1
i
∑ (
h
x − xi
)
54 / 108
Kernel density estimation
bandwidth selection
Silverman's rule of thumb:
h =
is an empirical standard deviation
σ^ (
3n
4
)
5
1
σ^
55 / 108
Kernel density estimation
bandwidth selection
Silverman's rule of thumb:
h =
is an empirical standard deviation
may be irrelevant if the data is far from being
gaussian
σ^ (
3n
4
)
5
1
σ^
56 / 108
n-minutes break
57 / 108
Recapitulation
1. Statistical ML: applications and problems
2. k nearest neighbours classifier and regressor
distance function
overfitting and model selection
speeding up neighbours search
3. Binary classification: ROC curve, ROC AUC
4. Optimal regression and optimal classification
5. Density estimation
histograms
kernel density estimation
58 / 108
Parametric density estimation
Family of density functions: f(x; θ).
Problem: estimate parameters of a Gaussian distribution.
f(x; μ, Σ) = exp − (x − μ) Σ (x − μ)
(2π) ∣Σ∣d/2 1/2
1
(
2
1 T −1
)
59 / 108
QDA (Quadratic discriminant analysis)
Reconstructing probabilities p(x ∣ y = 1), p(x ∣ y = 0) from data, assuming those are
multidimensional normal distributions:
p(x ∣ y = 0) ∼ N(μ , Σ )
p(x ∣ y = 1) ∼ N(μ , Σ )
= = const =
= exp − (x − μ ) Σ (x − μ ) + (x − μ ) Σ (x − μ ) + const
0 0
1 1
p(y = 0 ∣ x)
p(y = 1 ∣ x)
p(y = 0)
p(y = 1)
p(x ∣ y = 0)
p(x ∣ y = 1)
n0
n1
exp − (x − μ ) Σ (x − μ )( 2
1
0
T
0
−1
0 )
exp − (x − μ ) Σ (x − μ )( 2
1
1
T
1
−1
1 )
(
2
1
1
T
1
−1
1
2
1
0
T
0
−1
0 )
60 / 108
61 / 108
QDA computational complexity
n samples, d dimensions
training consists of fitting p(x ∣ y = 0) and p(x ∣ y = 1) takes O(nd + d )
computing covariance matrix O(nd )
inverting covariance matrix O(d )
prediction takes O(d ) for each sample spent on computing dot product
2 3
2
3
2
62 / 108
QDA overview
simple analytical formula for prediction
fast prediction
many parameters to reconstruct in high dimensions
thus, estimate may become non-reliable
data almost never has gaussian distribution
63 / 108
Gaussian mixtures for density estimation
Mixture of distributions:
f(x) = π f (x, θ ); π = 1
Mixture of Gaussian distributions:
f(x) = π f(x; μ , Σ )
Parameters to be found: π , ..., π , μ , ..., μ , Σ , ..., Σ
c−components
∑ c c c
c−components
∑ c
c−components
∑ c c c
1 C 1 C 1 C
64 / 108
65 / 108
Gaussian mixtures: finding parameters
Criterion is maximizing likelihood (using MLE to find optimal parameters)
log f(x ; θ) →
no analytic solution
we can use general-purpose optimization methods
i
∑ i
θ
max
66 / 108
Gaussian mixtures: finding parameters
Criterion is maximizing likelihood (using MLE to find optimal parameters)
log f(x ; θ) →
no analytic solution
we can use general-purpose optimization methods
In mixtures parameters are split in two groups:
θ , ..., θ — parameters of components
π , ..., π — contributions of components
i
∑ i
θ
max
1 C
1 C
67 / 108
Expectation-Maximization algorithm [Dempster et al., 1977]
Idea: introduce set of hidden variables π (x)
Expectation: π (x) = p(x ∈ c) =
Maximization:
Maximization step is trivial for Gaussian distributions.
EM-algorithm is more stable and has good convergence properties.
c
c
π f (x; θ )∑c~ c~ c~ c~
π f (x; θ )c c c
πc
θc
= π (x )
i
∑ c i
= arg π (x ) log f (x , θ )
θ
max
i
∑ c i c i c
68 / 108
EM algorithm
69 / 108
EM algorithm
70 / 108
Density estimation applications
parametric fits in HEP
specifically, mixtures of signal + background
outlier detection
if a new observation has low p w.r.t. to current
distribution, it is considered as an outlier
photometric selection of quasars
gaussian mixtures are used as a smooth version of
clustering (discussed later)
71 / 108
Clustering (unsupervised learning)
Gaussian mixtures are considered as a smooth form of clustering.
However in some cases clusters have to be defined strictly (that is, each observation can be
assigned only to one cluster).
72 / 108
Demo of k-means and DBSCAN clusterings
(shown interactively)
73 / 108
OPERA (operated in 2008-2012)
Oscillation Project with Emulsion-tRacking Apparatus
neutrinos sent from CERN to Gran Sasso (INFN, Italy)
looking for neutrino oscillations (Nobel prize 2015)
hence, neutrinos aren't massless
OPERA detector is placed under the ground,
consists of 150'000 bricks 8.3 kg each
brick consists of interleaving photo emulsion and lead layers
74 / 108
OPERA: brick and scanning
bricks are taken out and scanned only if there are signs of something interesting
one brick = billions of base-tracks
75 / 108
OPERA: Tracking with DBSCAN clustering
76 / 108
OPERA clustering demo
(shown interactively)
different metrics can be used in clustering
the appropriate choice of metric is crucial
77 / 108
Clustering
news clustering in news aggregators
each topic is covered in dozens of media, those are joined in
one group to avoid showing lots of similar results
clustering of surface based on the spectra
(in the picture, 5 spectral channels for each point were taken
from satellite photos)
clustering in calorimeters in HEP
clustering of gamma ray bursts
in astronomy
78 / 108
Classification model based on mixtures density estimation is called MDA (mixture
discriminant analysis)
Generative approach
Generative approach: trying to reconstruct p(x, y), then use Bayes classification formula to
predict.
QDA, MDA are generative classifiers.
79 / 108
Classification model based on mixtures density estimation is called MDA (mixture
discriminant analysis)
Generative approach
Generative approach: trying to reconstruct p(x, y), then use Bayes classification formula to
predict.
QDA, MDA are generative classifiers.
Problems of generative approach
Real life distributions hardly can be reconstructed
Especially in high-dimensional spaces
So, we switch to discriminative approach: guessing directly p(y∣x)
80 / 108
Classification: truck vs car
81 / 108
Density estimation requires reconstructing dependency between all known parameters,
which requires much data, but most of these dependencies aren't needed in application.
If we can avoid multidimensional density estimation, we'd better do it.
82 / 108
Naive Bayes classification rule
Assuming that features are independent for each class, one can avoid multidimensional
density fitting:
= =
Contribution of each feature is independent, one-dimensional fitting is much easier.
p(y = 0 ∣ x)
p(y = 1 ∣ x)
p(y = 0)
p(y = 1)
p(x ∣ y = 0)
p(x ∣ y = 1)
p(y = 0)
p(y = 1)
i
∏
p(x ∣ y = 0)i
p(x ∣ y = 1)i
83 / 108
Application: B-tagging (picture from LHCb)
flavour tagging is important in
estimating CKM-matrix elements
to predict which meson was produced
(B or ), non-signal part of
collision is analyzed (tagging)
signal part (green) provides
information about decayed state
there are dozens of tracks that aren't
part of this scheme
0
B¯0
84 / 108
Inclusive tagging idea
use all (non-signal) tracks from collision
let machine guess which tracks are tagging, how their charge is connected to meson
flavour, and estimate probability
We can make a naive assumption, that information from tracks is independent:
= =
Number of terms in the product varies (compared to typical Naive Bayes).
Ratio for each track in the right side is estimated with ML techniques discussed later.
This simple technique provides very good results.
p( ∣tracks)B¯0
p(B ∣tracks)0
p( )B¯0
p(B )0
track
∏
p(track∣ )B¯0
p(track∣B )0
(
p(B )0
p( )B¯0
)
n −1tracks
track
∏
p( ∣track)B¯0
p(B ∣track)0
85 / 108
Linear decision rule
Decision function is linear:
d(x) =< w, x > +w
for brevity:
= sgn(d(x))
This is a parametric model (finding parameters w, w ).
QDA & MDA are parametric as well.
0
{
d(x) > 0 →
d(x) < 0 →
= +1y^
= −1y^
y^
0
86 / 108
Finding Optimal Parameters
A good initial guess: get such w, w , that error of classification is minimal:
L = 1 = sgn(d(x ))
Notion: 1 = 1, 1 = 0.
Discontinuous optimization (arrrrgh!)
0
i∈samples
∑ y ≠i y^i y^i i
true false
87 / 108
Finding Optimal Parameters
A good initial guess: get such w, w , that error of classification is minimal:
L = 1 = sgn(d(x ))
Notion: 1 = 1, 1 = 0.
Discontinuous optimization (arrrrgh!)
solution: let's make decision rule smooth
0
i∈samples
∑ y ≠i y^i y^i i
true false
p(y = +1∣x) = p (x)+1
p(y = −1∣x) = p (x)−1
= f(d(x))
= 1 − p (x)+1 ⎩⎪
⎨
⎪⎧f(0) = 0.5
f(x) > 0.5
f(x) < 0.5
x > 0
x < 0
88 / 108
Logistic function
σ(x) = =
Properties
1. monotonic, σ(x) ∈ (0, 1)
2. σ(x) + σ(−x) = 1
3. σ (x) = σ(x)(1 − σ(x))
4. 2σ(x) = 1 + tanh(x/2)
1 + ex
ex
1 + e−x
1
′
89 / 108
Logistic regression
Define probabilities obtained with logistic function
with d(x) =< w, x > +w
and optimize log-likelihood:
L = − ln(p (x )) = L(x , y ) → min
p (x)+1
p (x)−1
= σ(d(x))
= σ(−d(x)) 0
i∈observations
∑ yi i
i
∑ i i
90 / 108
Logistic loss (LogLoss)
Term loss refers to somewhat we are minimizing. Losses typically estimate our risks,
denoted as L.
L = − ln(p (x )) = L(x , y ) → min
LogLoss penalty for single observation:
L(x , y ) = − ln(p (x )) = = ln(1 + e )
Margin y d(x ) is expected to be high for all observations.
i∈observations
∑ yi i
i
∑ i i
i i yi i {
ln(1 + e ),−d(x )i
ln(1 + e ),d(x )i
y = +1i
y = −1i
−y d(x )i i
i i
91 / 108
Logistic loss
L(x , y ) is a convex function.
Simple analysis shows that L is a sum
of convex functions w.r.t. to w, so
the optimization problem has
at most one optimum.
i i
92 / 108
Visualization of logistic regression
93 / 108
Linear model for regression
How to use linear function for
regression?
d(x) =< w, x > +w
Simplification of notion:
x = 1, x = (1, x , ..., x ).
d(x) =< w, x >
0
0 1 d
94 / 108
Linear regression (ordinary least squares)
We can use linear function for regression:
d(x ) = y d(x) =< w, x >
This is a linear system with d + 1 variables and n equations.
Minimize OLS aka MSE (mean squared error):
L = (d(x ) − y ) → min
Explicit solution: x x w = y x
i i
i
∑ i i
2
(∑i i i
T
) ∑i i i
95 / 108
Linear regression
can use some other loss L different from MSE
but no explicit solution in other cases
demonstrates properties of linear models
reliable estimates when n >> d
able to completely fit to the data if n = d
undefined when n < d
96 / 108
Regularization: motivation
When the number of parameters is high (compared to the number of observations)
hard to estimate reliably all parameters
linear regression with MSE:
in d-dimensional space you can find hyperplane through any d points
non-unique solution if n < d
the matrix x x degenerates
Solution 1: manually decrease dimensionality of the problem (select more appropriate
features)
Solution 2: use regularization
∑i i i
T
97 / 108
Regularization
When number of parameters in model is high, overfitting is very probable
Solution: add a regularization term to the loss function:
L = L(x , y ) + L → min
L regularization: L = α ∣w ∣
L regularization: L = β ∣w ∣
L + L regularization: L = α ∣w ∣ + β ∣w ∣
N
1
i
∑ i i reg
2 reg ∑j j
2
1 reg ∑j j
1 2 reg ∑j j
2
∑j j
98 / 108
L , L – regularizations
Dependence of parameters (components of w) on the regularization (stronger
regularization to the left)
L regularization L (solid), L + L (dashed)
2 1
2 1 1 2
99 / 108
Regularizations
L regularization encourages sparsity (many coefficients in w turn to zero)1
100 / 108
L regularizations
We can consider a more general case:
L = w
Expression for L : L = 1
exactly penalizes number of non-zero coefficients
but nobody uses L . Why?
p
p
i
∑ i
p
0 0 ∑i w ≠0i
0
101 / 108
L regularizations
We can consider a more general case:
L = w
Expression for L : L = 1
exactly penalizes number of non-zero coefficients
but nobody uses L . Why?
even L , 0 < p < 1 not used. Why?
p
p
i
∑ i
p
0 0 ∑i w ≠0i
0
p
102 / 108
Regularization summary
important tool to fight overfitting (= poor generalization on a new data)
different modifications for other models
makes it possible to handle really many features
machine learning should detect important features itself
from mathematical point:
turning convex problem to strongly convex (NB: only for linear models)
from practical point:
softly limiting the space of parameters
breaks scale-invariance of linear models
103 / 108
n-minutes break
104 / 108
Data Scientist Tools
Experiments in appropriate high-level language or environment
After experiments are over — implement final algorithm in low-level language (C++,
CUDA, FPGA)
Second point is typically unnecessary
105 / 108
Scientific Python
NumPy
vectorized computations in python
Matplotlib
for drawing
Scikit-learn
most popular library for machine learning
106 / 108
Scientific Python
Scipy
libraries for science and engineering
Root_numpy
convenient way to work with ROOT files
Astropy
A community python library for astronomy
107 / 108
108 / 108

More Related Content

What's hot

Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)Zihui Li
 
Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Mostafa G. M. Mostafa
 
Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic netVivian S. Zhang
 
24 Machine Learning Combining Models - Ada Boost
24 Machine Learning Combining Models - Ada Boost24 Machine Learning Combining Models - Ada Boost
24 Machine Learning Combining Models - Ada BoostAndres Mendez-Vazquez
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data萍華 楊
 
Kaggle Projects Presentation Sawinder Pal Kaur
Kaggle Projects Presentation Sawinder Pal KaurKaggle Projects Presentation Sawinder Pal Kaur
Kaggle Projects Presentation Sawinder Pal KaurSawinder Pal Kaur
 
Probabilistic PCA, EM, and more
Probabilistic PCA, EM, and moreProbabilistic PCA, EM, and more
Probabilistic PCA, EM, and morehsharmasshare
 
23 Machine Learning Feature Generation
23 Machine Learning Feature Generation23 Machine Learning Feature Generation
23 Machine Learning Feature GenerationAndres Mendez-Vazquez
 
31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster ValidityAndres Mendez-Vazquez
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine LearningPavithra Thippanaik
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionJordan McBain
 
SVM Tutorial
SVM TutorialSVM Tutorial
SVM Tutorialbutest
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...butest
 
27 Machine Learning Unsupervised Measure Properties
27 Machine Learning Unsupervised Measure Properties27 Machine Learning Unsupervised Measure Properties
27 Machine Learning Unsupervised Measure PropertiesAndres Mendez-Vazquez
 
Artificial Intelligence
Artificial Intelligence Artificial Intelligence
Artificial Intelligence butest
 

What's hot (20)

Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)
 
Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)
 
Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic net
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
24 Machine Learning Combining Models - Ada Boost
24 Machine Learning Combining Models - Ada Boost24 Machine Learning Combining Models - Ada Boost
24 Machine Learning Combining Models - Ada Boost
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
 
Kaggle Projects Presentation Sawinder Pal Kaur
Kaggle Projects Presentation Sawinder Pal KaurKaggle Projects Presentation Sawinder Pal Kaur
Kaggle Projects Presentation Sawinder Pal Kaur
 
Probabilistic PCA, EM, and more
Probabilistic PCA, EM, and moreProbabilistic PCA, EM, and more
Probabilistic PCA, EM, and more
 
23 Machine Learning Feature Generation
23 Machine Learning Feature Generation23 Machine Learning Feature Generation
23 Machine Learning Feature Generation
 
Estimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample SetsEstimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample Sets
 
Polynomial Matrix Decompositions
Polynomial Matrix DecompositionsPolynomial Matrix Decompositions
Polynomial Matrix Decompositions
 
PCA and SVD in brief
PCA and SVD in briefPCA and SVD in brief
PCA and SVD in brief
 
31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine Learning
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty Detection
 
SVM Tutorial
SVM TutorialSVM Tutorial
SVM Tutorial
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
 
27 Machine Learning Unsupervised Measure Properties
27 Machine Learning Unsupervised Measure Properties27 Machine Learning Unsupervised Measure Properties
27 Machine Learning Unsupervised Measure Properties
 
Artificial Intelligence
Artificial Intelligence Artificial Intelligence
Artificial Intelligence
 

Similar to Machine learning in science and industry — day 1

Analytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion miningAnalytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion miningcsandit
 
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...cscpconf
 
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MININGANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MININGcsandit
 
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...IJCNCJournal
 
isabelle_webinar_jan..
isabelle_webinar_jan..isabelle_webinar_jan..
isabelle_webinar_jan..butest
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzerbutest
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revisedKrish_ver2
 
CC282 Decision trees Lecture 2 slides for CC282 Machine ...
CC282 Decision trees Lecture 2 slides for CC282 Machine ...CC282 Decision trees Lecture 2 slides for CC282 Machine ...
CC282 Decision trees Lecture 2 slides for CC282 Machine ...butest
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Daniel Chan
 
EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171Yaxin Liu
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
Scaling Multinomial Logistic Regression via Hybrid Parallelism
Scaling Multinomial Logistic Regression via Hybrid ParallelismScaling Multinomial Logistic Regression via Hybrid Parallelism
Scaling Multinomial Logistic Regression via Hybrid ParallelismParameswaran Raman
 
Application of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosisApplication of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosisDr.Pooja Jain
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 
DagdelenSiriwardaneY..
DagdelenSiriwardaneY..DagdelenSiriwardaneY..
DagdelenSiriwardaneY..butest
 

Similar to Machine learning in science and industry — day 1 (20)

Lect4
Lect4Lect4
Lect4
 
Analytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion miningAnalytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion mining
 
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
 
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MININGANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
 
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
CONSTRUCTING A FUZZY NETWORK INTRUSION CLASSIFIER BASED ON DIFFERENTIAL EVOLU...
 
isabelle_webinar_jan..
isabelle_webinar_jan..isabelle_webinar_jan..
isabelle_webinar_jan..
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 
Respose surface methods
Respose surface methodsRespose surface methods
Respose surface methods
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised
 
CC282 Decision trees Lecture 2 slides for CC282 Machine ...
CC282 Decision trees Lecture 2 slides for CC282 Machine ...CC282 Decision trees Lecture 2 slides for CC282 Machine ...
CC282 Decision trees Lecture 2 slides for CC282 Machine ...
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)
 
EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
P1121133727
P1121133727P1121133727
P1121133727
 
Scaling Multinomial Logistic Regression via Hybrid Parallelism
Scaling Multinomial Logistic Regression via Hybrid ParallelismScaling Multinomial Logistic Regression via Hybrid Parallelism
Scaling Multinomial Logistic Regression via Hybrid Parallelism
 
Application of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosisApplication of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosis
 
Second subjective assignment
Second  subjective assignmentSecond  subjective assignment
Second subjective assignment
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
DagdelenSiriwardaneY..
DagdelenSiriwardaneY..DagdelenSiriwardaneY..
DagdelenSiriwardaneY..
 

Recently uploaded

DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxGiDMOh
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxPayal Shrivastava
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxfarhanvvdk
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfGABYFIORELAMALPARTID1
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxtuking87
 
final waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterfinal waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterHanHyoKim
 
Explainable AI for distinguishing future climate change scenarios
Explainable AI for distinguishing future climate change scenariosExplainable AI for distinguishing future climate change scenarios
Explainable AI for distinguishing future climate change scenariosZachary Labe
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...HafsaHussainp
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learningvschiavoni
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxpriyankatabhane
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxRitchAndruAgustin
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxkumarsanjai28051
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdf
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdfDECOMPOSITION PATHWAYS of TM-alkyl complexes.pdf
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdfDivyaK787011
 
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书zdzoqco
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPirithiRaju
 
linear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annovalinear Regression, multiple Regression and Annova
linear Regression, multiple Regression and AnnovaMansi Rastogi
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 

Recently uploaded (20)

DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptx
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptx
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptx
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
 
final waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterfinal waves properties grade 7 - third quarter
final waves properties grade 7 - third quarter
 
Explainable AI for distinguishing future climate change scenarios
Explainable AI for distinguishing future climate change scenariosExplainable AI for distinguishing future climate change scenarios
Explainable AI for distinguishing future climate change scenarios
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptx
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdf
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdfDECOMPOSITION PATHWAYS of TM-alkyl complexes.pdf
DECOMPOSITION PATHWAYS of TM-alkyl complexes.pdf
 
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPR
 
linear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annovalinear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annova
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 

Machine learning in science and industry — day 1

  • 1. Machine Learning in Science and Industry Day 1, lectures 1 & 2 Alex Rogozhnikov, Tatiana Likhomanenko Heidelberg, GradDays 2017
  • 2. About authors Alex Rogozhnikov Tatiana Likhomanenko Yandex researchers (internet-company in Russia, mostly known for search engine) applying machine learning in CERN projects for 3 (Alex) and 4 (Tatiana) years graduated from Yandex School of Data Analysis 1 / 108
  • 3. Intro notes 4 days course but we'll cover most significant algorithms of machine learning 2–2.5 hours for lecture, then optional practical session we can be a bit out of schedule, but we'll try to fit all materials are published in repository: https://github.com/yandexdataschool/MLatGradDays we have a challenge! know material? Spend more time on challenge! you can participate in teams of two use chat for questions on challenge 2 / 108
  • 4. What is Machine Learning about? a method of teaching computers to make and improve predictions or behaviors based on some data? a field of computer science, probability theory, and optimization theory which allows complex tasks to be solved for which a logical/procedural approach would not be possible or feasible? a type of AI that provides computers with the ability to learn without being explicitly programmed? somewhat in between of statistics, AI, optimization theory, signal processing and pattern matching? 3 / 108
  • 5. What is Machine Learning about Inference of statistical dependencies which give us ability to predict 4 / 108
  • 6. What is Machine Learning about Inference of statistical dependencies which give us ability to predict Data is cheap, knowledge is precious 5 / 108
  • 7. Machine Learning is used in search engines spam detection security: virus detection, DDOS defense computer vision and speech recognition market basket analysis, customer relationship management (CRM), churn prediction credit scoring / insurance scoring, fraud detection health monitoring traffic jam prediction, self-driving cars advertisement systems / recommendation systems / news clustering handwriting recognition opinion mining 6 / 108
  • 8. Machine Learning is used in search engines spam detection security: virus detection, DDOS defense computer vision and speech recognition market basket analysis, customer relationship management (CRM), churn prediction credit scoring / insurance scoring, fraud detection health monitoring traffic jam prediction, self-driving cars advertisement systems / recommendation systems / news clustering handwriting recognition opinion mining and hundreds more 7 / 108
  • 9. Machine Learning in science In particle physics High-level triggers Particle identification Tracking Tagging High-level analysis star / galaxy / quasar classification in astronomy analysis of gene expression brain-computer interfaces weather forecasting 8 / 108
  • 10. Machine Learning in science In particle physics High-level triggers Particle identification Tracking Tagging High-level analysis star / galaxy / quasar classification in astronomy analysis of gene expression brain-computer interfaces weather forecasting In mentioned applications different data is used and different information is inferred, but the ideas beyond are quite similar. 9 / 108
  • 11. General notion In supervised learning the training data is represented as a set of pairs x , y i is an index of an observation x is a vector of features available for an observation y is a target — the value we need to predict features = observables = variables observations = samples = events i i i i 10 / 108
  • 12. Classification problem y ∈ Y , where Y is finite set of labels. Examples particle identification based on information about track: x = (p, η, E, charge, χ , ...) Y = {electron, muon, pion, ... } binary classification: Y = {0, 1} — 1 is signal, 0 is background i i 2 11 / 108
  • 13. Regression problem y ∈ R Examples: predicting price of a house by it's positions predicting money income / number of customers of a shop reconstructing real momentum of a particle 12 / 108
  • 14. Regression problem y ∈ R Examples: predicting price of a house by it's positions predicting money income / number of customers of a shop reconstructing real momentum of a particle Why do we need automatic classification/regression? in applications up to thousands of features higher quality much faster adaptation to new problems 13 / 108
  • 15. Classification based on nearest neighbours Given training set of objects and their labels {x , y } we predict the label for the new observation x: = y , j = arg ρ(x, x ) Here and after ρ(x, ) is the distance in the space of features. i i y^ j i min i x~ 14 / 108
  • 16. Visualization of decision rule Consider a classification problem with 2 features: x = (x , x ), y ∈ Y = {0, 1}i i 1 i 2 i 15 / 108
  • 17. k Nearest Neighbours (kNN) We can use k neighbours to compute probabilities to belong to each class: p (x) = Question: why this may be a good idea? y~ k # of knn of x in class y~ 16 / 108
  • 19. k = 1, 2, 5, 30 18 / 108
  • 20. Overfitting Question: how accurate is classification on training dataset when k = 1? 19 / 108
  • 21. Overfitting Question: how accurate is classification on training dataset when k = 1? answer: it is ideal (closest neighbor is observation itself) quality is lower when k > 1 20 / 108
  • 22. Overfitting Question: how accurate is classification on training dataset when k = 1? answer: it is ideal (closest neighbor is observation itself) quality is lower when k > 1 this doesn't mean k = 1 is the best, it means we cannot use training observations to estimate quality when classifier's decision rule is too complex and captures details from training data that are not relevant to distribution, we call this an overfitting 21 / 108
  • 23. Model selection Given two models, which one should we select? ML is about inference of statistical dependencies, which give us ability to predict 22 / 108
  • 24. Model selection Given two models, which one should we select? ML is about inference of statistical dependencies, which give us ability to predict The best model is the model which gives better predictions for new observations. Simplest way to control this is to check quality on a holdout — a sample not used during training (cross-validation).This gives unbiased estimate of quality for new data. estimates have variance multiple testing introduces bias (solution: train + validation + test, like kaggle) very important factor if amount of data is small there are more approaches to cross-validation 23 / 108
  • 25. Regression using kNN Regression with nearest neighbours is done by averaging of output Model prediction: (x) = yy^ k 1 j∈knn(x) ∑ j 24 / 108
  • 26. kNN with weights Average neighbours' output with weights: = the closer the neighbour, the higher weights of its contribution, i.e.: w = 1/ρ(x, x ) y^ w∑j∈knn(x) j w y∑j∈knn(x) j j j j 25 / 108
  • 27. Computational complexity Given that dimensionality of space is d and there are n training samples: training time: ~ O(save a link to the data) prediction time: n × d for each observation Prediction is extremely slow. To resolve this one can build index-like structure to select neighbours faster. 26 / 108
  • 28. Spacial index: ball tree training time ~ O(d × n log(n)) prediction time ~ log(n) × d for each observation 27 / 108
  • 29. Spacial index: ball tree training time ~ O(d × n log(n)) prediction time ~ log(n) × d for each observation Other options exist: KD-tree FLANN FALCONN IMI, NO-IMI Those are very useful for searching over large databases with items of high dimensions28 / 108
  • 30. Overview of kNN Awesomely simple classifier and regressor Provides too optimistic quality on training data Quite slow, though optimizations exist Too sensitive to scale of features Doesn't provide good results with data of high dimensions 29 / 108
  • 31. Sensitivity to scale of features Euclidean distance: ρ(x, ) = (x − ) + (x − ) + ... + (x − )x~ 2 1 x~ 1 2 2 x~ 2 2 d x~ d 2 30 / 108
  • 32. Sensitivity to scale of features Euclidean distance: ρ(x, ) = (x − ) + (x − ) + ... + (x − ) Change scale of first feature: (assuming that x , x , ..., x are of the same order) x~ 2 1 x~ 1 2 2 x~ 2 2 d x~ d 2 ρ(x, )x~ 2 ρ(x, )x~ 2 = (10x − 10 ) + (x − ) + ... + (x − )1 x~ 1 2 2 x~ 2 2 d x~ d 2 ∼ 100(x − )1 x~ 1 2 1 2 n 31 / 108
  • 33. Sensitivity to scale of features Euclidean distance: ρ(x, ) = (x − ) + (x − ) + ... + (x − ) Change scale of first feature: (assuming that x , x , ..., x are of the same order) Scaling of features frequently increases quality. Otherwise contribution from features with large scale is dominating. x~ 2 1 x~ 1 2 2 x~ 2 2 d x~ d 2 ρ(x, )x~ 2 ρ(x, )x~ 2 = (10x − 10 ) + (x − ) + ... + (x − )1 x~ 1 2 2 x~ 2 2 d x~ d 2 ∼ 100(x − )1 x~ 1 2 1 2 n 32 / 108
  • 34. Distance function matters Minkowski distance ρ(x, ) = (x − ) Canberra ρ(x, ) = Cosine metric ρ(x, ) = x~ p l ∑ l x~ l p x~ l ∑ ∣x ∣ + ∣ ∣l x~ l ∣x − ∣l x~ l x~ ∣x∣ ∣ ∣x~ < x, >x~ 33 / 108
  • 35. image source Problems with high dimensions With higher dimensions d >> 1 the neighbouring points are further. Example: consider n training data points being distributed uniformly in the unit cube: expected number of point in the ball of size r is proportional to the r to collect the same amount on neighbors, we need to put r = const → 1 kNN suffers from curse of dimensionality. d 1/d 34 / 108
  • 36. Nearest neighbours applications Classification and regression based on kNN used quite rarely, but the nearest neighbours search is very widespread, used when you need to find something similar in the database. finding documents and images finding similar proteins by using 3d-structure search in chemical databases person identification by photo or by speech user identification by behavior on the web Similarity (distance) is very important and isn't defined trivially in most cases. 35 / 108
  • 37. Measuring quality of binary classification The classifier's output in binary classification is real variable (say, signal is blue and background is red) Which classifier provides better discrimination? 36 / 108
  • 38. Measuring quality of binary classification The classifier's output in binary classification is real variable (say, signal is blue and background is red) Which classifier provides better discrimination? Discrimination is identical in all three cases 37 / 108
  • 41. ROC curve These distributions have the same ROC curve: (ROC curve is passed signal vs passed bck dependency) 40 / 108
  • 42. ROC curve Defined only for binary classification Contains important information: all possible combinations of signal and background efficiencies you may achieve by setting a threshold Particular values of thresholds (and initial pdfs) don't matter, ROC curve doesn't contain this information ROC curve = information about order of observations' predictions: b b s b s b ... s s b s s Comparison of algorithms should be based on the information from ROC curve. 41 / 108
  • 43. ROC AUC (area under the ROC curve) ROC AUC = P(r < r ) where r , r are predictions of random background and signal observations. b s b s 42 / 108
  • 44. Classifier have the same ROC AUC, but which is better ... if the problem is spam detection? Spam is class 0, non-spam is class 1 (putting normal letter in spam is very undesireable) for background rejection system at the LHC? Background is class 0. (we need to pass very few background) 43 / 108
  • 45. Classifier have the same ROC AUC, but which is better ... if the problem is spam detection? Spam is class 0, non-spam is class 1 (putting normal letter in spam is very undesireable) for background rejection system at the LHC? Background is class 0. (we need to pass very few background) Applications frequently demand different metric. 44 / 108
  • 46. Statistical Machine Learning Machine learning we use in practice is based on statistics Main assumption: the data is generated from probabilistic distribution: p(x, y) Does there really exist the distribution of people / pages / texts? Same question for stars / proteins / viruses? 45 / 108
  • 47. Statistical Machine Learning Machine learning we use in practice is based on statistics Main assumption: the data is generated from probabilistic distribution: p(x, y) Does there really exist the distribution of people / pages / texts? Same question for stars / proteins / viruses? In HEP these distributions do exist Statistical framework turned out to be quite helpful to build models. 46 / 108
  • 48. Optimal regression Assuming that we know real distributions p(x, y) and the goal is to minimize squared error: L = E(y − (x)) → min An optimal prediction is a marginalized average (x ) = E(y∣x = x ) y^ 2 y^ 0 0 47 / 108
  • 49. Optimal classification. Bayes optimal classifier Assuming that we know real distributions p(x, y) we reconstruct using Bayes' rule p(y∣x) = = = p(x) p(x, y) p(x) p(y)p(x∣y) p(y = 0 ∣ x) p(y = 1 ∣ x) p(y = 0) p(x ∣ y = 0) p(y = 1) p(x ∣ y = 1) 48 / 108
  • 50. Optimal classification. Bayes optimal classifier Assuming that we know real distributions p(x, y) we reconstruct using Bayes' rule p(y∣x) = = = Lemma (Neyman–Pearson): The best classification quality is provided by (Bayes optimal classifier) p(x) p(x, y) p(x) p(y)p(x∣y) p(y = 0 ∣ x) p(y = 1 ∣ x) p(y = 0) p(x ∣ y = 0) p(y = 1) p(x ∣ y = 1) p(y = 0 ∣ x) p(y = 1 ∣ x) 49 / 108
  • 51. Optimal Binary Classification Bayes optimal classifier has the highest possible ROC curve. Since the classification quality depends only on order, p(y = 1 ∣ x) gives optimal classification quality too! = × p(y = 0 ∣ x) p(y = 1 ∣ x) p(y = 0) p(y = 1) p(x ∣ y = 0) p(x ∣ y = 1) 50 / 108
  • 52. Optimal Binary Classification Bayes optimal classifier has the highest possible ROC curve. Since the classification quality depends only on order, p(y = 1 ∣ x) gives optimal classification quality too! = × How can we estimate from data terms from this expression? p(y = 1), p(y = 0) ? p(y = 0 ∣ x) p(y = 1 ∣ x) p(y = 0) p(y = 1) p(x ∣ y = 0) p(x ∣ y = 1) 51 / 108
  • 53. Optimal Binary Classification Bayes optimal classifier has the highest possible ROC curve. Since the classification quality depends only on order, p(y = 1 ∣ x) gives optimal classification quality too! = × How can we estimate from data terms from this expression? p(y = 1), p(y = 0) ? p(x ∣ y = 1), p(x ∣ y = 0) ? p(y = 0 ∣ x) p(y = 1 ∣ x) p(y = 0) p(y = 1) p(x ∣ y = 0) p(x ∣ y = 1) 52 / 108
  • 54. Histograms density estimation Counting number of samples in each bin and normalizing. fast choice of binning is crucial number of bins grows exponentially → curse of dimensionality 53 / 108
  • 55. Kernel density estimation f(x) = K K(x) is kernel, h is bandwidth Typically, gaussian kernel is used, but there are many others. Approach is very close to weighted kNN. nh 1 i ∑ ( h x − xi ) 54 / 108
  • 56. Kernel density estimation bandwidth selection Silverman's rule of thumb: h = is an empirical standard deviation σ^ ( 3n 4 ) 5 1 σ^ 55 / 108
  • 57. Kernel density estimation bandwidth selection Silverman's rule of thumb: h = is an empirical standard deviation may be irrelevant if the data is far from being gaussian σ^ ( 3n 4 ) 5 1 σ^ 56 / 108
  • 59. Recapitulation 1. Statistical ML: applications and problems 2. k nearest neighbours classifier and regressor distance function overfitting and model selection speeding up neighbours search 3. Binary classification: ROC curve, ROC AUC 4. Optimal regression and optimal classification 5. Density estimation histograms kernel density estimation 58 / 108
  • 60. Parametric density estimation Family of density functions: f(x; θ). Problem: estimate parameters of a Gaussian distribution. f(x; μ, Σ) = exp − (x − μ) Σ (x − μ) (2π) ∣Σ∣d/2 1/2 1 ( 2 1 T −1 ) 59 / 108
  • 61. QDA (Quadratic discriminant analysis) Reconstructing probabilities p(x ∣ y = 1), p(x ∣ y = 0) from data, assuming those are multidimensional normal distributions: p(x ∣ y = 0) ∼ N(μ , Σ ) p(x ∣ y = 1) ∼ N(μ , Σ ) = = const = = exp − (x − μ ) Σ (x − μ ) + (x − μ ) Σ (x − μ ) + const 0 0 1 1 p(y = 0 ∣ x) p(y = 1 ∣ x) p(y = 0) p(y = 1) p(x ∣ y = 0) p(x ∣ y = 1) n0 n1 exp − (x − μ ) Σ (x − μ )( 2 1 0 T 0 −1 0 ) exp − (x − μ ) Σ (x − μ )( 2 1 1 T 1 −1 1 ) ( 2 1 1 T 1 −1 1 2 1 0 T 0 −1 0 ) 60 / 108
  • 63. QDA computational complexity n samples, d dimensions training consists of fitting p(x ∣ y = 0) and p(x ∣ y = 1) takes O(nd + d ) computing covariance matrix O(nd ) inverting covariance matrix O(d ) prediction takes O(d ) for each sample spent on computing dot product 2 3 2 3 2 62 / 108
  • 64. QDA overview simple analytical formula for prediction fast prediction many parameters to reconstruct in high dimensions thus, estimate may become non-reliable data almost never has gaussian distribution 63 / 108
  • 65. Gaussian mixtures for density estimation Mixture of distributions: f(x) = π f (x, θ ); π = 1 Mixture of Gaussian distributions: f(x) = π f(x; μ , Σ ) Parameters to be found: π , ..., π , μ , ..., μ , Σ , ..., Σ c−components ∑ c c c c−components ∑ c c−components ∑ c c c 1 C 1 C 1 C 64 / 108
  • 67. Gaussian mixtures: finding parameters Criterion is maximizing likelihood (using MLE to find optimal parameters) log f(x ; θ) → no analytic solution we can use general-purpose optimization methods i ∑ i θ max 66 / 108
  • 68. Gaussian mixtures: finding parameters Criterion is maximizing likelihood (using MLE to find optimal parameters) log f(x ; θ) → no analytic solution we can use general-purpose optimization methods In mixtures parameters are split in two groups: θ , ..., θ — parameters of components π , ..., π — contributions of components i ∑ i θ max 1 C 1 C 67 / 108
  • 69. Expectation-Maximization algorithm [Dempster et al., 1977] Idea: introduce set of hidden variables π (x) Expectation: π (x) = p(x ∈ c) = Maximization: Maximization step is trivial for Gaussian distributions. EM-algorithm is more stable and has good convergence properties. c c π f (x; θ )∑c~ c~ c~ c~ π f (x; θ )c c c πc θc = π (x ) i ∑ c i = arg π (x ) log f (x , θ ) θ max i ∑ c i c i c 68 / 108
  • 72. Density estimation applications parametric fits in HEP specifically, mixtures of signal + background outlier detection if a new observation has low p w.r.t. to current distribution, it is considered as an outlier photometric selection of quasars gaussian mixtures are used as a smooth version of clustering (discussed later) 71 / 108
  • 73. Clustering (unsupervised learning) Gaussian mixtures are considered as a smooth form of clustering. However in some cases clusters have to be defined strictly (that is, each observation can be assigned only to one cluster). 72 / 108
  • 74. Demo of k-means and DBSCAN clusterings (shown interactively) 73 / 108
  • 75. OPERA (operated in 2008-2012) Oscillation Project with Emulsion-tRacking Apparatus neutrinos sent from CERN to Gran Sasso (INFN, Italy) looking for neutrino oscillations (Nobel prize 2015) hence, neutrinos aren't massless OPERA detector is placed under the ground, consists of 150'000 bricks 8.3 kg each brick consists of interleaving photo emulsion and lead layers 74 / 108
  • 76. OPERA: brick and scanning bricks are taken out and scanned only if there are signs of something interesting one brick = billions of base-tracks 75 / 108
  • 77. OPERA: Tracking with DBSCAN clustering 76 / 108
  • 78. OPERA clustering demo (shown interactively) different metrics can be used in clustering the appropriate choice of metric is crucial 77 / 108
  • 79. Clustering news clustering in news aggregators each topic is covered in dozens of media, those are joined in one group to avoid showing lots of similar results clustering of surface based on the spectra (in the picture, 5 spectral channels for each point were taken from satellite photos) clustering in calorimeters in HEP clustering of gamma ray bursts in astronomy 78 / 108
  • 80. Classification model based on mixtures density estimation is called MDA (mixture discriminant analysis) Generative approach Generative approach: trying to reconstruct p(x, y), then use Bayes classification formula to predict. QDA, MDA are generative classifiers. 79 / 108
  • 81. Classification model based on mixtures density estimation is called MDA (mixture discriminant analysis) Generative approach Generative approach: trying to reconstruct p(x, y), then use Bayes classification formula to predict. QDA, MDA are generative classifiers. Problems of generative approach Real life distributions hardly can be reconstructed Especially in high-dimensional spaces So, we switch to discriminative approach: guessing directly p(y∣x) 80 / 108
  • 83. Density estimation requires reconstructing dependency between all known parameters, which requires much data, but most of these dependencies aren't needed in application. If we can avoid multidimensional density estimation, we'd better do it. 82 / 108
  • 84. Naive Bayes classification rule Assuming that features are independent for each class, one can avoid multidimensional density fitting: = = Contribution of each feature is independent, one-dimensional fitting is much easier. p(y = 0 ∣ x) p(y = 1 ∣ x) p(y = 0) p(y = 1) p(x ∣ y = 0) p(x ∣ y = 1) p(y = 0) p(y = 1) i ∏ p(x ∣ y = 0)i p(x ∣ y = 1)i 83 / 108
  • 85. Application: B-tagging (picture from LHCb) flavour tagging is important in estimating CKM-matrix elements to predict which meson was produced (B or ), non-signal part of collision is analyzed (tagging) signal part (green) provides information about decayed state there are dozens of tracks that aren't part of this scheme 0 B¯0 84 / 108
  • 86. Inclusive tagging idea use all (non-signal) tracks from collision let machine guess which tracks are tagging, how their charge is connected to meson flavour, and estimate probability We can make a naive assumption, that information from tracks is independent: = = Number of terms in the product varies (compared to typical Naive Bayes). Ratio for each track in the right side is estimated with ML techniques discussed later. This simple technique provides very good results. p( ∣tracks)B¯0 p(B ∣tracks)0 p( )B¯0 p(B )0 track ∏ p(track∣ )B¯0 p(track∣B )0 ( p(B )0 p( )B¯0 ) n −1tracks track ∏ p( ∣track)B¯0 p(B ∣track)0 85 / 108
  • 87. Linear decision rule Decision function is linear: d(x) =< w, x > +w for brevity: = sgn(d(x)) This is a parametric model (finding parameters w, w ). QDA & MDA are parametric as well. 0 { d(x) > 0 → d(x) < 0 → = +1y^ = −1y^ y^ 0 86 / 108
  • 88. Finding Optimal Parameters A good initial guess: get such w, w , that error of classification is minimal: L = 1 = sgn(d(x )) Notion: 1 = 1, 1 = 0. Discontinuous optimization (arrrrgh!) 0 i∈samples ∑ y ≠i y^i y^i i true false 87 / 108
  • 89. Finding Optimal Parameters A good initial guess: get such w, w , that error of classification is minimal: L = 1 = sgn(d(x )) Notion: 1 = 1, 1 = 0. Discontinuous optimization (arrrrgh!) solution: let's make decision rule smooth 0 i∈samples ∑ y ≠i y^i y^i i true false p(y = +1∣x) = p (x)+1 p(y = −1∣x) = p (x)−1 = f(d(x)) = 1 − p (x)+1 ⎩⎪ ⎨ ⎪⎧f(0) = 0.5 f(x) > 0.5 f(x) < 0.5 x > 0 x < 0 88 / 108
  • 90. Logistic function σ(x) = = Properties 1. monotonic, σ(x) ∈ (0, 1) 2. σ(x) + σ(−x) = 1 3. σ (x) = σ(x)(1 − σ(x)) 4. 2σ(x) = 1 + tanh(x/2) 1 + ex ex 1 + e−x 1 ′ 89 / 108
  • 91. Logistic regression Define probabilities obtained with logistic function with d(x) =< w, x > +w and optimize log-likelihood: L = − ln(p (x )) = L(x , y ) → min p (x)+1 p (x)−1 = σ(d(x)) = σ(−d(x)) 0 i∈observations ∑ yi i i ∑ i i 90 / 108
  • 92. Logistic loss (LogLoss) Term loss refers to somewhat we are minimizing. Losses typically estimate our risks, denoted as L. L = − ln(p (x )) = L(x , y ) → min LogLoss penalty for single observation: L(x , y ) = − ln(p (x )) = = ln(1 + e ) Margin y d(x ) is expected to be high for all observations. i∈observations ∑ yi i i ∑ i i i i yi i { ln(1 + e ),−d(x )i ln(1 + e ),d(x )i y = +1i y = −1i −y d(x )i i i i 91 / 108
  • 93. Logistic loss L(x , y ) is a convex function. Simple analysis shows that L is a sum of convex functions w.r.t. to w, so the optimization problem has at most one optimum. i i 92 / 108
  • 94. Visualization of logistic regression 93 / 108
  • 95. Linear model for regression How to use linear function for regression? d(x) =< w, x > +w Simplification of notion: x = 1, x = (1, x , ..., x ). d(x) =< w, x > 0 0 1 d 94 / 108
  • 96. Linear regression (ordinary least squares) We can use linear function for regression: d(x ) = y d(x) =< w, x > This is a linear system with d + 1 variables and n equations. Minimize OLS aka MSE (mean squared error): L = (d(x ) − y ) → min Explicit solution: x x w = y x i i i ∑ i i 2 (∑i i i T ) ∑i i i 95 / 108
  • 97. Linear regression can use some other loss L different from MSE but no explicit solution in other cases demonstrates properties of linear models reliable estimates when n >> d able to completely fit to the data if n = d undefined when n < d 96 / 108
  • 98. Regularization: motivation When the number of parameters is high (compared to the number of observations) hard to estimate reliably all parameters linear regression with MSE: in d-dimensional space you can find hyperplane through any d points non-unique solution if n < d the matrix x x degenerates Solution 1: manually decrease dimensionality of the problem (select more appropriate features) Solution 2: use regularization ∑i i i T 97 / 108
  • 99. Regularization When number of parameters in model is high, overfitting is very probable Solution: add a regularization term to the loss function: L = L(x , y ) + L → min L regularization: L = α ∣w ∣ L regularization: L = β ∣w ∣ L + L regularization: L = α ∣w ∣ + β ∣w ∣ N 1 i ∑ i i reg 2 reg ∑j j 2 1 reg ∑j j 1 2 reg ∑j j 2 ∑j j 98 / 108
  • 100. L , L – regularizations Dependence of parameters (components of w) on the regularization (stronger regularization to the left) L regularization L (solid), L + L (dashed) 2 1 2 1 1 2 99 / 108
  • 101. Regularizations L regularization encourages sparsity (many coefficients in w turn to zero)1 100 / 108
  • 102. L regularizations We can consider a more general case: L = w Expression for L : L = 1 exactly penalizes number of non-zero coefficients but nobody uses L . Why? p p i ∑ i p 0 0 ∑i w ≠0i 0 101 / 108
  • 103. L regularizations We can consider a more general case: L = w Expression for L : L = 1 exactly penalizes number of non-zero coefficients but nobody uses L . Why? even L , 0 < p < 1 not used. Why? p p i ∑ i p 0 0 ∑i w ≠0i 0 p 102 / 108
  • 104. Regularization summary important tool to fight overfitting (= poor generalization on a new data) different modifications for other models makes it possible to handle really many features machine learning should detect important features itself from mathematical point: turning convex problem to strongly convex (NB: only for linear models) from practical point: softly limiting the space of parameters breaks scale-invariance of linear models 103 / 108
  • 106. Data Scientist Tools Experiments in appropriate high-level language or environment After experiments are over — implement final algorithm in low-level language (C++, CUDA, FPGA) Second point is typically unnecessary 105 / 108
  • 107. Scientific Python NumPy vectorized computations in python Matplotlib for drawing Scikit-learn most popular library for machine learning 106 / 108
  • 108. Scientific Python Scipy libraries for science and engineering Root_numpy convenient way to work with ROOT files Astropy A community python library for astronomy 107 / 108