The Higgs Boson Machine Learning Challenge is, by far, one of the biggest big data competitions focusing on data analysis in the world. To be successful in such a competition, Cheng applied his knowledge in Computer Science, Mathematics, Statistics, and Physics, while his problem solving habit is developed during his training in Civil Engineering.
In this presentation, Cheng will use his experience in this competition to illustrate some important elements in big data analytics and why they are important. The content of the presentation covers different disciplines such as physics, statistics, and mathematics. But no background knowledge of these areas are required to understand the essence of the presentation.
In brief, the presentation covers the following content:
An effective framework for general data mining projects,
Introduction of the competition and its related physics background,
Various techniques in data exploring and some traps to avoid,
Various ways of feature enhancement,
Model building and selection, and
Optimization of model performance
2. PRESENTER
Ohio State University, Tongji University
Ph.D. Civil Engineering
M.S. Applied Statistics
Minor Computer Science
Advanced trainings:
City and Regional Planning
Industrial and Systems Engineering
Mathematics
Passion: (this) machine learning
2
3. HIGGS
BOSON
MACHINE
LEARNING
CHALLENGE
• Goal: improve the procedure that produces the selection region of Higgs Boson
• 4 Month Duration
• 1,785 teams
• Many machine learning experts, statisticians, and physicist
• Top 5 are from 5 different countries
3
Hungary
Netherlands
France
Russia
http://www.kaggle.com/c/higgs-‐boson/leaderboard U.S.A/China
7. • a.k.a
HIGGS
BOSON
the
God
Particle
(explains
some
mass)
• A
fundamental
particle
theorized
in
1964
in
the
Standard
Model
of
Particle
Physics
• “Considered”
discovered
in
2011
–
2013
in
LHC
by
CERN
• A
number
of
prestigious
awards
in
2013,
including
a
Nobel
prize
7
A
"definitive"
answer
might
require
"another
few
years"
after
the
collider's
2015
restart.
deputy
chair
of
physics
at
Brookhaven
National
Laboratory
http://en.wikipedia.org/wiki/Higgs_boson
http://upload.wikimedia.org/wikipedia/commons/0/00/Standard_Model_of_Elementary_Particles.svg
8. CERN:
THE
EUROPEAN
ORGANIZATION
• Established
FOR
NUCLEAR
RESEARCH
in
1954
• Birth
of
World
Wide
Web
(1989)
8
maps.google.com
9. • 27
LARGE
HADRON
COLLIDER
(LHC)
km
(17
mi)
in
circumference
• 175
meters
(574
ft)
beneath
ground
• Built
from
1998
to
2008
• Over
10,000
scientists
and
engineers
• Over
100
counties
• Seven
particle
detectors
https://www.llnl.gov/news/llnl-‐set-‐host-‐international-‐lattice-‐physics-‐conference 9
http://en.wikipedia.org/wiki/Large_Hadron_Collider
http://en.wikipedia.org/wiki/Large_Hadron_Collider
10. • 46
meters
long
• 25
meters
in
diameter
• Weighs
about
7,000
tonnes
• Contains
some
3000
km
of
cable
• Involves
ATLAS
roughly
3,000
physicists
from
over
175
institutions
in
38
countries.
10
http://en.wikipedia.org/wiki/Large_Hadron_Collider
http://higgsml.lal.in2p3.fr/documentation/
11. • 46
meters
long
• 25
meters
in
diameter
• Weighs
about
7,000
tonnes
• Contains
some
3000
km
of
cable
• Involves
ATLAS
roughly
3,000
physicists
from
over
175
institutions
in
38
countries.
11
http://en.wikipedia.org/wiki/Large_Hadron_Collider
http://higgsml.lal.in2p3.fr/documentation/
12. • 46
meters
long
• 25
meters
in
diameter
• Weighs
about
7,000
tonnes
• Contains
some
3000
km
of
cable
• Involves
ATLAS
roughly
3,000
physicists
from
over
175
institutions
in
38
countries.
12
http://en.wikipedia.org/wiki/Large_Hadron_Collider
http://higgsml.lal.in2p3.fr/documentation/
13. • Higgs
CHALLENGES
IN
DETECTION
OF
HIGGS
BOSON
Boson
can
not
be
measured
directly
(decays
immediately
into
lighter
particles)
• Other
particles
can
decay
into
the
same
set
of
lighter
particles
• PRODUCTION
and
DECAY
of
Higgs
Boson
depends
on
the
mass,
while
mass
was
not
predicted
by
theory
(now
we
know
it
is
close
to
125
Gev)
13
Seeing
a
circular
shaped
shadow
does
not
mean
the
real
object
is
a
sphere
ball
https://www2.physics.ox.ac.uk/sites/default/files/2012-‐03-‐27/sinead_farrington_pdf_17376.pdf
14. CURRENT
DETECTION
MECHANISM
• Raw
data
collected
from
LHC
• Hundreds
of
millions
of
proton-‐proton
collisions
(event)
per
second
• 400
events
of
interest
are
selected
per
second
– Signal
event
(i.e.
Higgs
Boson)
– Background
event
(i.e.
other
particles)
• Events
in
Ad
Hoc
selection
region
(in
certain
channels)
exceeding
background
noise
14
Needs
improvement
in
significance
and
robustness
in
selection
criteria
15. SIMPLIFICATIONS
FOR
COMPETITION
• Simulated
Data
• Fixed
mass
(125
GeV)
• Simplified
decay
channel
– Next
Slide
• Simplified
background
events
(three
representative
types
only)
–Decay
of
the
Z
boson
(91.2
GeV)
into
Tau-‐Tau
–Decay
of
a
pair
of
top
quarks
into
lepton
and
hadronic
tau
–“Decay”
of
the
W
boson
into
lepton
and
hadronic
tau
due
to
imperfections
in
the
particle
identification
procedure
• Simplified
objective
function
(significance
score)
15
16. • Decay
SIMPLIFIED
DECAY
CHANNEL
of
Tau-‐Tau
Channel
only
• One
tau
decays
into
lepton
and
two
neutrino
• The
other
tau
decays
into
hadronic
tau
and
a
neutrino
• (Note:
Neutrinos
can
not
be
detected)
hadronic tau:
a bunch of hadrons
16
17. • Decay
SIMPLIFIED
DECAY
CHANNEL
of
Tau-‐Tau
Channel
only
• One
tau
decays
into
lepton
and
two
neutrino
• The
other
tau
decays
into
hadronic
tau
and
a
neutrino
• (Note:
Neutrinos
can
not
be
detected)
hadronic tau:
a bunch of hadrons
17
18. • Decay
SIMPLIFIED
DECAY
CHANNEL
of
Tau-‐Tau
Channel
only
• One
tau
decays
into
lepton
and
two
neutrino
• The
other
tau
decays
into
hadronic
tau
and
a
neutrino
• (Note:
Neutrinos
can
not
be
detected)
18
Jets MET
vectorized
momenta
are
given
hadronic tau:
a bunch of hadrons
23. HOW
TO
HANDLE
MISSING
VALUES
• Assign
a
value
– Generate
a
random
value
– Fit
a
value
(mean,
median,
nearest
neighbor,
etc.)
– Fix
a
value
(domain
knowledge)
• Remove
the
record
• Leave
as
is
23
24. HOW
TO
HANDLE
MISSING
VALUES
• Assign
a
value
– Generate
a
random
value
– Fit
a
value
(mean,
median,
nearest
neighbor,
etc.)
– Fix
a
value
(domain
knowledge)
• Remove
the
record
• Leave
as
is
24
25. HISTOGRAM
PRI_jet_leading_pt
Count
Log
transformation
Count
Inverse
transformation
Count
Density
is
more
meaningful
in
the
range
of
x No
fuzzy
jump
at
the
edge 25
29. INTERACTIVE
VISUALIZATION
R
SHINY
29
Use
a
reasonable
number
of
bins
to
display
the
underlying
distribution
http://chencheng.shinyapps.DEMO io/demo_higgs
30. INTERACTIVE
VISUALIZATION
R
SHINY
30
Use
a
reasonable
transformation
to
display
the
underlying
distribution
http://chencheng.shinyapps.DEMO io/demo_higgs
42. FEATURE
ENHANCEMENT
CURVE
FITTING
Enhance
a
variable
based
on
correlation
with
another
variable 42
Count
Count
BKG
SGN
DER_pt_h
&
DER_deltar_tau_lep
46. • Select
DATA
DRILL
DOWN
variable(s):
One
var.
for
histogram,
two
var.
for
scatter
plot
46
http://chencheng.shinyapps.DEMO io/demo_higgs
47. • Dynamically
DATA
DRILL
DOWN
select
a
subset
of
data
—
PRI_jet_num
=
2
47
http://chencheng.shinyapps.DEMO io/demo_higgs
48. • Patterns
DATA
DRILL
DOWN
in
the
subset
data
—
PRI_jet_leading_eta
&
PRI_jet_subleading_eta
48
http://chencheng.shinyapps.DEMO io/demo_higgs
49. • Dynamically
DATA
DRILL
DOWN
select
a
subset
of
data
—
PRI_jet_num
=
3
49
http://chencheng.shinyapps.DEMO io/demo_higgs
50. • Patterns
DATA
DRILL
DOWN
in
the
subset
data
—
PRI_jet_leading_eta
&
PRI_jet_subleading_eta
50
http://chencheng.shinyapps.DEMO io/demo_higgs
51. • Patterns
DATA
DRILL
DOWN
in
the
subset
data
—
PRI_jet_leading_eta
&
PRI_jet_subleading_eta
51
PRI_jet_num
=
2 PRI_jet_num
=
3
Interactive
data
visualization
techniques
are
helpful
http://chencheng.shinyapps.DEMO io/demo_higgs
55. INSPIRATION
FROM
ANIMATION
• Distance
ratio
between
MET-‐Lep
and
Tau-‐Lep
d(MET,
Lep)/d(Tau,
Lep)
55
Inspiration
from
meaningful
visualization
can
be
helpful
Count
dist_ratio_met_lep_tau
BKG
SGN
56. INSPIRATION
FROM
ANIMATION
• Distance
ratio
between
MET-‐Lep
and
Tau-‐Lep
d(MET,
Lep)/d(Tau,
Lep)
BKG
SGN
56
Adjust
visualization
for
better
efficiency
Count
dist_ratio_met_lep_tau
Count
dist_ratio_met_lep_tau
BKG
SGN
59. • Gradient
boosting
tree
• Neural
network
• Bayesian
network
• Support
vector
machine
• Generalized
additive
model
MODELS
59
60. • Gradient
boosting
tree
• Neural
network
• Bayesian
network
• Support
vector
machine
• Generalized
additive
model
MODELS
60
61. • Decision
GRADIENT
BOOSTING
TREE
tree
– Build
many
shallow
trees
• Boosting
– Build
trees
based
on
residual
• Bagging
– Each
tree
uses
a
subset
of
the
data
• Ensembling
– Combine
the
trees
61
62. • Decision
GRADIENT
BOOSTING
TREE
tree
– Build
many
shallow
trees
• Boosting
– Build
trees
based
on
residual
• Bagging
– Each
tree
uses
a
subset
of
the
data
• Ensembling
– Combine
the
trees
62
63. • Regression
tree
DECISION
TREE
63
1.0
0.5
0.0
−0.5
−1.0
0.0 2.5 5.0 7.5 10.0
x
y
64. • Regression
tree
DECISION
TREE
64
1.0
0.5
0.0
−0.5
−1.0
0.0 2.5 5.0 7.5 10.0
x
y
Depth
=
1
|
x< 6.614
x>=6.614
0.19
n=100
−0.08
n=64
0.66
n=36
Regression Tree with Node Depth = 1
65. • Regression
tree
DECISION
TREE
65
0.19
n=100
|
x< 6.614
x>=6.614
x>=3.049 x>=8.953
x< 3.049 x< 8.953
−0.08
n=64
−0.53
n=40
0.67
n=24
0.66
n=36
0.086
n=7
0.8
n=29
Regression Tree with Node Depth = 2
1.0
0.5
0.0
−0.5
−1.0
0.0 2.5 5.0 7.5 10.0
x
y
Depth
=
2
68. DECISION
TREE
X0
=
X;
Y0
=
Y;
latest_model
=
train_tree(X,
Y);
for
ii
=
1:NUM_ITER
Index_train
=
random(1:NUM_REC,
FRAC_TRAIN
*
NUM_REC)
X
=
X0[Index_train];
Y
=
Y0[Index_train];
v_resid
=
Y
-‐
wts
*
latest_model(X);
tree(ii)
=
train_tree(X,
v_pseudo_resid,
wts);
latest_model
+=
LARNING_RATE
*
tree(ii)
68
base
model
69. GRADIENT
BOOSTING
TREE
(V.
1)
X0
=
X;
Y0
=
Y;
latest_model
=
train_tree(X,
Y);
for
ii
=
1:NUM_ITER
Index_train
=
random(1:NUM_REC,
FRAC_TRAIN
*
NUM_REC)
X
=
X0[Index_train];
Y
=
Y0[Index_train];
v_resid
=
Y
-‐
latest_model(X);
tree_add=
train_tree(X,
v_resid);
latest_model
+=
LARNING_RATE
*
tree_add
get
the
residuals
fit
a
tree
for
residuals
additive
model
69
70. (STOCHASTIC)
GRADIENT
BOOSTING
TREE
X0
=
X;
Y0
=
Y;
latest_model
=
train_tree(X,
Y);
for
ii
=
1:NUM_ITER
Index_train
=
random(1:NUM_REC,
FRAC_TRAIN
*
NUM_REC)
X
=
X0[Index_train];
Y
=
Y0[Index_train];
v_resid
=
Y
-‐
latest_model(X);
tree_add
=
train_tree(X,
v_resid);
latest_model
+=
LARNING_RATE
*
tree_add
get
sampled
index
sampled
records
as
input
70
store
input
71. (STOCHASTIC)
GRADIENT
BOOSTING
TREE
WITH
WEIGHT
X0
=
X;
Y0
=
Y;
latest_model
=
train_tree(X,
Y,
wts);
for
ii
=
1:NUM_ITER
Index_train
=
random(1:NUM_REC,
FRAC_TRAIN
*
NUM_REC)
X
=
X0[Index_train];
Y
=
Y0[Index_train];
v_resid
=
Y
-‐
wts
*
latest_model(X);
tree_add
=
train_tree(X,
v_resid,
wts);
latest_model
+=
LARNING_RATE
*
tree_add
71
72. (GENERAL)
GRADIENT
BOOSTING
X0
=
X;
Y0
=
Y;
latest_model
=
train_base_model(X,
Y,
wts);
for
ii
=
1:NUM_ITER
Index_train
=
random(1:NUM_REC,
FRAC_TRAIN
*
NUM_REC)
X
=
X0[Index_train];
Y
=
Y0[Index_train];
v_pseudo_resid
=
get_pseudo_residual(X,
Y,
wts,
latest_model,
LOSS_FUNCTION_TYPE);
model_add_base
=
train_base_model(X,
v_pseudo_resid,
wts);
alpha
=
linear_search(cost_function,
model_add_base,
X,
Y,
wts);
latest_model
+=
LARNING_RATE
*
(alpha
*
model_add_base)
[Stochastic Gradient Boosting] Jerome H. Friedman, 1999
72
76. APPLY
MODEL
ON
TEST
DATA
76
EventId Score RankOrder Class
1 0.98 501 s
2 0.42 259,579 b
3 0.46 264,125 b
. . . .
. . . .
449,998 0.86 31,154 s
449,999 0.12 489,251 b
550,000 0.79 110,154 b
78. GRADIENT
BOOSTING
PARAMETERS
• Number
of
iteration
• Minimum
observation
for
each
node
• Fraction
of
bagging
(0.5
~
0.8)
• Learning
rate
(<0.1)
• Depth
of
tree
(4
~
8)
78
80. • Split
training
data
– 70%
CROSS
VALIDATION
for
training
– 30%
for
cross
validation
• Train
model
(70%)
• Measure
performance
(30%)
80
81. PERFORMANCE
BASED
ON
AMS
81
Trade-‐off
between:
Ratio
of
Signal/Background
events
Number
of
records
in
selection
region
EventId Score RankOrd
er
Class truth
1 0.98 501 S S
2 0.42 259,579 B
3 0.46 264,125 B
. . . .
. . . .
449,998 0.86 31,154 S B
449,999 0.12 489,251 B
550,000 0.79 110,154 B
Selection
Region
s
=
sum(S)
b=
sum(B)
90. HEAT
MAP
OF
AMS
ON
B-‐S
PLAN
90
s
b
A
B
C
Inspiration
from
Lagrangian
Method
Weight
signal
and
background
events
by
partial
derivatives
of
AMS
function
91. AMS
CURVE
ON
B-‐S
PLAN
91
A
B
C
Inspiration
from
Lagrangian
Method
Weight
signal
and
background
events
by
partial
derivatives
of
AMS
function
s
partial
derivative
of
AMS
against
s
partial
derivative
of
AMS
against
b
b
Ratio
of
the
derivatives
==>
relative
weight
100. • Version
OTHER
TOPICS
control
(Git,
Source
Tree)
– Effectively
implement
many
different
ideas
• File
organization
– Efficiently
pull
out
the
file
needed
• Effective
code
(R,
Python)
– it
matters
so
much
when
dealing
with
big
data
100
101. Thanks
you
for
your
participation!
Any
Questions?
goDCI.com