2. COMP24111 Machine Learning
2
Outline
• Background
• Probability Basics
• Probabilistic Classification
• Naïve Bayes
– Principle and Algorithms
– Example: Play Tennis
• Relevant Issues
• Summary
3. COMP24111 Machine Learning
3
Background
• There are three methods to establish a classifier
a) Model a classification rule directly
Examples: k-NN, decision trees, perceptron, SVM
b) Model the probability of class memberships given input data
Example: perceptron with the cross-entropy cost
c) Make a probabilistic model of data within each class
Examples: naive Bayes, model based classifiers
• a) and b) are examples of discriminative classification
• c) is an example of generative classification
• b) and c) are both examples of probabilistic classification
4. COMP24111 Machine Learning
4
Probability Basics
• Prior, conditional and joint probability for random variables
– Prior probability:
– Conditional probability:
– Joint probability:
– Relationship:
– Independence:
• Bayesian Rule
)|,)( 121 XP(XX|XP 2
)(
)()(
)(
X
X
X
P
CPC|P
|CP =
)(XP
))(),,( 22 ,XP(XPXX 11 == XX
)()|()()|() 2211122 XPXXPXPXXP,XP(X1 ==
)()()),()|(),()|( 212121212 XPXP,XP(XXPXXPXPXXP 1 ===
Evidence
PriorLikelihood
Posterior
×
=
5. COMP24111 Machine Learning
5
Probability Basics
• Quiz: We have two six-sided dice. When they are tolled, it could end up
with the following occurance: (A) dice 1 lands on side “3”, (B) dice 2 lands
on side “1”, and (C) Two dice sum to eight. Answer the following questions:
?toequal),(Is8)
?),(7)
?),(6)
?)|(5)
?)|(4)
?3)
?2)
?)()1
P(C)P(A)CAP
CAP
BAP
ACP
BAP
P(C)
P(B)
AP
∗
=
=
=
=
=
=
=
6. COMP24111 Machine Learning
6
Probabilistic Classification
• Establishing a probabilistic model for classification
– Discriminative model
),,,)( 1 n1L X(Xc,,cC|CP ⋅⋅⋅=⋅⋅⋅= XX
),,,( 21 nxxx ⋅⋅⋅=x
Discriminative
Probabilistic Classifier
1x 2x nx
)|( 1 xcP )|( 2 xcP )|( xLcP
•••
•••
7. COMP24111 Machine Learning
7
Probabilistic Classification
• Establishing a probabilistic model for classification (cont.)
– Generative model
),,,)( 1 n1L X(Xc,,cCC|P ⋅⋅⋅=⋅⋅⋅= XX
Generative
Probabilistic Model
for Class 1
)|( 1cP x
1x 2x nx
•••
Generative
Probabilistic Model
for Class 2
)|( 2cP x
1x 2x nx
•••
Generative
Probabilistic Model
for Class L
)|( LcP x
1x 2x nx
•••
•••
),,,( 21 nxxx ⋅⋅⋅=x
8. COMP24111 Machine Learning
8
Probabilistic Classification
• MAP classification rule
– MAP: Maximum A Posterior
– Assign x to c* if
• Generative classification with the MAP rule
– Apply Bayesian rule to convert them into posterior probabilities
– Then apply the MAP rule
Lc,,cccc|cCP|cCP ⋅⋅⋅=≠==>== 1
**
,)()( xXxX
Li
cCPcC|P
P
cCPcC|P
|cCP
ii
ii
i
,,2,1for
)()(
)(
)()(
)(
⋅⋅⋅=
===∝
=
===
===
xX
xX
xX
xX
9. COMP24111 Machine Learning
9
Naïve Bayes
• Bayes classification
Difficulty: learning the joint probability
• Naïve Bayes classification
– Assumption that all input features are conditionally independent!
– MAP classification rule: for
)()|,,()()()( 1 CPCXXPCPC|P|CP n⋅⋅⋅=∝ XX
)|,,( 1 CXXP n⋅⋅⋅
)|()|()|(
)|,,()|(
)|,,(),,,|()|,,,(
21
21
22121
CXPCXPCXP
CXXPCXP
CXXPCXXXPCXXXP
n
n
nnn
⋅⋅⋅=
⋅⋅⋅=
⋅⋅⋅⋅⋅⋅=⋅⋅⋅
Lnn ccccccPcxPcxPcPcxPcxP ,,,),()]|()|([)()]|()|([ 1
*
1
***
1 ⋅⋅⋅=≠⋅⋅⋅>⋅⋅⋅
),,,( 21 nxxx ⋅⋅⋅=x
10. COMP24111 Machine Learning
10
Naïve Bayes
• Algorithm: Discrete-Valued Features
– Learning Phase: Given a training set S,
Output: conditional probability tables; for elements
– Test Phase: Given an unknown instance ,
Look up tables to assign the label c* to X’ if
;inexampleswith)|(estimate)|(ˆ
),1;,,1(featureeachofvaluefeatureeveryFor
;inexampleswith)(estimate)(ˆ
ofvaluetargeteachFor 1
S
S
ijkjijkj
jjjk
ii
Lii
cCxXPcCxXP
N,knjXx
cCPcCP
)c,,c(cc
==←==
⋅⋅⋅=⋅⋅⋅=
=←=
⋅⋅⋅=
Lnn ccccccPcaPcaPcPcaPcaP ,,,),(ˆ)]|(ˆ)|(ˆ[)(ˆ)]|(ˆ)|(ˆ[ 1
*
1
***
1 ⋅⋅⋅=≠′⋅⋅⋅′>′⋅⋅⋅′
),,( 1 naa ′⋅⋅⋅′=′X
LNX jj ×,
12. COMP24111 Machine Learning
12
Example
• Learning Phase
Outlook Play=Yes Play=No
Sunny 2/9 3/5
Overcast 4/9 0/5
Rain 3/9 2/5
Temperature Play=Yes Play=No
Hot 2/9 2/5
Mild 4/9 2/5
Cool 3/9 1/5
Humidity Play=Ye
s
Play=N
o
High 3/9 4/5
Normal 6/9 1/5
Wind Play=Yes Play=No
Strong 3/9 3/5
Weak 6/9 2/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
13. COMP24111 Machine Learning
13
Example
• Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phrase
– Decision making with the MAP rule
P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=No) = 3/5
P(Play=No) = 5/14
P(Outlook=Sunny|Play=Yes) = 2/9
P(Temperature=Cool|Play=Yes) = 3/9
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=Yes) = 3/9
P(Play=Yes) = 9/14
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
14. COMP24111 Machine Learning
14
Example
• Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
– MAP rule
P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=No) = 3/5
P(Play=No) = 5/14
P(Outlook=Sunny|Play=Yes) = 2/9
P(Temperature=Cool|Play=Yes) = 3/9
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=Yes) = 3/9
P(Play=Yes) = 9/14
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
15. COMP24111 Machine Learning
15
Naïve Bayes
• Algorithm: Continuous-valued Features
– Numberless values for a feature
– Conditional probability often modeled with the normal distribution
– Learning Phase:
Output: normal distributions and
– Test Phase: Given an unknown instance
• Instead of looking-up tables, calculate conditional probabilities with all the
normal distributions achieved in the learning phrase
• Apply the MAP rule to make a decision
ijji
ijji
ji
jij
ji
ij
cC
cX
X
cCXP
=σ
=µ
σ
µ−
−
σπ
==
whichforexamplesofXvaluesfeatureofdeviationstandard:
Cwhichforexamplesofvaluesfeatureof(avearage)mean:
2
)(
exp
2
1
)|(ˆ
2
2
Ln ccCXX ,,),,,(for 11 ⋅⋅⋅=⋅⋅⋅=X
Ln× LicCP i ,,1)( ⋅⋅⋅==
),,( 1 naa ′⋅⋅⋅′=′X
16. COMP24111 Machine Learning
16
Naïve Bayes
• Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
– Estimate mean and variance for each class
– Learning Phase: output two Gaussian models for P(temp|C)
∑∑
==
µ−=σ=µ
N
n
n
N
n
n x
N
x
N 1
22
1
)(
1
,
1
09.7,88.23
35.2,64.21
=σ=µ
=σ=µ
NoNo
YesYes
−
−=
×
−
−=
−
−=
×
−
−=
25.50
)88.23(
exp
209.7
1
09.72
)88.23(
exp
209.7
1
)|(ˆ
09.11
)64.21(
exp
235.2
1
35.22
)64.21(
exp
235.2
1
)|(ˆ
2
2
2
2
2
2
xx
NoxP
xx
YesxP
ππ
ππ
17. COMP24111 Machine Learning
17
Relevant Issues
• Violation of Independence Assumption
– For many real world tasks,
– Nevertheless, naïve Bayes works surprisingly well anyway!
• Zero conditional probability Problem
– If no example contains the attribute value
– In this circumstance, during test
– For a remedy, conditional probabilities estimated with
)|()|()|,,( 11 CXPCXPCXXP nn ⋅⋅⋅≠⋅⋅⋅
0)|(ˆ, ==== ijkjjkj cCaXPaX
0)|(ˆ)|(ˆ)|(ˆ
1 =⋅⋅⋅⋅⋅⋅ inijki cxPcaPcxP
)1examples,virtual""of(numberpriortoweight:
)ofvaluespossiblefor/1(usually,estimateprior:
whichforexamplestrainingofnumber:
Candwhichforexamplestrainingofnumber:
)|(ˆ
≥
=
=
==
+
+
===
mm
Xttpp
cCn
caXn
mn
mpn
cCaXP
j
i
ijkjc
c
ijkj
18. COMP24111 Machine Learning
18
Summary
• Naïve Bayes: the conditional independence assumption
– Training is very easy and fast; just requiring considering each
attribute in each class separately
– Test is straightforward; just looking up tables or calculating
conditional probabilities with estimated distributions
• A popular generative model
– Performance competitive to most of state-of-the-art classifiers
even in presence of violating independence assumption
– Many successful applications, e.g., spam mail filtering
– A good candidate of a base learner in ensemble learning
– Apart from classification, naïve Bayes can do more…
Editor's Notes
For a class, the previous generative model can be decomposed by n generative models of a single input.