2. SHIWANI GUPTA 2
Sources of Uncertainty
• Information is partial.
• Information is not fully reliable.
• Representation language is inherently imprecise.
• Information comes from multiple sources and it is
conflicting.
• Information is approximate.
• Non-absolute cause-effect relationships exist.
3. SHIWANI GUPTA 3
Uncertainty
Let action At = leave for airport t minutes before flight
Will At get me there on time?
Problems:
1. partial observability (road state, other drivers' plans, etc.)
2. noisy sensors (traffic reports)
3. uncertainty in action outcomes (flat tire, etc.)
4. immense complexity of modeling and predicting traffic
Hence a purely logical approach either
1. risks falsehood:
“A25 will get me there on time”, or
2. leads to conclusions that are too weak for decision making:
“A25 will get me there on time if there's no accident on the bridge
and it doesn't rain and my tires remain intact etc. etc.”
(A1440 might reasonably be said to get me there on time but I'd have to
stay overnight in the airport …)
4. SHIWANI GUPTA 4
Methods for handling uncertainty
• Default or nonmonotonic logic: (by default, something is true)
– Assume my car does not have a flat tire
– Assume A25 works unless contradicted by evidence
• Issues: What assumptions are reasonable? How to handle
contradiction?
• Rules with fudge factors:
– A25 |→0.3 get there on time
– Sprinkler |→ 0.99 WetGrass
– WetGrass |→ 0.7 Rain
• Issues: Problems with combination, e.g., Sprinkler causes Rain??
• Probability
– Model agent's degree of belief
– Given the available evidence, A25 will get me there on time with
probability 0.04
5. SHIWANI GUPTA 5
Basic Probability Theory
• An experiment has a set of potential outcomes, e.g.,
throw a dice
• The sample space of an experiment is the set of all
possible outcomes, e.g., {1, 2, 3, 4, 5, 6}
• An event is a subset of the sample space.
– {2}
– {3, 6}
– even = {2, 4, 6}
– odd = {1, 3, 5}
6. SHIWANI GUPTA 6
Probability
Probabilistic assertions summarize effects of
– laziness: failure to enumerate exceptions, qualifications, etc.
– ignorance: lack of relevant facts, initial conditions, etc.
Subjective probability:
• Probabilities relate propositions to agent's own state of knowledge
e.g., P(A25 | no reported accidents) = 0.06
These are not assertions about the world
Probabilities of propositions change with new evidence:
e.g., P(A25 | no reported accidents, 5 a.m.) = 0.15
7. SHIWANI GUPTA 7
Probability as Relative Frequency
• An event has a probability.
• Consider a long sequence of experiments. If we look at the
number of times a particular event occurs in that sequence,
and compare it to the total number of experiments, we can
compute a ratio.
• This ratio is one way of estimating the probability of the
event.
• P(E) = (# of times E occurred)/(total # of trials)
8. SHIWANI GUPTA 8
Making decisions under uncertainty
Suppose I believe the following:
P(A25 gets me there on time | …) = 0.04
P(A90 gets me there on time | …) = 0.70
P(A120 gets me there on time | …) = 0.95
P(A1440 gets me there on time | …) = 0.9999
• Which action to choose?
Depends on my preferences for missing flight vs. time spent
waiting, etc.
– Utility theory is used to represent and infer preferences
– Decision theory = probability theory + utility theory
9. SHIWANI GUPTA 9
Syntax
• Basic element: random variable
• Similar to propositional logic: possible worlds defined by assignment
of values to random variables.
• Boolean random variables
e.g., Cavity (do I have a cavity?)
• Discrete random variables
e.g., Weather is one of <sunny,rainy,cloudy,snow>
• Domain values must be exhaustive and mutually exclusive
• Elementary proposition constructed by assignment of a value to a
random variable: e.g., Weather = sunny, Cavity = false (abbreviated as
cavity)
• Complex propositions formed from elementary propositions and
standard logical connectives e.g., Weather = sunny Cavity = false
10. SHIWANI GUPTA 10
Syntax
• Atomic event: A complete specification of the state of the
world about which the agent is uncertain
e.g., if the world consists of only two Boolean variables
Cavity and Toothache, then there are 4 distinct atomic
events:
Cavity = false Toothache = false
Cavity = false Toothache = true
Cavity = true Toothache = false
Cavity = true Toothache = true
• Atomic events are mutually exclusive and exhaustive
11. SHIWANI GUPTA 11
Axioms of probability
• For any propositions A, B
– 0 ≤ P(A) ≤ 1
– P(true) = 1 and P(false) = 0
– P(A B) = P(A) + P(B) - P(A B)
12. SHIWANI GUPTA 12
Axioms of Probability Theory
• Suppose P(.) is a probability function, then
1. for any event E, 0≤ P(E) ≤1.
2. P(S) = 1, where S is the sample space.
3. for any two mutually exclusive events E1 and E2,
P(E1 È E2) = P(E1) + P(E2)
• Any function that satisfies the above three axioms is
a probability function.
13. SHIWANI GUPTA 13
Prior probability
• Prior or unconditional probabilities of propositions
e.g., P(Cavity = true) = 0.1 and P(Weather = sunny) = 0.72 correspond
to belief prior to arrival of any (new) evidence
• Probability distribution gives values for all possible assignments:
P(Weather) = <0.72,0.1,0.08,0.1> (normalized, i.e., sums to 1)
• Joint probability distribution for a set of random variables gives the
probability of every atomic event on those random variables
P(Weather,Cavity) = a 4 × 2 matrix of values:
Weather = sunny rainy cloudy snow
Cavity = true 0.144 0.02 0.016 0.02
Cavity = false 0.576 0.08 0.064 0.08
Every question about a domain can be answered by the joint distribution
14. SHIWANI GUPTA 14
Joint Probability
• Let A, B be two events, the joint probability of both A and B being true
is denoted by P(A, B).
• Example:
P(spade) is the probability of the top card being a spade.
P(king) is the probability of the top card being a king.
P(spade, king) is the probability of the top card being both a spade and
a king, i.e., the king of spade.
P(king, spade)=P(spade, king) ???
15. SHIWANI GUPTA 15
Properties of Probability
1. P(ØE) = 1– P(E)
2. If E1 and E2 are logically equivalent, then
P(E1)=P(E2).
– E1: Not all philosophers are more than six feet tall.
– E2: Some philosopher is not more that six feet tall.
Then P(E1)=P(E2).
3. P(E1, E2)≤P(E1).
16. SHIWANI GUPTA 16
Conditional probability
• Conditional or posterior probabilities
e.g., P(cavity | toothache) = 0.8
i.e., given that toothache is all I know
• If we know more, e.g., cavity is also given, then we have
P(cavity | toothache,cavity) = 1
• New evidence may be irrelevant, allowing simplification,
e.g.,
P(cavity | toothache, sunny) = P(cavity | toothache) = 0.8
17. SHIWANI GUPTA 17
Conditional probability
• Definition of conditional probability:
P(a | b) = P(a b) / P(b) if P(b) > 0
• Product rule gives an alternative formulation:
P(a b) = P(a | b) P(b) = P(b | a) P(a)
• A general version holds for whole distributions, e.g.,
P(Weather, Cavity) = P(Weather | Cavity) P(Cavity)
• (View as a set of 4 × 2 equations, not matrix multiplication)
• Chain rule is derived by successive application of product rule:
P(X1, …,Xn) = P(X1,...,Xn-1) P(Xn | X1,...,Xn-1)
= P(X1,...,Xn-2) P(Xn-1 | X1,...,Xn-2) P(Xn | X1,...,Xn-1)
= …
= πi= 1^n P(Xi | X1, … ,Xi-1)
18. SHIWANI GUPTA 18
Example
A: the top card of a deck of poker cards is a king of spade
P(A) = 1/52
However, if we know
B: the top card is a king
then, the probability of A given B is true is
P(A|B) = 1/4.
19. SHIWANI GUPTA 19
How to Compute P(A|B)?
A B
P(A|B)=
N(A and B)
N(B)
=
N(A and B)
N
N(B)
N
=
P(A, B)
P(B)
P(brown|cow)=
N(brown-cows)
N(cows)
=
P(brown-cow)
P(cow)
20. SHIWANI GUPTA 20
Inference by enumeration
• Start with the joint probability distribution:
• For any proposition φ, sum the atomic events where it is true: P(φ) =
Σω:ω╞ φ P(ω)
21. SHIWANI GUPTA 21
Inference by enumeration
• Start with the joint probability distribution:
• For any proposition φ, sum the atomic events where it is true: P(φ) =
Σω:ω╞ φ P(ω)
• P(toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2
22. SHIWANI GUPTA 22
Inference by enumeration
• Start with the joint probability distribution:
• Can also compute conditional probabilities:
P(cavity | toothache)= P(cavity toothache)
P(toothache)
= 0.016+0.064
0.108 + 0.012 + 0.016 + 0.064
= 0.4
23. SHIWANI GUPTA 23
Normalization
• Denominator can be viewed as a normalization constant α
P(Cavity | toothache) = α, P(Cavity,toothache)
= α, [P(Cavity,toothache,catch) + P(Cavity,toothache, catch)]
= α, [<0.108,0.016> + <0.012,0.064>]
= α, <0.12,0.08> = <0.6,0.4>
General idea: compute distribution on query variable by fixing evidence
variables and summing over hidden variables
24. SHIWANI GUPTA 24
Independence: Definition
Events A and B are independent iff:
P(A, B) = P(A) x P(B)
which is equivalent to
P(A|B) = P(A) and
P(B|A) = P(B)
when P(A, B) >0.
T1: the first toss is a head.
T2: the second toss is a tail.
P(T2|T1) = P(T2)
25. SHIWANI GUPTA 25
Independence
• A and B are independent iff
P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A) P(B)
P(Toothache, Catch, Cavity, Weather)
= P(Toothache, Catch, Cavity) P(Weather)
• Absolute independence is powerful but rare
• Dentistry is a large field with hundreds of variables, none of which are
independent
26. SHIWANI GUPTA 26
Conditional Independence
• Dependent events can become independent given
certain other events.
• Example: Size of shoe, Age, Size of vocabulary
• Two events A, B are conditionally independent given
a third event C iff
P(A|B, C) = P(A|C) that is the probability of A is not
changed after knowing C, given B is true.
• Equivalent formulations:
P(E1, E2|E)=P(E1|E) P(E2|E)
P(E2|E, E1)=P(E2|E)
27. SHIWANI GUPTA 27
Conditional independence
• P(Toothache, Cavity, Catch) has 23 – 1 = 7 independent entries
• If I have a cavity, the probability that the probe catches in it doesn't
depend on whether I have a toothache:
(1) P(catch | toothache, cavity) = P(catch | cavity)
• The same independence holds if I haven't got a cavity:
(2) P(catch | toothache,cavity) = P(catch | cavity)
• Catch is conditionally independent of Toothache given Cavity:
P(Catch | Toothache,Cavity) = P(Catch | Cavity)
• Equivalent statements:
P(Toothache | Catch, Cavity) = P(Toothache | Cavity)
P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch |
Cavity)
28. SHIWANI GUPTA 28
Conditional independence contd.
• Write out full joint distribution using chain rule:
P(Toothache, Catch, Cavity)
= P(Toothache | Catch, Cavity) P(Catch, Cavity)
= P(Toothache | Catch, Cavity) P(Catch | Cavity) P(Cavity)
= P(Toothache | Cavity) P(Catch | Cavity) P(Cavity)
i.e., 2 + 2 + 1 = 5 independent numbers
• In most cases, the use of conditional independence reduces the size of
the representation of the joint distribution from exponential in n to
linear in n.
• Conditional independence is our most basic and robust form of
knowledge about uncertain environments.
29. SHIWANI GUPTA 29
Example: Play Tennis?
Outlook Temperature Humidity Windy Class
sunny hot high false −
sunny hot high true −
overcast hot high false +
rain mild high false +
rain cool normal false +
rain cool normal true −
overcast cool normal true +
sunny mild high false −
sunny cool normal false +
rain mild normal false +
sunny mild normal true +
overcast mild high true +
overcast hot normal false +
rain mild high true −
Predict playing tennis when <sunny, cool, high, strong>
What probability should be used to make the prediction?
How to compute the probability?
30. SHIWANI GUPTA 30
Probabilities of Individual Attributes
Given the training set, we can compute the probabilities
Outlook + − Humidity + −
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Windy
hot 2/9 2/5 true 3/9 3/5
mild 4/9 2/5 false 6/9 2/5
cool 3/9 1/5
P(+) = 9/14
P(−) = 5/14
31. SHIWANI GUPTA 31
Bayes' Rule
• Product rule P(ab) = P(a | b) P(b) = P(b | a) P(a)
Bayes' rule: P(a | b) = P(b | a) P(a) / P(b)
• or in distribution form
P(Y|X) = P(X|Y) P(Y) / P(X) = α P(X|Y) P(Y)
• Useful for assessing diagnostic probability from causal probability:
– P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect)
– E.g., let M be meningitis, S be stiff neck:
P(m|s) = P(s|m) P(m) / P(s) = 0.8 × 0.0001 / 0.1 = 0.0008
– Note: posterior probability of meningitis still very small!
32. SHIWANI GUPTA 32
Bayes' Rule and conditional independence
P(Cavity | toothache catch)
= αP(toothache catch | Cavity) P(Cavity)
= αP(toothache | Cavity) P(catch | Cavity) P(Cavity)
• This is an example of a naïve Bayes model:
P(Cause,Effect1, … ,Effectn) = P(Cause) πiP(Effecti|Cause)
• Total number of parameters is linear in n
33. SHIWANI GUPTA 33
Baye’s theorem is intractable
– Too many problems to be provided to acquire knowledge
– Space required to store all problems is too large
– Time required for processing is too much
Solution
– Attach certainty factors to rules
– Bayesian N/Ws
– Dempster Shafer Theory
Problems with Baye’s rule
34. SHIWANI GUPTA 34
DEMPSTER SHAFER THEORY
(generalization of Bayesian Theory)
• The Dempster-Shafer theory is a mathematical theory of evidence
based on belief functions and plausible reasoning, which is used to
combine separate pieces of information (evidence) to calculate the
probability of an event.
• The theory was developed by Arthur P. Dempster (1968) and Glenn
Shafer (1976)
• In this formalism a degree of belief (also referred to as a mass) is
represented as a belief function rather than a Bayesian probability
distribution. Probability values are assigned to sets of possibilities
rather than single events
35. SHIWANI GUPTA 35
• It seems that the well known probability theory can effectively model
uncertainty. However, there is some information probability cannot
describe.
• For example, ignorance.
• Consider the following example: If we have absolutely no information
about the coin, in probability theory, we will assume that it would be
50% head and 50% tail.
• However, in another scenario, we know the coin is fair, so we know
for a fact that it would be 50% head and 50% tail.
• Therefore, in the two different scenarios, we arrive at the same
conclusion, how we present totally ignorance in probability theory
becomes a problem.
• Dempster-Shafer theory can effectively solve this problem:
– For the ignorance scenario, the belief of Head and the belief of Tail would be 0.
– For the fair coin scenario, the belief of Head would be 0.5, the belief of Tail
would also be 0.5.
36. SHIWANI GUPTA 36
Consider two possible gambles
The first gamble is that we bet on a head turning up when we
toss a coin that is known to be fair. Now consider the second
gamble, in which we bet on the outcome of a fight between
the world's greatest boxer and the world's greatest wrestler.
Assume we are fairly ignorant about martial arts and would
have great difficulty making a choice of who to bet on.
Many people would feel more unsure about taking the second
gamble, in which the probabilities are unknown, rather than
the first gamble, in which the probabilities are easily seen to
be one half for each outcome.
Dempster-Shafer theory allows one to consider the confidence
one has in the probabilities assigned to the various outcomes
37. SHIWANI GUPTA 37
What is DST
• Means of manipulating degrees of belief that does not require B(A) +
B(~A) to be equal to 1
• This means that it is possible to believe that something could be both
true and false to some degree
• Consider a situation in which you have three competing hypotheses x,
y, z
• There are 8 combinations for true hypotheses
{x} {y} {z}
{x y} {x z} {y z}
{x y z} { }
38. SHIWANI GUPTA 38
• Initially you might decide that without any evidence;
all three hypotheses are true and assign a weight of
1.0 to the set {x y z} all other sets would be assigned
weights of 0.0
• With each new pieces of evidence you would begin to
decrease the weight assigned to the set {x y z} and
increase some of the other weights making sure that
the sum of all weights is still 1.0
39. SHIWANI GUPTA 39
Belief and plausibility
P(A)+p(~A)<=1
[Belief,Plausibility]
Strength of evidence
in favor of a
set of propositions
Measures extent to
which evidence in
favor of ~s leaves
room for belief in s
If we have a certain
evidence in favor of ~s
then Bel(~s)=1
Pl(s)=0, Bel(s)=0
If no information, then true likelihood is in [0,1]
As evidence accumulated, interval shrinks and increased confidence
41. SHIWANI GUPTA 41
Belief(A) + Belief(~A) <= 1
Plausibility(A) + Plausibility(~A) >= 1
If A B then
Belief(A) <= Belief(B)
Plausibility(A) <= Plausibility(B)
42. SHIWANI GUPTA 42
Suppose we have a belief of 0.5 and a plausibility of 0.8 for a
proposition, say "the cat in the box is dead."
This means that we have evidence that allows us to state strongly that the
proposition is true with a confidence of 0.5.
However, the evidence contrary to that hypothesis (i.e. "the cat is alive")
only has a confidence of 0.2.
The remaining mass of 0.3 (the gap between the 0.5 supporting evidence
on the one hand, and the 0.2 contrary evidence on the other) is
"indeterminate," meaning that the cat could either be dead or alive.
This interval represents the level of uncertainty based on the evidence in
your system.
Example
43. SHIWANI GUPTA 43
The null hypothesis is set to zero by definition (it corresponds to "no
solution").
The orthogonal hypotheses "Alive" and "Dead" have probabilities of
0.2 and 0.5, respectively. This could correspond to "Live/Dead Cat
Detector" signals, which have respective reliabilities of 0.2 and 0.5.
Finally, the all-encompassing "Either" hypothesis (which simply
acknowledges there is a cat in the box) picks up the slack so that the
sum of the masses is 1. The support for the "Alive" and "Dead"
hypotheses matches their corresponding masses because they have no
subsets; support for "Either" consists of the sum of all three masses
(Either, Alive, and Dead) because "Alive" and "Dead" are each
subsets of "Either".
The "Alive" plausibility is 1-m(Death) and the "Dead" plausibility is
1-m(Alive). Finally, the "Either" plausibility sums
m(Alive)+m(Dead)+m(Either).
The universal hypothesis ("Either") will always have 100% support
and plausibility —it acts as a checksum of sorts.
44. SHIWANI GUPTA 44
Example 2
Here is a somewhat more elaborate example where the behavior of
support and plausibility begins to emerge. We're looking at a faraway
object, which can only be colored in one of three colors (red, white,
and blue) through a variety of detector modes:
Although these are rather bad examples, as events of that kind would
not be modeled as disjoint sets in the probability space, rather would
the event "red or blue" be considered as the union of the events "red"
and "blue", thereby
p(red or white)>= p(white) = 0.25 and p(any)=1. Only the three
disjoint events "Blue" "Red" and "White" would need to add up to 1.
In fact one could model a probability measure on the space linear
proportional to "plausibility" (normalized so that
p(red)+p(white)+p(blue) = 1, and with the exception that still all
probabilities are <=1)
45. SHIWANI GUPTA 45
Bayesian networks
A simple, graphical notation to exhibit interdependencies that exist
between related pieces of knowledge
Syntax:
– a set of nodes, one per variable
– a directed, acyclic graph (link ≈ "directly influences")
– a conditional distribution for each node given its parents:
P (Xi | Parents (Xi))
• In the simplest case, conditional distribution represented as a
conditional probability table (CPT) giving the distribution over Xi for
each combination of parent values
46. SHIWANI GUPTA 46
Example
Topology of network encodes conditional independence
assertions:
• Weather is independent of the other variables
• Toothache and Catch are conditionally independent given
Cavity
47. SHIWANI GUPTA 47
Example
I'm at work, neighbor John calls to say my alarm is ringing,
but neighbor Mary doesn't call. Sometimes it's set off by minor
earthquakes. Is there a burglar?
Variables: Burglary, Earthquake, Alarm, JohnCalls,
MaryCalls
Network topology reflects "causal" knowledge:
– A burglar can set the alarm off
– An earthquake can set the alarm off
– The alarm can cause Mary to call
– The alarm can cause John to call
49. SHIWANI GUPTA 49
Semantics
The full joint distribution is defined as the product of the local
conditional distributions:
P (X1, … ,Xn) = πi = 1 P (Xi | Parents(Xi))
e.g., P(j m a b e)
= P (j | a) P (m | a) P (a | b, e) P (b) P (e)
n
50. SHIWANI GUPTA 50
Constructing Bayesian networks
1. Choose an ordering of variables X1, … ,Xn
2. For i = 1 to n
– add Xi to the network
– select parents from X1, … ,Xi-1 such that
P (Xi | Parents(Xi)) = P (Xi | X1, ... Xi-1)
This choice of parents guarantees:
P (X1, … ,Xn) = πi =1 P (Xi | X1, … , Xi-1)(chain rule)
= πi =1P (Xi | Parents(Xi))(by construction)
51. SHIWANI GUPTA 51
• Suppose we choose the ordering M, J, A, B, E
P(J | M) = P(J)?
Example
52. SHIWANI GUPTA 52
• Suppose we choose the ordering M, J, A, B, E
P(J | M) = P(J)? No
P(A | J, M) = P(A | J)? P(A | J, M) = P(A)?
Example
53. SHIWANI GUPTA 53
• Suppose we choose the ordering M, J, A, B, E
P(J | M) = P(J)? No
P(A | J, M) = P(A | J)? P(A | J, M) = P(A)? No
P(B | A, J, M) = P(B | A)?
P(B | A, J, M) = P(B)?
Example
54. SHIWANI GUPTA 54
Suppose we choose the ordering
M, J, A, B, E
P(J | M) = P(J)? No
P(A | J, M) = P(A | J)? P(A | J, M) = P(A)? No
P(B | A, J, M) = P(B | A)? Yes
P(B | A, J, M) = P(B)? No
P(E | B, A ,J, M) = P(E | A)?
P(E | B, A, J, M) = P(E | A, B)?
Example
55. SHIWANI GUPTA 55
Suppose we choose the ordering
M, J, A, B, E
P(J | M) = P(J)? No
P(A | J, M) = P(A | J)? P(A | J, M) = P(A)? No
P(B | A, J, M) = P(B | A)? Yes
P(B | A, J, M) = P(B)? No
P(E | B, A ,J, M) = P(E | A)? No
P(E | B, A, J, M) = P(E | A, B)? Yes
Example
56. SHIWANI GUPTA 56
Example contd.
• Deciding conditional independence is hard in noncausal
directions
• (Causal models and conditional independence seem hardwired
for humans!)
• Network is less compact: 1 + 2 + 4 + 2 + 4 = 13 numbers
needed