- The document discusses information theory and how it relates to data and machine learning. It explains that data is not the same as information, and that information is measured by entropy and uncertainty.
- It provides an example of how entropy can be calculated for a set of letters with different probabilities. The average number of questions needed to identify a random letter is shown to be the same as the entropy calculation.
- Decision trees are discussed as a way to reduce entropy and gain information by splitting data based on features. Information gain is defined as the reduction in entropy when a feature is used to split the data.
2. Who am I?
Co-founder of Conductrics software. We help
companies solve their last mile problem for
analytics.
Why me?
Conductrics blends ideas from AB Testing, RL,
Statistics, and Information Theory to generate
human interpretable machine learning models
that target customers with the experiences they
care about. www.conductrics.com
Conductrics Confidential 2
14. Studied at MIT
In 1948 while at Bell Labs published
A Mathematical Theory of Communication
Claude Shannon
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
15. Matt Gershoff www.conductrics.com Twitter:@mgershoff
But What is Information Really?
What is
Information
if not Data?
Matt Gershoff www.conductrics.com Twitter:@mgershoff
16. Matt Gershoff www.conductrics.com Twitter:@mgershoff
But What is Information Really?
Information in data is
equal to the smallest*
possible lossless
encoding
Matt Gershoff www.conductrics.com Twitter:@mgershoff
*smallest on average
17. The More Predictable
The More Compression
The Less Information
Claude Shannon
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
18. Shannon Entropy is
Measure of
Unpredictability in Data
Claude Shannon
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
19. Entropy is the
Minimum number* of
Bits to fully encode data
* - average number Matt Gershoff www.conductrics.com
Twitter:@mgershoff
20. * - average number
…or equivalently
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
21. Entropy is the
Minimum number* of
QUESTIONS to identify all
possible values in the data
* - average number Matt Gershoff www.conductrics.com
Twitter:@mgershoff
22. Matt Gershoff www.conductrics.com Twitter:@mgershoff
But What is Information Really?
Matt Gershoff www.conductrics.com Twitter:@mgershoff
*smallest on average
23. Matt Gershoff www.conductrics.com Twitter:@mgershoff
But What is Information Really?
Matt Gershoff www.conductrics.com Twitter:@mgershoff
I see you are confused. To
help with intuition, lets play a
game of 20 Questions. What
letter am I thinking of: A; B;
C; or D?
24. No
No Yes
Yes
No Yes
Is it ‘C’ or ‘D’?
More Than 1? More Than 3?
1st Question
2nd Question
1 2 3 4
First Approach:
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
25. No
No Yes
Yes
No Yes
More Than 3?
1st Question
2nd Question
A B 3 4
Is it ‘C’ or ‘D’?
Is it ‘B’?
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
26. No
No Yes
Yes
No Yes
1st Question
2nd Question
A B C D
Is it ‘C’ or ‘D’?
Is it ‘B’? Is it ‘D’?
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
27. We can ALWAYS Pick the Letter
after 2 Questions
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
28. What if we swap 0|1 for No|Yes?
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
29. Each Y|N Question = 1 Bit
A 00
B 01
C 10
D 11
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
Questions Bits
30. We can identify each Letter
with 2 bits
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
A 0 0
B 0 1
C 1 0
D 1 1
Bit1 Bit2
31. So the number of bits in this
case is equal to the number of
Y|N Questions
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
33. A = 50%
B = 25%
C = 12.5%
D = 12.5%
It gets better. What if I think of the
Letters based on this prior
distribution?
*From http://www.inference.org.uk/mackay/itprnn/
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
34. Can we now do better, on average,
than 2 Questions, or 2 Bits?
*From http://www.inference.org.uk/mackay/itprnn/
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
35. I don’t understand. Maybe you
show me how this works.
*From http://www.inference.org.uk/mackay/itprnn/
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
36. Is it A?
Y N
A
1st Question
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
37. Is it A?
Y N
A Is it B?
Y N
B
1st Question
2nd Question
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
38. Is it A?
Y N
A Is it B?
Y N
B Is it C?
Y N
C
1st Question
2nd Question
3rd Question
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
39. Is it A?
Y N
A Is it B?
Y N
B Is it C?
Y N
C D
1st Question
2nd Question
3rd Question
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
40. Is it A?
Y N
A Is it B?
Y N
B Is it C?
Y N
C D
1st Question
2nd Question
3rd Question
Data Encoding
A=1
B=01
C=001
D=000
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
41. In general, the average number of
questions needed is:
1
𝑘
𝑃 𝐿𝑒𝑡𝑡𝑒𝑟i ⋅ Num Questions𝑖
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
42. Average Number of Questions Needed?
0.5*1 + 0.25*2 + 0.125*3 + 0.125*3 = 1.75
A B C D
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
1
4
𝑃 𝐿𝑒𝑡𝑡𝑒𝑟i ⋅ Num Questions𝑖
43. Can we on average do better than 2
Questions or bits?
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
44. YES!!!
Can we on average do better than 2
Questions or bits?
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
45. How much better depends on the prior
probabilities
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
46. The more predictable the data the fewer
questions or bits needed
Maximum = 2 Our Case = 1.75 Minimum= 0
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Questions to Bits to Information
49. From Data to Information
Information Content of an individual event,
or value, like a letter
Matt Gershoff www.conductrics.com Twitter:@mgershoff
50. From Data to Information
− log2 𝑃 𝑥𝑖
Information Content of an individual event,
or value, like a letter
Matt Gershoff www.conductrics.com Twitter:@mgershoff
51. From Data to Information
Information Content of an Event (𝑥i)
− log2 𝑃 𝑥𝑖
Prob of the Event
Matt Gershoff www.conductrics.com Twitter:@mgershoff
52. From Data to Information
Information Content of an Event (𝑥i)
− log2 𝑃 𝑥𝑖
Information is measured in Bits!
Matt Gershoff www.conductrics.com Twitter:@mgershoff
53. Entropy: H(x)
1
𝑘
−𝑃 𝑥i ⋅ log2 𝑃 𝑥𝑖
1) Calculate the Information in each event
2) Take Weighted Average of Information
Matt Gershoff www.conductrics.com Twitter:@mgershoff
56. Entropy: H(x)
1
𝑘
−𝑃 𝑥i ⋅ log2 𝑃 𝑥𝑖
Hmm… looks suspiciously similar
to the ave number of questions
calculation
Matt Gershoff www.conductrics.com Twitter:@mgershoff
57. Calculate the Entropy of our Letters
A = 50%
B = 25%
C = 12.5%
D = 12.5%
Entropy: H(x)
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
58. First Calculate the log2 𝑃 𝑥𝑖 :
A = Log(.5) = -1
B = Log(.25) = -2
C = Log(.125) = -3
D = Log(.125) = -3
Entropy: H(x)
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
59. Entropy: H(x)
1
𝑘
−𝑃 𝑥i ⋅ log2 𝑃 𝑥𝑖
-1*(0.5*-1 + 0.25*-2 + 0.125*-3 + 0.125*-3) =
1.75
A B C D
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
60. Entropy: H(x)
1
𝑘
−𝑃 𝑥i ⋅ log2 𝑃 𝑥𝑖
-1*(0.5*-1 + 0.25*-2 + 0.125*-3 + 0.125*-3) =
1.75
But that’s exactly
the same result!Matt Gershoff www.conductrics.com
Twitter:@mgershoff
61. Entropy = Number of Bits = Number Questions
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
62. What can I do with it?
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
63. Information Gain
Turn Prediction Targeting in to
Problem of Reducing Entropy
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
64. Information Gain
Key Idea Behind Decision Trees
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Example: Conductrics Predictive Segments
65. Information Gain
𝐼 𝑥; 𝑦 = 𝐻 𝑥 − 𝐻(𝑥|𝑦)
Entropy of
Target Variable
Conditional Entropy
of Target given
Feature Y
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
66. Example Data
Convert Windows Mobile
N N N
N N N
N N N
N N N
N N Y
N Y Y
Y N Y
Y Y N
Y Y N
Y Y N
Y Y Y
Y Y Y
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
68. Information Gain
1-.5*.65+.5*.65 = 0.35
Gain for Windows*
1-.42*.44+.58*.99 = 0.24
Gain for Mobile
Yes No
Yes No
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
Note: In this case Windows Yes and No are symmetrical, so results for each are the
same, but this need not be true. You can see this in Mobile – the two are not the same.
70. Information Gain
Split By Windows
Overall Conversion Rate
WindowsNO Yes
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
71. Decision Trees
Repeat For each Leaf Node
Stop when:
• Node Size is below threshold
• Information gain is below threshold
• Tree Depth reaches defined limit
These are all parameters YOU set to control
overfitting Matt Gershoff www.conductrics.com
Twitter:@mgershoff
76. Psst …much more than Decision
Trees
Information Gain is also Kullback-Leibler Divergence
Matt Gershoff www.conductrics.com
Twitter:@mgershoff
77. Resources
Information Theory
David Mackay’s Course
Information Theory, Pattern Recognition and Neural Networks
http://www.inference.org.uk/mackay/itprnn/
https://www.youtube.com/channel/UCfoScwn69ekXXWNTN0
CLGXA
Mackay’s Free Online Text Book
http://www.inference.org.uk/mackay/itila/
Decision Tree ID3 Algo
https://en.wikipedia.org/wiki/ID3_algorithm
Matt Gershoff www.conductrics.com
Twitter:@mgershoff