An End to the Data Love Affair: Understanding Information Through Entropy

Matt Gershoff
CEO Conductrics
www.conductrics.com twitter:@mgershoffMatt Gershoff www.conductrics.com Twitter:@mgershoff
Entropy:
An End to the Data
Love Affair

Who am I?
Co-founder of Conductrics software. We help
companies solve their last mile problem for
analytics.
Why me?
Conductrics blends ideas from AB Testing, RL,
Statistics, and Information Theory to generate
human interpretable machine learning models
that target customers with the experiences they
care about. www.conductrics.com
Conductrics Confidential 2

We All
Love Data!
Matt Gershoff www.conductrics.com
Twitter:@mgershoff

But should
we love
data?
Twitter:@mgershoff

Twitter:@mgershoff

Losing Ticket
Not Surprising
Not Interesting
Not Informative
Twitter:@mgershoff

Winning Ticket
Surprising
Interesting
Informative
Twitter:@mgershoff

If an event is Predictable it is
Not Informative
Twitter:@mgershoff
Key Idea

If an event is Not Predictable
it is Informative
Twitter:@mgershoff
Key Idea

Information=?
Matt Gershoff www.conductrics.com Twitter:@mgershoff
But What is Information Really?

Data IS NOT
Information
Key Idea

Data Is Only Potential
Information
Key Idea

Source: https://www.thedailybeast.com/claude-shannon-the-juggling-poet-who-gave-us-the-
information-age Photo: Alfred Eisenstaedt/The LIFE Picture Collection/Getty Images
Twitter:@mgershoff

Studied at MIT
In 1948 while at Bell Labs published
A Mathematical Theory of Communication
Claude Shannon
Twitter:@mgershoff

What is
Information
if not Data?

Information in data is
equal to the smallest*
possible lossless
encoding
*smallest on average

The More Predictable
The More Compression
The Less Information
Claude Shannon
Twitter:@mgershoff

Shannon Entropy is
Measure of
Unpredictability in Data
Claude Shannon
Twitter:@mgershoff

Entropy is the
Minimum number* of
Bits to fully encode data
* - average number Matt Gershoff www.conductrics.com
Twitter:@mgershoff

* - average number
…or equivalently
Twitter:@mgershoff

Entropy is the
Minimum number* of
QUESTIONS to identify all
possible values in the data
* - average number Matt Gershoff www.conductrics.com
Twitter:@mgershoff

*smallest on average

I see you are confused. To
help with intuition, lets play a
game of 20 Questions. What
letter am I thinking of: A; B;
C; or D?

No
No Yes
Yes
No Yes
Is it ‘C’ or ‘D’?
More Than 1? More Than 3?
1st Question
2nd Question
1 2 3 4
First Approach:
Twitter:@mgershoff
Questions to Bits to Information

No
No Yes
Yes
No Yes
More Than 3?
1st Question
2nd Question
A B 3 4
Is it ‘B’?
Twitter:@mgershoff

No
No Yes
Yes
No Yes
1st Question
2nd Question
A B C D
Is it ‘B’? Is it ‘D’?
Twitter:@mgershoff

We can ALWAYS Pick the Letter
after 2 Questions
Twitter:@mgershoff

What if we swap 0|1 for No|Yes?
Twitter:@mgershoff

Each Y|N Question = 1 Bit
A 00
B 01
C 10
D 11
Twitter:@mgershoff
Questions Bits

We can identify each Letter
with 2 bits
Twitter:@mgershoff
A 0 0
B 0 1
C 1 0
D 1 1
Bit1 Bit2

So the number of bits in this
case is equal to the number of
Y|N Questions
Twitter:@mgershoff

Twitter:@mgershoff
Mind Blown

A = 50%
B = 25%
C = 12.5%
D = 12.5%
It gets better. What if I think of the
Letters based on this prior
distribution?
*From http://www.inference.org.uk/mackay/itprnn/
Twitter:@mgershoff

Can we now do better, on average,
than 2 Questions, or 2 Bits?
Twitter:@mgershoff

I don’t understand. Maybe you
show me how this works.
Twitter:@mgershoff

Is it A?
Y N
A
1st Question
Twitter:@mgershoff

Is it A?
Y N
A Is it B?
Y N
B
1st Question
2nd Question
Twitter:@mgershoff

Is it A?
Y N
A Is it B?
Y N
B Is it C?
Y N
C
1st Question
2nd Question
3rd Question
Twitter:@mgershoff

Is it A?
Y N
A Is it B?
Y N
B Is it C?
Y N
C D
1st Question
2nd Question
3rd Question
Twitter:@mgershoff

Is it A?
Y N
A Is it B?
Y N
B Is it C?
Y N
C D
1st Question
2nd Question
3rd Question
Data Encoding
A=1
B=01
C=001
D=000
Twitter:@mgershoff

In general, the average number of
questions needed is:
෍
1
𝑘
𝑃 𝐿𝑒𝑡𝑡𝑒𝑟i ⋅ Num Questions𝑖
Twitter:@mgershoff

Average Number of Questions Needed?
0.5*1 + 0.25*2 + 0.125*3 + 0.125*3 = 1.75
A B C D
Twitter:@mgershoff
෍
1
4
𝑃 𝐿𝑒𝑡𝑡𝑒𝑟i ⋅ Num Questions𝑖

Can we on average do better than 2
Questions or bits?
Twitter:@mgershoff

YES!!!
Can we on average do better than 2
Questions or bits?
Twitter:@mgershoff

How much better depends on the prior
probabilities
Twitter:@mgershoff

The more predictable the data the fewer
questions or bits needed
Maximum = 2 Our Case = 1.75 Minimum= 0
Twitter:@mgershoff

𝐌𝐚𝐭𝐡 𝐓𝐢𝐦𝐞
Twitter:@mgershoff

Information
Entropy

From Data to Information
Information Content of an individual event,
or value, like a letter

− log2 𝑃 𝑥𝑖
Information Content of an individual event,
or value, like a letter

Information Content of an Event (𝑥i)
Prob of the Event

Information Content of an Event (𝑥i)
Information is measured in Bits!

Entropy: H(x)
෍
1
𝑘
−𝑃 𝑥i ⋅ log2 𝑃 𝑥𝑖
1) Calculate the Information in each event
2) Take Weighted Average of Information

෍
1
𝑘
Entropy: H(x)

Entropy: H(x)
෍
1
𝑘
Hmm… looks suspiciously similar
to the ave number of questions
calculation

Calculate the Entropy of our Letters
A = 50%
B = 25%
C = 12.5%
D = 12.5%
Entropy: H(x)
Twitter:@mgershoff

First Calculate the log2 𝑃 𝑥𝑖 :
A = Log(.5) = -1
B = Log(.25) = -2
C = Log(.125) = -3
D = Log(.125) = -3
Entropy: H(x)
Twitter:@mgershoff

Entropy: H(x)
෍
1
𝑘
-1*(0.5*-1 + 0.25*-2 + 0.125*-3 + 0.125*-3) =
1.75
A B C D
Twitter:@mgershoff

Entropy: H(x)
෍
1
𝑘
-1*(0.5*-1 + 0.25*-2 + 0.125*-3 + 0.125*-3) =
1.75
But that’s exactly
the same result!Matt Gershoff www.conductrics.com
Twitter:@mgershoff

Entropy = Number of Bits = Number Questions
Twitter:@mgershoff

What can I do with it?
Twitter:@mgershoff

Information Gain
Turn Prediction Targeting in to
Problem of Reducing Entropy
Twitter:@mgershoff

Information Gain
Key Idea Behind Decision Trees
Twitter:@mgershoff
Example: Conductrics Predictive Segments

Information Gain
𝐼 𝑥; 𝑦 = 𝐻 𝑥 − 𝐻(𝑥|𝑦)
Entropy of
Target Variable
Conditional Entropy
of Target given
Feature Y
Twitter:@mgershoff

Example Data
Convert Windows Mobile
N N N
N N N
N N N
N N N
N N Y
N Y Y
Y N Y
Y Y N
Y Y N
Y Y N
Y Y Y
Y Y Y
Twitter:@mgershoff

Summary Statistics
Variable Coverage
Conversion
Rate Entropy
Windows Y 50% 83% 0.65
Windows N 50% 17% 0.65
Mobile Y 42% 60% 0.44
Mobile N 58% 43% 0.99
Twitter:@mgershoff

Information Gain
1-.5*.65+.5*.65 = 0.35
Gain for Windows*
1-.42*.44+.58*.99 = 0.24
Gain for Mobile
Yes No
Yes No
Twitter:@mgershoff
Note: In this case Windows Yes and No are symmetrical, so results for each are the
same, but this need not be true. You can see this in Mobile – the two are not the same.

Information Gain
Build Tree
Overall Conversion Rate
Twitter:@mgershoff

Information Gain
Split By Windows
Overall Conversion Rate
WindowsNO Yes
Twitter:@mgershoff

Decision Trees
Repeat For each Leaf Node
Stop when:
• Node Size is below threshold
• Information gain is below threshold
• Tree Depth reaches defined limit
These are all parameters YOU set to control
overfitting Matt Gershoff www.conductrics.com
Twitter:@mgershoff

Twitter:@mgershoff
Don’t Love
Data!

Love
Information
Twitter:@mgershoff

෍
1
𝑘
I
Twitter:@mgershoff

Thank You!!!
Twitter:@mgershoff

Psst …much more than Decision
Trees
Information Gain is also Kullback-Leibler Divergence
Twitter:@mgershoff

Resources
Information Theory
David Mackay’s Course
Information Theory, Pattern Recognition and Neural Networks
http://www.inference.org.uk/mackay/itprnn/
https://www.youtube.com/channel/UCfoScwn69ekXXWNTN0
CLGXA
Mackay’s Free Online Text Book
http://www.inference.org.uk/mackay/itila/
Decision Tree ID3 Algo
https://en.wikipedia.org/wiki/ID3_algorithm
Twitter:@mgershoff

An End to the Data Love Affair: Understanding Information Through Entropy

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Recently uploaded

Recently uploaded (20)

An End to the Data Love Affair: Understanding Information Through Entropy