This presentation was shared by Gramener's Kanishk Kumar Abhishek during his guest lecture session at School of Business Management, NMIMS Mumbai.
Check out Gramener's data storytelling workshop for analysts and data scientists at https://gramener.com/data-storytelling-workshop
2. How a nurse changed the course of a war using data storytelling
3. Nightingale, helped curtail the death rate from a whopping 40% to a mere 2%
3
Created by Florence Nightingale for Queen
Victoria during England’s war with France.
Visualizes deaths due to:
Red: War wounds
Black: Other war-related causes
Blue: Avoidable hospital diseases
4. Let’s look at 15 Years of US Birth Data
US Birth dataset (1975 – 1990) that
has been around for several years
and has been studied extensively. Yet,
a visualization can reveal patterns that
are neither obvious nor well known.
• For example,
• Are birthdays uniformly distributed?
• Do doctors or parents exercise the
C-section option to move dates?
• Is there any day of the month that
has unusually high or low births?
• Are there any months with relatively
high or low births?
PROPRIETARY&CONFIDENTIAL:FORINTERNALUSEONLY
More births Fewer births … on average, for each day of the year (from 1975 to 1990)
Very high births in September.
But this is fairly well known. Most
conceptions happen during the
winter holiday season
Relatively few births during the
Christmas and Thanksgiving
holidays, as well as New Year
and Independence Day.
Most people prefer not
to have children on the
13th of any month, given
that it’s an unlucky day
Some special days like April
Fool’s day are avoided, but
Valentine’s Day is quite
popular
5. The pattern in India is quite different
This is a birth date dataset that’s
obtained from school admission data
for over 10 million children. When we
compare this with births in the US, we
see none of the same patterns.
For example,
• Is there an aversion to the 13th or is there a local cultural nuance?
• Are holidays avoided for births?
• Which months have a higher propensity for births, and why?
• Are there any patterns not found in the US data?
Very few children are born in the
month of August, and thereafter.
Most births are concentrated in
the first half of the year
We see a large number of
children born on the 5th, 10th,
15th, 20th and 25th of each month
– that is, round numbered dates
Such round numbered patterns a
typical indication of fraud. Here,
birthdates are brought forward to
aid early school admission
More births Fewer births … on average, for each day of the year (from 2007 to 2013)
LINK
6. This adversely impacts children’s marks
It’s a well-established fact that older children tend to do
better at school in most activities. Since many children
have had their birth dates brought forward, these younger
children suffer.
The average marks of children “born” on the 1st, 5th, 10th, 15th etc.. of the month tend
to score lower marks.
• Are holidays avoided for births?
• Which months have a higher propensity for births, and why?
• Are there any patterns not found in the US data?
Higher marks Lower marks … on average, for children born on a given day of the year (from 2007 to 2013)
Children “born” on round numbered days score lower marks on average,
due to a higher proportion of younger children
7. An energy utility detected billing fraud
This plot shows the frequency of all meter readings from Apr-2010 to Mar-2011.
An unusually large number of readings are aligned with the slab boundaries.
Below is a simple histogram (or frequency distribution) of usage levels. Each bar represents the
number of customers with a customers with a specific bill amount (in units, or KWh).
Tariffs are based on the usage slab. Someone with 101 units is billed in full at a higher tariff than
someone with 100 units. So people have a strong incentive to stay at or within a slab boundary.
An energy utility (with over 50 million subscribers)
had 10 years worth of customer billing data
available.
Most fraud detection software failed to load the
data, and sampled data revealed little or no insight.
This can happen in one of two ways.
First, people may be monitoring their usage very
carefully, and turn of their lights and fans the instant
their usage hits the slab boundary.
Or, more realistically, there’s probably some level of corruption
involved, where customers pay a small sum to the meter reading staff
to ensure that it stays exactly at the slab boundary, giving them the
advantage of a lower price.
8. This plot shows the frequency of all meter readings from Apr-
2010 to Mar-2011. An unusually large number of readings are
aligned with the tariff slab boundaries.
This clearly shows collusion of some
form with the customers.
Apr-10 May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11
217 219 200 200 200 200 200 200 200 350 200 200
250 200 200 200 201 200 200 200 250 200 200 150
250 150 150 200 200 200 200 200 200 200 200 150
150 200 200 200 200 200 200 200 200 200 200 50
200 200 200 150 180 150 50 100 50 70 100 100
100 100 100 100 100 100 100 100 100 100 110 100
100 150 123 123 50 100 50 100 100 100 100 100
0 111 100 100 100 100 100 100 100 100 50 50
0 100 27 100 50 100 100 100 100 100 70 100
1 1 1 100 99 50 100 100 100 100 100 100
This happens with specific
customers, not randomly. Here are
such customers’ meter readings.
Section Apr-10 May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11
Section 1 70% 97% 136% 65% 110% 116% 121% 107% 114% 88% 74% 109%
Section 2 66% 92% 66% 87% 70% 64% 63% 50% 58% 38% 41% 54%
Section 3 90% 46% 47% 43% 28% 31% 50% 32% 19% 38% 8% 34%
Section 4 44% 24% 36% 39% 21% 18% 24% 49% 56% 44% 31% 14%
Section 5 4% 63% -27% 20% 41% 82% 26% 34% 43% 2% 37% 15%
Section 6 18% 23% 30% 21% 28% 33% 39% 41% 39% 18% 0% 33%
Section 7 36% 51% 33% 33% 27% 35% 10% 39% 12% 5% 15% 14%
Section 8 22% 21% 28% 12% 24% 27% 10% 31% 13% 11% 22% 17%
Section 9 19% 35% 14% 9% 16% 32% 37% 12% 9% 5% -3% 11%
If we define the “extent of fraud” as
the percentage excess of the 100
unit
meter reading, the value varies
considerably across sections,
and time
New section
manager arrives
… and is
transferred out
… with some explainable
anomalies.
Why would
these happen?
9. Class Xth English Marks Distribution
0
5,000
10,000
15,000
20,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
10. Stories have four types of narratives to explain visualizations
Remember “SEAR”: Summarize, Explain, Annotate, Recommend 10
0
5,000
10,000
15,000
20,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Marks
# students
Teachers add marks to stop some students from failing
This chart shows Class 10 students’ English
marks in Tamil Nadu, India, in 2011. The X-axis
has the mark a student has scored. The Y-axis
has the # of students who scored that mark.
Large number of
students score
exactly 35 marks
Few (but not 0) students
fail at 31-34 marks
What’s unusual
Large number of students
score 35 marks.
Few (but not 0) students score
between 30-35
Only some students get this benefit.
Identify a fair policy that will be applied consistently.
Summarize the visual in its title
Don’t describe the chart.
Don’t write the user’s question.
Write the answer itself. Like a headline.
Explain & interpret the visual
How should the user read it?
What do you say when you talk through it?
Explain what the visual is. Then the axes.
Then its contents. Then the inference.
Recommend an action
How should I act on this?
You need to change the audience.
(Otherwise, you made no difference.)
Annotate essential elements
What should the user focus their eyes on?
Point it out, or highlight it with colors
Interpret what they’re seeing – in words.
This is a bell curve. But the spike at 35 (the mark
at which students pass) is unusual. Teachers
must be adding marks to some of the students
who are likely to fail by a small margin.
No one scores 0-4
marks
11. Our focus at Gramener is on narrating insights from data as stories
Stories are
memorable, viral
NUMBERS
ARE NOT
ENOUGH
STORIES EXPLAIN THEM
Delays are due to fragile cargo.
Trained staff and forklifts
reduce risk of breakage, and
hence reduce delay.
Insights are useful,
non-obvious, Big
FACTS ARE NOT USEFUL
E.g. Delay in cargo delivery
grew 8% last quarter.
INSIGHTS ENABLE ACTION
Lack of forklifts and fewer
trained staff led to the delay.
Improving these can reduce
cargo delay by 15%.
Today, I’ll share how we apply these at organizations like HDFC
11
INSIGHT STORY
DATA
GRAMENER
COMBINES
12. Data storytelling is a critical skill for data scientists, analysts & managers
12
Stories are memorable. They spread virally
People remember stories. They’ll act on them.
People share stories. That enables collective action.
For people to act on analysis, data stories are critical.
But analysts present analysis, not stories
We present what we did. Not what you need.
You need to know what happened, why, & what to do.
Narrated in an engaging way. As a story.
We’ll learn how do that in this session.
Storytelling has a 30X Return on Investment
Rob Walker and Joshua Glenn auctioned common
items like mugs, golf balls, toys, etc. The item
descriptions were stories purpose-written by 200+
contributing writers.
Items that were bought for $250 sold for over $8,000 –
a return of over 3,000% for storytelling!
Original price: $2.00.
Final price: $50.00.
This little statue stood on the window-sill
in my favorite aunt’s front hall. Perched
between plants of varying shapes and
sizes, surrounded by shards of broken
pottery and miniature ceramic elephants
from the Red Rose Tea box, dappled
with sunlight shining through the leaded
glass figures of St. Francis in his garden
and the mossy Celtic Cross, …
13. You have data.
You have analysis.
Now what?
Understanding the audience & intent
Finding insights
Storylining
Designing data stories
15. DO IT: Who is the audience for your analysis?
Role: _____________
Be specific. “Head of sales”, not “executive”
Example name: ______________
Name a real person. “Jim Fry”, not “any sales
head”.
Different people want
different things from the
same data.
Given sales data:
• The Board: “Predict next quarter’s sales”
• Product head: “Which product grew the most?”
• Sales head: “Did we meet our target?”
They are not interested in each others’ questions.
Who is your audience? They determine the story
16. DO IT: Write it in this structure
“[Person, Role] is in [situation], and faces this
[problem]. By taking [action], she can drive
[impact].”
Example
John, the Marketing head, person, role
must create a region-wise budget, situation
and doesn’t know the region-wise RoI. problem
By prioritizing the region, action
she can maximize ROI. impact
For each person, answer the following questions:
1. What’s their situation?
2. What problems do they face?
3. What action can they take?
4. What is the impact of this action?
What is their problem? That defines you to align the message accordingly
Clear needs & future scenario leads to effective communication.
17. Here are three examples in real life
17
Purchasing Commodities Cargo Delay Customer Churn
Person, Role Adam, the purchasing head of a
leading European brewery
Cris, the operations head of a
leading US airline
Ravi, the marketing manager of
an Asian telecom company
Situation had plants that purchased
commodities from several vendors.
Discounts were low. Number of
weekly orders were high.
had an SLA to deliver cargo from
the flight to the warehouse in under
1.5 hours – 15% lower than their
current best performance.
Found that the cost of replacing
customers was thrice the cost of
retention.
Problem But he didn’t know which plants
and commodities were a problem.
Every plant denied it.
But she didn’t know what were the
biggest drivers of this delay –
people, assets, or type of cargo.
But he didn’t know which
customers to make offers to in
order to retain them.
Action By consolidating vendors and
reducing order frequency,
By adding resources only to the
largest levers of delay,
By predicting which customer was
likely to churn,
Impact they could increase their discounts
and reduce logistics cost.
she could reduce turnaround time
with the lowest spend.
they could tailor a retention offer
and reduce re-acquisition cost.
18. Finding Insights
Big, Impactful & Surprising
Step 2
Understanding the audience & intent
Finding insights
Storylining
Designing data stories
19. Insights must be Big, Useful, and Surprising
Filter the analyses using these as a checklist
IS THE INSIGHT
BIG
IS THE INSIGHT
USEFUL
IS THE INSIGHT
SURPRISING
The analysis must, of course, be statistically significant.
But it should also be numerically significant.
We want a result that substantially changes the outcome.
What should the audience do after hearing the insight?
Can they take an action that improves their objective?
Even if it’s informational, what should they do next?
Is this something they didn’t know? Is it non-obvious?
Does it overturn a domain-driven belief or a gut feel?
Or does it bring consensus to a group with divided opinion?
20. Marking each analysis as Big, Useful or Surprising (High, Medium, Low)
20
Only those that are high or medium on all aspects are insights
Insights Big Useful Surprising
Twice as many Detractors talk about our Product’s ease of use. Low Medium High
Typing with capitalization in a credit application indicates creditworthiness Low Low High
Almost 20% of all voice search queries are triggered by just 25 words Low High Medium
More engaged employees have fewer accidents Low High Low
About 50% of American small businesses do not have a website High Medium Low
The recommendation system influences about 80% of content streamed on
Netflix
High Medium Low
21. Here are the analyses & filters for the problems we saw earlier
21
Purchasing Commodities B U S Cargo Delay B U S Customer Churn B U S
The most common commodity
was ordered 10 times a week
across 2.4 vendors
Fragile cargo is a big factor in the
delay, with a 20% impact
B S
Number of inbound calls does
not impact churn.
S
The number of orders is correlated
with the number of vendors.
Reducing one will reduce the other
U
Fridays are when cargo is delayed
the most
Customers who haven’t made
any calls in the last 15 days are
the most likely to churn
B
Plant P126 was the plant with the
most violations, especially on
largest commodity
B U
Trained staff and forklifts impact
delay the most
B U S
Customers making infrequent
calls, recharging small amounts
infrequently, are most at risk
B U S
23. A business storyline
• Our NPS improved 6%
• It was 34% in 4Q18. Now it’s at 40% in 2Q19
• Despite lower satisfaction with our Support,
our NPS grew
• This increase in NPS was mainly due to better
Product Quality & Research
Gladiator’s storyline
• The Emperor asks General Maximus to take
control of Rome and give it back to people
• The ambitious Prince murders the emperor.
• Maximus is sold as a gladiator slave. His family
is murdered
• Maximus grows famous, fights the Prince in the
arena, and wins
• He joins his family in death. Rome is in the
hands of the people
Outlines are the backbone on which you flesh out your story.
This section explains how to create storylines
Storylines are plot outlines. They summarize the entire story
Notice “characters” in red. All stories
have characters, human or otherwise.
23
24. Convert analysis into messages by adding context
24
DO IT: Add context to your analysis
1. Take each relevant analysis
2. Convert it to a message for the audience by
adding context
CHECK IT: Verify these yourself
Will your audience understand the messages
without explanation?
Will your audience understand why this
message is relevant?
Analysis doesn’t mean anything to people. When
it does, it’s a message. We do this by adding
context. Three ways to add context are:
1. Compare with similar numbers.
Our $15 mn sales is $3 mn more than last
year, $1 mn below budget, and twice our
nearest competitors.
2. Explain with analogies.
If we stopped producing, it’ll take 3 months to
dispose our excess inventory of $2 mn.
3. Add business interpretation.
Usage is correlated with discounts. For every
$1 discount, customer LTV increases by $24.
Frame each analysis as a message that the audience will understand and find relevant
25. Structure the messages into a pyramid or a tree
Conventional approach is to explain how we did
the analysis & found the insight
Insight is lost in the set of slides, takes too long to
reach to the first insight.
Instead, start with insight first, and then take the
audience through arguments to support it.
Starts with the main message, and then answers
why & how the insight makes sense.
Title
Analysis
section 1
Methodology Insight
Analysis
section 2
Methodology Insight
Insight that answers
a business question
Supporting
argument 1
Methodology
Supporting
reference
Supporting
argument 2
Methodology
Supporting
reference
27. Pick a format based on how your audience will consume the story
27
28. How the data should be interpreted decides the type of chart to be used
28
https://gramener.github.io/visual-vocabulary-vega/
Deviation
Change-
over-Time
Spatial Ranking
Correlation
Part-to-
Whole
Flow
Magnitude
Distribution
29. DO IT: Write your takeaway as one sentence
What’s the one thing you want the audience to
remember from your story?
What’s the one message that the audience
should take away?
CHECK IT: Verify these yourself
Is it a single, complete, sentence?
Does it deliver what you want the audience to
remember?
Will your audience care a lot about this?
Close your eyes. Think of a childhood tale.
Summarize the moral of the story in one line
We easily we remember these stories and their
summary as a moral several years later.
Close your eyes. Think of a business
presentation from last week. Can you easily
summarize the message in one line?
Stories are designed around a moral. A single
takeaway. An “elevator pitch”
It’s a one-sentence summary of the most important message for the audience.
Start with the takeaway. Summarize your entire story
29
30. Structure supporting analyses as a tree
30
Example of a business tree
Launch sales were 30% less than target due to high
competition
• Launch sales were projected at $20 mn in the
first month, but achieved only $14 mn
o Sales in every region were 20-50% lower.
o Only Philippines & Korea were on target
• Competitors discounted price by 35% - which is
unsustainable for them
o 80 store discounts increased from 15% to 35%
o The maximum sustainable discount is 20%
• Stores offered higher discounts saw less than
20% of our target sales
Construct a pyramid or tree-like outline
• Start with the takeaway at the root of the tree
• Add a message that supports the takeaway
• Add further details or supporting messages
• Messages must prove the first message, and
only the first message
• Strike off any message that isn’t required to
prove or support the takeaway
• Add next message that supports takeaway
• Add details to prove the second message
• Remaining messages for the takeaway
• Add details as required
Arrange messages hierarchically to prove & support the parent message
31. 4 type of annotations help the audience understand your intent
0
5,000
10,000
15,000
20,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Marks
# students
Teachers add marks to stop some students from failing
This chart shows Class 10 students’ English marks in Tamil Nadu, India, in
2011. The X-axis has the mark a student has scored. The Y-axis has the # of
students who scored that mark.
This is a bell curve. But the spike at 35 (the mark at which students pass) is
unusual. Teachers must be adding marks to some of the students who are
likely to fail by a small margin.
Large number of students score
exactly 35 marks
Few (but not 0) students score
between 30-35
What’s unusual
Large number of students
score 35 marks.
Few (but not 0) students score
between 30-35
Only some students get this benefit.
Identify a fair policy that will be applied consistently.
Summarize the chart in its title
Don’t describe the chart.
Don’t write the question to answer.
Write the answer itself. Like a headline.
Explain the chart
How should the user read it?
What do you say when you talk through it?
Explain what the visual is. Then the axes. Then
its contents. Then the inference.
Recommend an action
How should I act on this?
You need to change the audience.
(Otherwise, you made no difference.)
Highlight essential elements
What should the user focus their eyes on?
Point it out.
Interpret what they’re seeing – in words.
32. Here is the storyline for the analyses we saw earlier
32
Purchasing Commodities Cargo Delay Customer Churn
Takeaway Focus on reducing the number of
vendors products ICG (in P126),
FRS (in P121) and SWB (in P074)
for a potential 40% reduction in
logistics & vendor cost.
To reduce the TAT to 1.5 hours at
Airport XYZ, increase the number of
forklifts from 1 to 2, and the number
of trained staff from 4 to 6
If a customer has not called in the
last 5-14 days, and they have
made only 1 recharge under $20
last quarter, make them an offer
to retain them.
Supporting
points
ICG spend is among the highest, at
€6.9m. P126 typically orders 40
times a week, often from 15-20
vendors.
The number of forklifts is the
biggest driver of TAT. Each forklift
typically reduces TAT by 15-30%.
The biggest driver of retention is
when the customer made the
outgoing call. The 5-14 days
bucket has the highest variation.
FRS spend is €3.2m. P121 orders
from 3 vendors 8-14 times a week.
Total staff count does not impact
TAT. Increasing trained staff has a
more tangible impact of ~5-10% per
person.
Customers who make at most 1
recharge under $20 are 280%
more likely to churn than others.
42. Insights and Story telling approach
42
Stage 1- Identify
Business Problem
Define the problem
statement by understanding:
• What is the basic need
and desired outcome?
• Who will benefit?
• What is the impact?
• What is the success
criteria?
Stage 2- Translate to Data
Problem
• Breakdown the problem
statement into multiple use-
cases
• Connect each use case with
a data set
• Understand any limitations
on data sources- Internal
and External?
Stage 4- Translate to
Business Answer
• Stitch insights from
individual use case to
create a story
• Connect data story to help
in better decision making
• Measure success
Stage 3- Data Answer
Target each use case with
data through:
• EDA and transformation
• Modelling
• Generating insights
• Sales Rep
• Data Consultant
• Account Manager
• Solution Lead
• Analyst Lead
• Data Consultant
• Account Manager
• Solution Architect
• Solution Lead
• Analyst Lead
• Data Consultant
• Data Scientist
• Solution Architect
• Solution Lead
• Data Consultant
• Account Manager
• Solution Lead
45. In summary, here are the 9 steps to go from data to a data story
45
Who is your audience? They determine the story
What is their problem? That defines your analysis
Find the right analysis to solve the problem
Filter for big, useful, surprising insights
Start with the takeaway. Summarize your entire story
Add supporting analyses as a tree
Pick a format based on how your audience will consume the story
Pick a visual design based on the takeaway
Annotate to explain & engage. Use four types of narratives
46. To recap, we narrate insights as data stories
But this is not scalable without technology
46
INSIGHT STORY
DATA
GRAMENER
COMBINES