This presentation was created by Anand Madhav, Gramener's Sr. Manager - Data Sciences, for a guest lecture session at IIFT Delhi. It talks about data storytelling, how to create a data story, and what ingredients make a data story exciting.
Anand Madhav also teaches data storytelling to analysts and data scientists. Check out more about Gramener's data storytelling workshop at https://gramener.com/data-storytelling-workshop
6. We’ve been telling stories with data for a long time
6
How a nurse changed the course
of a war using data storytelling.
Created by Florence Nightingale for
Queen Victoria during England’s war
with France.
Visualizes deaths due to:
Red: War wounds
Black: Other war-related causes
Blue: Avoidable hospital diseases
7. Stories have a huge impact on humans
7
Storytelling has a 30X Return on Investment
Rob Walker and Joshua Glenn auctioned common
items like mugs, golf balls, toys, etc. The item
descriptions were stories purpose-written by 200+
contributing writers.
Items that were bought for $250 sold for over
$8,000 – a return of over 3,000% for storytelling!
Stories are memorable and viral
People remember stories. They’ll act on them.
People share stories. That enables collective
action.
We analyze data to improve people’s decision
making. For this to be effective, data stories are
needed more than ever before.
http://significantobjects.com/
8. Data storytelling is a critical skill for data scientists, analysts & managers
8
Share your data & analysis as data stories
Whenever you share inferences from data – whether
it’s as a presentation, or an email or document with
your analysis, or as a dashboard – craft it as a story.
Today we will look at a few engaging data stories and
find out how to convert an analysis into a memorable
story – even if you’ve never told a story before.
But analysts present their work, not their message
Data scientists present their analysis – what they did,
and what they found. That’s not what the audience
needs.
Audiences need a message that tells them what to do,
and why. Told in an engaging way. As a story.
10. New digital tools allow us to do this at scale
Sachin
R Tendulkar
Sourav
C Ganguly
Rahul
Dravid
Mohammad
Azharuddin
Yuvraj
Singh
Virender
Sehwag
Mahendra
S Dhoni
Alaysinhji
D Jadeja
Navjot
S Sidhu
Gautam
Gambhir
Krishnamachari
Srikkanth
Kapil
Dev
Dillip
B Vengsarkar
Suresh
K Raina
Ravishankar
J Shastri
Sunil
M Gavaskar
Mohammad
Kaif
Virat
Kohli
Vinod
G Kambli
Vangipurappu V
S Laxman
Rabindra
R Singh
Sanjay
V Manjrekar
Mohinder
Amarnath
Manoj
M Prabhakar
Rohit
G Sharma
Irfan
K Pathan
Nayan
R Mongia
Ajit
B Agarkar
Dinesh
Mongia
Harbhajan
Singh
Krishna K
D Karthik
Sandeep
M Patil
Anil
Kumble
Yashpal
Sharma
Javagal
Srinath
Hemang
K Badani
Yusuf
K Pathan
Robin
V Uthappa
Raman
Lamba
Zaheer
Khan
Ravindra
A Jadeja
Pathiv
A Patel
Sadagopan
Ramesh
Roger M
H Binny
Woorkeri
V Raman
Sunil
B Joshi
Kiran
S More
Praveen
K Amre
Ashok
Malhotra
Chetan
Sharma
Kapil Dev’s 175 against
Zimbabwe in 1983
Gavaskar’s 107 against
New Zealand in 1987
Srikkanth’s 95 against Sri
Lanka in 1982
Siddhu’s 134 against
England in Gwalior, 1993
Sachin’s 200 against South
Africa in 2010
Dhoni’s 198 against Sri
Lamka in 2005
Sachin’s 134 against
Australia in 1998
Ganguly’s 183 against Sri
Lanka in 1999
Sehwag’s 146 against Sri
Lanka in 2009
Yusuf Pathan’s 123 against New
Zealand in 2010
Kohli’s 107 against
Engalnd in 2011
11. Segmenting India Geo-demographically
Previously, the client was treating contiguous regions as a
homogenous entity, from a channel content perspective.
To deliver targeted content, we divided India into 6 clusters
based on their demographic behavior. Specifically, three
composite indices were created based on the economic
development lifecycle:
• Education (literacy, higher education) that leads to...
• Skilled jobs (in mfg. or services) that leads to...
• Purchasing power (higher income, asset ownership)
Districts were divided (at the average cut-off) by:
Offering targeted content to these clusters will reach a
more homogenous demographic population.
Skilled
Poorer Richer
Unskilled Skilled
Uneducated Educated Uneducated Educated
Unskilled
Purchasing power
Skilled jobs
Education
Poor Breakout Aspirant Owner Business Rich
Poor
Rural, uneducated agri
workers. Young population
with low income and asset
ownership. Mostly in Bihar,
Jharkhand, UP, MP.
Breakout
Rural, educated agri workers
poised for skilled labor.
Higher asset ownership.
Parts of UP, Bihar, MP.
Aspirant
Regions with skilled labor
pools but low purchasing
power. Cusp of economic
development. Mostly WB,
Odisha, parts of UP
Owner
Regions with unskilled labor
but high economic prosperity
(landlords, etc..) Mostly AP,
TN, parts of Karnataka,
Gujarat
Business
Lower education but working
in skilled jobs, and
prosperous. Typical of
business communities. Parts
of Gujarat, TN, Urban UP,
Punjab, etc.
Rich
Urban educated
population
working in skilled
jobs. All metros,
large cities, parts
of Kerala, TN
The 6 clusters are
12. This is a dataset (1975 – 1990) that
has been around for several years and
has been studied extensively. Yet, a
visualization can reveal patterns that
are neither obvious nor well known.
For example,
• Are birthdays uniformly distributed?
• Do doctors or parents exercise the C-section option to move dates?
• Is there any day of the month that has unusually high or low births?
• Are there any months with relatively high or low births?
Very high births in September.
But this is fairly well known. Most
conceptions happen during the
winter holiday season
Relatively few births during the
Christmas and Thanksgiving
holidays, as well as New Year
and Independence Day.
Most people prefer not
to have children on the
13th of any month, given
that it’s an unlucky day
Some special days like April
Fool’s day are avoided, but
Valentine’s Day is quite
popular
More births Fewer births … on average, for each day of the year (from 1975 to 1990)
Let’s look at 15 years of US Birth Data
Education
LINK
Fraud
13. The pattern in India is quite different
This is a birth date dataset that’s
obtained from school admission data
for over 10 million children. When we
compare this with births in the US, we
see none of the same patterns.
For example,
• Is there an aversion to the 13th or is there a local cultural nuance?
• Are holidays avoided for births?
• Which months have a higher propensity for births, and why?
• Are there any patterns not found in the US data?
Very few children are born in the
month of August, and thereafter.
Most births are concentrated in
the first half of the year
We see a large number of
children born on the 5th, 10th,
15th, 20th and 25th of each month
– that is, round numbered dates
Such round numbered patterns a
typical indication of fraud. Here,
birthdates are brought forward to
aid early school admission
More births Fewer births … on average, for each day of the year (from 2007 to 2013)
Education
LINK
Fraud
14. This adversely impacts children’s marks
It’s a well-established fact that older
children tend to do better at school in
most activities. Since many children
have had their birth dates brought
forward, these younger children suffer.
The average marks of children “born” on the 1st, 5th, 10th, 15th etc.. of
the month tend to score lower marks.
• Are holidays avoided for births?
• Which months have a higher propensity for births, and why?
• Are there any patterns not found in the US data?
Higher marks Lower marks … on average, for children born on a given day of the year (from 2007 to 2013)
Children “born” on round numbered days score lower marks on average,
due to a higher proportion of younger children
Education
LINK
Fraud
15. An energy utility detected billing fraud
This plot shows the frequency of all meter readings from Apr-
2010 to Mar-2011. An unusually large number of readings are
aligned with the slab boundaries.
Below is a simple histogram (or frequency distribution) of usage levels. Each bar represents the
number of customers with a customers with a specific bill amount (in units, or KWh).
Tariffs are based on the usage slab. Someone with 101 units is billed in full at a higher tariff than
someone with 100 units. So people have a strong incentive to stay at or within a slab boundary.
An energy utility (with over 50 million subscribers)
had 10 years worth of customer billing data
available.
Most fraud detection software failed to load the
data, and sampled data revealed little or no insight.
This can happen in one of two ways.
First, people may be monitoring their usage very
carefully, and turn of their lights and fans the instant
their usage hits the slab boundary.
Or, more realistically, there’s probably some level of corruption
involved, where customers pay a small sum to the meter reading staff
to ensure that it stays exactly at the slab boundary, giving them the
advantage of a lower price.
16. This plot shows the frequency of all meter readings from Apr-
2010 to Mar-2011. An unusually large number of readings are
aligned with the tariff slab boundaries.
This clearly shows collusion of some
form with the customers.
Apr-10 May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11
217 219 200 200 200 200 200 200 200 350 200 200
250 200 200 200 201 200 200 200 250 200 200 150
250 150 150 200 200 200 200 200 200 200 200 150
150 200 200 200 200 200 200 200 200 200 200 50
200 200 200 150 180 150 50 100 50 70 100 100
100 100 100 100 100 100 100 100 100 100 110 100
100 150 123 123 50 100 50 100 100 100 100 100
0 111 100 100 100 100 100 100 100 100 50 50
0 100 27 100 50 100 100 100 100 100 70 100
1 1 1 100 99 50 100 100 100 100 100 100
This happens with specific
customers, not randomly. Here are
such customers’ meter readings.
Section Apr-10 May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11
Section 1 70% 97% 136% 65% 110% 116% 121% 107% 114% 88% 74% 109%
Section 2 66% 92% 66% 87% 70% 64% 63% 50% 58% 38% 41% 54%
Section 3 90% 46% 47% 43% 28% 31% 50% 32% 19% 38% 8% 34%
Section 4 44% 24% 36% 39% 21% 18% 24% 49% 56% 44% 31% 14%
Section 5 4% 63% -27% 20% 41% 82% 26% 34% 43% 2% 37% 15%
Section 6 18% 23% 30% 21% 28% 33% 39% 41% 39% 18% 0% 33%
Section 7 36% 51% 33% 33% 27% 35% 10% 39% 12% 5% 15% 14%
Section 8 22% 21% 28% 12% 24% 27% 10% 31% 13% 11% 22% 17%
Section 9 19% 35% 14% 9% 16% 32% 37% 12% 9% 5% -3% 11%
If we define the “extent of fraud” as
the percentage excess of the 100
unit
meter reading, the value varies
considerably across sections,
and time
New section
manager arrives
… and is
transferred out
… with some explainable
anomalies.
Why would
these happen?
17. With the growth of self-service BI, 85% of companies have lost track of how many
dashboards they generated
What QUESTION does
the dashboard answer?
Is the ANSWER evident
from the dashboard?
What ACTION should the
user take now?
BUT 3 THINGS
ARE UNCLEAR ON
MOST DASHBOARDS
17
18. You have data.
You have analysis.
Now what?
Understanding the audience & intent
Finding insights
Storylining
Designing data stories
20. DO IT: Start with your own hypothesis
• Define different user personas
• What problems do you think your persona is facing?
• How do you feel the persona will use the analysis?
• Frame it as a user scenario.
CHECK IT: Verify user scenario with a partner
Is it framed as “As a [persona], I’m in [situation]
where I face [problem], leading to
[consequence]. Solving it by [action] leads to
[impact]”
Would the persona relate to this user scenario if
they heard it?
List scenario(s) for each persona
For each persona, answer the following questions:
1. What situation are they currently in?
2. What problems do they face?
3. What is the consequence?
4. What action can they need to take using your analysis?
5. What is the impact of this action?
Combine these as a user scenario:
“As a [persona], I’m in [situation] where I face [problem], leading
to [consequence]. Solving it by [action] leads to [impact]”
• John: As a Marketing manager, I have to create region-wise
budget for the next quarter. I don’t know which regions give the
highest RoI, so my spend isn’t optimized. Solving it by prioritizing
the region will lead to maximum ROI.
Clear needs & future scenario leads to effective communication.
Know your audience’s needs, they determine the story, align the message accordingly
Reference: SPIN Selling by Neil Rackham
22. Insights must be Big, Useful, and Surprising
Filter the analyses using these as a checklist
IS THE INSIGHT
BIG
IS THE INSIGHT
USEFUL
IS THE INSIGHT
SURPRISING
The analysis must, of course, be statistically significant.
But it should also be numerically significant.
We want a result that substantially changes the outcome.
What should the audience do after hearing the insight?
Can they take an action that improves their objective?
Even if it’s informational, what should they do next?
Is this something they didn’t know? Is it non-obvious?
Does it overturn a domain-driven belief or a gut feel?
Or does it bring consensus to a group with divided opinion?
23. Marking each analysis as Big, Useful or Surprising (High, Medium, Low)
23Only those that are high or medium on all aspects are insights
Insights Big Useful Surprising
Twice as many Detractors talk about our Product’s ease of use. Low Medium High
Typing with capitalization in a credit application indicates creditworthiness Low Low High
Almost 20% of all voice search queries are triggered by just 25 words Low High Medium
More engaged employees have fewer accidents Low High Low
About 50% of American small businesses do not have a website High Medium Low
The recommendation system influences about 80% of content streamed on
Netflix
High Medium Low
25. A business storyline
• Our NPS improved 6%
• It was 34% in 4Q18. Now it’s at 40% in 2Q19
• Despite lower satisfaction with our Support,
our NPS grew
• This increase in NPS was mainly due to better
Product Quality & Research
Gladiator’s storyline
• The Emperor asks General Maximus to take
control of Rome and give it back to people
• The ambitious Prince murders the emperor.
• Maximus is sold as a gladiator slave. His family
is murdered
• Maximus grows famous, fights the Prince in the
arena, and wins
• He joins his family in death. Rome is in the
hands of the people
Outlines are the backbone on which you flesh out your story.
This section explains how to create storylines
Storylines are plot outlines. They summarize the entire story
Notice “characters” in red. All stories
have characters, human or otherwise.
25
26. 3. Convert analysis into messages by adding context
26
DO IT: Add context to your analysis
1. Take each relevant analysis
2. Convert it to a message for the audience by
adding context
CHECK IT: Verify these yourself
Will your audience understand the messages
without explanation?
Will your audience understand why this
message is relevant?
Analysis doesn’t mean anything to people. When
it does, it’s a message. We do this by adding
context. Three ways to add context are:
1. Compare with similar numbers.
Our $15 mn sales is $3 mn more than last
year, $1 mn below budget, and twice our
nearest competitors.
2. Explain with analogies.
If we stopped producing, it’ll take 3 months to
dispose our excess inventory of $2 mn.
3. Add business interpretation.
Usage is correlated with discounts. For every
$1 discount, customer LTV increases by $24.
Frame each analysis as a message that the audience will understand and find relevant
27. 4. Structure the messages into a pyramid or a tree
Conventional approach is to explain how we did
the analysis & found the insight
Insight is lost in the set of slides, takes too long to
reach to the first insight.
Instead, start with insight first, and then take the
audience through arguments to support it.
Starts with the main message, and then answers
why & how the insight makes sense.
Title
Analysis
section 1
Methodology Insight
Analysis
section 2
Methodology Insight
Insight that answers
a business question
Supporting
argument 1
Methodology
Supporting
reference
Supporting
argument 2
Methodology
Supporting
reference
29. How the data should be interpreted decides the type of chart to be used
29
https://gramener.github.io/visual-vocabulary-vega/
Deviation
Change-
over-Time
Spatial Ranking
Correlation
Part-to-
Whole
Flow
Magnitude
Distribution
30. Class Xth English Marks Distribution
0
5,000
10,000
15,000
20,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
31. 4 type of annotations help the audience understand your intent
0
5,000
10,000
15,000
20,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Marks
# students
Teachers add marks to stop some students from failing
This chart shows Class 10 students’ English marks in Tamil Nadu, India, in
2011. The X-axis has the mark a student has scored. The Y-axis has the # of
students who scored that mark.
This is a bell curve. But the spike at 35 (the mark at which students pass) is
unusual. Teachers must be adding marks to some of the students who are
likely to fail by a small margin.
Large number of students score
exactly 35 marks
Few (but not 0) students score
between 30-35
What’s unusual
Large number of students
score 35 marks.
Few (but not 0) students score
between 30-35
Only some students get this benefit.
Identify a fair policy that will be applied consistently.
Summarize the chart in its title
Don’t describe the chart.
Don’t write the question to answer.
Write the answer itself. Like a headline.
Explain the chart
How should the user read it?
What do you say when you talk through it?
Explain what the visual is. Then the axes. Then
its contents. Then the inference.
Recommend an action
How should I act on this?
You need to change the audience.
(Otherwise, you made no difference.)
Highlight essential elements
What should the user focus their eyes on?
Point it out.
Interpret what they’re seeing – in words.
32. Insights and Story telling approach
32
Stage 1- Identify
Business Problem
Define the problem
statement by understanding:
• What is the basic need
and desired outcome?
• Who will benefit?
• What is the impact?
• What is the success
criteria?
Stage 2- Translate to Data
Problem
• Breakdown the problem
statement into multiple use-
cases
• Connect each use case with
a data set
• Understand any limitations
on data sources- Internal
and External?
Stage 4- Translate to
Business Answer
• Stitch insights from
individual use case to
create a story
• Connect data story to help
in better decision making
• Measure success
Stage 3- Data Answer
Target each use case with
data through:
• EDA and transformation
• Modelling
• Generating insights
• Sales Rep
• Data Consultant
• Account Manager
• Solution Lead
• Analyst Lead
• Data Consultant
• Account Manager
• Solution Architect
• Solution Lead
• Analyst Lead
• Data Consultant
• Data Scientist
• Solution Architect
• Solution Lead
• Data Consultant
• Account Manager
• Solution Lead
33. 8 super spreaders are responsible for 2/3rd of all suspected cases in the district
33
This visual represents the Covid-19 suspects in
a district
Each dot is a suspect. Red tested positive,
Green negative, Grey awaiting results and Blue
not tested yet
Contact tracing for 40 positive cases resulted in
~1,400 suspected cases, an average of 35
contacts per person
Top 8 super spreaders are responsible for two-
thirds of all suspected cases
Of these 1,400 suspected cases, test results
are waited for roughly 5% of the cases.
Among the declared results 11% are positive
One positive patient also came in contact with
11 people from another district
We did the simplest possible thing – plot the number of customers who had meter readings of 0, 1, 2, 3, etc.. – all the way up to 300 and beyond. (Effectively, we drew a histogram.)
As expected, it was log-normal. Relatively few users with low meter readings, and few with high meter readings. But what was striking were the spikes – at 50 units, 100 units, 200 units and 300 units – precisely at the slab boundaries.
Given the metering system, there is a strong economic incentive to stay at or within a slab boundary. Exceeding it increases the unit rate. However, there are two ways this could happen. Either the consumer watches their meter carefully, and the instant it hits 100, stops using their lights and fans – or a certain amount of money changes hands.
It was easy to see from this that there was fraud happening, but what stumped us were the spikes at 10, 20, 30, 40, etc.. Here, there’s no economic incentive. There’s no significant difference between a meter reading of 10 vs 11, so there was no incentive to commit fraud. However, we later learnt that we were looking at this the wrong way. This was not a case of fraud, but of laziness. These were the meter readings taken by staff that never visited the premises, and were cooking up numbers.
When people cook up numbers, they cook up round numbers. (An official said that he had to let go of one person who had not taken readings in a colony of houses for as long as six months. “Sir, there’s a pack of dogs in the colony” was his official statement.)
The other question is, what is the nature of this fraudulent contract. Is it monthly? The meter reading guy appears and charges a small sum to adjust the reading? Or is it an annual contract that’s paid upfront? We looked at the meter readings of some of the people who were consistently at the slab boundaries. For example, the table in the middle has the readings of 10 customers, one per row. In the first row, the readings are consistently at 200 for 9 of the 12 months. However, there’s a spike in Jan-11 to 350 units. This indicated a monthly contract with a failure to pay in just one month. However, we later learnt that many of the people on this list were famous personalities. In fact, the lady in the first row had an event at their place in Jan-11, and the actual reading was expected to be well over a thousand units. But since the electricity board has a policy of not often auditing those that were in the highest slab (above 300), a more likely explanation was a collusion of the lineman with the customer to place her in the highest slab just this month, to avoid scrutiny.
Lastly, we were examining the level at which fraud can be controlled. The last table above shows the extent of fraud of each section in one city, month on month. (The extent of fraud can be measured by the relative height of the spikes compared to the expected value.) Sections vary in the level of fraud, with Section 1 having significantly more fraud than Section 9. We also observe that fraud generally decreases in the winter season (Dec – Feb) when the need for cooling is less. But what’s most striking is the negative fraud in Section 5 in Jun-10. It stays low for a couple of months, and then, as if to compensate, shoots up to 82% in Sep-10.
We learnt that this coincided with the appointment and transfer of a new section manager – under whose “regime”, fraud seems to have been dramatically controlled. It appears that a good organization level to control fraud is at the 5,000 people strong section manager level, rather than the 100,000 people strong staff level.
Source: How to Vanish Management Reporting Mania, Sep 2014: https://www.cfo.com/management-accounting/2014/09/vanquish-management-report-mania/
Add steps on the right side
Activities to be checked yourself
Add steps on the right side
Think of good anecdotes here
Add steps on the right side
Instructors: Give the audience 2 minutes to expand on the context for any one analysis. Ask 1-2 people to read it out. Apply the checklist. Debate with them how they could make the point clearer and meaningful to the audience.. Explore alternatives, using comparisons, analogies or interpretation.