A nurse named Florence Nightingale used data visualization to reduce mortality rates during the Crimean War. She created a pie chart showing the causes of soldier deaths, with the largest slice representing deaths from avoidable hospital diseases. This helped Nightingale convince officials to improve sanitation in hospitals, which reduced the death rate from 40% to 2%. The document then discusses how data storytelling can help individuals advance their careers and provides tips on summarizing data insights concisely for different audiences.
1. Data & Storytelling
Shravan KumarThe Podium, Aug 8th, 2020
How to make yourself Indispensable in your Career with Data
2. How a nurse changed the course of a war using data storytelling
3. Nightingale, helped curtail the death rate from a whopping 40% to a mere 2%
3
Created by Florence Nightingale for Queen
Victoria during England’s war with France.
Visualizes deaths due to:
Red: War wounds
Black: Other war-related causes
Blue: Avoidable hospital diseases
4. 4
INTRODUCTION
Shravan Kumar A
Director, Client Success
“Simplify Data Science for all”100+ Clients
Insights as Stories
Help start, apply and adopt Data Science
@sh_ra_van
/shravankumara
10. 10
Companies are working to minimize COVID-19 impact and build resilience
1 Source: BCG Covid-19 report, Apr 2, 2020
2 Source: McKinsey - How CDOs can navigate COVID-19 response, Apr 2020
COVID-19 has disrupted every industry. All
sectors display an element of fragility and
are susceptible to shock.2
Industries at the forefront of the crisis are
relying on data to inform their response and
rebound strategies.
McKinsey1 suggests three waves of data-
driven actions that organizations can take:
1. Ensure data teams – and the whole
organization remain operational.
2. Lead solutions to prepare for the crisis-
triggered challenges.
3. Prepare for the next normal and get
ready to execute the plans.
The effects of the outbreak aren’t going away quickly. This realization has settled in.
12. 12
Senior Data ScientistPrincipal AI StorytellerChief Data Wizard
FEELING LUCKY? HERE’S A DATA SCIENCE TITLE GENERATOR!
Data
Statistical
ML
AI
Chief
Principal
Senior
Junior
Associate
Deputy
Assistant
Scientist
Engineer
Analyst
Designer
Developer
Designer
Storyteller
Ninja
Chef
Wrangler
Evangelist
Rock Star
Wizard
Alchemist
Vanity keywords Areas Activities
14. 14
THE JOURNEY FROM DATA TO DECISIONS
Data Engineering
MaturityPhases
Data Science
Data as
‘Culture’
Data
Collection
Data
Storage
Data
Transformation
Reporting Insights Consumption Decisions
Source: Article – When and how to build out your data science team
15. 15
THE JOURNEY FROM DATA TO DECISIONS
Data Engineering Data Science
Data
Collection
Data
Storage
Data
Transformatio
n
Reporting Insights Consumption
MaturityPhases
Source: Article – When and how to build out your data science team
Data as
‘Culture’
Decisions
16. 16
REPORTING: DESCRIPTIVE SUMMARIES
2019 Boston Chicago Detroit New York
Month Price Sales Price Sales Price Sales Price Sales
Jan 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
Feb 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
Mar 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
Apr 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
May 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
Jun 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
Jul 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
Aug 4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
Sep 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
Oct 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
Nov 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
Average 9.0 7.50 9.0 7.50 9.0 7.50 9.0 7.50
Variance 10.0 3.75 10.0 3.75 10.0 3.75 10.0 3.75
Revenue numbers from four Cities
17. 17
INSIGHT: PREDICTING TELCO CUSTOMER CHURN
Tenure (months)
0 - 12 36+12-36
Data Usage >
1.5 GB
01
YN
Bill > $65
0
N Y
• Simple Decision-tree model offered ~30% reduction in churn
• Advanced black-box models offered ~50%, but with low explainability
0Low Risk
1
High Risk
Source: Gramener
19. 19
CONSUMPTION: WHEN ARE PEOPLE BORN IN THE US?
Source: https://gramener.com/posters/Birthdays.pdf
..so, conceptions
might happen here
Very high
births..
Love the Valentine’s?
Too busy holidaying?
Avoid April
Fool’s Day?
Unlucky 13th?
More births
Fewer births
20. 20
More births
CONSUMPTION: WHAT’S THE BIRTH PATTERN IN INDIA?
Source: https://gramener.com/posters/Birthdays.pdf
Fewer births
Most births in
the first half
A striking birth pattern seen on the 5th, 10th,
15th, 20th and 25th of each month…
Very low births
Aug onwards
Why? Birthdates are ‘changed’ to
aid early school admissions
.. this is a typical
indication of fraud!
21. This adversely impacts children’s marks
It’s a well-established fact that older children tend to do
better at school in most activities. Since many children
have had their birth dates brought forward, these younger
children suffer.
The average marks of children “born” on the 1st, 5th, 10th, 15th etc.. of the month tend
to score lower marks.
• Are holidays avoided for births?
• Which months have a higher propensity for births, and why?
• Are there any patterns not found in the US data?
Higher marks Lower marks … on average, for children born on a given day of the year (from 2007 to 2013)
Children “born” on round numbered days score lower marks on average,
due to a higher proportion of younger children
22. An energy utility detected billing fraud
This plot shows the frequency of all meter readings from Apr-2010 to Mar-2011.
An unusually large number of readings are aligned with the slab boundaries.
Below is a simple histogram (or frequency distribution) of usage levels. Each bar represents the
number of customers with a customers with a specific bill amount (in units, or KWh).
Tariffs are based on the usage slab. Someone with 101 units is billed in full at a higher tariff than
someone with 100 units. So people have a strong incentive to stay at or within a slab boundary.
An energy utility (with over 50 million subscribers)
had 10 years worth of customer billing data
available.
Most fraud detection software failed to load the
data, and sampled data revealed little or no insight.
This can happen in one of two ways.
First, people may be monitoring their usage very
carefully, and turn off their lights and fans the instant
their usage hits the slab boundary.
Or, more realistically, there’s probably some level of corruption
involved, where customers pay a small sum to the meter reading staff
to ensure that it stays exactly at the slab boundary, giving them the
advantage of a lower price.
23. This plot shows the frequency of all meter readings from Apr-
2010 to Mar-2011. An unusually large number of readings are
aligned with the tariff slab boundaries.
This clearly shows collusion of some
form with the customers.
Apr-10 May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11
217 219 200 200 200 200 200 200 200 350 200 200
250 200 200 200 201 200 200 200 250 200 200 150
250 150 150 200 200 200 200 200 200 200 200 150
150 200 200 200 200 200 200 200 200 200 200 50
200 200 200 150 180 150 50 100 50 70 100 100
100 100 100 100 100 100 100 100 100 100 110 100
100 150 123 123 50 100 50 100 100 100 100 100
0 111 100 100 100 100 100 100 100 100 50 50
0 100 27 100 50 100 100 100 100 100 70 100
1 1 1 100 99 50 100 100 100 100 100 100
This happens with specific
customers, not randomly. Here are
such customers’ meter readings.
Section Apr-10 May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11
Section 1 70% 97% 136% 65% 110% 116% 121% 107% 114% 88% 74% 109%
Section 2 66% 92% 66% 87% 70% 64% 63% 50% 58% 38% 41% 54%
Section 3 90% 46% 47% 43% 28% 31% 50% 32% 19% 38% 8% 34%
Section 4 44% 24% 36% 39% 21% 18% 24% 49% 56% 44% 31% 14%
Section 5 4% 63% -27% 20% 41% 82% 26% 34% 43% 2% 37% 15%
Section 6 18% 23% 30% 21% 28% 33% 39% 41% 39% 18% 0% 33%
Section 7 36% 51% 33% 33% 27% 35% 10% 39% 12% 5% 15% 14%
Section 8 22% 21% 28% 12% 24% 27% 10% 31% 13% 11% 22% 17%
Section 9 19% 35% 14% 9% 16% 32% 37% 12% 9% 5% -3% 11%
If we define the “extent of fraud” as
the percentage excess of the 100
unit
meter reading, the value varies
considerably across sections,
and time
New section
manager arrives
… and is
transferred out
… with some explainable
anomalies.
Why would
these happen?
25. Class Xth English Marks Distribution
0
5,000
10,000
15,000
20,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
26. Stories have four types of narratives to explain visualizations
Remember “SEAR”: Summarize, Explain, Annotate, Recommend 26
0
5,000
10,000
15,000
20,000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Marks
# students
Teachers add marks to stop some students from failing
This chart shows Class 10 students’ English
marks in Tamil Nadu, India, in 2011. The X-axis
has the mark a student has scored. The Y-axis
has the # of students who scored that mark.
Large number of
students score
exactly 35 marks
Few (but not 0) students
fail at 31-34 marks
What’s unusual
Large number of students
score 35 marks.
Few (but not 0) students score
between 30-35
Only some students get this benefit.
Identify a fair policy that will be applied consistently.
Summarize the visual in its title
Don’t describe the chart.
Don’t write the user’s question.
Write the answer itself. Like a headline.
Explain & interpret the visual
How should the user read it?
What do you say when you talk through it?
Explain what the visual is. Then the axes.
Then its contents. Then the inference.
Recommend an action
How should I act on this?
You need to change the audience.
(Otherwise, you made no difference.)
Annotate essential elements
What should the user focus their eyes on?
Point it out, or highlight it with colors
Interpret what they’re seeing – in words.
This is a bell curve. But the spike at 35 (the mark
at which students pass) is unusual. Teachers
must be adding marks to some of the students
who are likely to fail by a small margin.
No one scores 0-4
marks
27. 27
INSIGHT + CONSUMPTION: DATA STORIES FROM THE WORLD BANK
Source: World bank storytelling, by Gramener
28. 28
DATA CULTURE: WHEN DATA DRIVES THE ENTIRE ORGANIZATION
Source: Netflix.com; Slides from InfoQ– ML Infra at Netflix
32. Insights and Story telling approach
32
Stage 1- Identify
Business Problem
Define the problem
statement by understanding:
• What is the basic need
and desired outcome?
• Who will benefit?
• What is the impact?
• What is the success
criteria?
Stage 2- Translate to Data
Problem
• Breakdown the problem
statement into multiple use-
cases
• Connect each use case with
a data set
• Understand any limitations
on data sources- Internal
and External?
Stage 4- Translate to
Business Answer
• Stitch insights from
individual use case to
create a story
• Connect data story to help
in better decision making
• Measure success
Stage 3- Data Answer
Target each use case with
data through:
• EDA and transformation
• Modelling
• Generating insights
• Sales Rep
• Data Consultant
• Account Manager
• Solution Lead
• Analyst Lead
• Data Consultant
• Account Manager
• Solution Architect
• Solution Lead
• Analyst Lead
• Data Consultant
• Data Scientist
• Solution Architect
• Solution Lead
• Data Consultant
• Account Manager
• Solution Lead
35. In summary, here are the 9 steps to go from data to a data story
35
Who is your audience? They determine the story
What is their problem? That defines your analysis
Find the right analysis to solve the problem
Filter for big, useful, surprising insights
Start with the takeaway. Summarize your entire story
Add supporting analyses as a tree
Pick a format based on how your audience will consume the story
Pick a visual design based on the takeaway
Annotate to explain & engage. Use four types of narratives
36. 36
MACHINE LEARNING 101
New Input
Desired
Outcome
Machine learning
how to do the job
Known Input
Known
Outcome
“Programs that solve
the problem”
“Programs that learn
to solve the problem”
vs
37. 37
WHY DEEP LEARNING?
37
Input Output
Identify features to
teach model
Traditional Machine Learning
Deep Learning
Person
Name
Input Output
Model automatically identifies
features to learn
Person
Name
https://www.cs.toronto.edu/~ranzato/publications/taigman_cvpr14.pdf
39. 39
1. Most Data Science projects solve the wrong Problem..
Tip #1: Master the application of knowledge
40. 40
AI IS COMING FOR THE DATA SCIENCE JOBS
AI and automation will
do away with most of
the grunt work in the
data science workflow
today.
Applied knowledge will
keep you relevant for
much longer.
42. What is Wolbachia?
• Naturally occurring bacteria
• Transmitted from parent to
offspring through the insect’s
eggs
• Safe for humans, animals and
the environment
• Reduces ability of
mosquitoes to transmit
disease (dengue, Zika,
chikungunya)
46. Model design
20,000 ppl / km2
15,000 ppl / km2
Identify where people live Detect buildings
Estimate human population
density
100m2
grids
e.g.
47. Data sources for model development
Gridded Population of the
World
• Modelled pop. data
• 1km2 resolution
Census data
• Up-to-date
• High resolution
SpaceNet Challenge
• very high resolution
• select cities available
Purchased satellite
imagery
• high resolution
• mix of archive and
tasked imagery
SpaceNet Challenge
• manually labelled
• high quality
Open Street Map
• crowd-sourced
• lower quality, but
suitable for training
High resolution satellite
imagery
Building outline data Population data
48. End to end solution
Data sources Modelling Model
outputs
(API)
DATA
Population density
Exclusion areas
Release areas
Building footprints
High resolution satellite
imagery
Building outline data
Population data
Human Settlement
Identification
(where people live)
Building footprint
extraction
Human population
density estimation
Site Scoping
application
DECISIONS
Mosquito release density
Prioritisation of release areas
Phasing of releases
49.
50. Site scoping
• Set boundary of potential
release area
• Identify the areas where
people live
• Map mosquito release points
over area with a grid
• Organise release area into
stages
51. 51
2. Data Analytics needs a lot more than Data & Analytics..
Tip #2: Learn non-core skills
53. 53
..AND BREAK IT DOWN INTO THE BUILDING BLOCKS
Domain
Design
Analytics
Development
• Impact analytics
• Clustering techniques
• Business workflow
• Influencing factors
• Frontend/backend coding
• Data transformation
• User journey
• Visuals & aesthetics
Project
Management
• Piecing it all together
• Change management
54. 54
HERE ARE THE 5 ROLES & SKILLS CRITICAL FOR DATA SCIENCE
Data
Translator
ML
Engineer
Information
Designer
Data
Scientist
Data Science
Manager
Comic characters from Gramener Comicgen library
Domain
Design
Analytics
Development
Project
Management
• Domain expertise
• Business analysis
• Solutioning
• Software engineering
• Front/back-end coding
• Data pipelining
• Information design
• User centered design
• Interface/visual design (parts)
• Stats & ML
• Interpret insights
• Scripting skills
• Project management
• Business analysis/solutioning
• Team handling
55. 55
3. Data cleaning takes up a majority of time on projects..
Tip #3: Sharpen ability to handle data
56. 56
In data science, 80% of the time is spent preparing data,
and the other 20% on complaining about preparing the data!
- Kirk Borne
“
57. 57
LEARN DATA HANDLING AND BUDGET TIME FOR IT IN YOUR WORK
Data
deduplication
Data
standardization
Data
normalization
Quality check
Exploratory
analysis
Data
Cleaning &
Preparation
58. 58
4. Technology goes obsolete faster in Data Science..
Tip #4: Learn new tools quickly
59. 59
WHAT DOES THE DATA TOOLS LANDSCAPE LOOK LIKE?
The tool does not matter. A person’s skill with the tool does.
Pick an ability to learn new tools rapidly
Source: https://mattturck.com/data2019/
60. 60
EXAMPLE: WHAT ARE YOUR TOOL OPTIONS TO VISUALIZE DATA?
Code-based
Plug-n-
play
Flexibility
Complexity
Google Data Studio
Excel
Google Sheets
Tableau
Raw
Vismio
Datawrapper
Timeline JS
Polestar
Vega
Vega-lite
d3,
matplotlib
C3
High charts
Nvd3
Gramex
ggplot, bokeh
Plotly
Choose tools based on flexibility, your background and tool availability
61. 61
Tip #4: Learn new tools quickly
Tip #2: Learn non-core skills
Tip #3: Sharpen ability to handle data
Tip #1: Master the application of knowledge
63. 63
WHAT DOES THE RECESSION MEAN FOR JOBS IN DATA SCIENCE?
Source: McKinsey report – Lives and Livelihoods
Data jobs and specialized professions
are relatively less impacted
Industries with the lowest wages and
lowest educational attainment are hit
the hardest
64. 64
HERE’S WHY DATA IS KEY FOR COVID-19 AND THE RECESSION
Enterprises
B
Community
C
Remote workforce & collaboration
Market demand & Cash flows1
2
Supply chain & Logistics3
Identifying vulnerability and contact-tracing
Tracking the COVID-19 patient lifecycle1
2
Predicting infection rates and spread2
Public Health
A
Understand behavioral shifts
Mapping the effectiveness of shutdown1
2
Address people concerns during Covid-193
Source: Gramener – NYC 311 analysisSource: Kinsa Health weather map Source: Gramener – Supply Chain flow
65. 65
HOW DO YOU STAY RELEVANT AND GROW IN YOUR CAREER PATH?
Do your own
data projects
Read/Write on
data science
Maintain a public
portfolio
Compete, learn &
re-apply
Source: Article – How to demonstrate your passion for Data