Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The value of storytelling through data

Gramener's Shravan Kumar took a virtual session with the students of IIM Shillong. It was a fantastic interaction and the participants were keen to understand how data and storytelling are making a difference to new-age businesses. Check out the slides from the session

  • Be the first to comment

The value of storytelling through data

  1. 1. Data & Storytelling Shravan KumarThe Podium, Aug 8th, 2020 How to make yourself Indispensable in your Career with Data
  2. 2. How a nurse changed the course of a war using data storytelling
  3. 3. Nightingale, helped curtail the death rate from a whopping 40% to a mere 2% 3 Created by Florence Nightingale for Queen Victoria during England’s war with France. Visualizes deaths due to: Red: War wounds Black: Other war-related causes Blue: Avoidable hospital diseases
  4. 4. 4 INTRODUCTION Shravan Kumar A Director, Client Success “Simplify Data Science for all”100+ Clients Insights as Stories Help start, apply and adopt Data Science @sh_ra_van /shravankumara
  5. 5. Introduction to Data Portraits 5
  6. 6. How to Create a Data Portrait 6
  7. 7. 7 DATA SCIENCE: WHAT’S THE VALUE REALITY CHECK: HOW TO THRIVE? IT’S A RECESSION. WHY DATA NOW?
  8. 8. 8Source: McKinsey – COVID-19 Briefing materials
  9. 9. 9Source: McKinsey – COVID-19 Briefing materials
  10. 10. 10 Companies are working to minimize COVID-19 impact and build resilience 1 Source: BCG Covid-19 report, Apr 2, 2020 2 Source: McKinsey - How CDOs can navigate COVID-19 response, Apr 2020 COVID-19 has disrupted every industry. All sectors display an element of fragility and are susceptible to shock.2 Industries at the forefront of the crisis are relying on data to inform their response and rebound strategies. McKinsey1 suggests three waves of data- driven actions that organizations can take: 1. Ensure data teams – and the whole organization remain operational. 2. Lead solutions to prepare for the crisis- triggered challenges. 3. Prepare for the next normal and get ready to execute the plans. The effects of the outbreak aren’t going away quickly. This realization has settled in.
  11. 11. 11 DATA SCIENCE: WHAT’S THE VALUE? IT’S A RECESSION. WHY DATA NOW? REALITY CHECK: HOW TO THRIVE?
  12. 12. 12 Senior Data ScientistPrincipal AI StorytellerChief Data Wizard FEELING LUCKY? HERE’S A DATA SCIENCE TITLE GENERATOR! Data Statistical ML AI Chief Principal Senior Junior Associate Deputy Assistant Scientist Engineer Analyst Designer Developer Designer Storyteller Ninja Chef Wrangler Evangelist Rock Star Wizard Alchemist Vanity keywords Areas Activities
  13. 13. 13 BUZZWORDS AND BUSTED BUDGETS
  14. 14. 14 THE JOURNEY FROM DATA TO DECISIONS Data Engineering MaturityPhases Data Science Data as ‘Culture’ Data Collection Data Storage Data Transformation Reporting Insights Consumption Decisions Source: Article – When and how to build out your data science team
  15. 15. 15 THE JOURNEY FROM DATA TO DECISIONS Data Engineering Data Science Data Collection Data Storage Data Transformatio n Reporting Insights Consumption MaturityPhases Source: Article – When and how to build out your data science team Data as ‘Culture’ Decisions
  16. 16. 16 REPORTING: DESCRIPTIVE SUMMARIES 2019 Boston Chicago Detroit New York Month Price Sales Price Sales Price Sales Price Sales Jan 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 Feb 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 Mar 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71 Apr 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84 May 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47 Jun 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04 Jul 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25 Aug 4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50 Sep 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 Oct 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 Nov 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89 Average 9.0 7.50 9.0 7.50 9.0 7.50 9.0 7.50 Variance 10.0 3.75 10.0 3.75 10.0 3.75 10.0 3.75 Revenue numbers from four Cities
  17. 17. 17 INSIGHT: PREDICTING TELCO CUSTOMER CHURN Tenure (months) 0 - 12 36+12-36 Data Usage > 1.5 GB 01 YN Bill > $65 0 N Y • Simple Decision-tree model offered ~30% reduction in churn • Advanced black-box models offered ~50%, but with low explainability 0Low Risk 1 High Risk Source: Gramener
  18. 18. 18 INSIGHT: IDENTIFYING QUALITY OF LIFE FROM SATELLITE IMAGES Source: https://qol.gramener.com/
  19. 19. 19 CONSUMPTION: WHEN ARE PEOPLE BORN IN THE US? Source: https://gramener.com/posters/Birthdays.pdf ..so, conceptions might happen here Very high births.. Love the Valentine’s? Too busy holidaying? Avoid April Fool’s Day? Unlucky 13th? More births Fewer births
  20. 20. 20 More births CONSUMPTION: WHAT’S THE BIRTH PATTERN IN INDIA? Source: https://gramener.com/posters/Birthdays.pdf Fewer births Most births in the first half A striking birth pattern seen on the 5th, 10th, 15th, 20th and 25th of each month… Very low births Aug onwards Why? Birthdates are ‘changed’ to aid early school admissions .. this is a typical indication of fraud!
  21. 21. This adversely impacts children’s marks It’s a well-established fact that older children tend to do better at school in most activities. Since many children have had their birth dates brought forward, these younger children suffer. The average marks of children “born” on the 1st, 5th, 10th, 15th etc.. of the month tend to score lower marks. • Are holidays avoided for births? • Which months have a higher propensity for births, and why? • Are there any patterns not found in the US data? Higher marks Lower marks … on average, for children born on a given day of the year (from 2007 to 2013) Children “born” on round numbered days score lower marks on average, due to a higher proportion of younger children
  22. 22. An energy utility detected billing fraud This plot shows the frequency of all meter readings from Apr-2010 to Mar-2011. An unusually large number of readings are aligned with the slab boundaries. Below is a simple histogram (or frequency distribution) of usage levels. Each bar represents the number of customers with a customers with a specific bill amount (in units, or KWh). Tariffs are based on the usage slab. Someone with 101 units is billed in full at a higher tariff than someone with 100 units. So people have a strong incentive to stay at or within a slab boundary. An energy utility (with over 50 million subscribers) had 10 years worth of customer billing data available. Most fraud detection software failed to load the data, and sampled data revealed little or no insight. This can happen in one of two ways. First, people may be monitoring their usage very carefully, and turn off their lights and fans the instant their usage hits the slab boundary. Or, more realistically, there’s probably some level of corruption involved, where customers pay a small sum to the meter reading staff to ensure that it stays exactly at the slab boundary, giving them the advantage of a lower price.
  23. 23. This plot shows the frequency of all meter readings from Apr- 2010 to Mar-2011. An unusually large number of readings are aligned with the tariff slab boundaries. This clearly shows collusion of some form with the customers. Apr-10 May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11 217 219 200 200 200 200 200 200 200 350 200 200 250 200 200 200 201 200 200 200 250 200 200 150 250 150 150 200 200 200 200 200 200 200 200 150 150 200 200 200 200 200 200 200 200 200 200 50 200 200 200 150 180 150 50 100 50 70 100 100 100 100 100 100 100 100 100 100 100 100 110 100 100 150 123 123 50 100 50 100 100 100 100 100 0 111 100 100 100 100 100 100 100 100 50 50 0 100 27 100 50 100 100 100 100 100 70 100 1 1 1 100 99 50 100 100 100 100 100 100 This happens with specific customers, not randomly. Here are such customers’ meter readings. Section Apr-10 May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11 Section 1 70% 97% 136% 65% 110% 116% 121% 107% 114% 88% 74% 109% Section 2 66% 92% 66% 87% 70% 64% 63% 50% 58% 38% 41% 54% Section 3 90% 46% 47% 43% 28% 31% 50% 32% 19% 38% 8% 34% Section 4 44% 24% 36% 39% 21% 18% 24% 49% 56% 44% 31% 14% Section 5 4% 63% -27% 20% 41% 82% 26% 34% 43% 2% 37% 15% Section 6 18% 23% 30% 21% 28% 33% 39% 41% 39% 18% 0% 33% Section 7 36% 51% 33% 33% 27% 35% 10% 39% 12% 5% 15% 14% Section 8 22% 21% 28% 12% 24% 27% 10% 31% 13% 11% 22% 17% Section 9 19% 35% 14% 9% 16% 32% 37% 12% 9% 5% -3% 11% If we define the “extent of fraud” as the percentage excess of the 100 unit meter reading, the value varies considerably across sections, and time New section manager arrives … and is transferred out … with some explainable anomalies. Why would these happen?
  24. 24. 24 CONSUMPTION: DECODING MAHABHARATHA’S RELATIONSHIP Source: https://gramener.com/mahabharatha/
  25. 25. Class Xth English Marks Distribution 0 5,000 10,000 15,000 20,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
  26. 26. Stories have four types of narratives to explain visualizations Remember “SEAR”: Summarize, Explain, Annotate, Recommend 26 0 5,000 10,000 15,000 20,000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Marks # students Teachers add marks to stop some students from failing This chart shows Class 10 students’ English marks in Tamil Nadu, India, in 2011. The X-axis has the mark a student has scored. The Y-axis has the # of students who scored that mark. Large number of students score exactly 35 marks Few (but not 0) students fail at 31-34 marks What’s unusual Large number of students score 35 marks. Few (but not 0) students score between 30-35 Only some students get this benefit. Identify a fair policy that will be applied consistently. Summarize the visual in its title Don’t describe the chart. Don’t write the user’s question. Write the answer itself. Like a headline. Explain & interpret the visual How should the user read it? What do you say when you talk through it? Explain what the visual is. Then the axes. Then its contents. Then the inference. Recommend an action How should I act on this? You need to change the audience. (Otherwise, you made no difference.) Annotate essential elements What should the user focus their eyes on? Point it out, or highlight it with colors Interpret what they’re seeing – in words. This is a bell curve. But the spike at 35 (the mark at which students pass) is unusual. Teachers must be adding marks to some of the students who are likely to fail by a small margin. No one scores 0-4 marks
  27. 27. 27 INSIGHT + CONSUMPTION: DATA STORIES FROM THE WORLD BANK Source: World bank storytelling, by Gramener
  28. 28. 28 DATA CULTURE: WHEN DATA DRIVES THE ENTIRE ORGANIZATION Source: Netflix.com; Slides from InfoQ– ML Infra at Netflix
  29. 29. Data stories through Comicgen An e.g. CoVID-19 Data Explained by Data Comics
  30. 30. Comic character in a data callout:
  31. 31. Comic character on hex & grid tiles in ppts:
  32. 32. Insights and Story telling approach 32 Stage 1- Identify Business Problem Define the problem statement by understanding: • What is the basic need and desired outcome? • Who will benefit? • What is the impact? • What is the success criteria? Stage 2- Translate to Data Problem • Breakdown the problem statement into multiple use- cases • Connect each use case with a data set • Understand any limitations on data sources- Internal and External? Stage 4- Translate to Business Answer • Stitch insights from individual use case to create a story • Connect data story to help in better decision making • Measure success Stage 3- Data Answer Target each use case with data through: • EDA and transformation • Modelling • Generating insights • Sales Rep • Data Consultant • Account Manager • Solution Lead • Analyst Lead • Data Consultant • Account Manager • Solution Architect • Solution Lead • Analyst Lead • Data Consultant • Data Scientist • Solution Architect • Solution Lead • Data Consultant • Account Manager • Solution Lead
  33. 33. Samuel L. Jackson Morgan Freeman Tom Hanks Harrison Ford Gary Oldman
  34. 34. Samuel L. Jackson Harrison Ford Morgan Freeman Tom Hanks Tom Cruise
  35. 35. In summary, here are the 9 steps to go from data to a data story 35 Who is your audience? They determine the story What is their problem? That defines your analysis Find the right analysis to solve the problem Filter for big, useful, surprising insights Start with the takeaway. Summarize your entire story Add supporting analyses as a tree Pick a format based on how your audience will consume the story Pick a visual design based on the takeaway Annotate to explain & engage. Use four types of narratives
  36. 36. 36 MACHINE LEARNING 101 New Input Desired Outcome Machine learning how to do the job Known Input Known Outcome “Programs that solve the problem” “Programs that learn to solve the problem” vs
  37. 37. 37 WHY DEEP LEARNING? 37 Input Output Identify features to teach model Traditional Machine Learning Deep Learning Person Name Input Output Model automatically identifies features to learn Person Name https://www.cs.toronto.edu/~ranzato/publications/taigman_cvpr14.pdf
  38. 38. 38 DATA SCIENCE: WHAT’S THE VALUE? IT’S A RECESSION. WHY DATA NOW? REALITY CHECK: HOW TO THRIVE?
  39. 39. 39 1. Most Data Science projects solve the wrong Problem.. Tip #1: Master the application of knowledge
  40. 40. 40 AI IS COMING FOR THE DATA SCIENCE JOBS AI and automation will do away with most of the grunt work in the data science workflow today. Applied knowledge will keep you relevant for much longer.
  41. 41. Wolbachia blocks dengue, Zika and chikungunya virus transmission
  42. 42. What is Wolbachia? • Naturally occurring bacteria • Transmitted from parent to offspring through the insect’s eggs • Safe for humans, animals and the environment • Reduces ability of mosquitoes to transmit disease (dengue, Zika, chikungunya)
  43. 43. Our innovative approach
  44. 44. How it works
  45. 45. Wolbachia mosquito releases Adults Eggs Community
  46. 46. Model design 20,000 ppl / km2 15,000 ppl / km2 Identify where people live Detect buildings Estimate human population density 100m2 grids e.g.
  47. 47. Data sources for model development Gridded Population of the World • Modelled pop. data • 1km2 resolution Census data • Up-to-date • High resolution SpaceNet Challenge • very high resolution • select cities available Purchased satellite imagery • high resolution • mix of archive and tasked imagery SpaceNet Challenge • manually labelled • high quality Open Street Map • crowd-sourced • lower quality, but suitable for training High resolution satellite imagery Building outline data Population data
  48. 48. End to end solution Data sources Modelling Model outputs (API) DATA Population density Exclusion areas Release areas Building footprints High resolution satellite imagery Building outline data Population data Human Settlement Identification (where people live) Building footprint extraction Human population density estimation Site Scoping application DECISIONS Mosquito release density Prioritisation of release areas Phasing of releases
  49. 49. Site scoping • Set boundary of potential release area • Identify the areas where people live • Map mosquito release points over area with a grid • Organise release area into stages
  50. 50. 51 2. Data Analytics needs a lot more than Data & Analytics.. Tip #2: Learn non-core skills
  51. 51. 52 DATA SCIENCE SOLUTION: LET’S TAKE THIS EXAMPLE.. Source: World bank storytelling, by Gramener
  52. 52. 53 ..AND BREAK IT DOWN INTO THE BUILDING BLOCKS Domain Design Analytics Development • Impact analytics • Clustering techniques • Business workflow • Influencing factors • Frontend/backend coding • Data transformation • User journey • Visuals & aesthetics Project Management • Piecing it all together • Change management
  53. 53. 54 HERE ARE THE 5 ROLES & SKILLS CRITICAL FOR DATA SCIENCE Data Translator ML Engineer Information Designer Data Scientist Data Science Manager Comic characters from Gramener Comicgen library Domain Design Analytics Development Project Management • Domain expertise • Business analysis • Solutioning • Software engineering • Front/back-end coding • Data pipelining • Information design • User centered design • Interface/visual design (parts) • Stats & ML • Interpret insights • Scripting skills • Project management • Business analysis/solutioning • Team handling
  54. 54. 55 3. Data cleaning takes up a majority of time on projects.. Tip #3: Sharpen ability to handle data
  55. 55. 56 In data science, 80% of the time is spent preparing data, and the other 20% on complaining about preparing the data! - Kirk Borne “
  56. 56. 57 LEARN DATA HANDLING AND BUDGET TIME FOR IT IN YOUR WORK Data deduplication Data standardization Data normalization Quality check Exploratory analysis Data Cleaning & Preparation
  57. 57. 58 4. Technology goes obsolete faster in Data Science.. Tip #4: Learn new tools quickly
  58. 58. 59 WHAT DOES THE DATA TOOLS LANDSCAPE LOOK LIKE? The tool does not matter. A person’s skill with the tool does. Pick an ability to learn new tools rapidly Source: https://mattturck.com/data2019/
  59. 59. 60 EXAMPLE: WHAT ARE YOUR TOOL OPTIONS TO VISUALIZE DATA? Code-based Plug-n- play Flexibility Complexity Google Data Studio Excel Google Sheets Tableau Raw Vismio Datawrapper Timeline JS Polestar Vega Vega-lite d3, matplotlib C3 High charts Nvd3 Gramex ggplot, bokeh Plotly Choose tools based on flexibility, your background and tool availability
  60. 60. 61 Tip #4: Learn new tools quickly Tip #2: Learn non-core skills Tip #3: Sharpen ability to handle data Tip #1: Master the application of knowledge
  61. 61. 62 DATA SCIENCE: WHAT’S THE VALUE? IT’S A RECESSION. WHY DATA NOW? REALITY CHECK: HOW TO THRIVE?
  62. 62. 63 WHAT DOES THE RECESSION MEAN FOR JOBS IN DATA SCIENCE? Source: McKinsey report – Lives and Livelihoods Data jobs and specialized professions are relatively less impacted Industries with the lowest wages and lowest educational attainment are hit the hardest
  63. 63. 64 HERE’S WHY DATA IS KEY FOR COVID-19 AND THE RECESSION Enterprises B Community C Remote workforce & collaboration Market demand & Cash flows1 2 Supply chain & Logistics3 Identifying vulnerability and contact-tracing Tracking the COVID-19 patient lifecycle1 2 Predicting infection rates and spread2 Public Health A Understand behavioral shifts Mapping the effectiveness of shutdown1 2 Address people concerns during Covid-193 Source: Gramener – NYC 311 analysisSource: Kinsa Health weather map Source: Gramener – Supply Chain flow
  64. 64. 65 HOW DO YOU STAY RELEVANT AND GROW IN YOUR CAREER PATH? Do your own data projects Read/Write on data science Maintain a public portfolio Compete, learn & re-apply Source: Article – How to demonstrate your passion for Data
  65. 65. 66 @sh_ra_van /shravankumara Please help me improve the session by answering the feedback survey that will be sent to your email  THANK YOU! GRACIAS! MERCI!

×