Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Managing missing values in routinely reported data: One approach from the Democratic Republic of the Congo

353 views

Published on

This Data for Impact webinar was held in December 2020. Access the recording and learn more at https://www.data4impactproject.org/resources/webinars/managing-missing-values-in-routinely-reported-data-one-approach-from-the-democratic-republic-of-the-congo/

Published in: Health & Medicine
  • Login to see the comments

Managing missing values in routinely reported data: One approach from the Democratic Republic of the Congo

  1. 1. Managing missing values in routinely reported data: One approach from the DRC Matt Worges Data for Impact Webinar Series December 2, 2020
  2. 2. • Framing the Webinar through the D4I lens • DHIS2 data: advantages and issues • Exploring a DHIS2 data set • What to do with blanks? • Interpolation • Recreate the “Truth” • Interpolation diagnostics Overview
  3. 3. • The D4I team was tasked with conducting an impact evaluation of the USAID Integrated Health Project (IHP) implemented in 9 provinces of the DRC • IHP goal: Reduce maternal, newborn, and child deaths through delivery of integrated health services • IHP objectives: Increase access to and use of quality health services in the targeted health zones IHP Impact Evaluation
  4. 4. • D4I research question: What was the impact of IHP on the utilization of health services (e.g., treatment for childhood illnesses) over the course of the study period? • Measuring impact: D4I is assessing impact through a difference- in-differences (DID) with propensity score matching (PSM) model • Data source: We are using DHIS2 data for this impact evaluation IHP Impact Evaluation – Approach
  5. 5. • PSM is widely used to mitigate confounding in observational studies • Complications arise when the covariates used to estimate the propensity scores are only partially observed • Interpolation/imputation approaches provide a potential solution for handling missing data in the estimation of the propensity scores • Recommended to derive the propensity score after applying interpolation or imputation IHP Impact Evaluation – Propensity Score Matching
  6. 6. • Addition/removal of health facilities at different time points • Long runs of missing values • Zero counts are typically not entered – they are left blank • Cannot distinguish between truly missing and zero • Data entry errors manifesting as outliers/anomalous points • Reporting has improved over time making older time points less complete Some DHIS2 Issues
  7. 7. • Missing data can result in: • Reduced statistical power • Biased estimators • Reduced representativeness of the sample • Generally incorrect inference and conclusions Why do we care about missingness? Overview of Approaches for Missing Data – Susan Buchman
  8. 8. • Time Series Characteristics • Restricted to Haut-Katanga Province, DRC • Uncomplicated + severe malaria cases (all ages) • 24-month period from October 2018 to September 2020 • Health facility count = 1,362 • The monthly-aggregated time series appears to include both a seasonal and positive trend component Data Set
  9. 9. Unprocessed Data – Missingness Visualized HF Oct-18 Nov-18 Dec-18 Jan-19 Feb-19 Mar-19 Apr-19 May-19 Jun-19 Jul-19 Aug-19 Sep-19 Oct-19 Nov-19 Dec-19 Jan-20 hk Panda Hôpital Général de Référence 514 637 637 910 563 1375 678 483 839 773 929 792 694 1355 1219 hk Serge Amie Centre de Santé 300 306 274 300 320 440 522 582 hk AENAF Centre de Santé de Référence 91 60 212 154 65 279 114 59 213 55 131 38 399 227 222 hk Asvie Centre Médical 439 556 475 379 370 335 279 280 256 381 627 639 hk Mupanda Centre de Santé 610 479 363 610 641 408 573 248 237 279 455 319 203 hk Boma Publique Centre de Santé 294 293 304 293 308 318 178 225 326 325 240 hk Kawama Centre de Santé 174 176 2 283 280 304 286 288 4 275 379 319 264 313 hk Kabambakuku Centre de Santé 317 396 372 434 368 298 255 314 303 251 287 283 hk Kaboka Centre de Santé 419 314 201 240 350 199 151 197 274 257 hk Kasomeno Centre de Santé de Référence 282 307 306 265 hk Kikula Centre de Santé de Référence 221 241 246 275 167 318 393 hk Belle Vue Centre de Santé 135 157 555 350 124 102 92
  10. 10. Unprocessed Data – Missingness Visualized Missing (28.6%) Present (71.4%)‘visdat’ package Malaria Cases – Haut-Katanga Province
  11. 11. Unprocessed Data – Histogram of Missingness No missing values (complete case analysis) Completely blank records (remove from data set) One missing value Two missing values ‘ggplot2’ package 284 193 137 27
  12. 12. Unprocessed Data – Outliers? What are these doing here? Are they malaria outbreaks? Are they data entry errors?
  13. 13. Unprocessed Data – Outliers. ‘anomalize’ package Something looks off here This point didn’t show up as anomalous
  14. 14. • One method to remove outliers is to delete those values that are ± X standard deviations from the median • The median is insensitive to extreme values in your time series • Experiment with different thresholds (i.e., ± 4 SDs from the median or ± 6 SDs from the median) to examine what happens to your data Removing Egregious Outliers – One Approach
  15. 15. Malaria cases Median Standard deviation + 4.5 SDs from the median This value would be removed from the data set
  16. 16. Anomalous Data Points ‘anomalize’ package This is what I’m targeting for removal Less concerned with these
  17. 17. Removing Egregious Outliers - Effects Average Malaria Cases – Haut-Katanga Province +4.5 SDs from the median Removed 8 values or 0.025% Unprocessed data set
  18. 18. Are missing values actually zeros in the DRC DHIS2?
  19. 19. Link between Missingness & Median Case Counts 1-15 16-30 31-45 46-60 61-75 76-90 91-105 106-120 121-135 136-150 >150 Median Health Facility Malaria Cases (binned) Generalization: the lower the median case counts the higher the number of average missing values
  20. 20. • Assume no item nonresponse? • Examine this notion with two extreme examples • One HF time series with large monthly values and 1 missing • One HF time series with low monthly values and 1 missing • Replace missing with zero and run anomaly detection Assumption: Missing Values are Zeros
  21. 21. Initial missing value was replaced with 0 Initial missing value was replaced with 0 ‘anomalize’ package
  22. 22. Interpolation on Univariate Time Series
  23. 23. • A univariate time series is a sequence of single observations at regular and successive points in time • Possible to decompose the time series into its trend, seasonal, and irregular components • We can use these time series characteristics in the interpolation process Univariate Time Series
  24. 24. dataseasonaltrendremainder 2017 2018 2019 2020 Loess Seasonal Decomposition of Average Malaria Cases ‘stats’ package
  25. 25. AutocorrelationFunction Lag Autocorrelation Function Plot (ACF plot)
  26. 26. • Values in a series do not have violent, unexplained fluctuations • The rate of change (increases/decreases) between points occurs at a uniform rate Assumptions of Interpolation
  27. 27. • Easy to code (one line in R for long form data frame) • df$int_cases <- na_interpolation(df$cases, option = "linear", maxgap = 2) • Intuitive understanding of linearly interpolating across very short gaps of missing values • Probably a good approach for high case load facilities • May not grossly deviate from the ‘truth’ when applied to low case load facilities A Role for Linear Interpolation? ‘imputeTS’ package
  28. 28. Linear Interpolation ---- ---- ---- Joining known values with linear segments
  29. 29. Initial missing value was replaced with 0 Initial missing value was replaced with 0 ‘anomalize’ package
  30. 30. Linearly interpolated ‘anomalize’ package
  31. 31. Seasonality in Interpolation Un-imputed data Linearly interpolated data w/o seasonality Linearly interpolated data w/ seasonality
  32. 32. • Take seasonality into account • na.interp from the ‘forecast’ package in R • By default, uses linear interpolation for non-seasonal series. For seasonal series, a robust STL decomposition is first computed. Then a linear interpolation is applied to the seasonally adjusted data, and the seasonal component is added back. • na.StructTS from the ‘zoo’ package in R • Interpolate with seasonal Kalman filter • These two functions use similar mechanisms to interpolate missing data in that they both can ‘handle’ seasonality in the time series Univariate Time Series Interpolation
  33. 33. Seasonality Adjusted Time Series
  34. 34. Let’s reset and apply some of these steps
  35. 35. Missingness Visualized – Unprocessed Data Missing (28.6%) Present (71.4%)‘visdat’ package 284 HFs with no missing data
  36. 36. Missingness Visualized – Removed New/Defunct HFs Missing (13.8%) Present (86.2%)‘visdat’ package
  37. 37. Missingness Visualized – Linear Interpolation (gaps ≤ 2) Missing (6.7%) Present (93.3%)‘visdat’ package 807 HFs with no missing data
  38. 38. Time Series Trends New/defunct HFs and outliers have been removed from all time series
  39. 39. Recreate the “Truth”
  40. 40. • Use a data set containing only complete time series records • 2.5% of data are zero values (primarily limited to smaller facilities) • Introduce random missingness • Randomly delete15% of data points • Delete 90% of remaining zero values • Include runs of more than 2 missing values • Apply various imputation methods and compare against the “truth” • Replace all blanks with zeros • Linear interpolation on gaps ≤ 2 • Use the two identified interpolation strategies that consider seasonality A Quick Example
  41. 41. Time Series Trends Anomalous data points have been removed
  42. 42. na.StructTS na.interp
  43. 43. na.StructTS Average raw bias = -1.18 na.interp Average raw bias = -0.03
  44. 44. na.StructTS MAPE = 119.03 na.interp MAPE = 117.41
  45. 45. The RMSE difference is positive for 1,847 HFs indicating that the ‘na.StructTS’ approach had a lower RMSE for 68% of HFs ‘na.StructTS’ approach has lower RMSE ‘na.interp’ approach has lower RMSE
  46. 46. • Assess missingness • Address egregious outliers • Manage new/defunct facility records • Decompose the time series • Try a few different interpolation techniques and plot results • Isolate a subset of records with no missing data • Introduce missing data and then recreate the “truth” Recap
  47. 47. This presentation was produced with the support of the United States Agency for International Development (USAID) under the terms of the Data for Impact (D4I) associate award 7200AA18LA00008, which is implemented by the Carolina Population Center at the University of North Carolina at Chapel Hill, in partnership with Palladium International, LLC; ICF Macro, Inc.; John Snow, Inc.; and Tulane University. The views expressed in this publication do not necessarily reflect the views of USAID or the United States government. www.data4impactproject.org
  48. 48. • DHIS 2 time series do not always lend themselves well to multiple imputation • Multiple imputation is a preferable choice when there are variables predictive of missingness that could be included in the imputation model • With DHIS 2 data, it can be difficult to locate other time dependent variables to aid in the imputation process • DHIS 2 time series may exhibit MNAR missingness structure • Earlier time points have more missing data • Zero values are more likely to be missing Imputation
  49. 49. • Advantages of using DHIS2 data • Access to a wide breadth of data elements/services • Analyze at various levels of the health system • National, regional, district, health facility • Data are generally collected via standardized reporting tools • Data tend to be reported at regular intervals allowing for frequent updates to analyses • However, not all data elements are well-reported, and it is typically necessary to process/clean DHIS2 data Why Use DHIS2 Data?

×