SlideShare a Scribd company logo
1 of 52
Managing missing values in routinely reported data:
One approach from the DRC
Matt Worges
Data for Impact Webinar Series
December 2, 2020
• Framing the Webinar through the D4I lens
• DHIS2 data: advantages and issues
• Exploring a DHIS2 data set
• What to do with blanks?
• Interpolation
• Recreate the “Truth”
• Interpolation diagnostics
Overview
• The D4I team was tasked with conducting an impact evaluation of
the USAID Integrated Health Project (IHP) implemented in 9
provinces of the DRC
• IHP goal: Reduce maternal, newborn, and child deaths through delivery of
integrated health services
• IHP objectives: Increase access to and use of quality health services in
the targeted health zones
IHP Impact Evaluation
• D4I research question: What was the impact of IHP on the
utilization of health services (e.g., treatment for childhood illnesses)
over the course of the study period?
• Measuring impact: D4I is assessing impact through a difference-
in-differences (DID) with propensity score matching (PSM) model
• Data source: We are using DHIS2 data for this impact evaluation
IHP Impact Evaluation – Approach
• PSM is widely used to mitigate confounding in observational
studies
• Complications arise when the covariates used to estimate the propensity
scores are only partially observed
• Interpolation/imputation approaches provide a potential solution for
handling missing data in the estimation of the propensity scores
• Recommended to derive the propensity score after applying interpolation or
imputation
IHP Impact Evaluation – Propensity Score Matching
• Addition/removal of health facilities at different time points
• Long runs of missing values
• Zero counts are typically not entered – they are left blank
• Cannot distinguish between truly missing and zero
• Data entry errors manifesting as outliers/anomalous points
• Reporting has improved over time making older time points less
complete
Some DHIS2 Issues
• Missing data can result in:
• Reduced statistical power
• Biased estimators
• Reduced representativeness of the sample
• Generally incorrect inference and conclusions
Why do we care about missingness?
Overview of Approaches for Missing Data – Susan Buchman
• Time Series Characteristics
• Restricted to Haut-Katanga Province, DRC
• Uncomplicated + severe malaria cases (all ages)
• 24-month period from October 2018 to September 2020
• Health facility count = 1,362
• The monthly-aggregated time series appears to include both a seasonal
and positive trend component
Data Set
Unprocessed Data – Missingness Visualized
HF Oct-18 Nov-18 Dec-18 Jan-19 Feb-19 Mar-19 Apr-19 May-19 Jun-19 Jul-19 Aug-19 Sep-19 Oct-19 Nov-19 Dec-19 Jan-20
hk Panda Hôpital Général de Référence 514 637 637 910 563 1375 678 483 839 773 929 792 694 1355 1219
hk Serge Amie Centre de Santé 300 306 274 300 320 440 522 582
hk AENAF Centre de Santé de Référence 91 60 212 154 65 279 114 59 213 55 131 38 399 227 222
hk Asvie Centre Médical 439 556 475 379 370 335 279 280 256 381 627 639
hk Mupanda Centre de Santé 610 479 363 610 641 408 573 248 237 279 455 319 203
hk Boma Publique Centre de Santé 294 293 304 293 308 318 178 225 326 325 240
hk Kawama Centre de Santé 174 176 2 283 280 304 286 288 4 275 379 319 264 313
hk Kabambakuku Centre de Santé 317 396 372 434 368 298 255 314 303 251 287 283
hk Kaboka Centre de Santé 419 314 201 240 350 199 151 197 274 257
hk Kasomeno Centre de Santé de Référence 282 307 306 265
hk Kikula Centre de Santé de Référence 221 241 246 275 167 318 393
hk Belle Vue Centre de Santé 135 157 555 350 124 102 92
Unprocessed Data – Missingness Visualized
Missing (28.6%) Present (71.4%)‘visdat’ package
Malaria Cases – Haut-Katanga Province
Unprocessed Data – Histogram of Missingness
No missing values
(complete case analysis)
Completely blank records
(remove from data set) One missing value
Two missing values
‘ggplot2’ package
284
193
137
27
Unprocessed Data – Outliers?
What are these doing here?
Are they malaria outbreaks?
Are they data entry errors?
Unprocessed Data – Outliers.
‘anomalize’ package
Something looks off here This point didn’t show up as anomalous
• One method to remove outliers is to delete those values that are
± X standard deviations from the median
• The median is insensitive to extreme values in your time series
• Experiment with different thresholds (i.e., ± 4 SDs from the median
or ± 6 SDs from the median) to examine what happens to your data
Removing Egregious Outliers – One Approach
Malaria cases
Median
Standard
deviation
+ 4.5 SDs from the median
This value would be
removed from the data set
Anomalous Data Points
‘anomalize’ package
This is what I’m targeting for removal
Less concerned with these
Removing Egregious Outliers - Effects
Average Malaria Cases – Haut-Katanga Province
+4.5 SDs from the median
Removed 8 values or 0.025%
Unprocessed data set
Are missing values actually
zeros in the DRC DHIS2?
Link between Missingness & Median Case Counts
1-15 16-30 31-45 46-60 61-75 76-90 91-105 106-120 121-135 136-150 >150
Median Health Facility Malaria Cases (binned)
Generalization: the lower the median case counts the
higher the number of average missing values
• Assume no item nonresponse?
• Examine this notion with two extreme examples
• One HF time series with large monthly values and 1 missing
• One HF time series with low monthly values and 1 missing
• Replace missing with zero and run anomaly detection
Assumption: Missing Values are Zeros
Initial missing value was replaced with 0
Initial missing value was replaced with 0
‘anomalize’ package
Interpolation on
Univariate Time Series
• A univariate time series is a sequence of single observations at
regular and successive points in time
• Possible to decompose the time series into its trend, seasonal, and
irregular components
• We can use these time series characteristics in the interpolation process
Univariate Time Series
dataseasonaltrendremainder
2017 2018 2019 2020
Loess Seasonal Decomposition of Average Malaria Cases
‘stats’ package
AutocorrelationFunction
Lag
Autocorrelation Function Plot (ACF plot)
• Values in a series do not have violent, unexplained fluctuations
• The rate of change (increases/decreases) between points occurs at
a uniform rate
Assumptions of Interpolation
• Easy to code (one line in R for long form data frame)
• df$int_cases <- na_interpolation(df$cases, option = "linear", maxgap = 2)
• Intuitive understanding of linearly interpolating across very short
gaps of missing values
• Probably a good approach for high case load facilities
• May not grossly deviate from the ‘truth’ when applied to low case load
facilities
A Role for Linear Interpolation?
‘imputeTS’ package
Linear Interpolation
----
---- ----
Joining known
values with linear
segments
Initial missing value was replaced with 0
Initial missing value was replaced with 0
‘anomalize’ package
Linearly interpolated
‘anomalize’ package
Seasonality in Interpolation
Un-imputed
data
Linearly
interpolated data
w/o seasonality
Linearly
interpolated data
w/ seasonality
• Take seasonality into account
• na.interp from the ‘forecast’ package in R
• By default, uses linear interpolation for non-seasonal series. For seasonal series, a
robust STL decomposition is first computed. Then a linear interpolation is applied to
the seasonally adjusted data, and the seasonal component is added back.
• na.StructTS from the ‘zoo’ package in R
• Interpolate with seasonal Kalman filter
• These two functions use similar mechanisms to interpolate missing
data in that they both can ‘handle’ seasonality in the time series
Univariate Time Series Interpolation
Seasonality Adjusted Time Series
Let’s reset and apply some
of these steps
Missingness Visualized – Unprocessed Data
Missing (28.6%) Present (71.4%)‘visdat’ package
284 HFs with no missing data
Missingness Visualized – Removed New/Defunct HFs
Missing (13.8%) Present (86.2%)‘visdat’ package
Missingness Visualized – Linear Interpolation (gaps ≤ 2)
Missing (6.7%) Present (93.3%)‘visdat’ package
807 HFs with no missing data
Time Series Trends
New/defunct HFs and outliers have been removed from all time series
Recreate the “Truth”
• Use a data set containing only complete time series records
• 2.5% of data are zero values (primarily limited to smaller facilities)
• Introduce random missingness
• Randomly delete15% of data points
• Delete 90% of remaining zero values
• Include runs of more than 2 missing values
• Apply various imputation methods and compare against the “truth”
• Replace all blanks with zeros
• Linear interpolation on gaps ≤ 2
• Use the two identified interpolation strategies that consider seasonality
A Quick Example
Time Series Trends
Anomalous data points have been removed
na.StructTS
na.interp
na.StructTS
Average raw bias = -1.18
na.interp
Average raw bias = -0.03
na.StructTS
MAPE = 119.03
na.interp
MAPE = 117.41
The RMSE difference is positive for 1,847
HFs indicating that the ‘na.StructTS’
approach had a lower RMSE for 68% of HFs
‘na.StructTS’ approach has lower RMSE
‘na.interp’ approach has lower RMSE
• Assess missingness
• Address egregious outliers
• Manage new/defunct facility records
• Decompose the time series
• Try a few different interpolation techniques and plot results
• Isolate a subset of records with no missing data
• Introduce missing data and then recreate the “truth”
Recap
This presentation was produced with the support of the United States Agency for International
Development (USAID) under the terms of the Data for Impact (D4I) associate award
7200AA18LA00008, which is implemented by the Carolina Population Center at the University of
North Carolina at Chapel Hill, in partnership with Palladium International, LLC; ICF Macro, Inc.;
John Snow, Inc.; and Tulane University. The views expressed in this publication do not
necessarily reflect the views of USAID or the United States government.
www.data4impactproject.org
• DHIS 2 time series do not always lend themselves well to multiple
imputation
• Multiple imputation is a preferable choice when there are variables
predictive of missingness that could be included in the imputation model
• With DHIS 2 data, it can be difficult to locate other time dependent variables to aid in
the imputation process
• DHIS 2 time series may exhibit MNAR missingness structure
• Earlier time points have more missing data
• Zero values are more likely to be missing
Imputation
• Advantages of using DHIS2 data
• Access to a wide breadth of data elements/services
• Analyze at various levels of the health system
• National, regional, district, health facility
• Data are generally collected via standardized reporting tools
• Data tend to be reported at regular intervals allowing for frequent updates
to analyses
• However, not all data elements are well-reported, and it is typically
necessary to process/clean DHIS2 data
Why Use DHIS2 Data?

More Related Content

What's hot

An Introduction to Implementation Research_Emily Peca_4.22.13
An Introduction to Implementation Research_Emily Peca_4.22.13An Introduction to Implementation Research_Emily Peca_4.22.13
An Introduction to Implementation Research_Emily Peca_4.22.13
CORE Group
 
introduction-to-health-policy
introduction-to-health-policyintroduction-to-health-policy
introduction-to-health-policy
Nayyar Kazmi
 

What's hot (20)

Analysis and interpretation of surveillance data
Analysis and interpretation of surveillance dataAnalysis and interpretation of surveillance data
Analysis and interpretation of surveillance data
 
Implementation research
Implementation researchImplementation research
Implementation research
 
National health program evaluation
National health program evaluationNational health program evaluation
National health program evaluation
 
Equity in health system
Equity in health systemEquity in health system
Equity in health system
 
Early Warning And Reporting System (EWARS) in Nepal
Early Warning And Reporting System (EWARS)  in NepalEarly Warning And Reporting System (EWARS)  in Nepal
Early Warning And Reporting System (EWARS) in Nepal
 
Health care finance and budget
Health care finance and budgetHealth care finance and budget
Health care finance and budget
 
LQAS 2011
LQAS 2011LQAS 2011
LQAS 2011
 
Organizational Capacity Assessments for Policy, Advocacy, Financing, and Gove...
Organizational Capacity Assessments for Policy, Advocacy, Financing, and Gove...Organizational Capacity Assessments for Policy, Advocacy, Financing, and Gove...
Organizational Capacity Assessments for Policy, Advocacy, Financing, and Gove...
 
An Introduction to Implementation Research_Emily Peca_4.22.13
An Introduction to Implementation Research_Emily Peca_4.22.13An Introduction to Implementation Research_Emily Peca_4.22.13
An Introduction to Implementation Research_Emily Peca_4.22.13
 
Meta analysis: Made Easy with Example from RevMan
Meta analysis: Made Easy with Example from RevManMeta analysis: Made Easy with Example from RevMan
Meta analysis: Made Easy with Example from RevMan
 
Sources of Public Health Data
 Sources of Public Health Data Sources of Public Health Data
Sources of Public Health Data
 
Public Policy and Health Policy
Public Policy and Health PolicyPublic Policy and Health Policy
Public Policy and Health Policy
 
introduction-to-health-policy
introduction-to-health-policyintroduction-to-health-policy
introduction-to-health-policy
 
Operational research in Public Health in India
Operational research in Public Health in IndiaOperational research in Public Health in India
Operational research in Public Health in India
 
National health accounts and estimates of health expenditure for india
National health accounts and estimates of health expenditure for indiaNational health accounts and estimates of health expenditure for india
National health accounts and estimates of health expenditure for india
 
Health system research ppt
Health system research pptHealth system research ppt
Health system research ppt
 
Principles of health economics
Principles of health economicsPrinciples of health economics
Principles of health economics
 
Overview of Community Based Health Insurance Lessons
Overview of Community Based Health Insurance LessonsOverview of Community Based Health Insurance Lessons
Overview of Community Based Health Insurance Lessons
 
HMIS Nepal
HMIS NepalHMIS Nepal
HMIS Nepal
 
SAMPLE REGISTRATION SYSTEM (SRS) INDIA
SAMPLE REGISTRATION SYSTEM (SRS) INDIA SAMPLE REGISTRATION SYSTEM (SRS) INDIA
SAMPLE REGISTRATION SYSTEM (SRS) INDIA
 

Similar to Managing missing values in routinely reported data: One approach from the Democratic Republic of the Congo

2010 smg training_cardiff_day1_session3_higgins
2010 smg training_cardiff_day1_session3_higgins2010 smg training_cardiff_day1_session3_higgins
2010 smg training_cardiff_day1_session3_higgins
rgveroniki
 
Analysis Report Presentation 041515 - Team 4
Analysis Report Presentation 041515 - Team 4Analysis Report Presentation 041515 - Team 4
Analysis Report Presentation 041515 - Team 4
Zijian Huang
 
Operational Risk: Solvency II and Exploratory Data Analysis
Operational Risk: Solvency II and Exploratory Data AnalysisOperational Risk: Solvency II and Exploratory Data Analysis
Operational Risk: Solvency II and Exploratory Data Analysis
Ignacio Reclusa
 
LESSON 4_UNGROUPED.pptx.pdf
LESSON  4_UNGROUPED.pptx.pdfLESSON  4_UNGROUPED.pptx.pdf
LESSON 4_UNGROUPED.pptx.pdf
nnzuliyana2
 
Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)
jkglick57
 
Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)
jkglick57
 
Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)
jkglick57
 

Similar to Managing missing values in routinely reported data: One approach from the Democratic Republic of the Congo (20)

Julian Flowers Erpho
Julian Flowers ErphoJulian Flowers Erpho
Julian Flowers Erpho
 
2010 smg training_cardiff_day1_session3_higgins
2010 smg training_cardiff_day1_session3_higgins2010 smg training_cardiff_day1_session3_higgins
2010 smg training_cardiff_day1_session3_higgins
 
Imputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trialsImputation techniques for missing data in clinical trials
Imputation techniques for missing data in clinical trials
 
Application of microbiological data
Application of microbiological dataApplication of microbiological data
Application of microbiological data
 
Biostatistics Class.pptx
Biostatistics Class.pptxBiostatistics Class.pptx
Biostatistics Class.pptx
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
 
3 Missing data12256429.ppt
3 Missing data12256429.ppt3 Missing data12256429.ppt
3 Missing data12256429.ppt
 
Analysis Report Presentation 041515 - Team 4
Analysis Report Presentation 041515 - Team 4Analysis Report Presentation 041515 - Team 4
Analysis Report Presentation 041515 - Team 4
 
Practical exercise: results analysis with different statistical robust methods.
Practical exercise: results analysis with different statistical robust methods. Practical exercise: results analysis with different statistical robust methods.
Practical exercise: results analysis with different statistical robust methods.
 
Biostatistics.pptx
Biostatistics.pptxBiostatistics.pptx
Biostatistics.pptx
 
data analysis in Statistics-2023 guide 2023
data analysis in Statistics-2023 guide 2023data analysis in Statistics-2023 guide 2023
data analysis in Statistics-2023 guide 2023
 
Statistics for the Health Scientist: Basic Statistics II
Statistics for the Health Scientist: Basic Statistics IIStatistics for the Health Scientist: Basic Statistics II
Statistics for the Health Scientist: Basic Statistics II
 
Operational Risk: Solvency II and Exploratory Data Analysis
Operational Risk: Solvency II and Exploratory Data AnalysisOperational Risk: Solvency II and Exploratory Data Analysis
Operational Risk: Solvency II and Exploratory Data Analysis
 
Statistical analysis
Statistical analysisStatistical analysis
Statistical analysis
 
LESSON 4_UNGROUPED.pptx.pdf
LESSON  4_UNGROUPED.pptx.pdfLESSON  4_UNGROUPED.pptx.pdf
LESSON 4_UNGROUPED.pptx.pdf
 
Lincoln-Lau-Session-3A-CCIH-2017
Lincoln-Lau-Session-3A-CCIH-2017Lincoln-Lau-Session-3A-CCIH-2017
Lincoln-Lau-Session-3A-CCIH-2017
 
Data analysis
Data analysisData analysis
Data analysis
 
Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)
 
Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)
 
Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)Neal lesh-1202742298252135-3 (5)
Neal lesh-1202742298252135-3 (5)
 

More from MEASURE Evaluation

Malaria Data Quality and Use in Selected Centers of Excellence in Madagascar:...
Malaria Data Quality and Use in Selected Centers of Excellence in Madagascar:...Malaria Data Quality and Use in Selected Centers of Excellence in Madagascar:...
Malaria Data Quality and Use in Selected Centers of Excellence in Madagascar:...
MEASURE Evaluation
 
Evaluating National Malaria Programs’ Impact in Moderate- and Low-Transmissio...
Evaluating National Malaria Programs’ Impact in Moderate- and Low-Transmissio...Evaluating National Malaria Programs’ Impact in Moderate- and Low-Transmissio...
Evaluating National Malaria Programs’ Impact in Moderate- and Low-Transmissio...
MEASURE Evaluation
 

More from MEASURE Evaluation (20)

Tuberculosis/HIV Mobility Study: Objectives and Background
Tuberculosis/HIV Mobility Study: Objectives and BackgroundTuberculosis/HIV Mobility Study: Objectives and Background
Tuberculosis/HIV Mobility Study: Objectives and Background
 
How to improve the capabilities of health information systems to address emer...
How to improve the capabilities of health information systems to address emer...How to improve the capabilities of health information systems to address emer...
How to improve the capabilities of health information systems to address emer...
 
LCI Evaluation Uganda Organizational Network Analysis
LCI Evaluation Uganda Organizational Network AnalysisLCI Evaluation Uganda Organizational Network Analysis
LCI Evaluation Uganda Organizational Network Analysis
 
Using Organizational Network Analysis to Plan and Evaluate Global Health Prog...
Using Organizational Network Analysis to Plan and Evaluate Global Health Prog...Using Organizational Network Analysis to Plan and Evaluate Global Health Prog...
Using Organizational Network Analysis to Plan and Evaluate Global Health Prog...
 
Understanding Referral Networks for Adolescent Girls and Young Women
Understanding Referral Networks for Adolescent Girls and Young WomenUnderstanding Referral Networks for Adolescent Girls and Young Women
Understanding Referral Networks for Adolescent Girls and Young Women
 
Local Capacity Initiative (LCI) Evaluation
Local Capacity Initiative (LCI) EvaluationLocal Capacity Initiative (LCI) Evaluation
Local Capacity Initiative (LCI) Evaluation
 
Development and Validation of a Reproductive Empowerment Scale
Development and Validation of a Reproductive Empowerment ScaleDevelopment and Validation of a Reproductive Empowerment Scale
Development and Validation of a Reproductive Empowerment Scale
 
Sustaining the Impact: MEASURE Evaluation Conversation on Maternal and Child ...
Sustaining the Impact: MEASURE Evaluation Conversation on Maternal and Child ...Sustaining the Impact: MEASURE Evaluation Conversation on Maternal and Child ...
Sustaining the Impact: MEASURE Evaluation Conversation on Maternal and Child ...
 
Using Most Significant Change in a Mixed-Methods Evaluation in Uganda
Using Most Significant Change in a Mixed-Methods Evaluation in UgandaUsing Most Significant Change in a Mixed-Methods Evaluation in Uganda
Using Most Significant Change in a Mixed-Methods Evaluation in Uganda
 
Malaria Data Quality and Use in Selected Centers of Excellence in Madagascar:...
Malaria Data Quality and Use in Selected Centers of Excellence in Madagascar:...Malaria Data Quality and Use in Selected Centers of Excellence in Madagascar:...
Malaria Data Quality and Use in Selected Centers of Excellence in Madagascar:...
 
Evaluating National Malaria Programs’ Impact in Moderate- and Low-Transmissio...
Evaluating National Malaria Programs’ Impact in Moderate- and Low-Transmissio...Evaluating National Malaria Programs’ Impact in Moderate- and Low-Transmissio...
Evaluating National Malaria Programs’ Impact in Moderate- and Low-Transmissio...
 
Improved Performance of the Malaria Surveillance, Monitoring, and Evaluation ...
Improved Performance of the Malaria Surveillance, Monitoring, and Evaluation ...Improved Performance of the Malaria Surveillance, Monitoring, and Evaluation ...
Improved Performance of the Malaria Surveillance, Monitoring, and Evaluation ...
 
Use of Qualitative Comparative Analysis in the Assessment of the Actionable D...
Use of Qualitative Comparative Analysis in the Assessment of the Actionable D...Use of Qualitative Comparative Analysis in the Assessment of the Actionable D...
Use of Qualitative Comparative Analysis in the Assessment of the Actionable D...
 
Sustaining the Impact: MEASURE Evaluation Conversation on Health Informatics
Sustaining the Impact: MEASURE Evaluation Conversation on Health InformaticsSustaining the Impact: MEASURE Evaluation Conversation on Health Informatics
Sustaining the Impact: MEASURE Evaluation Conversation on Health Informatics
 
7 Steps to EnGendering Evaluations of HIV programs with Adolescent Girls and ...
7 Steps to EnGendering Evaluations of HIV programs with Adolescent Girls and ...7 Steps to EnGendering Evaluations of HIV programs with Adolescent Girls and ...
7 Steps to EnGendering Evaluations of HIV programs with Adolescent Girls and ...
 
Sexual Orientation and Gender Identity Measures for Global Survey Research
Sexual Orientation and Gender Identity Measures for Global Survey ResearchSexual Orientation and Gender Identity Measures for Global Survey Research
Sexual Orientation and Gender Identity Measures for Global Survey Research
 
What’s Next? Practical Implementation Lessons from the Partnership for HIV-Fr...
What’s Next?Practical Implementation Lessons from the Partnership for HIV-Fr...What’s Next?Practical Implementation Lessons from the Partnership for HIV-Fr...
What’s Next? Practical Implementation Lessons from the Partnership for HIV-Fr...
 
Measuring Outcomes for Vulnerable Children: A Global Snapshot
Measuring Outcomes for Vulnerable Children: A Global SnapshotMeasuring Outcomes for Vulnerable Children: A Global Snapshot
Measuring Outcomes for Vulnerable Children: A Global Snapshot
 
Sustaining the Impact: MEASURE Evaluation Conversation on Health Systems Stre...
Sustaining the Impact: MEASURE Evaluation Conversation on Health Systems Stre...Sustaining the Impact: MEASURE Evaluation Conversation on Health Systems Stre...
Sustaining the Impact: MEASURE Evaluation Conversation on Health Systems Stre...
 
Les dialogues communautaires pour diffuser des résultats de recherche Example...
Les dialogues communautaires pour diffuser des résultats de recherche Example...Les dialogues communautaires pour diffuser des résultats de recherche Example...
Les dialogues communautaires pour diffuser des résultats de recherche Example...
 

Recently uploaded

College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
perfect solution
 

Recently uploaded (20)

Top Rated Bangalore Call Girls Mg Road ⟟ 9332606886 ⟟ Call Me For Genuine S...
Top Rated Bangalore Call Girls Mg Road ⟟   9332606886 ⟟ Call Me For Genuine S...Top Rated Bangalore Call Girls Mg Road ⟟   9332606886 ⟟ Call Me For Genuine S...
Top Rated Bangalore Call Girls Mg Road ⟟ 9332606886 ⟟ Call Me For Genuine S...
 
Premium Bangalore Call Girls Jigani Dail 6378878445 Escort Service For Hot Ma...
Premium Bangalore Call Girls Jigani Dail 6378878445 Escort Service For Hot Ma...Premium Bangalore Call Girls Jigani Dail 6378878445 Escort Service For Hot Ma...
Premium Bangalore Call Girls Jigani Dail 6378878445 Escort Service For Hot Ma...
 
Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟ 9332606886 ⟟ Call Me For G...
Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟  9332606886 ⟟ Call Me For G...Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟  9332606886 ⟟ Call Me For G...
Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟ 9332606886 ⟟ Call Me For G...
 
Call Girls Kochi Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Kochi Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Kochi Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Kochi Just Call 8250077686 Top Class Call Girl Service Available
 
Call Girls Agra Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Agra Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Agra Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Agra Just Call 8250077686 Top Class Call Girl Service Available
 
Call Girls Guntur Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Guntur  Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Guntur  Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Guntur Just Call 8250077686 Top Class Call Girl Service Available
 
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
 
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 9907093804 Top Class Call Girl Service Available
 
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
 
Call Girls Visakhapatnam Just Call 9907093804 Top Class Call Girl Service Ava...
Call Girls Visakhapatnam Just Call 9907093804 Top Class Call Girl Service Ava...Call Girls Visakhapatnam Just Call 9907093804 Top Class Call Girl Service Ava...
Call Girls Visakhapatnam Just Call 9907093804 Top Class Call Girl Service Ava...
 
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
Premium Call Girls In Jaipur {8445551418} ❤️VVIP SEEMA Call Girl in Jaipur Ra...
 
Call Girls Ooty Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Ooty Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Ooty Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Ooty Just Call 8250077686 Top Class Call Girl Service Available
 
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any TimeTop Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
Top Quality Call Girl Service Kalyanpur 6378878445 Available Call Girls Any Time
 
Call Girls Siliguri Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Siliguri Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Siliguri Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Siliguri Just Call 8250077686 Top Class Call Girl Service Available
 
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
 
O963O942363 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad
O963O942363 Call Girls In Ahmedabad Escort Service Available 24×7 In AhmedabadO963O942363 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad
O963O942363 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad
 
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
 
(Low Rate RASHMI ) Rate Of Call Girls Jaipur ❣ 8445551418 ❣ Elite Models & Ce...
(Low Rate RASHMI ) Rate Of Call Girls Jaipur ❣ 8445551418 ❣ Elite Models & Ce...(Low Rate RASHMI ) Rate Of Call Girls Jaipur ❣ 8445551418 ❣ Elite Models & Ce...
(Low Rate RASHMI ) Rate Of Call Girls Jaipur ❣ 8445551418 ❣ Elite Models & Ce...
 
Call Girls Faridabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Faridabad Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Faridabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Faridabad Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Service Jaipur {9521753030} ❤️VVIP RIDDHI Call Girl in Jaipur Raja...
Call Girls Service Jaipur {9521753030} ❤️VVIP RIDDHI Call Girl in Jaipur Raja...Call Girls Service Jaipur {9521753030} ❤️VVIP RIDDHI Call Girl in Jaipur Raja...
Call Girls Service Jaipur {9521753030} ❤️VVIP RIDDHI Call Girl in Jaipur Raja...
 

Managing missing values in routinely reported data: One approach from the Democratic Republic of the Congo

  • 1. Managing missing values in routinely reported data: One approach from the DRC Matt Worges Data for Impact Webinar Series December 2, 2020
  • 2. • Framing the Webinar through the D4I lens • DHIS2 data: advantages and issues • Exploring a DHIS2 data set • What to do with blanks? • Interpolation • Recreate the “Truth” • Interpolation diagnostics Overview
  • 3. • The D4I team was tasked with conducting an impact evaluation of the USAID Integrated Health Project (IHP) implemented in 9 provinces of the DRC • IHP goal: Reduce maternal, newborn, and child deaths through delivery of integrated health services • IHP objectives: Increase access to and use of quality health services in the targeted health zones IHP Impact Evaluation
  • 4. • D4I research question: What was the impact of IHP on the utilization of health services (e.g., treatment for childhood illnesses) over the course of the study period? • Measuring impact: D4I is assessing impact through a difference- in-differences (DID) with propensity score matching (PSM) model • Data source: We are using DHIS2 data for this impact evaluation IHP Impact Evaluation – Approach
  • 5. • PSM is widely used to mitigate confounding in observational studies • Complications arise when the covariates used to estimate the propensity scores are only partially observed • Interpolation/imputation approaches provide a potential solution for handling missing data in the estimation of the propensity scores • Recommended to derive the propensity score after applying interpolation or imputation IHP Impact Evaluation – Propensity Score Matching
  • 6. • Addition/removal of health facilities at different time points • Long runs of missing values • Zero counts are typically not entered – they are left blank • Cannot distinguish between truly missing and zero • Data entry errors manifesting as outliers/anomalous points • Reporting has improved over time making older time points less complete Some DHIS2 Issues
  • 7. • Missing data can result in: • Reduced statistical power • Biased estimators • Reduced representativeness of the sample • Generally incorrect inference and conclusions Why do we care about missingness? Overview of Approaches for Missing Data – Susan Buchman
  • 8. • Time Series Characteristics • Restricted to Haut-Katanga Province, DRC • Uncomplicated + severe malaria cases (all ages) • 24-month period from October 2018 to September 2020 • Health facility count = 1,362 • The monthly-aggregated time series appears to include both a seasonal and positive trend component Data Set
  • 9. Unprocessed Data – Missingness Visualized HF Oct-18 Nov-18 Dec-18 Jan-19 Feb-19 Mar-19 Apr-19 May-19 Jun-19 Jul-19 Aug-19 Sep-19 Oct-19 Nov-19 Dec-19 Jan-20 hk Panda Hôpital Général de Référence 514 637 637 910 563 1375 678 483 839 773 929 792 694 1355 1219 hk Serge Amie Centre de Santé 300 306 274 300 320 440 522 582 hk AENAF Centre de Santé de Référence 91 60 212 154 65 279 114 59 213 55 131 38 399 227 222 hk Asvie Centre Médical 439 556 475 379 370 335 279 280 256 381 627 639 hk Mupanda Centre de Santé 610 479 363 610 641 408 573 248 237 279 455 319 203 hk Boma Publique Centre de Santé 294 293 304 293 308 318 178 225 326 325 240 hk Kawama Centre de Santé 174 176 2 283 280 304 286 288 4 275 379 319 264 313 hk Kabambakuku Centre de Santé 317 396 372 434 368 298 255 314 303 251 287 283 hk Kaboka Centre de Santé 419 314 201 240 350 199 151 197 274 257 hk Kasomeno Centre de Santé de Référence 282 307 306 265 hk Kikula Centre de Santé de Référence 221 241 246 275 167 318 393 hk Belle Vue Centre de Santé 135 157 555 350 124 102 92
  • 10. Unprocessed Data – Missingness Visualized Missing (28.6%) Present (71.4%)‘visdat’ package Malaria Cases – Haut-Katanga Province
  • 11. Unprocessed Data – Histogram of Missingness No missing values (complete case analysis) Completely blank records (remove from data set) One missing value Two missing values ‘ggplot2’ package 284 193 137 27
  • 12. Unprocessed Data – Outliers? What are these doing here? Are they malaria outbreaks? Are they data entry errors?
  • 13. Unprocessed Data – Outliers. ‘anomalize’ package Something looks off here This point didn’t show up as anomalous
  • 14. • One method to remove outliers is to delete those values that are ± X standard deviations from the median • The median is insensitive to extreme values in your time series • Experiment with different thresholds (i.e., ± 4 SDs from the median or ± 6 SDs from the median) to examine what happens to your data Removing Egregious Outliers – One Approach
  • 15. Malaria cases Median Standard deviation + 4.5 SDs from the median This value would be removed from the data set
  • 16. Anomalous Data Points ‘anomalize’ package This is what I’m targeting for removal Less concerned with these
  • 17. Removing Egregious Outliers - Effects Average Malaria Cases – Haut-Katanga Province +4.5 SDs from the median Removed 8 values or 0.025% Unprocessed data set
  • 18. Are missing values actually zeros in the DRC DHIS2?
  • 19. Link between Missingness & Median Case Counts 1-15 16-30 31-45 46-60 61-75 76-90 91-105 106-120 121-135 136-150 >150 Median Health Facility Malaria Cases (binned) Generalization: the lower the median case counts the higher the number of average missing values
  • 20. • Assume no item nonresponse? • Examine this notion with two extreme examples • One HF time series with large monthly values and 1 missing • One HF time series with low monthly values and 1 missing • Replace missing with zero and run anomaly detection Assumption: Missing Values are Zeros
  • 21. Initial missing value was replaced with 0 Initial missing value was replaced with 0 ‘anomalize’ package
  • 23. • A univariate time series is a sequence of single observations at regular and successive points in time • Possible to decompose the time series into its trend, seasonal, and irregular components • We can use these time series characteristics in the interpolation process Univariate Time Series
  • 24. dataseasonaltrendremainder 2017 2018 2019 2020 Loess Seasonal Decomposition of Average Malaria Cases ‘stats’ package
  • 26.
  • 27. • Values in a series do not have violent, unexplained fluctuations • The rate of change (increases/decreases) between points occurs at a uniform rate Assumptions of Interpolation
  • 28. • Easy to code (one line in R for long form data frame) • df$int_cases <- na_interpolation(df$cases, option = "linear", maxgap = 2) • Intuitive understanding of linearly interpolating across very short gaps of missing values • Probably a good approach for high case load facilities • May not grossly deviate from the ‘truth’ when applied to low case load facilities A Role for Linear Interpolation? ‘imputeTS’ package
  • 29. Linear Interpolation ---- ---- ---- Joining known values with linear segments
  • 30. Initial missing value was replaced with 0 Initial missing value was replaced with 0 ‘anomalize’ package
  • 32. Seasonality in Interpolation Un-imputed data Linearly interpolated data w/o seasonality Linearly interpolated data w/ seasonality
  • 33. • Take seasonality into account • na.interp from the ‘forecast’ package in R • By default, uses linear interpolation for non-seasonal series. For seasonal series, a robust STL decomposition is first computed. Then a linear interpolation is applied to the seasonally adjusted data, and the seasonal component is added back. • na.StructTS from the ‘zoo’ package in R • Interpolate with seasonal Kalman filter • These two functions use similar mechanisms to interpolate missing data in that they both can ‘handle’ seasonality in the time series Univariate Time Series Interpolation
  • 35. Let’s reset and apply some of these steps
  • 36. Missingness Visualized – Unprocessed Data Missing (28.6%) Present (71.4%)‘visdat’ package 284 HFs with no missing data
  • 37. Missingness Visualized – Removed New/Defunct HFs Missing (13.8%) Present (86.2%)‘visdat’ package
  • 38. Missingness Visualized – Linear Interpolation (gaps ≤ 2) Missing (6.7%) Present (93.3%)‘visdat’ package 807 HFs with no missing data
  • 39. Time Series Trends New/defunct HFs and outliers have been removed from all time series
  • 41. • Use a data set containing only complete time series records • 2.5% of data are zero values (primarily limited to smaller facilities) • Introduce random missingness • Randomly delete15% of data points • Delete 90% of remaining zero values • Include runs of more than 2 missing values • Apply various imputation methods and compare against the “truth” • Replace all blanks with zeros • Linear interpolation on gaps ≤ 2 • Use the two identified interpolation strategies that consider seasonality A Quick Example
  • 42.
  • 43. Time Series Trends Anomalous data points have been removed
  • 45. na.StructTS Average raw bias = -1.18 na.interp Average raw bias = -0.03
  • 47. The RMSE difference is positive for 1,847 HFs indicating that the ‘na.StructTS’ approach had a lower RMSE for 68% of HFs ‘na.StructTS’ approach has lower RMSE ‘na.interp’ approach has lower RMSE
  • 48. • Assess missingness • Address egregious outliers • Manage new/defunct facility records • Decompose the time series • Try a few different interpolation techniques and plot results • Isolate a subset of records with no missing data • Introduce missing data and then recreate the “truth” Recap
  • 49.
  • 50. This presentation was produced with the support of the United States Agency for International Development (USAID) under the terms of the Data for Impact (D4I) associate award 7200AA18LA00008, which is implemented by the Carolina Population Center at the University of North Carolina at Chapel Hill, in partnership with Palladium International, LLC; ICF Macro, Inc.; John Snow, Inc.; and Tulane University. The views expressed in this publication do not necessarily reflect the views of USAID or the United States government. www.data4impactproject.org
  • 51. • DHIS 2 time series do not always lend themselves well to multiple imputation • Multiple imputation is a preferable choice when there are variables predictive of missingness that could be included in the imputation model • With DHIS 2 data, it can be difficult to locate other time dependent variables to aid in the imputation process • DHIS 2 time series may exhibit MNAR missingness structure • Earlier time points have more missing data • Zero values are more likely to be missing Imputation
  • 52. • Advantages of using DHIS2 data • Access to a wide breadth of data elements/services • Analyze at various levels of the health system • National, regional, district, health facility • Data are generally collected via standardized reporting tools • Data tend to be reported at regular intervals allowing for frequent updates to analyses • However, not all data elements are well-reported, and it is typically necessary to process/clean DHIS2 data Why Use DHIS2 Data?