1) The document discusses three types of using big data in statistics: (1) combined with survey data, (2) from a single complete source, and (3) from a single incomplete source.
2) Examples of type 2 include road sensor traffic data and web-scraped price data. These sources completely cover their target populations.
3) Examples of type 3 include social media data and mobile phone data. Only part of the target population is included, so ways must be found to deal with the missing part, such as determining the characteristics of the included population.
NO1 Certified Best vashikaran specialist in UK USA UAE London Dubai Canada Am...
Big data selection bias and ways to correct
1. Big data, selection bias, and ways
to correct for it
Piet Daas, Bart Buelens
Thanks to: Jan van den Brakel, Marco Puts, MartijnTennekes
Chang Sun, Jade Cock and AgataTroost
2. Using of Big Data
– Statistics Netherlands has been studying the potential
application and use of Big Data since a number of years
– How have we used Big Data so far?
– Three types of Big Data use
‐ 1) Combined with survey (or admin) data
‐ 2) Single source, but complete (census like)
‐ 3) Single source, but incomplete (part of population)
– Important considerations
– Quality of the data (and metadata)
– Coverage and ´selectivity´ of the population
2
3. 1. Type of Big Data use
– 1) Survey based, Big Data as additional source
‐ Consumer confidence + sentiment in social media
‐ CPI traditional + scanner data + web collected prices
‐ Survey methodology is the basis
‐ Methodological considerations:
‐ For some Big Data sources information needs to be
extracted first, e.g.
• Determining sentiment of social media messages
• Using pictures to identify product on the web
3
4. 1. Consumer confidence + social media
(~10%) (~80%)
- Combined sentiment of public Dutch Facebook and Twitter messages per month
correlates ~0.9 with (monthly) Consumer Confidence survey data
- Raw monthly aggregates of both series cointegrate
- Social media sentiment improves precision of survey based Consumer Confidence
estimate (Van den Brakel et al. (2017) Survey Methodology, forthcoming)
5. 2. Type of Big Data use
– 2) Big Data as the main/single source, Census approach
‐ Road sensor based traffic intensity statistics
‐ CPI fully based on web collected prices
‐ Land use statistics based on satellite images
‐ AIS data of ships for maritime statistics
‐ These Big Data sources have in common that:
• Target population is completely included (i.e. census)
(e.g. roads, products, country, vessels)
• Variable in source is identical/very similar/can be converted to
the one needed!
5
8. 2. Road sensor based intensity estimates
Time (years)
Numberofvehicles
- Findings of 5 quality indicators are used to select (daily) data of sensors used
- Missing data is the biggest problem (~40% of expected data is absent)
- Vehicle estimates are calculated per road segment with sensor weights
- Low sensor coverage of highways in first half of 2010 results in poor estimates
9. 3. Type of Big Data use
– 3) Big Data as the main source, but population not complete
‐ Social tension indicator using social media
‐ ‘Day time population’ using mobile phone data
‐ Tourism statistics using mobile phone data
‐ Energy statistics using smart meters
‐ …
‐ Part of the target population is included
‐ Need to find ways to deal with/correct for missing part
9
10. 3. Type of Big Data use
– 3) Try to ‘deal’ with missing part of ‘population’
‐ Social tension monitor using social media
• Detect relevant messages with keywords
• Relative number of messages are used per day
‐ ‘Day time population’ using mobile phone data (1 provider)
• Assume 1/3 of the population uses this provider
• Use age distribution of provider population for correction
• Future: Verify findings with data of another provider
‐ Tourism statistics using mobile phone data (1 provider)
• Not done yet: Change of foreign phones accessing providers
network
‐ It’s essential to find ways to obtain characteristics of the
population included in the Big Data source!
• Is challenging because sometimes directly available
background characteristics are absent
• Look for features (=measurable properties)
10
11. 3. Selectivity of mobile phone data
Number of people in ‘Assen’ city
Motor race (TT)
90.000 visitors
Truckstar festival
55.000 visitors
Overestimating the number of visitors based on mobile phone data
of a single provider
12. Big Data based statistics
– It’s possible, but depends on type of use
– 1) Survey based -> Need to ‘link’ Big Data source
– 2) Big Data census like -> Coverage (units) and
comparability (variable)
– 3) Big Data incomplete -> Selectivity, coverage and stability
of population in source
Especially topic 3 requires more methodological research
- Find ways to determine coverage and correct for selectivity
by extracting and studying ‘features’
- Find other data sources to increase coverage of target
population
12