SlideShare a Scribd company logo
1 of 53
Download to read offline
Some sampling techniques for big data analysis
Jae Kwang Kim 1
Iowa State University & KAIST
May 31, 2017
1
Joint work with Zhonglei Wang
Example
Let’s look at an artificial finite population.
ID Size of farms yield(y)
1 4 1
2 6 3
3 6 5
4 20 15
Parameter of interest: Mean yield of the farms in the population
Assume that farm sizes are known.
Instead of observing N = 4 farms, we want to select a sample of size
n = 2.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 2 / 53
Example - Continued
6 possible samples
case sample ID sample mean sampling error
1 1, 2 2 -4
2 1, 3 3 -3
3 1, 4 8 2
4 2, 3 4 -2
5 2, 4 9 3
6 3, 4 10 4
Each sample has sampling error.
Two ways to select one of the six possible samples.
Nonprobability sampling : (using size of farms or etc.) select a sample
subjectively.
Probability sampling : select a sample by a probability rule.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 3 / 53
Example - Continued : Probability sampling 1
Simple Random Sampling : Assign the same selection probability to all
possible samples
case sample ID sample mean(¯y) sampling error selection probability
1 1, 2 2 -4 1/6
2 1, 3 3 -3 1/6
3 1, 4 8 2 1/6
4 2, 3 4 -2 1/6
5 2, 4 9 3 1/6
6 3, 4 10 4 1/6
In this case, the sample mean(¯y) has a discrete probability distribution.
¯y =



2 w. p. 1/6
3 w. p. 1/6
4 w. p. 1/6
8 w. p. 1/6
9 w. p. 1/6
10 w. p. 1/6
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 4 / 53
Expected value of sampling error
E ¯y − ¯Y = (−4 − 3 + 2 − 2 + 3 + 4) /6
= 0
Thus, the estimator is unbiased.
Variance of sampling error is 1/6 ∗ (16 + 9 + 4 + 4 + 9 + 16) = 9.67.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 5 / 53
Probability Sampling
Definition : For each element in the population, the probability that
the element is included in the sample is known and greater than 0.
Advantages
1 Exclude subjectivity of selecting samples.
2 Remove sampling bias (or selection bias)
What is sampling bias ? ( θ : true value, ˆθ: estimated value of θ)
(sampling) error of ˆθ = ˆθ − θ
= ˆθ − E ˆθ + E ˆθ − θ
= variation + bias
In nonprobability sampling, variation is 0 but there is a bias. In
probability sampling, there exist variation but bias is 0.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 6 / 53
Probability Sampling
Main theory
1 Law of Large Numbers : ˆθ converges to E ˆθ for sufficiently large
sample size.
2 Central Limit Theorem : ˆθ follows a normal distribution for sufficiently
large sample size.
Additional advantages of probability sampling with large sample :
1 Improve the precision of an estimator
2 Can compute confidence intervals or test statistical hypotheses.
With the same sample size, we may have different precision.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 7 / 53
Survey Sampling
A classical area of statistics concerning ....
1 Drawing a probability sample from a target population
2 Using the sample to make inference about the population
Why survey sampling?
1 Cost consideration: Data collection often involves cost. Smaller sample
means less money.
2 Representativeness: Probability sampling is the only scientifically
justified approach of data collection for population studies.
3 Computational efficiency: Sample takes less memory and storage, and
less computing time.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 8 / 53
Survey Sampling in the era of Big Data
Challenges
1 Decreasing response rates: strict probability sampling is not possible
and representativeness may be weakened.
2 Cost model changes: data collection cost is not proportional to size.
3 Many competitors: survey sampling is no longer the only way of
collecting data. Abundant data sources.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 9 / 53
Paradigm Shift in Survey Sampling
Missing data framework:
From sample to population ⇒ From observed to unobserved
Combine survey data with other existing data
Survey Data
Auxiliary Data
⇒ Synthetic Data
Sampling techniques are still useful in handling big data. (e.g.
Reservior sampling, balanced sampling, inverse sampling, calibration
weighting, statistical inference with informative sampling, etc.)
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 10 / 53
In this talk, we will introduce ...
1 Reservior sampling and its variants
2 Inverse sampling
3 Synthetic data imputation (or survey integration)
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 11 / 53
Topic 1: Reservoir Sampling
McLeod and Bellhouse (1983)’s idea:
1 Let S = {1, · · · , n} be the initial reservoir sample.
2 If k = n + 1, · · · , select it with probability n/k.
3 If k is selected, replace it with one random sample from S. If k is not
selected, then go to Step 2.
Step 1 is the initialization step, Step 2 is the inclusion step, and Step 3 is
the removal step.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 12 / 53
Justification
Let Sk be the set of sample elements selected from Uk = {1, · · · , k}.
If m ≤ n,
P(m ∈ Sk) = 1 ×
n
n + 1
×
n + 1
n + 2
× · · · ×
k − 1
k
=
n
k
If m > n,
P(m ∈ Sk) =
n
m
×
m
m + 1
× · · · ×
k − 1
k
=
n
k
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 13 / 53
Remark
Selecting a SRS (simple random sample) of size n from arbitrary size
in one pass
No need to know the population size in advance
Very useful for big data sampling when the population is updated
continuously.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 14 / 53
Improving reservoir sampling
Goal: We wish to reduce the variance of ˆθ, which is computed from
reservoir sample.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 15 / 53
Method 1: Balanced reservoir sampling
We wish to impose balanced condition in the reservoir sample
1
n
i∈S
zi =
1
N
i∈U
zi .
Variable z can be called control variable.
To achieve the balanced condition, we can modify Step 3 (removal
step) as follows:
1 Let ¯zk be the population mean of z up to the k-th populaiton
Uk = {1, · · · , k}.
2 For each element in the current sample S, compute
D(i) = ¯zk − ¯z
(−i)
S , i ∈ S
where ¯z
(−i)
S is the sample mean excluding zi .
3 Instead of removing one element at random in Step 3, remove i∗
with
the smallest value of D(i).
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 16 / 53
A numerical experiment
Figure: Trace of sample means of size n = 100 from a population of size
N = 10, 000. The left side is the trace of the sample mean from the classical
reservoir sampling and the right side is the trace of the sample mean from the
proposed reservoir sampling.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 17 / 53
Remark
The balanced reservoir sampling may provide an efficient sample for
estimating population mean, but not the population distribution.
We may wish to provide an efficient representative sample that
provides consistent estimates for various population parameters.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 18 / 53
Method 2: Stratified reservoir sampling
Assume that the finite population consists of H subpopulations, called
strata. The within-stratum variance is small.
For scalar y with bounded support, we can predetermine the stratum
boundaries by creating H intervals with equal length.
We want to obtain a reservoir sample that achieves stratified random
sampling with proportional allocation.
The basic idea is to use Chao (1982)’s sampling to select an equal
probability sampling for each with probability proportional to
wi = Nh/nh, where Nh and nh are updated continuously.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 19 / 53
Stratified reservoir sampling
Let Uk = {1, · · · , k} be the finite population up to element k.
Let π(k; i) be the first-order inclusion probability that unit i is
selected from Uk.
We want to have
k
i=1
π(k; i) = n
and
π(k; i) ∝ wk,i
where wk,i = Nk,h/nk,h for stratum h that unit i belongs to, where
Nk,h and nk,h are the population size and the sample size,
respectively, corresponding to stratum h.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 20 / 53
Stratified reservoir sampling using Chao’s method
At time k + 1, we update the reservoir sample as follows:
1 Select unit k + 1 with probability π(k + 1; k + 1).
2 If unit k + 1 is not selected, retain the sample at time k.
3 If unit k + 1 is selected, then remove one unit from the current
sample with the removal probability
Rkj =



0 if j ∈ Ak
Tkj if j ∈ Bk
(1 − Tk)/(n − Lk) if j ∈ Ck,
where Ak = {i ∈ Sk; π(k; i) = π(k + 1; i) = 1},
Bk = {i ∈ Sk; π(k, i) = 1, π(k + 1; i) < 1},
Ck = {i ∈ Sk; π(k, i) < 1, π(k + 1; i) < 1}, Tk = i∈Bk
Tkj ,
Tkj = {1 − π(k + 1; j)}/π(k + 1; k + 1), and Lk = |Ak ∪ Bk|.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 21 / 53
Simulation study
Finite population {yi : i ∈ U} is generated by
yi ∼ N(10, 10)
independently for i ∈ U, where U = {1, . . . , N} with N = 5, 000.
Sample size n = 1, 000.
Three reservoir methods are considered.
Classical reservoir sampling approach (Mcleod and Bellhouse, 1983).
Balanced reservoir sampling approach with yi being the control variable.
Stratified reservoir sampling approach: 51 strata partitioned by equally
spaced knots {k1, . . . , k50}, where k1 and k50 are the 1%-th and
99%-th sample quantile of {yi : i = 1, . . . , n}.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 22 / 53
Simulation study (Cont’d)
We are interested in estimating the following finite population
parameters.
Mean mk = k−1 k
i=1 yi , where k = n + 1, . . . , N.
Probability Pk,j = k−1 k
i=1 I(yi < qj ), where j = 1, . . . , 5 and
q1, . . . , q5 are the 5%-th, 27.5%-th, 50%-th, 72.5%-th and 95%-th
quantiles of N(10, 10).
1,000 simulations are conducted.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 23 / 53
Bias of estimating mk
0 1000 3000
−0.15−0.10−0.050.000.050.100.15
Classical
k−n
Value
0 1000 3000
−0.15−0.10−0.050.000.050.100.15
Balanced
k−n
Value
0 1000 3000
−0.15−0.10−0.050.000.050.100.15
Stratified
k−n
Value
The black line is the point-wise mean of the simulation, and blue ones are the
point-wise 5%-th and 95%-th quantiles.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 24 / 53
Bias of estimating Pk,1
0 1000 3000
0.000.020.040.06
Classical
k−n
Value
0 1000 3000
0.000.020.040.06
Balanced
k−n
Value
0 1000 3000
0.000.020.040.06
Stratified
k−n
Value
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 25 / 53
Bias of estimating Pk,2
0 1000 3000
0.000.050.100.15
Classical
k−n
Value
0 1000 3000
0.000.050.100.15
Balanced
k−n
Value
0 1000 3000
0.000.050.100.15
Stratified
k−n
Value
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 26 / 53
Bias of estimating Pk,3
0 1000 3000
−0.02−0.010.000.010.02
Classical
k−n
Value
0 1000 3000
−0.02−0.010.000.010.02
Balanced
k−n
Value
0 1000 3000
−0.02−0.010.000.010.02
Stratified
k−n
Value
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 27 / 53
Bias of estimating Pk,4
0 1000 3000
−0.15−0.10−0.050.00
Classical
k−n
Value
0 1000 3000
−0.15−0.10−0.050.00
Balanced
k−n
Value
0 1000 3000
−0.15−0.10−0.050.00
Stratified
k−n
Value
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 28 / 53
Bias of estimating Pk,5
0 1000 3000
−0.06−0.04−0.020.00
Classical
k−n
Value
0 1000 3000
−0.06−0.04−0.020.00
Balanced
k−n
Value
0 1000 3000
−0.06−0.04−0.020.00
Stratified
k−n
Value
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 29 / 53
Summary statistics for k = 5, 000
Table: Summary statistics for k = 5, 000. The unit for the values is 10−3
Mean q1 q2
Bias S.E. Bias S.E. Bias S.E.
Classical 0.13 88.36 0.10 5.84 0.26 12.84
Balanced 0.83 1.02 50.19 6.84 165.66 11.77
Stratified -22.47 17.89 -0.91 1.61 3.48 3.06
q3 q4 q5
Bias S.E. Bias S.E. Bias S.E.
Classical -0.18 14.05 -0.47 12.80 0.19 6.04
Balanced -2.61 7.07 -164.25 11.81 -50.81 6.67
Stratified 3.56 3.00 1.74 2.59 1.89 1.48
S.E.: standard error of the 1,000 Monte Carlo estimates
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 30 / 53
Discussion
Balanced reservoir sampling provides the most efficient estimation for
the population mean, but it provides biased estimates for other
parameters.
Stratified reservoir sampling provides more efficient estimates than
the classical reservoir sampling for various parameters.
It can be shown that, under some conditions, the stratified reservoir
sampling satisfies
V (ˆθ) = O
1
nH
,
where H is the number of strata.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 31 / 53
Topic 2: Inverse sampling
Goal: Obtain a simple random sample from the current sample
obtained from a complex sampling design.
Originally proposed by Hinkins et al (1997) and Rao et al (2003) in
the context of stratified cluster sampling.
Note that the inverse sampling can be treated as a special case of
two-phase sampling:
1 Phase One: Original sample obtained from complex sampling design.
2 Phase Two: select a sample from Phase One sample to obtain the final
sample so that they have equal first-order inclusion probabilities.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 32 / 53
Inverse sampling for big data analysis
Two cases
1 Big data = population
2 Big data = population
In case one, the sampling problem is easy
In case two, we need to adjust for the bias associated with big data
itself.
If the data is not obtained from a probability sample, it may be
subject to selection bias.
We will discuss how to remove the selection bias of the big data using
inverse sampling idea.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 33 / 53
Inverse sampling for big data analysis
Assume that the big data (D) is a subset of the finite population.
Let δ be the indicator function
δi =
1 if i ∈ D
0 otherwise
The observations in D can be viewed as a random sample from
f (y | δ = 1).
However, we are interested in obtaining sample from f (y).
How to compute E(Y ) = yf (y)dy using the observations from
f (y | δ = 1)?
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 34 / 53
Importance sampling idea
Goal: We are interested in estimating E(Y ) = yf (y)dy using
y1, · · · , yn from f (y | δ = 1).
Note that
E(Y ) = y
f (y)
f (y | δ = 1)
f (y | δ = 1)dy ∼=
1
n
n
i=1
wi yi
where
wi ∝
f (yi )
f (yi | δi = 1)
The weight wi is called importance weight. It is the relative amount
of representativeness of yi with respect to f (y) when yi is generated
from f (y | δ = 1).
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 35 / 53
Inverse sampling for big data analysis
Assume for now that f (y) and f (y | δ = 1) are of known form.
From each i ∈ D, compute
wi ∝
f (yi )
f (yi | δi = 1)
Use the Chao method or something else, select a sample of size n
with the selection probability proportional to wi .
What if f (y) and f (y | δ = 1) are unknown?
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 36 / 53
Inverse sampling for big data analysis (Cont’d)
Assume bivariate observation (xi , yi ) from f (x, y | δ = 1).
Suppose that µx = E (X) is available from an external source.
Let
P0 = {f (x, y); xf (x, y)d(x, y) = µx }
Using µx , we can find f0(x, y) ∈ P0 that minimizes the
Kullback-Leibler distance
f0 (x, y) ln
f0 (x, y)
f (x, y | δ = 1)
d (x, y) . (1)
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 37 / 53
Inverse sampling for big data analysis (Cont’d)
The solution to (1) is
f0 (x, y) = f (x, y | δ = 1)
exp (λ (x − µx ))
E {exp (λ (X − µx ))}
(2)
where λ satisfies xf0 (x, y) d (x, y) = µx .
The transformation from f (x, y | δ = 1) to f0 (x, y) is called
exponential tiling. It tilts the density f (x, y | δ = 1) so that f0 (x, y)
satisfy the calibration constraint.
The importance weight in this situation is
w∗
i ∝
f0 (xi , yi )
f (xi , yi | δi = 1)
∝ exp{ˆλ (xi − µx )} (3)
where ˆλ satisfies n
i=1 w∗
i xi = µx .
Closely related to the exponential tilting (ET) weight (Kim 2010).
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 38 / 53
Inverse reservoir sampling using exponential tilting
Reservoir inverse sampling
1 Phase one: compute w∗
i using ET method.
2 Phase two: Use Chao’s reservoir sampling to select an unequal
probability sample of size n with πi ∝ w∗
i .
Computation for ˆλ in w∗
i in (3):
1 Newton method: Solve
n
i=1 w∗
i (λ)xi = µx iteratively.
2 One-step method (Kim, 2010): Approximation using
ˆλ = (µx − ¯xk )/s2
k ,
where ¯xk and s2
k are the sample mean and variance of
{xi : i = 1, . . . , k}.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 39 / 53
Simulation study
Finite population {(xi , yi ) : i ∈ U} is generated by
xi ∼ N(0, 1)
yi | xi ∼ N(xi , 1)
for i ∈ U, where U = {1, . . . , N} and N = 100, 000.
Indicator {δi : i ∈ U} is obtained by
logit{Pr(δi = 1 | xi )} = 0.25xi ,
where logit(x) = log{x/(1 − x)}.
The data stream is the first 10,000 elements with δi = 1.
Sample size n = 500.
We are interested in estimating E(Y ) by the proposed reservoir
inverse sampling methods.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 40 / 53
Bias of estimating E(Y )
0 4000
−0.100.000.050.100.150.200.25
R_whole
k−4n
Value
0 4000
−0.100.000.050.100.150.200.25
R_one_step
k−4n
Value
0 4000
−0.100.000.050.100.150.200.25
Trd.
k−4n
Value
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 41 / 53
Summary statistics for k = 10, 000
Table: Summary statistics for k = 10, 000. The units for the values is 10−2
Bias S.E.
Reservoir inverse sampling 0.07 6.52
Reservoir one step 0.20 6.41
Traditional reservoir 12.33∗ 5.94
S.E.: standard error of the 1,000 Monte Carlo
estimates
∗
: significant bias
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 42 / 53
Topic 3: Survey Integration
Combine information from two independent surveys, A and B, from
the same target population
1 Non-nested two-phase sample: Observe x from survey A and observe
(x, y) from survey B (Hidiroglou, 2001; Merkouris, 2004; Kim and
Rao, 2012).
2 Two surveys with measurement errors: Observe (x, y1) from survey A
and observe (x, y2) from survey B (Kim, Berg, and Park, 2016).
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 43 / 53
Survey integration
Combining big data with survey data
1 A: survey data (representative sample, expensive), observe (x, z)
2 B: big data (self-selected sample, inexpensive), observe (x, y).
Rivers (2007) idea : Use x to implement the nearest neighbor
imputation for sample A. That is, create synthetic data imputation
for sample A.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 44 / 53
Table: Data Structure
Data Representativeness X Z Y
A Yes o o
B No o o
Table: Rivers idea
Data Representativeness X Z Y
A Yes o o o
B No o o
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 45 / 53
Remark
Data B is subject to selection bias.
Rivers method is justified if
f (yi | xi , i ∈ B) = f (yi | xi , i ∈ A)
In some literature, the above condition is called transportability.
The transportability can be achieved if the selection mechanism for
big data is non-informative (Pfeffermann 2011) or ignorable.
In this case, the synthetic data imputation can provide unbiased
estimation of the population parameters.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 46 / 53
New approach for combining big data with survey data
1 In survey sample A, observe
δi =
1 if i ∈ B
0 otherwise
by asking a question on the membership for the big data.
2 Since we observe (xi , zi , δi ) for all i ∈ A, we can fit a model for
πi = P(δi = 1 | xi , zi ) to obtain ˆπi .
3 Use wi = 1/ˆπi as the weight for analyzing big data B.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 47 / 53
Remark
Note that the self-selection bias is essentially the same as the
nonresponse bias in the missing data literature.
Probability sample A is used to estimate the response probability πi .
Rivers method is based on the assumption that
P(δ = 1 | x, y, z) = P(δ = 1 | x).
The proposed method is based on weaker assumption for the response
mechanism:
P(δ = 1 | x, y, z) = P(δ = 1 | x, z).
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 48 / 53
Remark
Once the propensity score weights are computed, we can use it for
reservoir inverse sampling.
Variance estimation needs to be developed.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 49 / 53
Small area estimation
Small area estimation can be viewed as a special case of survey
integration.
Big data can be used in small area estimation (Rao and Molina,
2015).
We can use big data as covariates in area-level model for small area
estimation (Marchetti et al, 2015).
For unit level model, statistical issues such as self-selection bias and
measurement errors may occur.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 50 / 53
Conclusion
Three statistical tools for big data analysis
1 Reservoir sampling
2 Inverse sampling
3 Survey integration
Problems with big data: measurement error, self-selection bias
Survey data can be used to reduce the problems in big data.
Promising area of research
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 51 / 53
References
Chao, M. T. (1982). A General Purpose Unequal Probability Sampling Plan.
Biometrika 69, 653–656.
Hidiroglou, M. A. (2001). Double Sampling. Surv. Methodol. 27, 143–154.
Hinkins, S., Oh, H. L. & Scheuren, F. (1997). Inverse Sampling Design
Algorithm. Surv. Methodol. 23, 11–21.
Kim, J. K. (2010). Calibration estimation using exponential tilting in sample
surveys. Surv. Methodol. 36, 145–155.
Kim, J. K., Berg, E. & Park, T. (2016). Statistical matching using fractional
imputation. Surv. Methodol. 42, 19–40.
Kim, J. K. & Rao, J. N. K. (2012). Combining data from two independent
surveys: a model-assisted approach. Biometrika 99, 85–100.
Marchetti, S., Giusti, C., Pratesi, M., Salvati, N., Giannotti, F.,
Pedreschi, D., Rinzivillo, S., Pappalardo, L., & Gabrielli, L. (2015).
Small area model-based estimators using big data sources. J. Off. Stat. 31,
263–281.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 52 / 53
References (Cont’d)
Merkouris, T. (2004). Combining independent regression estimators from
multiple surveys, J. Am. Statist. Assoc. 99, 1131–1139.
McLeod, A. I. & Bellhouse, D. R. (1983). A Convenient Algorithm for
Drawing a Simple Random Sample. Appl. Statist. 32, 182–184.
Pfeffermann, D. (2011). Modeling of complex survey data: Why model? Why
is it a problem? How can we approach? Survey Methodology 37, 115-136.
Rao, J. N. K. & Molina, I. (2015). Small area estimation. John Wiley & Sons,
New Jersey.
Rao, J. N. K., Scott, A. J. & Benhin, E. (2003). Undoing Complex Survey
Data Structures: Some Theory and Applications of Inverse Sampling. Surv.
Methodol. 29, 107–128.
River, D. (2007). Sampling for web surveys. Joint Statistical Meetings,
Proceedings of the Section on Survey Research Methods.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 53 / 53

More Related Content

What's hot

RESIDUALS AND INFLUENCE IN NONLINEAR REGRESSION FOR REPEATED MEASUREMENT DATA
RESIDUALS AND INFLUENCE IN NONLINEAR REGRESSION FOR REPEATED MEASUREMENT DATARESIDUALS AND INFLUENCE IN NONLINEAR REGRESSION FOR REPEATED MEASUREMENT DATA
RESIDUALS AND INFLUENCE IN NONLINEAR REGRESSION FOR REPEATED MEASUREMENT DATAorajjournal
 
Approximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsApproximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsChristian Robert
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixturesChristian Robert
 
When Classifier Selection meets Information Theory: A Unifying View
When Classifier Selection meets Information Theory: A Unifying ViewWhen Classifier Selection meets Information Theory: A Unifying View
When Classifier Selection meets Information Theory: A Unifying ViewMohamed Farouk
 
Hypothesis testings on individualized treatment rules
Hypothesis testings on individualized treatment rulesHypothesis testings on individualized treatment rules
Hypothesis testings on individualized treatment rulesYoung-Geun Choi
 
ABC short course: introduction chapters
ABC short course: introduction chaptersABC short course: introduction chapters
ABC short course: introduction chaptersChristian Robert
 
Slides csm
Slides csmSlides csm
Slides csmychaubey
 
ABC short course: model choice chapter
ABC short course: model choice chapterABC short course: model choice chapter
ABC short course: model choice chapterChristian Robert
 
Generative models : VAE and GAN
Generative models : VAE and GANGenerative models : VAE and GAN
Generative models : VAE and GANSEMINARGROOT
 
ABC short course: final chapters
ABC short course: final chaptersABC short course: final chapters
ABC short course: final chaptersChristian Robert
 
Comparison of the optimal design
Comparison of the optimal designComparison of the optimal design
Comparison of the optimal designAlexander Decker
 
4thchannel conference poster_freedom_gumedze
4thchannel conference poster_freedom_gumedze4thchannel conference poster_freedom_gumedze
4thchannel conference poster_freedom_gumedzeFreedom Gumedze
 
ABC short course: survey chapter
ABC short course: survey chapterABC short course: survey chapter
ABC short course: survey chapterChristian Robert
 
Statistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityStatistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityChristian Robert
 
An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testingChristian Robert
 
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized ModelPredicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Modelweekendsunny
 

What's hot (20)

RESIDUALS AND INFLUENCE IN NONLINEAR REGRESSION FOR REPEATED MEASUREMENT DATA
RESIDUALS AND INFLUENCE IN NONLINEAR REGRESSION FOR REPEATED MEASUREMENT DATARESIDUALS AND INFLUENCE IN NONLINEAR REGRESSION FOR REPEATED MEASUREMENT DATA
RESIDUALS AND INFLUENCE IN NONLINEAR REGRESSION FOR REPEATED MEASUREMENT DATA
 
ABC workshop: 17w5025
ABC workshop: 17w5025ABC workshop: 17w5025
ABC workshop: 17w5025
 
Approximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsApproximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forests
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixtures
 
When Classifier Selection meets Information Theory: A Unifying View
When Classifier Selection meets Information Theory: A Unifying ViewWhen Classifier Selection meets Information Theory: A Unifying View
When Classifier Selection meets Information Theory: A Unifying View
 
Hypothesis testings on individualized treatment rules
Hypothesis testings on individualized treatment rulesHypothesis testings on individualized treatment rules
Hypothesis testings on individualized treatment rules
 
ABC short course: introduction chapters
ABC short course: introduction chaptersABC short course: introduction chapters
ABC short course: introduction chapters
 
Slides csm
Slides csmSlides csm
Slides csm
 
ABC short course: model choice chapter
ABC short course: model choice chapterABC short course: model choice chapter
ABC short course: model choice chapter
 
Generative models : VAE and GAN
Generative models : VAE and GANGenerative models : VAE and GAN
Generative models : VAE and GAN
 
ABC short course: final chapters
ABC short course: final chaptersABC short course: final chapters
ABC short course: final chapters
 
Comparison of the optimal design
Comparison of the optimal designComparison of the optimal design
Comparison of the optimal design
 
4thchannel conference poster_freedom_gumedze
4thchannel conference poster_freedom_gumedze4thchannel conference poster_freedom_gumedze
4thchannel conference poster_freedom_gumedze
 
Deep Learning Opening Workshop - Horseshoe Regularization for Machine Learnin...
Deep Learning Opening Workshop - Horseshoe Regularization for Machine Learnin...Deep Learning Opening Workshop - Horseshoe Regularization for Machine Learnin...
Deep Learning Opening Workshop - Horseshoe Regularization for Machine Learnin...
 
ABC short course: survey chapter
ABC short course: survey chapterABC short course: survey chapter
ABC short course: survey chapter
 
Statistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityStatistics symposium talk, Harvard University
Statistics symposium talk, Harvard University
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testing
 
asymptotics of ABC
asymptotics of ABCasymptotics of ABC
asymptotics of ABC
 
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized ModelPredicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model
 

Similar to Some sampling techniques for big data analysis

Computational Pool-Testing with Retesting Strategy
Computational Pool-Testing with Retesting StrategyComputational Pool-Testing with Retesting Strategy
Computational Pool-Testing with Retesting StrategyWaqas Tariq
 
Bayesian selection of best subsets in high-dimensional regression
Bayesian selection of best subsets in high-dimensional regressionBayesian selection of best subsets in high-dimensional regression
Bayesian selection of best subsets in high-dimensional regressionCaleb (Shiqiang) Jin
 
13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdfEmanAsem4
 
Theory of sampling
Theory of samplingTheory of sampling
Theory of samplingJags Jagdish
 
L4 theory of sampling
L4 theory of samplingL4 theory of sampling
L4 theory of samplingJags Jagdish
 
Chapter 3 sampling and sampling distribution
Chapter 3   sampling and sampling distributionChapter 3   sampling and sampling distribution
Chapter 3 sampling and sampling distributionAntonio F. Balatar Jr.
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applicationsFrank Nielsen
 
BRM Unit 2 Sampling.ppt
BRM Unit 2 Sampling.pptBRM Unit 2 Sampling.ppt
BRM Unit 2 Sampling.pptVikasRai405977
 
FUNDAMENTALS OF RESEARCH IN MEDICINE.pptx
FUNDAMENTALS OF RESEARCH IN MEDICINE.pptxFUNDAMENTALS OF RESEARCH IN MEDICINE.pptx
FUNDAMENTALS OF RESEARCH IN MEDICINE.pptxUzorTonyOzuem
 
Leveraging Bagging for Evolving Data Streams
Leveraging Bagging for Evolving Data StreamsLeveraging Bagging for Evolving Data Streams
Leveraging Bagging for Evolving Data StreamsAlbert Bifet
 
Population and sampling
Population and samplingPopulation and sampling
Population and samplingEdu Anud, Jr
 
Lect w2 measures_of_location_and_spread
Lect w2 measures_of_location_and_spreadLect w2 measures_of_location_and_spread
Lect w2 measures_of_location_and_spreadRione Drevale
 

Similar to Some sampling techniques for big data analysis (20)

Computational Pool-Testing with Retesting Strategy
Computational Pool-Testing with Retesting StrategyComputational Pool-Testing with Retesting Strategy
Computational Pool-Testing with Retesting Strategy
 
SAMPLING-PROCEDURE.pdf
SAMPLING-PROCEDURE.pdfSAMPLING-PROCEDURE.pdf
SAMPLING-PROCEDURE.pdf
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
Bayesian selection of best subsets in high-dimensional regression
Bayesian selection of best subsets in high-dimensional regressionBayesian selection of best subsets in high-dimensional regression
Bayesian selection of best subsets in high-dimensional regression
 
13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf13_Unsupervised Learning.pdf
13_Unsupervised Learning.pdf
 
Sampling distribution
Sampling distributionSampling distribution
Sampling distribution
 
Theory of sampling
Theory of samplingTheory of sampling
Theory of sampling
 
L4 theory of sampling
L4 theory of samplingL4 theory of sampling
L4 theory of sampling
 
Chapter 3 sampling and sampling distribution
Chapter 3   sampling and sampling distributionChapter 3   sampling and sampling distribution
Chapter 3 sampling and sampling distribution
 
Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
BRM Unit 2 Sampling.ppt
BRM Unit 2 Sampling.pptBRM Unit 2 Sampling.ppt
BRM Unit 2 Sampling.ppt
 
GDRR Opening Workshop - Gradient Boosting Trees for Spatial Data Prediction ...
GDRR Opening Workshop -  Gradient Boosting Trees for Spatial Data Prediction ...GDRR Opening Workshop -  Gradient Boosting Trees for Spatial Data Prediction ...
GDRR Opening Workshop - Gradient Boosting Trees for Spatial Data Prediction ...
 
9618821.ppt
9618821.ppt9618821.ppt
9618821.ppt
 
9618821.pdf
9618821.pdf9618821.pdf
9618821.pdf
 
FUNDAMENTALS OF RESEARCH IN MEDICINE.pptx
FUNDAMENTALS OF RESEARCH IN MEDICINE.pptxFUNDAMENTALS OF RESEARCH IN MEDICINE.pptx
FUNDAMENTALS OF RESEARCH IN MEDICINE.pptx
 
Statistics for management assignment
Statistics for management assignmentStatistics for management assignment
Statistics for management assignment
 
Leveraging Bagging for Evolving Data Streams
Leveraging Bagging for Evolving Data StreamsLeveraging Bagging for Evolving Data Streams
Leveraging Bagging for Evolving Data Streams
 
Chap 6
Chap 6Chap 6
Chap 6
 
Population and sampling
Population and samplingPopulation and sampling
Population and sampling
 
Lect w2 measures_of_location_and_spread
Lect w2 measures_of_location_and_spreadLect w2 measures_of_location_and_spread
Lect w2 measures_of_location_and_spread
 

Recently uploaded

Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Joonhun Lee
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxabhishekdhamu51
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...Lokesh Kothari
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 

Recently uploaded (20)

Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 

Some sampling techniques for big data analysis

  • 1. Some sampling techniques for big data analysis Jae Kwang Kim 1 Iowa State University & KAIST May 31, 2017 1 Joint work with Zhonglei Wang
  • 2. Example Let’s look at an artificial finite population. ID Size of farms yield(y) 1 4 1 2 6 3 3 6 5 4 20 15 Parameter of interest: Mean yield of the farms in the population Assume that farm sizes are known. Instead of observing N = 4 farms, we want to select a sample of size n = 2. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 2 / 53
  • 3. Example - Continued 6 possible samples case sample ID sample mean sampling error 1 1, 2 2 -4 2 1, 3 3 -3 3 1, 4 8 2 4 2, 3 4 -2 5 2, 4 9 3 6 3, 4 10 4 Each sample has sampling error. Two ways to select one of the six possible samples. Nonprobability sampling : (using size of farms or etc.) select a sample subjectively. Probability sampling : select a sample by a probability rule. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 3 / 53
  • 4. Example - Continued : Probability sampling 1 Simple Random Sampling : Assign the same selection probability to all possible samples case sample ID sample mean(¯y) sampling error selection probability 1 1, 2 2 -4 1/6 2 1, 3 3 -3 1/6 3 1, 4 8 2 1/6 4 2, 3 4 -2 1/6 5 2, 4 9 3 1/6 6 3, 4 10 4 1/6 In this case, the sample mean(¯y) has a discrete probability distribution. ¯y =    2 w. p. 1/6 3 w. p. 1/6 4 w. p. 1/6 8 w. p. 1/6 9 w. p. 1/6 10 w. p. 1/6 Kim (ISU) Sampling techniques for big data analysis May 31, 2017 4 / 53
  • 5. Expected value of sampling error E ¯y − ¯Y = (−4 − 3 + 2 − 2 + 3 + 4) /6 = 0 Thus, the estimator is unbiased. Variance of sampling error is 1/6 ∗ (16 + 9 + 4 + 4 + 9 + 16) = 9.67. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 5 / 53
  • 6. Probability Sampling Definition : For each element in the population, the probability that the element is included in the sample is known and greater than 0. Advantages 1 Exclude subjectivity of selecting samples. 2 Remove sampling bias (or selection bias) What is sampling bias ? ( θ : true value, ˆθ: estimated value of θ) (sampling) error of ˆθ = ˆθ − θ = ˆθ − E ˆθ + E ˆθ − θ = variation + bias In nonprobability sampling, variation is 0 but there is a bias. In probability sampling, there exist variation but bias is 0. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 6 / 53
  • 7. Probability Sampling Main theory 1 Law of Large Numbers : ˆθ converges to E ˆθ for sufficiently large sample size. 2 Central Limit Theorem : ˆθ follows a normal distribution for sufficiently large sample size. Additional advantages of probability sampling with large sample : 1 Improve the precision of an estimator 2 Can compute confidence intervals or test statistical hypotheses. With the same sample size, we may have different precision. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 7 / 53
  • 8. Survey Sampling A classical area of statistics concerning .... 1 Drawing a probability sample from a target population 2 Using the sample to make inference about the population Why survey sampling? 1 Cost consideration: Data collection often involves cost. Smaller sample means less money. 2 Representativeness: Probability sampling is the only scientifically justified approach of data collection for population studies. 3 Computational efficiency: Sample takes less memory and storage, and less computing time. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 8 / 53
  • 9. Survey Sampling in the era of Big Data Challenges 1 Decreasing response rates: strict probability sampling is not possible and representativeness may be weakened. 2 Cost model changes: data collection cost is not proportional to size. 3 Many competitors: survey sampling is no longer the only way of collecting data. Abundant data sources. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 9 / 53
  • 10. Paradigm Shift in Survey Sampling Missing data framework: From sample to population ⇒ From observed to unobserved Combine survey data with other existing data Survey Data Auxiliary Data ⇒ Synthetic Data Sampling techniques are still useful in handling big data. (e.g. Reservior sampling, balanced sampling, inverse sampling, calibration weighting, statistical inference with informative sampling, etc.) Kim (ISU) Sampling techniques for big data analysis May 31, 2017 10 / 53
  • 11. In this talk, we will introduce ... 1 Reservior sampling and its variants 2 Inverse sampling 3 Synthetic data imputation (or survey integration) Kim (ISU) Sampling techniques for big data analysis May 31, 2017 11 / 53
  • 12. Topic 1: Reservoir Sampling McLeod and Bellhouse (1983)’s idea: 1 Let S = {1, · · · , n} be the initial reservoir sample. 2 If k = n + 1, · · · , select it with probability n/k. 3 If k is selected, replace it with one random sample from S. If k is not selected, then go to Step 2. Step 1 is the initialization step, Step 2 is the inclusion step, and Step 3 is the removal step. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 12 / 53
  • 13. Justification Let Sk be the set of sample elements selected from Uk = {1, · · · , k}. If m ≤ n, P(m ∈ Sk) = 1 × n n + 1 × n + 1 n + 2 × · · · × k − 1 k = n k If m > n, P(m ∈ Sk) = n m × m m + 1 × · · · × k − 1 k = n k Kim (ISU) Sampling techniques for big data analysis May 31, 2017 13 / 53
  • 14. Remark Selecting a SRS (simple random sample) of size n from arbitrary size in one pass No need to know the population size in advance Very useful for big data sampling when the population is updated continuously. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 14 / 53
  • 15. Improving reservoir sampling Goal: We wish to reduce the variance of ˆθ, which is computed from reservoir sample. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 15 / 53
  • 16. Method 1: Balanced reservoir sampling We wish to impose balanced condition in the reservoir sample 1 n i∈S zi = 1 N i∈U zi . Variable z can be called control variable. To achieve the balanced condition, we can modify Step 3 (removal step) as follows: 1 Let ¯zk be the population mean of z up to the k-th populaiton Uk = {1, · · · , k}. 2 For each element in the current sample S, compute D(i) = ¯zk − ¯z (−i) S , i ∈ S where ¯z (−i) S is the sample mean excluding zi . 3 Instead of removing one element at random in Step 3, remove i∗ with the smallest value of D(i). Kim (ISU) Sampling techniques for big data analysis May 31, 2017 16 / 53
  • 17. A numerical experiment Figure: Trace of sample means of size n = 100 from a population of size N = 10, 000. The left side is the trace of the sample mean from the classical reservoir sampling and the right side is the trace of the sample mean from the proposed reservoir sampling. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 17 / 53
  • 18. Remark The balanced reservoir sampling may provide an efficient sample for estimating population mean, but not the population distribution. We may wish to provide an efficient representative sample that provides consistent estimates for various population parameters. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 18 / 53
  • 19. Method 2: Stratified reservoir sampling Assume that the finite population consists of H subpopulations, called strata. The within-stratum variance is small. For scalar y with bounded support, we can predetermine the stratum boundaries by creating H intervals with equal length. We want to obtain a reservoir sample that achieves stratified random sampling with proportional allocation. The basic idea is to use Chao (1982)’s sampling to select an equal probability sampling for each with probability proportional to wi = Nh/nh, where Nh and nh are updated continuously. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 19 / 53
  • 20. Stratified reservoir sampling Let Uk = {1, · · · , k} be the finite population up to element k. Let π(k; i) be the first-order inclusion probability that unit i is selected from Uk. We want to have k i=1 π(k; i) = n and π(k; i) ∝ wk,i where wk,i = Nk,h/nk,h for stratum h that unit i belongs to, where Nk,h and nk,h are the population size and the sample size, respectively, corresponding to stratum h. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 20 / 53
  • 21. Stratified reservoir sampling using Chao’s method At time k + 1, we update the reservoir sample as follows: 1 Select unit k + 1 with probability π(k + 1; k + 1). 2 If unit k + 1 is not selected, retain the sample at time k. 3 If unit k + 1 is selected, then remove one unit from the current sample with the removal probability Rkj =    0 if j ∈ Ak Tkj if j ∈ Bk (1 − Tk)/(n − Lk) if j ∈ Ck, where Ak = {i ∈ Sk; π(k; i) = π(k + 1; i) = 1}, Bk = {i ∈ Sk; π(k, i) = 1, π(k + 1; i) < 1}, Ck = {i ∈ Sk; π(k, i) < 1, π(k + 1; i) < 1}, Tk = i∈Bk Tkj , Tkj = {1 − π(k + 1; j)}/π(k + 1; k + 1), and Lk = |Ak ∪ Bk|. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 21 / 53
  • 22. Simulation study Finite population {yi : i ∈ U} is generated by yi ∼ N(10, 10) independently for i ∈ U, where U = {1, . . . , N} with N = 5, 000. Sample size n = 1, 000. Three reservoir methods are considered. Classical reservoir sampling approach (Mcleod and Bellhouse, 1983). Balanced reservoir sampling approach with yi being the control variable. Stratified reservoir sampling approach: 51 strata partitioned by equally spaced knots {k1, . . . , k50}, where k1 and k50 are the 1%-th and 99%-th sample quantile of {yi : i = 1, . . . , n}. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 22 / 53
  • 23. Simulation study (Cont’d) We are interested in estimating the following finite population parameters. Mean mk = k−1 k i=1 yi , where k = n + 1, . . . , N. Probability Pk,j = k−1 k i=1 I(yi < qj ), where j = 1, . . . , 5 and q1, . . . , q5 are the 5%-th, 27.5%-th, 50%-th, 72.5%-th and 95%-th quantiles of N(10, 10). 1,000 simulations are conducted. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 23 / 53
  • 24. Bias of estimating mk 0 1000 3000 −0.15−0.10−0.050.000.050.100.15 Classical k−n Value 0 1000 3000 −0.15−0.10−0.050.000.050.100.15 Balanced k−n Value 0 1000 3000 −0.15−0.10−0.050.000.050.100.15 Stratified k−n Value The black line is the point-wise mean of the simulation, and blue ones are the point-wise 5%-th and 95%-th quantiles. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 24 / 53
  • 25. Bias of estimating Pk,1 0 1000 3000 0.000.020.040.06 Classical k−n Value 0 1000 3000 0.000.020.040.06 Balanced k−n Value 0 1000 3000 0.000.020.040.06 Stratified k−n Value Kim (ISU) Sampling techniques for big data analysis May 31, 2017 25 / 53
  • 26. Bias of estimating Pk,2 0 1000 3000 0.000.050.100.15 Classical k−n Value 0 1000 3000 0.000.050.100.15 Balanced k−n Value 0 1000 3000 0.000.050.100.15 Stratified k−n Value Kim (ISU) Sampling techniques for big data analysis May 31, 2017 26 / 53
  • 27. Bias of estimating Pk,3 0 1000 3000 −0.02−0.010.000.010.02 Classical k−n Value 0 1000 3000 −0.02−0.010.000.010.02 Balanced k−n Value 0 1000 3000 −0.02−0.010.000.010.02 Stratified k−n Value Kim (ISU) Sampling techniques for big data analysis May 31, 2017 27 / 53
  • 28. Bias of estimating Pk,4 0 1000 3000 −0.15−0.10−0.050.00 Classical k−n Value 0 1000 3000 −0.15−0.10−0.050.00 Balanced k−n Value 0 1000 3000 −0.15−0.10−0.050.00 Stratified k−n Value Kim (ISU) Sampling techniques for big data analysis May 31, 2017 28 / 53
  • 29. Bias of estimating Pk,5 0 1000 3000 −0.06−0.04−0.020.00 Classical k−n Value 0 1000 3000 −0.06−0.04−0.020.00 Balanced k−n Value 0 1000 3000 −0.06−0.04−0.020.00 Stratified k−n Value Kim (ISU) Sampling techniques for big data analysis May 31, 2017 29 / 53
  • 30. Summary statistics for k = 5, 000 Table: Summary statistics for k = 5, 000. The unit for the values is 10−3 Mean q1 q2 Bias S.E. Bias S.E. Bias S.E. Classical 0.13 88.36 0.10 5.84 0.26 12.84 Balanced 0.83 1.02 50.19 6.84 165.66 11.77 Stratified -22.47 17.89 -0.91 1.61 3.48 3.06 q3 q4 q5 Bias S.E. Bias S.E. Bias S.E. Classical -0.18 14.05 -0.47 12.80 0.19 6.04 Balanced -2.61 7.07 -164.25 11.81 -50.81 6.67 Stratified 3.56 3.00 1.74 2.59 1.89 1.48 S.E.: standard error of the 1,000 Monte Carlo estimates Kim (ISU) Sampling techniques for big data analysis May 31, 2017 30 / 53
  • 31. Discussion Balanced reservoir sampling provides the most efficient estimation for the population mean, but it provides biased estimates for other parameters. Stratified reservoir sampling provides more efficient estimates than the classical reservoir sampling for various parameters. It can be shown that, under some conditions, the stratified reservoir sampling satisfies V (ˆθ) = O 1 nH , where H is the number of strata. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 31 / 53
  • 32. Topic 2: Inverse sampling Goal: Obtain a simple random sample from the current sample obtained from a complex sampling design. Originally proposed by Hinkins et al (1997) and Rao et al (2003) in the context of stratified cluster sampling. Note that the inverse sampling can be treated as a special case of two-phase sampling: 1 Phase One: Original sample obtained from complex sampling design. 2 Phase Two: select a sample from Phase One sample to obtain the final sample so that they have equal first-order inclusion probabilities. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 32 / 53
  • 33. Inverse sampling for big data analysis Two cases 1 Big data = population 2 Big data = population In case one, the sampling problem is easy In case two, we need to adjust for the bias associated with big data itself. If the data is not obtained from a probability sample, it may be subject to selection bias. We will discuss how to remove the selection bias of the big data using inverse sampling idea. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 33 / 53
  • 34. Inverse sampling for big data analysis Assume that the big data (D) is a subset of the finite population. Let δ be the indicator function δi = 1 if i ∈ D 0 otherwise The observations in D can be viewed as a random sample from f (y | δ = 1). However, we are interested in obtaining sample from f (y). How to compute E(Y ) = yf (y)dy using the observations from f (y | δ = 1)? Kim (ISU) Sampling techniques for big data analysis May 31, 2017 34 / 53
  • 35. Importance sampling idea Goal: We are interested in estimating E(Y ) = yf (y)dy using y1, · · · , yn from f (y | δ = 1). Note that E(Y ) = y f (y) f (y | δ = 1) f (y | δ = 1)dy ∼= 1 n n i=1 wi yi where wi ∝ f (yi ) f (yi | δi = 1) The weight wi is called importance weight. It is the relative amount of representativeness of yi with respect to f (y) when yi is generated from f (y | δ = 1). Kim (ISU) Sampling techniques for big data analysis May 31, 2017 35 / 53
  • 36. Inverse sampling for big data analysis Assume for now that f (y) and f (y | δ = 1) are of known form. From each i ∈ D, compute wi ∝ f (yi ) f (yi | δi = 1) Use the Chao method or something else, select a sample of size n with the selection probability proportional to wi . What if f (y) and f (y | δ = 1) are unknown? Kim (ISU) Sampling techniques for big data analysis May 31, 2017 36 / 53
  • 37. Inverse sampling for big data analysis (Cont’d) Assume bivariate observation (xi , yi ) from f (x, y | δ = 1). Suppose that µx = E (X) is available from an external source. Let P0 = {f (x, y); xf (x, y)d(x, y) = µx } Using µx , we can find f0(x, y) ∈ P0 that minimizes the Kullback-Leibler distance f0 (x, y) ln f0 (x, y) f (x, y | δ = 1) d (x, y) . (1) Kim (ISU) Sampling techniques for big data analysis May 31, 2017 37 / 53
  • 38. Inverse sampling for big data analysis (Cont’d) The solution to (1) is f0 (x, y) = f (x, y | δ = 1) exp (λ (x − µx )) E {exp (λ (X − µx ))} (2) where λ satisfies xf0 (x, y) d (x, y) = µx . The transformation from f (x, y | δ = 1) to f0 (x, y) is called exponential tiling. It tilts the density f (x, y | δ = 1) so that f0 (x, y) satisfy the calibration constraint. The importance weight in this situation is w∗ i ∝ f0 (xi , yi ) f (xi , yi | δi = 1) ∝ exp{ˆλ (xi − µx )} (3) where ˆλ satisfies n i=1 w∗ i xi = µx . Closely related to the exponential tilting (ET) weight (Kim 2010). Kim (ISU) Sampling techniques for big data analysis May 31, 2017 38 / 53
  • 39. Inverse reservoir sampling using exponential tilting Reservoir inverse sampling 1 Phase one: compute w∗ i using ET method. 2 Phase two: Use Chao’s reservoir sampling to select an unequal probability sample of size n with πi ∝ w∗ i . Computation for ˆλ in w∗ i in (3): 1 Newton method: Solve n i=1 w∗ i (λ)xi = µx iteratively. 2 One-step method (Kim, 2010): Approximation using ˆλ = (µx − ¯xk )/s2 k , where ¯xk and s2 k are the sample mean and variance of {xi : i = 1, . . . , k}. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 39 / 53
  • 40. Simulation study Finite population {(xi , yi ) : i ∈ U} is generated by xi ∼ N(0, 1) yi | xi ∼ N(xi , 1) for i ∈ U, where U = {1, . . . , N} and N = 100, 000. Indicator {δi : i ∈ U} is obtained by logit{Pr(δi = 1 | xi )} = 0.25xi , where logit(x) = log{x/(1 − x)}. The data stream is the first 10,000 elements with δi = 1. Sample size n = 500. We are interested in estimating E(Y ) by the proposed reservoir inverse sampling methods. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 40 / 53
  • 41. Bias of estimating E(Y ) 0 4000 −0.100.000.050.100.150.200.25 R_whole k−4n Value 0 4000 −0.100.000.050.100.150.200.25 R_one_step k−4n Value 0 4000 −0.100.000.050.100.150.200.25 Trd. k−4n Value Kim (ISU) Sampling techniques for big data analysis May 31, 2017 41 / 53
  • 42. Summary statistics for k = 10, 000 Table: Summary statistics for k = 10, 000. The units for the values is 10−2 Bias S.E. Reservoir inverse sampling 0.07 6.52 Reservoir one step 0.20 6.41 Traditional reservoir 12.33∗ 5.94 S.E.: standard error of the 1,000 Monte Carlo estimates ∗ : significant bias Kim (ISU) Sampling techniques for big data analysis May 31, 2017 42 / 53
  • 43. Topic 3: Survey Integration Combine information from two independent surveys, A and B, from the same target population 1 Non-nested two-phase sample: Observe x from survey A and observe (x, y) from survey B (Hidiroglou, 2001; Merkouris, 2004; Kim and Rao, 2012). 2 Two surveys with measurement errors: Observe (x, y1) from survey A and observe (x, y2) from survey B (Kim, Berg, and Park, 2016). Kim (ISU) Sampling techniques for big data analysis May 31, 2017 43 / 53
  • 44. Survey integration Combining big data with survey data 1 A: survey data (representative sample, expensive), observe (x, z) 2 B: big data (self-selected sample, inexpensive), observe (x, y). Rivers (2007) idea : Use x to implement the nearest neighbor imputation for sample A. That is, create synthetic data imputation for sample A. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 44 / 53
  • 45. Table: Data Structure Data Representativeness X Z Y A Yes o o B No o o Table: Rivers idea Data Representativeness X Z Y A Yes o o o B No o o Kim (ISU) Sampling techniques for big data analysis May 31, 2017 45 / 53
  • 46. Remark Data B is subject to selection bias. Rivers method is justified if f (yi | xi , i ∈ B) = f (yi | xi , i ∈ A) In some literature, the above condition is called transportability. The transportability can be achieved if the selection mechanism for big data is non-informative (Pfeffermann 2011) or ignorable. In this case, the synthetic data imputation can provide unbiased estimation of the population parameters. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 46 / 53
  • 47. New approach for combining big data with survey data 1 In survey sample A, observe δi = 1 if i ∈ B 0 otherwise by asking a question on the membership for the big data. 2 Since we observe (xi , zi , δi ) for all i ∈ A, we can fit a model for πi = P(δi = 1 | xi , zi ) to obtain ˆπi . 3 Use wi = 1/ˆπi as the weight for analyzing big data B. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 47 / 53
  • 48. Remark Note that the self-selection bias is essentially the same as the nonresponse bias in the missing data literature. Probability sample A is used to estimate the response probability πi . Rivers method is based on the assumption that P(δ = 1 | x, y, z) = P(δ = 1 | x). The proposed method is based on weaker assumption for the response mechanism: P(δ = 1 | x, y, z) = P(δ = 1 | x, z). Kim (ISU) Sampling techniques for big data analysis May 31, 2017 48 / 53
  • 49. Remark Once the propensity score weights are computed, we can use it for reservoir inverse sampling. Variance estimation needs to be developed. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 49 / 53
  • 50. Small area estimation Small area estimation can be viewed as a special case of survey integration. Big data can be used in small area estimation (Rao and Molina, 2015). We can use big data as covariates in area-level model for small area estimation (Marchetti et al, 2015). For unit level model, statistical issues such as self-selection bias and measurement errors may occur. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 50 / 53
  • 51. Conclusion Three statistical tools for big data analysis 1 Reservoir sampling 2 Inverse sampling 3 Survey integration Problems with big data: measurement error, self-selection bias Survey data can be used to reduce the problems in big data. Promising area of research Kim (ISU) Sampling techniques for big data analysis May 31, 2017 51 / 53
  • 52. References Chao, M. T. (1982). A General Purpose Unequal Probability Sampling Plan. Biometrika 69, 653–656. Hidiroglou, M. A. (2001). Double Sampling. Surv. Methodol. 27, 143–154. Hinkins, S., Oh, H. L. & Scheuren, F. (1997). Inverse Sampling Design Algorithm. Surv. Methodol. 23, 11–21. Kim, J. K. (2010). Calibration estimation using exponential tilting in sample surveys. Surv. Methodol. 36, 145–155. Kim, J. K., Berg, E. & Park, T. (2016). Statistical matching using fractional imputation. Surv. Methodol. 42, 19–40. Kim, J. K. & Rao, J. N. K. (2012). Combining data from two independent surveys: a model-assisted approach. Biometrika 99, 85–100. Marchetti, S., Giusti, C., Pratesi, M., Salvati, N., Giannotti, F., Pedreschi, D., Rinzivillo, S., Pappalardo, L., & Gabrielli, L. (2015). Small area model-based estimators using big data sources. J. Off. Stat. 31, 263–281. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 52 / 53
  • 53. References (Cont’d) Merkouris, T. (2004). Combining independent regression estimators from multiple surveys, J. Am. Statist. Assoc. 99, 1131–1139. McLeod, A. I. & Bellhouse, D. R. (1983). A Convenient Algorithm for Drawing a Simple Random Sample. Appl. Statist. 32, 182–184. Pfeffermann, D. (2011). Modeling of complex survey data: Why model? Why is it a problem? How can we approach? Survey Methodology 37, 115-136. Rao, J. N. K. & Molina, I. (2015). Small area estimation. John Wiley & Sons, New Jersey. Rao, J. N. K., Scott, A. J. & Benhin, E. (2003). Undoing Complex Survey Data Structures: Some Theory and Applications of Inverse Sampling. Surv. Methodol. 29, 107–128. River, D. (2007). Sampling for web surveys. Joint Statistical Meetings, Proceedings of the Section on Survey Research Methods. Kim (ISU) Sampling techniques for big data analysis May 31, 2017 53 / 53