Seismic Method Estimate velocity from seismic data.pptx
Some sampling techniques for big data analysis
1. Some sampling techniques for big data analysis
Jae Kwang Kim 1
Iowa State University & KAIST
May 31, 2017
1
Joint work with Zhonglei Wang
2. Example
Let’s look at an artificial finite population.
ID Size of farms yield(y)
1 4 1
2 6 3
3 6 5
4 20 15
Parameter of interest: Mean yield of the farms in the population
Assume that farm sizes are known.
Instead of observing N = 4 farms, we want to select a sample of size
n = 2.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 2 / 53
3. Example - Continued
6 possible samples
case sample ID sample mean sampling error
1 1, 2 2 -4
2 1, 3 3 -3
3 1, 4 8 2
4 2, 3 4 -2
5 2, 4 9 3
6 3, 4 10 4
Each sample has sampling error.
Two ways to select one of the six possible samples.
Nonprobability sampling : (using size of farms or etc.) select a sample
subjectively.
Probability sampling : select a sample by a probability rule.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 3 / 53
4. Example - Continued : Probability sampling 1
Simple Random Sampling : Assign the same selection probability to all
possible samples
case sample ID sample mean(¯y) sampling error selection probability
1 1, 2 2 -4 1/6
2 1, 3 3 -3 1/6
3 1, 4 8 2 1/6
4 2, 3 4 -2 1/6
5 2, 4 9 3 1/6
6 3, 4 10 4 1/6
In this case, the sample mean(¯y) has a discrete probability distribution.
¯y =
2 w. p. 1/6
3 w. p. 1/6
4 w. p. 1/6
8 w. p. 1/6
9 w. p. 1/6
10 w. p. 1/6
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 4 / 53
5. Expected value of sampling error
E ¯y − ¯Y = (−4 − 3 + 2 − 2 + 3 + 4) /6
= 0
Thus, the estimator is unbiased.
Variance of sampling error is 1/6 ∗ (16 + 9 + 4 + 4 + 9 + 16) = 9.67.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 5 / 53
6. Probability Sampling
Definition : For each element in the population, the probability that
the element is included in the sample is known and greater than 0.
Advantages
1 Exclude subjectivity of selecting samples.
2 Remove sampling bias (or selection bias)
What is sampling bias ? ( θ : true value, ˆθ: estimated value of θ)
(sampling) error of ˆθ = ˆθ − θ
= ˆθ − E ˆθ + E ˆθ − θ
= variation + bias
In nonprobability sampling, variation is 0 but there is a bias. In
probability sampling, there exist variation but bias is 0.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 6 / 53
7. Probability Sampling
Main theory
1 Law of Large Numbers : ˆθ converges to E ˆθ for sufficiently large
sample size.
2 Central Limit Theorem : ˆθ follows a normal distribution for sufficiently
large sample size.
Additional advantages of probability sampling with large sample :
1 Improve the precision of an estimator
2 Can compute confidence intervals or test statistical hypotheses.
With the same sample size, we may have different precision.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 7 / 53
8. Survey Sampling
A classical area of statistics concerning ....
1 Drawing a probability sample from a target population
2 Using the sample to make inference about the population
Why survey sampling?
1 Cost consideration: Data collection often involves cost. Smaller sample
means less money.
2 Representativeness: Probability sampling is the only scientifically
justified approach of data collection for population studies.
3 Computational efficiency: Sample takes less memory and storage, and
less computing time.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 8 / 53
9. Survey Sampling in the era of Big Data
Challenges
1 Decreasing response rates: strict probability sampling is not possible
and representativeness may be weakened.
2 Cost model changes: data collection cost is not proportional to size.
3 Many competitors: survey sampling is no longer the only way of
collecting data. Abundant data sources.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 9 / 53
10. Paradigm Shift in Survey Sampling
Missing data framework:
From sample to population ⇒ From observed to unobserved
Combine survey data with other existing data
Survey Data
Auxiliary Data
⇒ Synthetic Data
Sampling techniques are still useful in handling big data. (e.g.
Reservior sampling, balanced sampling, inverse sampling, calibration
weighting, statistical inference with informative sampling, etc.)
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 10 / 53
11. In this talk, we will introduce ...
1 Reservior sampling and its variants
2 Inverse sampling
3 Synthetic data imputation (or survey integration)
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 11 / 53
12. Topic 1: Reservoir Sampling
McLeod and Bellhouse (1983)’s idea:
1 Let S = {1, · · · , n} be the initial reservoir sample.
2 If k = n + 1, · · · , select it with probability n/k.
3 If k is selected, replace it with one random sample from S. If k is not
selected, then go to Step 2.
Step 1 is the initialization step, Step 2 is the inclusion step, and Step 3 is
the removal step.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 12 / 53
13. Justification
Let Sk be the set of sample elements selected from Uk = {1, · · · , k}.
If m ≤ n,
P(m ∈ Sk) = 1 ×
n
n + 1
×
n + 1
n + 2
× · · · ×
k − 1
k
=
n
k
If m > n,
P(m ∈ Sk) =
n
m
×
m
m + 1
× · · · ×
k − 1
k
=
n
k
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 13 / 53
14. Remark
Selecting a SRS (simple random sample) of size n from arbitrary size
in one pass
No need to know the population size in advance
Very useful for big data sampling when the population is updated
continuously.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 14 / 53
15. Improving reservoir sampling
Goal: We wish to reduce the variance of ˆθ, which is computed from
reservoir sample.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 15 / 53
16. Method 1: Balanced reservoir sampling
We wish to impose balanced condition in the reservoir sample
1
n
i∈S
zi =
1
N
i∈U
zi .
Variable z can be called control variable.
To achieve the balanced condition, we can modify Step 3 (removal
step) as follows:
1 Let ¯zk be the population mean of z up to the k-th populaiton
Uk = {1, · · · , k}.
2 For each element in the current sample S, compute
D(i) = ¯zk − ¯z
(−i)
S , i ∈ S
where ¯z
(−i)
S is the sample mean excluding zi .
3 Instead of removing one element at random in Step 3, remove i∗
with
the smallest value of D(i).
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 16 / 53
17. A numerical experiment
Figure: Trace of sample means of size n = 100 from a population of size
N = 10, 000. The left side is the trace of the sample mean from the classical
reservoir sampling and the right side is the trace of the sample mean from the
proposed reservoir sampling.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 17 / 53
18. Remark
The balanced reservoir sampling may provide an efficient sample for
estimating population mean, but not the population distribution.
We may wish to provide an efficient representative sample that
provides consistent estimates for various population parameters.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 18 / 53
19. Method 2: Stratified reservoir sampling
Assume that the finite population consists of H subpopulations, called
strata. The within-stratum variance is small.
For scalar y with bounded support, we can predetermine the stratum
boundaries by creating H intervals with equal length.
We want to obtain a reservoir sample that achieves stratified random
sampling with proportional allocation.
The basic idea is to use Chao (1982)’s sampling to select an equal
probability sampling for each with probability proportional to
wi = Nh/nh, where Nh and nh are updated continuously.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 19 / 53
20. Stratified reservoir sampling
Let Uk = {1, · · · , k} be the finite population up to element k.
Let π(k; i) be the first-order inclusion probability that unit i is
selected from Uk.
We want to have
k
i=1
π(k; i) = n
and
π(k; i) ∝ wk,i
where wk,i = Nk,h/nk,h for stratum h that unit i belongs to, where
Nk,h and nk,h are the population size and the sample size,
respectively, corresponding to stratum h.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 20 / 53
21. Stratified reservoir sampling using Chao’s method
At time k + 1, we update the reservoir sample as follows:
1 Select unit k + 1 with probability π(k + 1; k + 1).
2 If unit k + 1 is not selected, retain the sample at time k.
3 If unit k + 1 is selected, then remove one unit from the current
sample with the removal probability
Rkj =
0 if j ∈ Ak
Tkj if j ∈ Bk
(1 − Tk)/(n − Lk) if j ∈ Ck,
where Ak = {i ∈ Sk; π(k; i) = π(k + 1; i) = 1},
Bk = {i ∈ Sk; π(k, i) = 1, π(k + 1; i) < 1},
Ck = {i ∈ Sk; π(k, i) < 1, π(k + 1; i) < 1}, Tk = i∈Bk
Tkj ,
Tkj = {1 − π(k + 1; j)}/π(k + 1; k + 1), and Lk = |Ak ∪ Bk|.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 21 / 53
22. Simulation study
Finite population {yi : i ∈ U} is generated by
yi ∼ N(10, 10)
independently for i ∈ U, where U = {1, . . . , N} with N = 5, 000.
Sample size n = 1, 000.
Three reservoir methods are considered.
Classical reservoir sampling approach (Mcleod and Bellhouse, 1983).
Balanced reservoir sampling approach with yi being the control variable.
Stratified reservoir sampling approach: 51 strata partitioned by equally
spaced knots {k1, . . . , k50}, where k1 and k50 are the 1%-th and
99%-th sample quantile of {yi : i = 1, . . . , n}.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 22 / 53
23. Simulation study (Cont’d)
We are interested in estimating the following finite population
parameters.
Mean mk = k−1 k
i=1 yi , where k = n + 1, . . . , N.
Probability Pk,j = k−1 k
i=1 I(yi < qj ), where j = 1, . . . , 5 and
q1, . . . , q5 are the 5%-th, 27.5%-th, 50%-th, 72.5%-th and 95%-th
quantiles of N(10, 10).
1,000 simulations are conducted.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 23 / 53
24. Bias of estimating mk
0 1000 3000
−0.15−0.10−0.050.000.050.100.15
Classical
k−n
Value
0 1000 3000
−0.15−0.10−0.050.000.050.100.15
Balanced
k−n
Value
0 1000 3000
−0.15−0.10−0.050.000.050.100.15
Stratified
k−n
Value
The black line is the point-wise mean of the simulation, and blue ones are the
point-wise 5%-th and 95%-th quantiles.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 24 / 53
25. Bias of estimating Pk,1
0 1000 3000
0.000.020.040.06
Classical
k−n
Value
0 1000 3000
0.000.020.040.06
Balanced
k−n
Value
0 1000 3000
0.000.020.040.06
Stratified
k−n
Value
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 25 / 53
26. Bias of estimating Pk,2
0 1000 3000
0.000.050.100.15
Classical
k−n
Value
0 1000 3000
0.000.050.100.15
Balanced
k−n
Value
0 1000 3000
0.000.050.100.15
Stratified
k−n
Value
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 26 / 53
27. Bias of estimating Pk,3
0 1000 3000
−0.02−0.010.000.010.02
Classical
k−n
Value
0 1000 3000
−0.02−0.010.000.010.02
Balanced
k−n
Value
0 1000 3000
−0.02−0.010.000.010.02
Stratified
k−n
Value
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 27 / 53
28. Bias of estimating Pk,4
0 1000 3000
−0.15−0.10−0.050.00
Classical
k−n
Value
0 1000 3000
−0.15−0.10−0.050.00
Balanced
k−n
Value
0 1000 3000
−0.15−0.10−0.050.00
Stratified
k−n
Value
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 28 / 53
29. Bias of estimating Pk,5
0 1000 3000
−0.06−0.04−0.020.00
Classical
k−n
Value
0 1000 3000
−0.06−0.04−0.020.00
Balanced
k−n
Value
0 1000 3000
−0.06−0.04−0.020.00
Stratified
k−n
Value
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 29 / 53
30. Summary statistics for k = 5, 000
Table: Summary statistics for k = 5, 000. The unit for the values is 10−3
Mean q1 q2
Bias S.E. Bias S.E. Bias S.E.
Classical 0.13 88.36 0.10 5.84 0.26 12.84
Balanced 0.83 1.02 50.19 6.84 165.66 11.77
Stratified -22.47 17.89 -0.91 1.61 3.48 3.06
q3 q4 q5
Bias S.E. Bias S.E. Bias S.E.
Classical -0.18 14.05 -0.47 12.80 0.19 6.04
Balanced -2.61 7.07 -164.25 11.81 -50.81 6.67
Stratified 3.56 3.00 1.74 2.59 1.89 1.48
S.E.: standard error of the 1,000 Monte Carlo estimates
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 30 / 53
31. Discussion
Balanced reservoir sampling provides the most efficient estimation for
the population mean, but it provides biased estimates for other
parameters.
Stratified reservoir sampling provides more efficient estimates than
the classical reservoir sampling for various parameters.
It can be shown that, under some conditions, the stratified reservoir
sampling satisfies
V (ˆθ) = O
1
nH
,
where H is the number of strata.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 31 / 53
32. Topic 2: Inverse sampling
Goal: Obtain a simple random sample from the current sample
obtained from a complex sampling design.
Originally proposed by Hinkins et al (1997) and Rao et al (2003) in
the context of stratified cluster sampling.
Note that the inverse sampling can be treated as a special case of
two-phase sampling:
1 Phase One: Original sample obtained from complex sampling design.
2 Phase Two: select a sample from Phase One sample to obtain the final
sample so that they have equal first-order inclusion probabilities.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 32 / 53
33. Inverse sampling for big data analysis
Two cases
1 Big data = population
2 Big data = population
In case one, the sampling problem is easy
In case two, we need to adjust for the bias associated with big data
itself.
If the data is not obtained from a probability sample, it may be
subject to selection bias.
We will discuss how to remove the selection bias of the big data using
inverse sampling idea.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 33 / 53
34. Inverse sampling for big data analysis
Assume that the big data (D) is a subset of the finite population.
Let δ be the indicator function
δi =
1 if i ∈ D
0 otherwise
The observations in D can be viewed as a random sample from
f (y | δ = 1).
However, we are interested in obtaining sample from f (y).
How to compute E(Y ) = yf (y)dy using the observations from
f (y | δ = 1)?
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 34 / 53
35. Importance sampling idea
Goal: We are interested in estimating E(Y ) = yf (y)dy using
y1, · · · , yn from f (y | δ = 1).
Note that
E(Y ) = y
f (y)
f (y | δ = 1)
f (y | δ = 1)dy ∼=
1
n
n
i=1
wi yi
where
wi ∝
f (yi )
f (yi | δi = 1)
The weight wi is called importance weight. It is the relative amount
of representativeness of yi with respect to f (y) when yi is generated
from f (y | δ = 1).
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 35 / 53
36. Inverse sampling for big data analysis
Assume for now that f (y) and f (y | δ = 1) are of known form.
From each i ∈ D, compute
wi ∝
f (yi )
f (yi | δi = 1)
Use the Chao method or something else, select a sample of size n
with the selection probability proportional to wi .
What if f (y) and f (y | δ = 1) are unknown?
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 36 / 53
37. Inverse sampling for big data analysis (Cont’d)
Assume bivariate observation (xi , yi ) from f (x, y | δ = 1).
Suppose that µx = E (X) is available from an external source.
Let
P0 = {f (x, y); xf (x, y)d(x, y) = µx }
Using µx , we can find f0(x, y) ∈ P0 that minimizes the
Kullback-Leibler distance
f0 (x, y) ln
f0 (x, y)
f (x, y | δ = 1)
d (x, y) . (1)
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 37 / 53
38. Inverse sampling for big data analysis (Cont’d)
The solution to (1) is
f0 (x, y) = f (x, y | δ = 1)
exp (λ (x − µx ))
E {exp (λ (X − µx ))}
(2)
where λ satisfies xf0 (x, y) d (x, y) = µx .
The transformation from f (x, y | δ = 1) to f0 (x, y) is called
exponential tiling. It tilts the density f (x, y | δ = 1) so that f0 (x, y)
satisfy the calibration constraint.
The importance weight in this situation is
w∗
i ∝
f0 (xi , yi )
f (xi , yi | δi = 1)
∝ exp{ˆλ (xi − µx )} (3)
where ˆλ satisfies n
i=1 w∗
i xi = µx .
Closely related to the exponential tilting (ET) weight (Kim 2010).
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 38 / 53
39. Inverse reservoir sampling using exponential tilting
Reservoir inverse sampling
1 Phase one: compute w∗
i using ET method.
2 Phase two: Use Chao’s reservoir sampling to select an unequal
probability sample of size n with πi ∝ w∗
i .
Computation for ˆλ in w∗
i in (3):
1 Newton method: Solve
n
i=1 w∗
i (λ)xi = µx iteratively.
2 One-step method (Kim, 2010): Approximation using
ˆλ = (µx − ¯xk )/s2
k ,
where ¯xk and s2
k are the sample mean and variance of
{xi : i = 1, . . . , k}.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 39 / 53
40. Simulation study
Finite population {(xi , yi ) : i ∈ U} is generated by
xi ∼ N(0, 1)
yi | xi ∼ N(xi , 1)
for i ∈ U, where U = {1, . . . , N} and N = 100, 000.
Indicator {δi : i ∈ U} is obtained by
logit{Pr(δi = 1 | xi )} = 0.25xi ,
where logit(x) = log{x/(1 − x)}.
The data stream is the first 10,000 elements with δi = 1.
Sample size n = 500.
We are interested in estimating E(Y ) by the proposed reservoir
inverse sampling methods.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 40 / 53
41. Bias of estimating E(Y )
0 4000
−0.100.000.050.100.150.200.25
R_whole
k−4n
Value
0 4000
−0.100.000.050.100.150.200.25
R_one_step
k−4n
Value
0 4000
−0.100.000.050.100.150.200.25
Trd.
k−4n
Value
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 41 / 53
42. Summary statistics for k = 10, 000
Table: Summary statistics for k = 10, 000. The units for the values is 10−2
Bias S.E.
Reservoir inverse sampling 0.07 6.52
Reservoir one step 0.20 6.41
Traditional reservoir 12.33∗ 5.94
S.E.: standard error of the 1,000 Monte Carlo
estimates
∗
: significant bias
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 42 / 53
43. Topic 3: Survey Integration
Combine information from two independent surveys, A and B, from
the same target population
1 Non-nested two-phase sample: Observe x from survey A and observe
(x, y) from survey B (Hidiroglou, 2001; Merkouris, 2004; Kim and
Rao, 2012).
2 Two surveys with measurement errors: Observe (x, y1) from survey A
and observe (x, y2) from survey B (Kim, Berg, and Park, 2016).
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 43 / 53
44. Survey integration
Combining big data with survey data
1 A: survey data (representative sample, expensive), observe (x, z)
2 B: big data (self-selected sample, inexpensive), observe (x, y).
Rivers (2007) idea : Use x to implement the nearest neighbor
imputation for sample A. That is, create synthetic data imputation
for sample A.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 44 / 53
45. Table: Data Structure
Data Representativeness X Z Y
A Yes o o
B No o o
Table: Rivers idea
Data Representativeness X Z Y
A Yes o o o
B No o o
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 45 / 53
46. Remark
Data B is subject to selection bias.
Rivers method is justified if
f (yi | xi , i ∈ B) = f (yi | xi , i ∈ A)
In some literature, the above condition is called transportability.
The transportability can be achieved if the selection mechanism for
big data is non-informative (Pfeffermann 2011) or ignorable.
In this case, the synthetic data imputation can provide unbiased
estimation of the population parameters.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 46 / 53
47. New approach for combining big data with survey data
1 In survey sample A, observe
δi =
1 if i ∈ B
0 otherwise
by asking a question on the membership for the big data.
2 Since we observe (xi , zi , δi ) for all i ∈ A, we can fit a model for
πi = P(δi = 1 | xi , zi ) to obtain ˆπi .
3 Use wi = 1/ˆπi as the weight for analyzing big data B.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 47 / 53
48. Remark
Note that the self-selection bias is essentially the same as the
nonresponse bias in the missing data literature.
Probability sample A is used to estimate the response probability πi .
Rivers method is based on the assumption that
P(δ = 1 | x, y, z) = P(δ = 1 | x).
The proposed method is based on weaker assumption for the response
mechanism:
P(δ = 1 | x, y, z) = P(δ = 1 | x, z).
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 48 / 53
49. Remark
Once the propensity score weights are computed, we can use it for
reservoir inverse sampling.
Variance estimation needs to be developed.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 49 / 53
50. Small area estimation
Small area estimation can be viewed as a special case of survey
integration.
Big data can be used in small area estimation (Rao and Molina,
2015).
We can use big data as covariates in area-level model for small area
estimation (Marchetti et al, 2015).
For unit level model, statistical issues such as self-selection bias and
measurement errors may occur.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 50 / 53
51. Conclusion
Three statistical tools for big data analysis
1 Reservoir sampling
2 Inverse sampling
3 Survey integration
Problems with big data: measurement error, self-selection bias
Survey data can be used to reduce the problems in big data.
Promising area of research
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 51 / 53
52. References
Chao, M. T. (1982). A General Purpose Unequal Probability Sampling Plan.
Biometrika 69, 653–656.
Hidiroglou, M. A. (2001). Double Sampling. Surv. Methodol. 27, 143–154.
Hinkins, S., Oh, H. L. & Scheuren, F. (1997). Inverse Sampling Design
Algorithm. Surv. Methodol. 23, 11–21.
Kim, J. K. (2010). Calibration estimation using exponential tilting in sample
surveys. Surv. Methodol. 36, 145–155.
Kim, J. K., Berg, E. & Park, T. (2016). Statistical matching using fractional
imputation. Surv. Methodol. 42, 19–40.
Kim, J. K. & Rao, J. N. K. (2012). Combining data from two independent
surveys: a model-assisted approach. Biometrika 99, 85–100.
Marchetti, S., Giusti, C., Pratesi, M., Salvati, N., Giannotti, F.,
Pedreschi, D., Rinzivillo, S., Pappalardo, L., & Gabrielli, L. (2015).
Small area model-based estimators using big data sources. J. Off. Stat. 31,
263–281.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 52 / 53
53. References (Cont’d)
Merkouris, T. (2004). Combining independent regression estimators from
multiple surveys, J. Am. Statist. Assoc. 99, 1131–1139.
McLeod, A. I. & Bellhouse, D. R. (1983). A Convenient Algorithm for
Drawing a Simple Random Sample. Appl. Statist. 32, 182–184.
Pfeffermann, D. (2011). Modeling of complex survey data: Why model? Why
is it a problem? How can we approach? Survey Methodology 37, 115-136.
Rao, J. N. K. & Molina, I. (2015). Small area estimation. John Wiley & Sons,
New Jersey.
Rao, J. N. K., Scott, A. J. & Benhin, E. (2003). Undoing Complex Survey
Data Structures: Some Theory and Applications of Inverse Sampling. Surv.
Methodol. 29, 107–128.
River, D. (2007). Sampling for web surveys. Joint Statistical Meetings,
Proceedings of the Section on Survey Research Methods.
Kim (ISU) Sampling techniques for big data analysis May 31, 2017 53 / 53