3. Motivation
Case study: rlme package
Rank based regression and estimation of two- and three- level
nested effects models.
Goals: faster, less memory, more data
Before: 5,000 rows of data
After: 50,000 rows of data
16. Rewrite
Microbenchmarking
Know what R is good at
Avoid loops in favor of vectorization
Preallocation
Arguments are by value, not by reference
Embrace C++
17. Rewrite
Microbenchmarking
Know what R is good at
Avoid loops in favor of vectorization
Preallocation
Arguments are by value, not by reference
Embrace C++
Be careful!
18. Vectorizing
## Bad
vec = 1:100
for (i in 1:length(vec)) {
vec[i] = vec[i]^2
}
## Better
sapply(vec, function(x) x^2)
## Best
vec^2
19. Preallocation
## Bad
vec = c()
for (i in 1:0) {
vec = c(vec, i)
}
## Better
vec = numeric(100)
for (i in 1:0) {
vec[i] = i
}
33. Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
What’s wrong? Bad algorithm (the sort is at least O(nlogn)),
variable gets copied multiple times
34. Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
What’s wrong? Bad algorithm (the sort is at least O(nlogn)),
variable gets copied multiple times
Updated with C++
dresd = remove.k.smallest(dresd)
37. Wilcoxon Tau
But what about really huge inputs?
Bootstrapping: when over 5,000 rows, repeat estimate on
1000 sampled points 100 times
38. Wilcoxon Tau
But what about really huge inputs?
Bootstrapping: when over 5,000 rows, repeat estimate on
1000 sampled points 100 times
Not about speed, but about memory
39. Pairup
Pairup function: generates every possible pair from input
vector
Some rank-based estimators require pairwise operations
O(n2) complexity