Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Computational Techniques for the Statistical Analysis of Big Data in R

3,633 views

Published on

A talk presented at UP-Stat 2014 on techniques for optimizing R code for large data sets

  • I pasted a website that might be helpful to you: ⇒ www.HelpWriting.net ⇐ Good luck!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Have u ever tried external professional writing services like ⇒ www.HelpWriting.net ⇐ ? I did and I am more than satisfied.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • You can try to use this service ⇒ www.HelpWriting.net ⇐ I have used it several times in college and was absolutely satisfied with the result.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yxufevpm } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Computational Techniques for the Statistical Analysis of Big Data in R

  1. 1. Computational Techniques for the Statistical Analysis of Big Data in R A Case Study of the rlme Package Herb Susmann, Yusuf Bilgic April 12, 2014
  2. 2. Workflow Identify Rewrite Benchmark Test Case Study: rlme Identify Wilcoxon Tau Estimator Pairup Covariance Estimator Summary Keeping Ahead
  3. 3. Motivation Case study: rlme package Rank based regression and estimation of two- and three- level nested effects models. Goals: faster, less memory, more data Before: 5,000 rows of data After: 50,000 rows of data
  4. 4. Section 1 Workflow
  5. 5. Workflow Identify Rewrite Benchmark Test
  6. 6. Identify Know your big O!
  7. 7. Identify Know your big O! (O(n2) memory usage? probably not so good for big data)
  8. 8. Identify Know your big O! (O(n2) memory usage? probably not so good for big data) Look for error messages
  9. 9. Identify Know your big O! (O(n2) memory usage? probably not so good for big data) Look for error messages Profiling with RProf
  10. 10. Rewrite High level design Algorithm design
  11. 11. Rewrite High level design Algorithm design Statistical techniques: bootstrapping
  12. 12. Rewrite Microbenchmarking Know what R is good at
  13. 13. Rewrite Microbenchmarking Know what R is good at Avoid loops in favor of vectorization
  14. 14. Rewrite Microbenchmarking Know what R is good at Avoid loops in favor of vectorization Preallocation
  15. 15. Rewrite Microbenchmarking Know what R is good at Avoid loops in favor of vectorization Preallocation Arguments are by value, not by reference
  16. 16. Rewrite Microbenchmarking Know what R is good at Avoid loops in favor of vectorization Preallocation Arguments are by value, not by reference Embrace C++
  17. 17. Rewrite Microbenchmarking Know what R is good at Avoid loops in favor of vectorization Preallocation Arguments are by value, not by reference Embrace C++ Be careful!
  18. 18. Vectorizing ## Bad vec = 1:100 for (i in 1:length(vec)) { vec[i] = vec[i]^2 } ## Better sapply(vec, function(x) x^2) ## Best vec^2
  19. 19. Preallocation ## Bad vec = c() for (i in 1:0) { vec = c(vec, i) } ## Better vec = numeric(100) for (i in 1:0) { vec[i] = i }
  20. 20. Pass by value square <- function(x) { x <- x^2 return(x) } x <- 1:100 square(x)
  21. 21. Benchmark Write several versions of a slow function
  22. 22. Benchmark Write several versions of a slow function Test them against each other
  23. 23. Benchmark Write several versions of a slow function Test them against each other Package: microbenchmark
  24. 24. Test Regressions
  25. 25. Test Regressions Unit Testing
  26. 26. Test Regressions Unit Testing Package: testthat
  27. 27. Test Regressions Unit Testing Package: testthat
  28. 28. Section 2 Case Study: rlme
  29. 29. Identify Over to R! Rprof("profile") fit.rlme = rlme(...) Rprof(NULL) summaryRprof("profile")
  30. 30. Wilcoxon Tau Estimator Rank based scale estimator of residuals Uses pairup (so already O(n2))
  31. 31. Wilcoxon Tau Estimator Original: dresd <- sort(abs(temp[, 1] - temp[, 2])) dresd = dresd[(p + 1):choose(n, 2)]
  32. 32. Wilcoxon Tau Estimator Original: dresd <- sort(abs(temp[, 1] - temp[, 2])) dresd = dresd[(p + 1):choose(n, 2)] What’s wrong?
  33. 33. Wilcoxon Tau Estimator Original: dresd <- sort(abs(temp[, 1] - temp[, 2])) dresd = dresd[(p + 1):choose(n, 2)] What’s wrong? Bad algorithm (the sort is at least O(nlogn)), variable gets copied multiple times
  34. 34. Wilcoxon Tau Estimator Original: dresd <- sort(abs(temp[, 1] - temp[, 2])) dresd = dresd[(p + 1):choose(n, 2)] What’s wrong? Bad algorithm (the sort is at least O(nlogn)), variable gets copied multiple times Updated with C++ dresd = remove.k.smallest(dresd)
  35. 35. Wilcoxon Tau Estimator Test with 2,000 residuals: better!
  36. 36. Wilcoxon Tau But what about really huge inputs?
  37. 37. Wilcoxon Tau But what about really huge inputs? Bootstrapping: when over 5,000 rows, repeat estimate on 1000 sampled points 100 times
  38. 38. Wilcoxon Tau But what about really huge inputs? Bootstrapping: when over 5,000 rows, repeat estimate on 1000 sampled points 100 times Not about speed, but about memory
  39. 39. Pairup Pairup function: generates every possible pair from input vector Some rank-based estimators require pairwise operations O(n2) complexity
  40. 40. Pairup Original version: vectorized (14 LOC)
  41. 41. Pairup Original version: vectorized (14 LOC) Loop version (12 LOC)
  42. 42. Pairup Original version: vectorized (14 LOC) Loop version (12 LOC) ”Combn” version (core R function, 1 LOC)
  43. 43. Pairup Original version: vectorized (14 LOC) Loop version (12 LOC) ”Combn” version (core R function, 1 LOC) C++ version (12 LOC)
  44. 44. Over to R!
  45. 45. Covariance Estimator n × n covariance matrix change to preallocation
  46. 46. Covariance Estimator
  47. 47. Summary Identify Rewrite Benchmark Test
  48. 48. Keeping Ahead Parallelism
  49. 49. Keeping Ahead Parallelism Cluster: RMpi, snow
  50. 50. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud
  51. 51. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud Probably not Hadoop, maybe Apache Spark?
  52. 52. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud Probably not Hadoop, maybe Apache Spark? Julia Language
  53. 53. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud Probably not Hadoop, maybe Apache Spark? Julia Language Hadley Wickham (plyr, ggplot, testthat, ...)
  54. 54. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud Probably not Hadoop, maybe Apache Spark? Julia Language Hadley Wickham (plyr, ggplot, testthat, ...) “Advanced R Programming”
  55. 55. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud Probably not Hadoop, maybe Apache Spark? Julia Language Hadley Wickham (plyr, ggplot, testthat, ...) “Advanced R Programming”
  56. 56. Questions?

×