Computational Techniques for the Statistical Analysis of Big Data in R

•

1 like•3,819 views

herbps10

A talk presented at UP-Stat 2014 on techniques for optimizing R code for large data sets

Technology Entertainment & Humor

Computational Techniques for the Statistical
Analysis of Big Data in R
A Case Study of the rlme Package
Herb Susmann, Yusuf Bilgic
April 12, 2014

Workﬂow
Identify
Rewrite
Benchmark
Test
Case Study: rlme
Identify
Wilcoxon Tau Estimator
Pairup
Covariance Estimator
Summary
Keeping Ahead

Motivation
Case study: rlme package
Rank based regression and estimation of two- and three- level
nested eﬀects models.
Goals: faster, less memory, more data
Before: 5,000 rows of data
After: 50,000 rows of data

Identify
Know your big O! (O(n2) memory usage? probably not so
good for big data)

Identify
Know your big O! (O(n2) memory usage? probably not so
good for big data)
Look for error messages

Identify
Know your big O! (O(n2) memory usage? probably not so
good for big data)
Look for error messages
Proﬁling with RProf

Rewrite
High level design
Algorithm design

Rewrite
High level design
Algorithm design
Statistical techniques: bootstrapping

Rewrite
Microbenchmarking
Know what R is good at

Rewrite
Microbenchmarking
Know what R is good at
Avoid loops in favor of vectorization

Rewrite
Microbenchmarking
Know what R is good at
Avoid loops in favor of vectorization
Preallocation

Rewrite
Microbenchmarking
Know what R is good at
Avoid loops in favor of vectorization
Preallocation
Arguments are by value, not by reference

Rewrite
Microbenchmarking
Know what R is good at
Avoid loops in favor of vectorization
Preallocation
Arguments are by value, not by reference
Embrace C++

$Vectorizing ## Bad vec = 1:100 for (i in 1:length(vec)) { vec[i] = vec[i]^2 } ## Better sapply(vec, function(x) x^2) ## Best vec^2$

Preallocation
## Bad
vec = c()
for (i in 1:0) {
vec = c(vec, i)
}
## Better
vec = numeric(100)
for (i in 1:0) {
vec[i] = i
}

$Pass by value square <- function(x) { x <- x^2 return(x) } x <- 1:100 square(x)$

Benchmark
Write several versions of a slow function

Benchmark
Write several versions of a slow function
Test them against each other

Benchmark
Write several versions of a slow function
Test them against each other
Package: microbenchmark

Test
Regressions
Unit Testing
Package: testthat

Wilcoxon Tau Estimator
Rank based scale estimator of residuals
Uses pairup (so already O(n2))

Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]

Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
What’s wrong?

Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
What’s wrong? Bad algorithm (the sort is at least O(nlogn)),
variable gets copied multiple times

Wilcoxon Tau Estimator
Test with 2,000 residuals: better!

Wilcoxon Tau
But what about really huge inputs?

Wilcoxon Tau
But what about really huge inputs?
Bootstrapping: when over 5,000 rows, repeat estimate on
1000 sampled points 100 times

Wilcoxon Tau
But what about really huge inputs?
Bootstrapping: when over 5,000 rows, repeat estimate on
1000 sampled points 100 times
Not about speed, but about memory

Pairup
Pairup function: generates every possible pair from input
vector
Some rank-based estimators require pairwise operations
O(n2) complexity

Pairup
Original version: vectorized (14 LOC)

Pairup
Original version: vectorized (14 LOC)
Loop version (12 LOC)

Pairup
Original version: vectorized (14 LOC)
Loop version (12 LOC)
”Combn” version (core R function, 1 LOC)

Pairup
Original version: vectorized (14 LOC)
Loop version (12 LOC)
”Combn” version (core R function, 1 LOC)
C++ version (12 LOC)

Covariance Estimator
n × n covariance matrix
change to preallocation

Keeping Ahead
Parallelism
Cluster: RMpi, snow

Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud

Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Probably not Hadoop, maybe Apache Spark?

Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Probably not Hadoop, maybe Apache Spark?
Julia Language

Keeping Ahead
Parallelism
Cluster: RMpi, snow
GPU: rpud
Probably not Hadoop, maybe Apache Spark?
Julia Language
Hadley Wickham (plyr, ggplot, testthat, ...)

What's hot

R workshop xx -- Parallel Computing with R Vivian S. Zhang

Return Oriented ProgrammingUTD Computer Security Group

Optimizing Communicating Event-Loop Languages with TruffleStefan Marr

Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)Igalia

Parallel Computing with RAbhirup Mallik

Seattle useR Group - R + ScalaShouheng Yi

MUMS Opening Workshop - Inferring Release Characteristics from an Atmospheric...The Statistical and Applied Mathematical Sciences Institute

defenseQing Dou

Object Detection with TensorflowElifTech

[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...Jerry Chou

JavaScript for Web AnalystsLukáš Čech

Bigdata PresentationYonas Gidey

Tree-based Translation Models (『機械翻訳』§6.2-6.3)Yusuke Oda

Deep Learning in theanoMassimo Quadrana

Los Angeles R users group - July 12 2011 - Part 2rusersla

MODELS 2019: Querying and annotating model histories with time-aware patternsAntonio García-Domínguez

ocelotsean chen

jimmy hacking (at) MicrosoftJimmy Schementi

Arm tools and roadmap for SVE compiler supportLinaro

Virtual Machine for Regular ExpressionsAlexander Yakushev

What's hot (20)

R workshop xx -- Parallel Computing with R

Return Oriented Programming

Optimizing Communicating Event-Loop Languages with Truffle

Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)

Parallel Computing with R

Seattle useR Group - R + Scala

MUMS Opening Workshop - Inferring Release Characteristics from an Atmospheric...

defense

Object Detection with Tensorflow

[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...

JavaScript for Web Analysts

Bigdata Presentation

Tree-based Translation Models (『機械翻訳』§6.2-6.3)

Deep Learning in theano

Los Angeles R users group - July 12 2011 - Part 2

MODELS 2019: Querying and annotating model histories with time-aware patterns

ocelot

jimmy hacking (at) Microsoft

Arm tools and roadmap for SVE compiler support

Virtual Machine for Regular Expressions

Viewers also liked

GTC 2012: GPU-Accelerated Path RenderingMark Kilgard

GPUs in Big Data - StampedeCon 2014StampedeCon

Deep learning on sparkSatyendra Rana

PG-Strom - GPU Accelerated AsyncrKohei KaiGai

SIGGRAPH 2012: GPU-Accelerated 2D and Web RenderingMark Kilgard

GPU EcosystemOfer Rosenberg

Accelerating Machine Learning Applications on Spark Using GPUsIBM

Enabling Graph Analytics at Scale: The Opportunity for GPU-Acceleration of D...odsc

Heterogeneous System Architecture Overviewinside-BigData.com

PG-Strom - GPGPU meets PostgreSQL, PGcon2015Kohei KaiGai

PyData Amsterdam - Name Matching at ScaleGoDataDriven

Deep Learning on HadoopDataWorks Summit

Hadoop + GPUVladimir Starostenkov

From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...Spark Summit

DeepLearning4J and Spark: Successes and Challenges - François Garillotsparktc

How to Solve Real-Time Data ProblemsIBM Power Systems

Containerizing GPU Applications with Docker for Scaling to the CloudSubbu Rama

Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...Chris Fregly

The Potential of GPU-driven High Performance Data Analytics in SparkSpark Summit

Spark Summit EU talk by Tim HunterSpark Summit

Viewers also liked (20)

GTC 2012: GPU-Accelerated Path Rendering

GPUs in Big Data - StampedeCon 2014

Deep learning on spark

PG-Strom - GPU Accelerated Asyncr

SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering

GPU Ecosystem

Accelerating Machine Learning Applications on Spark Using GPUs

Enabling Graph Analytics at Scale: The Opportunity for GPU-Acceleration of D...

Heterogeneous System Architecture Overview

PG-Strom - GPGPU meets PostgreSQL, PGcon2015

PyData Amsterdam - Name Matching at Scale

Deep Learning on Hadoop

Hadoop + GPU

From Machine Learning to Learning Machines: Creating an End-to-End Cognitive ...

DeepLearning4J and Spark: Successes and Challenges - François Garillot

How to Solve Real-Time Data Problems

Containerizing GPU Applications with Docker for Scaling to the Cloud

Tallinn Estonia Advanced Java Meetup Spark + TensorFlow = TensorFrames Oct 24...

The Potential of GPU-driven High Performance Data Analytics in Spark

Spark Summit EU talk by Tim Hunter

Similar to Computational Techniques for the Statistical Analysis of Big Data in R

Devnology Workshop Genpro 2 feb 2011Devnology

NIPS2007: structured predictionzukun

Preemptive RANSAC by David Nister.Ian Sa

Functional Programming With ScalaKnoldus Inc.

Next.ml Boston: Data Science Dev OpsEric Chiang

Advanced procedures in assembly language Full chapter pptMuhammad Sikandar Mustafa

Parallelising Dynamic ProgrammingRaphael Reitzig

H2O Open Source Deep Learning, Arno Candel 03-20-14Sri Ambati

Basics of JavascriptUniverse41

Scala clojure techday_2011Thadeu Russo

R Analytics in the CloudDataMine Lab

Native interfaces for RSeth Falcon

Atlanta Spark User Meetup 09 22 2016Chris Fregly

Ppt chapter12Richard Styner

Pr045 deep lab_semantic_segmentationTaeoh Kim

Skip gram and cbowhyunyoung Lee

Uncovering Performance Problems in Java Applications with Reference Propagati...Dacong (Tony) Yan

pptx - Psuedo Random Generator for Halfspacesbutest

Inferno Scalable Deep Learning on SparkDataWorks Summit/Hadoop Summit

Similar to Computational Techniques for the Statistical Analysis of Big Data in R (20)

Devnology Workshop Genpro 2 feb 2011

NIPS2007: structured prediction

Preemptive RANSAC by David Nister.

Functional Programming With Scala

Next.ml Boston: Data Science Dev Ops

Advanced procedures in assembly language Full chapter ppt

Parallelising Dynamic Programming

H2O Open Source Deep Learning, Arno Candel 03-20-14

Basics of Javascript

Scala clojure techday_2011

R Analytics in the Cloud

Native interfaces for R

Atlanta Spark User Meetup 09 22 2016

Ppt chapter12

Pr045 deep lab_semantic_segmentation

Skip gram and cbow

Uncovering Performance Problems in Java Applications with Reference Propagati...

pptx - Psuedo Random Generator for Halfspaces

Inferno Scalable Deep Learning on Spark

Recently uploaded

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Time Series Foundation Models - current state and future directionsNathaniel Shimoni

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

How to write a Business Continuity PlanDatabarracks

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Rise of the Machines: Known As Drones...Rick Flair

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

Take control of your SAP testing with UiPath Test SuiteDianaGray10

unit 4 immunoblotting technique complete.pptxBkGupta21

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Artificial intelligence in cctv survelliance.pptxhariprasad279825

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Recently uploaded (20)

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

DSPy a system for AI to Write Prompts and Do Fine Tuning

Time Series Foundation Models - current state and future directions

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

How to write a Business Continuity Plan

Ensuring Technical Readiness For Copilot in Microsoft 365

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES

SIP trunking in Janus @ Kamailio World 2024

Dev Dives: Streamline document processing with UiPath Studio Web

Rise of the Machines: Known As Drones...

The State of Passkeys with FIDO Alliance.pptx

Take control of your SAP testing with UiPath Test Suite

unit 4 immunoblotting technique complete.pptx

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Artificial intelligence in cctv survelliance.pptx

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

TeamStation AI System Report LATAM IT Salaries 2024

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Computational Techniques for the Statistical Analysis of Big Data in R

1. Computational Techniques for the Statistical Analysis of Big Data in R A Case Study of the rlme Package Herb Susmann, Yusuf Bilgic April 12, 2014

2. Workﬂow Identify Rewrite Benchmark Test Case Study: rlme Identify Wilcoxon Tau Estimator Pairup Covariance Estimator Summary Keeping Ahead

3. Motivation Case study: rlme package Rank based regression and estimation of two- and three- level nested eﬀects models. Goals: faster, less memory, more data Before: 5,000 rows of data After: 50,000 rows of data

4. Section 1 Workﬂow

5. Workﬂow Identify Rewrite Benchmark Test

6. Identify Know your big O!

7. Identify Know your big O! (O(n2) memory usage? probably not so good for big data)

8. Identify Know your big O! (O(n2) memory usage? probably not so good for big data) Look for error messages

9. Identify Know your big O! (O(n2) memory usage? probably not so good for big data) Look for error messages Proﬁling with RProf

10. Rewrite High level design Algorithm design

11. Rewrite High level design Algorithm design Statistical techniques: bootstrapping

12. Rewrite Microbenchmarking Know what R is good at

13. Rewrite Microbenchmarking Know what R is good at Avoid loops in favor of vectorization

14. Rewrite Microbenchmarking Know what R is good at Avoid loops in favor of vectorization Preallocation

15. Rewrite Microbenchmarking Know what R is good at Avoid loops in favor of vectorization Preallocation Arguments are by value, not by reference

16. Rewrite Microbenchmarking Know what R is good at Avoid loops in favor of vectorization Preallocation Arguments are by value, not by reference Embrace C++

17. Rewrite Microbenchmarking Know what R is good at Avoid loops in favor of vectorization Preallocation Arguments are by value, not by reference Embrace C++ Be careful!

18. Vectorizing ## Bad vec = 1:100 for (i in 1:length(vec)) { vec[i] = vec[i]^2 } ## Better sapply(vec, function(x) x^2) ## Best vec^2

19. Preallocation ## Bad vec = c() for (i in 1:0) { vec = c(vec, i) } ## Better vec = numeric(100) for (i in 1:0) { vec[i] = i }

20. Pass by value square <- function(x) { x <- x^2 return(x) } x <- 1:100 square(x)

21. Benchmark Write several versions of a slow function

22. Benchmark Write several versions of a slow function Test them against each other

23. Benchmark Write several versions of a slow function Test them against each other Package: microbenchmark

24. Test Regressions

25. Test Regressions Unit Testing

26. Test Regressions Unit Testing Package: testthat

27. Test Regressions Unit Testing Package: testthat

28. Section 2 Case Study: rlme

29. Identify Over to R! Rprof("profile") fit.rlme = rlme(...) Rprof(NULL) summaryRprof("profile")

30. Wilcoxon Tau Estimator Rank based scale estimator of residuals Uses pairup (so already O(n2))

31. Wilcoxon Tau Estimator Original: dresd <- sort(abs(temp[, 1] - temp[, 2])) dresd = dresd[(p + 1):choose(n, 2)]

32. Wilcoxon Tau Estimator Original: dresd <- sort(abs(temp[, 1] - temp[, 2])) dresd = dresd[(p + 1):choose(n, 2)] What’s wrong?

33. Wilcoxon Tau Estimator Original: dresd <- sort(abs(temp[, 1] - temp[, 2])) dresd = dresd[(p + 1):choose(n, 2)] What’s wrong? Bad algorithm (the sort is at least O(nlogn)), variable gets copied multiple times

34. Wilcoxon Tau Estimator Original: dresd <- sort(abs(temp[, 1] - temp[, 2])) dresd = dresd[(p + 1):choose(n, 2)] What’s wrong? Bad algorithm (the sort is at least O(nlogn)), variable gets copied multiple times Updated with C++ dresd = remove.k.smallest(dresd)

35. Wilcoxon Tau Estimator Test with 2,000 residuals: better!

36. Wilcoxon Tau But what about really huge inputs?

37. Wilcoxon Tau But what about really huge inputs? Bootstrapping: when over 5,000 rows, repeat estimate on 1000 sampled points 100 times

38. Wilcoxon Tau But what about really huge inputs? Bootstrapping: when over 5,000 rows, repeat estimate on 1000 sampled points 100 times Not about speed, but about memory

39. Pairup Pairup function: generates every possible pair from input vector Some rank-based estimators require pairwise operations O(n2) complexity

40. Pairup Original version: vectorized (14 LOC)

41. Pairup Original version: vectorized (14 LOC) Loop version (12 LOC)

42. Pairup Original version: vectorized (14 LOC) Loop version (12 LOC) ”Combn” version (core R function, 1 LOC)

43. Pairup Original version: vectorized (14 LOC) Loop version (12 LOC) ”Combn” version (core R function, 1 LOC) C++ version (12 LOC)

44. Over to R!

45. Covariance Estimator n × n covariance matrix change to preallocation

46. Covariance Estimator

47. Summary Identify Rewrite Benchmark Test

48. Keeping Ahead Parallelism

49. Keeping Ahead Parallelism Cluster: RMpi, snow

50. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud

51. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud Probably not Hadoop, maybe Apache Spark?

52. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud Probably not Hadoop, maybe Apache Spark? Julia Language

53. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud Probably not Hadoop, maybe Apache Spark? Julia Language Hadley Wickham (plyr, ggplot, testthat, ...)

54. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud Probably not Hadoop, maybe Apache Spark? Julia Language Hadley Wickham (plyr, ggplot, testthat, ...) “Advanced R Programming”

55. Keeping Ahead Parallelism Cluster: RMpi, snow GPU: rpud Probably not Hadoop, maybe Apache Spark? Julia Language Hadley Wickham (plyr, ggplot, testthat, ...) “Advanced R Programming”

56. Questions?

Computational Techniques for the Statistical Analysis of Big Data in R

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Computational Techniques for the Statistical Analysis of Big Data in R

Similar to Computational Techniques for the Statistical Analysis of Big Data in R (20)

Recently uploaded

Recently uploaded (20)

Computational Techniques for the Statistical Analysis of Big Data in R