SlideShare a Scribd company logo
1 of 39
Get up to Speed
QUICK GUIDE TO DATA.TABLE IN R AND PENTAHO PDI
02 09 2015
SERBAN TANASA
What You Could Gain Tonight
 2-20x speed increase in your data loading and manipulation using
data.table
 (If time allows) A free path of entry into Business Intelligence ETL
(commercial scale computing technologies for Extract/Transform/Load)
using Pentaho Data Integration.
 Free food?
2
Planned Outline
data.table
 Why use it? Benchmarks.
 How to use it? Primer on basic functions.
 Overcome R scaling limitations: Multithread, Cloud, Databases.
Pentaho Data Integration (PDI)
 (Optional time-constrained section) Very basic run-through of PDI ETL
Unstructured Time for Q&A and (potentially) hilarious live-coding
3
R Online Support and Business Use
Source: Stack Overflow, Talk Stats, and Cross Validated
0
20
40
60
80
100
120
140
R SAS SPSS Stata
Thousands
Posts per Software
SO TalkStats Cross Validated
0
20
40
60
80
100
120
R SAS SPSS Stata
Thousands
LinkedIn Groups Members
4
Benchmarks
READ DATA
ORDER DATA
TRANSFORM DATA
5
Benchmarks: Hardware Setup
 Test Machine: AWS EC2 r3.8xlarge
 # R version 3.2.2 (2014-07-10) -- “Fire Safety”
 # Platform: x86_64-pc-linux-gnu (64-bit)
 An Amazon Web Services Elastic Cloud Compute on-demand instance
with these settings costs $2.8/hr on demand, ~$1/hr reserved, or as low as
~0.3/hr on spot instances.
6
Benchmarks: Reading Data
0
200
400
600
800
1000
1200
1400
50Mb 500Mb 5Gb
Seconds to Read File
read.csv read.csv(2) read.table ff sqldf fread
7
Benchmarks: Reading Data
0%
500%
1000%
1500%
2000%
2500%
50Mb 500Mb 5Gb
Read Performance Relative to fread()
read.csv read.csv(2) read.table ff sqldf fread
8
Benchmarks: Order Data
0.1
1
10
100
1000
10000
100000
1000000
10000000
1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09
Sort Table Operations by Table Size
Log Scale
Base dplyr data.table
9
Benchmarks: Order Data
-
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
Base dplyr data.table
Sort 1 Billion Rows (milisec)
10
Benchmarks: Transform Data (Setup)
 The input data is randomly ordered. No pre-sort. No indexes. No key.
 5 simple queries are run: large groups and small groups on different
columns of different types. Similar to what a data analyst might do in
practice; i.e., various ad hoc aggregations as the data is explored and
investigated.
 Each package is tested separately in its own fresh session.
 Each query is repeated once more, immediately. This is to isolate cache
effects and confirm the first timing.
 The results are compared and checked allowing for numeric tolerance
and column name differences.
 It is a tough test that happens to be realistic and very common.
11
Benchmarks: Transform Data (Setup)
N=1e9; K=100
set.seed(1)
DF <- data.frame(stringsAsFactors=FALSE,
id1 = sample(sprintf("id%03d",1:K), N, TRUE),
id2 = sample(sprintf("id%03d",1:K), N, TRUE),
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE),
id4 = sample(K, N, TRUE),
id5 = sample(K, N, TRUE),
id6 = sample(N/K, N, TRUE),
v1 = sample(5, N, TRUE),
v2 = sample(5, N, TRUE),
v3 = sample(round(runif(100,max=100),4), N, TRUE) )
12
id1 id2 id3 id4 id5 id6 v1 v2 v3
id027 id007 id0000000022 42 60 58 4 4 50.7016
id038 id068 id0000000012 15 56 71 4 4 5.5459
id058 id074 id0000000015 46 60 34 5 1 11.5124
id091 id012 id0000000031 81 40 12 1 1 18.8075
id021 id005 id0000000016 33 27 88 2 3 34.0231
id090 id014 id0000000053 87 74 6 2 3 27.2783
id095 id089 id0000000012 25 3 35 2 5 11.5124
id067 id084 id0000000048 83 85 47 5 1 63.7503
id063 id087 id0000000031 22 86 78 2 4 23.251
id007 id004 id0000000031 58 14 82 2 5 7.1864
id021 id011 id0000000030 37 39 69 5 1 49.0202
id018 id055 id0000000066 95 86 1 1 2 4.0548
id069 id011 id0000000039 11 8 71 5 2 45.0637
id039 id073 id0000000075 54 23 50 5 4 89.157
id077 id073 id0000000069 9 77 73 4 2 22.9517
id050 id079 id0000000027 29 34 17 3 4 23.251
id072 id062 id0000000041 67 98 53 4 1 73.6784
id100 id051 id0000000051 13 15 55 1 3 54.3411
id039 id046 id0000000090 100 77 79 1 2 7.1864
id078 id004 id0000000009 68 97 10 2 2 40.3839
13
Benchmarks: Test Commands
Test data.table dplyr
1.1DT[, sum(v1), keyby=id1] DF %>% group_by(id1) %>% summarise(sum(v1))
1.2DT[, sum(v1), keyby=id1] DF %>% group_by(id1) %>% summarise(sum(v1))
2.1DT[, sum(v1), keyby="id1,id2"] DF %>% group_by(id1,id2) %>% summarise(sum(v1))
2.2DT[, sum(v1), keyby="id1,id2"] DF %>% group_by(id1,id2) %>% summarise(sum(v1))
3.1DT[, list(sum(v1),mean(v3)), keyby=id3] DF %>% group_by(id3) %>% summarise(sum(v1),mean(v3))
3.2DT[, list(sum(v1),mean(v3)), keyby=id3] DF %>% group_by(id3) %>% summarise(sum(v1),mean(v3))
4.1DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] DF %>% group_by(id4) %>% summarise_each(funs(mean), 7:9)
4.2DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] DF %>% group_by(id4) %>% summarise_each(funs(mean), 7:9)
5.1DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] DF %>% group_by(id6) %>% summarise_each(funs(sum), 7:9)
5.2DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] DF %>% group_by(id6) %>% summarise_each(funs(sum), 7:9)
14
Benchmarks: Results
-
50
100
150
200
250
300
350
Millions
Group by and Summarize
(Average of 5 Operations)
dplyr data.table
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
Group by and Summarize
Average of 5 Operations
Log Scale
dplyr data.table
Microseconds
15
GB <0.01 <0.01 0.03 0.075 0.516 4.939 49.15
1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09
1 (1st) 127% 133% 160% 238% 217% 186% 185%
1 (2nd) 125% 146% 215% 265% 217% 188% 188%
2 (1st) 150% 331% 508% 578% 399% 309% 294%
2 (2nd) 148% 328% 497% 581% 405% 304% 281%
3 (1st) 94% 116% 264% 316% 254% 276% 298%
3 (2nd) 95% 120% 264% 307% 256% 264% 299%
4 (1st) 226% 214% 193% 176% 188% 227% 227%
4 (2nd) 171% 172% 175% 188% 187% 224% 232%
5 (1st) 165% 166% 204% 239% 314% 586% 497%
5 (2nd) 161% 164% 203% 240% 314% 623% 498%
16
data.table
Primer
READ
CREATE
MANIPULATE
SPECIAL COMMANDS
17
Read
fread()
 Similar to read.table but faster and more convenient. All controls
such as sep, colClasses and nrows are automatically
detected. bit64::integer64 types are also detected and read
directly without needing to read as character before converting.
 sep -- The separator between columns. Defaults to the first character
in the set [,t |;:] that exists on line autostart outside quoted regions, and
separates the rows above autostart into a consistent number of fields, too.
 skip, drop, select, showProgress;
 Input can be a file name, a URL pointing to a file, or (advanced
use) a shell command fread("grep @WhiteHouse.gov filename"))
18
Create
 data.table() – much like data.frame
 setDT() – makes an existing data.frame a data.table without copying (this
is important for large data)
 setkey() and setkeyv() – supercharged rownames, indices
19
Manipulate
 := : Assignment operator (without copy)
 .N : Counts
 data.table::melt(), data.table::dcast()
 data.table::merge() and DT_1[DT_2] joins
 DT[ i, j, by ]
20
DT[i, j, by] format
Source: https://campus.datacamp.com/courses/data-table-data-manipulation-r-tutorial
21
DT[i, j, by] format
Source: https://campus.datacamp.com/courses/data-table-data-manipulation-r-tutorial
22
Special Commands
 .( )
 .eachi
 .SD and .SDcols
 c('x2', 'y2') := list(..., ...)
 `:=`(x2=...,y2= ...) equivalent group assignment
 DT[, plot(x)] will actually produce a plot
 copy() – for when you do not want to update by reference
 DT[, “colname”, with=FALSE]
 DT[…][…] -- Chaining
23
Overcome R
scaling
limitations
MULTITHREAD R
CLOUD
DATABASES
SPECIALIZE R USE
24
Multithread R: RRO 3.2.1
https://mran.revolutionanalytics.com/download/
Enhancements
include multi-core
processing…
25
http://serbantanasa.com/2015/06/12/r-vs-revolution-r-open-3-2-0/
26
Cloud, Database, BI Tools
 AWS
 On-Off Deployment of memory-optimized instances for one-off heavy
processing
 AWS with Rstudio Server + Shiny Server (Linux Only)
 R is increasingly integrated in BI tools and even Databases
 Pentaho EE has R integration (as does Microstrategy, Microsoft SSRS, IBM, and
even data discovery tools like Tableau, Qlikview & Alteryx)
 IBM DashDB has a built-in Rstudio, MS SQL Server 2016 will have in-database R,
Postgres has PL/R etc.
27
Specialize Your Use of R
 R can do anything you can
program (it is a Turing-
complete programming
language)
 R should NOT do everything.
 Push ETL to specialized software
(like PDI)
 Push computation to DB (DBI
and rstats-db packages) &
Hadoop (Rhadoop – basically
large scale lapply)
https://github.com/rstats-db
28
Pentaho PDI OVERVIEW OF CAPABILITIES
29
What PDI can do for you
 Data integration without writing 1 line of code
 Heavily parallel streams (compare to base-R 1 core), can even push to a
whole slave computing cluster.
 Java, JavaScript, SQL, R Scripting (EE?)
 Slowly changing dimensions made easy
30
Data I/O Capabilities
31
Visual Data Munging
32
33Complexity can escalate quickly…
Additional Resources: PDI
 Community Version:
http://community.pentaho.com/projects/
data-integration/
 Enterprise Edition
http://www.pentaho.com/product/data-
integration
34
Additional Resources: data.table
 data.table wiki:
https://github.com/Rdatatable/data.table/wiki
 data.table tutorial:
https://campus.datacamp.com/courses/data-table-
data-manipulation-r-tutorial/
35
Thank you for your time!
stanasa@sunstonescience.com
36
Appendix
37
Stack Overflow
38
Benchmark Test
Data Size (GB) <0.01 <0.012 0.03 0.075 0.516 4.939 49.15
Rows 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09
DF %>% group_by(id1) %>% summarise(sum(v1)) 6,729 7,074 10,737 49,100 468,973 5,076,499 51,998,307
DT[, sum(v1), keyby=id1] 5,300 5,305 6,708 20,656 216,540 2,730,619 28,076,861
DF %>% group_by(id1) %>% summarise(sum(v1)) 1,188 1,642 5,344 43,524 455,865 5,128,423 51,528,819
DT[, sum(v1), keyby=id1] 953 1,123 2,486 16,416 210,032 2,721,640 27,406,794
DF %>% group_by(id1,id2) %>% summarise(sum(v1)) 1,894 8,480 28,033 152,965 1,444,988 14,927,984 152,535,515
DT[, sum(v1), keyby="id1,id2"] 1,263 2,559 5,516 26,446 361,812 4,827,830 51,897,539
DF %>% group_by(id1,id2) %>% summarise(sum(v1)) 1,865 8,396 27,343 153,185 1,440,788 14,652,509 152,118,605
DT[, sum(v1), keyby="id1,id2"] 1,257 2,561 5,505 26,386 355,492 4,827,702 54,047,731
DF %>% group_by(id3) %>% summarise(sum(v1),mean(v3)) 1,130 1,697 10,323 129,391 1,805,955 45,748,652 693,700,832
DT[, list(sum(v1),mean(v3)), keyby=id3] 1,197 1,461 3,910 40,991 710,813 16,582,213 233,001,218
DF %>% group_by(id3) %>% summarise(sum(v1),mean(v3)) 1,100 1,701 10,299 125,585 1,824,419 44,038,141 627,247,199
DT[, list(sum(v1),mean(v3)), keyby=id3] 1,160 1,415 3,894 40,895 713,289 16,660,202 209,734,867
DF %>% group_by(id4) %>% summarise_each(funs(mean), 7:9) 2,463 2,900 6,965 51,318 591,011 8,421,787 86,594,397
DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] 1,091 1,354 3,603 29,196 314,705 3,711,641 38,151,119
DF %>% group_by(id4) %>% summarise_each(funs(mean), 7:9) 1,811 2,276 6,328 49,939 579,184 8,315,717 86,585,169
DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] 1,056 1,322 3,614 26,572 310,265 3,706,489 37,275,816
DF %>% group_by(id6) %>% summarise_each(funs(sum), 7:9) 1,729 2,234 8,268 94,044 1,661,863 41,141,396 600,233,695
DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] 1,049 1,343 4,053 39,299 529,329 7,016,574 120,664,934
DF %>% group_by(id6) %>% summarise_each(funs(sum), 7:9) 1,696 2,207 8,189 93,323 1,660,917 41,126,264 625,999,220
DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] 1,055 1,343 4,033 38,910 529,763 6,602,454 125,725,448
39

More Related Content

What's hot

Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in RJeffrey Breen
 
Merge Multiple CSV in single data frame using R
Merge Multiple CSV in single data frame using RMerge Multiple CSV in single data frame using R
Merge Multiple CSV in single data frame using RYogesh Khandelwal
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programmingizahn
 
Python Pandas
Python PandasPython Pandas
Python PandasSunil OS
 
3. R- list and data frame
3. R- list and data frame3. R- list and data frame
3. R- list and data framekrishna singh
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on rAbhik Seal
 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using pythonPurna Chander
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyrRomain Francois
 
Introduction To R Language
Introduction To R LanguageIntroduction To R Language
Introduction To R LanguageGaurang Dobariya
 
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...Citus Data
 
Data Modeling, Normalization and Denormalization | Nordic PGDay 2018 | Dimitr...
Data Modeling, Normalization and Denormalization | Nordic PGDay 2018 | Dimitr...Data Modeling, Normalization and Denormalization | Nordic PGDay 2018 | Dimitr...
Data Modeling, Normalization and Denormalization | Nordic PGDay 2018 | Dimitr...Citus Data
 
R Programming: Mathematical Functions In R
R Programming: Mathematical Functions In RR Programming: Mathematical Functions In R
R Programming: Mathematical Functions In RRsquared Academy
 
R data-import, data-export
R data-import, data-exportR data-import, data-export
R data-import, data-exportFAO
 
RMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRsquared Academy
 

What's hot (20)

Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 
Pandas
PandasPandas
Pandas
 
Merge Multiple CSV in single data frame using R
Merge Multiple CSV in single data frame using RMerge Multiple CSV in single data frame using R
Merge Multiple CSV in single data frame using R
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programming
 
Python Pandas
Python PandasPython Pandas
Python Pandas
 
3. R- list and data frame
3. R- list and data frame3. R- list and data frame
3. R- list and data frame
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using python
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyr
 
R factors
R   factorsR   factors
R factors
 
Introduction To R Language
Introduction To R LanguageIntroduction To R Language
Introduction To R Language
 
Datamining with R
Datamining with RDatamining with R
Datamining with R
 
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
Data Modeling, Normalization, and Denormalisation | FOSDEM '19 | Dimitri Font...
 
Data Modeling, Normalization and Denormalization | Nordic PGDay 2018 | Dimitr...
Data Modeling, Normalization and Denormalization | Nordic PGDay 2018 | Dimitr...Data Modeling, Normalization and Denormalization | Nordic PGDay 2018 | Dimitr...
Data Modeling, Normalization and Denormalization | Nordic PGDay 2018 | Dimitr...
 
R Programming: Mathematical Functions In R
R Programming: Mathematical Functions In RR Programming: Mathematical Functions In R
R Programming: Mathematical Functions In R
 
R data-import, data-export
R data-import, data-exportR data-import, data-export
R data-import, data-export
 
RMySQL Tutorial For Beginners
RMySQL Tutorial For BeginnersRMySQL Tutorial For Beginners
RMySQL Tutorial For Beginners
 

Similar to Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)

Experience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data PlatformExperience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data PlatformBob Ward
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!OSCON Byrum
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresSteven Johnson
 
How to build tabular dashboards using proc report
How to build tabular dashboards using proc reportHow to build tabular dashboards using proc report
How to build tabular dashboards using proc reportFrank Bereznay
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Prog1 chap1 and chap 2
Prog1 chap1 and chap 2Prog1 chap1 and chap 2
Prog1 chap1 and chap 2rowensCap
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with ScalaChetan Khatri
 
eBay EDW元数据管理及应用
eBay EDW元数据管理及应用eBay EDW元数据管理及应用
eBay EDW元数据管理及应用mysqlops
 
Sql on hadoop the secret presentation.3pptx
Sql on hadoop  the secret presentation.3pptxSql on hadoop  the secret presentation.3pptx
Sql on hadoop the secret presentation.3pptxPaulo Alonso
 
SQL Server 2008 Performance Enhancements
SQL Server 2008 Performance EnhancementsSQL Server 2008 Performance Enhancements
SQL Server 2008 Performance Enhancementsinfusiondev
 
MDI Training DB2 Course
MDI Training DB2 CourseMDI Training DB2 Course
MDI Training DB2 CourseMarcus Davage
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageMajid Abdollahi
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 

Similar to Get up to Speed (Quick Guide to data.table in R and Pentaho PDI) (20)

Experience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data PlatformExperience SQL Server 2017: The Modern Data Platform
Experience SQL Server 2017: The Modern Data Platform
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
 
e_lumley.pdf
e_lumley.pdfe_lumley.pdf
e_lumley.pdf
 
How to build tabular dashboards using proc report
How to build tabular dashboards using proc reportHow to build tabular dashboards using proc report
How to build tabular dashboards using proc report
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Prog1 chap1 and chap 2
Prog1 chap1 and chap 2Prog1 chap1 and chap 2
Prog1 chap1 and chap 2
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
 
eBay EDW元数据管理及应用
eBay EDW元数据管理及应用eBay EDW元数据管理及应用
eBay EDW元数据管理及应用
 
Sql on hadoop the secret presentation.3pptx
Sql on hadoop  the secret presentation.3pptxSql on hadoop  the secret presentation.3pptx
Sql on hadoop the secret presentation.3pptx
 
SQL Server 2008 Performance Enhancements
SQL Server 2008 Performance EnhancementsSQL Server 2008 Performance Enhancements
SQL Server 2008 Performance Enhancements
 
MDI Training DB2 Course
MDI Training DB2 CourseMDI Training DB2 Course
MDI Training DB2 Course
 
Rattle Graphical Interface for R Language
Rattle Graphical Interface for R LanguageRattle Graphical Interface for R Language
Rattle Graphical Interface for R Language
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Module02
Module02Module02
Module02
 

Recently uploaded

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 

Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)

  • 1. Get up to Speed QUICK GUIDE TO DATA.TABLE IN R AND PENTAHO PDI 02 09 2015 SERBAN TANASA
  • 2. What You Could Gain Tonight  2-20x speed increase in your data loading and manipulation using data.table  (If time allows) A free path of entry into Business Intelligence ETL (commercial scale computing technologies for Extract/Transform/Load) using Pentaho Data Integration.  Free food? 2
  • 3. Planned Outline data.table  Why use it? Benchmarks.  How to use it? Primer on basic functions.  Overcome R scaling limitations: Multithread, Cloud, Databases. Pentaho Data Integration (PDI)  (Optional time-constrained section) Very basic run-through of PDI ETL Unstructured Time for Q&A and (potentially) hilarious live-coding 3
  • 4. R Online Support and Business Use Source: Stack Overflow, Talk Stats, and Cross Validated 0 20 40 60 80 100 120 140 R SAS SPSS Stata Thousands Posts per Software SO TalkStats Cross Validated 0 20 40 60 80 100 120 R SAS SPSS Stata Thousands LinkedIn Groups Members 4
  • 6. Benchmarks: Hardware Setup  Test Machine: AWS EC2 r3.8xlarge  # R version 3.2.2 (2014-07-10) -- “Fire Safety”  # Platform: x86_64-pc-linux-gnu (64-bit)  An Amazon Web Services Elastic Cloud Compute on-demand instance with these settings costs $2.8/hr on demand, ~$1/hr reserved, or as low as ~0.3/hr on spot instances. 6
  • 7. Benchmarks: Reading Data 0 200 400 600 800 1000 1200 1400 50Mb 500Mb 5Gb Seconds to Read File read.csv read.csv(2) read.table ff sqldf fread 7
  • 8. Benchmarks: Reading Data 0% 500% 1000% 1500% 2000% 2500% 50Mb 500Mb 5Gb Read Performance Relative to fread() read.csv read.csv(2) read.table ff sqldf fread 8
  • 9. Benchmarks: Order Data 0.1 1 10 100 1000 10000 100000 1000000 10000000 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09 Sort Table Operations by Table Size Log Scale Base dplyr data.table 9
  • 11. Benchmarks: Transform Data (Setup)  The input data is randomly ordered. No pre-sort. No indexes. No key.  5 simple queries are run: large groups and small groups on different columns of different types. Similar to what a data analyst might do in practice; i.e., various ad hoc aggregations as the data is explored and investigated.  Each package is tested separately in its own fresh session.  Each query is repeated once more, immediately. This is to isolate cache effects and confirm the first timing.  The results are compared and checked allowing for numeric tolerance and column name differences.  It is a tough test that happens to be realistic and very common. 11
  • 12. Benchmarks: Transform Data (Setup) N=1e9; K=100 set.seed(1) DF <- data.frame(stringsAsFactors=FALSE, id1 = sample(sprintf("id%03d",1:K), N, TRUE), id2 = sample(sprintf("id%03d",1:K), N, TRUE), id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), id4 = sample(K, N, TRUE), id5 = sample(K, N, TRUE), id6 = sample(N/K, N, TRUE), v1 = sample(5, N, TRUE), v2 = sample(5, N, TRUE), v3 = sample(round(runif(100,max=100),4), N, TRUE) ) 12
  • 13. id1 id2 id3 id4 id5 id6 v1 v2 v3 id027 id007 id0000000022 42 60 58 4 4 50.7016 id038 id068 id0000000012 15 56 71 4 4 5.5459 id058 id074 id0000000015 46 60 34 5 1 11.5124 id091 id012 id0000000031 81 40 12 1 1 18.8075 id021 id005 id0000000016 33 27 88 2 3 34.0231 id090 id014 id0000000053 87 74 6 2 3 27.2783 id095 id089 id0000000012 25 3 35 2 5 11.5124 id067 id084 id0000000048 83 85 47 5 1 63.7503 id063 id087 id0000000031 22 86 78 2 4 23.251 id007 id004 id0000000031 58 14 82 2 5 7.1864 id021 id011 id0000000030 37 39 69 5 1 49.0202 id018 id055 id0000000066 95 86 1 1 2 4.0548 id069 id011 id0000000039 11 8 71 5 2 45.0637 id039 id073 id0000000075 54 23 50 5 4 89.157 id077 id073 id0000000069 9 77 73 4 2 22.9517 id050 id079 id0000000027 29 34 17 3 4 23.251 id072 id062 id0000000041 67 98 53 4 1 73.6784 id100 id051 id0000000051 13 15 55 1 3 54.3411 id039 id046 id0000000090 100 77 79 1 2 7.1864 id078 id004 id0000000009 68 97 10 2 2 40.3839 13
  • 14. Benchmarks: Test Commands Test data.table dplyr 1.1DT[, sum(v1), keyby=id1] DF %>% group_by(id1) %>% summarise(sum(v1)) 1.2DT[, sum(v1), keyby=id1] DF %>% group_by(id1) %>% summarise(sum(v1)) 2.1DT[, sum(v1), keyby="id1,id2"] DF %>% group_by(id1,id2) %>% summarise(sum(v1)) 2.2DT[, sum(v1), keyby="id1,id2"] DF %>% group_by(id1,id2) %>% summarise(sum(v1)) 3.1DT[, list(sum(v1),mean(v3)), keyby=id3] DF %>% group_by(id3) %>% summarise(sum(v1),mean(v3)) 3.2DT[, list(sum(v1),mean(v3)), keyby=id3] DF %>% group_by(id3) %>% summarise(sum(v1),mean(v3)) 4.1DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] DF %>% group_by(id4) %>% summarise_each(funs(mean), 7:9) 4.2DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] DF %>% group_by(id4) %>% summarise_each(funs(mean), 7:9) 5.1DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] DF %>% group_by(id6) %>% summarise_each(funs(sum), 7:9) 5.2DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] DF %>% group_by(id6) %>% summarise_each(funs(sum), 7:9) 14
  • 15. Benchmarks: Results - 50 100 150 200 250 300 350 Millions Group by and Summarize (Average of 5 Operations) dplyr data.table 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 Group by and Summarize Average of 5 Operations Log Scale dplyr data.table Microseconds 15
  • 16. GB <0.01 <0.01 0.03 0.075 0.516 4.939 49.15 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09 1 (1st) 127% 133% 160% 238% 217% 186% 185% 1 (2nd) 125% 146% 215% 265% 217% 188% 188% 2 (1st) 150% 331% 508% 578% 399% 309% 294% 2 (2nd) 148% 328% 497% 581% 405% 304% 281% 3 (1st) 94% 116% 264% 316% 254% 276% 298% 3 (2nd) 95% 120% 264% 307% 256% 264% 299% 4 (1st) 226% 214% 193% 176% 188% 227% 227% 4 (2nd) 171% 172% 175% 188% 187% 224% 232% 5 (1st) 165% 166% 204% 239% 314% 586% 497% 5 (2nd) 161% 164% 203% 240% 314% 623% 498% 16
  • 18. Read fread()  Similar to read.table but faster and more convenient. All controls such as sep, colClasses and nrows are automatically detected. bit64::integer64 types are also detected and read directly without needing to read as character before converting.  sep -- The separator between columns. Defaults to the first character in the set [,t |;:] that exists on line autostart outside quoted regions, and separates the rows above autostart into a consistent number of fields, too.  skip, drop, select, showProgress;  Input can be a file name, a URL pointing to a file, or (advanced use) a shell command fread("grep @WhiteHouse.gov filename")) 18
  • 19. Create  data.table() – much like data.frame  setDT() – makes an existing data.frame a data.table without copying (this is important for large data)  setkey() and setkeyv() – supercharged rownames, indices 19
  • 20. Manipulate  := : Assignment operator (without copy)  .N : Counts  data.table::melt(), data.table::dcast()  data.table::merge() and DT_1[DT_2] joins  DT[ i, j, by ] 20
  • 21. DT[i, j, by] format Source: https://campus.datacamp.com/courses/data-table-data-manipulation-r-tutorial 21
  • 22. DT[i, j, by] format Source: https://campus.datacamp.com/courses/data-table-data-manipulation-r-tutorial 22
  • 23. Special Commands  .( )  .eachi  .SD and .SDcols  c('x2', 'y2') := list(..., ...)  `:=`(x2=...,y2= ...) equivalent group assignment  DT[, plot(x)] will actually produce a plot  copy() – for when you do not want to update by reference  DT[, “colname”, with=FALSE]  DT[…][…] -- Chaining 23
  • 25. Multithread R: RRO 3.2.1 https://mran.revolutionanalytics.com/download/ Enhancements include multi-core processing… 25
  • 27. Cloud, Database, BI Tools  AWS  On-Off Deployment of memory-optimized instances for one-off heavy processing  AWS with Rstudio Server + Shiny Server (Linux Only)  R is increasingly integrated in BI tools and even Databases  Pentaho EE has R integration (as does Microstrategy, Microsoft SSRS, IBM, and even data discovery tools like Tableau, Qlikview & Alteryx)  IBM DashDB has a built-in Rstudio, MS SQL Server 2016 will have in-database R, Postgres has PL/R etc. 27
  • 28. Specialize Your Use of R  R can do anything you can program (it is a Turing- complete programming language)  R should NOT do everything.  Push ETL to specialized software (like PDI)  Push computation to DB (DBI and rstats-db packages) & Hadoop (Rhadoop – basically large scale lapply) https://github.com/rstats-db 28
  • 29. Pentaho PDI OVERVIEW OF CAPABILITIES 29
  • 30. What PDI can do for you  Data integration without writing 1 line of code  Heavily parallel streams (compare to base-R 1 core), can even push to a whole slave computing cluster.  Java, JavaScript, SQL, R Scripting (EE?)  Slowly changing dimensions made easy 30
  • 34. Additional Resources: PDI  Community Version: http://community.pentaho.com/projects/ data-integration/  Enterprise Edition http://www.pentaho.com/product/data- integration 34
  • 35. Additional Resources: data.table  data.table wiki: https://github.com/Rdatatable/data.table/wiki  data.table tutorial: https://campus.datacamp.com/courses/data-table- data-manipulation-r-tutorial/ 35
  • 36. Thank you for your time! stanasa@sunstonescience.com 36
  • 39. Benchmark Test Data Size (GB) <0.01 <0.012 0.03 0.075 0.516 4.939 49.15 Rows 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09 DF %>% group_by(id1) %>% summarise(sum(v1)) 6,729 7,074 10,737 49,100 468,973 5,076,499 51,998,307 DT[, sum(v1), keyby=id1] 5,300 5,305 6,708 20,656 216,540 2,730,619 28,076,861 DF %>% group_by(id1) %>% summarise(sum(v1)) 1,188 1,642 5,344 43,524 455,865 5,128,423 51,528,819 DT[, sum(v1), keyby=id1] 953 1,123 2,486 16,416 210,032 2,721,640 27,406,794 DF %>% group_by(id1,id2) %>% summarise(sum(v1)) 1,894 8,480 28,033 152,965 1,444,988 14,927,984 152,535,515 DT[, sum(v1), keyby="id1,id2"] 1,263 2,559 5,516 26,446 361,812 4,827,830 51,897,539 DF %>% group_by(id1,id2) %>% summarise(sum(v1)) 1,865 8,396 27,343 153,185 1,440,788 14,652,509 152,118,605 DT[, sum(v1), keyby="id1,id2"] 1,257 2,561 5,505 26,386 355,492 4,827,702 54,047,731 DF %>% group_by(id3) %>% summarise(sum(v1),mean(v3)) 1,130 1,697 10,323 129,391 1,805,955 45,748,652 693,700,832 DT[, list(sum(v1),mean(v3)), keyby=id3] 1,197 1,461 3,910 40,991 710,813 16,582,213 233,001,218 DF %>% group_by(id3) %>% summarise(sum(v1),mean(v3)) 1,100 1,701 10,299 125,585 1,824,419 44,038,141 627,247,199 DT[, list(sum(v1),mean(v3)), keyby=id3] 1,160 1,415 3,894 40,895 713,289 16,660,202 209,734,867 DF %>% group_by(id4) %>% summarise_each(funs(mean), 7:9) 2,463 2,900 6,965 51,318 591,011 8,421,787 86,594,397 DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] 1,091 1,354 3,603 29,196 314,705 3,711,641 38,151,119 DF %>% group_by(id4) %>% summarise_each(funs(mean), 7:9) 1,811 2,276 6,328 49,939 579,184 8,315,717 86,585,169 DT[, lapply(.SD, mean), keyby=id4, .SDcols=7:9] 1,056 1,322 3,614 26,572 310,265 3,706,489 37,275,816 DF %>% group_by(id6) %>% summarise_each(funs(sum), 7:9) 1,729 2,234 8,268 94,044 1,661,863 41,141,396 600,233,695 DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] 1,049 1,343 4,053 39,299 529,329 7,016,574 120,664,934 DF %>% group_by(id6) %>% summarise_each(funs(sum), 7:9) 1,696 2,207 8,189 93,323 1,660,917 41,126,264 625,999,220 DT[, lapply(.SD, sum), keyby=id6, .SDcols=7:9] 1,055 1,343 4,033 38,910 529,763 6,602,454 125,725,448 39