SlideShare a Scribd company logo
1 of 15
Download to read offline
Copyright©2016 NTT corp. All Rights Reserved.	
Hivemall  Meets  XGBoost  in
DataFrame/Spark
2016/9/8
Takeshi  Yamamuro  (maropu)  @  NTT
2Copyright©2016 NTT corp. All Rights Reserved.	
Who  am  I?
3Copyright©2016 NTT corp. All Rights Reserved.	
• Short  for  eXtreme  Gradient  Boosting
•  https://github.com/dmlc/xgboost
• It  is...
•  variant  of  the  gradient  boosting  machine
•  tree-‐‑‒based  model
•  open-‐‑‒sourced  tool  (Apache2  license)  
•  written  in  C++
•  R/python/Julia/Java/Scala  interfaces  provided
•  widely  used  in  Kaggle  competitions
                is...
4Copyright©2016 NTT corp. All Rights Reserved.	
• Most  of  Hivemall  functions  supported  in  
Spark-‐‑‒v1.6  and  v2.0
•  the  v2.0  support  not  released  yet
• XGBoost  integration  under  development
•  distributed/parallel  predictions
•  native  libraries  bundled  for  major  platforms
•  Mac /Linux  on  x86_̲64
•  how-‐‑‒to-‐‑‒use:  https://gist.github.com/maropu/
33794b293ee937e99b8fb0788843fa3f
Hivemall  in  DataFrame/Spark
5Copyright©2016 NTT corp. All Rights Reserved.	
Spark  Quick  Examples
• Fetch  a  binary  Spark  v2.0.0
•  http://spark.apache.org/downloads.html
$ <SPARK_HOME>/bin/spark-shell
scala> :paste
val textFile = sc.textFile(”hoge.txt")
val counts = textFile.flatMap(_.split(" ")) .map(word => (word, 1))
.reduceByKey(_ + _)
6Copyright©2016 NTT corp. All Rights Reserved.	
Fetch  training  and  test  data
• E2006  tfidf  regression  dataset
•  http://www.csie.ntu.edu.tw/~∼cjlin/libsvmtools/
datasets/regression.html#E2006-‐‑‒tfidf
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/
E2006.train.bz2
$ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/
E2006.test.bz2
7Copyright©2016 NTT corp. All Rights Reserved.	
XGBoost  in  spark-‐‑‒shell
• Scala  interface  bundled  in  the  Hivemall  jar
$ bunzip2 E2006.train.bz2
$ <SPARK_HOME>/bin/spark-shell
-conf spark.jars=hivemall-spark-XXX-with-dependencies.jar
scala> import ml.dmlc.xgboost4j.scala._
scala> :paste
// Read trainining data
val trainData = new DMatrix(”E2006.train")
// Define parameters
val paramMap = List(
"eta" -> 0.1, "max_depth" -> 2, "objective" -> ”reg:logistic”
).toMap
// Train the model
val model = XGBoost.train(trainData, paramMap, 2)
// Save model to the file
model.saveModel(”xgboost_models_dir/xgb_0001.model”)
8Copyright©2016 NTT corp. All Rights Reserved.	
Load  test  data  in  parallel
$ <SPARK_HOME>/bin/spark-shell
-conf spark.jars=hivemall-spark-XXX-with-dependencies.jar
// Create DataFrame for the test data
scala> val testDf = sqlContext.sparkSession.read.format("libsvm”)
.load("E2006.test.bz2")
scala> testDf.printSchema
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
9Copyright©2016 NTT corp. All Rights Reserved.	
Load  test  data  in  parallel
0.000357499151147113 6066:0.0007932706219604
8 6069:0.000311377727123504 6070:0.0003067549
34580457 6071:0.000276992485786437 6072:0.000
39663531098024 6074:0.00039663531098024 6075
:0.00032548335…
testDf
Partition1 Partition2 Partition3 PartitionN
…
…
…
Load in parallel because
bzip2 is splittable
• #partitions  depends  on  three  parameters
•  spark.default.parallelism:  #cores  by  default
•  spark.sql.files.maxPartitionBytes:  128MB  by  default
•  spark.sql.files.openCostInBytes:  4MB  by  default
10Copyright©2016 NTT corp. All Rights Reserved.	
• XGBoost  in  DataFrame
•  Load  built  models  and  do  cross-‐‑‒joins  for  predictions
Do  predictions  in  parallel
scala> import org.apache.spark.hive.HivemallOps._
scala> :paste
// Load built models from persistent storage
val modelsDf = sqlContext.sparkSession.read.format(xgboost)
.load(”xgboost_models_dir")
// Do prediction in parallel via cross-joins
val predict = modelsDf.join(testDf)
.xgboost_predict($"rowid", $"features", $"model_id", $"pred_model")
.groupBy("rowid")
.avg()
11Copyright©2016 NTT corp. All Rights Reserved.	
• XGBoost  in  DataFrame
•  Load  built  models  and  do  cross-‐‑‒joins  for  predictions
• Broadcast  cross-‐‑‒joins  expected
•  Size  of  `̀modelsDf`̀  must  be  less  than  and  equal  to  
spark.sql.autoBroadcastJoinThreshold  (10MB  by  default)
Do  predictions  in  parallel
testDf
rowid label features
1 0.392 1:0.3  5:0.1…
2 0.929 3:0.2…
3 0.132 2:0.9…
4 0.3923 5:0.4…
…
modelsDf
model_̲id pred_̲model
xgb_̲0001.model <binary  data>
xgb_̲0002.model <binary  data>
cross-joins in parallel
12Copyright©2016 NTT corp. All Rights Reserved.	
• Structured  Streaming  in  Spark-‐‑‒2.0
•  Scalable  and  fault-‐‑‒tolerant  stream  processing  engine  
built  on  the  Spark  SQL  engine
•  alpha  component  in  v2.0
Do  predictions  for  streaming  data
scala> :paste
// Initialize streaming DataFrame
val testStreamingDf = spark.readStream
.format(”libsvm”) // Not supported in v2.0
…
// Do prediction for streaming data
val predict = modelsDf.join(testStreamingDf)
.xgboost_predict($"rowid", $"features", $"model_id", $"pred_model")
.groupBy("rowid")
.avg()
13Copyright©2016 NTT corp. All Rights Reserved.	
• One  model  for  a  partition
•  WIP:  Build  models  with  different  parameters
Build  models  in  parallel
scala> :paste
// Set options for XGBoost
val xgbOptions = XGBoostOptions()
.set("num_round", "10000")
.set(“max_depth”, “32,48,64”) // Randomly selected by workers
// Set # of models to output
val numModels = 4
// Build models and save them in persistent storage
trainDf.repartition(numModels)
.train_xgboost_regr($“features”, $ “label”, s"${xgbOptions}")
.write
.format(xgboost)
.save(”xgboost_models_dir”)
14Copyright©2016 NTT corp. All Rights Reserved.	
• If  you  get  stuck  in  UnsatisfiedLinkError,  you  
need  to  compile  a  binary  by  yourself
Compile  a  binary  on  your  platform  
$ mvn validate && mvn package -Pcompile-xgboost -Pspark-2.0 –DskipTests
$ ls target
hivemall-core-0.4.2-rc.2-with-dependencies.jar
hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2-with-dependencies.jar
hivemall-core-0.4.2-rc.2.jar
hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2.jar
hivemall-mixserv-0.4.2-rc.2-fat.jar
hivemall-xgboost-0.4.2-rc.2.jar
hivemall-nlp-0.4.2-rc.2-with-dependencies.jar
hivemall-xgboost_0.60-0.4.2-rc.2-with-dependencies.jar
hivemall-nlp-0.4.2-rc.2.jar
hivemall-xgboost_0.60-0.4.2-rc.2.jar
15Copyright©2016 NTT corp. All Rights Reserved.	
• Rabbit  integration  for  parallel  learning
•  http://dmlc.cs.washington.edu/rabit.html
• Python  supports
• spark.ml  interface  supports
• Bundle  more  binaries  for  portability
•  Windows  and  x86  platforms
• Others?
Future  Work

More Related Content

What's hot

Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEDataWorks Summit/Hadoop Summit
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Databricks
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
 
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, SparkDistributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, SparkJan Wiegelmann
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsDataWorks Summit
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!DataWorks Summit
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsTakuya UESHIN
 
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)Yahoo Developer Network
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and DatasetKazuaki Ishizaki
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsTakuya UESHIN
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit
 
Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)Takuya UESHIN
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSDatabricks
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySparkSpark Summit
 
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019Kim Hammar
 

What's hot (20)

Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
 
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, SparkDistributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
Distributed TensorFlow on Hadoop, Mesos, Kubernetes, Spark
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
 
Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!Surge: Rise of Scalable Machine Learning at Yahoo!
Surge: Rise of Scalable Machine Learning at Yahoo!
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
 
Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)
 
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDSAccelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
Accelerated Machine Learning with RAPIDS and MLflow, Nvidia/RAPIDS
 
Hadoop + GPU
Hadoop + GPUHadoop + GPU
Hadoop + GPU
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019
 

Viewers also liked

hivemallを使って4日間で性別推定した話
hivemallを使って4日間で性別推定した話hivemallを使って4日間で性別推定した話
hivemallを使って4日間で性別推定した話eventdotsjp
 
Hivemallmtup 20160908
Hivemallmtup 20160908Hivemallmtup 20160908
Hivemallmtup 20160908Kazuki Ohmori
 
Hivemall v0.3の機能紹介@1st Hivemall meetup
Hivemall v0.3の機能紹介@1st Hivemall meetupHivemall v0.3の機能紹介@1st Hivemall meetup
Hivemall v0.3の機能紹介@1st Hivemall meetupMakoto Yui
 
2nd Hivemall meetup 20151020
2nd Hivemall meetup 201510202nd Hivemall meetup 20151020
2nd Hivemall meetup 20151020Makoto Yui
 
Hivemall meetup vol2 oisix
Hivemall meetup vol2 oisixHivemall meetup vol2 oisix
Hivemall meetup vol2 oisixTaisuke Fukawa
 
Hivemallで始める不動産価格推定サービス
Hivemallで始める不動産価格推定サービスHivemallで始める不動産価格推定サービス
Hivemallで始める不動産価格推定サービスKentaro Yoshida
 
Sano tokyowebmining 201625_v04
Sano tokyowebmining 201625_v04Sano tokyowebmining 201625_v04
Sano tokyowebmining 201625_v04Masakazu Sano
 

Viewers also liked (8)

Sano hmm 20150512
Sano hmm 20150512Sano hmm 20150512
Sano hmm 20150512
 
hivemallを使って4日間で性別推定した話
hivemallを使って4日間で性別推定した話hivemallを使って4日間で性別推定した話
hivemallを使って4日間で性別推定した話
 
Hivemallmtup 20160908
Hivemallmtup 20160908Hivemallmtup 20160908
Hivemallmtup 20160908
 
Hivemall v0.3の機能紹介@1st Hivemall meetup
Hivemall v0.3の機能紹介@1st Hivemall meetupHivemall v0.3の機能紹介@1st Hivemall meetup
Hivemall v0.3の機能紹介@1st Hivemall meetup
 
2nd Hivemall meetup 20151020
2nd Hivemall meetup 201510202nd Hivemall meetup 20151020
2nd Hivemall meetup 20151020
 
Hivemall meetup vol2 oisix
Hivemall meetup vol2 oisixHivemall meetup vol2 oisix
Hivemall meetup vol2 oisix
 
Hivemallで始める不動産価格推定サービス
Hivemallで始める不動産価格推定サービスHivemallで始める不動産価格推定サービス
Hivemallで始める不動産価格推定サービス
 
Sano tokyowebmining 201625_v04
Sano tokyowebmining 201625_v04Sano tokyowebmining 201625_v04
Sano tokyowebmining 201625_v04
 

Similar to 20160908 hivemall meetup

A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.J On The Beach
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Databricks
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Jason Dai
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in productionParis Data Engineers !
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterTim Ellison
 
Cisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Connect Toronto 2017 - Model-driven TelemetryCisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Connect Toronto 2017 - Model-driven TelemetryCisco Canada
 
Graal and Truffle: One VM to Rule Them All
Graal and Truffle: One VM to Rule Them AllGraal and Truffle: One VM to Rule Them All
Graal and Truffle: One VM to Rule Them AllThomas Wuerthinger
 
20180417 hivemall meetup#4
20180417 hivemall meetup#420180417 hivemall meetup#4
20180417 hivemall meetup#4Takeshi Yamamuro
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche SparkAlex Thompson
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
 
Model-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data AnalyticsModel-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data AnalyticsCisco Canada
 
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur KhanRunning Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur KhanDatabricks
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceTim Ellison
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIYoni Davidson
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Spark Summit
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaJoe Stein
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Sparkfelixcss
 

Similar to 20160908 hivemall meetup (20)

A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production10 things i wish i'd known before using spark in production
10 things i wish i'd known before using spark in production
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
Cisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Connect Toronto 2017 - Model-driven TelemetryCisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Connect Toronto 2017 - Model-driven Telemetry
 
Graal and Truffle: One VM to Rule Them All
Graal and Truffle: One VM to Rule Them AllGraal and Truffle: One VM to Rule Them All
Graal and Truffle: One VM to Rule Them All
 
20180417 hivemall meetup#4
20180417 hivemall meetup#420180417 hivemall meetup#4
20180417 hivemall meetup#4
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 
Model-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data AnalyticsModel-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data Analytics
 
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur KhanRunning Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-API
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
AMD It's Time to ROC
AMD It's Time to ROCAMD It's Time to ROC
AMD It's Time to ROC
 
Developing Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache KafkaDeveloping Realtime Data Pipelines With Apache Kafka
Developing Realtime Data Pipelines With Apache Kafka
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 

More from Takeshi Yamamuro

LT: Spark 3.1 Feature Expectation
LT: Spark 3.1 Feature ExpectationLT: Spark 3.1 Feature Expectation
LT: Spark 3.1 Feature ExpectationTakeshi Yamamuro
 
Quick Overview of Upcoming Spark 3.0 + α
Quick Overview of Upcoming Spark 3.0 + αQuick Overview of Upcoming Spark 3.0 + α
Quick Overview of Upcoming Spark 3.0 + αTakeshi Yamamuro
 
MLflowによる機械学習モデルのライフサイクルの管理
MLflowによる機械学習モデルのライフサイクルの管理MLflowによる機械学習モデルのライフサイクルの管理
MLflowによる機械学習モデルのライフサイクルの管理Takeshi Yamamuro
 
Taming Distributed/Parallel Query Execution Engine of Apache Spark
Taming Distributed/Parallel Query Execution Engine of Apache SparkTaming Distributed/Parallel Query Execution Engine of Apache Spark
Taming Distributed/Parallel Query Execution Engine of Apache SparkTakeshi Yamamuro
 
LLJVM: LLVM bitcode to JVM bytecode
LLJVM: LLVM bitcode to JVM bytecodeLLJVM: LLVM bitcode to JVM bytecode
LLJVM: LLVM bitcode to JVM bytecodeTakeshi Yamamuro
 
An Experimental Study of Bitmap Compression vs. Inverted List Compression
An Experimental Study of Bitmap Compression vs. Inverted List CompressionAn Experimental Study of Bitmap Compression vs. Inverted List Compression
An Experimental Study of Bitmap Compression vs. Inverted List CompressionTakeshi Yamamuro
 
Sparkのクエリ処理系と周辺の話題
Sparkのクエリ処理系と周辺の話題Sparkのクエリ処理系と周辺の話題
Sparkのクエリ処理系と周辺の話題Takeshi Yamamuro
 
VLDB2013 R1 Emerging Hardware
VLDB2013 R1 Emerging HardwareVLDB2013 R1 Emerging Hardware
VLDB2013 R1 Emerging HardwareTakeshi Yamamuro
 
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4Takeshi Yamamuro
 
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)Takeshi Yamamuro
 
Introduction to Modern Analytical DB
Introduction to Modern Analytical DBIntroduction to Modern Analytical DB
Introduction to Modern Analytical DBTakeshi Yamamuro
 
SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-Takeshi Yamamuro
 
A x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequencesA x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequencesTakeshi Yamamuro
 
VLDB’11勉強会 -Session 9-
VLDB’11勉強会 -Session 9-VLDB’11勉強会 -Session 9-
VLDB’11勉強会 -Session 9-Takeshi Yamamuro
 
研究動向から考えるx86/x64最適化手法
研究動向から考えるx86/x64最適化手法研究動向から考えるx86/x64最適化手法
研究動向から考えるx86/x64最適化手法Takeshi Yamamuro
 
VLDB'10勉強会 -Session 20-
VLDB'10勉強会 -Session 20-VLDB'10勉強会 -Session 20-
VLDB'10勉強会 -Session 20-Takeshi Yamamuro
 

More from Takeshi Yamamuro (20)

LT: Spark 3.1 Feature Expectation
LT: Spark 3.1 Feature ExpectationLT: Spark 3.1 Feature Expectation
LT: Spark 3.1 Feature Expectation
 
Apache Spark + Arrow
Apache Spark + ArrowApache Spark + Arrow
Apache Spark + Arrow
 
Quick Overview of Upcoming Spark 3.0 + α
Quick Overview of Upcoming Spark 3.0 + αQuick Overview of Upcoming Spark 3.0 + α
Quick Overview of Upcoming Spark 3.0 + α
 
MLflowによる機械学習モデルのライフサイクルの管理
MLflowによる機械学習モデルのライフサイクルの管理MLflowによる機械学習モデルのライフサイクルの管理
MLflowによる機械学習モデルのライフサイクルの管理
 
Taming Distributed/Parallel Query Execution Engine of Apache Spark
Taming Distributed/Parallel Query Execution Engine of Apache SparkTaming Distributed/Parallel Query Execution Engine of Apache Spark
Taming Distributed/Parallel Query Execution Engine of Apache Spark
 
LLJVM: LLVM bitcode to JVM bytecode
LLJVM: LLVM bitcode to JVM bytecodeLLJVM: LLVM bitcode to JVM bytecode
LLJVM: LLVM bitcode to JVM bytecode
 
An Experimental Study of Bitmap Compression vs. Inverted List Compression
An Experimental Study of Bitmap Compression vs. Inverted List CompressionAn Experimental Study of Bitmap Compression vs. Inverted List Compression
An Experimental Study of Bitmap Compression vs. Inverted List Compression
 
Sparkのクエリ処理系と周辺の話題
Sparkのクエリ処理系と周辺の話題Sparkのクエリ処理系と周辺の話題
Sparkのクエリ処理系と周辺の話題
 
20150513 legobease
20150513 legobease20150513 legobease
20150513 legobease
 
20150516 icde2015 r19-4
20150516 icde2015 r19-420150516 icde2015 r19-4
20150516 icde2015 r19-4
 
VLDB2013 R1 Emerging Hardware
VLDB2013 R1 Emerging HardwareVLDB2013 R1 Emerging Hardware
VLDB2013 R1 Emerging Hardware
 
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
 
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
 
Introduction to Modern Analytical DB
Introduction to Modern Analytical DBIntroduction to Modern Analytical DB
Introduction to Modern Analytical DB
 
SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-
 
A x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequencesA x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequences
 
VAST-Tree, EDBT'12
VAST-Tree, EDBT'12VAST-Tree, EDBT'12
VAST-Tree, EDBT'12
 
VLDB’11勉強会 -Session 9-
VLDB’11勉強会 -Session 9-VLDB’11勉強会 -Session 9-
VLDB’11勉強会 -Session 9-
 
研究動向から考えるx86/x64最適化手法
研究動向から考えるx86/x64最適化手法研究動向から考えるx86/x64最適化手法
研究動向から考えるx86/x64最適化手法
 
VLDB'10勉強会 -Session 20-
VLDB'10勉強会 -Session 20-VLDB'10勉強会 -Session 20-
VLDB'10勉強会 -Session 20-
 

Recently uploaded

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSrknatarajan
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 

Recently uploaded (20)

ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 

20160908 hivemall meetup

  • 1. Copyright©2016 NTT corp. All Rights Reserved. Hivemall  Meets  XGBoost  in DataFrame/Spark 2016/9/8 Takeshi  Yamamuro  (maropu)  @  NTT
  • 2. 2Copyright©2016 NTT corp. All Rights Reserved. Who  am  I?
  • 3. 3Copyright©2016 NTT corp. All Rights Reserved. • Short  for  eXtreme  Gradient  Boosting •  https://github.com/dmlc/xgboost • It  is... •  variant  of  the  gradient  boosting  machine •  tree-‐‑‒based  model •  open-‐‑‒sourced  tool  (Apache2  license)   •  written  in  C++ •  R/python/Julia/Java/Scala  interfaces  provided •  widely  used  in  Kaggle  competitions                 is...
  • 4. 4Copyright©2016 NTT corp. All Rights Reserved. • Most  of  Hivemall  functions  supported  in   Spark-‐‑‒v1.6  and  v2.0 •  the  v2.0  support  not  released  yet • XGBoost  integration  under  development •  distributed/parallel  predictions •  native  libraries  bundled  for  major  platforms •  Mac /Linux  on  x86_̲64 •  how-‐‑‒to-‐‑‒use:  https://gist.github.com/maropu/ 33794b293ee937e99b8fb0788843fa3f Hivemall  in  DataFrame/Spark
  • 5. 5Copyright©2016 NTT corp. All Rights Reserved. Spark  Quick  Examples • Fetch  a  binary  Spark  v2.0.0 •  http://spark.apache.org/downloads.html $ <SPARK_HOME>/bin/spark-shell scala> :paste val textFile = sc.textFile(”hoge.txt") val counts = textFile.flatMap(_.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
  • 6. 6Copyright©2016 NTT corp. All Rights Reserved. Fetch  training  and  test  data • E2006  tfidf  regression  dataset •  http://www.csie.ntu.edu.tw/~∼cjlin/libsvmtools/ datasets/regression.html#E2006-‐‑‒tfidf $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/ E2006.train.bz2 $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/ E2006.test.bz2
  • 7. 7Copyright©2016 NTT corp. All Rights Reserved. XGBoost  in  spark-‐‑‒shell • Scala  interface  bundled  in  the  Hivemall  jar $ bunzip2 E2006.train.bz2 $ <SPARK_HOME>/bin/spark-shell -conf spark.jars=hivemall-spark-XXX-with-dependencies.jar scala> import ml.dmlc.xgboost4j.scala._ scala> :paste // Read trainining data val trainData = new DMatrix(”E2006.train") // Define parameters val paramMap = List( "eta" -> 0.1, "max_depth" -> 2, "objective" -> ”reg:logistic” ).toMap // Train the model val model = XGBoost.train(trainData, paramMap, 2) // Save model to the file model.saveModel(”xgboost_models_dir/xgb_0001.model”)
  • 8. 8Copyright©2016 NTT corp. All Rights Reserved. Load  test  data  in  parallel $ <SPARK_HOME>/bin/spark-shell -conf spark.jars=hivemall-spark-XXX-with-dependencies.jar // Create DataFrame for the test data scala> val testDf = sqlContext.sparkSession.read.format("libsvm”) .load("E2006.test.bz2") scala> testDf.printSchema root |-- label: double (nullable = true) |-- features: vector (nullable = true)
  • 9. 9Copyright©2016 NTT corp. All Rights Reserved. Load  test  data  in  parallel 0.000357499151147113 6066:0.0007932706219604 8 6069:0.000311377727123504 6070:0.0003067549 34580457 6071:0.000276992485786437 6072:0.000 39663531098024 6074:0.00039663531098024 6075 :0.00032548335… testDf Partition1 Partition2 Partition3 PartitionN … … … Load in parallel because bzip2 is splittable • #partitions  depends  on  three  parameters •  spark.default.parallelism:  #cores  by  default •  spark.sql.files.maxPartitionBytes:  128MB  by  default •  spark.sql.files.openCostInBytes:  4MB  by  default
  • 10. 10Copyright©2016 NTT corp. All Rights Reserved. • XGBoost  in  DataFrame •  Load  built  models  and  do  cross-‐‑‒joins  for  predictions Do  predictions  in  parallel scala> import org.apache.spark.hive.HivemallOps._ scala> :paste // Load built models from persistent storage val modelsDf = sqlContext.sparkSession.read.format(xgboost) .load(”xgboost_models_dir") // Do prediction in parallel via cross-joins val predict = modelsDf.join(testDf) .xgboost_predict($"rowid", $"features", $"model_id", $"pred_model") .groupBy("rowid") .avg()
  • 11. 11Copyright©2016 NTT corp. All Rights Reserved. • XGBoost  in  DataFrame •  Load  built  models  and  do  cross-‐‑‒joins  for  predictions • Broadcast  cross-‐‑‒joins  expected •  Size  of  `̀modelsDf`̀  must  be  less  than  and  equal  to   spark.sql.autoBroadcastJoinThreshold  (10MB  by  default) Do  predictions  in  parallel testDf rowid label features 1 0.392 1:0.3  5:0.1… 2 0.929 3:0.2… 3 0.132 2:0.9… 4 0.3923 5:0.4… … modelsDf model_̲id pred_̲model xgb_̲0001.model <binary  data> xgb_̲0002.model <binary  data> cross-joins in parallel
  • 12. 12Copyright©2016 NTT corp. All Rights Reserved. • Structured  Streaming  in  Spark-‐‑‒2.0 •  Scalable  and  fault-‐‑‒tolerant  stream  processing  engine   built  on  the  Spark  SQL  engine •  alpha  component  in  v2.0 Do  predictions  for  streaming  data scala> :paste // Initialize streaming DataFrame val testStreamingDf = spark.readStream .format(”libsvm”) // Not supported in v2.0 … // Do prediction for streaming data val predict = modelsDf.join(testStreamingDf) .xgboost_predict($"rowid", $"features", $"model_id", $"pred_model") .groupBy("rowid") .avg()
  • 13. 13Copyright©2016 NTT corp. All Rights Reserved. • One  model  for  a  partition •  WIP:  Build  models  with  different  parameters Build  models  in  parallel scala> :paste // Set options for XGBoost val xgbOptions = XGBoostOptions() .set("num_round", "10000") .set(“max_depth”, “32,48,64”) // Randomly selected by workers // Set # of models to output val numModels = 4 // Build models and save them in persistent storage trainDf.repartition(numModels) .train_xgboost_regr($“features”, $ “label”, s"${xgbOptions}") .write .format(xgboost) .save(”xgboost_models_dir”)
  • 14. 14Copyright©2016 NTT corp. All Rights Reserved. • If  you  get  stuck  in  UnsatisfiedLinkError,  you   need  to  compile  a  binary  by  yourself Compile  a  binary  on  your  platform   $ mvn validate && mvn package -Pcompile-xgboost -Pspark-2.0 –DskipTests $ ls target hivemall-core-0.4.2-rc.2-with-dependencies.jar hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2-with-dependencies.jar hivemall-core-0.4.2-rc.2.jar hivemall-spark-1.6.2_2.11.8-0.4.2-rc.2.jar hivemall-mixserv-0.4.2-rc.2-fat.jar hivemall-xgboost-0.4.2-rc.2.jar hivemall-nlp-0.4.2-rc.2-with-dependencies.jar hivemall-xgboost_0.60-0.4.2-rc.2-with-dependencies.jar hivemall-nlp-0.4.2-rc.2.jar hivemall-xgboost_0.60-0.4.2-rc.2.jar
  • 15. 15Copyright©2016 NTT corp. All Rights Reserved. • Rabbit  integration  for  parallel  learning •  http://dmlc.cs.washington.edu/rabit.html • Python  supports • spark.ml  interface  supports • Bundle  more  binaries  for  portability •  Windows  and  x86  platforms • Others? Future  Work