SlideShare a Scribd company logo
1 of 22
Download to read offline
Wei Tan, IBM T. J. Watson Research Center
wtan@us.ibm.com
CuMF: Large-Scale Matrix Factorization on
Just One Machine with GPUs
Agenda
 Why accelerate recommendation (matrix factorization) using GPUs?
 What are the challenges?
 How cuMF tackles the challenges?
 So what is the result?
2
Why we need a fast and scalable recommender system?
3
 Recommendation is pervasive
 Drive 80% of Netflix’s watch hours1
 Digital ads market in US: 37.3 billion2
 Need: fast, scalable, economic
ALS (alternating least square) for collaborative filtering
4
 Input: users ratings on some items
 Output: all missing ratings
 How: factorize the rating matrix R into
and minimize the lost on observed ratings:
 ALS: iteratively solve
R
X
xu
ΘT θv
Matrix factorization/ALS is versatile
5
item
X
xu
ΘT θv
Recommendation
user
X
ΘT
word
wordWord
embedding
X
ΘT
word
documentTopic
model
a very important algorithm
to accelerate
Challenges of fast and scalable ALS
6
• ALS needs to solve:
spmm: cublas
• Challenge 1: access and
aggregate many θvs:
irregular (R is sparse) and
memory intensive
LU or Cholesky
decomposition: cublas
• Challenge 2:
Single GPU can NOT handle big m, n and nnz
Challenge 1: memory access
7
 Nvidia K40: Memory BW: 288 GB/sec, compute: 4.3 Tflops/sec
 Higher flops  higher op intensity (more flops per byte)  caching!
Operational intensity (Flops/Byte)
Gflops/s
4.3T
1
288G ×
15
×
under-utilized
fully-utilized
Address challenge 1: memory-optimized ALS
8
• To obtain
1. Reuse θv s for many users
2. Load a portion into smem
3. Tile and aggregate in register
Address challenge 1: memory-optimized ALS
9
f
f
T
T
T
T
Address challenge 2: scale-up ALS on multiple GPUs
 Model parallel: solve a
portion of the model
10
model
parallel
Address challenge 2: scale-up ALS on multiple GPUs
 Data parallel: solve
with a portion of the
training data
11
Data parallel
model
parallel
Address challenge 2: parallel reduction
 Data parallel needs cross-GPU reduction
12
One-phase parallel reduction. Two-phase parallel reduction.
Inter-socketIntra-socket
Recap: cuMF tackled two challenges
13
• ALS needs to solve:
spmm: cublas
• Challenge 1: access and
aggregate many θvs:
irregular (R is sparse) and
memory intensive
LU or Cholesky
decomposition: cublas
• Challenge 2:
Single GPU can NOT handle big m, n and nnz
Connect cuMF to Spark MLlib
14
 Spark applications relying on mllib/ALS need no change
 Modified mllib/ALS detects GPU and offload computation
 Leverage the best of Spark (scale-out) and GPU (scale-up)
ALS apps
mllib/ALS
cuMFJNI
Connect cuMF to Spark MLlib
1 Power 8 node + 2 K40
CUDA
kernel
GPU1
RDD
CUDA
kernel
…RDD RDD RDD
CUDA
kernel
GPU2
RDD
CUDA
kernel
…RDD RDD RDD
shuffle
1 Power 8 node + 2 K40
CUDA
kernel
GPU1
RDD
CUDA
kernel
…RDD RDD RDD
shuffle
…
 RDD on CPU: to distribute rating data and shuffle parameters
 Solver on GPU: to form and solve
 Able to run on multiple nodes, and multiple GPUs per node
15
Implementation
16
 In C (circa. 10k LOC)
 CUDA 7.0/7.5, GCC OpenMP v3.0
 Baselines:
– Libmf: SGD on 1 node [RecSys14]
– NOMAD: SGD on >1 nodes [VLDB 14]
– SparkALS: ALS on Spark
– FactorBird: SGD + parameter server for MF
– Facebook: enhanced Giraph
CuMF performance
17
X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set
• 1 GPU vs. 30 cores, CuMF slightly
faster than libmf and NOMAD
• CuMF scales well on 1, 2, 4 GPUs
Effectiveness of memory optimization
18
X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set
• Aggressively using registers  2x
• Using texture  25%-35% faster
CuMF performance and cost
19
• CuMF @4 GPUs ≈ NOMAD @64 HPC nodes
≈ 10x NOMAD @32 AWS nodes
• CuMF @4 GPUs ≈ 10x SparkALS @50 nodes
≈ 1% of its cost
CuMF accelerated Spark on Power 8
20
• CuMF @2 K40s achieves 6+x speedup in training (193 sec  1.3k sec)
*GUI designed by Amir Sanjar
Conclusion
 Why accelerate recommendation (matrix factorization) using GPU?
– Need to be fast, scalable and economic
 What are the challenges?
– Memory access, scale to multiple GPUs
 How cuMF tackles the challenges?
– Optimize memory access, parallelism and communication
 So what is the result?
– Up to 10x as fast, 100x as cost-efficient
– Use cuMF standalone or with Spark
– GPU can tackle ML problems beyond deep learning!
21
Wei Tan, IBM T. J. Watson Research Center
wtan@us.ibm.com
http://ibm.biz/wei_tan
Thank you, questions?
Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs.
Wei Tan, Liangliang Cao, Liana Fong. HPDC 2016,
http://arxiv.org/abs/1603.03820
Source code available soon.

More Related Content

What's hot

Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachjemin lee
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitCarlo C. del Mundo
 
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...AMD Developer Central
 
Practical Guidelines for Solving Difficult Mixed Integer Programs
Practical Guidelines for Solving Difficult  Mixed Integer ProgramsPractical Guidelines for Solving Difficult  Mixed Integer Programs
Practical Guidelines for Solving Difficult Mixed Integer ProgramsIBM Decision Optimization
 
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...jemin lee
 
UK ATC 2015: Automated Post Processing of Multimodel Optimisation Data
UK ATC 2015: Automated Post Processing of Multimodel Optimisation DataUK ATC 2015: Automated Post Processing of Multimodel Optimisation Data
UK ATC 2015: Automated Post Processing of Multimodel Optimisation DataAltair
 
Solving Large Scale Optimization Problems using CPLEX Optimization Studio
Solving Large Scale Optimization Problems using CPLEX Optimization StudioSolving Large Scale Optimization Problems using CPLEX Optimization Studio
Solving Large Scale Optimization Problems using CPLEX Optimization Studiooptimizatiodirectdirect
 
Innovations in CPLEX performance and solver capabilities
Innovations in CPLEX performance and solver capabilitiesInnovations in CPLEX performance and solver capabilities
Innovations in CPLEX performance and solver capabilitiesIBM Decision Optimization
 
improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performances.rohit
 

What's hot (11)

Efficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approachEfficient execution of quantized deep learning models a compiler approach
Efficient execution of quantized deep learning models a compiler approach
 
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing UnitSlides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
 
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
 
Practical Guidelines for Solving Difficult Mixed Integer Programs
Practical Guidelines for Solving Difficult  Mixed Integer ProgramsPractical Guidelines for Solving Difficult  Mixed Integer Programs
Practical Guidelines for Solving Difficult Mixed Integer Programs
 
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
PACT19, MOSAIC : Heterogeneity-, Communication-, and Constraint-Aware Model ...
 
UK ATC 2015: Automated Post Processing of Multimodel Optimisation Data
UK ATC 2015: Automated Post Processing of Multimodel Optimisation DataUK ATC 2015: Automated Post Processing of Multimodel Optimisation Data
UK ATC 2015: Automated Post Processing of Multimodel Optimisation Data
 
Ch1
Ch1Ch1
Ch1
 
P.I.Z.Z.A.: Rationale Behind FPGA
P.I.Z.Z.A.: Rationale Behind FPGAP.I.Z.Z.A.: Rationale Behind FPGA
P.I.Z.Z.A.: Rationale Behind FPGA
 
Solving Large Scale Optimization Problems using CPLEX Optimization Studio
Solving Large Scale Optimization Problems using CPLEX Optimization StudioSolving Large Scale Optimization Problems using CPLEX Optimization Studio
Solving Large Scale Optimization Problems using CPLEX Optimization Studio
 
Innovations in CPLEX performance and solver capabilities
Innovations in CPLEX performance and solver capabilitiesInnovations in CPLEX performance and solver capabilities
Innovations in CPLEX performance and solver capabilities
 
improve deep learning training and inference performance
improve deep learning training and inference performanceimprove deep learning training and inference performance
improve deep learning training and inference performance
 

Similar to S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalJunli Gu
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917Bill Liu
 
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Databricks
 
Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...
Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...
Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...John Gunnels
 
Large-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookFaisal Siddiqi
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
PoC: Using a Group Communication System to improve MySQL Replication HA
PoC: Using a Group Communication System to improve MySQL Replication HAPoC: Using a Group Communication System to improve MySQL Replication HA
PoC: Using a Group Communication System to improve MySQL Replication HAUlf Wendel
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
 
Jvm problem diagnostics
Jvm problem diagnosticsJvm problem diagnostics
Jvm problem diagnosticsDanijel Mitar
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...Databricks
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Databricks
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemShuai Yuan
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomFacultad de Informática UCM
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 
Update on Trinity System Procurement and Plans
Update on Trinity System Procurement and PlansUpdate on Trinity System Procurement and Plans
Update on Trinity System Procurement and Plansinside-BigData.com
 

Similar to S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs (20)

OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
 
Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...
Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...
Making_Good_Enough...Better-Addressing_the_Multiple_Objectives_of_High-Perfor...
 
Large-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at Facebook
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
PoC: Using a Group Communication System to improve MySQL Replication HA
PoC: Using a Group Communication System to improve MySQL Replication HAPoC: Using a Group Communication System to improve MySQL Replication HA
PoC: Using a Group Communication System to improve MySQL Replication HA
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
Jvm problem diagnostics
Jvm problem diagnosticsJvm problem diagnostics
Jvm problem diagnostics
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 
computer architecture.
computer architecture.computer architecture.
computer architecture.
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroom
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Update on Trinity System Procurement and Plans
Update on Trinity System Procurement and PlansUpdate on Trinity System Procurement and Plans
Update on Trinity System Procurement and Plans
 

Recently uploaded

APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirtrahman018755
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样ayvbos
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftAanSulistiyo
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfJOHNBEBONYAP1
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsMonica Sydney
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查ydyuyu
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsMonica Sydney
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdfMatthew Sinclair
 
PowerDirector Explination Process...pptx
PowerDirector Explination Process...pptxPowerDirector Explination Process...pptx
PowerDirector Explination Process...pptxgalaxypingy
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsMonica Sydney
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Roommeghakumariji156
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查ydyuyu
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...kajalverma014
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查ydyuyu
 
75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptxAsmae Rabhi
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样ayvbos
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制pxcywzqs
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdfMatthew Sinclair
 

Recently uploaded (20)

APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
 
Microsoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck MicrosoftMicrosoft Azure Arc Customer Deck Microsoft
Microsoft Azure Arc Customer Deck Microsoft
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
PowerDirector Explination Process...pptx
PowerDirector Explination Process...pptxPowerDirector Explination Process...pptx
PowerDirector Explination Process...pptx
 
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girlsRussian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
Russian Call girls in Abu Dhabi 0508644382 Abu Dhabi Call girls
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
 
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
哪里办理美国迈阿密大学毕业证(本硕)umiami在读证明存档可查
 
75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 

S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

  • 1. Wei Tan, IBM T. J. Watson Research Center wtan@us.ibm.com CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs
  • 2. Agenda  Why accelerate recommendation (matrix factorization) using GPUs?  What are the challenges?  How cuMF tackles the challenges?  So what is the result? 2
  • 3. Why we need a fast and scalable recommender system? 3  Recommendation is pervasive  Drive 80% of Netflix’s watch hours1  Digital ads market in US: 37.3 billion2  Need: fast, scalable, economic
  • 4. ALS (alternating least square) for collaborative filtering 4  Input: users ratings on some items  Output: all missing ratings  How: factorize the rating matrix R into and minimize the lost on observed ratings:  ALS: iteratively solve R X xu ΘT θv
  • 5. Matrix factorization/ALS is versatile 5 item X xu ΘT θv Recommendation user X ΘT word wordWord embedding X ΘT word documentTopic model a very important algorithm to accelerate
  • 6. Challenges of fast and scalable ALS 6 • ALS needs to solve: spmm: cublas • Challenge 1: access and aggregate many θvs: irregular (R is sparse) and memory intensive LU or Cholesky decomposition: cublas • Challenge 2: Single GPU can NOT handle big m, n and nnz
  • 7. Challenge 1: memory access 7  Nvidia K40: Memory BW: 288 GB/sec, compute: 4.3 Tflops/sec  Higher flops  higher op intensity (more flops per byte)  caching! Operational intensity (Flops/Byte) Gflops/s 4.3T 1 288G × 15 × under-utilized fully-utilized
  • 8. Address challenge 1: memory-optimized ALS 8 • To obtain 1. Reuse θv s for many users 2. Load a portion into smem 3. Tile and aggregate in register
  • 9. Address challenge 1: memory-optimized ALS 9 f f T T T T
  • 10. Address challenge 2: scale-up ALS on multiple GPUs  Model parallel: solve a portion of the model 10 model parallel
  • 11. Address challenge 2: scale-up ALS on multiple GPUs  Data parallel: solve with a portion of the training data 11 Data parallel model parallel
  • 12. Address challenge 2: parallel reduction  Data parallel needs cross-GPU reduction 12 One-phase parallel reduction. Two-phase parallel reduction. Inter-socketIntra-socket
  • 13. Recap: cuMF tackled two challenges 13 • ALS needs to solve: spmm: cublas • Challenge 1: access and aggregate many θvs: irregular (R is sparse) and memory intensive LU or Cholesky decomposition: cublas • Challenge 2: Single GPU can NOT handle big m, n and nnz
  • 14. Connect cuMF to Spark MLlib 14  Spark applications relying on mllib/ALS need no change  Modified mllib/ALS detects GPU and offload computation  Leverage the best of Spark (scale-out) and GPU (scale-up) ALS apps mllib/ALS cuMFJNI
  • 15. Connect cuMF to Spark MLlib 1 Power 8 node + 2 K40 CUDA kernel GPU1 RDD CUDA kernel …RDD RDD RDD CUDA kernel GPU2 RDD CUDA kernel …RDD RDD RDD shuffle 1 Power 8 node + 2 K40 CUDA kernel GPU1 RDD CUDA kernel …RDD RDD RDD shuffle …  RDD on CPU: to distribute rating data and shuffle parameters  Solver on GPU: to form and solve  Able to run on multiple nodes, and multiple GPUs per node 15
  • 16. Implementation 16  In C (circa. 10k LOC)  CUDA 7.0/7.5, GCC OpenMP v3.0  Baselines: – Libmf: SGD on 1 node [RecSys14] – NOMAD: SGD on >1 nodes [VLDB 14] – SparkALS: ALS on Spark – FactorBird: SGD + parameter server for MF – Facebook: enhanced Giraph
  • 17. CuMF performance 17 X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set • 1 GPU vs. 30 cores, CuMF slightly faster than libmf and NOMAD • CuMF scales well on 1, 2, 4 GPUs
  • 18. Effectiveness of memory optimization 18 X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set • Aggressively using registers  2x • Using texture  25%-35% faster
  • 19. CuMF performance and cost 19 • CuMF @4 GPUs ≈ NOMAD @64 HPC nodes ≈ 10x NOMAD @32 AWS nodes • CuMF @4 GPUs ≈ 10x SparkALS @50 nodes ≈ 1% of its cost
  • 20. CuMF accelerated Spark on Power 8 20 • CuMF @2 K40s achieves 6+x speedup in training (193 sec  1.3k sec) *GUI designed by Amir Sanjar
  • 21. Conclusion  Why accelerate recommendation (matrix factorization) using GPU? – Need to be fast, scalable and economic  What are the challenges? – Memory access, scale to multiple GPUs  How cuMF tackles the challenges? – Optimize memory access, parallelism and communication  So what is the result? – Up to 10x as fast, 100x as cost-efficient – Use cuMF standalone or with Spark – GPU can tackle ML problems beyond deep learning! 21
  • 22. Wei Tan, IBM T. J. Watson Research Center wtan@us.ibm.com http://ibm.biz/wei_tan Thank you, questions? Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. Wei Tan, Liangliang Cao, Liana Fong. HPDC 2016, http://arxiv.org/abs/1603.03820 Source code available soon.