SlideShare a Scribd company logo
1 of 26
Download to read offline
Serverless GPU Cloud
with Job scheduler and
Container
Andrew yongjoon kong
CloudComputingCell, kakao
andrew.kong@kakaocorp.com
• Cloud Technical Advisory for Government Broad Cast Agency
• Adjunct Prof. Ajou Univ
• Korea Data Base Agency Acting Professor for Bigdata
• Member of National Information Agency Bigdata Advisory committee
• Kakao à Daum Kakao à Kakaocorp, Cloud Computing Cell lead
• Talks
• Embrace clouds (2017, openstack days, korea)
• Full route based network with linux (2016, netdev, Tokyo)
• SDN without SDN (2015, openstack, Vancouber)
Who am I
Andrew. Yongjoon kong
Supervised,
Korean
edition
Korean
Edition.
2nd Editions are coming…
Serverless computing is rising
Serverless computing , GPU
Serverless computing , GPU, Docker
Serverless framework
lots of serverless framework:
• Apache OpenWhisk
• Iron.io
• Openstack’s Picasso
• Gestalt ( based on DC/OS)
• Fission ( based on kubernetes)
• Runway ( kakao’s private FaaS)
What these framework’s purpose?
• connecting, mostly
• flow and automation
Serverless framework
Connection is very good virtue in public cloud
• there’s no resource depletion in public cloud.
connection/automation is directly related with cost
savings
• in private cloud, there’s screams for the resources
(especially GPU) from the engineers.
• The thing is that “Winner takes it all”
• à care for scheduling
Job scheduler
Scheduling User’s Job based on Algorithm
• FIFO
• Fair Share
• BackFill
• Preemption
Job
Job comprises two parts
• The resources
• CPU, Compute Nodes, Memory, Disk and Even
Walltime
• Job scheduling system manage the quota per queue,
user, user group
• The runnable execution
• Traditionally, The executable command
• e.g. saved_model_cli run --dir /tmp/saved_model_dir --tag_set serve
Job sample
Sample Job script
The traditional issue is how we distribute the command
and the data (you can’t specify node in batch system)
#!/bin/bash
#PBS -l nodes=1:ppn=2
#PBS -l walltime=00:00:59
cd /home/rcf-proj3/pv/test/ source
mkdir /test/test/dir
/usr/usc/sas/default/setup.sh sas my.sas
execution
resource
Job scheduler system layout
SharedFile system can
handle the file
locating issue.
Ă  Shared Filesystem
is too expensive.
Ă Modern
environment it is
much easier with
the container,
http://beagle.ci.uchicago.edu/using-beagle/
This could be changed by
container and registry
Job scheduler system, GPU and Container
add GPU resource to Job Script.
use NVIDIA Docker for the command
…
then scheduler will do the job
#!/bin/bash
#PBS -l nodes=1:ppn=2
#PBS -l walltime=00:00:59
#PBS -l gpus=8
NV_GPU=$NV_GPU nvidia-docker run --net
host -e PASSWORD=root -e USERNAME=root
-e PORT=$PORT
idock.daumkakao.io/dkos/nvidia-cuda-
sshd:dev
docker
registry
compute
node
GPU
compute
node
GPU
compute
node
GPU
master ( scheduler )
job
AI Development Cycle over compute resource
compute
node
GPU
compute
node
GPU
compute
node
GPU
compute
node
GPU
Training Model on
large scale with
Massive data
Inference thru. the
Model
Develop model on
personal env.
Abstract these to “job(resource,
executor)”
The output is abstract to container
docker
registry
master ( scheduler )
a
bs
tr
ac
ti
o
n
JOB
AI Service
The output is abstract to container
BTW, you need GPU and Other IT resource to show your effort to the public, as well
And, what about the monitoring & alert ?
The good thing is that if you make your effort with container
Ă  kakao cloud can help you
kakao cloud
Service Repo.
Service catalog
notification
scheduling
IaaS:
KRANE
Centralized
Measuring
System:
KEMI
Centralized
Deploying
System:
DKOS
Management Plane
DataCenter Contol/Data	plane	
Event / Alert
Initial Setup
Change
IT operations.
IT Services.
Some Numbers about kakao cloud
1563projects
632
pull request since 2014.9
88about
VMs are created/deleted per day
8703
vms
2,xxxprojects
913
pull request since 2014.9
100about
VMs are created/deleted per day
17,xxx
vms
2016.8 2017.9
9x,xxx active cores
KakaocorpSome	information	about	kakao cloud
from grizzly to Kilo
5 times upgraded
total 4Region
additional service Heat/
Trove/
Sahara
from grizzly to Mitaka
7 times upgraded
total 4Region
Heat/
Trove/
Octavia/
barbican 2016.82017.10
event monitoring/alert platform kakao, KEMI
Physical	
Servers
Virtual	
Instances
Containers
Others
(switches,	
logs)
monitoring
KEMI
IMS
(kakao CMDB	
API)
SB
Rule	
Engine
Notificati
on	
ETL	
Data Center Information abstraction layer
API	
predicting
scheduling
Openstack
Heat
Other	
Service	
API
Data Center (or Service ) Management Activity
control
KEMI stats KEMI log
Deployment abstraction in Kakao, DKOS
Data Center
User:
Defines	
resource
VM
PM
container
Service	
Catalogue
Centralized
Deploying
System
(DKOS)
Resource Pool
Queue
scheduler
manager
DKOS Archtecture
Services over DKOS
DKOS Situation
• Active cluster : 3 digit
• Total compute node : 4digit (vm+pm)
• Container counts : 5 digit
• Managed by?
DKOS Situation
• Why use DKOS(container)?
• Container is easy
• Container is cool
• dc/os is great
• Nop!
• Very summit point of integrated/automated infra service api
• authentication, authorization, compute resources, network,
volumes
• Metering, logging
• Monitoring, Notifications
kakao cloud now support GPU as well
Service Repo.
Service catalog
notification
scheduling
IaaS:
KRANE
Centralized
Measuring
System:
KEMI
Centralized
Deploying
System:
DKOS
Management Plane
DataCenter Contol/Data	plane	
Event / Alert
Initial Setup
Change
IT operations.
IT Services.
Thanks
Where are you from CMMI-Cloud perspective?
For CMM4, Time to embrace Clouds, not a Cloud
CMM0
legacy
output:
cloudTF
CMM1
self
service
Dev
resource
output:
krane
(openstack
cloud)
CMM2
limited
Prod
resources
output:
kemi
(MaaS)
CMM3
Automated
CloudUsage
output:
DKOS
(CaaS)
CMM4
Manual
Cloud
Usage
--
CMM5
Federated
Cloud
usage
--

More Related Content

What's hot

Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsIgor Sfiligoi
 
ENT101 Embracing the Cloud - AWS re: Invent 2012
ENT101 Embracing the Cloud - AWS re: Invent 2012ENT101 Embracing the Cloud - AWS re: Invent 2012
ENT101 Embracing the Cloud - AWS re: Invent 2012Amazon Web Services
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...In-Memory Computing Summit
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Igor Sfiligoi
 
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark ClustersApril 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark ClustersYahoo Developer Network
 
Deploy an Elastic, Resilient, Load-Balanced Cluster in 5 Minutes with Senlin
Deploy an Elastic, Resilient, Load-Balanced Cluster in 5 Minutes with SenlinDeploy an Elastic, Resilient, Load-Balanced Cluster in 5 Minutes with Senlin
Deploy an Elastic, Resilient, Load-Balanced Cluster in 5 Minutes with SenlinQiming Teng
 
Code for the earth OCP APAC Tokyo 2013-05
Code for the earth OCP APAC Tokyo 2013-05Code for the earth OCP APAC Tokyo 2013-05
Code for the earth OCP APAC Tokyo 2013-05Tetsu Saburi
 
Paul Angus - what's new in ACS 4.11
Paul Angus - what's new in ACS 4.11Paul Angus - what's new in ACS 4.11
Paul Angus - what's new in ACS 4.11ShapeBlue
 
Autoscaling with magnum and senlin
Autoscaling with magnum and senlinAutoscaling with magnum and senlin
Autoscaling with magnum and senlinQiming Teng
 
Performance Benchmarking of Clouds Evaluating OpenStack
Performance Benchmarking of Clouds                Evaluating OpenStackPerformance Benchmarking of Clouds                Evaluating OpenStack
Performance Benchmarking of Clouds Evaluating OpenStackPradeep Kumar
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCoburn Watson
 
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011Toby Bloom
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkDongwon Kim
 
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...Amazon Web Services
 
Paul Angus - CloudStack Container Service
Paul  Angus - CloudStack Container ServicePaul  Angus - CloudStack Container Service
Paul Angus - CloudStack Container ServiceShapeBlue
 
Senlin deep dive 2016
Senlin deep dive 2016Senlin deep dive 2016
Senlin deep dive 2016Qiming Teng
 
(CMP303) ResearchCloud: CfnCluster and Internet2 for Enterprise HPC
(CMP303) ResearchCloud: CfnCluster and Internet2 for Enterprise HPC(CMP303) ResearchCloud: CfnCluster and Internet2 for Enterprise HPC
(CMP303) ResearchCloud: CfnCluster and Internet2 for Enterprise HPCAmazon Web Services
 
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...OpenNebula Project
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Chris Fregly
 
A fun cup of joe with open liberty
A fun cup of joe with open libertyA fun cup of joe with open liberty
A fun cup of joe with open libertyAndy Mauer
 

What's hot (20)

Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
 
ENT101 Embracing the Cloud - AWS re: Invent 2012
ENT101 Embracing the Cloud - AWS re: Invent 2012ENT101 Embracing the Cloud - AWS re: Invent 2012
ENT101 Embracing the Cloud - AWS re: Invent 2012
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
 
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark ClustersApril 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters
 
Deploy an Elastic, Resilient, Load-Balanced Cluster in 5 Minutes with Senlin
Deploy an Elastic, Resilient, Load-Balanced Cluster in 5 Minutes with SenlinDeploy an Elastic, Resilient, Load-Balanced Cluster in 5 Minutes with Senlin
Deploy an Elastic, Resilient, Load-Balanced Cluster in 5 Minutes with Senlin
 
Code for the earth OCP APAC Tokyo 2013-05
Code for the earth OCP APAC Tokyo 2013-05Code for the earth OCP APAC Tokyo 2013-05
Code for the earth OCP APAC Tokyo 2013-05
 
Paul Angus - what's new in ACS 4.11
Paul Angus - what's new in ACS 4.11Paul Angus - what's new in ACS 4.11
Paul Angus - what's new in ACS 4.11
 
Autoscaling with magnum and senlin
Autoscaling with magnum and senlinAutoscaling with magnum and senlin
Autoscaling with magnum and senlin
 
Performance Benchmarking of Clouds Evaluating OpenStack
Performance Benchmarking of Clouds                Evaluating OpenStackPerformance Benchmarking of Clouds                Evaluating OpenStack
Performance Benchmarking of Clouds Evaluating OpenStack
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...
AWS re:Invent 2016: Deep Dive on Amazon EC2 Instances, Featuring Performance ...
 
Paul Angus - CloudStack Container Service
Paul  Angus - CloudStack Container ServicePaul  Angus - CloudStack Container Service
Paul Angus - CloudStack Container Service
 
Senlin deep dive 2016
Senlin deep dive 2016Senlin deep dive 2016
Senlin deep dive 2016
 
(CMP303) ResearchCloud: CfnCluster and Internet2 for Enterprise HPC
(CMP303) ResearchCloud: CfnCluster and Internet2 for Enterprise HPC(CMP303) ResearchCloud: CfnCluster and Internet2 for Enterprise HPC
(CMP303) ResearchCloud: CfnCluster and Internet2 for Enterprise HPC
 
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
A fun cup of joe with open liberty
A fun cup of joe with open libertyA fun cup of joe with open liberty
A fun cup of joe with open liberty
 

Similar to GPU cloud with Job scheduler and Container

HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
Building a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsBuilding a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsAvere Systems
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansPeter Clapham
 
Flexible compute
Flexible computeFlexible compute
Flexible computePeter Clapham
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
Cloud data center and openstack
Cloud data center and openstackCloud data center and openstack
Cloud data center and openstackAndrew Yongjoon Kong
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
 
Private Cloud with Open Stack, Docker
Private Cloud with Open Stack, DockerPrivate Cloud with Open Stack, Docker
Private Cloud with Open Stack, DockerDavinder Kohli
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
High Performance Computing with AWS
High Performance Computing with AWSHigh Performance Computing with AWS
High Performance Computing with AWSAmazon Web Services
 
Databases in the Hosted Cloud
Databases in the Hosted CloudDatabases in the Hosted Cloud
Databases in the Hosted CloudColin Charles
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...Cisco DevNet
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentDoKC
 
Provisioning Servers Made Easy
Provisioning Servers Made EasyProvisioning Servers Made Easy
Provisioning Servers Made EasyAll Things Open
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014Puppet
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AIIntroduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AITyrone Systems
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Brad stack - Digital Health and Well-Being Festival
Brad stack - Digital Health and Well-Being Festival Brad stack - Digital Health and Well-Being Festival
Brad stack - Digital Health and Well-Being Festival Digital Health Enterprise Zone
 

Similar to GPU cloud with Job scheduler and Container (20)

HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Building a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsBuilding a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for Analysts
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
Cloud data center and openstack
Cloud data center and openstackCloud data center and openstack
Cloud data center and openstack
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
Private Cloud with Open Stack, Docker
Private Cloud with Open Stack, DockerPrivate Cloud with Open Stack, Docker
Private Cloud with Open Stack, Docker
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
High Performance Computing with AWS
High Performance Computing with AWSHigh Performance Computing with AWS
High Performance Computing with AWS
 
Databases in the Hosted Cloud
Databases in the Hosted CloudDatabases in the Hosted Cloud
Databases in the Hosted Cloud
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...DEVNET-1140	InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch government
 
Provisioning Servers Made Easy
Provisioning Servers Made EasyProvisioning Servers Made Easy
Provisioning Servers Made Easy
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014How to Puppetize Google Cloud Platform - PuppetConf 2014
How to Puppetize Google Cloud Platform - PuppetConf 2014
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AIIntroduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AI
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Brad stack - Digital Health and Well-Being Festival
Brad stack - Digital Health and Well-Being Festival Brad stack - Digital Health and Well-Being Festival
Brad stack - Digital Health and Well-Being Festival
 

More from Andrew Yongjoon Kong

Nightmare with ceph : Recovery from ceph cluster total failure
Nightmare with ceph : Recovery from ceph cluster total failureNightmare with ceph : Recovery from ceph cluster total failure
Nightmare with ceph : Recovery from ceph cluster total failureAndrew Yongjoon Kong
 
Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...Andrew Yongjoon Kong
 
Graph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphGraph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphAndrew Yongjoon Kong
 

More from Andrew Yongjoon Kong (6)

Tunnel without tunnel
Tunnel without tunnelTunnel without tunnel
Tunnel without tunnel
 
Nightmare with ceph : Recovery from ceph cluster total failure
Nightmare with ceph : Recovery from ceph cluster total failureNightmare with ceph : Recovery from ceph cluster total failure
Nightmare with ceph : Recovery from ceph cluster total failure
 
Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...Stream analysis with kafka native way and considerations about monitoring as ...
Stream analysis with kafka native way and considerations about monitoring as ...
 
Graph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphGraph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraph
 
Way to cloud
Way to cloudWay to cloud
Way to cloud
 
Openstack dev on
Openstack dev onOpenstack dev on
Openstack dev on
 

Recently uploaded

Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxNiranjanYadav41
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfChristianCDAM
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectssuserb6619e
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Crystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxCrystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxachiever3003
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptNarmatha D
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 

Recently uploaded (20)

Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptx
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdf
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Crystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxCrystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptx
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.ppt
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 

GPU cloud with Job scheduler and Container

  • 1. Serverless GPU Cloud with Job scheduler and Container Andrew yongjoon kong CloudComputingCell, kakao andrew.kong@kakaocorp.com
  • 2. • Cloud Technical Advisory for Government Broad Cast Agency • Adjunct Prof. Ajou Univ • Korea Data Base Agency Acting Professor for Bigdata • Member of National Information Agency Bigdata Advisory committee • Kakao Ă  Daum Kakao Ă  Kakaocorp, Cloud Computing Cell lead • Talks • Embrace clouds (2017, openstack days, korea) • Full route based network with linux (2016, netdev, Tokyo) • SDN without SDN (2015, openstack, Vancouber) Who am I Andrew. Yongjoon kong Supervised, Korean edition Korean Edition. 2nd Editions are coming…
  • 6. Serverless framework lots of serverless framework: • Apache OpenWhisk • Iron.io • Openstack’s Picasso • Gestalt ( based on DC/OS) • Fission ( based on kubernetes) • Runway ( kakao’s private FaaS) What these framework’s purpose? • connecting, mostly • flow and automation
  • 7. Serverless framework Connection is very good virtue in public cloud • there’s no resource depletion in public cloud. connection/automation is directly related with cost savings • in private cloud, there’s screams for the resources (especially GPU) from the engineers. • The thing is that “Winner takes it all” • Ă  care for scheduling
  • 8. Job scheduler Scheduling User’s Job based on Algorithm • FIFO • Fair Share • BackFill • Preemption
  • 9. Job Job comprises two parts • The resources • CPU, Compute Nodes, Memory, Disk and Even Walltime • Job scheduling system manage the quota per queue, user, user group • The runnable execution • Traditionally, The executable command • e.g. saved_model_cli run --dir /tmp/saved_model_dir --tag_set serve
  • 10. Job sample Sample Job script The traditional issue is how we distribute the command and the data (you can’t specify node in batch system) #!/bin/bash #PBS -l nodes=1:ppn=2 #PBS -l walltime=00:00:59 cd /home/rcf-proj3/pv/test/ source mkdir /test/test/dir /usr/usc/sas/default/setup.sh sas my.sas execution resource
  • 11. Job scheduler system layout SharedFile system can handle the file locating issue. Ă  Shared Filesystem is too expensive. Ă Modern environment it is much easier with the container, http://beagle.ci.uchicago.edu/using-beagle/ This could be changed by container and registry
  • 12. Job scheduler system, GPU and Container add GPU resource to Job Script. use NVIDIA Docker for the command … then scheduler will do the job #!/bin/bash #PBS -l nodes=1:ppn=2 #PBS -l walltime=00:00:59 #PBS -l gpus=8 NV_GPU=$NV_GPU nvidia-docker run --net host -e PASSWORD=root -e USERNAME=root -e PORT=$PORT idock.daumkakao.io/dkos/nvidia-cuda- sshd:dev docker registry compute node GPU compute node GPU compute node GPU master ( scheduler ) job
  • 13. AI Development Cycle over compute resource compute node GPU compute node GPU compute node GPU compute node GPU Training Model on large scale with Massive data Inference thru. the Model Develop model on personal env. Abstract these to “job(resource, executor)” The output is abstract to container docker registry master ( scheduler ) a bs tr ac ti o n JOB
  • 14. AI Service The output is abstract to container BTW, you need GPU and Other IT resource to show your effort to the public, as well And, what about the monitoring & alert ? The good thing is that if you make your effort with container Ă  kakao cloud can help you
  • 15. kakao cloud Service Repo. Service catalog notification scheduling IaaS: KRANE Centralized Measuring System: KEMI Centralized Deploying System: DKOS Management Plane DataCenter Contol/Data plane Event / Alert Initial Setup Change IT operations. IT Services.
  • 16. Some Numbers about kakao cloud 1563projects 632 pull request since 2014.9 88about VMs are created/deleted per day 8703 vms 2,xxxprojects 913 pull request since 2014.9 100about VMs are created/deleted per day 17,xxx vms 2016.8 2017.9 9x,xxx active cores
  • 17. KakaocorpSome information about kakao cloud from grizzly to Kilo 5 times upgraded total 4Region additional service Heat/ Trove/ Sahara from grizzly to Mitaka 7 times upgraded total 4Region Heat/ Trove/ Octavia/ barbican 2016.82017.10
  • 18. event monitoring/alert platform kakao, KEMI Physical Servers Virtual Instances Containers Others (switches, logs) monitoring KEMI IMS (kakao CMDB API) SB Rule Engine Notificati on ETL Data Center Information abstraction layer API predicting scheduling Openstack Heat Other Service API Data Center (or Service ) Management Activity control KEMI stats KEMI log
  • 19. Deployment abstraction in Kakao, DKOS Data Center User: Defines resource VM PM container Service Catalogue Centralized Deploying System (DKOS) Resource Pool Queue scheduler manager
  • 22. DKOS Situation • Active cluster : 3 digit • Total compute node : 4digit (vm+pm) • Container counts : 5 digit • Managed by?
  • 23. DKOS Situation • Why use DKOS(container)? • Container is easy • Container is cool • dc/os is great • Nop! • Very summit point of integrated/automated infra service api • authentication, authorization, compute resources, network, volumes • Metering, logging • Monitoring, Notifications
  • 24. kakao cloud now support GPU as well Service Repo. Service catalog notification scheduling IaaS: KRANE Centralized Measuring System: KEMI Centralized Deploying System: DKOS Management Plane DataCenter Contol/Data plane Event / Alert Initial Setup Change IT operations. IT Services.
  • 26. Where are you from CMMI-Cloud perspective? For CMM4, Time to embrace Clouds, not a Cloud CMM0 legacy output: cloudTF CMM1 self service Dev resource output: krane (openstack cloud) CMM2 limited Prod resources output: kemi (MaaS) CMM3 Automated CloudUsage output: DKOS (CaaS) CMM4 Manual Cloud Usage -- CMM5 Federated Cloud usage --