complete construction, environmental and economics information of biomass com...
Â
GPU cloud with Job scheduler and Container
1. Serverless GPU Cloud
with Job scheduler and
Container
Andrew yongjoon kong
CloudComputingCell, kakao
andrew.kong@kakaocorp.com
2. • Cloud Technical Advisory for Government Broad Cast Agency
• Adjunct Prof. Ajou Univ
• Korea Data Base Agency Acting Professor for Bigdata
• Member of National Information Agency Bigdata Advisory committee
• Kakao à Daum Kakao à Kakaocorp, Cloud Computing Cell lead
• Talks
• Embrace clouds (2017, openstack days, korea)
• Full route based network with linux (2016, netdev, Tokyo)
• SDN without SDN (2015, openstack, Vancouber)
Who am I
Andrew. Yongjoon kong
Supervised,
Korean
edition
Korean
Edition.
2nd Editions are coming…
6. Serverless framework
lots of serverless framework:
• Apache OpenWhisk
• Iron.io
• Openstack’s Picasso
• Gestalt ( based on DC/OS)
• Fission ( based on kubernetes)
• Runway ( kakao’s private FaaS)
What these framework’s purpose?
• connecting, mostly
• flow and automation
7. Serverless framework
Connection is very good virtue in public cloud
• there’s no resource depletion in public cloud.
connection/automation is directly related with cost
savings
• in private cloud, there’s screams for the resources
(especially GPU) from the engineers.
• The thing is that “Winner takes it all”
• à care for scheduling
9. Job
Job comprises two parts
• The resources
• CPU, Compute Nodes, Memory, Disk and Even
Walltime
• Job scheduling system manage the quota per queue,
user, user group
• The runnable execution
• Traditionally, The executable command
• e.g. saved_model_cli run --dir /tmp/saved_model_dir --tag_set serve
10. Job sample
Sample Job script
The traditional issue is how we distribute the command
and the data (you can’t specify node in batch system)
#!/bin/bash
#PBS -l nodes=1:ppn=2
#PBS -l walltime=00:00:59
cd /home/rcf-proj3/pv/test/ source
mkdir /test/test/dir
/usr/usc/sas/default/setup.sh sas my.sas
execution
resource
11. Job scheduler system layout
SharedFile system can
handle the file
locating issue.
Ă Shared Filesystem
is too expensive.
Ă Modern
environment it is
much easier with
the container,
http://beagle.ci.uchicago.edu/using-beagle/
This could be changed by
container and registry
12. Job scheduler system, GPU and Container
add GPU resource to Job Script.
use NVIDIA Docker for the command
…
then scheduler will do the job
#!/bin/bash
#PBS -l nodes=1:ppn=2
#PBS -l walltime=00:00:59
#PBS -l gpus=8
NV_GPU=$NV_GPU nvidia-docker run --net
host -e PASSWORD=root -e USERNAME=root
-e PORT=$PORT
idock.daumkakao.io/dkos/nvidia-cuda-
sshd:dev
docker
registry
compute
node
GPU
compute
node
GPU
compute
node
GPU
master ( scheduler )
job
13. AI Development Cycle over compute resource
compute
node
GPU
compute
node
GPU
compute
node
GPU
compute
node
GPU
Training Model on
large scale with
Massive data
Inference thru. the
Model
Develop model on
personal env.
Abstract these to “job(resource,
executor)”
The output is abstract to container
docker
registry
master ( scheduler )
a
bs
tr
ac
ti
o
n
JOB
14. AI Service
The output is abstract to container
BTW, you need GPU and Other IT resource to show your effort to the public, as well
And, what about the monitoring & alert ?
The good thing is that if you make your effort with container
Ă kakao cloud can help you
15. kakao cloud
Service Repo.
Service catalog
notification
scheduling
IaaS:
KRANE
Centralized
Measuring
System:
KEMI
Centralized
Deploying
System:
DKOS
Management Plane
DataCenter Contol/Data plane
Event / Alert
Initial Setup
Change
IT operations.
IT Services.
16. Some Numbers about kakao cloud
1563projects
632
pull request since 2014.9
88about
VMs are created/deleted per day
8703
vms
2,xxxprojects
913
pull request since 2014.9
100about
VMs are created/deleted per day
17,xxx
vms
2016.8 2017.9
9x,xxx active cores
17. KakaocorpSome information about kakao cloud
from grizzly to Kilo
5 times upgraded
total 4Region
additional service Heat/
Trove/
Sahara
from grizzly to Mitaka
7 times upgraded
total 4Region
Heat/
Trove/
Octavia/
barbican 2016.82017.10
18. event monitoring/alert platform kakao, KEMI
Physical
Servers
Virtual
Instances
Containers
Others
(switches,
logs)
monitoring
KEMI
IMS
(kakao CMDB
API)
SB
Rule
Engine
Notificati
on
ETL
Data Center Information abstraction layer
API
predicting
scheduling
Openstack
Heat
Other
Service
API
Data Center (or Service ) Management Activity
control
KEMI stats KEMI log
19. Deployment abstraction in Kakao, DKOS
Data Center
User:
Defines
resource
VM
PM
container
Service
Catalogue
Centralized
Deploying
System
(DKOS)
Resource Pool
Queue
scheduler
manager
23. DKOS Situation
• Why use DKOS(container)?
• Container is easy
• Container is cool
• dc/os is great
• Nop!
• Very summit point of integrated/automated infra service api
• authentication, authorization, compute resources, network,
volumes
• Metering, logging
• Monitoring, Notifications
24. kakao cloud now support GPU as well
Service Repo.
Service catalog
notification
scheduling
IaaS:
KRANE
Centralized
Measuring
System:
KEMI
Centralized
Deploying
System:
DKOS
Management Plane
DataCenter Contol/Data plane
Event / Alert
Initial Setup
Change
IT operations.
IT Services.
26. Where are you from CMMI-Cloud perspective?
For CMM4, Time to embrace Clouds, not a Cloud
CMM0
legacy
output:
cloudTF
CMM1
self
service
Dev
resource
output:
krane
(openstack
cloud)
CMM2
limited
Prod
resources
output:
kemi
(MaaS)
CMM3
Automated
CloudUsage
output:
DKOS
(CaaS)
CMM4
Manual
Cloud
Usage
--
CMM5
Federated
Cloud
usage
--