SlideShare a Scribd company logo
1 of 16
Download to read offline
Baptiste Gerondeau, Takeharu Kato, Renato Golin
Arm Workshop, 26th July 2018
OpenHPC Automation with Ansible
Outline
● Linaro’s HPC-SIG Lab
● OpenHPC Ansible Automation
● Results
The HPC-SIG Lab
Goals:
● Cluster Automation & Validation
● Benchmarking & Performance Investigation
Requirements:
● Stability & Repeatability
● Close-to-production environment
● Upstream technology (reproducibility)
● Vendor isolation (hardware, results)
3
The HPC-SIG Lab
Network Layout
● Uplink subnet
○ Access/Provision
● BMC subnet
● MPI subnet
○ 100GB InfiniBand
○ Slave provision
● File System subnet
○ 100GB Ethernet
○ Lustre/Ceph (future)
4
The HPC-SIG Lab
● Stability & Repeatability
○ Critical external components cached locally
○ Strict migration plans (staging)
● Close-to-production environment
○ Hardware and firmware updated frequently
● Upstream technology (reproducibility)
○ All components open source / upstream
● Vendor isolation (hardware, results)
○ VPN, Provisioner, Jenkins, SSH control
5
Open Source and Upstream
Lab admin & tools:
● Jenkins: https://jenkins.io/
● Mr-Provisioner: https://github.com/Linaro/mr-provisioner
Lab Automation:
● https://github.com/Linaro/ans_setup_jenkins
● https://github.com/Linaro/mr-provisioner-role
● https://github.com/Linaro/mr-provisioner-kea-dhcp4-role
● https://github.com/Linaro/ansible-role-mr-provisioner
● https://github.com/Linaro/mr-provisioner-client
HPC-specific automation:
● https://github.com/Linaro/hpc_lab_setup
● https://github.com/Linaro/hpc_deploy_benchmarks
● https://github.com/Linaro/benchmark_harness
● https://github.com/Linaro/ansible-playbook-for-ohpc
6
Existing OpenHPC Automation
● A recipe with all rules described in the official documents
● LaTex snippets containing shell code
● Are converted and merged into a (OS-dependent) bash script
● Scripts in OHPC packages, installed after OpenHPC main package is installed
● Plus a input.local file, with some cluster-specific configuration / environment
7
Existing OpenHPC Automation
Shortcomings:
● The input.local file exports shell variables
● The recipe.sh is not idempotent
● Extensibility is impossible without editing the files
8
Ansible Playbooks
Ansible is a widely used automation tool which can describe the structure and
configuration of IT infrastructure with YAML “playbooks” and “roles”.
OpenHPC with Ansible:
● Ansible playbooks can more easily be made idempotent
● Ansible can manage nodes/tasks according to the the structure of the cluster
● Configuration is passed as a YAML file (no environment handling)
● Composition, using playbooks and roles, building on third-party content
9
Ansible OpenHPC Benefits
● Flexible cluster configuration
○ Fine grained / composable
○ Cluster wide / node group wide / node specific
● Works on both x86_64 and AArch64
○ Ansible gathers information about architecture
○ Same playbook runs on both
● OS is directly inferred by Ansible (gather_facts)
○ Ansible gathers information about OS
○ Yum, apt, zypper… can be switched in the roles’ logic
10
Ansible OpenHPC Recipe
The basic structure of the Ansible playbook
playbook
+---- group_vars/
+-- all.yml cluster wide configurations
+-- group1,group2 … node group(e.g., computing nodes, login, DFS) specific configurations
+---- host_vars/
+--- host1,host2 … host specific configurations
+---- roles/ package specific tasks, templates, config files, and config variables
+--- package-name/
+--- tasks/main.yml … YAML file to describe installation method of package-name
+--- vars/main.yml … package specific configuration variables
+--- templates/*.j2 … template files to generate configuration files
11
Ansible OpenHPC Tasks
12
Upstreaming our work
After multiple discussions, our proposal is:
● Generate from LaTex sources at the same time as recipe.sh
○ Each block/section as a separate role
○ Control OS/Arch via gather_facts/config
○ xCAT vs Warewulf, Slurm vs PBS: only add what hasn’t yet
● Push to a separate repository (openhpc/ansible-role ?)
○ Control release branch/tags as to not overwrite previous recipes
○ Repo can have additional playbooks (CI, new users)
○ Once release is validated, push to Ansible Galaxy
● Direct pull requests to playbooks / support roles
○ Only recipe roles are auto-generated
○ Support roles (specific to Ansible, playbook) are maintained on the repo
13
Results
Test Suite
Most tests green, however:
● Intel-specific tests (CILK, TBB, IMB) disabled
● Others need package install (PDF, CDF, HDF), but pass when installed
● TAU fails because LMod defaults to openmpi (needs openmpi3)
● Lustre fails as package depends on kernel 4.2 (which won’t work on our machines)
● MiniDFT and PRK had make failures, but we haven’t investigated yet
● --enable-longdoesn’t really, need to look into why not
The plan from now on is:
1. Automate package install conditional on enabled tests, fix remaining errors
2. Work with members to prioritise long term ones (like Lustre)
3. Use it for additional packages, so we can test them before sending upstream
4. Add a benchmark mode, making sure to use entire cluster
15
Thank You!

More Related Content

What's hot

OpenStack Toronto: Juno Community Update
OpenStack Toronto: Juno Community UpdateOpenStack Toronto: Juno Community Update
OpenStack Toronto: Juno Community UpdateStephen Gordon
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaLinaro
 
OpenNebula Conf 2014: CentOS, QA an OpenNebula - Christoph Galuschka
OpenNebula Conf 2014: CentOS, QA an OpenNebula - Christoph GaluschkaOpenNebula Conf 2014: CentOS, QA an OpenNebula - Christoph Galuschka
OpenNebula Conf 2014: CentOS, QA an OpenNebula - Christoph GaluschkaNETWAYS
 
BKK16-305B ILP32 Performance on AArch64
BKK16-305B ILP32 Performance on AArch64BKK16-305B ILP32 Performance on AArch64
BKK16-305B ILP32 Performance on AArch64Linaro
 
Python Basics for Operators Troubleshooting OpenStack
Python Basics for Operators Troubleshooting OpenStackPython Basics for Operators Troubleshooting OpenStack
Python Basics for Operators Troubleshooting OpenStackJames Dennis
 
OpenPOWER foundation update new executive director and bright open future_i...
OpenPOWER  foundation update  new executive director and bright open future_i...OpenPOWER  foundation update  new executive director and bright open future_i...
OpenPOWER foundation update new executive director and bright open future_i...Ganesan Narayanasamy
 
Openstack devops challenges
Openstack devops challenges Openstack devops challenges
Openstack devops challenges openstackindia
 
Dude, This Isn't Where I Parked My Instance?
Dude, This Isn't Where I Parked My Instance?Dude, This Isn't Where I Parked My Instance?
Dude, This Isn't Where I Parked My Instance?Stephen Gordon
 
OpenStack Data Processing ("Sahara") project update - December 2014
OpenStack Data Processing ("Sahara") project update - December 2014OpenStack Data Processing ("Sahara") project update - December 2014
OpenStack Data Processing ("Sahara") project update - December 2014Sergey Lukjanov
 
Andy McCrae, Rackspace - Using Ansible to Deploy and Automate OpenStack, Open...
Andy McCrae, Rackspace - Using Ansible to Deploy and Automate OpenStack, Open...Andy McCrae, Rackspace - Using Ansible to Deploy and Automate OpenStack, Open...
Andy McCrae, Rackspace - Using Ansible to Deploy and Automate OpenStack, Open...Cloud Native Day Tel Aviv
 
LAS16-301: OpenStack on Aarch64, running in production, upstream improvements...
LAS16-301: OpenStack on Aarch64, running in production, upstream improvements...LAS16-301: OpenStack on Aarch64, running in production, upstream improvements...
LAS16-301: OpenStack on Aarch64, running in production, upstream improvements...Linaro
 
BKK16-400B ODPI - Standardizing Hadoop
BKK16-400B ODPI - Standardizing HadoopBKK16-400B ODPI - Standardizing Hadoop
BKK16-400B ODPI - Standardizing HadoopLinaro
 
Deploying OpenDaylight and OpenStack at Ease
Deploying OpenDaylight and OpenStack at EaseDeploying OpenDaylight and OpenStack at Ease
Deploying OpenDaylight and OpenStack at EaseMichelle Holley
 
LAS16-207: Bus scaling QoS
LAS16-207: Bus scaling QoSLAS16-207: Bus scaling QoS
LAS16-207: Bus scaling QoSLinaro
 
Guts & OpenStack migration
Guts & OpenStack migrationGuts & OpenStack migration
Guts & OpenStack migrationopenstackindia
 

What's hot (20)

OpenStack Toronto: Juno Community Update
OpenStack Toronto: Juno Community UpdateOpenStack Toronto: Juno Community Update
OpenStack Toronto: Juno Community Update
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
 
OpenNebula Conf 2014: CentOS, QA an OpenNebula - Christoph Galuschka
OpenNebula Conf 2014: CentOS, QA an OpenNebula - Christoph GaluschkaOpenNebula Conf 2014: CentOS, QA an OpenNebula - Christoph Galuschka
OpenNebula Conf 2014: CentOS, QA an OpenNebula - Christoph Galuschka
 
BKK16-305B ILP32 Performance on AArch64
BKK16-305B ILP32 Performance on AArch64BKK16-305B ILP32 Performance on AArch64
BKK16-305B ILP32 Performance on AArch64
 
Python Basics for Operators Troubleshooting OpenStack
Python Basics for Operators Troubleshooting OpenStackPython Basics for Operators Troubleshooting OpenStack
Python Basics for Operators Troubleshooting OpenStack
 
OpenPOWER foundation update new executive director and bright open future_i...
OpenPOWER  foundation update  new executive director and bright open future_i...OpenPOWER  foundation update  new executive director and bright open future_i...
OpenPOWER foundation update new executive director and bright open future_i...
 
Openstack devops challenges
Openstack devops challenges Openstack devops challenges
Openstack devops challenges
 
TripleO
 TripleO TripleO
TripleO
 
Dude, This Isn't Where I Parked My Instance?
Dude, This Isn't Where I Parked My Instance?Dude, This Isn't Where I Parked My Instance?
Dude, This Isn't Where I Parked My Instance?
 
Openstack ansible
Openstack ansibleOpenstack ansible
Openstack ansible
 
OpenStack Data Processing ("Sahara") project update - December 2014
OpenStack Data Processing ("Sahara") project update - December 2014OpenStack Data Processing ("Sahara") project update - December 2014
OpenStack Data Processing ("Sahara") project update - December 2014
 
Andy McCrae, Rackspace - Using Ansible to Deploy and Automate OpenStack, Open...
Andy McCrae, Rackspace - Using Ansible to Deploy and Automate OpenStack, Open...Andy McCrae, Rackspace - Using Ansible to Deploy and Automate OpenStack, Open...
Andy McCrae, Rackspace - Using Ansible to Deploy and Automate OpenStack, Open...
 
LAS16-301: OpenStack on Aarch64, running in production, upstream improvements...
LAS16-301: OpenStack on Aarch64, running in production, upstream improvements...LAS16-301: OpenStack on Aarch64, running in production, upstream improvements...
LAS16-301: OpenStack on Aarch64, running in production, upstream improvements...
 
BKK16-400B ODPI - Standardizing Hadoop
BKK16-400B ODPI - Standardizing HadoopBKK16-400B ODPI - Standardizing Hadoop
BKK16-400B ODPI - Standardizing Hadoop
 
Deploying OpenDaylight and OpenStack at Ease
Deploying OpenDaylight and OpenStack at EaseDeploying OpenDaylight and OpenStack at Ease
Deploying OpenDaylight and OpenStack at Ease
 
LAS16-207: Bus scaling QoS
LAS16-207: Bus scaling QoSLAS16-207: Bus scaling QoS
LAS16-207: Bus scaling QoS
 
HA in OpenStack service - meetup #9
HA in OpenStack service - meetup #9HA in OpenStack service - meetup #9
HA in OpenStack service - meetup #9
 
Guts & OpenStack migration
Guts & OpenStack migrationGuts & OpenStack migration
Guts & OpenStack migration
 
OpenStack Neutron behind the Scenes
OpenStack Neutron behind the ScenesOpenStack Neutron behind the Scenes
OpenStack Neutron behind the Scenes
 

Similar to OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018

LAS16-209: Finished and Upcoming Projects in LMG
LAS16-209: Finished and Upcoming Projects in LMGLAS16-209: Finished and Upcoming Projects in LMG
LAS16-209: Finished and Upcoming Projects in LMGLinaro
 
Network Automation: Ansible 101
Network Automation: Ansible 101Network Automation: Ansible 101
Network Automation: Ansible 101APNIC
 
LAS16-400: Mini Conference 3 AOSP (Session 1)
LAS16-400: Mini Conference 3 AOSP (Session 1)LAS16-400: Mini Conference 3 AOSP (Session 1)
LAS16-400: Mini Conference 3 AOSP (Session 1)Linaro
 
TechDay - Cambridge 2016 - OpenNebula at Harvard Univerity
TechDay - Cambridge 2016 - OpenNebula at Harvard UniverityTechDay - Cambridge 2016 - OpenNebula at Harvard Univerity
TechDay - Cambridge 2016 - OpenNebula at Harvard UniverityOpenNebula Project
 
Introduction to ansible
Introduction to ansibleIntroduction to ansible
Introduction to ansibleOmid Vahdaty
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific DashboardCeph Community
 
PLNOG Automation@Brainly
PLNOG Automation@BrainlyPLNOG Automation@Brainly
PLNOG Automation@Brainlyvespian_256
 
PLNOG14: Automation at Brainly - Paweł Rozlach
PLNOG14: Automation at Brainly - Paweł RozlachPLNOG14: Automation at Brainly - Paweł Rozlach
PLNOG14: Automation at Brainly - Paweł RozlachPROIDEA
 
Linux Kernel Platform Development: Challenges and Insights
 Linux Kernel Platform Development: Challenges and Insights Linux Kernel Platform Development: Challenges and Insights
Linux Kernel Platform Development: Challenges and InsightsGlobalLogic Ukraine
 
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...VMware Tanzu
 
Linux 开源操作系统发展新趋势
Linux 开源操作系统发展新趋势Linux 开源操作系统发展新趋势
Linux 开源操作系统发展新趋势Anthony Wong
 
Sheepdog Status Report
Sheepdog Status ReportSheepdog Status Report
Sheepdog Status ReportLiu Yuan
 
Lec 10-linux-review
Lec 10-linux-reviewLec 10-linux-review
Lec 10-linux-reviewabinaya m
 
LMG Lightning Talks - SFO17-205
LMG Lightning Talks - SFO17-205LMG Lightning Talks - SFO17-205
LMG Lightning Talks - SFO17-205Linaro
 
Introduction to ansible
Introduction to ansibleIntroduction to ansible
Introduction to ansibleMukul Malhotra
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8AbdullahMunir32
 
Automation@Brainly - Polish Linux Autumn 2014
Automation@Brainly - Polish Linux Autumn 2014Automation@Brainly - Polish Linux Autumn 2014
Automation@Brainly - Polish Linux Autumn 2014vespian_256
 
Backing up Wikipedia Databases
Backing up Wikipedia DatabasesBacking up Wikipedia Databases
Backing up Wikipedia DatabasesJaime Crespo
 
#OktoCampus - Workshop : An introduction to Ansible
#OktoCampus - Workshop : An introduction to Ansible#OktoCampus - Workshop : An introduction to Ansible
#OktoCampus - Workshop : An introduction to AnsibleCédric Delgehier
 
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...Linaro
 

Similar to OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018 (20)

LAS16-209: Finished and Upcoming Projects in LMG
LAS16-209: Finished and Upcoming Projects in LMGLAS16-209: Finished and Upcoming Projects in LMG
LAS16-209: Finished and Upcoming Projects in LMG
 
Network Automation: Ansible 101
Network Automation: Ansible 101Network Automation: Ansible 101
Network Automation: Ansible 101
 
LAS16-400: Mini Conference 3 AOSP (Session 1)
LAS16-400: Mini Conference 3 AOSP (Session 1)LAS16-400: Mini Conference 3 AOSP (Session 1)
LAS16-400: Mini Conference 3 AOSP (Session 1)
 
TechDay - Cambridge 2016 - OpenNebula at Harvard Univerity
TechDay - Cambridge 2016 - OpenNebula at Harvard UniverityTechDay - Cambridge 2016 - OpenNebula at Harvard Univerity
TechDay - Cambridge 2016 - OpenNebula at Harvard Univerity
 
Introduction to ansible
Introduction to ansibleIntroduction to ansible
Introduction to ansible
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 
PLNOG Automation@Brainly
PLNOG Automation@BrainlyPLNOG Automation@Brainly
PLNOG Automation@Brainly
 
PLNOG14: Automation at Brainly - Paweł Rozlach
PLNOG14: Automation at Brainly - Paweł RozlachPLNOG14: Automation at Brainly - Paweł Rozlach
PLNOG14: Automation at Brainly - Paweł Rozlach
 
Linux Kernel Platform Development: Challenges and Insights
 Linux Kernel Platform Development: Challenges and Insights Linux Kernel Platform Development: Challenges and Insights
Linux Kernel Platform Development: Challenges and Insights
 
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
 
Linux 开源操作系统发展新趋势
Linux 开源操作系统发展新趋势Linux 开源操作系统发展新趋势
Linux 开源操作系统发展新趋势
 
Sheepdog Status Report
Sheepdog Status ReportSheepdog Status Report
Sheepdog Status Report
 
Lec 10-linux-review
Lec 10-linux-reviewLec 10-linux-review
Lec 10-linux-review
 
LMG Lightning Talks - SFO17-205
LMG Lightning Talks - SFO17-205LMG Lightning Talks - SFO17-205
LMG Lightning Talks - SFO17-205
 
Introduction to ansible
Introduction to ansibleIntroduction to ansible
Introduction to ansible
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
 
Automation@Brainly - Polish Linux Autumn 2014
Automation@Brainly - Polish Linux Autumn 2014Automation@Brainly - Polish Linux Autumn 2014
Automation@Brainly - Polish Linux Autumn 2014
 
Backing up Wikipedia Databases
Backing up Wikipedia DatabasesBacking up Wikipedia Databases
Backing up Wikipedia Databases
 
#OktoCampus - Workshop : An introduction to Ansible
#OktoCampus - Workshop : An introduction to Ansible#OktoCampus - Workshop : An introduction to Ansible
#OktoCampus - Workshop : An introduction to Ansible
 
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
LCE13: Test and Validation Mini-Summit: Review Current Linaro Engineering Pro...
 

More from Linaro

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloLinaro
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraLinaro
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaLinaro
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteLinaro
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopLinaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allLinaro
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorLinaro
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMULinaro
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MLinaro
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation Linaro
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootLinaro
 
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...Linaro
 
HKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready ProgramHKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready ProgramLinaro
 
HKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NNHKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NNLinaro
 
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...Linaro
 
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...Linaro
 
HKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: IntroductionHKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: IntroductionLinaro
 

More from Linaro (20)

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qa
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
 
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
 
HKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready ProgramHKG18-317 - Arm Server Ready Program
HKG18-317 - Arm Server Ready Program
 
HKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NNHKG18-312 - CMSIS-NN
HKG18-312 - CMSIS-NN
 
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
 
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
HKG18-300K2 - Keynote: Tomas Evensen - All Programmable SoCs? – Platforms to ...
 
HKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: IntroductionHKG18-212 - Trusted Firmware M: Introduction
HKG18-212 - Trusted Firmware M: Introduction
 

Recently uploaded

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Recently uploaded (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018

  • 1. Baptiste Gerondeau, Takeharu Kato, Renato Golin Arm Workshop, 26th July 2018 OpenHPC Automation with Ansible
  • 2. Outline ● Linaro’s HPC-SIG Lab ● OpenHPC Ansible Automation ● Results
  • 3. The HPC-SIG Lab Goals: ● Cluster Automation & Validation ● Benchmarking & Performance Investigation Requirements: ● Stability & Repeatability ● Close-to-production environment ● Upstream technology (reproducibility) ● Vendor isolation (hardware, results) 3
  • 4. The HPC-SIG Lab Network Layout ● Uplink subnet ○ Access/Provision ● BMC subnet ● MPI subnet ○ 100GB InfiniBand ○ Slave provision ● File System subnet ○ 100GB Ethernet ○ Lustre/Ceph (future) 4
  • 5. The HPC-SIG Lab ● Stability & Repeatability ○ Critical external components cached locally ○ Strict migration plans (staging) ● Close-to-production environment ○ Hardware and firmware updated frequently ● Upstream technology (reproducibility) ○ All components open source / upstream ● Vendor isolation (hardware, results) ○ VPN, Provisioner, Jenkins, SSH control 5
  • 6. Open Source and Upstream Lab admin & tools: ● Jenkins: https://jenkins.io/ ● Mr-Provisioner: https://github.com/Linaro/mr-provisioner Lab Automation: ● https://github.com/Linaro/ans_setup_jenkins ● https://github.com/Linaro/mr-provisioner-role ● https://github.com/Linaro/mr-provisioner-kea-dhcp4-role ● https://github.com/Linaro/ansible-role-mr-provisioner ● https://github.com/Linaro/mr-provisioner-client HPC-specific automation: ● https://github.com/Linaro/hpc_lab_setup ● https://github.com/Linaro/hpc_deploy_benchmarks ● https://github.com/Linaro/benchmark_harness ● https://github.com/Linaro/ansible-playbook-for-ohpc 6
  • 7. Existing OpenHPC Automation ● A recipe with all rules described in the official documents ● LaTex snippets containing shell code ● Are converted and merged into a (OS-dependent) bash script ● Scripts in OHPC packages, installed after OpenHPC main package is installed ● Plus a input.local file, with some cluster-specific configuration / environment 7
  • 8. Existing OpenHPC Automation Shortcomings: ● The input.local file exports shell variables ● The recipe.sh is not idempotent ● Extensibility is impossible without editing the files 8
  • 9. Ansible Playbooks Ansible is a widely used automation tool which can describe the structure and configuration of IT infrastructure with YAML “playbooks” and “roles”. OpenHPC with Ansible: ● Ansible playbooks can more easily be made idempotent ● Ansible can manage nodes/tasks according to the the structure of the cluster ● Configuration is passed as a YAML file (no environment handling) ● Composition, using playbooks and roles, building on third-party content 9
  • 10. Ansible OpenHPC Benefits ● Flexible cluster configuration ○ Fine grained / composable ○ Cluster wide / node group wide / node specific ● Works on both x86_64 and AArch64 ○ Ansible gathers information about architecture ○ Same playbook runs on both ● OS is directly inferred by Ansible (gather_facts) ○ Ansible gathers information about OS ○ Yum, apt, zypper… can be switched in the roles’ logic 10
  • 11. Ansible OpenHPC Recipe The basic structure of the Ansible playbook playbook +---- group_vars/ +-- all.yml cluster wide configurations +-- group1,group2 … node group(e.g., computing nodes, login, DFS) specific configurations +---- host_vars/ +--- host1,host2 … host specific configurations +---- roles/ package specific tasks, templates, config files, and config variables +--- package-name/ +--- tasks/main.yml … YAML file to describe installation method of package-name +--- vars/main.yml … package specific configuration variables +--- templates/*.j2 … template files to generate configuration files 11
  • 13. Upstreaming our work After multiple discussions, our proposal is: ● Generate from LaTex sources at the same time as recipe.sh ○ Each block/section as a separate role ○ Control OS/Arch via gather_facts/config ○ xCAT vs Warewulf, Slurm vs PBS: only add what hasn’t yet ● Push to a separate repository (openhpc/ansible-role ?) ○ Control release branch/tags as to not overwrite previous recipes ○ Repo can have additional playbooks (CI, new users) ○ Once release is validated, push to Ansible Galaxy ● Direct pull requests to playbooks / support roles ○ Only recipe roles are auto-generated ○ Support roles (specific to Ansible, playbook) are maintained on the repo 13
  • 15. Test Suite Most tests green, however: ● Intel-specific tests (CILK, TBB, IMB) disabled ● Others need package install (PDF, CDF, HDF), but pass when installed ● TAU fails because LMod defaults to openmpi (needs openmpi3) ● Lustre fails as package depends on kernel 4.2 (which won’t work on our machines) ● MiniDFT and PRK had make failures, but we haven’t investigated yet ● --enable-longdoesn’t really, need to look into why not The plan from now on is: 1. Automate package install conditional on enabled tests, fix remaining errors 2. Work with members to prioritise long term ones (like Lustre) 3. Use it for additional packages, so we can test them before sending upstream 4. Add a benchmark mode, making sure to use entire cluster 15