1. OpenStack HA -
Theory to Reality
GERD PRÜßMANN SHAMAIL TAHIR
SRIRAM SUBRAMANIAN KALIN NIKOLOV
2. Gerd Prüßmann Shamail Tahir
Cloud Architect Cloud Architect
Deutsche Telekom AG EMC Office of the CTO
Sriram Subramanian Kalin Nikolov
Founder & Cloud Specialist Cloud Engineer
CloudDon PayPal
@2digitsleft @ShamailXD
@sriramhere
3. Agenda
OpenStack HA - Introduction
Active/ Active
Active/ Passive
DT Implementation
eBay/PayPal Implementation
Summary
4. OpenStack HA - Introduction
What does it mean?
Why is it not by default?
Stateless vs Stateful
Challenges
More than one way
Active/ Passive
Active/ Active
8. Active/ Active
● OS High Availability (HA) concept depends on components used for
i.e. network virtualization, storage backend, database system etc.
● Various technologies available to realize HA:
Vendors use combinations: i.e. Pacemaker, Corosync, Galera, Keepalived,
HAProxy, VRRP, DRBD … or their own tools
The following description is derived from the generic proposal from the
OpenStack HA guide:
http://docs.openstack.org/high-availability-guide/content/index.html
9. Active/ Active
● Target: Try to have all services of the platform highly available
Redundancy and resiliency against single service / node failure
● stateless services are load balanced (HAproxy + keepalived)
o i.e. API endpoints / nova-scheduler
● stateful services use individual HA technologies
o i.e. RabbitMQ, MySQL DB etc.
o might be load balanced as well
● some services/agents where no built in HA feature is available
10. Active/ Active - API service endpoints
API endpoints
● deploy on multiple nodes
● configure load balancing with virtual IPs in HAproxy
● use HAproxy’s VIPs to configure respective identity endpoints
● all service configuration files refer to these VIPs only
schedulers
● nova-scheduler, nova-conductor, cinder-scheduler, neutron-server,
ceilometer-collector, heat-engine
● schedulers will be configured with clustered RabbitMQ nodes
11. Active/ Active - Databases
● MySQL or MariaDB with Galera cluster
(wsrep) library extension
o transaction commit level replication
● synchronous multiple master nodes setup
o min. 3 nodes to get quorum in
case of network partition
● Write and read to any node
● other databases options possible:
Percona XtraDB, PostgreSQL etc.
12. Active/ Active - RabbitMQ
● RabbitMQ nodes clustered
● mirrored queues configured via policy (i.e. ha-mode all)
● all services use the RabbitMQ nodes
13. Active/ Active - Networking
Network
● deploy multiple network nodes
● Neutron DHCP agent – configure multiple DHCP agents
(dhcp_agents_per_network)
● Neutron L3 agent
o Automatic L3 agent HA (allow_automatic_l3agent_failover)
o VRRP (l3_ha, max_l3_agents_per_router, min_l3_agents_per_router)
● Neutron L2 agent - no HA available
● Neutron metadata agent – no HA availailable
● Neutron LBaaS agent – no HA available
● no HA feature available: active/passive pacemaker / corosync solution
16. Active/ Passive: General
● Components should leverage a Virtual IP
● The primary tools used for Active/Passive
OpenStack configurations are general (non-
OpenStack specific): Pacemaker +
Corosync, and DRBD
17. Corosync
● Messaging Layer used by Cluster
● Responsibilities include cluster membership and
messaging
● Leverages RRP (Redundant Ring Protocol)
o Rings can be set up as A/A or A/P
o UDP Only
o mcastport specifies rcv port; mcastport minus 1 is
send port
18. Pacemaker
● Cluster Resource Manager
● Cluster Information Base (CIB)
o Represents current state of resources
and cluster configuration (XML)
● Cluster Resource Management Daemon
(CRMd)
o Acts as decision maker (one master)
● Policy Engine (PEngine)
o Send instructions to LRMd and CRMd
● STONITHd
o Fencing mechanism
CRMd
STONITHd CIB
PEngine
LRMd
19. DRBD
● Distributed Replicated Block Device
● Creates logical block devices (e.g. /dev/drbdX) that
having backing volumes
● Reads serviced locally
● Primary node writes are sent to secondary node
20. Host1
Active/Passive: Database
MySQL
Host2
MySQL
DRBD DRBD
Pacemaker Pacemaker
Corosync Corosync
● Use DRBD to back MySQL
● Leverage VIP that can float
between hosts
● Manage all resources (including
MySQL Daemon) with Pacemaker
● MySQL/Galera is an alternative
but current version of HA Guide
does not recommend it
21. Host1
Active/Passive: RabbitMQ
RabbitMQ
Host2
RabbitMQ
DRBD DRBD
Pacemaker Pacemaker
Corosync Corosync
● Use DRBD to back RabbitMQ
● Leverage VIP that can float
between hosts
● Ensure erlang.cookie are identical
on all nodes
o Enables ability to
communicate with each other
● RabbitMQ clustering does not
tolerate network partitions well
22. Active/Passive: Overview (From Guide)
● Leverage DB, RabbitMQ VIP in configuration files
● Configure Pacemaker Resources for OpenStack Services
o Image API
o Identity
o Block Storage API
o Telemetry Central Agent
o Networking
o L3-Agent
o DHCP
23. DT Implementation - Overview
● Business Market Place (BMP)
● SaaS offering
● https://portal.telekomcloud.com/
● SaaS Applications from Software Partners
(ISVs) and DT offered to SME customers
● Platform based on Open Source technologies only
(OpenStack, CEPH, Linux)
● Project started in 2012 with OS Essex, CEPH
● In production since 3/13
24. DT Implementation
DTAG scale out project (ongoing)
Target: Migrate production to a new DC and scale out
Requirements:
● scale out compute by 30%, storage by 40%
● eliminate all SPOFs
● Setup in two fire protection areas / physically separated DC rooms
25. DT Implementation
● single region HA OS instance
● all services distributed over two DC rooms
o Compute and Storage distributed equally
o All OpenStack services HA (as far as possible)
OSS (DNS, NTP, puppet master, Mirror etc., redundant perimeter
firewall)
● Instance distribution: 4 Availability Zones, multiple host aggregates and
scheduler filters
26. DT Implementation
● Load Balancing
o HAproxy for MySQL, services, RabbitMQ, APIs (nginx under test)
● MySQL
o Galera Multi Master Node replication (3 nodes)
● RabbitMQ
o 2 nodes cluster / mirrored queues
● Neutron
o DHCP multiple agents started; Pacemaker/Corosync
● API Endpoints
o Loadbalancing with round robin distribution
● Storage
o 2 shared, distributed CEPH clusters (RBD/S3)
27. DT Implementation
Tests/Experiences so far
● Load balancing works well
● Database: OpenStack multi-node write issues
o 1 node write / 2 nodes backup: diminishes Galera HA efficiency (monitoring)
● Specific issues with deployment in 2 DC rooms / uneven distribution of services (Galera)
o if the “wrong” room fails
Galera: quorum requires majority!
room with 2 nodes goes down → 3rd node will deactivate itself → DB outage
Storage specific:
CEPH may lose 2/3 of the replicas → heavy replication load on CEPH cluster
danger of losing data (OSD/disk failure) → raise replica level / adapt crush map
Network: recovering from a neutron / L3 failure: <15 minutes to recover
o pet applications vulnerable – may suffer from hick-ups at disasters anyway
● DHCP agent failures
28. DT Implementation
Plans for the future
● use DVR / VRRP in the future
o make network more resilient and elastic
● a third DC room would be desirable :-)
o CEPH replicas / MONs, MySQL Galera
29. eBay/PayPal Implementation
The scope of Ebay/PayPal OpenStack Clouds
● 100% of PayPal web/mid tier
● Most of Dev/QA
● Number of HVs: 8,500
● Number of Virtual Machines: 70,000
● Number of users: Several thousands
● Availability zones: 10
30. eBay/PayPal Implementation
● Database
MySQL MMM replication, VIP with FailoverPersistence / Galera
● RabbitMQ
VIP with SingleNode FailoverPersistence or 3 nodes with mirrored queues
● NeutronDHCP / LBaaS
Corosync/Pacemaker
● API Endpoints
LB VIPs for every service with either RR or least connection
● Storage
Shared storage with nfs/iscsi
31. eBay/PayPal Implementation
Successful HA Implementations
● LoadBalanced HA - VIPs for every service
● LB Single Node Failover Persistence Profile
● Galera/Percona for Identity Service
● Global Identity Service using GLB
32. eBay/PayPal Implementation
HA Failures
● Corosync/Pacemaker
NeutronDHCP and LBaaS - missing advanced health checks
● RabbitMQ
Single Node Failover Persistence
● MySQL Replication
Single Node Failover Persistence sometimes doesn't work well
Implemented external monitoring and disabling of the failed member.
● VIPs without ECV health checks
33. eBay/PayPal Implementation
Future direction
● HA on Global or Regional Services
One leg in each Availability Zone
(Keystone, LBaaS, Swift)
● RabbitMQ with 3 node/mirrored queues
LB VIP with least connections
● No shared NFS for Glance
35. eBay/PayPal Implementation
Lessons Learned
● Try not to overcomplicate
● Simulate Failures
Before placing in production make sure HA works
● Place your services in different Availability zones
or at least different FaultZones
● Always make backups
No matter how robust your HA solution is
36. ● OpenStack HA Guide Update Efforts
● WTE Work Group (now known as ‘Enterprise’)
● Share Best Practices
Call to Action
Explain the notion of High Availability under the context of OpenStack. Ensuring high availability of OpenStack Services, API services, and supporting infrastructure including databases, message queues. HA means different at different contexts - is it guest availability? is it DB? is it storage? or is it application availability? if there is a failure, should the application fail over or should the underlying infra?
Broadly, protect against system down time and prevent accidental data loss.
There could be multiple SPOFs - services, API end points, network components, storage components, infrastructure components such as power, cooling etc. Provide redundancy at appropriate levels
OpenStack is a collection of services sharing some common infrastructure. It is not a monolithic application that can be made highly available by slapping in a load balancer. These services are independent and self-contained services with some shared infrastructure among them. They have different configuration, settings and more
Some of the components are stateless - such as nova-api, keystone-api, glance-api etc. Some of the components are databases/ message queues
OpenStack architecture is very complete
Acitve/ Passive - one ‘active’ and on failure, the reduntant service/ system is brought in to action. For stateless services, very minimal config needed. For Stateful services, additional applications such as Pacemaker, CoreSync are needed
Active/ Active - both active and redundant systemns are maintained in same state concurrently. For stateless services, active and redundant instances are load balanced using a LB such as HAProxy. Stateless services will need to be maintained in same state. Again, need an LB.