Audience: Intermediate
About: Tales from an OpenStack operations team that had to learn to walk before they could fly. A small agile team who follow scrum to reduce single points of failure and rely heavily on orchestration. This presentation will outline how we use metrics to investigate, troubleshoot and influence purchasing decisions. Why Up Down monitoring is not enough in this day and age, and how to support the inevitable Persistent VM in the cloud.
Speaker Bio: Rarm Nagalingam – Senior Consultant, Red Hat
Rarm is a Senior Consultant at Red Hat working with customers to deploy and manage their cloud infrastructure. As a passionate cloud advocate, he has assisted in the migration of workloads running on legacy virtualisation to the cloud. Rarm has over 13 years of experience in the ICT industry, specializing in rapid development of bespoke systems.
OpenStack Australia Day - Sydney 2016
https://events.aptira.com/openstack-australia-day-sydney-2016/
/bin/tails from OpenStack Operations: Rarm Nagalingam, Red Hat
1. /bin/tails from OpenStack Operations
OpenStack Australia Day
Rarm Nagalingam
DevOps J.O.A.T Engineer
May 2016
2. OpenStack Australia Day 2016
INTRODUCTION
Rarm Nagalingam
DevOps J.O.A.T Engineer
rarm@redhat.com
linkedin.com/in/rarm-nagalingam-736aa54
3. OpenStack Australia Day 20163
● Current Architecture, Size, Workloads
● Patch Methodology
● User Issue: Is the Cloud Slow!! today?
● egrep fail -R ./ == fail
● Let's play the blame game
● Fool me once, shame on you, fool me twice, monitor it!
● Role Play
● Questions & possibly Answers
AGENDA
OpenStack Australia Day: /bin/tails from OpenStack Operations
5. OpenStack Australia Day 2016
Current - RHELOSP 5.0 (ICEHOUSE)
• 3 x Physical Controllers
• 3 x Physical DB Nodes
• 2 x Virtual Load Balancers
• 26 x Compute Nodes (56 vCPUs and 256 GB ram)
•
1456 vCPUs / 6.6TB of RAM – 90% allocated
• Storage NFS via Filer
7. OpenStack Australia Day 2016
Future - RHELOSPd 8.0 (LIBERTY)
●
3 x Physical Controllers
●
3 x Physical DB Nodes
●
3 x Physical CEPH Monitor Nodes
●
9 x Physical CEPH Storage Nodes (~ 36TB per node with NVMe Journals)
●
2 x Virtual Load Balancers
●
(xxx) x Compute Nodes (56 vCPUs and 512 GB ram each)
16. OpenStack Australia Day 2016
ERROR nova.openstack.common.rpc.common [req-c5e13da1-97f2-4da5-855f-1c09a11f328a None None] ['Traceback (most recent call last):n', ' File
"/opt/stack/nova/nova/openstack/common/rpc/amqp.py", line 461, in _process_datan **args)n', ' File
"/opt/stack/nova/nova/openstack/common/rpc/dispatcher.py", line 172, in dispatchn result = getattr(proxyobj, method)(ctxt, **kwargs)n', ' File
"/opt/stack/nova/nova/openstack/common/rpc/common.py", line 439, in innern return catch_client_exception(exceptions, func, *args, **kwargs)n', ' File
"/opt/stack/nova/nova/openstack/common/rpc/common.py", line 420, in catch_client_exceptionn return func(*args, **kwargs)n', ' File
"/opt/stack/nova/nova/network/manager.py", line 573, in get_instance_nw_infon instance_uuid)n', ' File "/opt/stack/nova/nova/db/api.py", line 561, in
virtual_interface_get_by_instancen return IMPL.virtual_interface_get_by_instance(context, instance_id)n', ' File
"/opt/stack/nova/nova/db/sqlalchemy/api.py", line 126, in wrappern return f(*args, **kwargs)n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line
138, in wrappern instance_get_by_uuid(context, instance_uuid)n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 126, in wrappern return
f(*args, **kwargs)n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1678, in instance_get_by_uuidn columns_to_join=columns_to_join)n', ' File
"/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1684, in _instance_get_by_uuidn filter_by(uuid=uuid).n', ' Filepython2.7/dist-
packages/sqlalchemy/engine/base.py", line 1449, in executen params)n', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1584,
in _execute_clauseelementn compiled_sql, distilled_paramsn', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1698, in
_execute_contextn context)n', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1691, in _execute_contextn context)n', ' File
"/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 331, in do_executen cursor.execute(statement, parameters)n', ' File
"/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 174, in executen self.errorhandler(self, exc, value)n', ' File "/usr/lib/python2.7/dist-
packages/MySQLdb/connections.py", line 36, in defaulterrorhandlern raise errorclass, errorvaluen', 'OperationalError: (OperationalError) (1054,
"Unknown column 'instances.locked_by' in 'field list'") 'SELECT anon_1.instances_created_at AS anon_1_instances_created_at,
anon_1.instances_updated_at AS anon_1_instances_updated_at, anon_1.instances_deleted_at AS anon_1_instances_deleted_at,
anon_1.instances_deleted AS anon_1_i instances_hostname, instances.launch_index AS instances_launch_index, instances.key_name AS
instances_key_name, instances.key_data AS instances_key_data, instances.power_state AS instances_power_state, instances.vm_state AS
instances_vm_state, instances.task_state AS instances_task_state…...ces_access_ip_v6, instances.auto_disk_config AS instances_auto_disk_config,
instances.progress AS instances_progress, instances.shutdown_terminate AS instances_shutdown_terminate, instances.disable_terminate AS
instances_disable_terminate, instances.cell_name AS instances_cell_name, instances.internal_id AS instances_internal_id, instances.cleaned AS
instances_cleaned nFROM instances nWHERE instances.deleted = %s AND instances.uuid = %s n LIMIT %s) AS anon_1 LEFT OUTER JOIN
instance_info_caches AS instance_info_caches_1
18. OpenStack Australia Day 2016
Got Logs
●
Troubleshooting from the 90’s
●
Log Aggregation FTW
●
Support infrastructure just as important as the Cloud
●
Testing in Prod == a resume generating event
27. OpenStack Australia Day 2016
Exercise 1:
You arrive to work and discover one of you compute nodes had been hard powered off. The
node was running three high priority instances, a small 60GB Windows instance and two
medium RHEL instances.
Goal:
Without rebuilding the compute node, restart the instances on another node.
Example Scenario
29. OpenStack Australia Day 2016
Exercise 3:
One of the admins accidentally dropped a database table. However, rather than just clearing
out the redundant data they dropped all the tables form the OpenStack nova database.
Thankfully you saw the user do this and can respond quickly.
Goal:
Redirect users to a temporary site stating that an outage has occurred. Restore the database
and ensure that all services are able to successfully interact with the database before
removing the redirect
BackUp Scenario
30. OpenStack Australia Day 2016
Now you are an Elite Cloud Admin
(cc) https://www.flickr.com/photos/-chuckc-/