/bin/tails from OpenStack Operations: Rarm Nagalingam, Red Hat

/bin/tails from OpenStack Operations
OpenStack Australia Day
Rarm Nagalingam
DevOps J.O.A.T Engineer
May 2016

OpenStack Australia Day 2016
INTRODUCTION
Rarm Nagalingam
DevOps J.O.A.T Engineer
rarm@redhat.com
linkedin.com/in/rarm-nagalingam-736aa54

● Current Architecture, Size, Workloads
● Patch Methodology
● User Issue: Is the Cloud Slow!! today?
● egrep fail -R ./ == fail
● Let's play the blame game
● Fool me once, shame on you, fool me twice, monitor it!
● Role Play
● Questions & possibly Answers
AGENDA
OpenStack Australia Day: /bin/tails from OpenStack Operations

Current - RHELOSP 5.0 (ICEHOUSE)
• 3 x Physical Controllers
• 3 x Physical DB Nodes
• 2 x Virtual Load Balancers
• 26 x Compute Nodes (56 vCPUs and 256 GB ram)
•
1456 vCPUs / 6.6TB of RAM – 90% allocated
• Storage NFS via Filer

Future - RHELOSPd 8.0 (LIBERTY)
●
3 x Physical Controllers
●
3 x Physical DB Nodes
●
3 x Physical CEPH Monitor Nodes
●
9 x Physical CEPH Storage Nodes (~ 36TB per node with NVMe Journals)
●
2 x Virtual Load Balancers
●
(xxx) x Compute Nodes (56 vCPUs and 512 GB ram each)

Current Workloads
●
Cloud Based
●
Web Apps
●
Cloudy-VMs ++
https://www.flickr.com/photos/truedimensions/

Patch Methodology
https://www.flickr.com/photos/emma-lego/

●
Option 1: Scatter Gun
Take Aim Fire Ah...
www.safaribooksonline.com

Option 2: Become an Elite Cloud Admin
(cc) https://www.flickr.com/photos/-chuckc-/

ERROR nova.openstack.common.rpc.common [req-c5e13da1-97f2-4da5-855f-1c09a11f328a None None] ['Traceback (most recent call last):n', ' File
"/opt/stack/nova/nova/openstack/common/rpc/amqp.py", line 461, in _process_datan **args)n', ' File
"/opt/stack/nova/nova/openstack/common/rpc/dispatcher.py", line 172, in dispatchn result = getattr(proxyobj, method)(ctxt, **kwargs)n', ' File
"/opt/stack/nova/nova/openstack/common/rpc/common.py", line 439, in innern return catch_client_exception(exceptions, func, *args, **kwargs)n', ' File
"/opt/stack/nova/nova/openstack/common/rpc/common.py", line 420, in catch_client_exceptionn return func(*args, **kwargs)n', ' File
"/opt/stack/nova/nova/network/manager.py", line 573, in get_instance_nw_infon instance_uuid)n', ' File "/opt/stack/nova/nova/db/api.py", line 561, in
virtual_interface_get_by_instancen return IMPL.virtual_interface_get_by_instance(context, instance_id)n', ' File
"/opt/stack/nova/nova/db/sqlalchemy/api.py", line 126, in wrappern return f(*args, **kwargs)n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line
138, in wrappern instance_get_by_uuid(context, instance_uuid)n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 126, in wrappern return
f(*args, **kwargs)n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1678, in instance_get_by_uuidn columns_to_join=columns_to_join)n', ' File
"/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1684, in _instance_get_by_uuidn filter_by(uuid=uuid).n', ' Filepython2.7/dist-
packages/sqlalchemy/engine/base.py", line 1449, in executen params)n', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1584,
in _execute_clauseelementn compiled_sql, distilled_paramsn', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1698, in
_execute_contextn context)n', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1691, in _execute_contextn context)n', ' File
"/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 331, in do_executen cursor.execute(statement, parameters)n', ' File
"/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 174, in executen self.errorhandler(self, exc, value)n', ' File "/usr/lib/python2.7/dist-
packages/MySQLdb/connections.py", line 36, in defaulterrorhandlern raise errorclass, errorvaluen', 'OperationalError: (OperationalError) (1054,
"Unknown column 'instances.locked_by' in 'field list'") 'SELECT anon_1.instances_created_at AS anon_1_instances_created_at,
anon_1.instances_updated_at AS anon_1_instances_updated_at, anon_1.instances_deleted_at AS anon_1_instances_deleted_at,
anon_1.instances_deleted AS anon_1_i instances_hostname, instances.launch_index AS instances_launch_index, instances.key_name AS
instances_key_name, instances.key_data AS instances_key_data, instances.power_state AS instances_power_state, instances.vm_state AS
instances_vm_state, instances.task_state AS instances_task_state…...ces_access_ip_v6, instances.auto_disk_config AS instances_auto_disk_config,
instances.progress AS instances_progress, instances.shutdown_terminate AS instances_shutdown_terminate, instances.disable_terminate AS
instances_disable_terminate, instances.cell_name AS instances_cell_name, instances.internal_id AS instances_internal_id, instances.cleaned AS
instances_cleaned nFROM instances nWHERE instances.deleted = %s AND instances.uuid = %s n LIMIT %s) AS anon_1 LEFT OUTER JOIN
instance_info_caches AS instance_info_caches_1

http://logstash.openstack.org/#/dashboard/file/logstash.json

Got Logs
●
Troubleshooting from the 90’s
●
Log Aggregation FTW
●
Support infrastructure just as important as the Cloud
●
Testing in Prod == a resume generating event

Difference between Metrics and
Monitoring

Use metrics to prove your theories
https://www.elastic.co/blog/kibana-4-5-0-released

Let's Play the Blame Game
∙ Enforce OLAs
∙ Influence and support purchasing

Fool me once, shame on you.
Fool me twice, monitor it!

Fool me twice, monitor it!
(cc) rarm

(cc) https://www.flickr.com/photos/d0ppler/
Role Play

Exercise 1:
You arrive to work and discover one of you compute nodes had been hard powered off. The
node was running three high priority instances, a small 60GB Windows instance and two
medium RHEL instances.
Goal:
Without rebuilding the compute node, restart the instances on another node.
Example Scenario

Exercise 3:
One of the admins accidentally dropped a database table. However, rather than just clearing
out the redundant data they dropped all the tables form the OpenStack nova database.
Thankfully you saw the user do this and can respond quickly.
Goal:
Redirect users to a temporary site stating that an outage has occurred. Restore the database
and ensure that all services are able to successfully interact with the database before
removing the redirect
BackUp Scenario

Now you are an Elite Cloud Admin
(cc) https://www.flickr.com/photos/-chuckc-/

THANK YOU
plus.google.com/+RedHat
linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHatNews

/bin/tails from OpenStack Operations: Rarm Nagalingam, Red Hat

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to /bin/tails from OpenStack Operations: Rarm Nagalingam, Red Hat

Similar to /bin/tails from OpenStack Operations: Rarm Nagalingam, Red Hat (20)

More from OpenStack

More from OpenStack (20)

Recently uploaded

Recently uploaded (20)

/bin/tails from OpenStack Operations: Rarm Nagalingam, Red Hat