How to Troubleshoot OpenStack Without Losing Sleep

TRUSTED CLOUD SOLUTIONS
OpenStack Summit Austin

Sadique Puthen & Dustin Black
Cloud Success Architect
26th April 2016
How To Troubleshoot Openstack Without
Losing Sleep

sputhenp@redhat.com
@sadiquepp
dustin@redhat.com
@dustinlblack

Manifestation of a Problem
“Our compute service on the compute node is stuck in a state of activating.”
“Most OpenStack Overcloud neutron services inactive and disabled”
No valid host was found. Exceeded max scheduling attempts 3 for instance
PortLimitExceeded: Maximum number of ports exceeded
“User unable to launch new instances”
Instance failed to spawn

Over-Working RabbitMQ
Click to add subtitle
Insert paragraph of copy here and graphic in
box to the right.
● Bullet
● Bullet
● Bullet

Problem Description: Our compute service on the compute node is stuck in a state of activating
Initial evidence are non-descriptive timeouts:
# journalctl --all --this-boot --no-pager | grep nova
May 27 16:20:50 host.example.com systemd[1]: openstack-nova-
compute.service operation timed out. Terminating.
May 27 16:20:50 host.example.com systemd[1]: Unit openstack-nova-
compute.service entered failed state.
May 27 16:20:50 host.example.com systemd[1]: openstack-nova-
compute.service holdoff time over, scheduling restart.
Rebooting the compute node doesn’t help.

An strace of the nova-compute service reveals our trouble communicating with rabbit:
# grep :5672 compute.strace
12938 03:29:28.320069 write(3, "2015-05-28 03:29:28.319 12938 ERROR
oslo.messaging._drivers.impl_rabbit [-] AMQP server on
192.168.100.47:5672 is unreachable: Socket closed. Trying again in 1
seconds.n", 169) = 169 <0.000019>
12938 03:29:29.321779 write(3, "2015-05-28 03:29:29.321 12938 INFO
oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on
192.168.100.48:5672n", 126) = 126 <0.000061>
12938 03:29:30.333894 write(3, "2015-05-28 03:29:30.333 12938 INFO
oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on
192.168.100.48:5672n", 123) = 123 <0.000013>

The strace leads to more logs...
The logs lead to an existing bug report...
The bug report leads to an upstream discussion...
Yadda Yadda Yadda
The rabbitmq-server process is
out of file descriptors!

https://github.com/puppetlabs/puppetlabs-rabbitmq/pull/215#discussion_r24977957

Now you Know!
Too few RabbitMQ file
descriptors is a recipe for
sleepless nights.
Set the rabbitmq-server
NOFILE limit to 65436*
*Be careful if you’re using pacemaker -- limits are set by the resource agent.

Knowledge-Centered Support
● Continuous improvement of the knowledgebase
simplifies troubleshooting of future issues
● Knowledge automatically captured as a by-product of
the problem solving process
● Search and reuse as core disciplines of the support
team
● Fast track to publication means easier self-resolution
https://access.redhat.com/solutions/1465753

WE HAVE OUR RIGHT TO SLEEPIssue #2: Random failure while spawning large
number of instances

$ nova list
ERROR (ConnectionRefused): Unable to establish
connection to http://192.168.1.1:35357/v2.0/tokens
● Connection to various openstack service APIs (nova-
api, cinder-api, neutron-api, etc times out randomly.
● Not reproducible in most of the environments. When
it happens, the failure is random without any pattern.
Sometimes 1 in 100 or 1 in 500, etc.
● Obviously keystone is up and running perfectly fine.
connection
refused!!
neutron-apicinder-apinova-api
Keystone
Issue #2: The symptom is same as issue #1
Result: Random failure in spawning instances, creating volumes, networks, etc.

First suspect is Keystone, but he is innocent.
Where one can go wrong?
Looking at the error message, It’s natural to point fingers at keystone.
● Looked at keystone api logs. No clue!!
● Can see abnormal number of of keystone connections
in CLOSE_WAIT status. Focused and wasted a lot of
time by investigating in that direction.
● It’s time to understand how the connections from end
user to api and keystone goes by focusing on how the
dots are connected.

17
How does it work under the Hood?
connection
refused!!
haproxy
nova-api keystone
mariadb-galera
haproxy
nova-api keystone
mariadb-galera
haproxy
nova-api keystone
mariadb-galera
VIP
nova-api
keystone
database
controller-1 controller-2 controller-3

Possibilities?
Keystone is already ruled out.
● Intermittent network packet drop?
● Haproxy (load balancer) drops connection?
end user -> nova
nova -> keystone
keystone -> database
No, ruled out by network troubleshooting
Likely?
Highly unlikely as the error is when nova connects to keystone.
Slightly likely.
Highly likely. Enabled logging and found heavy client termination
messages.
haproxy[22346]: 10.243.232.62:48999 [10/Jul/2015:01:41:34.706] galera galera/pcmk-hovsh0800sdc-06 1/0/8734961 37181 cD 1369/1337/1337/1337/0 0/0

19
galera: sessions
max: 2000 Limit: 2000
Hold on, but where did I set it? Nowhere!!!
● Then from where this limit comes to effect?
This is the default hard coded limit for each
proxy if one is explicitly not defined.
● Then why there is no proper error message?
Connection by haproxy is sent into a queue
waiting for free database connection, then
terminated when it hits timeout.
Haproxy has hit maxconn for galera!
listen galera
bind 10.243.232.62:3306
mode tcp
option tcplog
option httpchk
option tcpka
stick on dst
stick-table type ip size 2
timeout client 90m
timeout server 90m
server controller-1 10.243.232.14:3306 check inter 1s on-
marked-down shutdown-sessions
global
daemon
group haproxy
maxconn 40000
pidfile /var/run/haproxy.pid
user haproxy
defaults
log 127.0.0.1 local2 warning
mode tcp
option tcplog
option redispatch
retries 3
timeout connect 5s
timeout client 30s
timeout server 30s
maxconn 2000

20
I solved your problem, can I go and sleep? Hold on..
● It took more time to determine the right value for maximum database
connection because it depends on,
○ How many workers are spawned by each api?
■ Depends on api_workers/workers configuration for
each service.
● Depends on how many cpu cores are there on
each controller?
■ This can differ from deployment to deployment.
○ Each worker process opens five long lived database
connection.
○ There are also some short lived connections by each worker.
What should be the maxconn for galera?
Now I can
sleep like
him.
# Number of workers for OpenStack API service. The default
will be the number of CPUs available. (integer value)

21
nova-api
24x3 = 72
mariadb-galera
controller-1
cores = 24
Based on default deployment by RHEL Openstack Platform Director.
What should be the maxconn for galera?
keystone
24x2 = 48
neutron-s
24x2 = 48
glance-ap
24x1 = 24
cinder-api
24x1 = 24
glance-re
24x1 = 24
nova-con
24x1 = 24
nova-api
24x3 = 72
controller-1
cores = 24
keystone
24x2 = 48
neutron-s
24x2 = 48
glance-ap
24x1 = 24
cinder-api
24x1 = 24
glance-re
24x1 = 24
nova-con
24x1 = 24
nova-api
24x3 = 72
controller-1
cores = 24
keystone
24x2 = 48
neutron-s
24x2 = 48
glance-ap
24x1 = 24
cinder-api
24x1 = 24
glance-re
24x1 = 24
nova-con
24x1 = 24
mariadb-galera mariadb-galera
total = 264x5 =1320
Haproxy-VIP Total is 3960
total = 264x5 =1320 total = 264x5 =1320
Add 1024 for:
1 - Short lived connections
2 - Other services.
3 - New services.
Total = 4960

22
To sleep like a …..?
Setting the right maxconn value upfront for database
proxy can save you from sleepless nights.
● Decide how many worker threads are required by
each api for optimum performance. A 96 core system
does not need x3 nova worker processes.
● Automate this calculation and set it during
deployment time itself.
Both haproxy and for database server.
max_connections
● Those use different load balancers, make sure to
address this problem, if applicable.
Decide and Set the right value upfront before
going to bed.

Proactive alerts
Real-time
risk assessment
No infrastructure cost Validated resolution
Tailored resolution
Quick setup
SaaS
Discover the Beta: access.redhat.com/insights

THANK YOU
plus.google.com/+RedHat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHatNewslinkedin.com/company/red-hat

How to Troubleshoot OpenStack Without Losing Sleep

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to How to Troubleshoot OpenStack Without Losing Sleep

Similar to How to Troubleshoot OpenStack Without Losing Sleep (20)

Recently uploaded

Recently uploaded (17)

How to Troubleshoot OpenStack Without Losing Sleep