The complex architecture, design, and difficulties while troubleshooting amplifies the effort in debugging a problem with an OpenStack environment. This can give administrators and support associates sleepless nights if OpenStack native and supporting components are not configured properly and tuned for optimum performance, especially with large deployments that involve high availability and load balancing.
5. Manifestation of a Problem
“Our compute service on the compute node is stuck in a state of activating.”
“Most OpenStack Overcloud neutron services inactive and disabled”
No valid host was found. Exceeded max scheduling attempts 3 for instance
PortLimitExceeded: Maximum number of ports exceeded
“User unable to launch new instances”
Instance failed to spawn
6.
7. Over-Working RabbitMQ
Click to add subtitle
Insert paragraph of copy here and graphic in
box to the right.
● Bullet
● Bullet
● Bullet
8. Over-Working RabbitMQ
Problem Description: Our compute service on the compute node is stuck in a state of activating
Initial evidence are non-descriptive timeouts:
# journalctl --all --this-boot --no-pager | grep nova
May 27 16:20:50 host.example.com systemd[1]: openstack-nova-
compute.service operation timed out. Terminating.
May 27 16:20:50 host.example.com systemd[1]: Unit openstack-nova-
compute.service entered failed state.
May 27 16:20:50 host.example.com systemd[1]: openstack-nova-
compute.service holdoff time over, scheduling restart.
Rebooting the compute node doesn’t help.
9. Over-Working RabbitMQ
Problem Description: Our compute service on the compute node is stuck in a state of activating
An strace of the nova-compute service reveals our trouble communicating with rabbit:
# grep :5672 compute.strace
12938 03:29:28.320069 write(3, "2015-05-28 03:29:28.319 12938 ERROR
oslo.messaging._drivers.impl_rabbit [-] AMQP server on
192.168.100.47:5672 is unreachable: Socket closed. Trying again in 1
seconds.n", 169) = 169 <0.000019>
12938 03:29:29.321779 write(3, "2015-05-28 03:29:29.321 12938 INFO
oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on
192.168.100.48:5672n", 126) = 126 <0.000061>
12938 03:29:30.333894 write(3, "2015-05-28 03:29:30.333 12938 INFO
oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on
192.168.100.48:5672n", 123) = 123 <0.000013>
10. Over-Working RabbitMQ
The strace leads to more logs...
The logs lead to an existing bug report...
The bug report leads to an upstream discussion...
Yadda Yadda Yadda
The rabbitmq-server process is
out of file descriptors!
Problem Description: Our compute service on the compute node is stuck in a state of activating
12. Now you Know!
Too few RabbitMQ file
descriptors is a recipe for
sleepless nights.
Set the rabbitmq-server
NOFILE limit to 65436*
*Be careful if you’re using pacemaker -- limits are set by the resource agent.
13. Knowledge-Centered Support
● Continuous improvement of the knowledgebase
simplifies troubleshooting of future issues
● Knowledge automatically captured as a by-product of
the problem solving process
● Search and reuse as core disciplines of the support
team
● Fast track to publication means easier self-resolution
https://access.redhat.com/solutions/1465753
14. WE HAVE OUR RIGHT TO SLEEPIssue #2: Random failure while spawning large
number of instances
15. $ nova list
ERROR (ConnectionRefused): Unable to establish
connection to http://192.168.1.1:35357/v2.0/tokens
● Connection to various openstack service APIs (nova-
api, cinder-api, neutron-api, etc times out randomly.
● Not reproducible in most of the environments. When
it happens, the failure is random without any pattern.
Sometimes 1 in 100 or 1 in 500, etc.
● Obviously keystone is up and running perfectly fine.
connection
refused!!
neutron-apicinder-apinova-api
Keystone
Issue #2: The symptom is same as issue #1
Result: Random failure in spawning instances, creating volumes, networks, etc.
16. First suspect is Keystone, but he is innocent.
Where one can go wrong?
Looking at the error message, It’s natural to point fingers at keystone.
● Looked at keystone api logs. No clue!!
● Can see abnormal number of of keystone connections
in CLOSE_WAIT status. Focused and wasted a lot of
time by investigating in that direction.
● It’s time to understand how the connections from end
user to api and keystone goes by focusing on how the
dots are connected.
17. 17
How does it work under the Hood?
connection
refused!!
haproxy
nova-api keystone
mariadb-galera
haproxy
nova-api keystone
mariadb-galera
haproxy
nova-api keystone
mariadb-galera
VIP
nova-api
keystone
database
controller-1 controller-2 controller-3
18. Possibilities?
Keystone is already ruled out.
● Intermittent network packet drop?
● Haproxy (load balancer) drops connection?
end user -> nova
nova -> keystone
keystone -> database
No, ruled out by network troubleshooting
Likely?
Highly unlikely as the error is when nova connects to keystone.
Slightly likely.
Highly likely. Enabled logging and found heavy client termination
messages.
haproxy[22346]: 10.243.232.62:48999 [10/Jul/2015:01:41:34.706] galera galera/pcmk-hovsh0800sdc-06 1/0/8734961 37181 cD 1369/1337/1337/1337/0 0/0
haproxy[22346]: 10.243.232.14:53092 [10/Jul/2015:02:37:43.666] galera galera/pcmk-hovsh0800sdc-06 1/0/5400007 2875 cD 1375/1337/1337/1337/0 0/0
haproxy[22346]: 10.243.232.62:41742 [10/Jul/2015:01:47:44.819] galera galera/pcmk-hovsh0800sdc-06 1/0/8400246 38448 cD 1376/1336/1336/1336/0 0/0
haproxy[22346]: 10.243.232.14:53318 [10/Jul/2015:02:37:47.499] galera galera/pcmk-hovsh0800sdc-06 1/0/5400005 3414 cD 1384/1335/1335/1335/0 0/0
haproxy[22346]: 10.243.232.62:42507 [10/Jul/2015:02:37:47.529] galera galera/pcmk-hovsh0800sdc-06 1/0/5400006 2875 cD 1383/1334/1334/1334/0 0/0
haproxy[22346]: 10.243.232.62:42609 [10/Jul/2015:02:37:49.103] galera galera/pcmk-hovsh0800sdc-06 1/0/5400315 35783 cD 1384/1334/1334/1334/0 0/0
haproxy[22346]: 10.243.232.62:42684 [10/Jul/2015:02:37:50.598] galera galera/pcmk-hovsh0800sdc-06 1/0/5400259 28994 cD 1384/1334/1334/1334/0 0/0
haproxy[22346]: 10.243.232.14:53493 [10/Jul/2015:02:37:50.885] galera galera/pcmk-hovsh0800sdc-06 1/0/5400007 2875 cD 1383/1333/1333/1333/0 0/0
haproxy[22346]: 10.243.232.14:53674 [10/Jul/2015:02:37:53.874] galera galera/pcmk-hovsh0800sdc-06 1/0/5400007 3498 cD 1404/1335/1335/1335/0 0/0
haproxy[22346]: 10.243.232.14:54625 [10/Jul/2015:02:38:11.399] galera galera/pcmk-hovsh0800sdc-06 1/0/5400008 12461 cD 1407/1335/1335/1335/0 0/0
19. 19
galera: sessions
max: 2000 Limit: 2000
Hold on, but where did I set it? Nowhere!!!
● Then from where this limit comes to effect?
This is the default hard coded limit for each
proxy if one is explicitly not defined.
● Then why there is no proper error message?
Connection by haproxy is sent into a queue
waiting for free database connection, then
terminated when it hits timeout.
Haproxy has hit maxconn for galera!
listen galera
bind 10.243.232.62:3306
mode tcp
option tcplog
option httpchk
option tcpka
stick on dst
stick-table type ip size 2
timeout client 90m
timeout server 90m
server controller-1 10.243.232.14:3306 check inter 1s on-
marked-down shutdown-sessions
server controller-2 10.243.232.15:3306 check inter 1s on-
marked-down shutdown-sessions
server controller-3 10.243.232.16:3306 check inter 1s on-
marked-down shutdown-sessions
global
daemon
group haproxy
maxconn 40000
pidfile /var/run/haproxy.pid
user haproxy
defaults
log 127.0.0.1 local2 warning
mode tcp
option tcplog
option redispatch
retries 3
timeout connect 5s
timeout client 30s
timeout server 30s
maxconn 2000
20. 20
I solved your problem, can I go and sleep? Hold on..
● It took more time to determine the right value for maximum database
connection because it depends on,
○ How many workers are spawned by each api?
■ Depends on api_workers/workers configuration for
each service.
● Depends on how many cpu cores are there on
each controller?
■ This can differ from deployment to deployment.
○ Each worker process opens five long lived database
connection.
○ There are also some short lived connections by each worker.
What should be the maxconn for galera?
Now I can
sleep like
him.
# Number of workers for OpenStack API service. The default
will be the number of CPUs available. (integer value)
21. 21
nova-api
24x3 = 72
mariadb-galera
controller-1
cores = 24
Based on default deployment by RHEL Openstack Platform Director.
What should be the maxconn for galera?
keystone
24x2 = 48
neutron-s
24x2 = 48
glance-ap
24x1 = 24
cinder-api
24x1 = 24
glance-re
24x1 = 24
nova-con
24x1 = 24
nova-api
24x3 = 72
controller-1
cores = 24
keystone
24x2 = 48
neutron-s
24x2 = 48
glance-ap
24x1 = 24
cinder-api
24x1 = 24
glance-re
24x1 = 24
nova-con
24x1 = 24
nova-api
24x3 = 72
controller-1
cores = 24
keystone
24x2 = 48
neutron-s
24x2 = 48
glance-ap
24x1 = 24
cinder-api
24x1 = 24
glance-re
24x1 = 24
nova-con
24x1 = 24
mariadb-galera mariadb-galera
total = 264x5 =1320
Haproxy-VIP Total is 3960
total = 264x5 =1320 total = 264x5 =1320
Add 1024 for:
1 - Short lived connections
2 - Other services.
3 - New services.
Total = 4960
22. 22
To sleep like a …..?
Setting the right maxconn value upfront for database
proxy can save you from sleepless nights.
● Decide how many worker threads are required by
each api for optimum performance. A 96 core system
does not need x3 nova worker processes.
● Automate this calculation and set it during
deployment time itself.
Both haproxy and for database server.
max_connections
● Those use different load balancers, make sure to
address this problem, if applicable.
Decide and Set the right value upfront before
going to bed.