NTTドコモ様導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Copyright©2015 NTT DOCOMO, INC. All rights reserved.
After One Year of OpenStack Cloud
Operation (NTT DOCOMO)
NTT DOCOMO Inc.
Ken Igarashi
NTT Software
Asako Ishigaki
NEC
Akihiro Motoki

DOCOMO, INC All Rights Reserved
Ken Igarashi
○ Leading OpenStack Project at NTT DOCOMO
○ One of the first members of proposing
OpenStack Bare Metal Provisioning (currently
called "Ironic") - bit.ly/1stuN2E
Asako Ishigaki
○ Engineer, NTT Software
○ Developing OpenStack log collection and
analytics tools.
Akihiro Motoki
○ Senior Research Engineer, NEC
○ Core developer of Neutron and Horizon.
About Us
2

Our Project

Copyright©2015 NTT DOCOMO, INC. All rights reserved. 4
Scalable Test
using 100 nodes
(10)
System
Design
(8)
Recovery Tests
(12)
Racking and
Cabling
(14)
24/7 support
(14)
User Support
(+x)
2014-6 2014-8 2014-11 2015-2 2015-5 2015-112015-8

o Team Rules (Culture)
 Focusing on using OpenStack instead of developing OpenStack
 Think how to use it.
 Don’t think OpenStack can’t do XXXX.
 Reducing Opex/Promoting Automation
 Operation tools
• “Anything that a humane needs to do more than twice must be
automated.”
 Reduce operators by HA and self healing.

o Tools
 Ansible, Python, Shell Script
CI/CD
• pep-8
• Ansible-lint
• Install
Spec Writing
Test
Review
Production
+5
200+ deployments
(2015)
2000+ patches
(2015)
Deployment
Procedure

Operation

o OpenStack Configuration(http://bit.ly/1DbJPUO)
 Double redundancies for hardware
 Triple redundancies for software
VM
VM
VM
VM
VM
VM
MySQL (Galera)
Arbitrator
DB1 DB2
DB3 DB4 VM
VM
Nova
OpenStack
APIs
Zabbix
LBLB
Neutron Agents
PXE, DNS, DHCP
MaaS
RabbitMQ

o Deployment
 CMDB Registration

o Choose playbooks for Ansible Dynamic Inventory
Ansible

o Deployments
 Common: network, account, logging, Zabbix agent, drivers/firmware x
37
 OpenStack: Nova, Swift, Neutron, ……. x 62
 HA Configuration
compileInitial update setup
kernel driver firmware filesystem
development
environment
Install HDD Driver

o Operation x 31
 Common: process restart, log correction
 OpenStack Operation: usage, VM migration/backup, user
add/delete/quota change
 OpenStack Monitoring: health check tools
 perhost instance check
• Launch instances on given node(s)
• boot succeed, instance log
• Metadata retrieval, login prompt, SSH access
• Optionally, test volume attach and its read/write access

o 2015/10/27 4:40pm - 5:20pm
 Heian (New Takanawa)
What are operators doing
behind the Cloud?

Monitoring System

o Monitoring System
Weekday daytime
24h / 365d
VM
VM
…
VM
VM
Swift
VM
VM
Cinder
VM
VM
Nova
RabbitMQ
Neutron Agents
Data Bases
Fluentd
Elastic
search
Zabbi
x
Kibana

VM
VM
…
VM
VM
Swift
VM
VM
Cinder
VM
VM
Nova
RabbitMQ
Neutron Agents Data Bases
Memory CPU Network HDD
General
OpenStack
Monitoring Items Self Healing
1,970 25
3,957 59

o RabbitMQ
 Configuration
 3 node cluster
 cluster_partition_handling, autoheal
 Monitoring
 Split Brain check:
• “rabbitmqctl eval '[N||{partitions,N}<-rabbit_mnesia:status()].’”
 Port Check (5672, 25672)
 Process Check
• Beam.smp
• Rabbitmq-server
At least one node running(1/3)
• {Openstack-RabbitMQ:grpsum["HostG-
RabbitMQ","net.tcp.service[tcp,,25672]",last
,0].count(#3,0,"eq")}=3
• {OpenStack-RabbitMQ:grpsum["HostG-
RabbitMQ","proc.num[beam.smp]",last,0].c
ount(#3,0,"eq")}=3

o MySQL
 Configuration
 4 Nodes + 1 Arbitrator
 Monitoring
 Cluster Check
• wsrep_local_recv_queue
• wsrep_local_send_queue
• wsrep_flow_control_paused
• wsrep_local_commits
Arbitrator
LB
R/W

o MySQL Cluster
Master
Disk
Galera
recv_queuesend_queue
Commit
Disk
Replication
OK
Slave
MySQL
Client
OK
Wait until receive OK
from replication

o MySQL Cluster Freeze
Master
Disk
Galera
recv_queuesend_queue
Commit
Disk
Replication
OK
Slave
MySQL
Client
OK
Wait until receive OK
from replication
👿
• Disk Failure: 😀 (removed from cluster)
• Disk Speed Throttling : 😢

○ Prohibit some self-healing actions
 Do not reboot some OpenStack processes
– neutron-plugin-openvswitch-agent
 Do not reboot network nodes
– loose network reachability (can’t recreate network namespace)
Prohibited Actions while MySQL Cluster Freeze
23
Solved at Liberty?
All the VMs loose connections

o Throttling happens during DB backup
 Limit Backup Node
 Backup Method
LB
R/W
Limit Backup Node
LOCK TABLES FOR
BACKUP (online)
1. Take from cluster
(Donor/Desynced)
2. DB lock and do backup
(FLUSH TABLES WITH READ
LOCK)
3. Return to cluster
(wsrep_desync=OFF)
– wsrep_local_recv_queue
– wsrep_local_commits

Log Analytics
Kibana

(1) detect critical system-
failure
We have to
recover
immediately
(2) detect malicious access
We need to
notify users
(3) detect no critical errors
Better to be fixed
as soon as
possible
(4) find errors/warnings that
have no service impact
We want to
filter out
next time
Purpose of Log Analytics
26

○ e.g.Logs of a day
Total:
100 GB, 80M lines
Sum of critical, error and warning logs:
200K lines
The meaningful logs are more restrictive:
(1) 0 critical failure (2) 0 malicious access
(3) 6 non-critical failure (4) 6 ignorable failure
0%
0%
1%
30%
39%
30%
Breakdown of Logs
Critical
Error
Warning
Info
Debug
Other
Treasure Hunt in The Ocean of Logs
0%
24%
24%49%
3%
HW
OS
OpenStack
backend
OpenStack
Operation
tools
27

○ We analyze logs to enhance our black list and white list.
○ Logs found in our black list are sent to Zabbix.
Log Analytics Based on White/Black List
-----
-----
-----
Logs
trash
Zabbix Kibana
-----
-----
-----
-----
expand
expand
reduce
analyze…
28
add
add
black list
white list

Log Server
Network
Node
Control
Node
Compute
Node
How to Adopt Black/White List Using Fluentd
Fluentd
Elasticsearch
zabbix_sender
fluentd
LB
UTM
• Add “ignorable” flag according to
white list
• Put metadata to create graphs
from the logs
rsyslog
refer
Zabbix
alerts
Kibana
graph graph
Notify Zabbix according to
black list
29

Log Server
How to Adopt Black/White List Using Fluentd
Fluentd
Elasticsearch
zabbix_sender
fluentd
1. syslog
10:01 crit: hardware failure
path: syslog rsyslog api.log
timestamp: 10:01 10:03 10:04
severity: crit warn ERROR
item: - ids ignore
source_ip: - x.x.x.x -
message: hardware
failure
IDS:
from
x.x.x.x
invalid
request
format
3. api.log
10:04 ERROR: invalid request format
2. rsyslog
10:03 warn: IDS: from x.x.x.x
Zabbix
hardware
failure
Kibana
IDS
graph
crit
graph
refer
30

Example of Our White List # with Juno
• Count response codes and understand the trend.
That’s enough.
^keystonemiddleware.auth_token [-] Unable to find
authentication token in headers$
• This ERROR means user’s operation was denied due to quota.
• It has no impact to our system. Should be INFO log?
^nova.api.openstack [[^]]*] Caught error:
VolumeSizeExceedsAvailableQuota: Requested volume or
snapshot exceeds allowed Gigabytes quota..*$
• This WARNING is caused by presence of SHUTOFF instances.
• It is commonplace condition. Need to be ignored.
^nova.scheduler.host_manager [[^]]+] Host has more
disk space than database expected .*$
31
1
2
3

○ We succeeded in reducing logs to be analyzed.
 In other words, so many meaningless logs have high log-levels.
Effect of Our White List
Without White List: 160K
With White List: 37
reduce
99.98%
32
Today
We can analyze all logs in 2-
3 hours a day!
1 year ago
We couldn’t analyze all logs
in a day

Example of Our Black List
• This message indicates disk problem on
Compute node.
^kernel: [[^]]*] XXXXX.*hardware failure.$
• Corosync needs cleanup its resources.
^pengine: warning: unpack_rsc_op:
Processing failed op monitor for .*$
• Fullbackup of mysql failed once.
^mysql_fullbackup[d+]:sFailedstosMySQLsfullbacku
p.*$
33
Warning
alert
Information
alert
Information
alert
1
2
3

Demonstration with Kibana
○ 3 dashboards
 OpenStack
 All Logs
 Error Logs
 Critical Logs
 Warning Logs
 IDS
34

Trademarks
○ Kibana is a trademark of Elasticsearch BV, registered in the U.S. and in other countries.
○ Elasticsearch is a trademark of Elasticsearch BV, registered in the U.S. and in other countries.
○ logstash is a trademark of Elasticsearch BV.
35

o Presentation - Operation
 2015/10/27 4:40pm - 5:20pm Heian (New Takanawa)
「What are operators doing behind the Cloud?」
o Exhibition
 NEC Booth(H4)
 28(Wed.)10:45-13:00,16:30-18:30, 29(Thu.) 9:00-14:00
 NTT Group Booth(S14)
 28(Wed.) 13:15-16:15
「Touch and Feel! NTT DOCOMO’s Cloud Operation」
contact-cloudpf-ml@nttdocomo.com

NEC NTT

ご清聴ありがとうございました。

NTTドコモ様導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to NTTドコモ様導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

Similar to NTTドコモ様導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」 (20)

More from VirtualTech Japan Inc.

More from VirtualTech Japan Inc. (20)

Recently uploaded

Recently uploaded (20)