More Related Content Similar to OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO's Mail Cloud System Powered by OpenStack Swift (20) OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO's Mail Cloud System Powered by OpenStack Swift1. Copyright © 2015 NTT DATA Corporation
2015/10/27
NTT DATA Corporation
Know-how of Challlenging Deploy/Operation NTT DOCOMO's Mail
Cloud System Powered by OpenStack Swift
2. 2Copyright © 2015 NTT DATA Corporation
Abstract
Docomo mail is 24/7 cloud mail system which has accesses from over 20 million
people. This mail system stores user's mail archive in OpenStack Swift with Peta
Byte scale capacity deployed by NTT DATA.
We have been successfully operating this service since Sep 2014 without any
downtime. In this session, we'll present the actual issues and challenges we have
faced and conquered.
3. 3Copyright © 2015 NTT DATA Corporation
Today’s contents and presenter
○Project Overview
Changes of Japanese mobile situation and abstraction of this project
– Project Manager : Sosuke Kakehi
○Migrate process
Process of migrating swift to existed docomo mail system
– OpenStack Swift Engineer : Masaaki Nakagawa
○Technical challenges
Swift technical challenges on this project
– OpenStack Engineer : Ryosei Kasai
○Operating session
Large scale swift operation
– OpenStack Swift Engineer : Masaaki Nakagawa
5. 5Copyright © 2015 NTT DATA Corporation
Project Overview
1 NTT Docomo's Cloud Mail System
2 Project Background
3 Customer Requirements
6. 6Copyright © 2015 NTT DATA Corporation
Cloud Mail System
NTT Docomo's Cloud Mail System - System Summary
• Docomo Mail - NTT Docomo’s Cloud Mail Service
• Over 20 million users
• Powered by OpenStack Swift
High Performance
Storage
Object Storage
OpenStack Swift
Later Mail
Tablet PCSmart Phone
Archived Mail
Stored to Swift
7. 7Copyright © 2015 NTT DATA Corporation
NTT Docomo's Cloud Mail System - System Scale
• Geographically Distributed Swift Cluster
• Over 6.4 Peta Byte Logical Capacity
• Over Hundreds of Servers
Site2
Site3
Site4
Site1
Proxy Node
Storage Node
Region1
Storage Node
Region2
Storage Node
Region3
8. 8Copyright © 2015 NTT DATA Corporation
Project Background
Shift from “Feature phone” to “Smart phone”
Service
Service
Service
Service
Smart Phone / Tablet PC
Service
Documents
Text
Photos
Music
MovieApplication
E-mail Data Size was increased
9. 9Copyright © 2015 NTT DATA Corporation
Cost
CostCost
Cost CostCost
Project Background
High-end Storage
High-end Storage
High-end Storage
High-end Storage
High-end Storage
Extend the High-end Storage, extend, extend
= expensive cost, cost, cost
High-end Storage
10. 10Copyright © 2015 NTT DATA Corporation
Customer Requirements
High
Availability
Low
Cost
High
Scalability
OSS(Software Storage) + IA Server
Disaster
Recovery
etc
Adopt OpenStack Swift
12. 12Copyright © 2015 NTT DATA Corporation
Overview of migration session
NTT DOCOMO has launched docomo mail service since Oct 2013, and swift was
installed docomo mail system at Jan 2015. When we migrated swift to docomo
mail system, docomo mail did not stop user service.
In this section, I would like to introduce overall of docomo mail system and
migration process.
laterolder
Oct, 2013
docomo mail service in
Jan, 2015
Swift service in
May, 2014
test user start to use swift
Oct, 2015
General user start to
test use Swift
13. 13Copyright © 2015 NTT DATA Corporation
swift
(archived mail holder)
High speed block storage
(later mail holder)
Swift migrate session
System construction overview
Docomo mail frontend server
(proxy of block storage and swift)
Proxy
Storage Storage Storage
Internet
archived
user mail
archived
user mail
archived
user mail
user mail user mail user mail
14. 14Copyright © 2015 NTT DATA Corporation
Swift migrate session
Mail access flow
Docomo mail frontend server
(proxy of block storage and swift)
Block Storage
Proxy
Storage Storage Storage
Internet
archived
user mail
archived
user mail
archived
user mail
access device
user mail user mail user mail
User mail will be
archived/stored to swift
15. 15Copyright © 2015 NTT DATA Corporation
Swift migrate session
System construction (before swift installed)
Docomo mail frontend server
Block Storage
Internet
archived
user mail
archived
user mail
user mail
16. 16Copyright © 2015 NTT DATA Corporation
Swift migrate session
Migration 1st step – deploy swift and test
Docomo mail frontend server
Block Storage
Proxy
Storage Storage Storage
Internet
• Deploy swift
• Trouble test
• Tuning
archived
user mail
archived
user mail
user mail
17. 17Copyright © 2015 NTT DATA Corporation
Swift migrate session
Migration 2nd step – copy test user’s archived mail
Docomo mail frontend server
Block Storage
Proxy
Storage Storage Storage
Internet
Copy test user’s archived mail
General user’s mail is
not copied
archived
user mail
archived
user mail
archived
user mail
archived
user mail
archived
user mail
user mail
18. 18Copyright © 2015 NTT DATA Corporation
Swift migrate session
Migration 3rd step – copy general user’s archived mail
Docomo mail frontend server
Block Storage
Proxy
Storage Storage Storage
Internet
Move general user’s archived mail
keep all mail archive
against swift trouble
archived
user mail
archived
user mail
archived
user mail
archived
user mail
archived
user mail
user mail
19. 19Copyright © 2015 NTT DATA Corporation
Swift migrate session
Migration 4th step – launch service
Docomo mail frontend server
Block Storage
Proxy
Storage Storage Storage
Internet
archived
user mail
archived
user mail
archived
user mail
archived
user mail
archived
user mail
user mail
20. 20Copyright © 2015 NTT DATA Corporation
Conclusion of migrate session
• Firstly, docomo mail has only block storage
• We need to deploy and migrate swift with no down time
• To achieve it, we divide migrate to 4 steps
– Deploy
– Test user mail copy to swift
– General user mail copy to swift with remaining block storage
– System durability check
• We achieve no service down migration
As I said , in migrating, we achieve some technical challenges. Next session, Mr.
Kasai introduce it.
22. 22Copyright © 2015 NTT DATA Corporation
Our Technical Challenges
1 Durability assurance
2 Geographically distributed cluster
3 Quality
23. 23Copyright © 2015 NTT DATA Corporation
Challenge 1: Durability assurance
• Quality requirement in Japan
• This system needs very high quality.
• Everything should be under control
• System design for normal situation
• System design for defeat situation
Even on distributed system
• Analyze every behavior before building system
24. 24Copyright © 2015 NTT DATA Corporation
Recovery test in variety of defeat pattern
• Variety of failure pattern
(1) The point of failure
• Disk, NIC, Process, Node, …
(2) The number of failures
• 1, 2, 3, 4, …
(3) The range of failures
• 1 node, multiple nodes/zones/regions, …
100s of test cases!!
Case #201
Proxy
Storage
Storage
Storage
Storage
Storage
Storage
Zone1 Zone2
…
Region 1
Case #201
Proxy
Storage
Storage
Storage
Storage
Storage
Storage
Zone1 Zone2
…
Region 1
Case #001
Proxy
Storage Storage Storage
Case #001
Proxy
Storage Storage Storage
Case #001
Proxy
Storage Storage Storage
Case #101
Proxy
Storage Storage Storage
Case #301
Proxy
Storage Storage Storage
Case #501
Proxy
Storage
Storage
Storage
Storage
Storage
Storage
Zone1 Zone2
…
Region 1
25. 25Copyright © 2015 NTT DATA Corporation
Result of recovery test
• Extreme durability and recoverability of swift
• Swift rarely loses data in it. Only accurate snipe or great disaster can causes
data lost.
26. 26Copyright © 2015 NTT DATA Corporation
private network
Site 3
Storage
Site 4
Storage
Site 2
Storage
Challenge 2: Geographically distributed cluster
• Geographically distributed swift cluster to realize disaster recovery
• Important points to evaluate global distribution
1. Client request
2. Durability
Site 1
Proxy
300km~300km~
300km~300km~
300km~
27. 27Copyright © 2015 NTT DATA Corporation
Pseudo-global cluster
• Pseudo-global cluster with simulated network latency
• Proxy and 3 Storage regions placed in different locations
• 10~200msec latency between locations simulated by tc
• TL msec latency for one way, 2*TL msec latency for round trip
Proxy
Storage
region 1
Storage
region 2
Storage
region 3
10~200msec
latency
10~200msec
latency
10~200msec
latency
10~200msec
latency
10~200msec
latency
10~200msec
latency
Client
Proxy
Storage
region1
TLmsec
TLmsec
28. 28Copyright © 2015 NTT DATA Corporation
2 points of Pseudo-global cluster testing
1. Client request
• Object PUT/GET/DELETE from client
• Error rate
• Turnaround time for 1 request
• Throughput
• Latency between proxy and storage
2. Durability
• Auto recovery by object-replicator
• Error rate
• Turnaround time of 1 sync process
• Throughput
• Latency between storages
Proxy
Storage
region 1
Storage
region 2
Storage
region 3
Storage
region 1
Storage
region 2
Storage
region 3
Client
Proxy
PUT GET
Client
29. 29Copyright © 2015 NTT DATA Corporation
Test1: Client request
Object PUT/GET/DELETE from client
• No error caused by latency
• Degradation of turnaround time
• No throughput degradation for concurrent requests
latency
limitation of network bandwidth
PUT/GET
DELETE
Latency concurrency
ThroughputTurnaround time
30. 30Copyright © 2015 NTT DATA Corporation
Test2: Durability
Auto recovery by object-replicator
• No error caused by latency
• Performance degradation of one process
• No throughput degradation for concurrent process
Latency concurrency
Throughput
latency
limitation of network bandwidth
Defeat
Recovery
Performance
31. 31Copyright © 2015 NTT DATA Corporation
Challenge 3: Quality
1. Software Quality
• All processes work well ?
• Account / Container / Object
• server / replicator / updater / reaper
2. System Quality
• Our system is working well ?
• All nodes
• All APIs
32. 32Copyright © 2015 NTT DATA Corporation
Software quality
1 Add process name checking into swift-init
2 Prevent redundant commenting by drive-audit
3 Remove invalid connection checking in db_replicator
4 Add timestamp checking in AccountBroker.is_status_deleted
5 Fix error log of proxy-server when cache middleware is disabled
Source Code Analysis and Customize
• Official patch (below)
• Original patch
Strict test all processes
and more …
Our official patch
33. 33Copyright © 2015 NTT DATA Corporation
System quality
storage servers …
…
Tempest
proxy servers
checking tool
Test all nodes
• Automation testing tools for
1. APIs : All swift APIs, including error case
2. Nodes : All swift nodes
• Extended Tempest and checking tool
Test all APIs
34. 34Copyright © 2015 NTT DATA Corporation
Our solutions
1 Durability assurance
2
Geographically
distributed cluster
3 Quality
Recovery test in variety of failure pattern
Performance test of frontend/backend
with pseudo-global swift cluster
・Source Code Analysis and Customize
・Automated testing
Challenge Solutions
36. 36Copyright © 2015 NTT DATA Corporation
Overview of operating session
Operation scheme of Docomo mail is high confidential.
We would like to introduce about NTT DATA swift solution's operation.
Docomo mail system uses NTT DATA swift solution with customizing.
37. 37Copyright © 2015 NTT DATA Corporation
Operating session
Large scale system makes operation costly
Large scale Swift
scale outmanagementrepairtuning
38. 38Copyright © 2015 NTT DATA Corporation
Operating session
Reduce operating work amount
Parallel access
(pssh / pscp)
Automatic deploy
(kickstart)
Tuning
(svn / puppet)
Master
repository
39. 39Copyright © 2015 NTT DATA Corporation
Operating session
Reduce operation frequency
Disk failureNode downServer Process Down Backend process down
ex)auditor process
Service affect
40. 40Copyright © 2015 NTT DATA Corporation
Operating session
Stop monitoring which low priority
Periodic performance check
monitoring alert
41. 41Copyright © 2015 NTT DATA Corporation
Conclusion of operating session
• Swift is consisted by many nodes
• System operating costs of Swift tend to be costly
• NTT DATA has know-how to reduce swift operation cost
– Using operation parallelized tool
– Customizing for monitoring priority
– Change monitoring items to periodic check
42. 42Copyright © 2015 NTT DATA Corporation
Conclusion of this presentation
We introduce usage, challenge, and operating OpenStack swift at docomo mail
service system
• System migration with no service down time
• Three technical achievement
• Reduce operating cost
Docomo mail has been service with no down time.
If you have something questions, please come to NTT booth.
○Attention
All company names, product names, and service names
mentioned are trademarks or registered trademarks of the
respective companies