More Related Content
Similar to Automation and Culture Changes for 40M Subscriber Platform Operation (20)
More from VMware Tanzu (20)
Automation and Culture Changes for 40M Subscriber Platform Operation
- 1. Automation and Culture Changes for
40M Subscriber Platform Operation
Yuichiro Sano
Yahoo! JAPAN
ysano@yahoo-corp.jp
- 2. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
About me
Yuichiro Sano
• Yahoo! JAPAN Platform Head of Cloud Platform Department
• Responsible for operational support, promotion and management of on-
premises platform (PCF k8s) at Yahoo! JAPAN
2
- 3. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
- 4. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
About Yahoo! JAPAN
4
3/03 67
A B HC CB C H H
H BC C M CAD B B H
CF ! F F J M
F B B LD B B C F
B B F B
2
1 ! A FHD CB ! H H B
CH F J F J
BCFAC HF
03 5 51
H 0 B E F !
CB H F C H . D B
DCD H CB F H
6 CC . D B F
+ 6 0 7
B CDH A F H H F
B HC B B J
F H F H CF
FJ H F BH CFA
C B F H C
F F
1 4
+B B F B ,C
J F FJ
- 5. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Why Cloud Foundry?
• Needed to modernize systems, reduce operational cost
• Productivity. Needed an environment where Engineers could just focus on
building services and not worry about the rest
• Bosh
• Buildpack model
• Auto-scale needed
• Wanted to leverage existing OpenStack environment
5
- 6. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
The Journey to Cloud Foundry
2015 Apr
PoC start
2015 Oct
Pilot
Dev start
2016 Oct
Pilot
release
2017 Apr
PCF GA
(openStack)
2017 Oct
openstack
+1 cluster
Today
Openstack x2
vSphere x4
11,000 AI1,000 AI 3,000 AI
2018 Apr
vSphere
+2 cluster
- 7. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Current Cluster Spec
7
Sandbox Development Production
cluster 2 2 6
HV 40 120 360
Diego cell - 300 900
App Instance - 8,000 11,000
Total request/sec - - 600,000
Log traffic log/sec - - 90,000
- 8. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Ecosystem around PCF
8
RDB MQ
Splunk RBAC
FaaS
Repository
KVS
Object
Storage
Redis
GitHub
Enterprise
- 9. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Organizational Structure
9
System dep.
Platform dep.
Cloud Platform dep.
Infrastructure dep.
other Platform dep. Network IaaSLBother Platform dep.other Platform dep.
- 10. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Team Structure
10
Cloud Platform dep.
PaaS CaaS
- 11. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Introducing PCF to the Organization
Most people were new to PCF, so we did the following:
• Held company seminars, hands-on workshops for > 1000 developers
• Maintained Japanese language reference material and tutorials
• Provided best practice guidance for various development use-cases
• Used Pivotal consulting services to provide support for platform and SRE
• Created a Service Broker to handle special YJ cookie offload
11
- 12. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Growing Pains
• With the addition of clusters the operational burden and alert support
increased
• Along with the increase in #Apps, with insufficient clusters PCF was unable to
accept new apps
• With increasing log volume, our log management system became overloaded
• We found App config mistakes (eg. timeouts) could affect the cluster
(goRouter)
• Time dealing with user support issues made it difficult to introduce stable
operations policies
12
- 13. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Clearly Defining the Role of Each Team
13
CRE
SRE
PCF users 2,500 Engineers
PaaS team
Propose efficient usage methods and
proactively resolve issues to ease the
transition for engineers
Platform as a Product. Focus on increasing
system reliability by preventing failure and
promoting automation
- 14. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
CRE team
14
Developer Counterpart
l First line support for users
l Contact users in case
application mis-behaves
Developers Education
l Deliver workshops
l Provide default CI/CD
templates
l Best practices
l Applications Architecture
support
Service Integra8ons
l Create service broker
services
l Support service team
- 15. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
SRE team
15
CRE counterpart
l Implements CRE needs
l Works closely with CRE for
Capacity Planning
Platform Updates
l PCF Updates
l Add new features around
the platform
l Logging, Metrics for users
l Automate all the Things ….
Platform Stability
l Define SLO, SLA
l Platform monitoring, alerts,
etc...
l Defining alerts, what, when,
how ?
l Capacity planning
- 16. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
SRE
- 17. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Automate all the Things
17
Install PCF
Backup with BBR
Cluster
Integrity
Update PCF
Prometheus
deployment
Quota, Usage
check
Buildpack Update
Logs forwarder
Deployment
IaaS layer check
(blobstore,...)
Smoke Test
User/Space/Orgs
Management
etc ….
- 18. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
PCF Install & Update Pipeline
18
Deploy
bosh
upload
Tile
Install
Tile
1 Day
Deploy
Opsman
Create IaaS
- 19. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Buildpack Update Pipeline
19
Dev
Update
staging
Update
Production
Update
Sandbox
Update
Every month
Buildpack x 8
- 20. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
PCF Backup
20
every-2am
every-3am
every-4am
every-5am
- 21. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
PCF Smoke-Test
21
All environments
Every 10 minutes
- 22. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Monitoring all the Things
22
App Instance
Log Traffic
Cell Capacity
Avg response
time
cf push
Duration
Router Traffic
CPU Usage
Cluster
Healthcheck
Probe
Mem Usage
Log missing rate
etc ….
- 23. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Cluster Dashboard
23
Routetr Rps Routetr Go Routines
Routetr Latency
- 24. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Cell Capacity
24
Used
Availlable
Used
Availlable
- 25. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
App Latency
25
Latency 99%
90%
50%
- 26. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Smoke Test
26
Cf Push Duration
Cf Scale Duration
Cf Start Duration
Cf Delete Duration
- 27. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
BOSH
Use BOSH for Logging Components
• Alert & App logs are transferred to the platform
• Using BOSH makes it easier to scale the nozzle and relay components
27
Internal
Notifications
App A
App B Splunk
loggregator
Monitor
nozzle
Splunk
nozzle
Monitor
relay
Splunk
relay
easier to scale
- 28. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Log summary
28
Noisy Neighbor
- 29. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Missing Log Data Dashboard
29
100%
100%
Every 1hour
Every 1minute
- 30. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 30
- 31. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Benefits of Platform Automation
• Automation has reduced SRE team platform install & update work
by 85%
• Precision has increased and human error has been removed
which has saved a lot of effort and time.
• Anyone can now easily work with the platform so we are not
dependent on individual “rockstars”
31
- 32. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Benefits of Focus on Observability
• Time to identify problems has been radically reduced
• Able to move from a Reactive to a Pro-active problem
resolution approach
• Contributed to a more sustainable, stress-free work
environment
32
- 33. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Outcomes
- 34. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Outcomes
Speed
75%
Time to Deploy
Weeks → 4hrs
Workspace
Provisioning
Scaling
Weeks → 5secs
Time to Scale
600k
TPS
Reliability
ZERO
Downtime
2hr → 2mins
VM Recovery
- 35. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Outcomes
Security
5x
Patch Frequency
1d → 4hr
Time to Patch
Productivity
11,000
AIs
3845
Apps in Production
- 36. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Future Plans
- 37. Unless otherwise indicated, these slides are © 2013-2018 Pivotal Software, Inc. and licensed under a Creative Commons
Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
For improving the value of users
• Adding further clusters
x6 ⇒ x12
• Evolution of Log related architecture
relay ⇒ queuing
• All Platform Service Broker support
3PF ⇒ 1XXPF
• Proactive Operations
Able to safely take a nap on the job SRE J
37