SpringOne 2020
Seth Jones: Solution Owner, Slalom LLC;
Ishaan Khurana: Data Scientist/ Analyst, Ford Motor Company;
Tom Woods: Platform Services Analytics and Billing Super, Ford Motor Company;
Kyle Hinton: Solution Architect, Slalom Detroit
2. About Us
Tom Woods
Platform Services Analytics and Billing
Supervisor, Ford Motor Company
Ishaan Khurana
Data Scientist / Analyst, Platform Services
Analytics and Billing team,
Ford Motor Company
The Platform Services Analytics Billing Automation
team at Ford measures, drives and benchmarks
the value of Ford Strategic Cloud Platform
investments.
6. Fitness Actions
Metrics Store For PCF Tile
• Fitness Reports Identify:
o Lazy AIs
o Bloated AIs
o Overstaffed AIs
o Misplaced AIs
o Orphaned AIs
o Unhealthy AIs
• Forecasting
o Predict AI tipping points
Tagging
• Supports Automation
• Improve Audits
• More Granularity
7. We are SRE Engineers from Slalom, based out of
Detroit.
We are passionate about Open Source and SRE.
Our goal is to use our expertise and knowledge to
improve our clients’ products and platforms.
Who are we? Seth Jones
Solution Owner – Slalom LLC
Seth.Jones@slalom.com
Kyle Hinton
Solution Architect – Slalom LLC
Kyle.Hinton@slalom.com
8. How we got involved
Ford is a transitioning organization that is migrating their infrastructure from traditional ops to more cloud
native solutions. Here we are presenting some methods that we have found successful in assisting product
teams in their evolutions.
9. Why SRE Matters to Ford
“Ford’s Future: Evolving to Become Most Trusted Mobility Company, Designing Smart
Vehicles for a Smart World” – Ford Oct. 3rd, 2017
10. Observability
15,000+ PCF Applications
20,000+ Application Instances
95% Applications Java/Kotlin
500+ Product Teams
Global Product Teams
AWS / Azure / On-Prem PCF Foundries
Global Data Centers
11. Ease of Adoption
"Every team should be able to develop in whatever method they want."
- Jonathan Schneider
How do easily monitor 20k TAS applications?
Observability Goals
1. Provide various levels of insight to each teams platform to allow for troubleshooting, optimizing, and improved
management
2. A solution that requires single time setup, and removes continued platform management toil from product
development teams
13. Right Sizing
Scaling of an application or platform to properly utilize
resources to achieve intended capacity
14. Assumptions / Considerations
• Provide insights without required Product team effort
• Most Product teams have little experience with Capacity Planning
• Product teams control their own infrastructure, and resources
• Nudge teams towards change, with metrics
• Leverage open source technology to limit third party dependencies, and maximize
customization
• Ability to measure the impact of our “Right Sizing” efforts
16. On Premise
Multiple data centers run and
maintained by Ford. In general
these are egress only
environments.
Microsoft Azure
Several foundries running in
the Azure cloud to support
applications that need to be
exposed directly to the public
internet.
Amazon Web Services
New foundries are being stood
up in AWS in support of some
of Ford’s most important
initiatives.
Observability Platform
The Ford Observability
Platform has been designed to
collect and aggregate data
from all of these sources.
Cloud Agnostic Ecosystem
17. Getting Metrics From PCF
1) Prometheus BOSH Release installed in all foundries
2) Currently we focus on the metrics exposed by the
cf and firehose exporters.
3) Other metrics are available about nodes, the bosh
system, etc.
18. Aggregation at Global Scale
1) Our Observability Platform has multiple
Prometheus instances which federate
metrics from the Prometheus instances in
the foundries
2) We utilize the CNCF Sandbox Project
Thanos to give us a global overview of all
collected metrics.
3) Foundries in the Azure and AWS
environments are tied into Thanos via the
Thanos Sidecar service, while the egress
only on prem instances utilize Thanos
Receive for remote write.
19. Crunching The
Numbers
• Goal to start getting teams to think about capacity
management by providing memory quota
recommendations in real time
• Too much raw data to process it all at time of request
• Use recording rules in Prometheus to process the raw
data as it comes in.
• Occasional missing data points cause extra
headaches.
• We can now provide suggestions for thousands of
applications in seconds.
20. Utilizing The Data
• We currently use a very basic model for a recommended application memory quota,
aiming for 65% average memory usage while accounting for spikes.
• Displaying data in Grafana, providing both high level overviews with numerous
applications as well as targeted dashboards showing other application metrics
• Data transparency is an underlying tenant of our system.
21.
22.
23. Future Enhancements
1) Utilize other data sources (logs, app specific metrics, and traces) to further
refine suggestions.
2) Better understand resource utilization profiles
3) Provide recommendations for where to host applications
4) Analyze profiles to recommend an auto-scaling strategy.
5) Provide guides around failure domains and application design best practices.
24. Fitness Reports - App Instance
memory reduction
• Sent initial fitness communications to application teams for 1 of 4 On Prem foundries (EDC 1 Pre
Prod)
• EDC 1 PP Contains 19% of all Ford TAS app instances
• Targeted 516 Ford Applications (485 TAS Orgs) that contained app instances with potential for
memory downsizing
• 31% of EDC 1 PP app instances were considered potentially overallocated (more memory
allocated than required)
• Suggested app instance memory reductions based off historical utilization
• Aimed for 65 % max average memory utilization, and 90 % absolute maximum utilization
• Reduces app instance memory while minimizing any performance risks
• Aggregated data and memory recommendations by Ford Application, with dashboard links for
detailed app instance utilization metrics
25. Fitness Reports Effects
• 7 days after first fitness reports, total memory allocation in EDC 1 Pre Prod decreased by 1831 GB
due to app instance downsizing
• Greater number of app instances were downsized than expected (targeted 3421 instances,
4313 instances were downsized)
• Teams that received fitness reports reduced memory for app instances that were not
specifically targeted
• Continuing these reductions for over a year would result in 16 M GB Hour (15% of scalable
platform load) reduced annually
• Fitness FAQ to help guide teams on how to adjust, monitor, and optimize app instance resources
in TAS
26. Fitness Reports Next Steps
• Send TAS app instance memory guidelines to developers who provision new Orgs and Spaces
• Create and send fitness reports targeting all 11 foundries
• Recurring communications with application/product teams to maintain TAS fitness over time
• Include additional resource utilization metrics in future fitness reports
• Identify Orgs, Spaces that can reduce the number of active app instances
• Train supervised models to forecast future resource utilization
• Identify app instances trending towards becoming over/under allocated