SlideShare a Scribd company logo
1 of 31
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Lessons Learned From Building Out
Hyper-Scale Cloud Services Using
Docker
Boris Scholl
VP Microservices, Oracle
Harvey Raja
Coherence Architect, Oracle
March 6th, 2017
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Agenda
• Objectives
• Service use case
• Service design goals and principles
• Platform architecture
• DevOps flow
• Demo
• Lessons learned
2
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Objectives
• Provide insights into building production grade cloud services
• Provide insights into production grade CI/CD pipeline
• Share some lessons learned
• Get insight into a actual real world architecture
• Awareness of potential pitfalls when entering this space
3
Takeaways
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Service Use Case
• Backbone of other internal distributed services
• Needed a services for
– Leader election
– Service Registry and Discovery
– Configuration management
• Potentially making it available to customers later
4
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Service Design Goals
5
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Service Design Goals
• Hyper-scale
• Highly available
• Resilient
• Multitenant
• Optimal hardware utilization to optimize costs
• Agile delivery of individual services, continuous deployments
6
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Service Design principles
• Design to optimize for Time to Market
– Microservice architectural approach
– Each service is delivered by independent development teams
– Automate everything
– e.g. Application consists of nine separate services delivered by five geographically-separate
development teams
• Governance
– Unit testing, coding standards, and code reviews on all commits
– Common log format
• Only services which are “deployable and testable” can be promoted.
• Build for operations
– Custom Dashboard UI provides status for all versioned manifests and services, identifying issues,
bottlenecks, etc.
– Diagnostics and Monitoring UI and alerting and tools in place
7
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Technology Stack
8
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Tech Stack mainly focused on proven OSS technologies
• Reliable infrastructure
– Oracle Bare Metal Cloud Services
– Mesos/Marathon
• Currently managed by our team. Will be moving to managed CaaS.
– NGINX
• Technologies designed for operations
– Docker
– ELK (Elastic Search, Logstash, Kibana) + Grafana
– Prometheus
• Java (JAX-RS, Jersey, Grizzly, Netty, Coherence)
• Jenkins CI
9
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Architectural overview
Load Balancer
Management APIs Management APIs
Mesos/Marathon
Load Balancer
Load Balancer
Load Balancer
Load Balancer
Tenant 1
Tenant 2
Tenant 3
Tenant 4
AD 1
AD 2
etcd-1
etcd-1
etcd-1
etcd-1
etcd-1
Operator
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Platform components
• Load balancer
– NGINX based
– Control plane LB and Tenant LB
– Tenant LB sits in the middle between service VCN and Tenant VCN
• Acts as a ‘wormhole’ between the private networks
• Management APIs
– Provides endpoints for Console and CLI to create new etcd services
• Etcd service
– Virtual concept based on Coherence cluster
• Etcd gateway == Frontend nodes
• Storage enabled nodes == Backend nodes
• Data persisted to NVMe
11
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Platform components
• Orchestrator
– Implemented a layer between management APIs and M/M
– Responsible for provisioning the etcd service components in a particular order
– Managing the life cycle of etcd service
• Check for safe states etc.
– Supports target environment profiles
• Depending on compute infrastructure the orchestrator will adjust cluster size and JVM resource
consumption
• Platform manifest
– Declarative way of bundling platform components in to a release
– Contains name and version of components (Docker images) being released
• Platform Installer
– Deploys platform software as defined in the manifest
– Can deploy to Mesos/Marathon, BM Container Service or Virtual Machines
12
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Service Runtime Architecture
13
Service
VCN
Availability Domain 3Availability Domain 1 Availability Domain 2
Load Balancer Service
Gateway Gateway Gateway
Tenant 1
VCN
Tenant 2
VCN
Tenant n
VCN
Backend
T1 Inst 2 T2 Inst 1T1 Inst 1
T1 Client 1
T1 Client 2
T2 Client 1
Gateway Gateway Gateway
Backend
T2 Inst 2T1 Inst 1
Gateway Gateway Gateway
Backend
T1 Inst 2
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Testing and CI/CD Pipeline
14
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Service-Level Tests Platform-Level Tests
• Owned by central test team
• Includes end-to-end tests
– Functional Acceptance Test,
– Minimal Acceptance Test (MAT)
– Longevity Test
– Upgrade Test
– Non-functional (Performance/Stress)
– Jepson testing
• Run as a part of the CI/CD pipeline
15
Testing Strategy
• Owned by each service team
• Includes
– Unit Test
– Component Test
– Integration Test
• Run as a part of individual builds,
prior to CI/CD stage
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
CI/CD Pipeline and Testing Levels
Level 1: "Verify"
• Use localhost installation
(isolated sandbox env) with the
last published platform manifest
• Install new version of X on top
• Run Functional Acceptance Tests,
in 10 minute parallel chunks
• Run Upgrade Acceptance Tests,
upgrading from current
production manifest to new
version of X
Level 2: "Pre-Stage"
• Use Prod-like environment with
last published platform manifest
• Deploy a platform instance from
scratch using new version of X on
top
• Run MATs (10 minutes)
• On success, a new platform
manifest is produced using the
new version of X
• Successfully passing this level
represents CI
Level 3: "Stage"
• Use Prod-like environment with
current production manifest
already deployed
• Upgrade to the new version of X
• Run MATs (10 minutes)
Level 4: “Prod Candidate”
• Frequency: Once per night
• Run long running tests: Longevity,
PSR, and Functional Acceptance
Tests
• Based on the results, a manual
decision is made to deploy a
specific manifest to production at
a frequency determined by
management
16
Note: X indicates platform components that have changed since the last release
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 17
Detailed Service Release Pipeline
Etcd data
plane
Etcd control
plane
Build & Unit
Tests
Build & Unit
Tests
Integration
Test
Publish
Image
Install on
pre-stageMATS
Upgrade
stageUATS
Performance
perf
Longevity
under Stress
Parallel Test
Runs
Prod Ready
Candidate
Rollback
stage
Production
Release start
Review Prod
Ready List
Select
Manifest
Canary on
prod
Finish prod
Upgrade
Confirm
Prod Ready
Install Prod
on stage
Integration
Test
Publish
Image
Platform
Testing
Update
Manifest
Publish
Manifest
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Lessons learned
19
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Lessons Learned (Best Practices) – Services will fail
• Retries
– Anticipate transient errors for services you are trying to reach
– Implement a retry policy with an appropriate retry count and interval (e.g.
exponential back off, incremental intervals etc.)
– Ensure idempotency with retries
• Circuit Breaker
– Prevents an application to retry an operation that is likely to fail
– Can be combined with the retry pattern
• Bulkhead
– Avoids faults in one part of the system to take down the entire system
20
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Lessons Learned (Best Practices) – Communication
• More services mean more communication and data exchange
– HTTP (HTTP/2) for external and internal
– TCP/UDP for internal for better performance
– Serialization format: JSON, Protocol buffers, Coherence POF
• Serialization and Deserialization can be a bottleneck at large scale services
– Consider if you need to re-serialize if a downstream service works with the same
object
• Augment the de-serialized object and pass onto another service in a form
– Choose a JSON serializer wisely
– Jersey (JAX-RS implementation) and Jetty (as the HTTP transport) and Jackson get you
pretty far.
21
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Lessons Learned – Docker and Java Apps
• Memory
– JVM does not honor Docker runtime metrics
• JVM tries to use all the memory it sees
• Docker daemon kills the container when crossing constraints
• Avoid issue: Specifying max heap size for the process that is lower than the container memory
constraints
– Ensure the JVM memory settings are correctly synched with the Marathon container
memory settings.
• Failure to do this can cause marathon to simply continually kill the container when the enclosed JVM
hits a memory event that we Java programmers would just consider a "normal" event
22
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Lessons Learned – Docker and Java Apps
• CPU
– Java VM running sees all the cores of the host machine.
• Manually configure if you rely on that information (e.g. create Threads)
• Workaround: using a -D Java property
– Ensure the JVM CPU settings are correctly synched with the Marathon container CPU
settings.
• Failure to do this can cause the container to not ever start.
23
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Lessons Learned – Marathon (or other orchestrator)
• Mesos/Marathon status may not match service status
– M/M reports on container status
– It may take longer for the service inside the container to come up.
– Workaround: Configure health check url for the service so that one can definitely
conclude that the container and process inside it are up and running
24
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Lessons Learned – NGINX
• Dynamic reconfiguration of NGINX
– Configuration updates based on events sent by orchestrator
• E.g. spins up new etcd clusters
• LB needs to be aware of new routes
– Not easy to find out when the change was applied
– Workaround: Ping service via NGINX to make sure the service is back up
– Disable NGINX logging
– Increase worker processes to match load (auto == cpu count)
25
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Lessons Learned (Best Practice) – Containers
• Use proper container image versioning
– E.g. etcd-nginx:1.0.0-b21
– Avoid the latest tag
• Use small base images
– E.g. Oracle Linux 7.1 - slim
– Large base images can delay service readiness
• Use one base image for multiple purposes
– Add functionality to base image
– Enable feature by configuration
– E.g. Tenant LB needs to support HTTP/2 (Required by etcd V3 which uses gRPC)
26
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
CI/CD Pipeline – Lesson Learned
• Automate everything may not be possible
• Let teams choose what they want to use
– Supporting both Maven and Gradle projects allowed Dev Teams to choose the tools
they preferred
– Helped getting devs more involved in CI/CD process
• Restrict the testing pipeline to “deployable” components (i.e. Docker
images).
– Teams producing Java libraries required to coordinate with Docker image producers.
– Dev projects should be responsible for managing dependencies with other projects.
Confidential – Oracle Highly Restricted 27
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
CI/CD Testing- Lessons Learned
• Isolated sandbox environment for development, debugging and testing
– Team members should be able to easily stand up their own isolated environment
• Agile “testing pyramid” of unit / integration / end-to-end tests
– More unit tests for better quality. Less end-to-end tests to reduce CI cycle time
• Parallelize testing where possible
– To execute more tests in a short amount of time
• Test upgrades early in the CI/CD pipeline
– Saves cycle time on other expensive tests if upgrades fail/introduce regressions
• Address intermittent failures in end-to-end tests right away
– Prioritize these failures, identify the root cause/faulty component and fix it first
28
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
CI/CD General – Lesson Learned
• A Developer-focused Dashboard is essential to identify failures & pipeline
blockages
• Provide information needed for diagnosis is essential in understanding
failures in a timely manner
• More specific to our situation:
– Defining an external Platform Manifest was a good choice.
• Allowed reproducible test results
• Mixing and matching between component versions
– Wish we had abstracted Container Management interfaces
• Would allow moving between orchestrators
Confidential – Oracle Highly Restricted 29
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Summary
• Microservices approach helps with time to market
– Be aware that you are dealing with a distributed system
• Tools and technology of choice requires governance
• CI/CD pipeline and automation are key
– You may not be able to automate all the way up to continuous deployment
• More on that topic:
– 5 part blog series: Getting started with microservices
• https://blogs.oracle.com/developers/getting-started-with-microservices-part-one
Confidential – Oracle Internal/Restricted/Highly Restricted 30
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 31
Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

More Related Content

More from Oracle Developers

Container Native Development Tools - Talk by Mickey Boxell
Container Native Development Tools - Talk by Mickey BoxellContainer Native Development Tools - Talk by Mickey Boxell
Container Native Development Tools - Talk by Mickey BoxellOracle Developers
 
General Capabilities of GraalVM by Oleg Selajev @shelajev
General Capabilities of GraalVM by Oleg Selajev @shelajevGeneral Capabilities of GraalVM by Oleg Selajev @shelajev
General Capabilities of GraalVM by Oleg Selajev @shelajevOracle Developers
 
GraalVM Native Images by Oleg Selajev @shelajev
GraalVM Native Images by Oleg Selajev @shelajevGraalVM Native Images by Oleg Selajev @shelajev
GraalVM Native Images by Oleg Selajev @shelajevOracle Developers
 
Serverless Patterns by Jesse Butler
Serverless Patterns by Jesse ButlerServerless Patterns by Jesse Butler
Serverless Patterns by Jesse ButlerOracle Developers
 
Java Library for High Speed Streaming Data
Java Library for High Speed Streaming Data Java Library for High Speed Streaming Data
Java Library for High Speed Streaming Data Oracle Developers
 
Reactive Java Programming: A new Asynchronous Database Access API by Kuassi M...
Reactive Java Programming: A new Asynchronous Database Access API by Kuassi M...Reactive Java Programming: A new Asynchronous Database Access API by Kuassi M...
Reactive Java Programming: A new Asynchronous Database Access API by Kuassi M...Oracle Developers
 
Managing containers on Oracle Cloud by Jamal Arif
Managing containers on Oracle Cloud by Jamal ArifManaging containers on Oracle Cloud by Jamal Arif
Managing containers on Oracle Cloud by Jamal ArifOracle Developers
 
North America November Meetups
North America November MeetupsNorth America November Meetups
North America November MeetupsOracle Developers
 
GraphPipe - Blazingly Fast Machine Learning Inference by Vish Abrams
GraphPipe - Blazingly Fast Machine Learning Inference by Vish AbramsGraphPipe - Blazingly Fast Machine Learning Inference by Vish Abrams
GraphPipe - Blazingly Fast Machine Learning Inference by Vish AbramsOracle Developers
 
North America Meetups in September
North America Meetups in September North America Meetups in September
North America Meetups in September Oracle Developers
 
Introduction to the Oracle Container Engine
Introduction to the Oracle Container EngineIntroduction to the Oracle Container Engine
Introduction to the Oracle Container EngineOracle Developers
 
Oracle Data Science Platform
Oracle Data Science PlatformOracle Data Science Platform
Oracle Data Science PlatformOracle Developers
 
Persistent storage with containers By Kaslin Fields
Persistent storage with containers By Kaslin FieldsPersistent storage with containers By Kaslin Fields
Persistent storage with containers By Kaslin FieldsOracle Developers
 
The Fn Project by Jesse Butler
 The Fn Project by Jesse Butler The Fn Project by Jesse Butler
The Fn Project by Jesse ButlerOracle Developers
 
Silicon Valley JUG meetup July 18, 2018
Silicon Valley JUG meetup July 18, 2018Silicon Valley JUG meetup July 18, 2018
Silicon Valley JUG meetup July 18, 2018Oracle Developers
 
Hyperledger Austin meetup July 10, 2018
Hyperledger Austin meetup July 10, 2018Hyperledger Austin meetup July 10, 2018
Hyperledger Austin meetup July 10, 2018Oracle Developers
 
Oracle Global Meetups Team Update - Upcoming Meetups (July and August)
Oracle Global Meetups Team Update - Upcoming Meetups (July and August)Oracle Global Meetups Team Update - Upcoming Meetups (July and August)
Oracle Global Meetups Team Update - Upcoming Meetups (July and August)Oracle Developers
 
Managing Containers on Oracle's Cloud Infrastructure
Managing Containers on Oracle's Cloud InfrastructureManaging Containers on Oracle's Cloud Infrastructure
Managing Containers on Oracle's Cloud InfrastructureOracle Developers
 
Oracle - Continuous Delivery NYC meetup, June 07, 2018
Oracle - Continuous Delivery NYC meetup, June 07, 2018Oracle - Continuous Delivery NYC meetup, June 07, 2018
Oracle - Continuous Delivery NYC meetup, June 07, 2018Oracle Developers
 

More from Oracle Developers (20)

Container Native Development Tools - Talk by Mickey Boxell
Container Native Development Tools - Talk by Mickey BoxellContainer Native Development Tools - Talk by Mickey Boxell
Container Native Development Tools - Talk by Mickey Boxell
 
General Capabilities of GraalVM by Oleg Selajev @shelajev
General Capabilities of GraalVM by Oleg Selajev @shelajevGeneral Capabilities of GraalVM by Oleg Selajev @shelajev
General Capabilities of GraalVM by Oleg Selajev @shelajev
 
GraalVM Native Images by Oleg Selajev @shelajev
GraalVM Native Images by Oleg Selajev @shelajevGraalVM Native Images by Oleg Selajev @shelajev
GraalVM Native Images by Oleg Selajev @shelajev
 
Serverless Patterns by Jesse Butler
Serverless Patterns by Jesse ButlerServerless Patterns by Jesse Butler
Serverless Patterns by Jesse Butler
 
Java Library for High Speed Streaming Data
Java Library for High Speed Streaming Data Java Library for High Speed Streaming Data
Java Library for High Speed Streaming Data
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Reactive Java Programming: A new Asynchronous Database Access API by Kuassi M...
Reactive Java Programming: A new Asynchronous Database Access API by Kuassi M...Reactive Java Programming: A new Asynchronous Database Access API by Kuassi M...
Reactive Java Programming: A new Asynchronous Database Access API by Kuassi M...
 
Managing containers on Oracle Cloud by Jamal Arif
Managing containers on Oracle Cloud by Jamal ArifManaging containers on Oracle Cloud by Jamal Arif
Managing containers on Oracle Cloud by Jamal Arif
 
North America November Meetups
North America November MeetupsNorth America November Meetups
North America November Meetups
 
GraphPipe - Blazingly Fast Machine Learning Inference by Vish Abrams
GraphPipe - Blazingly Fast Machine Learning Inference by Vish AbramsGraphPipe - Blazingly Fast Machine Learning Inference by Vish Abrams
GraphPipe - Blazingly Fast Machine Learning Inference by Vish Abrams
 
North America Meetups in September
North America Meetups in September North America Meetups in September
North America Meetups in September
 
Introduction to the Oracle Container Engine
Introduction to the Oracle Container EngineIntroduction to the Oracle Container Engine
Introduction to the Oracle Container Engine
 
Oracle Data Science Platform
Oracle Data Science PlatformOracle Data Science Platform
Oracle Data Science Platform
 
Persistent storage with containers By Kaslin Fields
Persistent storage with containers By Kaslin FieldsPersistent storage with containers By Kaslin Fields
Persistent storage with containers By Kaslin Fields
 
The Fn Project by Jesse Butler
 The Fn Project by Jesse Butler The Fn Project by Jesse Butler
The Fn Project by Jesse Butler
 
Silicon Valley JUG meetup July 18, 2018
Silicon Valley JUG meetup July 18, 2018Silicon Valley JUG meetup July 18, 2018
Silicon Valley JUG meetup July 18, 2018
 
Hyperledger Austin meetup July 10, 2018
Hyperledger Austin meetup July 10, 2018Hyperledger Austin meetup July 10, 2018
Hyperledger Austin meetup July 10, 2018
 
Oracle Global Meetups Team Update - Upcoming Meetups (July and August)
Oracle Global Meetups Team Update - Upcoming Meetups (July and August)Oracle Global Meetups Team Update - Upcoming Meetups (July and August)
Oracle Global Meetups Team Update - Upcoming Meetups (July and August)
 
Managing Containers on Oracle's Cloud Infrastructure
Managing Containers on Oracle's Cloud InfrastructureManaging Containers on Oracle's Cloud Infrastructure
Managing Containers on Oracle's Cloud Infrastructure
 
Oracle - Continuous Delivery NYC meetup, June 07, 2018
Oracle - Continuous Delivery NYC meetup, June 07, 2018Oracle - Continuous Delivery NYC meetup, June 07, 2018
Oracle - Continuous Delivery NYC meetup, June 07, 2018
 

Recently uploaded

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker

  • 1. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Lessons Learned From Building Out Hyper-Scale Cloud Services Using Docker Boris Scholl VP Microservices, Oracle Harvey Raja Coherence Architect, Oracle March 6th, 2017
  • 2. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Agenda • Objectives • Service use case • Service design goals and principles • Platform architecture • DevOps flow • Demo • Lessons learned 2
  • 3. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Objectives • Provide insights into building production grade cloud services • Provide insights into production grade CI/CD pipeline • Share some lessons learned • Get insight into a actual real world architecture • Awareness of potential pitfalls when entering this space 3 Takeaways
  • 4. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Service Use Case • Backbone of other internal distributed services • Needed a services for – Leader election – Service Registry and Discovery – Configuration management • Potentially making it available to customers later 4
  • 5. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Service Design Goals 5
  • 6. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Service Design Goals • Hyper-scale • Highly available • Resilient • Multitenant • Optimal hardware utilization to optimize costs • Agile delivery of individual services, continuous deployments 6
  • 7. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Service Design principles • Design to optimize for Time to Market – Microservice architectural approach – Each service is delivered by independent development teams – Automate everything – e.g. Application consists of nine separate services delivered by five geographically-separate development teams • Governance – Unit testing, coding standards, and code reviews on all commits – Common log format • Only services which are “deployable and testable” can be promoted. • Build for operations – Custom Dashboard UI provides status for all versioned manifests and services, identifying issues, bottlenecks, etc. – Diagnostics and Monitoring UI and alerting and tools in place 7
  • 8. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Technology Stack 8
  • 9. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Tech Stack mainly focused on proven OSS technologies • Reliable infrastructure – Oracle Bare Metal Cloud Services – Mesos/Marathon • Currently managed by our team. Will be moving to managed CaaS. – NGINX • Technologies designed for operations – Docker – ELK (Elastic Search, Logstash, Kibana) + Grafana – Prometheus • Java (JAX-RS, Jersey, Grizzly, Netty, Coherence) • Jenkins CI 9
  • 10. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Architectural overview Load Balancer Management APIs Management APIs Mesos/Marathon Load Balancer Load Balancer Load Balancer Load Balancer Tenant 1 Tenant 2 Tenant 3 Tenant 4 AD 1 AD 2 etcd-1 etcd-1 etcd-1 etcd-1 etcd-1 Operator
  • 11. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Platform components • Load balancer – NGINX based – Control plane LB and Tenant LB – Tenant LB sits in the middle between service VCN and Tenant VCN • Acts as a ‘wormhole’ between the private networks • Management APIs – Provides endpoints for Console and CLI to create new etcd services • Etcd service – Virtual concept based on Coherence cluster • Etcd gateway == Frontend nodes • Storage enabled nodes == Backend nodes • Data persisted to NVMe 11
  • 12. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Platform components • Orchestrator – Implemented a layer between management APIs and M/M – Responsible for provisioning the etcd service components in a particular order – Managing the life cycle of etcd service • Check for safe states etc. – Supports target environment profiles • Depending on compute infrastructure the orchestrator will adjust cluster size and JVM resource consumption • Platform manifest – Declarative way of bundling platform components in to a release – Contains name and version of components (Docker images) being released • Platform Installer – Deploys platform software as defined in the manifest – Can deploy to Mesos/Marathon, BM Container Service or Virtual Machines 12
  • 13. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Service Runtime Architecture 13 Service VCN Availability Domain 3Availability Domain 1 Availability Domain 2 Load Balancer Service Gateway Gateway Gateway Tenant 1 VCN Tenant 2 VCN Tenant n VCN Backend T1 Inst 2 T2 Inst 1T1 Inst 1 T1 Client 1 T1 Client 2 T2 Client 1 Gateway Gateway Gateway Backend T2 Inst 2T1 Inst 1 Gateway Gateway Gateway Backend T1 Inst 2
  • 14. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Testing and CI/CD Pipeline 14
  • 15. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Service-Level Tests Platform-Level Tests • Owned by central test team • Includes end-to-end tests – Functional Acceptance Test, – Minimal Acceptance Test (MAT) – Longevity Test – Upgrade Test – Non-functional (Performance/Stress) – Jepson testing • Run as a part of the CI/CD pipeline 15 Testing Strategy • Owned by each service team • Includes – Unit Test – Component Test – Integration Test • Run as a part of individual builds, prior to CI/CD stage
  • 16. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | CI/CD Pipeline and Testing Levels Level 1: "Verify" • Use localhost installation (isolated sandbox env) with the last published platform manifest • Install new version of X on top • Run Functional Acceptance Tests, in 10 minute parallel chunks • Run Upgrade Acceptance Tests, upgrading from current production manifest to new version of X Level 2: "Pre-Stage" • Use Prod-like environment with last published platform manifest • Deploy a platform instance from scratch using new version of X on top • Run MATs (10 minutes) • On success, a new platform manifest is produced using the new version of X • Successfully passing this level represents CI Level 3: "Stage" • Use Prod-like environment with current production manifest already deployed • Upgrade to the new version of X • Run MATs (10 minutes) Level 4: “Prod Candidate” • Frequency: Once per night • Run long running tests: Longevity, PSR, and Functional Acceptance Tests • Based on the results, a manual decision is made to deploy a specific manifest to production at a frequency determined by management 16 Note: X indicates platform components that have changed since the last release
  • 17. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 17 Detailed Service Release Pipeline Etcd data plane Etcd control plane Build & Unit Tests Build & Unit Tests Integration Test Publish Image Install on pre-stageMATS Upgrade stageUATS Performance perf Longevity under Stress Parallel Test Runs Prod Ready Candidate Rollback stage Production Release start Review Prod Ready List Select Manifest Canary on prod Finish prod Upgrade Confirm Prod Ready Install Prod on stage Integration Test Publish Image Platform Testing Update Manifest Publish Manifest
  • 18. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Lessons learned 19
  • 19. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Lessons Learned (Best Practices) – Services will fail • Retries – Anticipate transient errors for services you are trying to reach – Implement a retry policy with an appropriate retry count and interval (e.g. exponential back off, incremental intervals etc.) – Ensure idempotency with retries • Circuit Breaker – Prevents an application to retry an operation that is likely to fail – Can be combined with the retry pattern • Bulkhead – Avoids faults in one part of the system to take down the entire system 20
  • 20. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Lessons Learned (Best Practices) – Communication • More services mean more communication and data exchange – HTTP (HTTP/2) for external and internal – TCP/UDP for internal for better performance – Serialization format: JSON, Protocol buffers, Coherence POF • Serialization and Deserialization can be a bottleneck at large scale services – Consider if you need to re-serialize if a downstream service works with the same object • Augment the de-serialized object and pass onto another service in a form – Choose a JSON serializer wisely – Jersey (JAX-RS implementation) and Jetty (as the HTTP transport) and Jackson get you pretty far. 21
  • 21. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Lessons Learned – Docker and Java Apps • Memory – JVM does not honor Docker runtime metrics • JVM tries to use all the memory it sees • Docker daemon kills the container when crossing constraints • Avoid issue: Specifying max heap size for the process that is lower than the container memory constraints – Ensure the JVM memory settings are correctly synched with the Marathon container memory settings. • Failure to do this can cause marathon to simply continually kill the container when the enclosed JVM hits a memory event that we Java programmers would just consider a "normal" event 22
  • 22. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Lessons Learned – Docker and Java Apps • CPU – Java VM running sees all the cores of the host machine. • Manually configure if you rely on that information (e.g. create Threads) • Workaround: using a -D Java property – Ensure the JVM CPU settings are correctly synched with the Marathon container CPU settings. • Failure to do this can cause the container to not ever start. 23
  • 23. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Lessons Learned – Marathon (or other orchestrator) • Mesos/Marathon status may not match service status – M/M reports on container status – It may take longer for the service inside the container to come up. – Workaround: Configure health check url for the service so that one can definitely conclude that the container and process inside it are up and running 24
  • 24. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Lessons Learned – NGINX • Dynamic reconfiguration of NGINX – Configuration updates based on events sent by orchestrator • E.g. spins up new etcd clusters • LB needs to be aware of new routes – Not easy to find out when the change was applied – Workaround: Ping service via NGINX to make sure the service is back up – Disable NGINX logging – Increase worker processes to match load (auto == cpu count) 25
  • 25. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Lessons Learned (Best Practice) – Containers • Use proper container image versioning – E.g. etcd-nginx:1.0.0-b21 – Avoid the latest tag • Use small base images – E.g. Oracle Linux 7.1 - slim – Large base images can delay service readiness • Use one base image for multiple purposes – Add functionality to base image – Enable feature by configuration – E.g. Tenant LB needs to support HTTP/2 (Required by etcd V3 which uses gRPC) 26
  • 26. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | CI/CD Pipeline – Lesson Learned • Automate everything may not be possible • Let teams choose what they want to use – Supporting both Maven and Gradle projects allowed Dev Teams to choose the tools they preferred – Helped getting devs more involved in CI/CD process • Restrict the testing pipeline to “deployable” components (i.e. Docker images). – Teams producing Java libraries required to coordinate with Docker image producers. – Dev projects should be responsible for managing dependencies with other projects. Confidential – Oracle Highly Restricted 27
  • 27. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | CI/CD Testing- Lessons Learned • Isolated sandbox environment for development, debugging and testing – Team members should be able to easily stand up their own isolated environment • Agile “testing pyramid” of unit / integration / end-to-end tests – More unit tests for better quality. Less end-to-end tests to reduce CI cycle time • Parallelize testing where possible – To execute more tests in a short amount of time • Test upgrades early in the CI/CD pipeline – Saves cycle time on other expensive tests if upgrades fail/introduce regressions • Address intermittent failures in end-to-end tests right away – Prioritize these failures, identify the root cause/faulty component and fix it first 28
  • 28. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | CI/CD General – Lesson Learned • A Developer-focused Dashboard is essential to identify failures & pipeline blockages • Provide information needed for diagnosis is essential in understanding failures in a timely manner • More specific to our situation: – Defining an external Platform Manifest was a good choice. • Allowed reproducible test results • Mixing and matching between component versions – Wish we had abstracted Container Management interfaces • Would allow moving between orchestrators Confidential – Oracle Highly Restricted 29
  • 29. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Summary • Microservices approach helps with time to market – Be aware that you are dealing with a distributed system • Tools and technology of choice requires governance • CI/CD pipeline and automation are key – You may not be able to automate all the way up to continuous deployment • More on that topic: – 5 part blog series: Getting started with microservices • https://blogs.oracle.com/developers/getting-started-with-microservices-part-one Confidential – Oracle Internal/Restricted/Highly Restricted 30
  • 30. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Confidential – Oracle Internal/Restricted/Highly Restricted 31

Editor's Notes

  1. Secure patched version of images: https://hub.docker.com/_/oraclelinux/