SlideShare a Scribd company logo
1 of 36
Download to read offline
An Introduction to
Apache Mesos
Timothy St. Clair
@timothysc
Special Thanks To:
Paco Nathan
http://liber118.com/pxn
@pacoid
Let's have some fun!
Please ask questions.
Overview
● Project Synopsis
● History
● Ecosystem Questions
● More Detailed Description
● Interesting Use Cases
● Current “Shiny” Developments
What is Apache Mesos?
Apache Mesos is a cluster manager that
provides efficient resource isolation and sharing
across distributed applications, or frameworks.
It can run Hadoop, MPI, Hypertable, Spark,
Elastic Search, Storm, Aurora, Marathon ... and
other applications on a dynamically shared pool of
nodes.
Project Highlights
● Top-level Apache project ~ 1 year (mesos.apache.org)
● Scales to 10,000s of nodes
● Obviates the need for virtual machines
● Isolation for CPU, RAM, I/O, FS, etc.
● Fault-tolerant leader election based on Zookeeper
● API's in C++, Java/Scala, Python, Go, Erlang, Haskell.
● Web UI for inspecting state
● Available for Linux, OpenSolaris, Mac OSX
Who is using Mesos? - Early Adopters
Google Refs
● The Datacenter as a Computer: An Introduction to the Design
of Warehouse-Scale Machines
http://research.google.com/pubs/pub35290.html
● 2011 GAFS Omega John Wilkes:
http://youtu.be/0ZFMlO98Jkc
● Omega: flexible, scalable schedulers for large compute
clusters
http://eurosys2013.tudos.org/wp-content/uploads/2013/paper
/Schwarzkopf.pdf
● Taming Latency Variability and Scaling Deep Learning
https://plus.google.com/u/0/+ResearchatGoogle/posts/C1dPh
QhcDRv
History
Understanding of Datacenter Computing
Google has been doing data center computing for
years to address the complexities of large-scale
data workflows:
● Leveraging the modern kernel isolation. (cgroups)
● Containerization !Virtualization (lmctfy - Docker)
● Most (>80) jobs are batch jobs, but the majority of
resources(55-80%) are allocated to service jobs.
● Mixed workloads, multi-tenancy
● Relatively high utilization rates
● JVM? Not so much...
● Reality: scheduling batch is simple;
– scheduling services is hard/expensive.
Refs.
● The Datacenter as a Computer: An Introduction
to the Design of Warehouse-Scale Machines
– http://research.google.com/pubs/pub35290.html
● GAFS Omega John Wilkes
– https://www.youtube.com/watch?v=0ZFMlO98Jkc
● Taming Latency Variability and Scaling Deep
Learning
– http://youtu.be/nK6daeTZGA8
Google Refs
● The Datacenter as a Computer: An Introduction to the Design
of Warehouse-Scale Machines
http://research.google.com/pubs/pub35290.html
● 2011 GAFS Omega John Wilkes:
http://youtu.be/0ZFMlO98Jkc
● Omega: flexible, scalable schedulers for large compute
clusters
http://eurosys2013.tudos.org/wp-content/uploads/2013/paper
/Schwarzkopf.pdf
● Taming Latency Variability and Scaling Deep Learning
https://plus.google.com/u/0/+ResearchatGoogle/posts/C1dPh
QhcDRv
Ecosystem
Questions?
Aren't There Several Other Existing
Solutions? - “Sounds Like” a grid
● IBM Platform Symphony
● Microsoft Autopilot
● Univa Grid Engine (SGE)
● Condor (Full Disclosure)
Follow Up:
What is the Gap?
● Many existing grid-solutions had architectural
deficiencies around a constraining model.
Everything is a “job”.
– Good at Batch, but tough for Services
– What happens when you want to write your own
distributed application? (no primitives)
– What happens when you want to write your own
scheduler (elastic service). Square wheel
reinvention.
What about Clouds?
● Can't you just “Cloud All the things”
– It's not very efficient
– Can be quite costly
● It doesn't actually solve the root cause of many of the
problems in applications, and in some cases a Cloud can
cause more issues, not less.
– Network Latency
– Data Gravity
– Multi-tenant Service Elasticity
...
Google Refs
● The Datacenter as a Computer: An Introduction to the Design
of Warehouse-Scale Machines
http://research.google.com/pubs/pub35290.html
● 2011 GAFS Omega John Wilkes:
http://youtu.be/0ZFMlO98Jkc
● Omega: flexible, scalable schedulers for large compute
clusters
http://eurosys2013.tudos.org/wp-content/uploads/2013/paper
/Schwarzkopf.pdf
● Taming Latency Variability and Scaling Deep Learning
https://plus.google.com/u/0/+ResearchatGoogle/posts/C1dPh
QhcDRv
Reality Check
The New Reality
● New applications need to be:
– Fault Tolerant (Withstand failure)
– Scalable (Doesn't crumble under it's own weight)
– Elastic (Can grow and shrink based on demand)
– Multi-tenent (It can't have it's own dedicated cluster)
● So what does that really mean?
Distributed Applications
● “There's Just No Getting Around It: You're Building a Distributed System” Mark Cavage
– queue.acm.org/detail.cfm?id=2482856
● Key takeaways on architecture:
– Decompose the business applications into discrete services on the boundaries of fault
domains, scaling, and data workload.
– Make as many things as possible stateless
– When dealing with state, deeply understand CAP, latency, throughput, and durability
requirements.
“Without practical experience working on successful—and failed—systems, most engineers take
a "hopefully it works" approach and attempt to string together off-the-shelf software, whether
open source or commercial, and often are unsuccessful at building a resilient, performant system.
In reality, building a distributed system requires a methodical approach to requirements along the
boundaries of failure domains, latency, throughput, durability, consistency, and desired SLAs for
the business application at all aspects of the application.”
Google Refs
● The Datacenter as a Computer: An Introduction to the Design
of Warehouse-Scale Machines
http://research.google.com/pubs/pub35290.html
● 2011 GAFS Omega John Wilkes:
http://youtu.be/0ZFMlO98Jkc
● Omega: flexible, scalable schedulers for large compute
clusters
http://eurosys2013.tudos.org/wp-content/uploads/2013/paper
/Schwarzkopf.pdf
● Taming Latency Variability and Scaling Deep Learning
https://plus.google.com/u/0/+ResearchatGoogle/posts/C1dPh
QhcDRv
Emerging
at Berkeley
Beyond Hadoop
Hadoop – an open source solution for fault-tolerant parallel
processing of batch jobs at scale, based on commodity hardware...
However, other priorities have emerged for analytics lifecycle:
●
Applications require integration beyond Hadoop
●
Multiple topologies, mixed workloads, mult-tenancy
●
Higher utilization
● Lower latency
●
High availability
●
More then “Just JVM” - e.g. Python ...
Next Generation Data Analytics Stack
BDAS Stack
Prior Practice: Dedicated Servers
• low utilization rates
• longer time to ramp up new services
DATACENTER
Prior Practice: Virtualization
DATACENTER PROVISIONED VMS
• even more machines to manage
• substantial performance decrease
due to virtualization
• VM licensing costs
Prior Practice: Static Partitioning
STATIC PARTITIONING
• even more machines to manage
• substantial performance decrease
due to virtualization
• VM licensing costs
• failures make static partitioning
more complex to manage
DATACENTER
What are the costs of Single Tenancy?
0%
25%
50%
75%
100%
RAILS CPU
LOAD
MEMCACHED
CPU LOAD
0%
25%
50%
75%
100%
HADOOP CPU
LOAD
0%
25%
50%
75%
100%
tt
0%
25%
50%
75%
100%
Rails
Memcached
Hadoop
COMBINED CPU LOAD (RAILS,
MEMCACHED, HADOOP)
MESOS
Mesos: One Large Pool of Resources
“We wanted people to be able to program
for the datacenter just like they program
for their laptop."
Ben Hindman
DATACENTER
MESOS
Mesos: One Large Pool of Resources
“We wanted people to be able to program
for the datacenter just like they program
for their laptop."
Ben Hindman
DATACENTER
Re-eval – What is Mesos?
● Mesos is a meta-scheduler
– Mesos is a distributed system to build and run distributed
systems.
● Microkernel for the datacenter.
– Common substrate, or programming abstractions, for
creating, or adapting distributed applications.
Kernel
Apps
servicesbatch
Frameworks
Python
JVM
C
++
Workloads
distributed file system
Chronos
DFS
distributed resources: CPU, RAM, I/O, FS, rack locality, etc. Cluster
Storm
Kafka JBoss Django RailsSharkImpalaScalding
Marathon
SparkHadoopMPI
MySQL
Mesos – architecture
Google Refs
● The Datacenter as a Computer: An Introduction to the Design
of Warehouse-Scale Machines
http://research.google.com/pubs/pub35290.html
● 2011 GAFS Omega John Wilkes:
http://youtu.be/0ZFMlO98Jkc
● Omega: flexible, scalable schedulers for large compute
clusters
http://eurosys2013.tudos.org/wp-content/uploads/2013/paper
/Schwarzkopf.pdf
● Taming Latency Variability and Scaling Deep Learning
https://plus.google.com/u/0/+ResearchatGoogle/posts/C1dPh
QhcDRv
Use Cases
Case Study: Twitter (bare metal / on premise)
“Mesos is the cornerstone of our elastic compute infrastructure –
it’s how we build all our new services and is critical for Twitter’s
continued success at scale. It's one of the primary keys to our
data center efficiency."
Chris Fry, SVP Engineering
blog.twitter.com/2013/mesos-graduates-from-apache-incubation
wired.com/gadgetlab/2013/11/qa-with-chris-fry/
• key services run in production: analytics, typeahead, ads
• Twitter engineers rely on Mesos to build all new services
• instead of thinking about static machines, engineers think
about resources like CPU, memory and disk
• allows services to scale and leverage a shared pool of
servers across datacenters efficiently
• reduces the time between prototyping and launching
Case Study: Airbnb (fungible cloud infrastructure)
“We think we might be pushing data science in the field of travel
more so than anyone has ever done before… a smaller number
of engineers can have higher impact through automation on
Mesos."
Mike Curtis, VP Engineering
gigaom.com/2013/07/29/airbnb-is-engineering-itself-into-a-data...
• improves resource management and efficiency
• helps advance engineering strategy of building small teams
that can move fast
• key to letting engineers make the most of AWS-based
infrastructure beyond just Hadoop
• allowed company to migrate off Elastic MapReduce
• enables use of Hadoop along with Chronos, Spark, Storm, etc.
Case Study: HubSpot (cluster management)
Tom Petr
youtu.be/ROn14csiikw
• 500 deployable objects; 100 deploys/day to
production; 90 engineers; 3 devops on Mesos
cluster
• “Our QA cluster is now a fixed $10K/month —
that used to fluctuate”
Google Refs
● The Datacenter as a Computer: An Introduction to the Design
of Warehouse-Scale Machines
http://research.google.com/pubs/pub35290.html
● 2011 GAFS Omega John Wilkes:
http://youtu.be/0ZFMlO98Jkc
● Omega: flexible, scalable schedulers for large compute
clusters
http://eurosys2013.tudos.org/wp-content/uploads/2013/paper
/Schwarzkopf.pdf
● Taming Latency Variability and Scaling Deep Learning
https://plus.google.com/u/0/+ResearchatGoogle/posts/C1dPh
QhcDRv
The Shiny!
Dock-ah, dock-ah, dock-ah
● Enumerate hyperbolic awesome-sauce!
● Mesos had plugable add-ons for docker, now
they are being 1st classed into the core.
● Google Kubernetes Framework:
http://gigaom.com/2014/06/11/why-google-is-
sowing-the-seeds-of-container-based-
computing/
API's & More Plugability
● Always supported Protobuf API's compiled
against core libraries. Now native bindings are
occurring in the wild. (Go, Python, Java).
● Additional Plug-ins for Containerizers, Isolators,
...
Want to know More?
Come Join Us @ #MesosCon
Google Refs
● The Datacenter as a Computer: An Introduction to the Design
of Warehouse-Scale Machines
http://research.google.com/pubs/pub35290.html
● 2011 GAFS Omega John Wilkes:
http://youtu.be/0ZFMlO98Jkc
● Omega: flexible, scalable schedulers for large compute
clusters
http://eurosys2013.tudos.org/wp-content/uploads/2013/paper
/Schwarzkopf.pdf
● Taming Latency Variability and Scaling Deep Learning
https://plus.google.com/u/0/+ResearchatGoogle/posts/C1dPh
QhcDRv
Thank You!
mesos.apache.org
@timothysc

More Related Content

What's hot

What's hot (20)

Mesos vs kubernetes comparison
Mesos vs kubernetes comparisonMesos vs kubernetes comparison
Mesos vs kubernetes comparison
 
Deploying Containers in Production and at Scale
Deploying Containers in Production and at ScaleDeploying Containers in Production and at Scale
Deploying Containers in Production and at Scale
 
DEPLOYING A DOCKERIZED DISTRIBUTED APPLICATION IN MESOS
DEPLOYING A DOCKERIZED DISTRIBUTED APPLICATION IN MESOSDEPLOYING A DOCKERIZED DISTRIBUTED APPLICATION IN MESOS
DEPLOYING A DOCKERIZED DISTRIBUTED APPLICATION IN MESOS
 
Building and deploying a distributed application with Docker, Mesos and Marathon
Building and deploying a distributed application with Docker, Mesos and MarathonBuilding and deploying a distributed application with Docker, Mesos and Marathon
Building and deploying a distributed application with Docker, Mesos and Marathon
 
DC/OS: The definitive platform for modern apps
DC/OS: The definitive platform for modern appsDC/OS: The definitive platform for modern apps
DC/OS: The definitive platform for modern apps
 
Kubernetes on Top of Mesos on Top of DCOS
Kubernetes on Top of Mesos on Top of DCOSKubernetes on Top of Mesos on Top of DCOS
Kubernetes on Top of Mesos on Top of DCOS
 
Musings on Mesos: Docker, Kubernetes, and Beyond.
Musings on Mesos: Docker, Kubernetes, and Beyond.Musings on Mesos: Docker, Kubernetes, and Beyond.
Musings on Mesos: Docker, Kubernetes, and Beyond.
 
Federated mesos clusters for global data center designs
Federated mesos clusters for global data center designsFederated mesos clusters for global data center designs
Federated mesos clusters for global data center designs
 
Container Orchestration @Docker Meetup Hamburg
Container Orchestration @Docker Meetup HamburgContainer Orchestration @Docker Meetup Hamburg
Container Orchestration @Docker Meetup Hamburg
 
Introduction to DC/OS
Introduction to DC/OSIntroduction to DC/OS
Introduction to DC/OS
 
Mesos and Kubernetes ecosystem overview
Mesos and Kubernetes ecosystem overviewMesos and Kubernetes ecosystem overview
Mesos and Kubernetes ecosystem overview
 
Hyperscale Computing, Enterprise Agility with Mesosphere
Hyperscale Computing, Enterprise Agility with MesosphereHyperscale Computing, Enterprise Agility with Mesosphere
Hyperscale Computing, Enterprise Agility with Mesosphere
 
Docker, Mesos, Spark
Docker, Mesos, Spark Docker, Mesos, Spark
Docker, Mesos, Spark
 
HA Kubernetes on Mesos / Marathon
HA Kubernetes on Mesos / MarathonHA Kubernetes on Mesos / Marathon
HA Kubernetes on Mesos / Marathon
 
Orchestrating Redis & K8s Operators
Orchestrating Redis & K8s OperatorsOrchestrating Redis & K8s Operators
Orchestrating Redis & K8s Operators
 
Docker at Shopify: From This-Looks-Fun to Production by Simon Eskildsen (Shop...
Docker at Shopify: From This-Looks-Fun to Production by Simon Eskildsen (Shop...Docker at Shopify: From This-Looks-Fun to Production by Simon Eskildsen (Shop...
Docker at Shopify: From This-Looks-Fun to Production by Simon Eskildsen (Shop...
 
Discover the all new Mesosphere DC/OS 1.10
Discover the all new Mesosphere DC/OS 1.10Discover the all new Mesosphere DC/OS 1.10
Discover the all new Mesosphere DC/OS 1.10
 
Integrating Docker with Mesos and Marathon
Integrating Docker with Mesos and MarathonIntegrating Docker with Mesos and Marathon
Integrating Docker with Mesos and Marathon
 
Scaling Development Environments with Docker
Scaling Development Environments with DockerScaling Development Environments with Docker
Scaling Development Environments with Docker
 
Container Orchestration Wars (Micro Edition)
Container Orchestration Wars (Micro Edition)Container Orchestration Wars (Micro Edition)
Container Orchestration Wars (Micro Edition)
 

Similar to Introduction To Apache Mesos

Network Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspectiveNetwork Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspective
Walid Shaari
 
Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...
Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...
Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...
Andrew Miller
 

Similar to Introduction To Apache Mesos (20)

Apache Mesos Overview and Integration
Apache Mesos Overview and IntegrationApache Mesos Overview and Integration
Apache Mesos Overview and Integration
 
Transforming to OpenStack: a sample roadmap to DevOps
Transforming to OpenStack: a sample roadmap to DevOpsTransforming to OpenStack: a sample roadmap to DevOps
Transforming to OpenStack: a sample roadmap to DevOps
 
SQL Server & la virtualisation : « 45 minutes inside » !
SQL Server & la virtualisation :  « 45 minutes inside » !SQL Server & la virtualisation :  « 45 minutes inside » !
SQL Server & la virtualisation : « 45 minutes inside » !
 
SQL Server & la virtualisation : « 45 minutes inside » !
SQL Server & la virtualisation :  « 45 minutes inside » !SQL Server & la virtualisation :  « 45 minutes inside » !
SQL Server & la virtualisation : « 45 minutes inside » !
 
Modernizing Applications with Microservices and DC/OS (Lightbend/Mesosphere c...
Modernizing Applications with Microservices and DC/OS (Lightbend/Mesosphere c...Modernizing Applications with Microservices and DC/OS (Lightbend/Mesosphere c...
Modernizing Applications with Microservices and DC/OS (Lightbend/Mesosphere c...
 
Network Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspectiveNetwork Automation Journey, A systems engineer NetOps perspective
Network Automation Journey, A systems engineer NetOps perspective
 
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...
OCCIware: Extensible and Standard-based XaaS Platform To Manage Everything in...
 
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
OCCIware, an extensible, standard-based XaaS consumer platform to manage ever...
 
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platformOCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
OCCIware@POSS 2016 - an extensible, standard XaaS cloud consumer platform
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
How jKool Analyzes Streaming Data in Real Time with DataStax
How jKool Analyzes Streaming Data in Real Time with DataStaxHow jKool Analyzes Streaming Data in Real Time with DataStax
How jKool Analyzes Streaming Data in Real Time with DataStax
 
How jKool Analyzes Streaming Data in Real Time with DataStax
How jKool Analyzes Streaming Data in Real Time with DataStaxHow jKool Analyzes Streaming Data in Real Time with DataStax
How jKool Analyzes Streaming Data in Real Time with DataStax
 
Journey to Containerized Application / Google Container Engine
Journey to Containerized Application / Google Container EngineJourney to Containerized Application / Google Container Engine
Journey to Containerized Application / Google Container Engine
 
Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...
Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...
Varrow Q4 Lunch & Learn Presentation - Virtualizing Business Critical Applica...
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
 
Measure and Increase Developer Productivity with Help of Serverless at Server...
Measure and Increase Developer Productivity with Help of Serverless at Server...Measure and Increase Developer Productivity with Help of Serverless at Server...
Measure and Increase Developer Productivity with Help of Serverless at Server...
 
Stay productive while slicing up the monolith
Stay productive while slicing up the monolithStay productive while slicing up the monolith
Stay productive while slicing up the monolith
 
Stay productive while slicing up the monolith
Stay productive while slicing up the monolithStay productive while slicing up the monolith
Stay productive while slicing up the monolith
 
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
 

Recently uploaded

Recently uploaded (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Introduction To Apache Mesos

  • 1. An Introduction to Apache Mesos Timothy St. Clair @timothysc Special Thanks To: Paco Nathan http://liber118.com/pxn @pacoid
  • 2. Let's have some fun! Please ask questions.
  • 3. Overview ● Project Synopsis ● History ● Ecosystem Questions ● More Detailed Description ● Interesting Use Cases ● Current “Shiny” Developments
  • 4. What is Apache Mesos? Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, Elastic Search, Storm, Aurora, Marathon ... and other applications on a dynamically shared pool of nodes.
  • 5. Project Highlights ● Top-level Apache project ~ 1 year (mesos.apache.org) ● Scales to 10,000s of nodes ● Obviates the need for virtual machines ● Isolation for CPU, RAM, I/O, FS, etc. ● Fault-tolerant leader election based on Zookeeper ● API's in C++, Java/Scala, Python, Go, Erlang, Haskell. ● Web UI for inspecting state ● Available for Linux, OpenSolaris, Mac OSX
  • 6. Who is using Mesos? - Early Adopters
  • 7. Google Refs ● The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines http://research.google.com/pubs/pub35290.html ● 2011 GAFS Omega John Wilkes: http://youtu.be/0ZFMlO98Jkc ● Omega: flexible, scalable schedulers for large compute clusters http://eurosys2013.tudos.org/wp-content/uploads/2013/paper /Schwarzkopf.pdf ● Taming Latency Variability and Scaling Deep Learning https://plus.google.com/u/0/+ResearchatGoogle/posts/C1dPh QhcDRv History
  • 8. Understanding of Datacenter Computing Google has been doing data center computing for years to address the complexities of large-scale data workflows: ● Leveraging the modern kernel isolation. (cgroups) ● Containerization !Virtualization (lmctfy - Docker) ● Most (>80) jobs are batch jobs, but the majority of resources(55-80%) are allocated to service jobs. ● Mixed workloads, multi-tenancy ● Relatively high utilization rates ● JVM? Not so much... ● Reality: scheduling batch is simple; – scheduling services is hard/expensive.
  • 9. Refs. ● The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines – http://research.google.com/pubs/pub35290.html ● GAFS Omega John Wilkes – https://www.youtube.com/watch?v=0ZFMlO98Jkc ● Taming Latency Variability and Scaling Deep Learning – http://youtu.be/nK6daeTZGA8
  • 10. Google Refs ● The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines http://research.google.com/pubs/pub35290.html ● 2011 GAFS Omega John Wilkes: http://youtu.be/0ZFMlO98Jkc ● Omega: flexible, scalable schedulers for large compute clusters http://eurosys2013.tudos.org/wp-content/uploads/2013/paper /Schwarzkopf.pdf ● Taming Latency Variability and Scaling Deep Learning https://plus.google.com/u/0/+ResearchatGoogle/posts/C1dPh QhcDRv Ecosystem Questions?
  • 11. Aren't There Several Other Existing Solutions? - “Sounds Like” a grid ● IBM Platform Symphony ● Microsoft Autopilot ● Univa Grid Engine (SGE) ● Condor (Full Disclosure)
  • 12. Follow Up: What is the Gap? ● Many existing grid-solutions had architectural deficiencies around a constraining model. Everything is a “job”. – Good at Batch, but tough for Services – What happens when you want to write your own distributed application? (no primitives) – What happens when you want to write your own scheduler (elastic service). Square wheel reinvention.
  • 13. What about Clouds? ● Can't you just “Cloud All the things” – It's not very efficient – Can be quite costly ● It doesn't actually solve the root cause of many of the problems in applications, and in some cases a Cloud can cause more issues, not less. – Network Latency – Data Gravity – Multi-tenant Service Elasticity ...
  • 14. Google Refs ● The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines http://research.google.com/pubs/pub35290.html ● 2011 GAFS Omega John Wilkes: http://youtu.be/0ZFMlO98Jkc ● Omega: flexible, scalable schedulers for large compute clusters http://eurosys2013.tudos.org/wp-content/uploads/2013/paper /Schwarzkopf.pdf ● Taming Latency Variability and Scaling Deep Learning https://plus.google.com/u/0/+ResearchatGoogle/posts/C1dPh QhcDRv Reality Check
  • 15. The New Reality ● New applications need to be: – Fault Tolerant (Withstand failure) – Scalable (Doesn't crumble under it's own weight) – Elastic (Can grow and shrink based on demand) – Multi-tenent (It can't have it's own dedicated cluster) ● So what does that really mean?
  • 16. Distributed Applications ● “There's Just No Getting Around It: You're Building a Distributed System” Mark Cavage – queue.acm.org/detail.cfm?id=2482856 ● Key takeaways on architecture: – Decompose the business applications into discrete services on the boundaries of fault domains, scaling, and data workload. – Make as many things as possible stateless – When dealing with state, deeply understand CAP, latency, throughput, and durability requirements. “Without practical experience working on successful—and failed—systems, most engineers take a "hopefully it works" approach and attempt to string together off-the-shelf software, whether open source or commercial, and often are unsuccessful at building a resilient, performant system. In reality, building a distributed system requires a methodical approach to requirements along the boundaries of failure domains, latency, throughput, durability, consistency, and desired SLAs for the business application at all aspects of the application.”
  • 17. Google Refs ● The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines http://research.google.com/pubs/pub35290.html ● 2011 GAFS Omega John Wilkes: http://youtu.be/0ZFMlO98Jkc ● Omega: flexible, scalable schedulers for large compute clusters http://eurosys2013.tudos.org/wp-content/uploads/2013/paper /Schwarzkopf.pdf ● Taming Latency Variability and Scaling Deep Learning https://plus.google.com/u/0/+ResearchatGoogle/posts/C1dPh QhcDRv Emerging at Berkeley
  • 18. Beyond Hadoop Hadoop – an open source solution for fault-tolerant parallel processing of batch jobs at scale, based on commodity hardware... However, other priorities have emerged for analytics lifecycle: ● Applications require integration beyond Hadoop ● Multiple topologies, mixed workloads, mult-tenancy ● Higher utilization ● Lower latency ● High availability ● More then “Just JVM” - e.g. Python ... Next Generation Data Analytics Stack
  • 20. Prior Practice: Dedicated Servers • low utilization rates • longer time to ramp up new services DATACENTER
  • 21. Prior Practice: Virtualization DATACENTER PROVISIONED VMS • even more machines to manage • substantial performance decrease due to virtualization • VM licensing costs
  • 22. Prior Practice: Static Partitioning STATIC PARTITIONING • even more machines to manage • substantial performance decrease due to virtualization • VM licensing costs • failures make static partitioning more complex to manage DATACENTER
  • 23. What are the costs of Single Tenancy? 0% 25% 50% 75% 100% RAILS CPU LOAD MEMCACHED CPU LOAD 0% 25% 50% 75% 100% HADOOP CPU LOAD 0% 25% 50% 75% 100% tt 0% 25% 50% 75% 100% Rails Memcached Hadoop COMBINED CPU LOAD (RAILS, MEMCACHED, HADOOP)
  • 24. MESOS Mesos: One Large Pool of Resources “We wanted people to be able to program for the datacenter just like they program for their laptop." Ben Hindman DATACENTER
  • 25. MESOS Mesos: One Large Pool of Resources “We wanted people to be able to program for the datacenter just like they program for their laptop." Ben Hindman DATACENTER
  • 26. Re-eval – What is Mesos? ● Mesos is a meta-scheduler – Mesos is a distributed system to build and run distributed systems. ● Microkernel for the datacenter. – Common substrate, or programming abstractions, for creating, or adapting distributed applications.
  • 27. Kernel Apps servicesbatch Frameworks Python JVM C ++ Workloads distributed file system Chronos DFS distributed resources: CPU, RAM, I/O, FS, rack locality, etc. Cluster Storm Kafka JBoss Django RailsSharkImpalaScalding Marathon SparkHadoopMPI MySQL Mesos – architecture
  • 28. Google Refs ● The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines http://research.google.com/pubs/pub35290.html ● 2011 GAFS Omega John Wilkes: http://youtu.be/0ZFMlO98Jkc ● Omega: flexible, scalable schedulers for large compute clusters http://eurosys2013.tudos.org/wp-content/uploads/2013/paper /Schwarzkopf.pdf ● Taming Latency Variability and Scaling Deep Learning https://plus.google.com/u/0/+ResearchatGoogle/posts/C1dPh QhcDRv Use Cases
  • 29. Case Study: Twitter (bare metal / on premise) “Mesos is the cornerstone of our elastic compute infrastructure – it’s how we build all our new services and is critical for Twitter’s continued success at scale. It's one of the primary keys to our data center efficiency." Chris Fry, SVP Engineering blog.twitter.com/2013/mesos-graduates-from-apache-incubation wired.com/gadgetlab/2013/11/qa-with-chris-fry/ • key services run in production: analytics, typeahead, ads • Twitter engineers rely on Mesos to build all new services • instead of thinking about static machines, engineers think about resources like CPU, memory and disk • allows services to scale and leverage a shared pool of servers across datacenters efficiently • reduces the time between prototyping and launching
  • 30. Case Study: Airbnb (fungible cloud infrastructure) “We think we might be pushing data science in the field of travel more so than anyone has ever done before… a smaller number of engineers can have higher impact through automation on Mesos." Mike Curtis, VP Engineering gigaom.com/2013/07/29/airbnb-is-engineering-itself-into-a-data... • improves resource management and efficiency • helps advance engineering strategy of building small teams that can move fast • key to letting engineers make the most of AWS-based infrastructure beyond just Hadoop • allowed company to migrate off Elastic MapReduce • enables use of Hadoop along with Chronos, Spark, Storm, etc.
  • 31. Case Study: HubSpot (cluster management) Tom Petr youtu.be/ROn14csiikw • 500 deployable objects; 100 deploys/day to production; 90 engineers; 3 devops on Mesos cluster • “Our QA cluster is now a fixed $10K/month — that used to fluctuate”
  • 32. Google Refs ● The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines http://research.google.com/pubs/pub35290.html ● 2011 GAFS Omega John Wilkes: http://youtu.be/0ZFMlO98Jkc ● Omega: flexible, scalable schedulers for large compute clusters http://eurosys2013.tudos.org/wp-content/uploads/2013/paper /Schwarzkopf.pdf ● Taming Latency Variability and Scaling Deep Learning https://plus.google.com/u/0/+ResearchatGoogle/posts/C1dPh QhcDRv The Shiny!
  • 33. Dock-ah, dock-ah, dock-ah ● Enumerate hyperbolic awesome-sauce! ● Mesos had plugable add-ons for docker, now they are being 1st classed into the core. ● Google Kubernetes Framework: http://gigaom.com/2014/06/11/why-google-is- sowing-the-seeds-of-container-based- computing/
  • 34. API's & More Plugability ● Always supported Protobuf API's compiled against core libraries. Now native bindings are occurring in the wild. (Go, Python, Java). ● Additional Plug-ins for Containerizers, Isolators, ...
  • 35. Want to know More? Come Join Us @ #MesosCon
  • 36. Google Refs ● The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines http://research.google.com/pubs/pub35290.html ● 2011 GAFS Omega John Wilkes: http://youtu.be/0ZFMlO98Jkc ● Omega: flexible, scalable schedulers for large compute clusters http://eurosys2013.tudos.org/wp-content/uploads/2013/paper /Schwarzkopf.pdf ● Taming Latency Variability and Scaling Deep Learning https://plus.google.com/u/0/+ResearchatGoogle/posts/C1dPh QhcDRv Thank You! mesos.apache.org @timothysc