This presentation describes how hortonworks is delivering Hadoop on Docker for a cloud-agnostic deployment approach which presented in Cisco Live 2015.
9. Cloudbreak
• Developed by SequenceIQ
• Open source with Apache 2.0
license [ Apache project soon ]
• Deploys selected services to
public and private cloud via
Ambari Blueprints
• Elastic – can spin up any number
of nodes, add/remove on the fly
• Provides full cloud lifecycle
management post-deployment
10. BI / Analytics
(Hive)
IoT Apps
(Storm, HBase, Hive)
Launch HDP on Any Cloud for Any Application
Dev / Test
(all HDP services)
Data Science
(Spark)
Cloudbreak
1. Pick a Blueprint
2. Choose a Cloud
3. Launch HDP!
Example Ambari
Blueprints:
IoT Apps, BI / Analytics, Data Science, Dev /
Test
11. Hadoop in Cloud Provisioning with Cloudbreak
Create
Templates
Provide
Blueprint
Associate
Credentials
Launch
Cluster
17. BI / Analytics
(Hive)
IoT Apps
(Storm, HBase, Hive)
Dev / Test
(all HDP services)
Data Science
(Spark)
Autoscaling
Policy
• Policies based on any Ambari metrics
• Coordinates with YARN
• Policies are based on Metrics or Time
• Scaling can be service or component
type specific
Optimize cloud usage via Elastic Clusters
19. Provisioning – How it works
Start VMs -
with a running
Docker
daemon
Cloudbreak
Bootstrap
•Start Consul
Cluster
•Start Swarm
Cluster (Consul
for discovery)
Start Ambari
servers/agents
- Swarm API
Ambari
services
registered in
Consul
(Registrator)
Post Blueprint
21. Multiplicity
of
Stacks
Multiplicity
of hardware
environments
Static website Web frontendUser DB Queue Analytics DB
Development
VM QA server Public Cloud
Contributor’s
laptopProduction
Cluster
Customer Data
Center
An engine that enables any payload to be
encapsulated as a lightweight, portable,
self-sufficient container
Docker is a “Shipping Container” System for Code
22. Lightweight, portable
Build once, run anywhere
VM – without the overhead of a VM
Isolated containers
Automated and scripted
Docker
23. Why Is Docker So Exciting?
For Developers:
Build once…run anywhere
• A clean, safe, and portable runtime
environment for your app.
• No missing dependencies, packages etc.
• Run each app in its own isolated container
• Automate testing, integration, packaging
• Reduce/eliminate concerns about
compatibility on different platforms
• Cheap, zero-penalty containers to deploy
services
For DevOps:
Configure once…run anything
• Make the entire lifecycle more efficient,
consistent, and repeatable
• Eliminate inconsistencies between SDLC
stages
• Support segregation of duties
• Significantly improves the speed and
reliability of CICD
• Significantly lightweight compared to VMs
24. App
A
Hypervisor (Type 2)
Host OS
Server
Guest
OS
Bins/
Libs
App
A’
Guest
OS
Bins/
Libs
App
B
Guest
OS
Bins/
Libs
Docker
Host OS kernel
Server
bin
AppA
lib
AppB
VM
Container
Containers are isolated,
Share only the kernel
Guest
OS
Guest
OS
…result is significantly faster
deployment, much less overhead,
easier migration, faster restart
lib
AppB
lib
AppB
lib
AppB
bin
AppA
Docker: Containers vs. VMs
26. HDP as Docker
Containers
via Cloudbreak
• Running Ambari Cluster in Containers
• Use Blueprint to define services
• All HDP services share a single container
Cloudb
reak
Ambari HDP
Installs
Ambari on
the VMs
Docker
VM
Docker
VM
Docker
Linux
Instruct
s
Ambari
to build
HDP
cluster
Cloud Provider/Bare Metal
Provisions
VMs from
Cloud
Providers
Run Hadoop as Docker Containers
31. • Quick installation with pre-pulled rpms
• Same process/images for dev/qa/prod
• Same process for single/multi-node
Benefits of running Hadoop on Docker
43. Cisco and Hortonworks’ Partnership
100% open source Hadoop Distribution,
Support and Training
Integrated Infrastructures for Big Data
CISCO AND HORTONWORKS ARE PARTNERING TO HELP YOU BUILD
YOUR BIG DATA SOLUTION AND REACH MASSIVE SCALABILITY,
SUPERIOR EFFICIENCY AND DRAMATICALLY LOWER TOTAL COST OF
OWNERSHIP THANKS TO A VALIDATED JOINT ARCHITECTURE.
44. Results of the collaboration
• Efficient Hadoop as a
service
• Adoption of Docker for
enterprise Hadoop
deployment
Tasks
Cisco
InterCloud
Public Cloud
Provider
HDP installation
15:04 mins 11:55 mins
Teragen (avg of 3 execution)
7:08 mins 22:15 mins
Terasort(avg of 3 execution)
32:09 mins 60:12 mins
Teravalidate(avg of 3
execution)
2:31 mins 10:40 mins
45. Observations Future Collaboration
• Docker is maturing inside enterprises
• Interest to run Docker on top of bare
metal
• Big data app developers are leaning
towards containerization of apps
• YARN is becoming application
deployment platform beyond big data
apps
• Demand for native containerized fully
managed app on YARN
• Run Docker natively on
Openstack
• Run Docker on Yarn
• OpenStack bare metal
47. Learn More
Download the Hortonworks Sandbox
Learn Hadoop
Build Your Analytic App
Try Hadoop 2
More about Cisco & Hortonworks
http://hortonworks.com/partner/cisco/
More about Hortonworks’ Acquisition of SequenceIQ
http://bit.ly/1R1ktxO
Editor's Notes
Deploying Hadoop on Openstack is never been easier but Hortonworks and Cisco collaboration in last few months makes it completely automated and seamless.
This is cautionary statement as this presentation may have product and collaboration direction which are subject to change.
We were founded in 2011 by 24 developers from Yahoo where Hadoop was conceived to address data challenges at internet scale. What we now know of as Hadoop really started in 2005, when a team at Yahoo was directed to build out a large-scale data storage and processing technology that would allow them to improve their most critical application, Search.
Their challenge was essentially two-fold. First they needed to capture and archive the contents of the internet, and then process the data so that users could search through it effectively an efficiently. Clearly traditional approaches were both technically (due to the size of the data) and commercially (due to the cost) impractical. The result was the Apache Hadoop project that delivered large scale storage (HDFS) and processing (MapReduce).
Today we are over 600 employees and have partnered with over 1000 companies who are the leaders in the data center
We have also been very fortunate to achieve very significant customer adoption with over 330 customers as of the end of 2014, spanning nearly every vertical.
Hortonworks was founded the sole intent to make Hadoop an enterprise data platform. With YARN as its foundation, HDP delivers a centralized architecture with true multi-tenancy for data-processing and shared services for Security, Governance and Operations to satisfy enterprise requirements, all deeply integrated and certified with leading datacenter technologies.
We are uniquely focused on this transformation of Hadoop and doing our work completely in open source. This is all predicated on our leadership in the community, which enables not only to best support users of but also provides uniquely present customer requirements within this open, thriving community.
Hortonworks approach is quite clear… we are focused on delivery of enterprise grade Hadoop as a reliable data platform that will enable your transition to a modern data architecture. To this end, we work solely within the broad open source community with a focus on innovation at the core of Apache Hadoop with YARN as a foundation and then within all the related projects that deliver on the key requirements for the enterprise such as governance, security and operation.
Since our incepetion just three years ago, we have grown to more than 450 employees and have partnered closely with the leaders in the datacenter, all of whom share this vision: to enable a modern data architecture with Hadoop in order to allow their customers to address the architectural challenge that they all are facing due to exploding data volumes.
Hortonworks Open platform approach enables us to partner and co-exist with other data center technologies. Our deep engineering relationship with data center leaders like Cisco makes it possible for customers to augment their data center with Hadoop technologies for their next generation modern data architecture.
Hortonwork’s Hadoop platform had already been enabled deployment Hadoop in any environment from Linux to Windows , Bare metal to Cloud so that Hadoop deployment environment should be business decision rather than a technical one. In continuation of such Hadoop Everywhere vision, Hortonworks recent acquisition of SequenceIQ added a provisioning and auto-scaling toolset which makes it even more easier to deploy Hadoop in private and public Cloud to accelerate the time-to-value for Hadoop deployment.
Cloudbreak is developed by SequenceIQ company from beautiful city of Budapest. Hortonworks acquired them in the month of April.
Cloudbreak is open source with Apache 2.0 license and uses many other open source technologies as the build blocks including Docker.
It is Hadoop cluster deployment and management tool which can deploy any app or use case specific hadoop cluster to public and private cloud environment in matter of minutes.
It also provide on-going cluster infrastructure management including policy based auto-scaling of clusters to optimize infrastructure usage.
Cloudbreak enables launching Hadoop cluster in 4 easy steps.
Create template captures your hadoop cluster infrastructure definition – node size , network setup .
Cloudbreak support heterogeneous instances for building the hadoop cluster as all service or service components are not same in terms of their resource requirement.
Cloudbreak not only simplify the Hadoop cluster provisioning in Cisco Openstack Cloud but also automatically scale the Hadoop clusters based on SLA or time based policies. SLA is monitored through Hadoop service metrics captured by Ambari. This way Cloudbreak enables you to get an elastic Hadoop clusters very quickly in Cisco Openstack Cloud.
Cloudbreak actively monitors Ambari metrics to assess health of every Hadoop service. It allows defining policies based on these metrics for every cluster deployed and enabled for auto-scaling. Based on these metrics and user defined policies , cloudbreak can scale clusters or services by adding nodes or allocating more yarn containers depending of the type of hadoop service.
View from 10000 ft high.
Only thing it will need is a Docker daemon. All cloud providers are going towards Docker including Cisco Intercloud.
Quick question - How many of you have used Docker before.
Docker is a container based virtualization framework. It is an open platform for developers and admins to build, ship, and
run distributed applications.
Consisting of Docker Engine, a portable,
lightweight runtime and packaging tool, and Docker Hub, a cloud service
for sharing applications and automating workflows, Docker enables apps
to be quickly assembled from components and eliminates the friction
between development, QA, and production environments. Docker is
Lightweight, portable VM but without the overhead of a VM.
Unlike traditional virtualization Docker is fast, lightweight and easy to use. Docker allows you to create containers holding all the dependencies for an application. Each container is kept isolated from any other, and nothing gets shared.
Steps:
Can span us Docker containers remotely on hosts considering:
1. Resource management - aware of the cluster resources (e.g. can schedule it with bin packing - anywhere where 1GB memory is available) or randomly
2. Constraints using labels (label one node and stsrt the container based on labels)
3. Affinity - containers can be co-scheduled (link, vollumes-from, net=container on the same host)
Best of Hadoop , Docker and Openstack in a single cloud platform to our joint customers.
Description Texas 3 GCP
VM types GP2-2Xlarge n1-standard-8
Cores 8 8
Memory 32 GB 30 GB
Volume size 2 x 400GB 2 x 400GB
Volume type HDD (magnetic) generic (magnetic)
Data nodes count 10 10
HDFS size 8 TB 8 TB
Yarn memory 240 GB 240 GB
HDP blueprint multinode-hdfs-yarn
We are expanding our Cloud strategy to meet Enterprise customer demand.
Look at the top first. We’ve done a great job of taking our platform for Private Cloud and provisioning Enterprise workloads. We’ve done a great job with UCS, with VBlock, with FlexPod. As a matter of fact, we are the leader in converged infrastructure today, and that market is expanding as customers look to Cisco and our Partners to deliver the Enterprise workloads and the benefits of Private Cloud. They’re also asking for Dev/Ops models. They want to create truly native applications for the Public Cloud. They want to harness the value of Hadoop and Big Data Analytics and Hana. And they want to leverage the collaborative platform present today. We are the leader in Private Cloud infrastructure.
Along the left-hand side, our Partners have done some amazing things. 3 Million seats of HCS, the IaaS platforms that they’ve invested in, small, medium, large, local community-based infrastructure platforms. Some Partners have enabled the PaaS platform. Some Partners are hosting MicroSoft applications, like Dimension Data does today…globally around the world. Some Partners have managed to build a Citrix or VMware virtual desktop offer.
So what Cisco Cloud Services offer is an engine to generate more services to augment capabilities we’ve invested in, and to do so in a way that only we could do together. You’ll see us leverage the extensions through innovations in the WebEx platform. You’ll see that Meraki is a very powerful model to continue to expand. You’ll hear more about the portfolio of Unified Threat Defense, and comprehensive threat defense that we think only we can bring to the cloud.
You’ll see more about analytics, and the Platforms that we have in store. You’ll soon see more about Hana-as-a-Service. And all the capabilities we can bring, will be an acceleration of those offers that we can bring to you. Why not accelerate all of our capabilities together, using our capabilities in a way that no one else has. And btw, we can’t ignore the big Public Clouds. Let’s use the Intercloud FabricT manager when appropriate to just move a workload out to that Public Cloud. I don’t care if its Azure, or Amazon or Google. Only Cisco can do this through some of the innovations that we have.
How are we going to do this?