Hortonworks Technical Workshop: HDP everywhere - cloud considerations using cloudbreak 2015 june

Hadoop Everywhere
Hortonworks. We do Hadoop.

$ whoami
Sean Roberts
Partner Solutions Engineer
London, EMEA &
everywhere
@seano
linkedin.com/in/seanorama
MacGyver. Data Freak. Cook.
Autodidact. Volunteer. Ancestral
Health. Fito. Couchsurfer. Nomad

- HDP 2.3
- http://hortonworks.com/
- Hadoop Summit recordings:
- http://2015.hadoopsummit.org/san-jose/
- http://2015.hadoopsummit.org/brussels/
- Past & Future workshops:
- http://hortonworks.com/partners/learn/
What’s New!

Agenda
● Hadoop Everywhere
● Deployment challenges & requirements
● Cloudbreak & our Docker approach
● Workshop: Your own CloudBreak
○ And auto-scaling with Periscope
● Cloud best practices
Reminder:
● Attendee phone lines are muted
● Please ask questions in the chat

© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Disclaimer
This document may contain product features and technology directions that are under
development, may be under development in the future or may ultimately not be
developed.
Project capabilities are based on information that is publicly available within the Apache
Software Foundation project websites ("Apache"). Progress of the project capabilities
can be tracked from inception to release through Apache, however, technical feasibility,
market demand, user feedback and the overarching Apache Software Foundation
community development process can all effect timing and final delivery.
This document’s description of these features and technology directions does not
represent a contractual commitment, promise or obligation from Hortonworks to deliver
these features in any generally available product.
Product features and technology directions are subject to change, and must not be
included in contracts, purchase orders, or sales agreements of any kind.
Since this document contains an outline of general product development plans,
customers should not rely upon it when making purchasing decisions.

Any application
Batch, interactive, and real-time
Any data
Existing and new datasets
Anywhere
Complete range of deployment options
Commodity Appliance Cloud
YARN: data operating system
Existing
applications
New
analytics
Partner
applications
Data access: batch, interactive, real-time
Hadoop Everywhere

Hybrid Deployment Choice
Windows, Linux, On-Premise or Cloud
Data “gravity” guides choice
Compatible Clusters
Run applications and data processing
workloads wherever and whenever
needed
Replicated Datasets
Democratize Hadoop data access via
automated sharing of datasets using
Apache Falcon
Hadoop Up There, Down Here...Everywhere!
Dev / Test BI / ML
IoT Apps
On-Premises

Use Cases Where?
Active Archive / Compliance Reporting Sensitive data = “down here”; “up there” valid for many
scenarios
ETL / Data Warehouse Optimization
Usually has “down here” gravity; DW in cloud is changing
that
Smart Meter Analysis Data typically flows “up there”
Single View of Customer
May have “down here” gravity; unless you’re using SaaS
apps
Supply Chain Optimization May have heavy “down here” gravity
New Data for Product Management “Up there” could be considered for many scenarios.
Vehicle Data for
Transportation/Logistics
Why not “up there”?
Vehicle Data for Insurance
May have “down here” gravity (ex. join with existing risk
data)
Anywhere? Up There or Down Here?

Deployment
Challenges & Requirements

Deployment challenges
● Infrastructure is different everywhere
○ e.g. Each cloud provider has their own API
○ e.g. Each provider has different networking methods
● OS/images are different everywhere
● How to do service discovery?
● How to dynamically scale/manage?
See prior operations workshops

- Infrastructure
- Operating System
- Environment Prepared (see docs)
- Ambari Agent/Server installed & registered
- Deploy HDP Cluster
- Ambari Blueprints or Cluster Wizard
- Ongoing configuration/management
Deployment requirements

Options for Automation
- Many combinations of tools
- e.g. Foreman, Ansible, Chef, Puppet, docker-ambari,
shell scripts, CloudFormation, …
- Provider specific
- Cisco UCS, Teradata, HP, Google’s bdutil, …
- Docker with Cloudbreak
Using Ambari with all of the above!

https://github.com/seanorama/ambari-bootstrap/
Demo: Basic script-based example

https://github.com/seanorama/ambari-bootstrap
Requirements:
● Infrastructure prepped (see HDP docs)
● Nodes with RedHat EL or CentOS 6 systems
● HDFS paths mounted (see HDP docs)
● sudo or root access
ambari-bootstrap

After Ambari deployment
● (optional) Configure local YUM/APT repos
● Deploy HDP with Ambari Wizard or Blueprint
● Ongoing configuration/management

Using Ansible
https://github.com/rackerlabs/ansible-hadoop

Build once. Deploy anywhere.
Docker

Multiplicity
of
Stacks
Multiplicity
of hardware
environments
Static website Web
frontend
User
DB
Queu
e
Analytics
DB
Development VM
QA server Public Cloud
Contributor’s laptop
Docker is a “Shipping Container” System for Code
Production Cluster
Customer Data Center
An engine that enables any payload to be
encapsulated as a lightweight, portable,
self-sufficient container

Docker
• Container based virtualization
• Lightweight and portable
• Build once, run anywhere
• Ease of packaging applications
• Automated and scripted
• Isolated

Why Is Docker So Exciting?
For Developers:
Build once…run anywhere
• A clean, safe, and portable runtime
environment for your app.
• No missing dependencies, packages etc.
• Run each app in its own isolated container
• Automate testing, integration, packaging
• Reduce/eliminate concerns about
compatibility on different platforms
• Cheap, zero-penalty containers to deploy
services
For DevOps:
Configure once…run anything
• Make the entire lifecycle more efficient,
consistent, and repeatable
• Eliminate inconsistencies between SDLC
stages
• Support segregation of duties
• Significantly improves the speed and
reliability of CICD
• Significantly lightweight compared to VMs

More Technical Explanation
WHY WHA
T
• Run on any LINUX
• Regardless of kernel version (2.6.32+)
• Regardless of host distro
• Physical or virtual, cloud or not
• Container and host architecture must match
• Run anything
• If it can run on the host, it can run in the
container
• i.e. if it can run on a Linux kernel, it can run
• High Level—It’s a lightweight VM
• Own process space
• Own network interface
• Can run stuff as root
• Low Level—It’s chroot on steroids
• Container=isolated processes
• Share kernel with host
• No device emulation (neither HVM nor PV)
from host)

Docker - How it works
App
A
Hypervisor (Type 2)
Host OS
Server
Guest
OS
Bins/
Libs
App
A’
Gues
t
OS
Bins/
Libs
App
B
Gues
t
OS
Bins/
Libs
Docker
Host OS kernel
Server
bin
AppA
lib
AppB
VM
Container
Containers are isolated. Share OS
and bins/libraries
Guest
OS
Guest
OS
…result is significantly faster
deployment, much less overhead,
easier migration, faster restart
lib
AppB
lib
AppB
lib
AppB
bin
AppA

Cloudbreak
Tool for Provision and Managing Hadoop Clusters In The
Cloud

Cloudbreak
• Developed by SequenceIQ
• Open source with Apache 2.0
license [ Apache project soon ]
• Cloud and infrastructure
agnostic, cost effective Hadoop
As-a-Service platform API.
• Elastic – can spin up any number
of nodes, add/remove on the fly
• Provides full cloud lifecycle
management post-deployment

Key Features of Cloudbreak
Elastic
• Enables provisioning an
arbitrary node Cluster
• Enables (de)
commissioning nodes
from Cluster
• Policy and time based
based scaling of cluster
Flexible
• Declarative and flexible
Hadoop cluster creation
using blueprints
• Provision to multiple
public cloud providers or
Openstack based private
cloud using same
common API
• Access all of this
functionality through rich
UI, secured REST API or
automatable Shell
Enterprise-ready
• Supports basic, token
based and OAuth2
authentication model
• The cluster is
provisioned in a logically
isolated network
• Tracking usage and
cluster metrics

BI / Analytics
(Hive)
IoT Apps
(Storm, HBase, Hive)
Launch HDP on Any Cloud for Any Application
Dev / Test
(all HDP services)
Data Science
(Spark)
Cloudbreak
1. Pick a Blueprint
2. Choose a Cloud
3. Launch HDP!
Example Ambari Blueprints:
IoT Apps, BI / Analytics, Data Science, Dev /
Test

Cloudbreak Approach
• Use Ambari for heavy lifting
• Provisioning of Hadoop services
• Monitoring
• Use Ambari Blueprints
• Assign Host groups to physical instance types
• Public/Private Cloud provider API abstracted
• Azure/Google/Amazon/Openstack
• Run Ambari agent/server in Docker container
• Networking: docker run –net=host
• Service discovery: consul (previously serf)

cloudbreak-deployer
● https://github.com/sequenceiq/cloudbreak-deployer
Requirements:
● A Docker host (laptop, server or Cloud infrastructure)
● Resources:
○ Very little. Tested with 2GB of RAM.
Workshop: Your Own CloudBreak

Requirement: a Docker host
● OSX or Windows: http://boot2docker.io/
○ boot2docker init
○ boot2docker up
○ eval "$(boot2docker shellinit)"
○ boot2docker ssh
● Linux: Install the docker daemon
● Anywhere: docker-machine “lets you create Docker hosts on your
computer, on cloud providers, and inside your own data center”
○ Example on Rackspace:
■ docker-machine create --driver rackspace
--rackspace-api-key $OS_PASSWORD
--rackspace-username $OS_USERNAME
--rackspace-region DFW docker-rax
■ docker-machine ssh docker-rax

Install cloudbreak-deployer
https://github.com/sequenceiq/cloudbreak-deployer
● curl https://raw.githubusercontent.com/sequenceiq/cloudbreak-
deployer/master/install | sh && cbd --version
● cbd init
● cbd start
You’ll then have your own CloudBreak & Periscope server
with API and Web UI

Deploy a cluster with your CloudBreak

Documentation:
http://sequenceiq.
com/cloudbreak/#clou
dbreak-credentials
1. Add Credentials

3. Use your Cluster
Ambari available as expected
To reach your Hadoop hosts:
● SSH to Docker Host
○ Hosts arre listed in “Cloud stack description”
○ ssh cloudbreak@IPofHost
● Shell to the “ambari-agent”
container
○ sudo docker ps | grep ambari-agent
■ note the CONTAINER ID
○ sudo docker -it CONTAINERID bash
● Use the hosts as usual. e.g.:
○ hadoop fs -ls /

Cloudbreak
Cloudbreak Internals
Uluwatu
(cbreak UI)
Sultans
(User mgmt UI)
Browser
Cloudbreak
shellOAuth2
(UAA)
uaa-db
(psql)
Cloudbreak
(rest API)
cb-db
(psql)
Periscope
(autoscaling
)
ps-db
(psql)
consul registrator ambassador
docker

Swarm
• Native clustering for Docker
• Distributed container orchestration
• Same API as Docker

Swarm – How it works
• Swarm managers/agents
• Discovery services
• Advanced scheduling

Consul
• Service discovery/registry
• Health checking
• Key/Value store
• DNS
• Multi datacenter aware

Consul – How it works
• Consul servers/agents
• Consistency through a quorum (RAFT)
• Scalability due to gossip based protocol (SWIM)
• Decentralized and fault tolerant
• Highly available
• Consistency over availability (CP)
• Multiple interfaces - HTTP and DNS
• Support for watches

Apache Ambari
• Easy Hadoop cluster provisioning
• Management and monitoring
• Key feature - Blueprints
• REST API, CLI shell
• Extensible
• Stacks
• Services
• Views

Apache Ambari – How it works
• Ambari server/agents
• Define a blueprint (blueprint.json)
• Define a host mapping (hostmapping.json)
• Post the cluster create

Run Hadoop as Docker containers
HDP as Docker
Containers
via Cloudbreak
• Fully Automated Ambari Cluster installation
• Avoid GUI, use rest API only (ambari-shell)
• Fully Automated HDP installation with blueprints
• Quick installation (pre-pulled rpms)
• Same process/images for dev/qa/prod
• Same process for single/multinode
Cloudbreak Ambari HDP
Installs
Ambari
on the
VMs
Docker
VM
Docker
VM
Docker
Linux
Instructs
Ambari
to build
HDP
cluster
Cloud Provider/Bare Metal
Provision
s VMs
from
Cloud
Providers

Provisioning – How it works
Start VMs -
with a running
Docker
daemon
Cloudbreak
Bootstrap
•Start Consul
Cluster
•Start Swarm
Cluster (Consul
for discovery)
Start Ambari
servers/agents
- Swarm API
Ambari
services
registered in
Consul
(Registrator)
Post Blueprint

Cloudbreak
Docker Docker
DockerDockerDocker
Docker

Cloudbreak
Docker Docker
DockerDockerDocker
Docker
amb-
agn
amb-ser
amb-
agn
amb-
agn
amb-
agn
amb-
agn

Cloudbreak
Docker Docker
DockerDockerDocker
Docker
amb-
agn
amb-ser
amb-
agn
amb-
agn
amb-
agn
amb-
agn
Blueprint

Cloudbreak
Docker Docker
DockerDockerDocker
Docker
amb-agn
- hdfs
- hbase
amb-ser
amb-agn
-hdfs
-hive
amb-agn
-hdfs
-yarn
amb-agn
-hdfs
-zookpr
amb-agn
-nmnode
-hdfs

Workshop: Auto-Scale your Cluster
with Periscope

Optimize Cloud Usage via Elastic HDP Clusters
Dev / Test
Auto-scaling
Policy
• Policies based on any Ambari metrics
• Dynamically scale to achieve physical elasticity
• Coordinates with YARN to achieve elasticity based on
the policies.

Scaling for Static and Dynamic Clusters
Auto-scale
Policy
Auto-scale
Policy
Auto-scale
Policy
YARN
Ambari
Alerts
Ambari
Metrics
Ambari
Ambari
Ambari
Provisioning
Cloudbreak
Static
Dynamic
Enforces Policies
Scales
Cluster/YARN Apps
Metrics and Alerts
Feed
Cloudbreak/Periscope

Scale by Ambari Monitoring Metric
1. Ambari: review metric
2. CloudBreak: set alert
3. Cloudbreak: set scaling policy

Scale up/down by time
1. Set time-based alert
2. Set scaling policy
Repeat with an alert
and policy which
scales down

Release Summary
Cloudbreak
● It’s own project
(separate from Ambari)
● Supported on Linux
flavors which support
Docker
Periscope
● Feature of Cloudbreak 1.0
● Will be embedded in
Ambari later in 2015

Release Timeline
Cloudbreak 1.0
GA
June/July
2015
Cloudbreak 2.0 GA
2H2015
Ambari 2.1.0
HDP “Dal” / 2.3
Ambari 2.2
HDP “Erie” / 2.4
Cloudbreak 1.1
August 2015
(est)
Ambari 2.1.1
HDP “Dal-M10”
Cloudbreak
Incubator
Proposal
July/August 2015
(est)

Supported Cloud Environments
Cloudbreak
HDP 2.3
Microsoft Azure GA
AWS GA
Google Compute GA
Cloudbreak
HDP 2.3
Cloudbreak HDP
2.4
Openstack
Community
Tech Preview Tech Preview
Red Hat OSP TBD
HP Helion GA (Tentative)
Mirantis
OpenStack

Hortonworks Data Platform On Azure

Rackspace
Cloud Big Data Platform
● Rapidly spin up on-demand HDP clusters
● Integrated with Cloud Files (OpenStack Swift)
● Opt-in for Managed Services by Rackspace
Managed Big Data Platform
● Fully Managed HDP on Dedicated and/or Cloud
● Leverage Fanatical Support and Industry Leading SLA’s
● Supported by Rackspace with escalation to Hortonworks

Microsoft Azure
● Deployment
○ Deploy using CloudBreak
○ Deploy using HWX Azure Gallery Image
● Integrated with Azure Blob Storage
● Supported directly by Hortonworks
● Other offerings
○ Microsoft HDInsight
○ HDP Sandbox

Azure Deployment Guideline
● All in same Region
● Instance Types
○ Typical: A7
○ Performance: D14
○ 8x1TB Standard LRS x3 Virtual Hard Disk per
server
● Multiple Storage Accounts are recommended
○ Recommend no more than 40 Virtual Hard Disks
per Storage Account

Azure Blob Store
Azure Blob Store (Object Storage)
● wasb[s]:
//<containername>@<accountname>.blob.
core.windows.net/<path>
Can be used as a replacement for HDFS
● Thoroughly tested in HDP release test suites

Amazon Web Services
● Deploy using CloudBreak
● Integrated with AWS S3 (object storage)

Amazon Deployment Guideline
● All in same Region/AZ
● Instances with Enhanced
Networking
Master Nodes:
● Choose EBS Optimized
● Boot: 100GB on EBS
● Data: 4+ 1TB on EBS
Worker Nodes:
● Boot: 100GB on EBS
● Data: Instance Storage
○ EBS can be used, but local
is preferred
Instance Types:
● Typical: d2.
● Performance: i2.
https://aws.amazon.com/ec2/instance-types/

AWS RDS
● Some services rely on MySQL, Oracle or PostgreSQL:
○ Apache Ambari
○ Apache Hive
○ Apache Oozie
○ Apache Ranger
● Use RDS for these instead of managing yourself.

AWS S3 (Object Storage)
● s3n:// with HDP 2.2 (Hadoop 2.6)
● s3a:// with HDP 2.3 (Hadoop 2.7)
Not currently a direct replacement for HDFS
Recommended to configure access with IAM Role/Policy
● https://docs.aws.amazon.
com/IAM/latest/UserGuide/policies_examples.html#iam-
policy-example-s3
● Example: http://git.io/vLoGY

Google Cloud
● Deploy using
○ CloudBreak
○ Google bdutil with Apache Ambari plug-in
● Integrated with Google Cloud Storage

Google Deployment Guideline
● Instance Types
○ Typical: n1 standard 4 with single 1.5 TB
persistent disks
○ Performance: n1 standard 8 with 1TB SSD
● Google GCS (Object Storage)
● gs://<CONFIGBUCKET>/dir/file
● Not currently a replacement for HDFS

S3 & GCS as Secondary storage system
The connectors are currently eventually consistent so do not replace HDFS
Backup
● Falcon, distCP, hadoop fs, HBase ExportSnapshot
● Kafka+Storm bolt sends messages to S3/GCS
providing backup & point-in-time recovery source
Input/Output
● Convenient & broadly used upload/download method
○ As a middleware to ease integration with Hadoop & limit access
● Publishing static content (optionally with CloudFront)
○ Removes need to manage any web services
● Storage for temporary/ephemeral clusters

$ shutdown -h now
- HDP 2.3
- http://hortonworks.com/
- Hadoop Summit recordings:
- http://2015.hadoopsummit.org/san-jose/
- http://2015.hadoopsummit.org/brussels/
- Past & Future workshops:
- http://hortonworks.com/partners/learn/

Hortonworks Technical Workshop: HDP everywhere - cloud considerations using cloudbreak 2015 june

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hortonworks Technical Workshop: HDP everywhere - cloud considerations using cloudbreak 2015 june

Similar to Hortonworks Technical Workshop: HDP everywhere - cloud considerations using cloudbreak 2015 june (20)

More from Hortonworks

More from Hortonworks (20)

Recently uploaded

Recently uploaded (20)

Hortonworks Technical Workshop: HDP everywhere - cloud considerations using cloudbreak 2015 june