As Hadoop becomes the defacto big data platform, enterprises deploy HDP across wide range of physical and virtual environments spanning private and public clouds. This session will cover key considerations for cloud deployment and showcase Cloudbreak for simple and consistent deployment across cloud providers of choice.
11. Deployment challenges
● Infrastructure is different everywhere
○ e.g. Each cloud provider has their own API
○ e.g. Each provider has different networking methods
● OS/images are different everywhere
● How to do service discovery?
● How to dynamically scale/manage?
See prior operations workshops
13. Options for Automation
- Many combinations of tools
- e.g. Foreman, Ansible, Chef, Puppet, docker-ambari,
shell scripts, CloudFormation, …
- Provider specific
- Cisco UCS, Teradata, HP, Google’s bdutil, …
- Docker with Cloudbreak
Using Ambari with all of the above!
32. Requirement: a Docker host
● OSX or Windows: http://boot2docker.io/
○ boot2docker init
○ boot2docker up
○ eval "$(boot2docker shellinit)"
○ boot2docker ssh
● Linux: Install the docker daemon
● Anywhere: docker-machine “lets you create Docker hosts on your
computer, on cloud providers, and inside your own data center”
○ Example on Rackspace:
■ docker-machine create --driver rackspace
--rackspace-api-key $OS_PASSWORD
--rackspace-username $OS_USERNAME
--rackspace-region DFW docker-rax
■ docker-machine ssh docker-rax
38. 3. Use your Cluster
Ambari available as expected
To reach your Hadoop hosts:
● SSH to Docker Host
○ Hosts arre listed in “Cloud stack description”
○ ssh cloudbreak@IPofHost
● Shell to the “ambari-agent”
container
○ sudo docker ps | grep ambari-agent
■ note the CONTAINER ID
○ sudo docker -it CONTAINERID bash
● Use the hosts as usual. e.g.:
○ hadoop fs -ls /
65. Rackspace
Cloud Big Data Platform
● Rapidly spin up on-demand HDP clusters
● Integrated with Cloud Files (OpenStack Swift)
● Opt-in for Managed Services by Rackspace
Managed Big Data Platform
● Fully Managed HDP on Dedicated and/or Cloud
● Leverage Fanatical Support and Industry Leading SLA’s
● Supported by Rackspace with escalation to Hortonworks
68. Microsoft Azure
● Deployment
○ Deploy using CloudBreak
○ Deploy using HWX Azure Gallery Image
● Integrated with Azure Blob Storage
● Supported directly by Hortonworks
● Other offerings
○ Microsoft HDInsight
○ HDP Sandbox
69. Azure Deployment Guideline
● All in same Region
● Instance Types
○ Typical: A7
○ Performance: D14
○ 8x1TB Standard LRS x3 Virtual Hard Disk per
server
● Multiple Storage Accounts are recommended
○ Recommend no more than 40 Virtual Hard Disks
per Storage Account
70. Azure Blob Store
Azure Blob Store (Object Storage)
● wasb[s]:
//<containername>@<accountname>.blob.
core.windows.net/<path>
Can be used as a replacement for HDFS
● Thoroughly tested in HDP release test suites
71. Amazon Web Services
● Deploy using CloudBreak
● Integrated with AWS S3 (object storage)
● Supported directly by Hortonworks
72. Amazon Deployment Guideline
● All in same Region/AZ
● Instances with Enhanced
Networking
Master Nodes:
● Choose EBS Optimized
● Boot: 100GB on EBS
● Data: 4+ 1TB on EBS
Worker Nodes:
● Boot: 100GB on EBS
● Data: Instance Storage
○ EBS can be used, but local
is preferred
Instance Types:
● Typical: d2.
● Performance: i2.
https://aws.amazon.com/ec2/instance-types/
73. AWS RDS
● Some services rely on MySQL, Oracle or PostgreSQL:
○ Apache Ambari
○ Apache Hive
○ Apache Oozie
○ Apache Ranger
● Use RDS for these instead of managing yourself.
74. AWS S3 (Object Storage)
● s3n:// with HDP 2.2 (Hadoop 2.6)
● s3a:// with HDP 2.3 (Hadoop 2.7)
Not currently a direct replacement for HDFS
Recommended to configure access with IAM Role/Policy
● https://docs.aws.amazon.
com/IAM/latest/UserGuide/policies_examples.html#iam-
policy-example-s3
● Example: http://git.io/vLoGY
75. Google Cloud
● Deploy using
○ CloudBreak
○ Google bdutil with Apache Ambari plug-in
● Integrated with Google Cloud Storage
● Supported directly by Hortonworks
76. Google Deployment Guideline
● Instance Types
○ Typical: n1 standard 4 with single 1.5 TB
persistent disks
○ Performance: n1 standard 8 with 1TB SSD
● Google GCS (Object Storage)
● gs://<CONFIGBUCKET>/dir/file
● Not currently a replacement for HDFS
77. S3 & GCS as Secondary storage system
The connectors are currently eventually consistent so do not replace HDFS
Backup
● Falcon, distCP, hadoop fs, HBase ExportSnapshot
● Kafka+Storm bolt sends messages to S3/GCS
providing backup & point-in-time recovery source
Input/Output
● Convenient & broadly used upload/download method
○ As a middleware to ease integration with Hadoop & limit access
● Publishing static content (optionally with CloudFront)
○ Removes need to manage any web services
● Storage for temporary/ephemeral clusters