Automating Google Workspace (GWS) & more with Apps Script
Â
Hadoop in the Clouds, Virtualization and Virtual Machines
1. Š Hortonworks Inc. 2013
Hadoop in Clouds, Virtualization and Virtual
Machines
Page 1
Sanjay Radia
Founder & Architect
Hortonworks Inc
Suresh Srinivas
Founder, Director of Engineering
Hortonworks Inc.
2. Š Hortonworks Inc. 2013
Hello
Sanjay Radia
⢠Founder, Hortonworks
⢠Part of the Hadoop team at Yahoo! since 2007
âChief Architect of Hadoop Core at Yahoo!
âLong time Apache Hadoop PMC and Committer
âDesigned and developed several key Hadoop features
⢠Prior
âData center automation, virtualization, Java, HA, OSs, File
Systems (Startup, Sun Microsystems, âŚ)
âPh.D., University of Waterloo
Page 2
3. Š Hortonworks Inc. 2013
Overview
⢠Virtualization â A General CS concept
âWhat resources does Hadoop virtualize?
⢠Public Clouds
âA spectrum of offerings
⢠Private Clouds
âVMs and Hadoop
âTest, dev, Poc clusters
âProduction clusters
âGuideline for VMs
Page 3
4. Š Hortonworks Inc. 2013
Virtualization â a General CS Concept
Defn: Slice/dice a resource across multiple users so that
â Each user has an impression of a dedicated resource
â CPU, Virtual memory, âŚ
â A user has the impression of a much larger resource
â Virtual memory, tiered storage
Examples
⢠Virtualize the CPU, virtual memory
â Abstractions: process (threads) and its virtual memory
⢠Virtualize Storage
â Abstractions: file system, Luns, etc
⢠Virtualize the entire machine (CPU, memory, devices)
â Abstraction: VMs
⢠Virtualize Hadoop cluster resources across users and tenants
â Next slide âŚ
Page 4
5. Š Hortonworks Inc. 2013
So what about Virtualization
in a Hadoop cluster?
Hadoop lets the OS virtualize the hardware resources.
Hadoop instead virtualizes the network and each
nodeâs OS resources across the cluster
5
Hadoop is Big-Data âOperating Systemâ
7. Š Hortonworks Inc. 2013
What about Virtualization in Hadoop (1)
⢠Hadoop virtualize compute, storage and network resources
across the cluster tenants
â Abstractions: Yarn application container, HDFS files, âŚ
â Yarn virtualizes each nodeâs OS resources across tenants
â Yarn scheduler focuses on CPU and memory today but at lowest level the
resource request is fairly general â IO, network bandwidth can be added
â Yarn supports a diverse set of applications beyond, MR, SQL, ETL, âŚ
â Allows sharing the cluster infrastructure
â Slider: run services on Hadoop cluster
â Yarn container is a process (and soon Linux containers) to provide isolation
â can be other things including VMs
â Capacity scheduler guarantees capacity across multiple cluster tenants
â Can go beyond guaranteed capacity if available, but subject to preemption
â Limits, Queues and sub-queues for flexible management
â Uses OS limits, and Linux containers to limit resource usage
Hadoop was designed for multi-tenancy
Page 7
8. Š Hortonworks Inc. 2013
What about Virtualization in Hadoop (2)
⢠Hadoop virtualize compute, storage and network resources
across the cluster tenants
â Abstraction: Yarn application container, HDFS files, âŚ
â Yarn virtualizes compute/memory across nodes for tenants
â (covered in previous slide)
âHDFS
â A shared file system across tenants with quota support
â Soon:
â memory for intermediate data that is virtualized like virtual memory
â Tiering across storage types
Hadoop was designed for multi-tenancy
Page 8
9. Š Hortonworks Inc. 2013
Take Away for rest of the talk
⢠Hadoop for the public cloud
âMost based on VMs due to strong isolation requirement
âMain challenge is your data as clusters are spun up and down
â We describe best practices and what is coming
⢠Hadoop for the private cloud
âFor test and dev clusters
â Both Hadoop on raw machines and VMs are a good idea
âFor production clusters
â Hadoop natively provides excellent elasticity, resource management and multi-
tenancy
â If you must use VMs
â Understand your motivations well and configure accordingly
âFor VMs, we describe some best practices
Page 9
11. Š Hortonworks Inc. 2012Š Hortonworks Inc. 2013. Confidential and Proprietary.
Public Clouds â A Spectrum
Page 11
Virtual
Machines
Hadoop App
as a Service
Run/Manage
Your
Hadoop Cluster
On VMs
Just submit
your job/app
Hosted Deployed Managed
Hadoop Clusters
Managed âvirtualâ cluster that
uses a shared Hadoop.
Just submit jobs/apps
MS Azure
AWS EC2
Rackspace Cloud
AWS EMR
Cluster deployed on
VMs or private hosts
with easy
Cluster Management &
Job Management tools
MS HDInsight (VMs)
Rackspace
(VMs or Host in cage)
ATT
Altiscale
You Spin up/down cluster
Data on external store
(S3, Azure storage) across
Cluster lifecycles
Automatic Spin up/down
Data on external store
(S3 Azure storage) across
Cluster lifecycles,
Shared HDFS but
Isolated Namespaces
Shared Yarn but isolation via
Containers or VMs
Hosted:
Cluster/Data
persists
12. Š Hortonworks Inc. 2013
Cluster Lifecycle and Data Persistence
⢠For a few: the Cluster and HDFS data persists
âRackspace (Hosted and managed private cluster)
âAltiscale
⢠For most: clusters spun up and down
âMain challenge: need to persist data across cluster lifecycle
â Local data disappears when cluster is spun down
âExternal K-V store
â Access remote storage (slow)
â Copy, then process, process, process, spin down (e.g. Netflix)
âHow can we do better?
â Interleaved storage cluster across rack (an example later)
â Shadow HDFS that caches K-V Store â great for reading
â Use the K-V store as 4th lazy HDFS replica â works for writing
Page 12
13. Š Hortonworks Inc. 2013
Use of External Storage
when cluster is Spinup/Spindown
⢠Shadow mount external storage (S3) â great for reading
â Cache one or two replicas at mount time
â when missing, fetch from source
â Hadoop apps access cached S3 data on local HDFS
⢠Use S3 to store blocks and check-pointed image â works for writes
â Write lazy of 4th replica
â When cluster is spun down, all block and image must be flushed to S3
Page 13
DataNode
Shadow Mount
(Cache)
S3S3
Checkpoint
Image
+Blocks
Lazy write
Of 4th replica
Local
HDFS
14. Š Hortonworks Inc. 2013
An Interesting âHadoop as serviceâ
configuration
⢠HDFS Cluster for data external to the compute cluster
â but stripped to allow rack level access
⢠Persists when the dynamic customer clusters are spun-down
⢠Much better performance than external S3âŚ
Page 14
Core
Switch
`
Switch
âŚâŚ
`
Switch
âŚâŚ
`
Switch
âŚâŚ
Storage HDFS cluster
- Stores data that persists
when compute clusters are
spun down
Transient Hadoop cluster
- Clusters (VMs) are spun up
and down
- Have transient local
storage
16. Š Hortonworks Inc. 2013
Virtual Machines and Hadoop
⢠Traditional Motivation for VMs
â Easier to install and manage
â Elasticity â expand and shrink service capacity
â Relies on expensive shared storage so that data can be accessed from any VM
â Infrastructure consolidation and improved utilization
⢠Some of the VM benefits require shared storage
â But external shared storage (SAN/NAS) does not make sense for Hadoop
⢠Potential Motivation for Hadoop on VMs
â Public clouds mostly use VMs for isolation reasons
â Motivation for VMs for private clouds
â Test, dev clusters-on-fly to allow independent version and cluster management
â Share infrastructure with other non-Hadoop VMs
â Production clusters âŚ
â Elasticity ?
â Multi-tenancy isolation?
Page 16
17. Š Hortonworks Inc. 2013
Virtual Machines and Hadoop
⢠Elasticity in Hadoop via VMs
âElastic DataNodes? - Not a good idea âlots of data to be copied
âElastic Compute Nodes â works but some challenges
â Need to allocate local disk partition for shuffle data
â VM per-tenant will fragment your local storage
- This can be fixed by moving tmp storage to HDFS âwill be enabled in the future
â Local read optimization (short circuit read) is not available to VM
⢠Resource management & Isolation excellent in Hadoop
âYarn has flexible resource model (CPU, Memory, others coming)
â It manages across the nodes
âYarnâs Capacity scheduler gives capacity guarantees
âIsolation supported via Linux Containers
âDocker helps with image management
Page 17
18. Š Hortonworks Inc. 2013
Guideline for VMs for Hadoop
⢠VMs for Test, Dev, Poc, Staging Clusters works well
âCan share Host with non-Hadoop VMs
âFor such use cases having independent clusters is fine
â Do not have lots of data or storage fragmentation is less concern
â Independent clusters allow each to have a different Hadoop version
âSome customers: a native shared Hadoop cluster for testing apps
⢠Production Hadoop
â VMs generally less attractive
But if you do, consider motivation, and configure accordingly (next slide)
âHadoop has built-in
â elasticity for CPU and Storage
â resource management
â capacity guarantees for multi-tenancy
â isolation for multi-tenancy
Page 18
19. Š Hortonworks Inc. 2013
CNâŚ
Expandd
compute
Configurations for Hadoop VMs
⢠Model 1 is generally better
â ComputeNode/DataNode shortcut optimization
â Less storage fragmentation (intermediate shuffle data)
â Share cluster: Hadoop has excellent Tenant/Resource isolation
â Use Model 2 if you need need additional tenant isolation but shared data
Page 20
Base cluster:
Data+Compute
on each VM
CN
DN
Model 1
Base HDFS cluster:
a DataNode on a VM
DN
Per-tenant
compute cluster
CN âŚ
Model 2
⢠Share the cluster unless you need
per-tenant Hadoop-version or more isolation
20. Š Hortonworks Inc. 2013
Summary
⢠Hadoop Virtualizes the Clusterâs HW Resources across Tenants
⢠Hadoop for the Public Cloud
â Most based on VMs due to strong complete isolation requirement
â Altiscale, Rackspace offer additional models
â Main challenge is data persistence as clusters are spun up/down
â We describe best practices and what is coming
⢠Hadoop for the Private Cloud
â For test, dev, Poc clusters
â Both Hadoop on raw machines and VMs are a good idea
â VMs allow âcluster-on-the-flyâ plus ability to control version for each user
â For production clusters
â Hadoop natively provides excellent elasticity, resource management and
multi-tenancy
â If you must use VMs
â Understand your motivations well and configure accordingly
â For VMs, follow the best practices we provided
Page 21
First let us take a look at the term virtualization.
Virtualization is a general CS term that predates the term virtual machine by decades.
A thing to remember that Hadoop moves the computation to the data.
Shortcut for local access directly from the OS,
Now lets turn to HDFS
VM on one extreme and a Hadoop App service such as EMR on the other end
All except Amazon run HDP
Interleaved HDFS cluster
We have one customer interleave a HDFS cluster across compute racks
VMs have been important for reducing the TCO for traditional enterprise processing
Yarnâs underlying resource model is very flexible and has an extensible abstraction for different resource types
First memory,
Now Cpu and different kinds of memory
Next IO and Network bandwidths