SlideShare a Scribd company logo
1 of 21
Š Hortonworks Inc. 2013
Hadoop in Clouds, Virtualization and Virtual
Machines
Page 1
Sanjay Radia
Founder & Architect
Hortonworks Inc
Suresh Srinivas
Founder, Director of Engineering
Hortonworks Inc.
Š Hortonworks Inc. 2013
Hello
Sanjay Radia
• Founder, Hortonworks
• Part of the Hadoop team at Yahoo! since 2007
–Chief Architect of Hadoop Core at Yahoo!
–Long time Apache Hadoop PMC and Committer
–Designed and developed several key Hadoop features
• Prior
–Data center automation, virtualization, Java, HA, OSs, File
Systems (Startup, Sun Microsystems, …)
–Ph.D., University of Waterloo
Page 2
Š Hortonworks Inc. 2013
Overview
• Virtualization – A General CS concept
–What resources does Hadoop virtualize?
• Public Clouds
–A spectrum of offerings
• Private Clouds
–VMs and Hadoop
–Test, dev, Poc clusters
–Production clusters
–Guideline for VMs
Page 3
Š Hortonworks Inc. 2013
Virtualization – a General CS Concept
Defn: Slice/dice a resource across multiple users so that
– Each user has an impression of a dedicated resource
– CPU, Virtual memory, …
– A user has the impression of a much larger resource
– Virtual memory, tiered storage
Examples
• Virtualize the CPU, virtual memory
– Abstractions: process (threads) and its virtual memory
• Virtualize Storage
– Abstractions: file system, Luns, etc
• Virtualize the entire machine (CPU, memory, devices)
– Abstraction: VMs
• Virtualize Hadoop cluster resources across users and tenants
– Next slide …
Page 4
Š Hortonworks Inc. 2013
So what about Virtualization
in a Hadoop cluster?
Hadoop lets the OS virtualize the hardware resources.
Hadoop instead virtualizes the network and each
node’s OS resources across the cluster
5
Hadoop is Big-Data “Operating System”
Š Hortonworks Inc. 2013
Hadoop’s Architecture
Page 6
NameNode RM
File System Namespace
Block Management
Compute Resource Management
Task placement
Tasks
DISC 1 DISC 2 DISC N
DataNode
TaskManager
Block delete
Replicate Heartbeat
BlockReport
Heartbeat
Client
Read/Write Data
File create,
delete,
modify
BlocksBlocksBlocks
Slave Node
Task
Management
Slave Node
Tens to Thousands
Slave Nodes
Š Hortonworks Inc. 2013
What about Virtualization in Hadoop (1)
• Hadoop virtualize compute, storage and network resources
across the cluster tenants
– Abstractions: Yarn application container, HDFS files, …
– Yarn virtualizes each node’s OS resources across tenants
– Yarn scheduler focuses on CPU and memory today but at lowest level the
resource request is fairly general – IO, network bandwidth can be added
– Yarn supports a diverse set of applications beyond, MR, SQL, ETL, …
– Allows sharing the cluster infrastructure
– Slider: run services on Hadoop cluster
– Yarn container is a process (and soon Linux containers) to provide isolation
– can be other things including VMs
– Capacity scheduler guarantees capacity across multiple cluster tenants
– Can go beyond guaranteed capacity if available, but subject to preemption
– Limits, Queues and sub-queues for flexible management
– Uses OS limits, and Linux containers to limit resource usage
Hadoop was designed for multi-tenancy
Page 7
Š Hortonworks Inc. 2013
What about Virtualization in Hadoop (2)
• Hadoop virtualize compute, storage and network resources
across the cluster tenants
– Abstraction: Yarn application container, HDFS files, …
– Yarn virtualizes compute/memory across nodes for tenants
– (covered in previous slide)
–HDFS
– A shared file system across tenants with quota support
– Soon:
– memory for intermediate data that is virtualized like virtual memory
– Tiering across storage types
Hadoop was designed for multi-tenancy
Page 8
Š Hortonworks Inc. 2013
Take Away for rest of the talk
• Hadoop for the public cloud
–Most based on VMs due to strong isolation requirement
–Main challenge is your data as clusters are spun up and down
– We describe best practices and what is coming
• Hadoop for the private cloud
–For test and dev clusters
– Both Hadoop on raw machines and VMs are a good idea
–For production clusters
– Hadoop natively provides excellent elasticity, resource management and multi-
tenancy
– If you must use VMs
– Understand your motivations well and configure accordingly
–For VMs, we describe some best practices
Page 9
Š Hortonworks Inc. 2013
Public Clouds
Page 10
Š Hortonworks Inc. 2012Š Hortonworks Inc. 2013. Confidential and Proprietary.
Public Clouds – A Spectrum
Page 11
Virtual
Machines
Hadoop App
as a Service
Run/Manage
Your
Hadoop Cluster
On VMs
Just submit
your job/app
Hosted Deployed Managed
Hadoop Clusters
Managed “virtual” cluster that
uses a shared Hadoop.
Just submit jobs/apps
MS Azure
AWS EC2
Rackspace Cloud
AWS EMR
Cluster deployed on
VMs or private hosts
with easy
Cluster Management &
Job Management tools
MS HDInsight (VMs)
Rackspace
(VMs or Host in cage)
ATT
Altiscale
You Spin up/down cluster
Data on external store
(S3, Azure storage) across
Cluster lifecycles
Automatic Spin up/down
Data on external store
(S3 Azure storage) across
Cluster lifecycles,
Shared HDFS but
Isolated Namespaces
Shared Yarn but isolation via
Containers or VMs
Hosted:
Cluster/Data
persists
Š Hortonworks Inc. 2013
Cluster Lifecycle and Data Persistence
• For a few: the Cluster and HDFS data persists
–Rackspace (Hosted and managed private cluster)
–Altiscale
• For most: clusters spun up and down
–Main challenge: need to persist data across cluster lifecycle
– Local data disappears when cluster is spun down
–External K-V store
– Access remote storage (slow)
– Copy, then process, process, process, spin down (e.g. Netflix)
–How can we do better?
– Interleaved storage cluster across rack (an example later)
– Shadow HDFS that caches K-V Store – great for reading
– Use the K-V store as 4th lazy HDFS replica – works for writing
Page 12
Š Hortonworks Inc. 2013
Use of External Storage
when cluster is Spinup/Spindown
• Shadow mount external storage (S3) – great for reading
– Cache one or two replicas at mount time
– when missing, fetch from source
– Hadoop apps access cached S3 data on local HDFS
• Use S3 to store blocks and check-pointed image – works for writes
– Write lazy of 4th replica
– When cluster is spun down, all block and image must be flushed to S3
Page 13
DataNode
Shadow Mount
(Cache)
S3S3
Checkpoint
Image
+Blocks
Lazy write
Of 4th replica
Local
HDFS
Š Hortonworks Inc. 2013
An Interesting “Hadoop as service”
configuration
• HDFS Cluster for data external to the compute cluster
– but stripped to allow rack level access
• Persists when the dynamic customer clusters are spun-down
• Much better performance than external S3…
Page 14
Core
Switch
`
Switch
……
`
Switch
……
`
Switch
……
Storage HDFS cluster
- Stores data that persists
when compute clusters are
spun down
Transient Hadoop cluster
- Clusters (VMs) are spun up
and down
- Have transient local
storage
Š Hortonworks Inc. 2013
VMs and Hadoop in
Private Clouds
Page 15
Š Hortonworks Inc. 2013
Virtual Machines and Hadoop
• Traditional Motivation for VMs
– Easier to install and manage
– Elasticity – expand and shrink service capacity
– Relies on expensive shared storage so that data can be accessed from any VM
– Infrastructure consolidation and improved utilization
• Some of the VM benefits require shared storage
– But external shared storage (SAN/NAS) does not make sense for Hadoop
• Potential Motivation for Hadoop on VMs
– Public clouds mostly use VMs for isolation reasons
– Motivation for VMs for private clouds
– Test, dev clusters-on-fly to allow independent version and cluster management
– Share infrastructure with other non-Hadoop VMs
– Production clusters …
– Elasticity ?
– Multi-tenancy isolation?
Page 16
Š Hortonworks Inc. 2013
Virtual Machines and Hadoop
• Elasticity in Hadoop via VMs
–Elastic DataNodes? - Not a good idea –lots of data to be copied
–Elastic Compute Nodes – works but some challenges
– Need to allocate local disk partition for shuffle data
– VM per-tenant will fragment your local storage
- This can be fixed by moving tmp storage to HDFS –will be enabled in the future
– Local read optimization (short circuit read) is not available to VM
• Resource management & Isolation excellent in Hadoop
–Yarn has flexible resource model (CPU, Memory, others coming)
– It manages across the nodes
–Yarn’s Capacity scheduler gives capacity guarantees
–Isolation supported via Linux Containers
–Docker helps with image management
Page 17
Š Hortonworks Inc. 2013
Guideline for VMs for Hadoop
• VMs for Test, Dev, Poc, Staging Clusters works well
–Can share Host with non-Hadoop VMs
–For such use cases having independent clusters is fine
– Do not have lots of data or storage fragmentation is less concern
– Independent clusters allow each to have a different Hadoop version
–Some customers: a native shared Hadoop cluster for testing apps
• Production Hadoop
– VMs generally less attractive
But if you do, consider motivation, and configure accordingly (next slide)
–Hadoop has built-in
– elasticity for CPU and Storage
– resource management
– capacity guarantees for multi-tenancy
– isolation for multi-tenancy
Page 18
Š Hortonworks Inc. 2013
CN…
Expandd
compute
Configurations for Hadoop VMs
• Model 1 is generally better
– ComputeNode/DataNode shortcut optimization
– Less storage fragmentation (intermediate shuffle data)
– Share cluster: Hadoop has excellent Tenant/Resource isolation
– Use Model 2 if you need need additional tenant isolation but shared data
Page 20
Base cluster:
Data+Compute
on each VM
CN
DN
Model 1
Base HDFS cluster:
a DataNode on a VM
DN
Per-tenant
compute cluster
CN …
Model 2
• Share the cluster unless you need
per-tenant Hadoop-version or more isolation
Š Hortonworks Inc. 2013
Summary
• Hadoop Virtualizes the Cluster’s HW Resources across Tenants
• Hadoop for the Public Cloud
– Most based on VMs due to strong complete isolation requirement
– Altiscale, Rackspace offer additional models
– Main challenge is data persistence as clusters are spun up/down
– We describe best practices and what is coming
• Hadoop for the Private Cloud
– For test, dev, Poc clusters
– Both Hadoop on raw machines and VMs are a good idea
– VMs allow “cluster-on-the-fly” plus ability to control version for each user
– For production clusters
– Hadoop natively provides excellent elasticity, resource management and
multi-tenancy
– If you must use VMs
– Understand your motivations well and configure accordingly
– For VMs, follow the best practices we provided
Page 21
Š Hortonworks Inc. 2013
Thank You
Page 22
Q & A

More Related Content

What's hot

Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudPart 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudCloudera, Inc.
 
Dawn of YARN @ Rocket Fuel
Dawn of YARN @ Rocket FuelDawn of YARN @ Rocket Fuel
Dawn of YARN @ Rocket FuelDataWorks Summit
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld
 
Achieving cloud scale with microservices based applications on azure
Achieving cloud scale with microservices based applications on azureAchieving cloud scale with microservices based applications on azure
Achieving cloud scale with microservices based applications on azureUtkarsh Pandey
 
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...DataWorks Summit
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About YarnCloudera, Inc.
 
Big data on virtualized infrastucture
Big data on virtualized infrastuctureBig data on virtualized infrastucture
Big data on virtualized infrastuctureDataWorks Summit
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14John Sing
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...EMC
 
Hadoop Virtualization - Intel White Paper
Hadoop Virtualization - Intel White PaperHadoop Virtualization - Intel White Paper
Hadoop Virtualization - Intel White PaperBlueData, Inc.
 
Configuring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the EnterpriseConfiguring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the EnterpriseCloudera, Inc.
 
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...DataWorks Summit
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetupiwrigley
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentDataWorks Summit
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkCloudera, Inc.
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks
 

What's hot (20)

Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudPart 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
 
Dawn of YARN @ Rocket Fuel
Dawn of YARN @ Rocket FuelDawn of YARN @ Rocket Fuel
Dawn of YARN @ Rocket Fuel
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
 
Achieving cloud scale with microservices based applications on azure
Achieving cloud scale with microservices based applications on azureAchieving cloud scale with microservices based applications on azure
Achieving cloud scale with microservices based applications on azure
 
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
 
Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
Big data on virtualized infrastucture
Big data on virtualized infrastuctureBig data on virtualized infrastucture
Big data on virtualized infrastucture
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
 
Hadoop Virtualization - Intel White Paper
Hadoop Virtualization - Intel White PaperHadoop Virtualization - Intel White Paper
Hadoop Virtualization - Intel White Paper
 
Configuring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the EnterpriseConfiguring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the Enterprise
 
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake Environment
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
 

Viewers also liked

Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopHortonworks
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APImcsrivas
 
Hadoop 2 @ Twitter, Elephant Scale
Hadoop 2 @ Twitter, Elephant Scale Hadoop 2 @ Twitter, Elephant Scale
Hadoop 2 @ Twitter, Elephant Scale Gera Shegalov
 
Cost of Ownership for Hadoop Implementation
Cost of Ownership for Hadoop ImplementationCost of Ownership for Hadoop Implementation
Cost of Ownership for Hadoop ImplementationDataWorks Summit
 
The Keys to Digital Transformation
The Keys to Digital TransformationThe Keys to Digital Transformation
The Keys to Digital TransformationMapR Technologies
 
Farming hadoop in_the_cloud
Farming hadoop in_the_cloudFarming hadoop in_the_cloud
Farming hadoop in_the_cloudSteve Loughran
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas
 
Hadoop on Virtual Machines
Hadoop on Virtual MachinesHadoop on Virtual Machines
Hadoop on Virtual MachinesRichard McDougall
 
TeraSort
TeraSortTeraSort
TeraSortTung D. Le
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking TutorialTilmann Rabl
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the CloudCaserta
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicDataWorks Summit
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachDataWorks Summit
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataNicolas Poggi
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingBart Vandewoestyne
 

Viewers also liked (20)

Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache Hadoop
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase API
 
Hadoop 2 @ Twitter, Elephant Scale
Hadoop 2 @ Twitter, Elephant Scale Hadoop 2 @ Twitter, Elephant Scale
Hadoop 2 @ Twitter, Elephant Scale
 
Cost of Ownership for Hadoop Implementation
Cost of Ownership for Hadoop ImplementationCost of Ownership for Hadoop Implementation
Cost of Ownership for Hadoop Implementation
 
The Keys to Digital Transformation
The Keys to Digital TransformationThe Keys to Digital Transformation
The Keys to Digital Transformation
 
Farming hadoop in_the_cloud
Farming hadoop in_the_cloudFarming hadoop in_the_cloud
Farming hadoop in_the_cloud
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoop
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
Hadoop on Virtual Machines
Hadoop on Virtual MachinesHadoop on Virtual Machines
Hadoop on Virtual Machines
 
TeraSort
TeraSortTeraSort
TeraSort
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking Tutorial
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Big Data Benchmarking
Big Data BenchmarkingBig Data Benchmarking
Big Data Benchmarking
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
 

Similar to Hadoop in the Clouds, Virtualization and Virtual Machines

HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and FutureDataWorks Summit
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...Amazon Web Services
 
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupChallenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupAndrei Savu
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceBlueData, Inc.
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Community
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeNicolas Morales
 
Hello OpenStack, Meet Hadoop
Hello OpenStack, Meet HadoopHello OpenStack, Meet Hadoop
Hello OpenStack, Meet HadoopDataWorks Summit
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Highly scalable caching service on cloud - Redis
Highly scalable caching service on cloud - RedisHighly scalable caching service on cloud - Redis
Highly scalable caching service on cloud - RedisKrishna-Kumar
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsCloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsDataWorks Summit/Hadoop Summit
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsDataWorks Summit
 
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 Chris Almond
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceChris Nauroth
 

Similar to Hadoop in the Clouds, Virtualization and Virtual Machines (20)

HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
 
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupChallenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-Service
 
Cloudbreak - Technical Deep Dive
Cloudbreak - Technical Deep DiveCloudbreak - Technical Deep Dive
Cloudbreak - Technical Deep Dive
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
 
Hello OpenStack, Meet Hadoop
Hello OpenStack, Meet HadoopHello OpenStack, Meet Hadoop
Hello OpenStack, Meet Hadoop
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Highly scalable caching service on cloud - Redis
Highly scalable caching service on cloud - RedisHighly scalable caching service on cloud - Redis
Highly scalable caching service on cloud - Redis
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsCloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World Considerations
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
 
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
 
Deploying Big-Data-as-a-Service (BDaaS) in the Enterprise
Deploying Big-Data-as-a-Service (BDaaS) in the EnterpriseDeploying Big-Data-as-a-Service (BDaaS) in the Enterprise
Deploying Big-Data-as-a-Service (BDaaS) in the Enterprise
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash CourseDataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Hadoop in the Clouds, Virtualization and Virtual Machines

  • 1. Š Hortonworks Inc. 2013 Hadoop in Clouds, Virtualization and Virtual Machines Page 1 Sanjay Radia Founder & Architect Hortonworks Inc Suresh Srinivas Founder, Director of Engineering Hortonworks Inc.
  • 2. Š Hortonworks Inc. 2013 Hello Sanjay Radia • Founder, Hortonworks • Part of the Hadoop team at Yahoo! since 2007 –Chief Architect of Hadoop Core at Yahoo! –Long time Apache Hadoop PMC and Committer –Designed and developed several key Hadoop features • Prior –Data center automation, virtualization, Java, HA, OSs, File Systems (Startup, Sun Microsystems, …) –Ph.D., University of Waterloo Page 2
  • 3. Š Hortonworks Inc. 2013 Overview • Virtualization – A General CS concept –What resources does Hadoop virtualize? • Public Clouds –A spectrum of offerings • Private Clouds –VMs and Hadoop –Test, dev, Poc clusters –Production clusters –Guideline for VMs Page 3
  • 4. Š Hortonworks Inc. 2013 Virtualization – a General CS Concept Defn: Slice/dice a resource across multiple users so that – Each user has an impression of a dedicated resource – CPU, Virtual memory, … – A user has the impression of a much larger resource – Virtual memory, tiered storage Examples • Virtualize the CPU, virtual memory – Abstractions: process (threads) and its virtual memory • Virtualize Storage – Abstractions: file system, Luns, etc • Virtualize the entire machine (CPU, memory, devices) – Abstraction: VMs • Virtualize Hadoop cluster resources across users and tenants – Next slide … Page 4
  • 5. Š Hortonworks Inc. 2013 So what about Virtualization in a Hadoop cluster? Hadoop lets the OS virtualize the hardware resources. Hadoop instead virtualizes the network and each node’s OS resources across the cluster 5 Hadoop is Big-Data “Operating System”
  • 6. Š Hortonworks Inc. 2013 Hadoop’s Architecture Page 6 NameNode RM File System Namespace Block Management Compute Resource Management Task placement Tasks DISC 1 DISC 2 DISC N DataNode TaskManager Block delete Replicate Heartbeat BlockReport Heartbeat Client Read/Write Data File create, delete, modify BlocksBlocksBlocks Slave Node Task Management Slave Node Tens to Thousands Slave Nodes
  • 7. Š Hortonworks Inc. 2013 What about Virtualization in Hadoop (1) • Hadoop virtualize compute, storage and network resources across the cluster tenants – Abstractions: Yarn application container, HDFS files, … – Yarn virtualizes each node’s OS resources across tenants – Yarn scheduler focuses on CPU and memory today but at lowest level the resource request is fairly general – IO, network bandwidth can be added – Yarn supports a diverse set of applications beyond, MR, SQL, ETL, … – Allows sharing the cluster infrastructure – Slider: run services on Hadoop cluster – Yarn container is a process (and soon Linux containers) to provide isolation – can be other things including VMs – Capacity scheduler guarantees capacity across multiple cluster tenants – Can go beyond guaranteed capacity if available, but subject to preemption – Limits, Queues and sub-queues for flexible management – Uses OS limits, and Linux containers to limit resource usage Hadoop was designed for multi-tenancy Page 7
  • 8. Š Hortonworks Inc. 2013 What about Virtualization in Hadoop (2) • Hadoop virtualize compute, storage and network resources across the cluster tenants – Abstraction: Yarn application container, HDFS files, … – Yarn virtualizes compute/memory across nodes for tenants – (covered in previous slide) –HDFS – A shared file system across tenants with quota support – Soon: – memory for intermediate data that is virtualized like virtual memory – Tiering across storage types Hadoop was designed for multi-tenancy Page 8
  • 9. Š Hortonworks Inc. 2013 Take Away for rest of the talk • Hadoop for the public cloud –Most based on VMs due to strong isolation requirement –Main challenge is your data as clusters are spun up and down – We describe best practices and what is coming • Hadoop for the private cloud –For test and dev clusters – Both Hadoop on raw machines and VMs are a good idea –For production clusters – Hadoop natively provides excellent elasticity, resource management and multi- tenancy – If you must use VMs – Understand your motivations well and configure accordingly –For VMs, we describe some best practices Page 9
  • 10. Š Hortonworks Inc. 2013 Public Clouds Page 10
  • 11. Š Hortonworks Inc. 2012Š Hortonworks Inc. 2013. Confidential and Proprietary. Public Clouds – A Spectrum Page 11 Virtual Machines Hadoop App as a Service Run/Manage Your Hadoop Cluster On VMs Just submit your job/app Hosted Deployed Managed Hadoop Clusters Managed “virtual” cluster that uses a shared Hadoop. Just submit jobs/apps MS Azure AWS EC2 Rackspace Cloud AWS EMR Cluster deployed on VMs or private hosts with easy Cluster Management & Job Management tools MS HDInsight (VMs) Rackspace (VMs or Host in cage) ATT Altiscale You Spin up/down cluster Data on external store (S3, Azure storage) across Cluster lifecycles Automatic Spin up/down Data on external store (S3 Azure storage) across Cluster lifecycles, Shared HDFS but Isolated Namespaces Shared Yarn but isolation via Containers or VMs Hosted: Cluster/Data persists
  • 12. Š Hortonworks Inc. 2013 Cluster Lifecycle and Data Persistence • For a few: the Cluster and HDFS data persists –Rackspace (Hosted and managed private cluster) –Altiscale • For most: clusters spun up and down –Main challenge: need to persist data across cluster lifecycle – Local data disappears when cluster is spun down –External K-V store – Access remote storage (slow) – Copy, then process, process, process, spin down (e.g. Netflix) –How can we do better? – Interleaved storage cluster across rack (an example later) – Shadow HDFS that caches K-V Store – great for reading – Use the K-V store as 4th lazy HDFS replica – works for writing Page 12
  • 13. Š Hortonworks Inc. 2013 Use of External Storage when cluster is Spinup/Spindown • Shadow mount external storage (S3) – great for reading – Cache one or two replicas at mount time – when missing, fetch from source – Hadoop apps access cached S3 data on local HDFS • Use S3 to store blocks and check-pointed image – works for writes – Write lazy of 4th replica – When cluster is spun down, all block and image must be flushed to S3 Page 13 DataNode Shadow Mount (Cache) S3S3 Checkpoint Image +Blocks Lazy write Of 4th replica Local HDFS
  • 14. Š Hortonworks Inc. 2013 An Interesting “Hadoop as service” configuration • HDFS Cluster for data external to the compute cluster – but stripped to allow rack level access • Persists when the dynamic customer clusters are spun-down • Much better performance than external S3… Page 14 Core Switch ` Switch …… ` Switch …… ` Switch …… Storage HDFS cluster - Stores data that persists when compute clusters are spun down Transient Hadoop cluster - Clusters (VMs) are spun up and down - Have transient local storage
  • 15. Š Hortonworks Inc. 2013 VMs and Hadoop in Private Clouds Page 15
  • 16. Š Hortonworks Inc. 2013 Virtual Machines and Hadoop • Traditional Motivation for VMs – Easier to install and manage – Elasticity – expand and shrink service capacity – Relies on expensive shared storage so that data can be accessed from any VM – Infrastructure consolidation and improved utilization • Some of the VM benefits require shared storage – But external shared storage (SAN/NAS) does not make sense for Hadoop • Potential Motivation for Hadoop on VMs – Public clouds mostly use VMs for isolation reasons – Motivation for VMs for private clouds – Test, dev clusters-on-fly to allow independent version and cluster management – Share infrastructure with other non-Hadoop VMs – Production clusters … – Elasticity ? – Multi-tenancy isolation? Page 16
  • 17. Š Hortonworks Inc. 2013 Virtual Machines and Hadoop • Elasticity in Hadoop via VMs –Elastic DataNodes? - Not a good idea –lots of data to be copied –Elastic Compute Nodes – works but some challenges – Need to allocate local disk partition for shuffle data – VM per-tenant will fragment your local storage - This can be fixed by moving tmp storage to HDFS –will be enabled in the future – Local read optimization (short circuit read) is not available to VM • Resource management & Isolation excellent in Hadoop –Yarn has flexible resource model (CPU, Memory, others coming) – It manages across the nodes –Yarn’s Capacity scheduler gives capacity guarantees –Isolation supported via Linux Containers –Docker helps with image management Page 17
  • 18. Š Hortonworks Inc. 2013 Guideline for VMs for Hadoop • VMs for Test, Dev, Poc, Staging Clusters works well –Can share Host with non-Hadoop VMs –For such use cases having independent clusters is fine – Do not have lots of data or storage fragmentation is less concern – Independent clusters allow each to have a different Hadoop version –Some customers: a native shared Hadoop cluster for testing apps • Production Hadoop – VMs generally less attractive But if you do, consider motivation, and configure accordingly (next slide) –Hadoop has built-in – elasticity for CPU and Storage – resource management – capacity guarantees for multi-tenancy – isolation for multi-tenancy Page 18
  • 19. Š Hortonworks Inc. 2013 CN… Expandd compute Configurations for Hadoop VMs • Model 1 is generally better – ComputeNode/DataNode shortcut optimization – Less storage fragmentation (intermediate shuffle data) – Share cluster: Hadoop has excellent Tenant/Resource isolation – Use Model 2 if you need need additional tenant isolation but shared data Page 20 Base cluster: Data+Compute on each VM CN DN Model 1 Base HDFS cluster: a DataNode on a VM DN Per-tenant compute cluster CN … Model 2 • Share the cluster unless you need per-tenant Hadoop-version or more isolation
  • 20. Š Hortonworks Inc. 2013 Summary • Hadoop Virtualizes the Cluster’s HW Resources across Tenants • Hadoop for the Public Cloud – Most based on VMs due to strong complete isolation requirement – Altiscale, Rackspace offer additional models – Main challenge is data persistence as clusters are spun up/down – We describe best practices and what is coming • Hadoop for the Private Cloud – For test, dev, Poc clusters – Both Hadoop on raw machines and VMs are a good idea – VMs allow “cluster-on-the-fly” plus ability to control version for each user – For production clusters – Hadoop natively provides excellent elasticity, resource management and multi-tenancy – If you must use VMs – Understand your motivations well and configure accordingly – For VMs, follow the best practices we provided Page 21
  • 21. Š Hortonworks Inc. 2013 Thank You Page 22 Q & A

Editor's Notes

  1. First let us take a look at the term virtualization. Virtualization is a general CS term that predates the term virtual machine by decades.
  2. A thing to remember that Hadoop moves the computation to the data. Shortcut for local access directly from the OS,
  3. Now lets turn to HDFS
  4. VM on one extreme and a Hadoop App service such as EMR on the other end All except Amazon run HDP
  5. Interleaved HDFS cluster We have one customer interleave a HDFS cluster across compute racks
  6. VMs have been important for reducing the TCO for traditional enterprise processing
  7. Yarn’s underlying resource model is very flexible and has an extensible abstraction for different resource types First memory, Now Cpu and different kinds of memory Next IO and Network bandwidths