SlideShare a Scribd company logo
1 of 9
Download to read offline
Infrastructure Considerations for
Analytical Workloads
By applying Hadoop clusters to big data workloads,
organizations can achieve incredible performance gains that can
vary based on physical versus virtual infrastructure.
Executive Summary
On the list of technology industry buzzwords,
“big data” is among the most intriguing ones.
As data volume, velocity and variety proliferate,
and the search for veracity escalates, organiza-
tions across industries are placing new bets on
various new data sources such as machine sensor
data, medical images, financial information, retail
sales, radio frequency identification and Web
tracking data. This is creating huge challenges for
decision-makers to make meaning and untangle
trends from input larger than ever before.
From a technological perspective, the so-called
four V’s of big data (volume, velocity, variety and
veracity) make it ever more difficult to process big
data on a single system. Even if one disregarded
the storage constraints of a single system, and
utilized a storage area network (SAN) to store
the petabytes of incoming data, processing speed
remains a huge bottleneck. Whether a single-core
or multi-core processor is used, a single system
would take substantially more time to process
data than if the data was partitioned across an
array of systems used in parallel. That’s not to
say that the processing conundrum shouldn’t
be confronted and overcome. Big data plays a
vital role in improving organizational profitabil-
ity, increasing productivity and solving scientific
challenges. It also enables decision-makers to
understand customer needs, wants and desires,
and to see where markets are heading.
One of the major technologies that helps orga-
nizations make sense of big data is the open
source distributed processing framework known
as Apache Hadoop. Based on our engagement
experiences and through intensive benchmark-
ing, this white paper analyzes the infrastructure
considerations for running analytical workloads
on Hadoop clusters. The primary emphasis is to
compare and contrast the physical or virtual infra-
structure requirements to support typical business
workloads from performance, cost, support and
scalability perspectives. Our goal is to arm the
reader with the insights necessary for assessing
whether physical or virtual infrastructure would
best suit your organization’s requirements.
cognizant 20-20 insights | april 2016
• Cognizant 20-20 Insights
cognizant 20-20 insights 2
Hadoop: A Primer
To solve many of the aforementioned big data
issues, the Apache Foundation developed Apache
Hadoop, a Java-based framework that can be
used to process large amounts of data across
thousands of computing nodes. It consists of
two main components – HDFS1
and MapReduce.2
Hadoop Distributed File System (HDFS) is
designed to run on commodity hardware, while
MapReduce provides the processing framework
for distributed data across thousands of nodes.
HDFS shares many attributes with other distribut-
ed file systems. However, Hadoop has implement-
ed numerous features that allow the file system
to be significantly more fault-tolerant than typical
hardware solutions such as redundant arrays
of inexpensive disks (RAIDs) or data replica-
tion alone. What follows is a deep dive into the
reasons Hadoop is considered a viable solution
for the challenges created by big data. The HDFS
components explored are the NameNode and
DataNodes (see Figure 1).
The MapReduce framework processes large data
sets across numerous computer nodes (known as
data nodes) where all nodes are on the same local
network and use similar hardware. Computational
processing can occur on data stored either in a file
system (either semi-structured or unstructured)
or in a database (structured). MapReduce can take
advantage of data locality. In MapReduce version
1, the components are JobTracker and TaskTrack-
ers, whereas in MapReduce version 2 (YARN), the
components are the ResourceManager and Node-
Managers (see next page, Figure 2).
Hadoop’s Role
Hadoop provides performance enhancements
that enable high throughput access to applica-
tion data. It also handles streaming access to file
system resources, which are increasingly chal-
lenging when attempting to manipulate larger
data sets. Many of the design considerations can
be subdivided into the following categories:
•	Data asset size.
•	Transformational challenges.
•	Decision-making.
•	Analytics.
Hadoop’s ability to integrate data from different
sources (databases, social media, etc.), systems
(network/machine/sensor logs, geo-spatial data,
etc.) and file types (structured, unstructured and
semi-structured) enable organizations to respond
to business questions such as:
•	Do you test all of your decisions to compete in
the market?
•	Can new business models be created based on
the available data in the organization?
•	Can you drive new operational efficiencies by
modernizing extract, transform and load (ETL)
and optimizing batch processing?
•	How can you harness the hidden value in
your data that until now has been archived,
discarded or ignored?
All applications utilizing HDFS tend to have
large data sets that range from gigabytes to
petabytes. HDFS has been calibrated to adjust to
HDFS Architecture
Figure 1
Rack 1 Rack 2
NameNodeClient
Read Write DataNodes
Block Ops
Metadata Ops
Replication
cognizant 20-20 insights 3
such large data volumes. By providing substan-
tial aggregated data bandwidth, HDFS should
scale to thousands of nodes per cluster. Hadoop
is a highly scalable storage platform because it
can store and distribute very large data sets
across hundreds of commodity servers operating
in parallel. This enables businesses to run their
applications on thousands of nodes involving
thousands of terabytes of data.
In legacy environments, traditional ETL and
batch processes can take hours, days or even
weeks – in a world where businesses require
access to data in minutes or even seconds.
Hadoop excels at high-volume batch processing.
Because the processing is in parallel, Hadoop is
said to perform batch processing multiple times
faster than on a single server.
Likewise, when Hadoop is used as an enterprise
data hub (EDH), it can ease the ETL bottleneck
by establishing a single version of truth that can
be accessed and transformed by business users
without the need for a dedicated infrastructure
setup. This makes Hadoop one place to store all
data, for as long as desired or required – and in its
original fidelity – that is integrated with existing
infrastructure and tools. Doing this provides
the flexibility to run a variety of enterprise
workloads, including batch processing, inter-
active SQL, enterprise search and advanced
analytics. It also comes with the built-in security,
governance, data protection and management
that enterprises require.
With EDH, leading organizations are changing the
way they think about data, transforming it from a
cost to an asset.
For many enterprises, data streams from all
directions. The challenge is to synthesize and
quantify it and convert bits and bytes into insights
and foresights by applying analytical procedures
on the historical data collected. Hadoop enables
organizations not only to store the data collected
but also to analyze it. With Hadoop, business
value can be elevated by:
•	Mining social media data to determine
customer sentiments.
•	Evaluating Web clickstream data to improve
customer segmentation.
•	Proactively identifying and responding to
security breaches.
MapReduce v1 YARN* - MapReduce v2
* YARN – Yet Another Resource Negotiator
Client Client
JobTracker
NameNode
rr
Client Client
Resource Manager
NameNode
Node Manager
DataNode
Container
Node Manager
DataNode
Container
Node Manager
DataNode
Container
Node Manager
DataNode
Container
Node Manager
DataNode
Application Master
Node Manager
DataNode
Application Master
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
Input Split Map [Combine]
Shuffle &
Sort
Reduce Output
MapReduce Logical Data Flow
Figure 3
MR vs. YARN Architecture
Figure 2
•	Predicting a customer’s next buy.
•	Fortifying security and compliance using
server/machine logs and analyzing various
data sets across multiple data sources.
Understanding Hadoop Infrastructure
Hadoop can be deployed in either of two environ-
ments:
•	Physical-infrastructure-based.
•	Virtual-infrastructure-based.
Physical Infrastructure for Hadoop Cluster
Deployment
Hadoop and its associated ecosystem components
are deployed on physical machines with large
amounts of local storage and memory. Machines
are racked and stacked with high-speed network
switches.
The merits:
•	Delivers the full benefits of Hadoop’s perfor-
mance, especially with locality-aware computa-
tion. In the case where a node is too busy to
accept additional work, the JobTracker can
still schedule work near the node and take
advantage of the switch’s bandwidth.
•	The HDFS file system is persistent over cluster
restarts (provided the data on the NameNode
is protected and a secondary NameNode exists
to keep up with the data, or the high availability
has been configured).
•	When writing files to HDFS, data blocks can be
streamed to multiple racks; importantly, if a
switch fails or a rack loses power, a copy of the
data is still retained.
The demerits:
•	Unless there is enough work to keep the
CPUs busy, hardware becomes a depreciating
investment, particularly if servers aren’t being
used to their full potential – thereby increasing
the effective cost of the entire cluster.
•	The cluster hostnames and IP addresses needs
to be copied into /etc/hosts of each server in
the cluster, to avoid DNS load.
Virtual Infrastructure for Hadoop Cluster
Deployment
Virtual machines (VMs) are created only up to the
duration of the Hadoop cluster. In this approach,
a cluster configuration with the NameNode and
JobTracker hostnames is created, usually in the
same machine for a small cluster. Network rules
can ensure that only authorized hosts have
access to the master and slave nodes. Persistent
data must be kept in an alternate file system to
avoid data loss.
The merits:
•	Can be cost-effective as the organization is
billed based on the duration of cluster usage;
when the cluster is not needed, it can be shut
down – thus saving money.
•	Can scale the cluster up and down on demand.
•	Some cloud service providers provide a version
of Hadoop that is prepackaged, easy and ready-
to-use.
The demerits:
•	Prepackaged Hadoop implementations may
be older versions or private branches without
the code being public. This makes it harder to
handle failure.
•	Startup can be complex, as the hostnames of
the master node(s) are not known until they
are allocated; configuration files need to be
created on demand and then placed in the VMs.
•	There is no persistent storage except through
non-HDFS file systems.
•	There is no locality in a Hadoop cluster; thus,
there is no easy way to determine the location
of slave nodes and their relativity to each other.
cognizant 20-20 insights 4
Soft Factors Hard Factors
Performance optimization parameters External factors
Number of maps Environment
Number of reducers Number of cores
Combiner Memory size
Custom serialization The Network
Shuffle tweaks
Intermediate compression
Factors Affecting Hadoop Cluster Performance
Figure 4
cognizant 20-20 insights 5
•	DataNodes may be colocated on the same
physical servers, and so lack the actual
redundancy which they appear to offer in the
HDFS.
•	Extra tooling is often needed to restart the
cluster when the machines are destroyed.
Hadoop Performance Evaluation
When it comes to Hadoop clusters, performance is
critical. These clusters may run on premises on a
physical or on a virtualized environment, or both.
A performance analysis of individual clusters in
each environment aids in determining the best
alternative for achieving required performance
(see Figure 4, previous page).
Setup Details and Experiment Results
We compared the performance of a Hadoop
cluster running virtually on Amazon Web Services’
Elastic Map Reduce (AWS-EMR) and a similar
hard-wired cluster running on internal physical
infrastructure. See Figure 5 for the precise con-
figurations.
Figure 6 reveals our benchmark findings of the
virtual cluster running Hive and Pig scripts versus
the physical machines running Mahout KMeans
Clustering.
Figure 7 reveals the nature of benchmark data.
This benchmark was performed to transform
AWS VM Sizes* vCPU x Memory No. of Nodes
m1.medium 1 X 2 GB 4
m1.large 1 X 4 GB 4
m1.xlarge 4 X 16 GB 4
Machine Sizes CPU x Memory
NameNode 4 x 4 GB 1
DataNode 4 x 4 GB 3
Client 4 x 8 GB 1
Processor: Intel Core i3-3220 CPU@3.30GHz 4 cores
A Tale of the Tape: Physical vs. Virtual Machines
Figure 5
AWS EMR Physical machine
Distribution Apache Hadoop Cloudera Distribution for Hadoop 4
Hadoop Version 1.0.3 2.0.0+1518
Pig 0.11.1.1-amzn (rexported) 0.11.0+36
Hive 0.11.0.1 0.10.0+214
Mahout 0.9 0.7+22
Benchmarking Physical and Virtual Machines*
*Instance details may differ with releases.3
Figure 6
Requirement Generate 1B records and store it on S3 Bucket/HDFS
No. of Columns 37
No. of Files 50 nos.
No. of Records (each file) 20 Million
File Size (each file) 2.7 GB
Total Data Size 135 GB
Cluster Size (4-Node) No. of DataNodes/TaskTrackers: 03 nos.
Data Details
Figure 7
raw data into a standard format using big data
tools such as Hive Query Language (HiveQL)
and Pig Latin on 40 million records, scaling to
1 billion records. Along with this, Mahout (the
machine learning tool for Hadoop) was also run
for KMeans Clustering of the data that created
five clusters with a maximum of eight iterations
on m1.large (1vCPU x 4GB memory), m1.xlarge
(4vCPU x 15.3GB memory) and physical machines
(4CPU x 4GB memory). The input data was placed
in the HDFS for physical machines and on AWS S3
for AWS-EMR.
Consequential Graphs
Figure 8 shows how the cluster performed for the
Hive transformation on both physical and virtual
environments.
Figure 8 reveals that both workloads took almost
the same time for smaller datasets (~40 to ~80
million records). Gradually with increasing data
sizes, the physical machines performed better
than EMR’s m1.large cluster.
Figure 9, which compares PM versus VM using
Pig transformation, shows that the EMR cluster
cognizant 20-20 insights 6
0
2000
4000
6000
8000
40M 80M 160M 320M 640M 1B
Time(inseconds)
No. of Records
AWS EMR (m1.large) Physical Machines
Hive Transformation (PM vs. VM)
Figure 8
0
1000
2000
3000
4000
5000
6000
40M 80M 160M 320M
Time(inseconds)
No. of records
AWS EMR (m1.large)-PIG Physical Machines-PIG
Pig Transformation (PM vs. VM)
Figure 9
0 1000 2000 3000 4000 5000 6000
Time (in seconds)
Hive (Query-3)
Hive (Query-2)
Hive (Transformation)
Pig (Transformation)
PMVM
PM vs. VM (for 320M records)
Figure 10
cognizant 20-20 insights 7
executing Pig Latin script on 40 million records
takes longer compared with a workload running
the same script on physical machines. Eventually
with increasing data sizes, the difference
between the time taken by physical and virtual
infrastructure expands to a point where physical
machines execute significantly faster.
Figure 10 (previous page) shows the time taken
for all four operations on a dataset containing
320 million records. This includes running various
Hive queries and Pig scripts to compare their per-
formance. With the exception of the Hive Trans-
formation, the other operations are faster with
physical compared with virtual infrastructure.
Figure 11 compares the gradual increase in
execution time with increasing data sizes. Here
the Pig scripts appear to have a faster execution
time on physical machines than on virtual
machines.
Figure 12 shows the time taken by Hive queries to
run on physical and virtual machines for various
data sizes. Again, physical machines appear to
perform much faster than virtual ones.
0
1000
2000
3000
4000
5000
6000
40M 80M 160M 320M 40M 80M 160M 320M
m1.large Physical Machines
Time(inSeconds)
No. of Records
Pig (Transformation) Hive (Transformation)
Pig/Hive Transformation (PM vs. VM)
Figure 11
0
500
1000
1500
2000
2500
3000
3500
80M 160M 320M 640M 1B 40M 80M 160M 320M 640M 1B
m1.large Physical Machines
Time(inSeconds)
No. of Records
Hive (Query-2) Hive (Query-3)
Hive (Query-2 & Query-3): PM vs. VM
Figure 12
455.27
532.37
736.65
750.82
73.64
91.06
139.37
180.86
8.1
8.62
10.21
11.93
0.00 100.00 200.00 300.00 400.00 500.00 600.00 700.00 800.00
1M
2M
4M
6M
Time (in Seconds)
No.ofRecords
VM(4x15)
PM
VM(1x4)
PM vs. VM Mahout K-means
Figure 13
cognizant 20-20 insights 8
Figure 13 (previous page) displays the K-Means
Clustering performance on physical infrastruc-
ture, m1.large virtual infrastructure (1 core x
4GB memory) and m1.xlarge virtual infrastruc-
ture (4 cores x 15 GB memory). In this test, the
best performance was clocked on an m1.xlarge
cluster. Hence, the performance achieved
depends significantly on the memory consumed
for the run. In this case, the ease of scalabil-
ity of virtual machines drove the performance
advantage over physical machines.
In our experiment, we perceived that AWS EMR
up to m1.large instances performs significantly
slower than the one running in a physical environ-
ment. Whereas with the m1.xlarge instance with a
larger memory capacity, virtual performance was
faster than on physical machines.
In sum, Hadoop MapReduce jobs are IO bound
and, generally speaking, virtualization will not help
organizations to boost performance. Hadoop takes
advantage of sequential disk IO, for example, by
using larger block sizes. Virtualization works on the
notion that multiple “machines” do not need full
physical resources at all times. IO-intensive data
processing applications that operate on dedicated
storage are preferred to be non-virtualized.
For a large job, adding more TaskTrackers to
the cluster will help boost computational speed,
but there is no flexibility for adding or removing
nodes from the cluster on physical machines.
Moving Forward
Selecting hardware that provides the best
balance of performance and economy for a given
workload requires testing and validation. It is
important to understand your workload and the
principles involved for hardware selection (e.g.,
blades and SANs are preferred to satisfy their
grid and processing-intensive workloads). Based
on the finding from our benchmark study, we
recommend that organizations keep in mind the
following infrastructure considerations:
•	If your application depends on performance,
the application has a longer lifecycle and the
data growth is regular, a physical machine
would be a better option as it performs better,
the deployment cost is a one-time expense and
as data growth is regular there might not be a
need of highly scalable infrastructure.
•	In cases where your application has a balanced
workload, is cost-intensive, the data growth
is exponential and requires support, virtual
machines can prove to be safer as the CPU is
well utilized and the memory is scalable. They
are also a more cost-efficient option since they
come with a more flexible pay-per-use policy.
Also, the VM environment is highly scalable
in the event of adding or deleting DataNodes/
TaskTrackers/NodeManagers.
•	In cases where your application depends on
performance, has to be cost-efficient, and data
growth is regular and requires support, virtual
machines can be a better choice.
PERFORMANCE SCALABILITY COST
RESOURCE
UTILIZATION
Comparing the
performance of
physical and virtual
machines with the same
configuration, the
physical machines have
higher performance;
with increased memory,
however, a VM can
perform better.
Commissioning and
decommissioning of
physical machines’
cluster nodes can
prove to be an
expensive affair
compared to provisioning
VMs as per need. Thereby
scalability can be highly
limited with physical
machines.
Provisioning physical
machines incurs higher
cost than virtual
machines, where
creation of a VM can be
as simple as cloning an
instance of a VM and its
unique identity.
The processor utilization
of physical machines is
less than 20%; however,
the rest is all available
for use. In the case of
virtual machines, the
CPU is utilized at its
best, with high chances
of CPU overhead leading
to lower performance.
Characteristic Differences Between Physical and Virtual Infrastructure
Figure 14
About Cognizant
Cognizant (NASDAQ: CTSH) is a leading provider of information technology, consulting, and business
process outsourcing services, dedicated to helping the world’s leading companies build stronger business-
es. Headquartered in Teaneck, New Jersey (U.S.), Cognizant combines a passion for client satisfaction,
technology innovation, deep industry and business process expertise, and a global, collaborative workforce
that embodies the future of work. With over 100 development and delivery centers worldwide and approxi-
mately 221,700 employees as of December 31, 2015, Cognizant is a member of the NASDAQ-100, the S&P
500, the Forbes Global 2000, and the Fortune 500 and is ranked among the top performing and fastest
growing companies in the world. Visit us online at www.cognizant.com or follow us on Twitter: Cognizant.
World Headquarters
500 Frank W. Burr Blvd.
Teaneck, NJ 07666 USA
Phone: +1 201 801 0233
Fax: +1 201 801 0243
Toll Free: +1 888 937 3277
Email: inquiry@cognizant.com
European Headquarters
1 Kingdom Street
Paddington Central
London W2 6BD
Phone: +44 (0) 20 7297 7600
Fax: +44 (0) 20 7121 0102
Email: infouk@cognizant.com
India Operations Headquarters
#5/535, Old Mahabalipuram Road
Okkiyam Pettai, Thoraipakkam
Chennai, 600 096 India
Phone: +91 (0) 44 4209 6000
Fax: +91 (0) 44 4209 6060
Email: inquiryindia@cognizant.com
­­© Copyright 2016, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise, without the express written permission from Cognizant. The information contained herein is
subject to change without notice. All other trademarks mentioned herein are the property of their respective owners.
About the Authors
Apsara Radhakrishnan is an Associate of the Decision Science Team within Cognizant Analytics. She has
three years of experience in the areas of big data technology focused on ETL in the Hadoop environment,
its administration and AWS Analytics products. She holds a master’s degree in computer applications from
Visvesvaraya Technological University. Apsara can be reached at Apsara.Radhakrishnan@cognizant.com.
Harish Chauhan is Principal Consultant, Cloud Services, within Cognizant Infrastructure Services. He
has over 24 years of IT experience, has numerous technical publications to his credit, and he has also
coauthored two patents in the area of virtualization – one of which was issued in January 2015. Harish’s
white paper on “Harnessing Hadoop” was released in 2013. His areas of specialization include distributed
computing (Hadoop/big data/HPC), cloud computing (private cloud technologies), virtualization/contain-
erization and system management/monitoring. Harish has worked in many areas including infrastructure
management, product engineering, consulting/assessment, advisory services and pre-sales. He holds
a bachelor’s degree in computer science and engineering. In his current role, Harish is responsible for
capability building on emerging trends and technologies like big data/Hadoop, cloud computing/virtualiza-
tion, private clouds and mobility. He can be reached at Harish.Chauhan@cognizant.com.
TL Codex 1732
Footnotes
1	 HDFS - http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.
2	 MapReduce - http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html.
3	 AWS Details - http://aws.amazon.com/ec2/previous-generation/.
•	In cases where your application requires high
performance and data growth is exponential
with no required support, virtual machines with
higher memory are a better choice.
During the course of our investigation, we
found that the commodity systems, while both
antiquated and less responsive, performed sig-
nificantly better using our implementation than
customary virtual machine implementations using
standard hypervisors.
From these results, we observe that virtual
Hadoop cluster performance is significantly
lower than the cluster running on a physical
machine due to the overhead of the virtualization
on the CPU of the physical host. Any feature that
overrides this virtualization overhead of virtual
machines with larger memory would boost per-
formance.

More Related Content

What's hot

Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsWP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsJane Roberts
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1Abbas Maazallahi
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Imviplav
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014Stratebi
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big dataYukti Kaura
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousingDataWorks Summit
 

What's hot (19)

Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRobertsWP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
WP_Impetus_2016_Guide_to_Modernize_Your_Enterprise_Data_Warehouse_JRoberts
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousing
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 

Viewers also liked

Data Infrastructure on Hadoop - Hadoop Summit 2011 BLR
Data Infrastructure on Hadoop - Hadoop Summit 2011 BLRData Infrastructure on Hadoop - Hadoop Summit 2011 BLR
Data Infrastructure on Hadoop - Hadoop Summit 2011 BLRSeetharam Venkatesh
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
Scalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and MesosScalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and MesosDataWorks Summit
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosHeiko Loewe
 
Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersBlueData, Inc.
 

Viewers also liked (8)

Data Infrastructure on Hadoop - Hadoop Summit 2011 BLR
Data Infrastructure on Hadoop - Hadoop Summit 2011 BLRData Infrastructure on Hadoop - Hadoop Summit 2011 BLR
Data Infrastructure on Hadoop - Hadoop Summit 2011 BLR
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Final White Paper_
Final White Paper_Final White Paper_
Final White Paper_
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Scalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and MesosScalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and Mesos
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and Mesos
 
Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
 

Similar to Infrastructure Considerations for Analyzing Big Data with Hadoop

Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeSysfore Technologies
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paperSupratim Ray
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 
Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop TechnologyRahul Sharma
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdfavenkatram
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCPBlibBlobb
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Samsung Business USA
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overviewNitesh Ghosh
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overviewvhrocca
 

Similar to Infrastructure Considerations for Analyzing Big Data with Hadoop (20)

Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
G017143640
G017143640G017143640
G017143640
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Big Data Hadoop Technology
Big Data Hadoop TechnologyBig Data Hadoop Technology
Big Data Hadoop Technology
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 

More from Cognizant

Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...
Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...
Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...Cognizant
 
Data Modernization: Breaking the AI Vicious Cycle for Superior Decision-making
Data Modernization: Breaking the AI Vicious Cycle for Superior Decision-makingData Modernization: Breaking the AI Vicious Cycle for Superior Decision-making
Data Modernization: Breaking the AI Vicious Cycle for Superior Decision-makingCognizant
 
It Takes an Ecosystem: How Technology Companies Deliver Exceptional Experiences
It Takes an Ecosystem: How Technology Companies Deliver Exceptional ExperiencesIt Takes an Ecosystem: How Technology Companies Deliver Exceptional Experiences
It Takes an Ecosystem: How Technology Companies Deliver Exceptional ExperiencesCognizant
 
Intuition Engineered
Intuition EngineeredIntuition Engineered
Intuition EngineeredCognizant
 
The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...
The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...
The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...Cognizant
 
Enhancing Desirability: Five Considerations for Winning Digital Initiatives
Enhancing Desirability: Five Considerations for Winning Digital InitiativesEnhancing Desirability: Five Considerations for Winning Digital Initiatives
Enhancing Desirability: Five Considerations for Winning Digital InitiativesCognizant
 
The Work Ahead in Manufacturing: Fulfilling the Agility Mandate
The Work Ahead in Manufacturing: Fulfilling the Agility MandateThe Work Ahead in Manufacturing: Fulfilling the Agility Mandate
The Work Ahead in Manufacturing: Fulfilling the Agility MandateCognizant
 
The Work Ahead in Higher Education: Repaving the Road for the Employees of To...
The Work Ahead in Higher Education: Repaving the Road for the Employees of To...The Work Ahead in Higher Education: Repaving the Road for the Employees of To...
The Work Ahead in Higher Education: Repaving the Road for the Employees of To...Cognizant
 
Engineering the Next-Gen Digital Claims Organisation for Australian General I...
Engineering the Next-Gen Digital Claims Organisation for Australian General I...Engineering the Next-Gen Digital Claims Organisation for Australian General I...
Engineering the Next-Gen Digital Claims Organisation for Australian General I...Cognizant
 
Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...
Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...
Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...Cognizant
 
Green Rush: The Economic Imperative for Sustainability
Green Rush: The Economic Imperative for SustainabilityGreen Rush: The Economic Imperative for Sustainability
Green Rush: The Economic Imperative for SustainabilityCognizant
 
Policy Administration Modernization: Four Paths for Insurers
Policy Administration Modernization: Four Paths for InsurersPolicy Administration Modernization: Four Paths for Insurers
Policy Administration Modernization: Four Paths for InsurersCognizant
 
The Work Ahead in Utilities: Powering a Sustainable Future with Digital
The Work Ahead in Utilities: Powering a Sustainable Future with DigitalThe Work Ahead in Utilities: Powering a Sustainable Future with Digital
The Work Ahead in Utilities: Powering a Sustainable Future with DigitalCognizant
 
AI in Media & Entertainment: Starting the Journey to Value
AI in Media & Entertainment: Starting the Journey to ValueAI in Media & Entertainment: Starting the Journey to Value
AI in Media & Entertainment: Starting the Journey to ValueCognizant
 
Operations Workforce Management: A Data-Informed, Digital-First Approach
Operations Workforce Management: A Data-Informed, Digital-First ApproachOperations Workforce Management: A Data-Informed, Digital-First Approach
Operations Workforce Management: A Data-Informed, Digital-First ApproachCognizant
 
Five Priorities for Quality Engineering When Taking Banking to the Cloud
Five Priorities for Quality Engineering When Taking Banking to the CloudFive Priorities for Quality Engineering When Taking Banking to the Cloud
Five Priorities for Quality Engineering When Taking Banking to the CloudCognizant
 
Getting Ahead With AI: How APAC Companies Replicate Success by Remaining Focused
Getting Ahead With AI: How APAC Companies Replicate Success by Remaining FocusedGetting Ahead With AI: How APAC Companies Replicate Success by Remaining Focused
Getting Ahead With AI: How APAC Companies Replicate Success by Remaining FocusedCognizant
 
Crafting the Utility of the Future
Crafting the Utility of the FutureCrafting the Utility of the Future
Crafting the Utility of the FutureCognizant
 
Utilities Can Ramp Up CX with a Customer Data Platform
Utilities Can Ramp Up CX with a Customer Data PlatformUtilities Can Ramp Up CX with a Customer Data Platform
Utilities Can Ramp Up CX with a Customer Data PlatformCognizant
 
The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...
The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...
The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...Cognizant
 

More from Cognizant (20)

Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...
Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...
Using Adaptive Scrum to Tame Process Reverse Engineering in Data Analytics Pr...
 
Data Modernization: Breaking the AI Vicious Cycle for Superior Decision-making
Data Modernization: Breaking the AI Vicious Cycle for Superior Decision-makingData Modernization: Breaking the AI Vicious Cycle for Superior Decision-making
Data Modernization: Breaking the AI Vicious Cycle for Superior Decision-making
 
It Takes an Ecosystem: How Technology Companies Deliver Exceptional Experiences
It Takes an Ecosystem: How Technology Companies Deliver Exceptional ExperiencesIt Takes an Ecosystem: How Technology Companies Deliver Exceptional Experiences
It Takes an Ecosystem: How Technology Companies Deliver Exceptional Experiences
 
Intuition Engineered
Intuition EngineeredIntuition Engineered
Intuition Engineered
 
The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...
The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...
The Work Ahead: Transportation and Logistics Delivering on the Digital-Physic...
 
Enhancing Desirability: Five Considerations for Winning Digital Initiatives
Enhancing Desirability: Five Considerations for Winning Digital InitiativesEnhancing Desirability: Five Considerations for Winning Digital Initiatives
Enhancing Desirability: Five Considerations for Winning Digital Initiatives
 
The Work Ahead in Manufacturing: Fulfilling the Agility Mandate
The Work Ahead in Manufacturing: Fulfilling the Agility MandateThe Work Ahead in Manufacturing: Fulfilling the Agility Mandate
The Work Ahead in Manufacturing: Fulfilling the Agility Mandate
 
The Work Ahead in Higher Education: Repaving the Road for the Employees of To...
The Work Ahead in Higher Education: Repaving the Road for the Employees of To...The Work Ahead in Higher Education: Repaving the Road for the Employees of To...
The Work Ahead in Higher Education: Repaving the Road for the Employees of To...
 
Engineering the Next-Gen Digital Claims Organisation for Australian General I...
Engineering the Next-Gen Digital Claims Organisation for Australian General I...Engineering the Next-Gen Digital Claims Organisation for Australian General I...
Engineering the Next-Gen Digital Claims Organisation for Australian General I...
 
Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...
Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...
Profitability in the Direct-to-Consumer Marketplace: A Playbook for Media and...
 
Green Rush: The Economic Imperative for Sustainability
Green Rush: The Economic Imperative for SustainabilityGreen Rush: The Economic Imperative for Sustainability
Green Rush: The Economic Imperative for Sustainability
 
Policy Administration Modernization: Four Paths for Insurers
Policy Administration Modernization: Four Paths for InsurersPolicy Administration Modernization: Four Paths for Insurers
Policy Administration Modernization: Four Paths for Insurers
 
The Work Ahead in Utilities: Powering a Sustainable Future with Digital
The Work Ahead in Utilities: Powering a Sustainable Future with DigitalThe Work Ahead in Utilities: Powering a Sustainable Future with Digital
The Work Ahead in Utilities: Powering a Sustainable Future with Digital
 
AI in Media & Entertainment: Starting the Journey to Value
AI in Media & Entertainment: Starting the Journey to ValueAI in Media & Entertainment: Starting the Journey to Value
AI in Media & Entertainment: Starting the Journey to Value
 
Operations Workforce Management: A Data-Informed, Digital-First Approach
Operations Workforce Management: A Data-Informed, Digital-First ApproachOperations Workforce Management: A Data-Informed, Digital-First Approach
Operations Workforce Management: A Data-Informed, Digital-First Approach
 
Five Priorities for Quality Engineering When Taking Banking to the Cloud
Five Priorities for Quality Engineering When Taking Banking to the CloudFive Priorities for Quality Engineering When Taking Banking to the Cloud
Five Priorities for Quality Engineering When Taking Banking to the Cloud
 
Getting Ahead With AI: How APAC Companies Replicate Success by Remaining Focused
Getting Ahead With AI: How APAC Companies Replicate Success by Remaining FocusedGetting Ahead With AI: How APAC Companies Replicate Success by Remaining Focused
Getting Ahead With AI: How APAC Companies Replicate Success by Remaining Focused
 
Crafting the Utility of the Future
Crafting the Utility of the FutureCrafting the Utility of the Future
Crafting the Utility of the Future
 
Utilities Can Ramp Up CX with a Customer Data Platform
Utilities Can Ramp Up CX with a Customer Data PlatformUtilities Can Ramp Up CX with a Customer Data Platform
Utilities Can Ramp Up CX with a Customer Data Platform
 
The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...
The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...
The Work Ahead in Intelligent Automation: Coping with Complexity in a Post-Pa...
 

Infrastructure Considerations for Analyzing Big Data with Hadoop

  • 1. Infrastructure Considerations for Analytical Workloads By applying Hadoop clusters to big data workloads, organizations can achieve incredible performance gains that can vary based on physical versus virtual infrastructure. Executive Summary On the list of technology industry buzzwords, “big data” is among the most intriguing ones. As data volume, velocity and variety proliferate, and the search for veracity escalates, organiza- tions across industries are placing new bets on various new data sources such as machine sensor data, medical images, financial information, retail sales, radio frequency identification and Web tracking data. This is creating huge challenges for decision-makers to make meaning and untangle trends from input larger than ever before. From a technological perspective, the so-called four V’s of big data (volume, velocity, variety and veracity) make it ever more difficult to process big data on a single system. Even if one disregarded the storage constraints of a single system, and utilized a storage area network (SAN) to store the petabytes of incoming data, processing speed remains a huge bottleneck. Whether a single-core or multi-core processor is used, a single system would take substantially more time to process data than if the data was partitioned across an array of systems used in parallel. That’s not to say that the processing conundrum shouldn’t be confronted and overcome. Big data plays a vital role in improving organizational profitabil- ity, increasing productivity and solving scientific challenges. It also enables decision-makers to understand customer needs, wants and desires, and to see where markets are heading. One of the major technologies that helps orga- nizations make sense of big data is the open source distributed processing framework known as Apache Hadoop. Based on our engagement experiences and through intensive benchmark- ing, this white paper analyzes the infrastructure considerations for running analytical workloads on Hadoop clusters. The primary emphasis is to compare and contrast the physical or virtual infra- structure requirements to support typical business workloads from performance, cost, support and scalability perspectives. Our goal is to arm the reader with the insights necessary for assessing whether physical or virtual infrastructure would best suit your organization’s requirements. cognizant 20-20 insights | april 2016 • Cognizant 20-20 Insights
  • 2. cognizant 20-20 insights 2 Hadoop: A Primer To solve many of the aforementioned big data issues, the Apache Foundation developed Apache Hadoop, a Java-based framework that can be used to process large amounts of data across thousands of computing nodes. It consists of two main components – HDFS1 and MapReduce.2 Hadoop Distributed File System (HDFS) is designed to run on commodity hardware, while MapReduce provides the processing framework for distributed data across thousands of nodes. HDFS shares many attributes with other distribut- ed file systems. However, Hadoop has implement- ed numerous features that allow the file system to be significantly more fault-tolerant than typical hardware solutions such as redundant arrays of inexpensive disks (RAIDs) or data replica- tion alone. What follows is a deep dive into the reasons Hadoop is considered a viable solution for the challenges created by big data. The HDFS components explored are the NameNode and DataNodes (see Figure 1). The MapReduce framework processes large data sets across numerous computer nodes (known as data nodes) where all nodes are on the same local network and use similar hardware. Computational processing can occur on data stored either in a file system (either semi-structured or unstructured) or in a database (structured). MapReduce can take advantage of data locality. In MapReduce version 1, the components are JobTracker and TaskTrack- ers, whereas in MapReduce version 2 (YARN), the components are the ResourceManager and Node- Managers (see next page, Figure 2). Hadoop’s Role Hadoop provides performance enhancements that enable high throughput access to applica- tion data. It also handles streaming access to file system resources, which are increasingly chal- lenging when attempting to manipulate larger data sets. Many of the design considerations can be subdivided into the following categories: • Data asset size. • Transformational challenges. • Decision-making. • Analytics. Hadoop’s ability to integrate data from different sources (databases, social media, etc.), systems (network/machine/sensor logs, geo-spatial data, etc.) and file types (structured, unstructured and semi-structured) enable organizations to respond to business questions such as: • Do you test all of your decisions to compete in the market? • Can new business models be created based on the available data in the organization? • Can you drive new operational efficiencies by modernizing extract, transform and load (ETL) and optimizing batch processing? • How can you harness the hidden value in your data that until now has been archived, discarded or ignored? All applications utilizing HDFS tend to have large data sets that range from gigabytes to petabytes. HDFS has been calibrated to adjust to HDFS Architecture Figure 1 Rack 1 Rack 2 NameNodeClient Read Write DataNodes Block Ops Metadata Ops Replication
  • 3. cognizant 20-20 insights 3 such large data volumes. By providing substan- tial aggregated data bandwidth, HDFS should scale to thousands of nodes per cluster. Hadoop is a highly scalable storage platform because it can store and distribute very large data sets across hundreds of commodity servers operating in parallel. This enables businesses to run their applications on thousands of nodes involving thousands of terabytes of data. In legacy environments, traditional ETL and batch processes can take hours, days or even weeks – in a world where businesses require access to data in minutes or even seconds. Hadoop excels at high-volume batch processing. Because the processing is in parallel, Hadoop is said to perform batch processing multiple times faster than on a single server. Likewise, when Hadoop is used as an enterprise data hub (EDH), it can ease the ETL bottleneck by establishing a single version of truth that can be accessed and transformed by business users without the need for a dedicated infrastructure setup. This makes Hadoop one place to store all data, for as long as desired or required – and in its original fidelity – that is integrated with existing infrastructure and tools. Doing this provides the flexibility to run a variety of enterprise workloads, including batch processing, inter- active SQL, enterprise search and advanced analytics. It also comes with the built-in security, governance, data protection and management that enterprises require. With EDH, leading organizations are changing the way they think about data, transforming it from a cost to an asset. For many enterprises, data streams from all directions. The challenge is to synthesize and quantify it and convert bits and bytes into insights and foresights by applying analytical procedures on the historical data collected. Hadoop enables organizations not only to store the data collected but also to analyze it. With Hadoop, business value can be elevated by: • Mining social media data to determine customer sentiments. • Evaluating Web clickstream data to improve customer segmentation. • Proactively identifying and responding to security breaches. MapReduce v1 YARN* - MapReduce v2 * YARN – Yet Another Resource Negotiator Client Client JobTracker NameNode rr Client Client Resource Manager NameNode Node Manager DataNode Container Node Manager DataNode Container Node Manager DataNode Container Node Manager DataNode Container Node Manager DataNode Application Master Node Manager DataNode Application Master TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode Input Split Map [Combine] Shuffle & Sort Reduce Output MapReduce Logical Data Flow Figure 3 MR vs. YARN Architecture Figure 2
  • 4. • Predicting a customer’s next buy. • Fortifying security and compliance using server/machine logs and analyzing various data sets across multiple data sources. Understanding Hadoop Infrastructure Hadoop can be deployed in either of two environ- ments: • Physical-infrastructure-based. • Virtual-infrastructure-based. Physical Infrastructure for Hadoop Cluster Deployment Hadoop and its associated ecosystem components are deployed on physical machines with large amounts of local storage and memory. Machines are racked and stacked with high-speed network switches. The merits: • Delivers the full benefits of Hadoop’s perfor- mance, especially with locality-aware computa- tion. In the case where a node is too busy to accept additional work, the JobTracker can still schedule work near the node and take advantage of the switch’s bandwidth. • The HDFS file system is persistent over cluster restarts (provided the data on the NameNode is protected and a secondary NameNode exists to keep up with the data, or the high availability has been configured). • When writing files to HDFS, data blocks can be streamed to multiple racks; importantly, if a switch fails or a rack loses power, a copy of the data is still retained. The demerits: • Unless there is enough work to keep the CPUs busy, hardware becomes a depreciating investment, particularly if servers aren’t being used to their full potential – thereby increasing the effective cost of the entire cluster. • The cluster hostnames and IP addresses needs to be copied into /etc/hosts of each server in the cluster, to avoid DNS load. Virtual Infrastructure for Hadoop Cluster Deployment Virtual machines (VMs) are created only up to the duration of the Hadoop cluster. In this approach, a cluster configuration with the NameNode and JobTracker hostnames is created, usually in the same machine for a small cluster. Network rules can ensure that only authorized hosts have access to the master and slave nodes. Persistent data must be kept in an alternate file system to avoid data loss. The merits: • Can be cost-effective as the organization is billed based on the duration of cluster usage; when the cluster is not needed, it can be shut down – thus saving money. • Can scale the cluster up and down on demand. • Some cloud service providers provide a version of Hadoop that is prepackaged, easy and ready- to-use. The demerits: • Prepackaged Hadoop implementations may be older versions or private branches without the code being public. This makes it harder to handle failure. • Startup can be complex, as the hostnames of the master node(s) are not known until they are allocated; configuration files need to be created on demand and then placed in the VMs. • There is no persistent storage except through non-HDFS file systems. • There is no locality in a Hadoop cluster; thus, there is no easy way to determine the location of slave nodes and their relativity to each other. cognizant 20-20 insights 4 Soft Factors Hard Factors Performance optimization parameters External factors Number of maps Environment Number of reducers Number of cores Combiner Memory size Custom serialization The Network Shuffle tweaks Intermediate compression Factors Affecting Hadoop Cluster Performance Figure 4
  • 5. cognizant 20-20 insights 5 • DataNodes may be colocated on the same physical servers, and so lack the actual redundancy which they appear to offer in the HDFS. • Extra tooling is often needed to restart the cluster when the machines are destroyed. Hadoop Performance Evaluation When it comes to Hadoop clusters, performance is critical. These clusters may run on premises on a physical or on a virtualized environment, or both. A performance analysis of individual clusters in each environment aids in determining the best alternative for achieving required performance (see Figure 4, previous page). Setup Details and Experiment Results We compared the performance of a Hadoop cluster running virtually on Amazon Web Services’ Elastic Map Reduce (AWS-EMR) and a similar hard-wired cluster running on internal physical infrastructure. See Figure 5 for the precise con- figurations. Figure 6 reveals our benchmark findings of the virtual cluster running Hive and Pig scripts versus the physical machines running Mahout KMeans Clustering. Figure 7 reveals the nature of benchmark data. This benchmark was performed to transform AWS VM Sizes* vCPU x Memory No. of Nodes m1.medium 1 X 2 GB 4 m1.large 1 X 4 GB 4 m1.xlarge 4 X 16 GB 4 Machine Sizes CPU x Memory NameNode 4 x 4 GB 1 DataNode 4 x 4 GB 3 Client 4 x 8 GB 1 Processor: Intel Core i3-3220 CPU@3.30GHz 4 cores A Tale of the Tape: Physical vs. Virtual Machines Figure 5 AWS EMR Physical machine Distribution Apache Hadoop Cloudera Distribution for Hadoop 4 Hadoop Version 1.0.3 2.0.0+1518 Pig 0.11.1.1-amzn (rexported) 0.11.0+36 Hive 0.11.0.1 0.10.0+214 Mahout 0.9 0.7+22 Benchmarking Physical and Virtual Machines* *Instance details may differ with releases.3 Figure 6 Requirement Generate 1B records and store it on S3 Bucket/HDFS No. of Columns 37 No. of Files 50 nos. No. of Records (each file) 20 Million File Size (each file) 2.7 GB Total Data Size 135 GB Cluster Size (4-Node) No. of DataNodes/TaskTrackers: 03 nos. Data Details Figure 7
  • 6. raw data into a standard format using big data tools such as Hive Query Language (HiveQL) and Pig Latin on 40 million records, scaling to 1 billion records. Along with this, Mahout (the machine learning tool for Hadoop) was also run for KMeans Clustering of the data that created five clusters with a maximum of eight iterations on m1.large (1vCPU x 4GB memory), m1.xlarge (4vCPU x 15.3GB memory) and physical machines (4CPU x 4GB memory). The input data was placed in the HDFS for physical machines and on AWS S3 for AWS-EMR. Consequential Graphs Figure 8 shows how the cluster performed for the Hive transformation on both physical and virtual environments. Figure 8 reveals that both workloads took almost the same time for smaller datasets (~40 to ~80 million records). Gradually with increasing data sizes, the physical machines performed better than EMR’s m1.large cluster. Figure 9, which compares PM versus VM using Pig transformation, shows that the EMR cluster cognizant 20-20 insights 6 0 2000 4000 6000 8000 40M 80M 160M 320M 640M 1B Time(inseconds) No. of Records AWS EMR (m1.large) Physical Machines Hive Transformation (PM vs. VM) Figure 8 0 1000 2000 3000 4000 5000 6000 40M 80M 160M 320M Time(inseconds) No. of records AWS EMR (m1.large)-PIG Physical Machines-PIG Pig Transformation (PM vs. VM) Figure 9 0 1000 2000 3000 4000 5000 6000 Time (in seconds) Hive (Query-3) Hive (Query-2) Hive (Transformation) Pig (Transformation) PMVM PM vs. VM (for 320M records) Figure 10
  • 7. cognizant 20-20 insights 7 executing Pig Latin script on 40 million records takes longer compared with a workload running the same script on physical machines. Eventually with increasing data sizes, the difference between the time taken by physical and virtual infrastructure expands to a point where physical machines execute significantly faster. Figure 10 (previous page) shows the time taken for all four operations on a dataset containing 320 million records. This includes running various Hive queries and Pig scripts to compare their per- formance. With the exception of the Hive Trans- formation, the other operations are faster with physical compared with virtual infrastructure. Figure 11 compares the gradual increase in execution time with increasing data sizes. Here the Pig scripts appear to have a faster execution time on physical machines than on virtual machines. Figure 12 shows the time taken by Hive queries to run on physical and virtual machines for various data sizes. Again, physical machines appear to perform much faster than virtual ones. 0 1000 2000 3000 4000 5000 6000 40M 80M 160M 320M 40M 80M 160M 320M m1.large Physical Machines Time(inSeconds) No. of Records Pig (Transformation) Hive (Transformation) Pig/Hive Transformation (PM vs. VM) Figure 11 0 500 1000 1500 2000 2500 3000 3500 80M 160M 320M 640M 1B 40M 80M 160M 320M 640M 1B m1.large Physical Machines Time(inSeconds) No. of Records Hive (Query-2) Hive (Query-3) Hive (Query-2 & Query-3): PM vs. VM Figure 12 455.27 532.37 736.65 750.82 73.64 91.06 139.37 180.86 8.1 8.62 10.21 11.93 0.00 100.00 200.00 300.00 400.00 500.00 600.00 700.00 800.00 1M 2M 4M 6M Time (in Seconds) No.ofRecords VM(4x15) PM VM(1x4) PM vs. VM Mahout K-means Figure 13
  • 8. cognizant 20-20 insights 8 Figure 13 (previous page) displays the K-Means Clustering performance on physical infrastruc- ture, m1.large virtual infrastructure (1 core x 4GB memory) and m1.xlarge virtual infrastruc- ture (4 cores x 15 GB memory). In this test, the best performance was clocked on an m1.xlarge cluster. Hence, the performance achieved depends significantly on the memory consumed for the run. In this case, the ease of scalabil- ity of virtual machines drove the performance advantage over physical machines. In our experiment, we perceived that AWS EMR up to m1.large instances performs significantly slower than the one running in a physical environ- ment. Whereas with the m1.xlarge instance with a larger memory capacity, virtual performance was faster than on physical machines. In sum, Hadoop MapReduce jobs are IO bound and, generally speaking, virtualization will not help organizations to boost performance. Hadoop takes advantage of sequential disk IO, for example, by using larger block sizes. Virtualization works on the notion that multiple “machines” do not need full physical resources at all times. IO-intensive data processing applications that operate on dedicated storage are preferred to be non-virtualized. For a large job, adding more TaskTrackers to the cluster will help boost computational speed, but there is no flexibility for adding or removing nodes from the cluster on physical machines. Moving Forward Selecting hardware that provides the best balance of performance and economy for a given workload requires testing and validation. It is important to understand your workload and the principles involved for hardware selection (e.g., blades and SANs are preferred to satisfy their grid and processing-intensive workloads). Based on the finding from our benchmark study, we recommend that organizations keep in mind the following infrastructure considerations: • If your application depends on performance, the application has a longer lifecycle and the data growth is regular, a physical machine would be a better option as it performs better, the deployment cost is a one-time expense and as data growth is regular there might not be a need of highly scalable infrastructure. • In cases where your application has a balanced workload, is cost-intensive, the data growth is exponential and requires support, virtual machines can prove to be safer as the CPU is well utilized and the memory is scalable. They are also a more cost-efficient option since they come with a more flexible pay-per-use policy. Also, the VM environment is highly scalable in the event of adding or deleting DataNodes/ TaskTrackers/NodeManagers. • In cases where your application depends on performance, has to be cost-efficient, and data growth is regular and requires support, virtual machines can be a better choice. PERFORMANCE SCALABILITY COST RESOURCE UTILIZATION Comparing the performance of physical and virtual machines with the same configuration, the physical machines have higher performance; with increased memory, however, a VM can perform better. Commissioning and decommissioning of physical machines’ cluster nodes can prove to be an expensive affair compared to provisioning VMs as per need. Thereby scalability can be highly limited with physical machines. Provisioning physical machines incurs higher cost than virtual machines, where creation of a VM can be as simple as cloning an instance of a VM and its unique identity. The processor utilization of physical machines is less than 20%; however, the rest is all available for use. In the case of virtual machines, the CPU is utilized at its best, with high chances of CPU overhead leading to lower performance. Characteristic Differences Between Physical and Virtual Infrastructure Figure 14
  • 9. About Cognizant Cognizant (NASDAQ: CTSH) is a leading provider of information technology, consulting, and business process outsourcing services, dedicated to helping the world’s leading companies build stronger business- es. Headquartered in Teaneck, New Jersey (U.S.), Cognizant combines a passion for client satisfaction, technology innovation, deep industry and business process expertise, and a global, collaborative workforce that embodies the future of work. With over 100 development and delivery centers worldwide and approxi- mately 221,700 employees as of December 31, 2015, Cognizant is a member of the NASDAQ-100, the S&P 500, the Forbes Global 2000, and the Fortune 500 and is ranked among the top performing and fastest growing companies in the world. Visit us online at www.cognizant.com or follow us on Twitter: Cognizant. World Headquarters 500 Frank W. Burr Blvd. Teaneck, NJ 07666 USA Phone: +1 201 801 0233 Fax: +1 201 801 0243 Toll Free: +1 888 937 3277 Email: inquiry@cognizant.com European Headquarters 1 Kingdom Street Paddington Central London W2 6BD Phone: +44 (0) 20 7297 7600 Fax: +44 (0) 20 7121 0102 Email: infouk@cognizant.com India Operations Headquarters #5/535, Old Mahabalipuram Road Okkiyam Pettai, Thoraipakkam Chennai, 600 096 India Phone: +91 (0) 44 4209 6000 Fax: +91 (0) 44 4209 6060 Email: inquiryindia@cognizant.com ­­© Copyright 2016, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the express written permission from Cognizant. The information contained herein is subject to change without notice. All other trademarks mentioned herein are the property of their respective owners. About the Authors Apsara Radhakrishnan is an Associate of the Decision Science Team within Cognizant Analytics. She has three years of experience in the areas of big data technology focused on ETL in the Hadoop environment, its administration and AWS Analytics products. She holds a master’s degree in computer applications from Visvesvaraya Technological University. Apsara can be reached at Apsara.Radhakrishnan@cognizant.com. Harish Chauhan is Principal Consultant, Cloud Services, within Cognizant Infrastructure Services. He has over 24 years of IT experience, has numerous technical publications to his credit, and he has also coauthored two patents in the area of virtualization – one of which was issued in January 2015. Harish’s white paper on “Harnessing Hadoop” was released in 2013. His areas of specialization include distributed computing (Hadoop/big data/HPC), cloud computing (private cloud technologies), virtualization/contain- erization and system management/monitoring. Harish has worked in many areas including infrastructure management, product engineering, consulting/assessment, advisory services and pre-sales. He holds a bachelor’s degree in computer science and engineering. In his current role, Harish is responsible for capability building on emerging trends and technologies like big data/Hadoop, cloud computing/virtualiza- tion, private clouds and mobility. He can be reached at Harish.Chauhan@cognizant.com. TL Codex 1732 Footnotes 1 HDFS - http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html. 2 MapReduce - http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html. 3 AWS Details - http://aws.amazon.com/ec2/previous-generation/. • In cases where your application requires high performance and data growth is exponential with no required support, virtual machines with higher memory are a better choice. During the course of our investigation, we found that the commodity systems, while both antiquated and less responsive, performed sig- nificantly better using our implementation than customary virtual machine implementations using standard hypervisors. From these results, we observe that virtual Hadoop cluster performance is significantly lower than the cluster running on a physical machine due to the overhead of the virtualization on the CPU of the physical host. Any feature that overrides this virtualization overhead of virtual machines with larger memory would boost per- formance.