Implementation of Big Data infrastructure and technology can be seen in various industries like banking, retail, insurance, healthcare, media, etc. Big Data management functions like storage, sorting, processing and analysis for such colossal volumes cannot be handled by the existing database systems or technologies. Frameworks come into picture in such scenarios. Frameworks are nothing but toolsets that offer innovative, cost-effective solutions to the problems posed by Big Data processing and helps in providing insights, incorporating metadata and aids decision making aligned to the business needs.
2. Introduction
There are 3V’s that are vital for classifying data as Big
Data. These include Volume, Velocityand Veracity.
Volume:
Data volumes it is in terms of terabytes, petabytes and so on.
Velocity:
Velocity is to do with the high speed of data movement like
real-time data streaming at arapid rate in microseconds.
Veracity:
Veracity involves the handling approach for both structured
and unstructured data.
3. THINKABOUTIT
Implementation of Big Data infrastructure and technology
can be seen in various industries like banking,
retail, insurance, healthcare, media,etc.
Big Data management functions like storage, sorting,
processing and analysis for such colossal volumes cannot be
handled by the existing database systems or technologies.
4. There are many frameworks presently existing in this space. Some of
the popular ones are Spark, Hadoop, Hive and Storm.
Some score high on utility index like Presto while frameworks like Flink
have great potential.
There are still others which need some mention like the Samza,Impala,
Apache Pig,etc.
Some of these frameworks have been briefly discussed below.
5. Apache Hadoop
Hadoop is aJava-based platform founded by Mike Cafarella and Doug
Cutting.
This open-source framework provides batch data processing as well
as data storage services across agroup of hardware machines
arranged inclusters.
Hadoop consists of multiple layers like HDFSandYARNthat work
together to carry out data processing.
6. HDFS(Hadoop Distributed File System) is the hardware layer that
ensures coordination of data replication and storage activities
across various data clusters. In the event of acluster node
failure, real-time can still be made available for processing.
YARN(YetAnother Resource Negotiator) is the layer responsible
for resource management and job scheduling.
MapReduce is the software layer that functions as the batch
processing engine.
7. Pros Cons
Include cost-effective solution,
high throughput, multi-language
support, compatibilitywith most
emerging technologies inBig Data
services, highscalability, fault
tolerance, better suitedfor R&D,
high availability through excellent
failure handlingmechanism.
Include vulnerability to security
breaches, does not perform in-
memory computation hence
suffers processing overheads,
not suited for stream
processing and real-time
processing, issues in
processing small files in large
numbers.
8. It is abatch processing framework with enhanced data streaming
processing.
With full in-memory computation and processing optimisation, it
promises alightning fast cluster computing system.
Apache Spark
9. Spark framework is composed of five layers.
HDFSand HBASE:They form the first layer of data storage
systems.
YARNand Mesos: Theyform the resource management layer.
Core engine: This forms the third layer.
Library: This forms the fourth layer containing Spark SQLfor SQL
queries while stream processing, GraphX and Spark Rutilities for
processing graph data and MLlib for machine learningalgorithms.
Thefifth layer contains an application program interface such as
Java or Scala.
10. Pros Cons
Include scalability, lightning
processing speeds through
reduced number of I/O operations
to disk, faulttolerance, supports
advanced analytics applications
with superiorAIimplementation
and seamless integrationwith
Hadoop
Include complexity of setup and
implementation, language support
limitation, notagenuine streaming
engine.
11. Storm
It is an application development platform-independent, can be used
with any programming language and guarantees delivery of data with
the leastlatency.
In Storm architecture, there are 2 nodes
Master Node and Worker/ Supervisor Node. The master node
monitors the failures of machines and is responsible for task
allocation. In case of acluster failure, the task is reassigned to
another one.
12. Pros Cons
Include ease insetup and
operation, highscalability, good
speed, fault tolerance,support for
awide range of languages
Include compleximplementation,
debugging issues and not very
learner-friendly
13. Apache Flink, an open-source framework is equally good for both batch
as well as stream data processing.
It is suited for cluster environments. It is based on transformations -
streams concept.
It is also the 4G of Big Data. It is the 100 times faster than
Hadoop - Map Reduce.
Apache Flink
15. Pros Cons
Include lowlatency, high
throughput, fault tolerance,
entry byentry processing,
ease ofbatch and stream
data processing,
compatibility withHadoop.
Include few scalabilityissues.
16. Hive
Apache Hive, designed by Facebook, is an ETL(Extract / Transform/
Load) and data warehousing system. It is built on top of the Hadoop –
HDFSplatform.
Thekey components of the HiveArchitecture include
Deploy Layer
Runtime Layer
17. Thekey components of the HiveArchitecture include
Hive Clients
Hive Services
Hive Storage andComputing
TheHive engine converts SQL-queries or requests to MapReduce
taskchains. Theengine comprises of,
Parser: It goes through the incoming SQL-requests and sorts
ThemOptimizer: It goes through the sorted requests and optimises
ThemExecutor: It sends tasks to the Map Reduce framework
18. Pros Cons
Include lowlatency, high
throughput, fault tolerance,
entry byentry processing,
ease ofbatch and stream
data processing,
compatibility withHadoop.
Include few scalabilityissues.
19. Presto is the open-source distributed SQLtool most suited for smaller
datasets up to 3Tb.Presto engine includes acoordinator and multiple
workers.
When client submits queries, these are parsed, analysed, their
execution planned and distributed for processing among the workers
by the coordinator.
Presto
20. Pros Cons
Include least query
degradation even inthe event
of increased concurrent
query workload. Ithas aquery
execution rate thatis three
times fasterthan Hive. Ease
in addingimages and
embedding links. Highlyuser-
friendly.
Include reliabilityissues
21. Impala is an open-source MPP(Massive Parallel Processing) query
engine that runs on multiple systems under aHadoop cluster.
It has been written in C++ and Java.
Impala
22. It is not coupled with its storage engine. It includes 3 main
components
Impala Daemon (Impalad): It is executed on every
node where Impala isinstalled.
Impala StateStore
Impala MetaStore
Impala has its query language like SQL.
23. Pros Cons
Include supports in-memory
computation hence accesses
data without movement
directly fromHadoop nodes,
smooth integrationwith BI
tools likeTableau, ZoomData,
etc., supportsawide range of
file formats.
Include no support forserialisation
and deserialization ofdata, inability
to read custom binary files, table
refresh needed for every record
addition.
24. Contact Us
+1347 374 8437
info@cuelogic.com
https://www.cuelogic.com/
Unit 610, 134 W 29th St,
New York, NY10001
Content Source: CuelogicBlog