SlideShare a Scribd company logo
1 of 25
Presented by Cuelogic Technologies
BIG DATA
FRAMEWORKS
Introduction
There are 3V’s that are vital for classifying data as Big
Data. These include Volume, Velocityand Veracity.
Volume:
Data volumes it is in terms of terabytes, petabytes and so on.
Velocity:
Velocity is to do with the high speed of data movement like
real-time data streaming at arapid rate in microseconds.
Veracity:
Veracity involves the handling approach for both structured
and unstructured data.
THINKABOUTIT
Implementation of Big Data infrastructure and technology
can be seen in various industries like banking,
retail, insurance, healthcare, media,etc.
Big Data management functions like storage, sorting,
processing and analysis for such colossal volumes cannot be
handled by the existing database systems or technologies.
There are many frameworks presently existing in this space. Some of
the popular ones are Spark, Hadoop, Hive and Storm.
Some score high on utility index like Presto while frameworks like Flink
have great potential.
There are still others which need some mention like the Samza,Impala,
Apache Pig,etc.
Some of these frameworks have been briefly discussed below.
Apache Hadoop
Hadoop is aJava-based platform founded by Mike Cafarella and Doug
Cutting.
This open-source framework provides batch data processing as well
as data storage services across agroup of hardware machines
arranged inclusters.
Hadoop consists of multiple layers like HDFSandYARNthat work
together to carry out data processing.
HDFS(Hadoop Distributed File System) is the hardware layer that
ensures coordination of data replication and storage activities
across various data clusters. In the event of acluster node
failure, real-time can still be made available for processing.
YARN(YetAnother Resource Negotiator) is the layer responsible
for resource management and job scheduling.
MapReduce is the software layer that functions as the batch
processing engine.
Pros Cons
Include cost-effective solution,
high throughput, multi-language
support, compatibilitywith most
emerging technologies inBig Data
services, highscalability, fault
tolerance, better suitedfor R&D,
high availability through excellent
failure handlingmechanism.
Include vulnerability to security
breaches, does not perform in-
memory computation hence
suffers processing overheads,
not suited for stream
processing and real-time
processing, issues in
processing small files in large
numbers.
It is abatch processing framework with enhanced data streaming
processing.
With full in-memory computation and processing optimisation, it
promises alightning fast cluster computing system.
Apache Spark
Spark framework is composed of five layers.
HDFSand HBASE:They form the first layer of data storage
systems.
YARNand Mesos: Theyform the resource management layer.
Core engine: This forms the third layer.
Library: This forms the fourth layer containing Spark SQLfor SQL
queries while stream processing, GraphX and Spark Rutilities for
processing graph data and MLlib for machine learningalgorithms.
Thefifth layer contains an application program interface such as
Java or Scala.
Pros Cons
Include scalability, lightning
processing speeds through
reduced number of I/O operations
to disk, faulttolerance, supports
advanced analytics applications
with superiorAIimplementation
and seamless integrationwith
Hadoop
Include complexity of setup and
implementation, language support
limitation, notagenuine streaming
engine.
Storm
It is an application development platform-independent, can be used
with any programming language and guarantees delivery of data with
the leastlatency.
In Storm architecture, there are 2 nodes
Master Node and Worker/ Supervisor Node. The master node
monitors the failures of machines and is responsible for task
allocation. In case of acluster failure, the task is reassigned to
another one.
Pros Cons
Include ease insetup and
operation, highscalability, good
speed, fault tolerance,support for
awide range of languages
Include compleximplementation,
debugging issues and not very
learner-friendly
Apache Flink, an open-source framework is equally good for both batch
as well as stream data processing.
It is suited for cluster environments. It is based on transformations -
streams concept.
It is also the 4G of Big Data. It is the 100 times faster than
Hadoop - Map Reduce.
Apache Flink
Flink system contains multiple layers
Deploy Layer
Runtime Layer
Library Layer
Pros Cons
Include lowlatency, high
throughput, fault tolerance,
entry byentry processing,
ease ofbatch and stream
data processing,
compatibility withHadoop.
Include few scalabilityissues.
Hive
Apache Hive, designed by Facebook, is an ETL(Extract / Transform/
Load) and data warehousing system. It is built on top of the Hadoop –
HDFSplatform.
Thekey components of the HiveArchitecture include
Deploy Layer
Runtime Layer
Thekey components of the HiveArchitecture include
Hive Clients
Hive Services
Hive Storage andComputing
TheHive engine converts SQL-queries or requests to MapReduce
taskchains. Theengine comprises of,
Parser: It goes through the incoming SQL-requests and sorts
ThemOptimizer: It goes through the sorted requests and optimises
ThemExecutor: It sends tasks to the Map Reduce framework
Pros Cons
Include lowlatency, high
throughput, fault tolerance,
entry byentry processing,
ease ofbatch and stream
data processing,
compatibility withHadoop.
Include few scalabilityissues.
Presto is the open-source distributed SQLtool most suited for smaller
datasets up to 3Tb.Presto engine includes acoordinator and multiple
workers.
When client submits queries, these are parsed, analysed, their
execution planned and distributed for processing among the workers
by the coordinator.
Presto
Pros Cons
Include least query
degradation even inthe event
of increased concurrent
query workload. Ithas aquery
execution rate thatis three
times fasterthan Hive. Ease
in addingimages and
embedding links. Highlyuser-
friendly.
Include reliabilityissues
Impala is an open-source MPP(Massive Parallel Processing) query
engine that runs on multiple systems under aHadoop cluster.
It has been written in C++ and Java.
Impala
It is not coupled with its storage engine. It includes 3 main
components
Impala Daemon (Impalad): It is executed on every
node where Impala isinstalled.
Impala StateStore
Impala MetaStore
Impala has its query language like SQL.
Pros Cons
Include supports in-memory
computation hence accesses
data without movement
directly fromHadoop nodes,
smooth integrationwith BI
tools likeTableau, ZoomData,
etc., supportsawide range of
file formats.
Include no support forserialisation
and deserialization ofdata, inability
to read custom binary files, table
refresh needed for every record
addition.
Contact Us
+1347 374 8437
info@cuelogic.com
https://www.cuelogic.com/
Unit 610, 134 W 29th St,
New York, NY10001
Content Source: CuelogicBlog
Big Data Frameworks: Spark, Hadoop, Hive and More

More Related Content

What's hot

Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyRohit Dubey
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...Simplilearn
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Databasenehabsairam
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to datasetdatamantra
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1Mahmoud Alfarra
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 

What's hot (20)

Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Big Data PPT by Rohit Dubey
Big Data PPT by Rohit DubeyBig Data PPT by Rohit Dubey
Big Data PPT by Rohit Dubey
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Big Data Analytics Lab File
Big Data Analytics Lab FileBig Data Analytics Lab File
Big Data Analytics Lab File
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Introduction to-data-mining chapter 1
Introduction to-data-mining  chapter 1Introduction to-data-mining  chapter 1
Introduction to-data-mining chapter 1
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 

Similar to Big Data Frameworks: Spark, Hadoop, Hive and More

RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkLaxmi8
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdfavenkatram
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCPBlibBlobb
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsgagravarr
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopIRJET Journal
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paperSupratim Ray
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLAdam Muise
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Joan Novino
 
BigData & Hadoop Ecosystem.pptx
BigData & Hadoop Ecosystem.pptxBigData & Hadoop Ecosystem.pptx
BigData & Hadoop Ecosystem.pptxBibhasDeb1
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introductionsaisreealekhya
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 

Similar to Big Data Frameworks: Spark, Hadoop, Hive and More (20)

RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Big data
Big dataBig data
Big data
 
Survey Paper on Big Data and Hadoop
Survey Paper on Big Data and HadoopSurvey Paper on Big Data and Hadoop
Survey Paper on Big Data and Hadoop
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Hadoop
Hadoop Hadoop
Hadoop
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
 
BigData & Hadoop Ecosystem.pptx
BigData & Hadoop Ecosystem.pptxBigData & Hadoop Ecosystem.pptx
BigData & Hadoop Ecosystem.pptx
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 

More from Cuelogic Technologies Pvt. Ltd. (6)

Introduction to mongoDB
Introduction to mongoDBIntroduction to mongoDB
Introduction to mongoDB
 
Introduction to google glass and GDK
Introduction to google glass and GDKIntroduction to google glass and GDK
Introduction to google glass and GDK
 
Automation Testing by Selenium Web Driver
Automation Testing by Selenium Web DriverAutomation Testing by Selenium Web Driver
Automation Testing by Selenium Web Driver
 
Trends in mobile applications development
Trends in mobile applications developmentTrends in mobile applications development
Trends in mobile applications development
 
HTML5
HTML5HTML5
HTML5
 
How to begin with Amazon EC2?
How to begin with Amazon EC2?How to begin with Amazon EC2?
How to begin with Amazon EC2?
 

Recently uploaded

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 

Recently uploaded (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 

Big Data Frameworks: Spark, Hadoop, Hive and More

  • 1. Presented by Cuelogic Technologies BIG DATA FRAMEWORKS
  • 2. Introduction There are 3V’s that are vital for classifying data as Big Data. These include Volume, Velocityand Veracity. Volume: Data volumes it is in terms of terabytes, petabytes and so on. Velocity: Velocity is to do with the high speed of data movement like real-time data streaming at arapid rate in microseconds. Veracity: Veracity involves the handling approach for both structured and unstructured data.
  • 3. THINKABOUTIT Implementation of Big Data infrastructure and technology can be seen in various industries like banking, retail, insurance, healthcare, media,etc. Big Data management functions like storage, sorting, processing and analysis for such colossal volumes cannot be handled by the existing database systems or technologies.
  • 4. There are many frameworks presently existing in this space. Some of the popular ones are Spark, Hadoop, Hive and Storm. Some score high on utility index like Presto while frameworks like Flink have great potential. There are still others which need some mention like the Samza,Impala, Apache Pig,etc. Some of these frameworks have been briefly discussed below.
  • 5. Apache Hadoop Hadoop is aJava-based platform founded by Mike Cafarella and Doug Cutting. This open-source framework provides batch data processing as well as data storage services across agroup of hardware machines arranged inclusters. Hadoop consists of multiple layers like HDFSandYARNthat work together to carry out data processing.
  • 6. HDFS(Hadoop Distributed File System) is the hardware layer that ensures coordination of data replication and storage activities across various data clusters. In the event of acluster node failure, real-time can still be made available for processing. YARN(YetAnother Resource Negotiator) is the layer responsible for resource management and job scheduling. MapReduce is the software layer that functions as the batch processing engine.
  • 7. Pros Cons Include cost-effective solution, high throughput, multi-language support, compatibilitywith most emerging technologies inBig Data services, highscalability, fault tolerance, better suitedfor R&D, high availability through excellent failure handlingmechanism. Include vulnerability to security breaches, does not perform in- memory computation hence suffers processing overheads, not suited for stream processing and real-time processing, issues in processing small files in large numbers.
  • 8. It is abatch processing framework with enhanced data streaming processing. With full in-memory computation and processing optimisation, it promises alightning fast cluster computing system. Apache Spark
  • 9. Spark framework is composed of five layers. HDFSand HBASE:They form the first layer of data storage systems. YARNand Mesos: Theyform the resource management layer. Core engine: This forms the third layer. Library: This forms the fourth layer containing Spark SQLfor SQL queries while stream processing, GraphX and Spark Rutilities for processing graph data and MLlib for machine learningalgorithms. Thefifth layer contains an application program interface such as Java or Scala.
  • 10. Pros Cons Include scalability, lightning processing speeds through reduced number of I/O operations to disk, faulttolerance, supports advanced analytics applications with superiorAIimplementation and seamless integrationwith Hadoop Include complexity of setup and implementation, language support limitation, notagenuine streaming engine.
  • 11. Storm It is an application development platform-independent, can be used with any programming language and guarantees delivery of data with the leastlatency. In Storm architecture, there are 2 nodes Master Node and Worker/ Supervisor Node. The master node monitors the failures of machines and is responsible for task allocation. In case of acluster failure, the task is reassigned to another one.
  • 12. Pros Cons Include ease insetup and operation, highscalability, good speed, fault tolerance,support for awide range of languages Include compleximplementation, debugging issues and not very learner-friendly
  • 13. Apache Flink, an open-source framework is equally good for both batch as well as stream data processing. It is suited for cluster environments. It is based on transformations - streams concept. It is also the 4G of Big Data. It is the 100 times faster than Hadoop - Map Reduce. Apache Flink
  • 14. Flink system contains multiple layers Deploy Layer Runtime Layer Library Layer
  • 15. Pros Cons Include lowlatency, high throughput, fault tolerance, entry byentry processing, ease ofbatch and stream data processing, compatibility withHadoop. Include few scalabilityissues.
  • 16. Hive Apache Hive, designed by Facebook, is an ETL(Extract / Transform/ Load) and data warehousing system. It is built on top of the Hadoop – HDFSplatform. Thekey components of the HiveArchitecture include Deploy Layer Runtime Layer
  • 17. Thekey components of the HiveArchitecture include Hive Clients Hive Services Hive Storage andComputing TheHive engine converts SQL-queries or requests to MapReduce taskchains. Theengine comprises of, Parser: It goes through the incoming SQL-requests and sorts ThemOptimizer: It goes through the sorted requests and optimises ThemExecutor: It sends tasks to the Map Reduce framework
  • 18. Pros Cons Include lowlatency, high throughput, fault tolerance, entry byentry processing, ease ofbatch and stream data processing, compatibility withHadoop. Include few scalabilityissues.
  • 19. Presto is the open-source distributed SQLtool most suited for smaller datasets up to 3Tb.Presto engine includes acoordinator and multiple workers. When client submits queries, these are parsed, analysed, their execution planned and distributed for processing among the workers by the coordinator. Presto
  • 20. Pros Cons Include least query degradation even inthe event of increased concurrent query workload. Ithas aquery execution rate thatis three times fasterthan Hive. Ease in addingimages and embedding links. Highlyuser- friendly. Include reliabilityissues
  • 21. Impala is an open-source MPP(Massive Parallel Processing) query engine that runs on multiple systems under aHadoop cluster. It has been written in C++ and Java. Impala
  • 22. It is not coupled with its storage engine. It includes 3 main components Impala Daemon (Impalad): It is executed on every node where Impala isinstalled. Impala StateStore Impala MetaStore Impala has its query language like SQL.
  • 23. Pros Cons Include supports in-memory computation hence accesses data without movement directly fromHadoop nodes, smooth integrationwith BI tools likeTableau, ZoomData, etc., supportsawide range of file formats. Include no support forserialisation and deserialization ofdata, inability to read custom binary files, table refresh needed for every record addition.
  • 24. Contact Us +1347 374 8437 info@cuelogic.com https://www.cuelogic.com/ Unit 610, 134 W 29th St, New York, NY10001 Content Source: CuelogicBlog