SlideShare a Scribd company logo
1 of 7
Download to read offline
Core concepts and Key technologies - Big Data Analytics!
!
Big Data Business has solved the problem of 'Big data acquisition and persistence' using daily
ETL and Batch Analysis through the Hadoop Eco-system.!
!
Now lets now see how the Big Data Market evolved beyond batch processing (Hadoop ,
RDBMS - world) to extract intelligence from global data streams in real time !!
!
Tremendous Research work underway , last few years , to challenge conventional wisdom and
create the augmented reality of Big Data Analytics !!
!
'Dynamic Decision Making' is no longer driven by 'Traditional Business Intelligence' but involves
'Fast exploration of data patterns', 'performing complex deterministic or stochastic
approximations or accurate queries' !!
!
Lets glance through some of the core concepts and key technologies that are driving this
Renaissance.!
!
Need to preserve Data Locality !
!
Traditional Hadoop MR does not preserve 'data locality' during Map Reduce transition or
between iterations. In order to send data to the next job in a MR workflow, a MR job
needs to store its data in HDFS. So it incurs communication overhead and extra
processing time. Bulk Synchronous Parallel - concept was implemented in (Pregel,
Giraph, Hama) to solve this very important MR problem. !
Hama initiates a peer2peer communication only when necessary and peers focus on
keeping locally processed data in local nodes.!
!
BSP manages the synchronization and communication in the middle layer as opposed to
file-system based parallel random access pattern ! It uses the K-Means Clustering
algorithm. Apache Hama provides a stable reference implementation for analyzing
streaming events or big data with graph/network structure by implementing deadlock-
free 'message passing interface' and 'barrier synchronization' (reduces significant n/w
overheads)!
!
Need for Real-time processing and streaming ETL!
!
Hadoop was purpose-built for 'distributed batch processing' using static input, output and
processor configuration. But fast ad-hoc machine learning query requires real-time
distributed processing and real-time updates based on dynamically changing
configuration without requiring any code change. Saffron Memory Base (SMB) is a
technology that offers real-time analytic on hybrid data.!
!
SQLstream Connector for Hadoop provides bi-directional, continuous integration with
Hadoop HBase. DataTorrent Apex is another front-runner in Stream Processing.!
!
With SciDB, one can run a query the moment it occurs to the user. By contrast, arguably
Hadoop enforces a huge burden of infrastructure setup, data preparation, map-reduce
configuration and architectural coding. Both SciDB and SMB positions themselves as a
complete replacement of Hadoop-MR when it comes to complex data analysis.!
!
Need for complex analytic functions!
!
Its not suitable for increasingly complex mathematical and graphical functions like Page-
Ranking, BFS, Matrix-Multiplication which require repetitive MR Jobs.!
!
So many interesting research works spawned in recent time; BSP, Twister, Haloop,
RHadoop!
!
Need for rich Data Model and rich Query syntax!
!
Existing MR Query API has limited syntax for relational joins and group-bys. They do
not directly support iteration or recursion in declarative form and are not able to
handle complex semi-structured nested scientific data. Here comes MRQL (Map-Reduce
Query Language) to the rescue of Hadoop MR by supporting nested collections, trees,
arbitrary query nesting, and user-defined types and functions.!
!
Impala : read directly from HDFS and HBase data. it will add a columnar storage engine,
cost-based optimizer and other distinctly database-like features.!
!
Need to optimize data flow and query execution!
!
All the 'Big data Analytics datastores' both proprietery and open-sourced trying their best
to redefine the 'traditional Hadoop MR'!
!
Well MapReduce does not make sense as an engine for querying !!
!
So here comes Shark is a Distributed In-Memory MR framework with great speed of
Execution!
!
Need for speed and versatility of MR Query!
!
Apache Hadoop is designed to achieve very high throughput, but is not designed to
achieve the sub-second latency needed for interactive data analysis and exploration.
Here comes Google Dremel and Apache Drill. Columnar query execution engine offers
low latency interactive reporting using DrQL which encompasses a broad range of low
latency frameworks like Mongo Query, Cascading, Plume. It adheres to Hadoop
philosophy of connecting to multiple storage systems, but broadens the scope and
introduces enormous flexibility through supporting multiple query languages, data
formats and data sources. Spark extends Hive syntax but employs a very efficient
column-storage with interactive multi-stage computing flow.!
!
Ability to cache and reuse intermediate map outputs!
!
HaLoop introduces recursive joins for effective iterative data analysis by caching and
reusing loop-independent data across iterations. A significant improvement over
conventional general-purpose MapReduce.!
!
Twister offers a modest mechanism for managing configurable and cacheable.mr tasks
and implements effective pub/sub based communication and off course special support
for 'iterative MR computations'.!
!
Leverage CPU cache and distributed memory access patterns!
!
There are quite a few frameworks to store data in distributed memory instead of HDFS
like GridGain, HazelCast, RDD, Piccolo!
!
Hadoop was not designed to facilitate interactive analytics. So it required few game
changers to exploit 'CPU cache' and 'Distributed Memory’!
!
Single-bus Multi-core CPU!
!
It's a well-known fact how Single-bus Multi-core CPU offers simultaneous multi-threading
that significantly reduces latency for certain type of algorithm and data structures ! So
DBMS being re-architected from ground up to leverage the 'cpu-bound partitioning-
phase hash-join' as opposed to 'memory-bound hash-join' !!
!
Its the ever-increasing speed of CPU caches and TLBs which allow blazing fast
computation and retrieval of hashed result. Also its noteworthy how modern multi-core
CPU and GPGPU offer cheap compression schemes at virtually no CPU cost. As we
know access to memory becoming pathetically slower compared to the ever galloping
processor clock-speed!!
!
ElasticCube from SiSense leverages 'query plans optimized for fast response time and
parallel execution based on multi-cores' and continuous 'instruction recycling' for
reusing pre-computed results.!
!
Few other Business Analytics database/tools like VectorWise, Tableau Data Engine
and SalesEdge also make great use of CPU cache to offer blazing fast ad-hoc query.!
!
!
Implementing Parallel Vectorization of compressed data through SIMD
(Single-Instruction, Multiple Data) : !
!
VectorWise efficiently utilized the techniques of vectorization, cpu compression and
'using cpu as execution memory'. This is also a core technology behind many leading
analytics column stores.!
!
Driving Positional-Delta-Tree (PDT) : !
!
PDT stores both position and the delta are stored in memory and effectively merged with
data during optimized query execution. Ad-hoc query in most cases is about identifying
the 'difference'. VectorWise makes effective use of PDT. More can be found in its white
paper.!
!
!
Need for directly query compressed data residing in heavily indexed
columnar files : !
!
SSDs and Flash storages will get cheaper (means more cheaper Cloud service) with
innovative compression and de-duplication on file system. All the analytics datastores
are gearing up to make most of this feature. Watch out for Pure Storage .!
!
Need for Dynamic Memory Computation and Resource Allocation!
!
Output of batched job need to be dumped in secondary storage. Rather it would be good
idea to constantly compute the data size and create peer processors and make sure
collective memory does not exceed entire data size. So its important to understand
Hadoop-MR is not the best-fit for processing all types data structures ! BSP model
should be adopted for massive Graph processing where bulk of static data can remain in
filesystem while dynamic data processed by peers and result kept in memory.!
!
!
Usage of Fractal Tree Indexes : !
!
Local processing is the key ! Keep enough buffers and pivots in the Tree node itself in
order to avoid frequent costly round trips along the tree for individual items! That
means keep filling up your local buffer and then do bulk flush! New age drives love bulk
updates (more changes per write) to avoid fragmentation !!
!
TokuDB replaces MySQL and MongoDB binary tree implementation with Fractral tree
and achieved massive performance gain.!
!
Distributed Shared Memory Abstraction !
!
It provides a radical performance improvement over disk-based MR ! Partial DAG
execution (Directed Acyclic Graph) model to describe parallel processing for for in-
memory computation (aggregate result set that fits in memory e.g. like intermediate Map
outputs).!
!
Spark uses 'Resilient Distributed Dataset' architecture for converting query into 'operator
tree' ! Shark keeps on reoptimizing a running query after running first few stages of the
task DAG, thereby selecting better Join strategy and right degree of parallelism. Shark
also offers 'co-partitioning multiple tables based on common key' for faster join query!!
!
It leverages SSD, CPU cores and Main Memory to the fullest extent !!
!
Its worth mentioning how DDF (Distributed DataFrame - nexgen extension of Spark
RDD) , H2O DataFrame and Dato SFrame expanding the horizons machine learning at
massive scale.!
!
Parallel Array Computation !
!
It is a very simplistic yet powerful mathematical approach to embed Big Math functions
directly inside database engine. SciDB has mastered this concept by embedding
statistical computation using distributed, multidimensional arrays.!
!
Semi-computation and instant approximation!
!
'fast response to ad-hoc query' through Continuous learning from experience and instant
approximation as opposed to waiting for the end of processing and computation of final
result . A bunch of analytics products coming to market with built-in 'data science
capabilities' - for example H20 from 0xdata .!
!
!
Need for Push-based Map Reduce Resource Management!
!
Though Hadoop's main strength is distributed processing over clusters, but at peak load
the utilization drops due to scheduling overhead. Its well-known how map-reduce cluster
is divided into fixed number of processor slots based on static configuration. !
So Facebook introduced Corona - a 'push-based scheduling where a cluster-manager
tracks nodes, dynamically allocates a slot making it easier to utilize all the slots based on
cluster workload for both map-reduce and non-mr applications.!
!
Topological Data Analysis !
!
It is a giant leap forward ! It treat data model as a topology of nodes and discovers
patterns and results by measuring similarity ! Ayasdi has pioneered this idea to build the
first 'Query-free exploratory analytics tool' ! Its a true example of 'analytics based on
unsupervised learning without requiring a priori algebraic model'.!
!
Ayasdi Iris is a mind-boggling insight discovery tool !!
!
Learn from KnowledgeBase!
!
InfoBright offers fast analytics based on the Knowledge-base built by continuously
updating metadata about data, data access patterns, query, aggregate result. !
This type of innovative 'dynamic introspective' approach helps columnar storage to avoid
requirement for indexing and costly sub-selects and allows to decompress only the
required data !!
!
!
Reduction in Cluster size!
!
'Data Analysis' processors do not need the same number of machines like Hadoop
nodes. For example ParAccel can handle data analysis in one node as opposed to (avg)
8 nodes required by Hadoop to perform same type of analysis due to advanced storage
schema optimization. Enhanced File systems like QFS offer much lower replication
factor (1.5) and higher throughput.!
!
!
Avoid Data duplication !
!
Both ParAccel and Hadapt share a similar vision of analyzing the data as close to the
data node as possible without moving data to different BI layer.!
!
As opposed to Hadoop Connector strategies of MPP analytic datastores, Hadapt
processes data in an RDBMS layer sitting close to HDFS and load-balances queries in a
virtualized environment of adaptive query engines.!
!
Ensure Fault-tolerance & Reliability !
!
Apache Hama ensures reliability through Process Communication and Barrier
Synchronization. Each peer uses checkpoint recovery to occasionally flush the volatile
part of its state to the DFS and allows rollback to the last checkpoint in the event of
failure.!
!
Per Shark documentation, "RDDs track the series of transformations used to build them
(their lineage) to recompute lost data "!
!
Shark/Spark with its compact Resilient Data Set -based distributed in-memory
computing offers a great hope to startups and open source enthusiasts for building a
lightning fast data warehouse system.!
!
References : !
!
** Single-Bus-Multi-Core CPU : http://pages.cs.wisc.edu/~jignesh/publ/hashjoin.pdf !
** Bulk Synchronous Parallel : http://www.staff.science.uu.nl/~bisse101/Book/PSC/psc1_2.pdf !
!
Spark : http://spark-project.org/research/ !
Shark : https://github.com/amplab/shark !
Pregel : http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-
google.html !
Giraph : http://giraph.apache.org/ !
Hama : http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf !
Saffron Memory Base : http://www.slideshare.net/paulhofmann/big-data-and-saffron !
SQL Stream : http://www.sqlstream.com/applications/hadoop/ !
RHadoop : https://github.com/RevolutionAnalytics/RHadoop/wiki !
MRQL : http://lambda.uta.edu/mrql/ !
Apache Drill : http://incubator.apache.org/drill/ !
HaLoop : http://code.google.com/p/haloop/wiki/UserManual !
Twister : http://www.iterativemapreduce.org/ !
BSP : http://en.wikipedia.org/wiki/Bulk_synchronous_parallel !
Corona : https://github.com/facebook/hadoop-20/tree/master/src/contrib/corona !
ParAccell : http://img.en25.com/Web/ParAccel/%7Be72a7284-edb0-4e58-bb75-ff1145717d2b
%7D_Hadoop-Limitations-for-Big-Data-ParAccel-Whitepaper.pdf !
QFS : https://github.com/quantcast/qfs/wiki/Performance-Comparison-to-HDFS !
Hadapt : http://hadapt.com/assets/Hadapt-Product-Overview1.pdf !
ElasticCube : http://pages.sisense.com/elasticube-whitepaper.html?src=bottom !
VectorWise : http://fastreporting.files.wordpress.com/2011/03/vectorwise-whitepaper.pdf !
SalesEdge : http://edgespring.com/product.php !
TokuDB : http://www.tokutek.com/resources/technology/ !
Ayasdi Iris : http://www.ayasdi.com/rethink-data/ !
InfoBright : http://support.infobright.com/Support/Resource-Library/Whitepapers/ !
SciDB : http://www.paradigm4.com/2013/01/terabyte-scale-parallel-processing-with-r-and-scidb/ !
Saffron Memory Base : http://www.slideshare.net/paulhofmann/saffron-for-cloud-con !
Hama : http://www.slideshare.net/teofili/machine-learning-with-apache-hama !
H2O : http://www.0xdata.com/faq.html !
!
http://pages.cs.wisc.edu/~jignesh/publ/hashjoin.pdf !
http://www.sisense.com/documentation/prism-elasticube-manager/introduction-to-elasticube-
manager , !
http://kowshik.github.com/JPregel/pregel_paper.pdf!
http://en.wikipedia.org/wiki/Topological_sorting!
http://fastreporting.files.wordpress.com/2011/03/vectorwise-whitepaper.pdf!
http://www.slideshare.net/paulhofmann/big-data-and-saffron!
http://www.paradigm4.com/2013/01/terabyte-scale-parallel-processing-with-r-and-scidb/!
http://en.wikipedia.org/wiki/Associative_Memory_Base!
http://www.staff.science.uu.nl/~bisse101/Book/PSC/psc1_2.pdf!
http://calab.kaist.ac.kr/~swseo/papers/IEEE_CLOUDCOM2010_HAMA.pdf!
http://en.wikipedia.org/wiki/Bulk_synchronous_parallel!
http://gigaom.com/2012/07/05/want-to-ditch-your-data-scientists-heres-are-7-startups-that-can-
help/!
http://pages.sisense.com/elasticube-whitepaper.html?src=bottom!
http://fastreporting.files.wordpress.com/2011/03/vectorwise-whitepaper.pdf!
http://support.infobright.com/Support/Resource-Library/Whitepapers/!
http://www.cse.buffalo.edu/faculty/tkosar/datacloud2012/papers/datacloud2012_paper_4.pdf!

More Related Content

What's hot

Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)SiamAhmed16
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012Gigaom
 
Big data analysis using map/reduce
Big data analysis using map/reduceBig data analysis using map/reduce
Big data analysis using map/reduceRenuSuren
 
Introducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaIntroducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaStudent
 
big data analytics in mobile cellular network
big data analytics in mobile cellular networkbig data analytics in mobile cellular network
big data analytics in mobile cellular networkshubham patil
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data AnalyticsTUSHAR GARG
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation17aroumougamh
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An OverviewArvind Kalyan
 
A novel approach to big data veracity using crowd-sourcing techniques
A novel approach to big data veracity using crowd-sourcing techniques A novel approach to big data veracity using crowd-sourcing techniques
A novel approach to big data veracity using crowd-sourcing techniques Abhiram Ravikumar
 
Integrating Big Data Technologies
Integrating Big Data TechnologiesIntegrating Big Data Technologies
Integrating Big Data TechnologiesDATAVERSITY
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data WarehousingThomas Kejser
 

What's hot (20)

Research paper on big data and hadoop
Research paper on big data and hadoopResearch paper on big data and hadoop
Research paper on big data and hadoop
 
Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)
 
Chapter 1 big data
Chapter 1 big dataChapter 1 big data
Chapter 1 big data
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
What is big data?
What is big data?What is big data?
What is big data?
 
Our big data
Our big dataOur big data
Our big data
 
Big Data
Big DataBig Data
Big Data
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 
Big data analysis using map/reduce
Big data analysis using map/reduceBig data analysis using map/reduce
Big data analysis using map/reduce
 
Introducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaIntroducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by Jaseela
 
big data analytics in mobile cellular network
big data analytics in mobile cellular networkbig data analytics in mobile cellular network
big data analytics in mobile cellular network
 
Big Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning GuruBig Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning Guru
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
A novel approach to big data veracity using crowd-sourcing techniques
A novel approach to big data veracity using crowd-sourcing techniques A novel approach to big data veracity using crowd-sourcing techniques
A novel approach to big data veracity using crowd-sourcing techniques
 
Integrating Big Data Technologies
Integrating Big Data TechnologiesIntegrating Big Data Technologies
Integrating Big Data Technologies
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data Warehousing
 

Viewers also liked

Big data: current technology scope.
Big data: current technology scope.Big data: current technology scope.
Big data: current technology scope.Roman Nikitchenko
 
Big data - Key Enablers, Drivers & Challenges
Big data - Key Enablers, Drivers & ChallengesBig data - Key Enablers, Drivers & Challenges
Big data - Key Enablers, Drivers & ChallengesShilpi Sharma
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKRajesh Jayarman
 
Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy Carol
Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy CarolScaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy Carol
Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy CarolHakka Labs
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduceFARUK BERKSÖZ
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
The big data value chain r1-31 oct13
The big data value chain r1-31 oct13The big data value chain r1-31 oct13
The big data value chain r1-31 oct13Rei Lynn Hayashi
 
DATA MINING TOOL- ORANGE
DATA MINING TOOL- ORANGEDATA MINING TOOL- ORANGE
DATA MINING TOOL- ORANGENeeraj Goswami
 
Big data characteristics, value chain and challenges
Big data characteristics, value chain and challengesBig data characteristics, value chain and challenges
Big data characteristics, value chain and challengesMusfiqur Rahman
 
BigData_Chp2: Hadoop & Map-Reduce
BigData_Chp2: Hadoop & Map-ReduceBigData_Chp2: Hadoop & Map-Reduce
BigData_Chp2: Hadoop & Map-ReduceLilia Sfaxi
 
BigData_Chp1: Introduction à la Big Data
BigData_Chp1: Introduction à la Big DataBigData_Chp1: Introduction à la Big Data
BigData_Chp1: Introduction à la Big DataLilia Sfaxi
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSINGKing Julian
 

Viewers also liked (20)

Big data: current technology scope.
Big data: current technology scope.Big data: current technology scope.
Big data: current technology scope.
 
Big data - Key Enablers, Drivers & Challenges
Big data - Key Enablers, Drivers & ChallengesBig data - Key Enablers, Drivers & Challenges
Big data - Key Enablers, Drivers & Challenges
 
Data mining
Data miningData mining
Data mining
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Implementing a Population Health Model (Hon Pak)
Implementing a Population Health Model (Hon Pak)Implementing a Population Health Model (Hon Pak)
Implementing a Population Health Model (Hon Pak)
 
Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy Carol
Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy CarolScaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy Carol
Scaling HBase (nosql store) to handle massive loads at Pinterest by Jeremy Carol
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
The big data value chain r1-31 oct13
The big data value chain r1-31 oct13The big data value chain r1-31 oct13
The big data value chain r1-31 oct13
 
DATA MINING TOOL- ORANGE
DATA MINING TOOL- ORANGEDATA MINING TOOL- ORANGE
DATA MINING TOOL- ORANGE
 
Big data characteristics, value chain and challenges
Big data characteristics, value chain and challengesBig data characteristics, value chain and challenges
Big data characteristics, value chain and challenges
 
BigData_Chp2: Hadoop & Map-Reduce
BigData_Chp2: Hadoop & Map-ReduceBigData_Chp2: Hadoop & Map-Reduce
BigData_Chp2: Hadoop & Map-Reduce
 
Big Data and Analytics on AWS
Big Data and Analytics on AWS Big Data and Analytics on AWS
Big Data and Analytics on AWS
 
BigData_Chp1: Introduction à la Big Data
BigData_Chp1: Introduction à la Big DataBigData_Chp1: Introduction à la Big Data
BigData_Chp1: Introduction à la Big Data
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Disaster ppt
Disaster pptDisaster ppt
Disaster ppt
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
 

Similar to Core concepts and Key technologies - Big Data Analytics

Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkGraisy Biswal
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs sparkamarkayam
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformGeekNightHyderabad
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?samthemonad
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkAgnihotriGhosh2
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Samsung Business USA
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?IJCSIS Research Publications
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedDouglas Bernardini
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 

Similar to Core concepts and Key technologies - Big Data Analytics (20)

Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Hadoop Research
Hadoop Research Hadoop Research
Hadoop Research
 
Big Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. SparkBig Data: RDBMS vs. Hadoop vs. Spark
Big Data: RDBMS vs. Hadoop vs. Spark
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
 
Final deck
Final deckFinal deck
Final deck
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
Why Spark over Hadoop?
Why Spark over Hadoop?Why Spark over Hadoop?
Why Spark over Hadoop?
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 

More from Kaniska Mandal

Machine learning advanced applications
Machine learning advanced applicationsMachine learning advanced applications
Machine learning advanced applicationsKaniska Mandal
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmKaniska Mandal
 
Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1Kaniska Mandal
 
Debugging over tcp and http
Debugging over tcp and httpDebugging over tcp and http
Debugging over tcp and httpKaniska Mandal
 
Concurrency Learning From Jdk Source
Concurrency Learning From Jdk SourceConcurrency Learning From Jdk Source
Concurrency Learning From Jdk SourceKaniska Mandal
 
Wondeland Of Modelling
Wondeland Of ModellingWondeland Of Modelling
Wondeland Of ModellingKaniska Mandal
 
The Road To Openness.Odt
The Road To Openness.OdtThe Road To Openness.Odt
The Road To Openness.OdtKaniska Mandal
 
Perils Of Url Class Loader
Perils Of Url Class LoaderPerils Of Url Class Loader
Perils Of Url Class LoaderKaniska Mandal
 
Making Applications Work Together In Eclipse
Making Applications Work Together In EclipseMaking Applications Work Together In Eclipse
Making Applications Work Together In EclipseKaniska Mandal
 
E4 Eclipse Super Force
E4 Eclipse Super ForceE4 Eclipse Super Force
E4 Eclipse Super ForceKaniska Mandal
 
Create a Customized GMF DnD Framework
Create a Customized GMF DnD FrameworkCreate a Customized GMF DnD Framework
Create a Customized GMF DnD FrameworkKaniska Mandal
 
Creating A Language Editor Using Dltk
Creating A Language Editor Using DltkCreating A Language Editor Using Dltk
Creating A Language Editor Using DltkKaniska Mandal
 
Advanced Hibernate Notes
Advanced Hibernate NotesAdvanced Hibernate Notes
Advanced Hibernate NotesKaniska Mandal
 
Converting Db Schema Into Uml Classes
Converting Db Schema Into Uml ClassesConverting Db Schema Into Uml Classes
Converting Db Schema Into Uml ClassesKaniska Mandal
 
Graphical Model Transformation Framework
Graphical Model Transformation FrameworkGraphical Model Transformation Framework
Graphical Model Transformation FrameworkKaniska Mandal
 

More from Kaniska Mandal (20)

Machine learning advanced applications
Machine learning advanced applicationsMachine learning advanced applications
Machine learning advanced applications
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning Algorithm
 
Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1Machine Learning Comparative Analysis - Part 1
Machine Learning Comparative Analysis - Part 1
 
Debugging over tcp and http
Debugging over tcp and httpDebugging over tcp and http
Debugging over tcp and http
 
Designing Better API
Designing Better APIDesigning Better API
Designing Better API
 
Concurrency Learning From Jdk Source
Concurrency Learning From Jdk SourceConcurrency Learning From Jdk Source
Concurrency Learning From Jdk Source
 
Wondeland Of Modelling
Wondeland Of ModellingWondeland Of Modelling
Wondeland Of Modelling
 
The Road To Openness.Odt
The Road To Openness.OdtThe Road To Openness.Odt
The Road To Openness.Odt
 
Perils Of Url Class Loader
Perils Of Url Class LoaderPerils Of Url Class Loader
Perils Of Url Class Loader
 
Making Applications Work Together In Eclipse
Making Applications Work Together In EclipseMaking Applications Work Together In Eclipse
Making Applications Work Together In Eclipse
 
Eclipse Tricks
Eclipse TricksEclipse Tricks
Eclipse Tricks
 
E4 Eclipse Super Force
E4 Eclipse Super ForceE4 Eclipse Super Force
E4 Eclipse Super Force
 
Create a Customized GMF DnD Framework
Create a Customized GMF DnD FrameworkCreate a Customized GMF DnD Framework
Create a Customized GMF DnD Framework
 
Creating A Language Editor Using Dltk
Creating A Language Editor Using DltkCreating A Language Editor Using Dltk
Creating A Language Editor Using Dltk
 
Advanced Hibernate Notes
Advanced Hibernate NotesAdvanced Hibernate Notes
Advanced Hibernate Notes
 
Best Of Jdk 7
Best Of Jdk 7Best Of Jdk 7
Best Of Jdk 7
 
Converting Db Schema Into Uml Classes
Converting Db Schema Into Uml ClassesConverting Db Schema Into Uml Classes
Converting Db Schema Into Uml Classes
 
EMF Tips n Tricks
EMF Tips n TricksEMF Tips n Tricks
EMF Tips n Tricks
 
Graphical Model Transformation Framework
Graphical Model Transformation FrameworkGraphical Model Transformation Framework
Graphical Model Transformation Framework
 
Mashup Magic
Mashup MagicMashup Magic
Mashup Magic
 

Recently uploaded

why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 

Recently uploaded (20)

why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 

Core concepts and Key technologies - Big Data Analytics

  • 1. Core concepts and Key technologies - Big Data Analytics! ! Big Data Business has solved the problem of 'Big data acquisition and persistence' using daily ETL and Batch Analysis through the Hadoop Eco-system.! ! Now lets now see how the Big Data Market evolved beyond batch processing (Hadoop , RDBMS - world) to extract intelligence from global data streams in real time !! ! Tremendous Research work underway , last few years , to challenge conventional wisdom and create the augmented reality of Big Data Analytics !! ! 'Dynamic Decision Making' is no longer driven by 'Traditional Business Intelligence' but involves 'Fast exploration of data patterns', 'performing complex deterministic or stochastic approximations or accurate queries' !! ! Lets glance through some of the core concepts and key technologies that are driving this Renaissance.! ! Need to preserve Data Locality ! ! Traditional Hadoop MR does not preserve 'data locality' during Map Reduce transition or between iterations. In order to send data to the next job in a MR workflow, a MR job needs to store its data in HDFS. So it incurs communication overhead and extra processing time. Bulk Synchronous Parallel - concept was implemented in (Pregel, Giraph, Hama) to solve this very important MR problem. ! Hama initiates a peer2peer communication only when necessary and peers focus on keeping locally processed data in local nodes.! ! BSP manages the synchronization and communication in the middle layer as opposed to file-system based parallel random access pattern ! It uses the K-Means Clustering algorithm. Apache Hama provides a stable reference implementation for analyzing streaming events or big data with graph/network structure by implementing deadlock- free 'message passing interface' and 'barrier synchronization' (reduces significant n/w overheads)! ! Need for Real-time processing and streaming ETL! ! Hadoop was purpose-built for 'distributed batch processing' using static input, output and processor configuration. But fast ad-hoc machine learning query requires real-time distributed processing and real-time updates based on dynamically changing configuration without requiring any code change. Saffron Memory Base (SMB) is a technology that offers real-time analytic on hybrid data.! ! SQLstream Connector for Hadoop provides bi-directional, continuous integration with Hadoop HBase. DataTorrent Apex is another front-runner in Stream Processing.! ! With SciDB, one can run a query the moment it occurs to the user. By contrast, arguably Hadoop enforces a huge burden of infrastructure setup, data preparation, map-reduce
  • 2. configuration and architectural coding. Both SciDB and SMB positions themselves as a complete replacement of Hadoop-MR when it comes to complex data analysis.! ! Need for complex analytic functions! ! Its not suitable for increasingly complex mathematical and graphical functions like Page- Ranking, BFS, Matrix-Multiplication which require repetitive MR Jobs.! ! So many interesting research works spawned in recent time; BSP, Twister, Haloop, RHadoop! ! Need for rich Data Model and rich Query syntax! ! Existing MR Query API has limited syntax for relational joins and group-bys. They do not directly support iteration or recursion in declarative form and are not able to handle complex semi-structured nested scientific data. Here comes MRQL (Map-Reduce Query Language) to the rescue of Hadoop MR by supporting nested collections, trees, arbitrary query nesting, and user-defined types and functions.! ! Impala : read directly from HDFS and HBase data. it will add a columnar storage engine, cost-based optimizer and other distinctly database-like features.! ! Need to optimize data flow and query execution! ! All the 'Big data Analytics datastores' both proprietery and open-sourced trying their best to redefine the 'traditional Hadoop MR'! ! Well MapReduce does not make sense as an engine for querying !! ! So here comes Shark is a Distributed In-Memory MR framework with great speed of Execution! ! Need for speed and versatility of MR Query! ! Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Here comes Google Dremel and Apache Drill. Columnar query execution engine offers low latency interactive reporting using DrQL which encompasses a broad range of low latency frameworks like Mongo Query, Cascading, Plume. It adheres to Hadoop philosophy of connecting to multiple storage systems, but broadens the scope and introduces enormous flexibility through supporting multiple query languages, data formats and data sources. Spark extends Hive syntax but employs a very efficient column-storage with interactive multi-stage computing flow.! ! Ability to cache and reuse intermediate map outputs! ! HaLoop introduces recursive joins for effective iterative data analysis by caching and reusing loop-independent data across iterations. A significant improvement over conventional general-purpose MapReduce.!
  • 3. ! Twister offers a modest mechanism for managing configurable and cacheable.mr tasks and implements effective pub/sub based communication and off course special support for 'iterative MR computations'.! ! Leverage CPU cache and distributed memory access patterns! ! There are quite a few frameworks to store data in distributed memory instead of HDFS like GridGain, HazelCast, RDD, Piccolo! ! Hadoop was not designed to facilitate interactive analytics. So it required few game changers to exploit 'CPU cache' and 'Distributed Memory’! ! Single-bus Multi-core CPU! ! It's a well-known fact how Single-bus Multi-core CPU offers simultaneous multi-threading that significantly reduces latency for certain type of algorithm and data structures ! So DBMS being re-architected from ground up to leverage the 'cpu-bound partitioning- phase hash-join' as opposed to 'memory-bound hash-join' !! ! Its the ever-increasing speed of CPU caches and TLBs which allow blazing fast computation and retrieval of hashed result. Also its noteworthy how modern multi-core CPU and GPGPU offer cheap compression schemes at virtually no CPU cost. As we know access to memory becoming pathetically slower compared to the ever galloping processor clock-speed!! ! ElasticCube from SiSense leverages 'query plans optimized for fast response time and parallel execution based on multi-cores' and continuous 'instruction recycling' for reusing pre-computed results.! ! Few other Business Analytics database/tools like VectorWise, Tableau Data Engine and SalesEdge also make great use of CPU cache to offer blazing fast ad-hoc query.! ! ! Implementing Parallel Vectorization of compressed data through SIMD (Single-Instruction, Multiple Data) : ! ! VectorWise efficiently utilized the techniques of vectorization, cpu compression and 'using cpu as execution memory'. This is also a core technology behind many leading analytics column stores.! ! Driving Positional-Delta-Tree (PDT) : ! ! PDT stores both position and the delta are stored in memory and effectively merged with data during optimized query execution. Ad-hoc query in most cases is about identifying the 'difference'. VectorWise makes effective use of PDT. More can be found in its white paper.! !
  • 4. ! Need for directly query compressed data residing in heavily indexed columnar files : ! ! SSDs and Flash storages will get cheaper (means more cheaper Cloud service) with innovative compression and de-duplication on file system. All the analytics datastores are gearing up to make most of this feature. Watch out for Pure Storage .! ! Need for Dynamic Memory Computation and Resource Allocation! ! Output of batched job need to be dumped in secondary storage. Rather it would be good idea to constantly compute the data size and create peer processors and make sure collective memory does not exceed entire data size. So its important to understand Hadoop-MR is not the best-fit for processing all types data structures ! BSP model should be adopted for massive Graph processing where bulk of static data can remain in filesystem while dynamic data processed by peers and result kept in memory.! ! ! Usage of Fractal Tree Indexes : ! ! Local processing is the key ! Keep enough buffers and pivots in the Tree node itself in order to avoid frequent costly round trips along the tree for individual items! That means keep filling up your local buffer and then do bulk flush! New age drives love bulk updates (more changes per write) to avoid fragmentation !! ! TokuDB replaces MySQL and MongoDB binary tree implementation with Fractral tree and achieved massive performance gain.! ! Distributed Shared Memory Abstraction ! ! It provides a radical performance improvement over disk-based MR ! Partial DAG execution (Directed Acyclic Graph) model to describe parallel processing for for in- memory computation (aggregate result set that fits in memory e.g. like intermediate Map outputs).! ! Spark uses 'Resilient Distributed Dataset' architecture for converting query into 'operator tree' ! Shark keeps on reoptimizing a running query after running first few stages of the task DAG, thereby selecting better Join strategy and right degree of parallelism. Shark also offers 'co-partitioning multiple tables based on common key' for faster join query!! ! It leverages SSD, CPU cores and Main Memory to the fullest extent !! ! Its worth mentioning how DDF (Distributed DataFrame - nexgen extension of Spark RDD) , H2O DataFrame and Dato SFrame expanding the horizons machine learning at massive scale.! ! Parallel Array Computation ! !
  • 5. It is a very simplistic yet powerful mathematical approach to embed Big Math functions directly inside database engine. SciDB has mastered this concept by embedding statistical computation using distributed, multidimensional arrays.! ! Semi-computation and instant approximation! ! 'fast response to ad-hoc query' through Continuous learning from experience and instant approximation as opposed to waiting for the end of processing and computation of final result . A bunch of analytics products coming to market with built-in 'data science capabilities' - for example H20 from 0xdata .! ! ! Need for Push-based Map Reduce Resource Management! ! Though Hadoop's main strength is distributed processing over clusters, but at peak load the utilization drops due to scheduling overhead. Its well-known how map-reduce cluster is divided into fixed number of processor slots based on static configuration. ! So Facebook introduced Corona - a 'push-based scheduling where a cluster-manager tracks nodes, dynamically allocates a slot making it easier to utilize all the slots based on cluster workload for both map-reduce and non-mr applications.! ! Topological Data Analysis ! ! It is a giant leap forward ! It treat data model as a topology of nodes and discovers patterns and results by measuring similarity ! Ayasdi has pioneered this idea to build the first 'Query-free exploratory analytics tool' ! Its a true example of 'analytics based on unsupervised learning without requiring a priori algebraic model'.! ! Ayasdi Iris is a mind-boggling insight discovery tool !! ! Learn from KnowledgeBase! ! InfoBright offers fast analytics based on the Knowledge-base built by continuously updating metadata about data, data access patterns, query, aggregate result. ! This type of innovative 'dynamic introspective' approach helps columnar storage to avoid requirement for indexing and costly sub-selects and allows to decompress only the required data !! ! ! Reduction in Cluster size! ! 'Data Analysis' processors do not need the same number of machines like Hadoop nodes. For example ParAccel can handle data analysis in one node as opposed to (avg) 8 nodes required by Hadoop to perform same type of analysis due to advanced storage schema optimization. Enhanced File systems like QFS offer much lower replication factor (1.5) and higher throughput.! ! !
  • 6. Avoid Data duplication ! ! Both ParAccel and Hadapt share a similar vision of analyzing the data as close to the data node as possible without moving data to different BI layer.! ! As opposed to Hadoop Connector strategies of MPP analytic datastores, Hadapt processes data in an RDBMS layer sitting close to HDFS and load-balances queries in a virtualized environment of adaptive query engines.! ! Ensure Fault-tolerance & Reliability ! ! Apache Hama ensures reliability through Process Communication and Barrier Synchronization. Each peer uses checkpoint recovery to occasionally flush the volatile part of its state to the DFS and allows rollback to the last checkpoint in the event of failure.! ! Per Shark documentation, "RDDs track the series of transformations used to build them (their lineage) to recompute lost data "! ! Shark/Spark with its compact Resilient Data Set -based distributed in-memory computing offers a great hope to startups and open source enthusiasts for building a lightning fast data warehouse system.! ! References : ! ! ** Single-Bus-Multi-Core CPU : http://pages.cs.wisc.edu/~jignesh/publ/hashjoin.pdf ! ** Bulk Synchronous Parallel : http://www.staff.science.uu.nl/~bisse101/Book/PSC/psc1_2.pdf ! ! Spark : http://spark-project.org/research/ ! Shark : https://github.com/amplab/shark ! Pregel : http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at- google.html ! Giraph : http://giraph.apache.org/ ! Hama : http://people.apache.org/~edwardyoon/papers/Apache_HAMA_BSP.pdf ! Saffron Memory Base : http://www.slideshare.net/paulhofmann/big-data-and-saffron ! SQL Stream : http://www.sqlstream.com/applications/hadoop/ ! RHadoop : https://github.com/RevolutionAnalytics/RHadoop/wiki ! MRQL : http://lambda.uta.edu/mrql/ ! Apache Drill : http://incubator.apache.org/drill/ ! HaLoop : http://code.google.com/p/haloop/wiki/UserManual ! Twister : http://www.iterativemapreduce.org/ ! BSP : http://en.wikipedia.org/wiki/Bulk_synchronous_parallel ! Corona : https://github.com/facebook/hadoop-20/tree/master/src/contrib/corona ! ParAccell : http://img.en25.com/Web/ParAccel/%7Be72a7284-edb0-4e58-bb75-ff1145717d2b %7D_Hadoop-Limitations-for-Big-Data-ParAccel-Whitepaper.pdf ! QFS : https://github.com/quantcast/qfs/wiki/Performance-Comparison-to-HDFS ! Hadapt : http://hadapt.com/assets/Hadapt-Product-Overview1.pdf ! ElasticCube : http://pages.sisense.com/elasticube-whitepaper.html?src=bottom ! VectorWise : http://fastreporting.files.wordpress.com/2011/03/vectorwise-whitepaper.pdf !
  • 7. SalesEdge : http://edgespring.com/product.php ! TokuDB : http://www.tokutek.com/resources/technology/ ! Ayasdi Iris : http://www.ayasdi.com/rethink-data/ ! InfoBright : http://support.infobright.com/Support/Resource-Library/Whitepapers/ ! SciDB : http://www.paradigm4.com/2013/01/terabyte-scale-parallel-processing-with-r-and-scidb/ ! Saffron Memory Base : http://www.slideshare.net/paulhofmann/saffron-for-cloud-con ! Hama : http://www.slideshare.net/teofili/machine-learning-with-apache-hama ! H2O : http://www.0xdata.com/faq.html ! ! http://pages.cs.wisc.edu/~jignesh/publ/hashjoin.pdf ! http://www.sisense.com/documentation/prism-elasticube-manager/introduction-to-elasticube- manager , ! http://kowshik.github.com/JPregel/pregel_paper.pdf! http://en.wikipedia.org/wiki/Topological_sorting! http://fastreporting.files.wordpress.com/2011/03/vectorwise-whitepaper.pdf! http://www.slideshare.net/paulhofmann/big-data-and-saffron! http://www.paradigm4.com/2013/01/terabyte-scale-parallel-processing-with-r-and-scidb/! http://en.wikipedia.org/wiki/Associative_Memory_Base! http://www.staff.science.uu.nl/~bisse101/Book/PSC/psc1_2.pdf! http://calab.kaist.ac.kr/~swseo/papers/IEEE_CLOUDCOM2010_HAMA.pdf! http://en.wikipedia.org/wiki/Bulk_synchronous_parallel! http://gigaom.com/2012/07/05/want-to-ditch-your-data-scientists-heres-are-7-startups-that-can- help/! http://pages.sisense.com/elasticube-whitepaper.html?src=bottom! http://fastreporting.files.wordpress.com/2011/03/vectorwise-whitepaper.pdf! http://support.infobright.com/Support/Resource-Library/Whitepapers/! http://www.cse.buffalo.edu/faculty/tkosar/datacloud2012/papers/datacloud2012_paper_4.pdf!