SlideShare a Scribd company logo
1 of 38
A Brief Introduction of Existing Big DataTools
Presented
By
Outline
The world map of big data tools
Layered architecture
Big data tools for HPC and supercomputing
MPI
Big data tools on clouds
MapReduce model
Iterative MapReduce model
DAG model
Graph model
Collective model
Machine learning on big data
Query on big data
Stream data processing
MapReduce ModelDAG Model Graph Model BSP/Collective Model
Storm
Twister
For
Iterations/
Learning
For
Streaming
For Query
S4
DrillDrill
Hadoop
MPI
Dryad/
DryadLINQ Pig/PigLatin
Spark
Shark
Spark Streaming
MRQL
Hive
Tez
Giraph
Hama
GraphLab
Harp
GraphX
HaLoop
Samza
The World
of Big Data
Tools
Stratosphere
Reef
Layered Architecture (Upper)
• NA – Non Apache
projects
• Green layers are
Apache/Commercial
Cloud (light) to HPC
(darker) integration
layers
Orche
stratio
n&
Workf
low
Oozie,
ODE,
Airav
ata
and
OOD
T
(Tools
)
NA:
Pegas
us,
Keple
r,
Swift,
Taver
na,
Triden
t,
Active
BPEL,
BioKe
pler,
Galax
y
Orche
stratio
n&
Workf
low
Oozie,
ODE,
Airav
ata
and
OOD
T
(Tools
)
NA:
Pegas
us,
Keple
r,
Swift,
Taver
na,
Triden
t,
Active
BPEL,
BioKe
pler,
Galax
y
Data Analytics Libraries:Machine Learning
Mahout , MLlib , MLbase
CompLearn (NA)
Linear Algebra
Scalapack, PetSc (NA)
Statistics, Bioinformatics
R, Bioconductor (NA)
Imagery
ImageJ (NA)
MRQL
(SQL on Hadoop,
Hama, Spark)
Hive
(SQL on
Hadoop)
Pig
(Procedural
Language)
Shark
(SQL on
Spark, NA)
Hcatalog
Interfaces
Impala (NA)
Cloudera
(SQL on Hbase)
Swazall
(Log Files
Google NA)
High Level (Integrated) Systems for Data Processing
Parallel Horizontally Scalable Data Processing
Giraph
~Pregel
Tez
(DAG)
Spark
(Iterative
MR)
Storm
S4
Yahoo
Samza
LinkedIn
Hama
(BSP)
Hadoop
(Map
Reduce)
Pegasus
on Hadoop
(NA)
NA:Twister
Stratosphere
Iterative MR
GraphBatch Stream
Pub/Sub Messaging Netty (NA)/ZeroMQ (NA)/ActiveMQ/Qpid/Kafka
ABDS Inter-process Communication
Hadoop, Spark Communications MPI (NA)
& Reductions Harp Collectives (NA)
HPC Inter-process Communication
Cross Cutting
Capabilities
DistributedCoordination:ZooKeeper,JGroups
DistributedCoordination:ZooKeeper,JGroups
MessageProtocols:Thrift,Protobuf(NA)
MessageProtocols:Thrift,Protobuf(NA)
Security&Privacy
Security&Privacy
Monitoring:Ambari,Ganglia,Nagios,Inca(NA)
Monitoring:Ambari,Ganglia,Nagios,Inca(NA)
Layered Architecture (Lower)
• NA – Non Apache
projects
• Green layers are
Apache/Commercial
Cloud (light) to HPC
(darker) integration
layers
In memory distributed databases/caches: GORA (general object from NoSQL), Memcached (NA),
Redis(NA) (key value), Hazelcast (NA), Ehcache (NA);
In memory distributed databases/caches: GORA (general object from NoSQL), Memcached (NA),
Redis(NA) (key value), Hazelcast (NA), Ehcache (NA);
Mesos, Yarn, Helix, Llama(Cloudera) Condor, Moab, Slurm, Torque(NA) ……..
ABDS Cluster Resource Management HPC Cluster Resource Management
ABDS File Systems User Level HPC File Systems (NA)
HDFS, Swift, Ceph FUSE(NA) Gluster, Lustre, GPFS, GFFS
Object Stores POSIX Interface Distributed, Parallel, Federated
iRODS(NA)
Interoperability Layer Whirr / JClouds OCCI CDMI (NA)
DevOps/Cloud Deployment Puppet/Chef/Boto/CloudMesh(NA)
Cross Cutting
Capabilities
DistributedCoordination:ZooKeeper,JGroups
DistributedCoordination:ZooKeeper,JGroups
MessageProtocols:Thrift,Protobuf(NA)
MessageProtocols:Thrift,Protobuf(NA)
Security&Privacy
Security&Privacy
Monitoring:Ambari,Ganglia,Nagios,Inca(NA)
Monitoring:Ambari,Ganglia,Nagios,Inca(NA)
SQL
MySQL
(NA)
SciDB
(NA)
Arrays,
R,Python
Phoenix
(SQL on
HBase)
UIMA
(Entities)
(Watson)
Tika
(Content)
Extraction Tools
Cassandra
(DHT)
NoSQL: Column
HBase
(Data on
HDFS)
Accumulo
(Data on
HDFS)
Solandra
(Solr+
Cassandra)
+Document
Azure
Table
NoSQL: Document
MongoDB
(NA)
CouchDB Lucene
Solr
Riak
~Dynamo
NoSQL: Key Value (all NA)
Dynamo
Amazon
Voldemort
~Dynamo
Berkeley
DB
Neo4J
Java Gnu
(NA)
NoSQL: General Graph
RYA RDF on
Accumulo
NoSQL: TripleStore RDF SparkQL
AllegroGraph
Commercial
Sesame
(NA)
Yarcdata
Commercial
(NA)
Jena
ORM Object Relational Mapping: Hibernate(NA), OpenJPA and JDBC StandardORM Object Relational Mapping: Hibernate(NA), OpenJPA and JDBC Standard
File
Management
IaaS System Manager Open Source Commercial Clouds
OpenStack, OpenNebula, Eucalyptus, CloudStack, vCloud, Amazon, Azure, Google
Bare
Metal
Data Transport BitTorrent, HTTP, FTP, SSH Globus Online (GridFTP)
Big Data Tools for HPC and Supercomputing
MPI(Message Passing Interface, 1992)
Provide standardized function interfaces for communication between parallel
processes.
Collective communication operations
Broadcast, Scatter, Gather, Reduce, Allgather, Allreduce, Reduce-scatter.
Popular implementations
MPICH (2001)
OpenMPI (2004)
http://www.open-mpi.org/
MapReduce Model
Google MapReduce (2004)
Jeffrey Dean et al. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004.
Apache Hadoop (2005)
http://hadoop.apache.org/
http://developer.yahoo.com/hadoop/tutorial/
Apache Hadoop 2.0 (2012)
Vinod Kumar Vavilapalli et al. Apache Hadoop YARN: Yet Another Resource Negotiator,
SOCC 2013.
Separation between resource management and computation model.
Key Features of MapReduce Model
Designed for clouds
Large clusters of commodity machines
Designed for big data
Support from local disks based distributed file system (GFS / HDFS)
Disk based intermediate data transfer in Shuffling
MapReduce programming model
Computation pattern: Map tasks and Reduce tasks
Data abstraction: KeyValue pairs
Google MapReduce
Worker
Worker
Worker
Worker
Worker
(1) fork (1) fork (1) fork
Master(2)
assign
map
(2)
assign
reduce
(3) read (4) local write
(5) remote read
Output
File 0
Output
File 1
(6) write
Split 0
Split 1
Split 2
Input files
Mapper: split, read, emit
intermediate KeyValue pairs
Reducer: repartition, emits final
output
User
Program
Map phase
Intermediate files
(on local disks)
Reduce phase Output files
Iterative MapReduce Model
Twister Programming Model
configureMaps(…)
configureReduce(…)
runMapReduce(...)
while(condition){
} //end while
updateCondition()
close()
Combine()
operation
Reduce()
Map()
Worker Nodes
Communications/data transfers via the
pub-sub broker network & direct TCP
Iterations
May scatter/broadcast <Key,Value>
pairs directly
Local Disk
Cacheable map/reduce tasks
Main program’s process space
• Main program may contain many MapReduce
invocations or iterative MapReduce invocations
May merge data in shuffling
DAG (Directed Acyclic Graph) Model
Dryad and DryadLINQ (2007)
Michael Isard et al. Dryad: Distributed Data-Parallel Programs from Sequential
Building Blocks, EuroSys, 2007.
http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx
Model Composition
Apache Spark (2010)
Matei Zaharia et al. Spark: Cluster Computing with Working Sets,. HotCloud 2010.
Matei Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for
In-Memory Cluster Computing. NSDI 2012.
http://spark.apache.org/
Resilient Distributed Dataset (RDD)
RDD operations
MapReduce-like parallel operations
DAG of execution stages and pipelined transformations
Simple collectives: broadcasting and aggregation
Graph Processing with BSP model
Pregel (2010)
Grzegorz Malewicz et al. Pregel: A System for Large-Scale Graph Processing. SIGMOD 2010.
Apache Hama (2010)
https://hama.apache.org/
Apache Giraph (2012)
https://giraph.apache.org/
Scaling Apache Giraph to a trillion edges
https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/101516170061
Pregel & Apache Giraph
 Computation Model
 Superstep as iteration
 Vertex state machine:
Active and Inactive, vote to halt
 Message passing between vertices
 Combiners
 Aggregators
 Topology mutation
 Master/worker model
 Graph partition: hashing
 Fault tolerance: checkpointing and confined
recovery
3 6 2 1
6 6 2 6
6 6 6 6
6 6 6 6
Superstep 0
Superstep 1
Superstep 2
Superstep 3
Vote to halt Active
Maximum Value Example
Giraph Page Rank Code Example
public class PageRankComputation
extends BasicComputation<IntWritable, FloatWritable, NullWritable, FloatWritable> {
/** Number of supersteps */
public static final String SUPERSTEP_COUNT = "giraph.pageRank.superstepCount";
@Override
public void compute(Vertex<IntWritable, FloatWritable, NullWritable> vertex, Iterable<FloatWritable>
messages)
throws IOException {
if (getSuperstep() >= 1) {
float sum = 0;
for (FloatWritable message : messages) {
sum += message.get();
}
vertex.getValue().set((0.15f / getTotalNumVertices()) + 0.85f * sum);
}
if (getSuperstep() < getConf().getInt(SUPERSTEP_COUNT, 0)) {
sendMessageToAllEdges(vertex,
new FloatWritable(vertex.getValue().get() / vertex.getNumEdges()));
} else {
vertex.voteToHalt();
}
}
}
GraphLab (2010)
Yucheng Low et al. GraphLab: A New Parallel Framework for Machine Learning. UAI
2010.
Yucheng Low, et al. Distributed GraphLab: A Framework for Machine Learning and Data
Mining in the Cloud. PVLDB 2012.
http://graphlab.org/projects/index.html
http://graphlab.org/resources/publications.html
Data graph
Update functions and the scope
Sync operation (similar to aggregation in Pregel)
Data Graph
Vertex-cut v.s. Edge-cut
PowerGraph (2012)
Joseph E. Gonzalez et al. PowerGraph: Distributed
Graph-Parallel Computation on Natural Graphs.
OSDI 2012.
Gather, apply, Scatter (GAS) model
GraphX (2013)
Reynold Xin et al. GraphX: A Resilient Distributed
Graph System on Spark. GRADES (SIGMOD
workshop) 2013.
https
://amplab.cs.berkeley.edu/publication/graphx-grades/
Edge-cut (Giraph model)
Vertex-cut (GAS model)
To reduce communication overhead….
Option 1
Algorithmic message reduction
Fixed point-to-point communication pattern
Option 2
Collective communication optimization
Not considered by previous BSP model but well developed in MPI
Initial attempts in Twister and Spark on clouds
Mosharaf Chowdhury et al. Managing Data Transfers in Computer Clusters with Orchestra.
SIGCOMM 2011.
Bingjing Zhang, Judy Qiu. High Performance Clustering of Social Images in a Map-Collective
Programming Model. SOCC Poster 2013.
Collective Model
Harp (2013)
https://github.com/jessezbj/harp-project
Hadoop Plugin (on Hadoop 1.2.1 and Hadoop 2.2.0)
Hierarchical data abstraction on arrays, key-values and graphs for easy programming
expressiveness.
Collective communication model to support various communication operations on the
data abstractions.
Caching with buffer management for memory allocation required from computation
and communication
BSP style parallelism
Fault tolerance with check-pointing
Harp Design
Parallelism Model Architecture
ShuffleShuffle
M M M M
Collective CommunicationCollective Communication
M M M M
R R
Map-Collective ModelMapReduce Model
YARN
MapReduce V2
Harp
MapReduce
Applications
Map-Collective
ApplicationsApplication
Framework
Resource
Manager
Vertex Table
KeyValue
Partition
Array
Commutable
Key-ValuesVertices, Edges, Messages
Double
Array
Int
Array
Long
Array
Array Partition
< Array Type >
Struct Object
Vertex
Partition
Edge Partition
Array Table
<Array Type>
Message
Partition
KeyValue Table
Byte
Array
Message
Table
Edge
Table
Broadcast, Send, Gather
Broadcast, Allgather, Allreduce, Regroup-(combine/reduce),
Message-to-Vertex, Edge-to-Vertex
Broadcast, Send
Table
Partition
Basic Types
Hierarchical Data Abstraction
and Collective Communication
Harp Bcast Code Example
protected void mapCollective(KeyValReader reader, Context context)
throws IOException, InterruptedException {
ArrTable<DoubleArray, DoubleArrPlus> table =
new ArrTable<DoubleArray, DoubleArrPlus>(0, DoubleArray.class, DoubleArrPlus.class);
if (this.isMaster()) {
String cFile = conf.get(KMeansConstants.CFILE);
Map<Integer, DoubleArray> cenDataMap = createCenDataMap(cParSize, rest, numCenPartitions,
vectorSize, this.getResourcePool());
loadCentroids(cenDataMap, vectorSize, cFile, conf);
addPartitionMapToTable(cenDataMap, table);
}
arrTableBcast(table);
}
Pipelined Broadcasting with
Topology-Awareness
Twister vs. MPI
(Broadcasting 0.5~2GB data)
Twister vs. MPJ
(Broadcasting 0.5~2GB data)
Twister vs. Spark (Broadcasting 0.5GB
data)
Twister Chain with/without topology-awareness
Tested on IU Polar Grid with 1 Gbps Ethernet connection
K-Means Clustering Performance on Madrid Cluster
(8 nodes)
K-means Clustering Parallel Efficiency
• Shantenu Jha et al.ATale
ofTwo Data-Intensive
Paradigms:Applications,
Abstractions, and
Architectures. 2014.
WDA-MDS Performance on
Big Red II
• WDA-MDS
• Yang Ruan, Geoffrey Fox. A Robust and Scalable Solution for Interpolative
Multidimensional Scaling with Weighting. IEEE e-Dcience 2013.
• Big Red II
• http://kb.iu.edu/data/bcqt.html
• Allgather
• Bucket algorithm
• Allreduce
• Bidirectional exchange algorithm
Execution Time of 100k Problem
Parallel Efficiency
Based On 8 Nodes and 256 Cores
Scale Problem Size (100k, 200k, 300k)
Machine Learning on Big Data
Mahout on Hadoop
https://mahout.apache.org/
MLlib on Spark
http://spark.apache.org/mllib/
GraphLab Toolkits
http://graphlab.org/projects/toolkits.html
GraphLab Computer Vision Toolkit
Query on Big Data
Query with procedural language
Google Sawzall (2003)
Rob Pike et al. Interpreting the Data: Parallel Analysis with Sawzall. Special Issue on
Grids and Worldwide Computing Programming Models and Infrastructure 2003.
Apache Pig (2006)
Christopher Olston et al. Pig Latin: A Not-So-Foreign Language for Data Processing.
SIGMOD 2008.
https://pig.apache.org/
SQL-like Query
Apache Hive (2007)
Facebook Data Infrastructure Team. Hive - A Warehousing Solution Over a Map-Reduce
Framework. VLDB 2009.
https://hive.apache.org/
On top of Apache Hadoop
Shark (2012)
Reynold Xin et al. Shark: SQL and Rich Analytics at Scale. Technical Report. UCB/EECS
2012.
http://shark.cs.berkeley.edu/
On top of Apache Spark
Apache MRQL (2013)
http://mrql.incubator.apache.org/
On top of Apache Hadoop, Apache Hama, and Apache Spark
Other Tools for Query
Apache Tez (2013)
http://tez.incubator.apache.org/
To build complex DAG of tasks for Apache Pig and Apache Hive
On top of YARN
Dremel (2010) Apache Drill (2012)
Sergey Melnik et al. Dremel: Interactive Analysis of Web-Scale Datasets. VLDB 2010.
http://incubator.apache.org/drill/index.html
System for interactive query
Stream Data Processing
Apache S4 (2011)
http://incubator.apache.org/s4/
Apache Storm (2011)
http://storm.incubator.apache.org/
Spark Streaming (2012)
https://spark.incubator.apache.org/streaming/
Apache Samza (2013)
http://samza.incubator.apache.org/
REEF
Retainable Evaluator Execution Framework
http://www.reef-project.org/
Provides system authors with a centralized (pluggable) control flow
Embeds a user-defined system controller called the Job Driver
Event driven control
Package a variety of data-processing libraries (e.g., high-bandwidth shuffle, relational
operators, low-latency group communication, etc.) in a reusable form.
To cover different models such as MapReduce, query, graph processing and stream
data processing
Questions?
Thank You!

More Related Content

What's hot

Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitTyler Treat
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterSri Ambati
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Gabriele Modena
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingToni Cebrián
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connectorDuyhai Doan
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...randyguck
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Robert Metzger
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesRussell Spitzer
 

What's hot (20)

Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profit
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 
Scalding
ScaldingScalding
Scalding
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
Strata Presentation: One Billion Objects in 2GB: Big Data Analytics on Small ...
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 

Viewers also liked

Processing edges on apache giraph
Processing edges on apache giraphProcessing edges on apache giraph
Processing edges on apache giraphDataWorks Summit
 
Introduction of apache giraph project
Introduction of apache giraph projectIntroduction of apache giraph project
Introduction of apache giraph projectChun Cheng Lin
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Claudio Martella
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - HortonworksAvery Ching
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Initiation à Neo4j
Initiation à Neo4jInitiation à Neo4j
Initiation à Neo4jNeo4j
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 

Viewers also liked (9)

LAUNCHPAD MAGAINE
LAUNCHPAD MAGAINELAUNCHPAD MAGAINE
LAUNCHPAD MAGAINE
 
Processing edges on apache giraph
Processing edges on apache giraphProcessing edges on apache giraph
Processing edges on apache giraph
 
Introduction of apache giraph project
Introduction of apache giraph projectIntroduction of apache giraph project
Introduction of apache giraph project
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Initiation à Neo4j
Initiation à Neo4jInitiation à Neo4j
Initiation à Neo4j
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
Hadoop Graph Analysis par Thomas Vial
Hadoop Graph Analysis par Thomas VialHadoop Graph Analysis par Thomas Vial
Hadoop Graph Analysis par Thomas Vial
 

Similar to Hadoop trainingin bangalore

Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Robert Metzger
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Taste Java In The Clouds
Taste Java In The CloudsTaste Java In The Clouds
Taste Java In The CloudsJacky Chu
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerCodemotion
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 

Similar to Hadoop trainingin bangalore (20)

Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Hadoop
HadoopHadoop
Hadoop
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Taste Java In The Clouds
Taste Java In The CloudsTaste Java In The Clouds
Taste Java In The Clouds
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe Seiler
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 

Recently uploaded

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.MateoGardella
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfSanaAli374401
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 

Recently uploaded (20)

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 

Hadoop trainingin bangalore

  • 1. A Brief Introduction of Existing Big DataTools Presented By
  • 2. Outline The world map of big data tools Layered architecture Big data tools for HPC and supercomputing MPI Big data tools on clouds MapReduce model Iterative MapReduce model DAG model Graph model Collective model Machine learning on big data Query on big data Stream data processing
  • 3. MapReduce ModelDAG Model Graph Model BSP/Collective Model Storm Twister For Iterations/ Learning For Streaming For Query S4 DrillDrill Hadoop MPI Dryad/ DryadLINQ Pig/PigLatin Spark Shark Spark Streaming MRQL Hive Tez Giraph Hama GraphLab Harp GraphX HaLoop Samza The World of Big Data Tools Stratosphere Reef
  • 4. Layered Architecture (Upper) • NA – Non Apache projects • Green layers are Apache/Commercial Cloud (light) to HPC (darker) integration layers Orche stratio n& Workf low Oozie, ODE, Airav ata and OOD T (Tools ) NA: Pegas us, Keple r, Swift, Taver na, Triden t, Active BPEL, BioKe pler, Galax y Orche stratio n& Workf low Oozie, ODE, Airav ata and OOD T (Tools ) NA: Pegas us, Keple r, Swift, Taver na, Triden t, Active BPEL, BioKe pler, Galax y Data Analytics Libraries:Machine Learning Mahout , MLlib , MLbase CompLearn (NA) Linear Algebra Scalapack, PetSc (NA) Statistics, Bioinformatics R, Bioconductor (NA) Imagery ImageJ (NA) MRQL (SQL on Hadoop, Hama, Spark) Hive (SQL on Hadoop) Pig (Procedural Language) Shark (SQL on Spark, NA) Hcatalog Interfaces Impala (NA) Cloudera (SQL on Hbase) Swazall (Log Files Google NA) High Level (Integrated) Systems for Data Processing Parallel Horizontally Scalable Data Processing Giraph ~Pregel Tez (DAG) Spark (Iterative MR) Storm S4 Yahoo Samza LinkedIn Hama (BSP) Hadoop (Map Reduce) Pegasus on Hadoop (NA) NA:Twister Stratosphere Iterative MR GraphBatch Stream Pub/Sub Messaging Netty (NA)/ZeroMQ (NA)/ActiveMQ/Qpid/Kafka ABDS Inter-process Communication Hadoop, Spark Communications MPI (NA) & Reductions Harp Collectives (NA) HPC Inter-process Communication Cross Cutting Capabilities DistributedCoordination:ZooKeeper,JGroups DistributedCoordination:ZooKeeper,JGroups MessageProtocols:Thrift,Protobuf(NA) MessageProtocols:Thrift,Protobuf(NA) Security&Privacy Security&Privacy Monitoring:Ambari,Ganglia,Nagios,Inca(NA) Monitoring:Ambari,Ganglia,Nagios,Inca(NA)
  • 5. Layered Architecture (Lower) • NA – Non Apache projects • Green layers are Apache/Commercial Cloud (light) to HPC (darker) integration layers In memory distributed databases/caches: GORA (general object from NoSQL), Memcached (NA), Redis(NA) (key value), Hazelcast (NA), Ehcache (NA); In memory distributed databases/caches: GORA (general object from NoSQL), Memcached (NA), Redis(NA) (key value), Hazelcast (NA), Ehcache (NA); Mesos, Yarn, Helix, Llama(Cloudera) Condor, Moab, Slurm, Torque(NA) …….. ABDS Cluster Resource Management HPC Cluster Resource Management ABDS File Systems User Level HPC File Systems (NA) HDFS, Swift, Ceph FUSE(NA) Gluster, Lustre, GPFS, GFFS Object Stores POSIX Interface Distributed, Parallel, Federated iRODS(NA) Interoperability Layer Whirr / JClouds OCCI CDMI (NA) DevOps/Cloud Deployment Puppet/Chef/Boto/CloudMesh(NA) Cross Cutting Capabilities DistributedCoordination:ZooKeeper,JGroups DistributedCoordination:ZooKeeper,JGroups MessageProtocols:Thrift,Protobuf(NA) MessageProtocols:Thrift,Protobuf(NA) Security&Privacy Security&Privacy Monitoring:Ambari,Ganglia,Nagios,Inca(NA) Monitoring:Ambari,Ganglia,Nagios,Inca(NA) SQL MySQL (NA) SciDB (NA) Arrays, R,Python Phoenix (SQL on HBase) UIMA (Entities) (Watson) Tika (Content) Extraction Tools Cassandra (DHT) NoSQL: Column HBase (Data on HDFS) Accumulo (Data on HDFS) Solandra (Solr+ Cassandra) +Document Azure Table NoSQL: Document MongoDB (NA) CouchDB Lucene Solr Riak ~Dynamo NoSQL: Key Value (all NA) Dynamo Amazon Voldemort ~Dynamo Berkeley DB Neo4J Java Gnu (NA) NoSQL: General Graph RYA RDF on Accumulo NoSQL: TripleStore RDF SparkQL AllegroGraph Commercial Sesame (NA) Yarcdata Commercial (NA) Jena ORM Object Relational Mapping: Hibernate(NA), OpenJPA and JDBC StandardORM Object Relational Mapping: Hibernate(NA), OpenJPA and JDBC Standard File Management IaaS System Manager Open Source Commercial Clouds OpenStack, OpenNebula, Eucalyptus, CloudStack, vCloud, Amazon, Azure, Google Bare Metal Data Transport BitTorrent, HTTP, FTP, SSH Globus Online (GridFTP)
  • 6. Big Data Tools for HPC and Supercomputing MPI(Message Passing Interface, 1992) Provide standardized function interfaces for communication between parallel processes. Collective communication operations Broadcast, Scatter, Gather, Reduce, Allgather, Allreduce, Reduce-scatter. Popular implementations MPICH (2001) OpenMPI (2004) http://www.open-mpi.org/
  • 7. MapReduce Model Google MapReduce (2004) Jeffrey Dean et al. MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004. Apache Hadoop (2005) http://hadoop.apache.org/ http://developer.yahoo.com/hadoop/tutorial/ Apache Hadoop 2.0 (2012) Vinod Kumar Vavilapalli et al. Apache Hadoop YARN: Yet Another Resource Negotiator, SOCC 2013. Separation between resource management and computation model.
  • 8. Key Features of MapReduce Model Designed for clouds Large clusters of commodity machines Designed for big data Support from local disks based distributed file system (GFS / HDFS) Disk based intermediate data transfer in Shuffling MapReduce programming model Computation pattern: Map tasks and Reduce tasks Data abstraction: KeyValue pairs
  • 9. Google MapReduce Worker Worker Worker Worker Worker (1) fork (1) fork (1) fork Master(2) assign map (2) assign reduce (3) read (4) local write (5) remote read Output File 0 Output File 1 (6) write Split 0 Split 1 Split 2 Input files Mapper: split, read, emit intermediate KeyValue pairs Reducer: repartition, emits final output User Program Map phase Intermediate files (on local disks) Reduce phase Output files
  • 11. Twister Programming Model configureMaps(…) configureReduce(…) runMapReduce(...) while(condition){ } //end while updateCondition() close() Combine() operation Reduce() Map() Worker Nodes Communications/data transfers via the pub-sub broker network & direct TCP Iterations May scatter/broadcast <Key,Value> pairs directly Local Disk Cacheable map/reduce tasks Main program’s process space • Main program may contain many MapReduce invocations or iterative MapReduce invocations May merge data in shuffling
  • 12. DAG (Directed Acyclic Graph) Model Dryad and DryadLINQ (2007) Michael Isard et al. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks, EuroSys, 2007. http://research.microsoft.com/en-us/collaboration/tools/dryad.aspx
  • 13. Model Composition Apache Spark (2010) Matei Zaharia et al. Spark: Cluster Computing with Working Sets,. HotCloud 2010. Matei Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012. http://spark.apache.org/ Resilient Distributed Dataset (RDD) RDD operations MapReduce-like parallel operations DAG of execution stages and pipelined transformations Simple collectives: broadcasting and aggregation
  • 14. Graph Processing with BSP model Pregel (2010) Grzegorz Malewicz et al. Pregel: A System for Large-Scale Graph Processing. SIGMOD 2010. Apache Hama (2010) https://hama.apache.org/ Apache Giraph (2012) https://giraph.apache.org/ Scaling Apache Giraph to a trillion edges https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/101516170061
  • 15. Pregel & Apache Giraph  Computation Model  Superstep as iteration  Vertex state machine: Active and Inactive, vote to halt  Message passing between vertices  Combiners  Aggregators  Topology mutation  Master/worker model  Graph partition: hashing  Fault tolerance: checkpointing and confined recovery 3 6 2 1 6 6 2 6 6 6 6 6 6 6 6 6 Superstep 0 Superstep 1 Superstep 2 Superstep 3 Vote to halt Active Maximum Value Example
  • 16. Giraph Page Rank Code Example public class PageRankComputation extends BasicComputation<IntWritable, FloatWritable, NullWritable, FloatWritable> { /** Number of supersteps */ public static final String SUPERSTEP_COUNT = "giraph.pageRank.superstepCount"; @Override public void compute(Vertex<IntWritable, FloatWritable, NullWritable> vertex, Iterable<FloatWritable> messages) throws IOException { if (getSuperstep() >= 1) { float sum = 0; for (FloatWritable message : messages) { sum += message.get(); } vertex.getValue().set((0.15f / getTotalNumVertices()) + 0.85f * sum); } if (getSuperstep() < getConf().getInt(SUPERSTEP_COUNT, 0)) { sendMessageToAllEdges(vertex, new FloatWritable(vertex.getValue().get() / vertex.getNumEdges())); } else { vertex.voteToHalt(); } } }
  • 17. GraphLab (2010) Yucheng Low et al. GraphLab: A New Parallel Framework for Machine Learning. UAI 2010. Yucheng Low, et al. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB 2012. http://graphlab.org/projects/index.html http://graphlab.org/resources/publications.html Data graph Update functions and the scope Sync operation (similar to aggregation in Pregel)
  • 19. Vertex-cut v.s. Edge-cut PowerGraph (2012) Joseph E. Gonzalez et al. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. OSDI 2012. Gather, apply, Scatter (GAS) model GraphX (2013) Reynold Xin et al. GraphX: A Resilient Distributed Graph System on Spark. GRADES (SIGMOD workshop) 2013. https ://amplab.cs.berkeley.edu/publication/graphx-grades/ Edge-cut (Giraph model) Vertex-cut (GAS model)
  • 20. To reduce communication overhead…. Option 1 Algorithmic message reduction Fixed point-to-point communication pattern Option 2 Collective communication optimization Not considered by previous BSP model but well developed in MPI Initial attempts in Twister and Spark on clouds Mosharaf Chowdhury et al. Managing Data Transfers in Computer Clusters with Orchestra. SIGCOMM 2011. Bingjing Zhang, Judy Qiu. High Performance Clustering of Social Images in a Map-Collective Programming Model. SOCC Poster 2013.
  • 21. Collective Model Harp (2013) https://github.com/jessezbj/harp-project Hadoop Plugin (on Hadoop 1.2.1 and Hadoop 2.2.0) Hierarchical data abstraction on arrays, key-values and graphs for easy programming expressiveness. Collective communication model to support various communication operations on the data abstractions. Caching with buffer management for memory allocation required from computation and communication BSP style parallelism Fault tolerance with check-pointing
  • 22. Harp Design Parallelism Model Architecture ShuffleShuffle M M M M Collective CommunicationCollective Communication M M M M R R Map-Collective ModelMapReduce Model YARN MapReduce V2 Harp MapReduce Applications Map-Collective ApplicationsApplication Framework Resource Manager
  • 23. Vertex Table KeyValue Partition Array Commutable Key-ValuesVertices, Edges, Messages Double Array Int Array Long Array Array Partition < Array Type > Struct Object Vertex Partition Edge Partition Array Table <Array Type> Message Partition KeyValue Table Byte Array Message Table Edge Table Broadcast, Send, Gather Broadcast, Allgather, Allreduce, Regroup-(combine/reduce), Message-to-Vertex, Edge-to-Vertex Broadcast, Send Table Partition Basic Types Hierarchical Data Abstraction and Collective Communication
  • 24. Harp Bcast Code Example protected void mapCollective(KeyValReader reader, Context context) throws IOException, InterruptedException { ArrTable<DoubleArray, DoubleArrPlus> table = new ArrTable<DoubleArray, DoubleArrPlus>(0, DoubleArray.class, DoubleArrPlus.class); if (this.isMaster()) { String cFile = conf.get(KMeansConstants.CFILE); Map<Integer, DoubleArray> cenDataMap = createCenDataMap(cParSize, rest, numCenPartitions, vectorSize, this.getResourcePool()); loadCentroids(cenDataMap, vectorSize, cFile, conf); addPartitionMapToTable(cenDataMap, table); } arrTableBcast(table); }
  • 25. Pipelined Broadcasting with Topology-Awareness Twister vs. MPI (Broadcasting 0.5~2GB data) Twister vs. MPJ (Broadcasting 0.5~2GB data) Twister vs. Spark (Broadcasting 0.5GB data) Twister Chain with/without topology-awareness Tested on IU Polar Grid with 1 Gbps Ethernet connection
  • 26. K-Means Clustering Performance on Madrid Cluster (8 nodes)
  • 27. K-means Clustering Parallel Efficiency • Shantenu Jha et al.ATale ofTwo Data-Intensive Paradigms:Applications, Abstractions, and Architectures. 2014.
  • 28. WDA-MDS Performance on Big Red II • WDA-MDS • Yang Ruan, Geoffrey Fox. A Robust and Scalable Solution for Interpolative Multidimensional Scaling with Weighting. IEEE e-Dcience 2013. • Big Red II • http://kb.iu.edu/data/bcqt.html • Allgather • Bucket algorithm • Allreduce • Bidirectional exchange algorithm
  • 29. Execution Time of 100k Problem
  • 30. Parallel Efficiency Based On 8 Nodes and 256 Cores
  • 31. Scale Problem Size (100k, 200k, 300k)
  • 32. Machine Learning on Big Data Mahout on Hadoop https://mahout.apache.org/ MLlib on Spark http://spark.apache.org/mllib/ GraphLab Toolkits http://graphlab.org/projects/toolkits.html GraphLab Computer Vision Toolkit
  • 33. Query on Big Data Query with procedural language Google Sawzall (2003) Rob Pike et al. Interpreting the Data: Parallel Analysis with Sawzall. Special Issue on Grids and Worldwide Computing Programming Models and Infrastructure 2003. Apache Pig (2006) Christopher Olston et al. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008. https://pig.apache.org/
  • 34. SQL-like Query Apache Hive (2007) Facebook Data Infrastructure Team. Hive - A Warehousing Solution Over a Map-Reduce Framework. VLDB 2009. https://hive.apache.org/ On top of Apache Hadoop Shark (2012) Reynold Xin et al. Shark: SQL and Rich Analytics at Scale. Technical Report. UCB/EECS 2012. http://shark.cs.berkeley.edu/ On top of Apache Spark Apache MRQL (2013) http://mrql.incubator.apache.org/ On top of Apache Hadoop, Apache Hama, and Apache Spark
  • 35. Other Tools for Query Apache Tez (2013) http://tez.incubator.apache.org/ To build complex DAG of tasks for Apache Pig and Apache Hive On top of YARN Dremel (2010) Apache Drill (2012) Sergey Melnik et al. Dremel: Interactive Analysis of Web-Scale Datasets. VLDB 2010. http://incubator.apache.org/drill/index.html System for interactive query
  • 36. Stream Data Processing Apache S4 (2011) http://incubator.apache.org/s4/ Apache Storm (2011) http://storm.incubator.apache.org/ Spark Streaming (2012) https://spark.incubator.apache.org/streaming/ Apache Samza (2013) http://samza.incubator.apache.org/
  • 37. REEF Retainable Evaluator Execution Framework http://www.reef-project.org/ Provides system authors with a centralized (pluggable) control flow Embeds a user-defined system controller called the Job Driver Event driven control Package a variety of data-processing libraries (e.g., high-bandwidth shuffle, relational operators, low-latency group communication, etc.) in a reusable form. To cover different models such as MapReduce, query, graph processing and stream data processing