SlideShare a Scribd company logo
1 of 24
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 1
Memory Speed Big Data
AnalyticsAlluxio vs Apache Ignite
Irfan Elahi
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 2
• Working as a Consultant in Deloitte (Analytics Service Line)
• 4+ years of experience in Big Data and Machine Learning in multiple verticals
• Recent Deloitte projects in Australia’s biggest Telecom company:
• Architecting one of the largest Hadoop deployments in cloud employing in-
memory computation technologies
• Developing enterprise-grade stream processing system based on Hadoop stack
employing NoSQL data-stores
• Premium Udemy Instructor with 12,000+ students from 131 countries
• Technical Reviewer of an upcoming Hadoop book published by APress
• Lets connect: ielahi@deloitte.com.au | linkedin.com/in/irfanelahi | @elahi_irfan
About Me
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 3
• Big Data – The Evolution and Beyond
• In-Memory Computation Trend – Overview
• Unaddressed Challenges in Big Data
• Introduction and Deep Dive Comparison of
Alluxio and Apache Ignite
Agenda
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 4
The Evolution and Beyond
Big Data
Scale-up
• Make individual expensive
machines bigger
• Challenges:
• No collocated processing
• Limited Scalability
Scale-out
• Add more machines
• Challenges:
• Programming complexity
• Partial Failures
Hadoop @ MapReduce
• Scalable , economical and fault
tolerant processing and storage
• Fit for Offline computation
• Disk I/O bound
Hadoop @ Spark
• Distributed in-memory
computation framework
• Fit for Offline + Online
computation
• Many challenges remain
unaddressed
Memory Centric Platforms
Enabling advanced caching and
memory management features to
optimize performance and address
many challenges
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 5
Name of the game:
Memory is the new disk!
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 6
Driving Factor: Economics
• Memory is much faster than
disk (approx. 3000x)
• Cost of memory decreasing
• Memory per node increasing
Challenges:
• Memory is still expensive than
disk (approx. 80x)
• Memory is still limited
• Not all data is memory-worthy
and that’s not all…
Driving Factor: Traditional
paradigms’ Limitations
Intermittent disk I/O and
serialization cost in traditional
computing platforms (e.g.
MapReduce) causes:
• High Latency
• In-efficiency in iterative
algorithms execution in
analytics (e.g. machine
learning, graph and network
analysis)
• In-efficiency in interactive data
mining
• Infeasibility for innovative use-
cases like stream processing
Impact: Innovative
technologies and processing
patterns
• New processing patterns:
Batch -> Event Driven
• New processing technologies:
Map Reduce -> Spark
Hive -> Impala
• New storage technologies:
HDFS -> Alluxio | IGFS
• New Use-cases:
Real-time stream processing
IoT
Overview
In-Memory Trend
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 7
Overview
Un-addressed Challenges:
• On-Heap memory in memory-
centric platforms (e.g. Spark)
is limited thus causing
resource pressure
• Data resilience is
compromised in the event of
application crashes and
causes expensive disk I/O
• Inter-process data/state
sharing still relies on HDFS
I/O thereby causing
performance issues
On-Heap Memory
Constraints:
• Managing increasing number of compute and storage
platforms increases complexity
• Adding/Removing respective systems require
application changes thus impacting DevOps lifecycle
• Data locality gets compromised
Many Compute to Storage Integration
Paradox:
Many leading Big data platforms still don’t support:
• ACID compliant Transactions
• ANSI SQL compliance
• Indexing
• In-place mutation
Missing SQL and Transactional
Support on Hadoop
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 8
Potential Missing Pieces of Puzzle:
• Alluxio
• Apache Ignite
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 9
Alluxio
• Launched in 2012 by UC Berkeley
AMPLab
• Formerly known as Tachyon
• Licensed under Apache License 2.0
• Approximately 500 contributors
• Deployed in Yahoo, Baidu, Intel,
Samsung to name a few
A distributed and scalable storage virtualized across multiple storage systems under
unified namespace to facilitate data access at memory speed
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 10
Ignite
• First release in early 2015
• August 2015: second fastest project to
graduate after Spark
• Licensed under Apache License 2.0
• Approximately 120 contributors
• Deployed in IBM, Siemens, Citibank,
Barclays, Nielsen to name a few
An distributed key-value store and scalable in-memory computing platform with
powerful SQL, key-value and processing APIs
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 11
Deep Dive Comparison
• Alluxio
• Apache Ignite
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 12
Alluxio Apache Ignite
Master Nodes:
• Manage File System Metadata
• Can be Primary or Secondary
• HA supported via ZooKeeper
Worker Nodes:
• Store data in the form of
blocks
• No rebalancing of blocks upon addition of new nodes
• Send heartbeats to Master Nodes
Require Under File System (UFS) (e.g. HDFS, S3) for operation
Architecture
master node(s):
group: B
worker node(s):
group: A
servers:
Optional Node Roles:
• Servers (Default | Equal by design | Multiple servers on one host)
• Clients (Explicitly defined | Connect to servers for computation)
Logical Grouping:
• User configurable node roles via attributes registered by nodes at
start-up
• Registered attributes can be leveraged for dynamic logical grouping
based on predicates (e.g. CPU Utilization > 50%) for localizing
processes and jobs
No Name-Node Architecture:
• When used as IGFS, no centralized metadata management (e.g. like
HDFS NameNode or Alluxio’s Master Nodes) is needed
• Hashing is used for data locality determination
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 13
Alluxio Apache Ignite
Configuration:
• Requires explicitly specifying:
• Master Node(s)
• Worker Node(s)
in the configuration files
• Addition of nodes requires restarting cluster
Interfaces:
• Alluxio Shell
• Web interface (also enables to browse Alluxio FS)
Configuration:
• Doesn’t require explicit specification of nodes in configuration files
• Nodes discover themselves automatically when started
• Supported methods for nodes discovery:
• Multi-cast
• Static IP based
• Cluster can be scaled without restarting
• Supported deployment modes: Shared or Embedded
Interfaces:
Visor CommandLine:
For viewing topology, node metrics, cache statistics and
administrating cluster
Web Interface:
Needs to be installed separately
Architecture (Continued)
group: Bgroup: A
servers:master node(s):
worker node(s):
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 14
Alluxio Apache Ignite
Architecture (Continued)
Alluxio Shell
Visor CommandLine
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 15
Alluxio Apache Ignite
Architecture (Continued)
Alluxio Web Interface
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 16
Alluxio Apache Ignite
• Provides two modes of persistence in addition to in-memory:
• Native Persistence (disk only)
• 3rd party Persistence (pluggable)
Native persistence:
• Treats disk for persisting super-set of data
• Supports SSD, Flash, 3d Xpoint storage
• Features like ACID compliance, SQL are supported only in this mode
3rd Party Persistence:
• Data stores like HDFS, Cassandra and JDBC based are pluggable
• Involves implementing CacheStore interface for read/write through
• Supports write-behind caching for improved performance
Integration with Data Stores
Alluxio (exposed as file system)
Spark
Map
Reduce
Flink
…
S3 HDFS Blob…
Ignite (In-memory)
Disk HDFS Cassandra…
native
persistence
3rd party
persistence
Spark/ Map
Reduce/ Flink SQL Streaming
…
Machine
Learning
Compute
• Enables processing frameworks to interact with data from
different data stores with unified namespace and API
• Conveniently supports the following data stores as UFS:
• HDFS, Blob, S3, GCS, Minio, Ceph, Swift, MapR-FS to
name a few
• Process involves mounting different UFS at different mount
points in Alluxio namespace and then accessing seamlessly in
applications
• Addition of more UFS storage is configurable
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 17
Alluxio Apache Ignite
• Alluxio is provides Hadoop compatible file system APIs and
thus data can be read/written via Spark RDD’s file system
related APIs
• Enables to read/write data from different data stores
(configured as UFS) via Alluxio’s unified Namespace and API
• Automatically manages movement of data persisted in Alluxio
or UFS
Two ways to integrate with Spark:
• As stand-alone IGFS or caching layer on HDFS:
• Ignite is exposed as HDFS and thus data can be read/written
via Spark RDD’s File System related APIs
• As Distributed Cache via IgniteContext:
• Provides implementation of Spark RDDs (supporting all
transformations and actions)
• Mutable RDDs (view over distributed cache’s content)
• Configurable lifespan depending upon Ignite’s deployment
mode
Integration with Spark
Alluxio (exposed as file system)
S3 HDFS Blob
RDD’s save to
file system API
IgniteContext
IgniteRDD [Tuple2]
transformations
RDD’s read from FileSystem API
transformations
Disk…
savePairs Action
Ignite IGFS Ignite (Distributed Cache)
RDD’s read/write
to/from FileSystem API
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 18
Alluxio Apache Ignite
//reading data:
val textRdd =
sc.textFile(“alluxio://masternode:19998/path”)
//transformations:
val textRdd2=textRdd.filter(_.contains(“deloitte”))
//writing data:
textRdd2.saveAsTextFile(“alluxio://masternode:19998/desti
nation_path”)
//creating IgniteContext
val igniteContext = new IgniteContext(sparkContext,() => new
IgniteConfiguration())
//creating IgniteRDD
val
cacheRdd:org.apache.ignite.spark.IgniteRDD[Integer,String]=
igniteContext.fromCache(“deloitte_cache")
//transformations:
val cacheRdd2=cacheRdd.filter(_._2.contains(“deloitte”))
//writing data:
cacheRdd2.savePairs()
Integration with Spark (Continued)
Alluxio (exposed as file system)
S3 HDFS Blob
RDD’s save to
file system API
IgniteContext
IgniteRDD [Tuple2]
transformations
RDD’s read from FileSystem API
transformations
Disk…
savePairs Action
Ignite IGFS Ignite (Distributed Cache)
RDD’s read/write
to/from FileSystem API
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 19
Alluxio Apache Ignite
Memory Architecture
MEM
SSD
HDD
Alluxio Tiers
Performance
Alluxio storage is divided into three ordered tiers as follows:
• MEM (memory)
• SSD
• HDD
• Allows to store data greater than the available Memory in
cluster
• Automatically manages data between tiers
• Data is written to top tier by default
Memory
Disk
…
Memory Region Memory Region Memory Region
Memory Segment
Data
Page
B+
Tree
Index
Page
Index
Page
Data
Page
In Native Persistence, data and index storage is divided into:
• Memory (subset of data)
• Disk (superset of data)
• Data can be stored both off-heap and on-heap.
• When stored off-heap, less constraints on volume of data to
be stored and less GC pauses
• Memory is further divided into Memory Regions
• Memory Regions consist of Memory Segments which comprise
of Data Page, B+ Tree Page, Index Page and FreeList
Structures
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 20
Alluxio Apache Ignite
Advanced Memory Management
MEM
SSD
HDD
Alluxio Tiers
Performance
• Pinning/Unpinning:
To enforce data locality in a specific tier
• Allocators:
For choosing locations to write new data blocks.
• Evictors:
For choosing which data to move to lower tier for
freeing space. Supported algorithms: Greedy, LRU,
LRFU, Partial LRU
• Evictors and Allocators are applied globally
• Write may fail if space cant be freed or if data exceeds the
size of top tier
Memory
Disk
…
Memory Region Memory Region Memory Region
Memory Segment
Data
Page
B+
Tree
Index
Page
Index
Page
Data
Page
Supports memory policies (e.g. eviction) to be applicable at:
• Memory Region Level (for off-heap caching)
• Entry Level (for on-heap caching)
thus providing more granular control
Eviction:
Supported algorithms for Page Based Eviction:
Random-LRU, Random2-LRU
Supported algorithms for Entry Level Eviction:
FIFO, LRU, Random
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 21
Alluxio Apache Ignite
Additional Capabilities: SQL Support
Not supported
• Supports distributed and Horizontally scalable SQL Database
capabilities
• Supports indexing
• SQL ANSI-99 compliant
• Supports all SQL DDL and DML commands including UPDATE,
DELETE, MERGE queries
• Resembles Kudu’s capabilities
• Counters limitation of HDFS
• Supports running queries on data spanning on memory or
disk. All of the data need not be in memory for processing
unlike in Impala or Spark
JDBC ODBC Ignite SQL API
Disk
RAM
Disk
RAM
Disk
RAM
Disk
RAM
Disk
RAM
…
Data and Indexes
Java .Net C++ PHP BI Tools
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 22
Alluxio Apache Ignite
Additional Capabilities (Continued)
Key Value APIs (not transactional/ACID compliant) Key Value APIs (transactional - ACID compliant)
Compute Grid: Distributed and parallel computation
MLGrid: Machine learning library on top of Apache Ignite.
Currently supports limited vector and matrix algebra operations
and other algorithms are on the roadmap.
Streaming ingestion: Ingesting real time streams of data into
ignite in distributed, scalable and fault tolerant manner
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 23
• For inter-process state sharing (e.g. across Spark jobs), both provide adequate
functional capabilities.
• Both platforms provide automatic hot data management whereas Apache Ignite
provides more granular control courtesy of its per memory region policies.
• For convenience in use-cases involving interacting with data in multiple storage
systems at memory speed, Alluxio makes more sense.
• For building real-time data and analytics pipelines, Apache Ignite makes more
sense as a sink.
• For analytical use-cases involving relational processing and in-place mutation at
high speed, Apache Ignite makes more sense.
Key Takeaways:
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 24
Questions

More Related Content

What's hot

What's hot (20)

Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdfDeep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Ceph - A distributed storage system
Ceph - A distributed storage systemCeph - A distributed storage system
Ceph - A distributed storage system
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Ceph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion ObjectsCeph scale testing with 10 Billion Objects
Ceph scale testing with 10 Billion Objects
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Faster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDBFaster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDB
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 

Similar to Apache Ignite vs Alluxio: Memory Speed Big Data Analytics

Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Community
 

Similar to Apache Ignite vs Alluxio: Memory Speed Big Data Analytics (20)

Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
 
Alluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle MeetupAlluxio @ Uber Seattle Meetup
Alluxio @ Uber Seattle Meetup
 
Unified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any CloudUnified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any Cloud
 
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
 
Achieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAchieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloads
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio Flexible and Fast Storage for Deep Learning with Alluxio
Flexible and Fast Storage for Deep Learning with Alluxio
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
 
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
 
Open Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and CloudOpen Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and Cloud
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017 Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
Enable Fast Big Data Analytics on Ceph with Alluxio at Ceph Days 2017
 
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Apache Ignite vs Alluxio: Memory Speed Big Data Analytics

  • 1. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 1 Memory Speed Big Data AnalyticsAlluxio vs Apache Ignite Irfan Elahi
  • 2. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 2 • Working as a Consultant in Deloitte (Analytics Service Line) • 4+ years of experience in Big Data and Machine Learning in multiple verticals • Recent Deloitte projects in Australia’s biggest Telecom company: • Architecting one of the largest Hadoop deployments in cloud employing in- memory computation technologies • Developing enterprise-grade stream processing system based on Hadoop stack employing NoSQL data-stores • Premium Udemy Instructor with 12,000+ students from 131 countries • Technical Reviewer of an upcoming Hadoop book published by APress • Lets connect: ielahi@deloitte.com.au | linkedin.com/in/irfanelahi | @elahi_irfan About Me
  • 3. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 3 • Big Data – The Evolution and Beyond • In-Memory Computation Trend – Overview • Unaddressed Challenges in Big Data • Introduction and Deep Dive Comparison of Alluxio and Apache Ignite Agenda
  • 4. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 4 The Evolution and Beyond Big Data Scale-up • Make individual expensive machines bigger • Challenges: • No collocated processing • Limited Scalability Scale-out • Add more machines • Challenges: • Programming complexity • Partial Failures Hadoop @ MapReduce • Scalable , economical and fault tolerant processing and storage • Fit for Offline computation • Disk I/O bound Hadoop @ Spark • Distributed in-memory computation framework • Fit for Offline + Online computation • Many challenges remain unaddressed Memory Centric Platforms Enabling advanced caching and memory management features to optimize performance and address many challenges
  • 5. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 5 Name of the game: Memory is the new disk!
  • 6. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 6 Driving Factor: Economics • Memory is much faster than disk (approx. 3000x) • Cost of memory decreasing • Memory per node increasing Challenges: • Memory is still expensive than disk (approx. 80x) • Memory is still limited • Not all data is memory-worthy and that’s not all… Driving Factor: Traditional paradigms’ Limitations Intermittent disk I/O and serialization cost in traditional computing platforms (e.g. MapReduce) causes: • High Latency • In-efficiency in iterative algorithms execution in analytics (e.g. machine learning, graph and network analysis) • In-efficiency in interactive data mining • Infeasibility for innovative use- cases like stream processing Impact: Innovative technologies and processing patterns • New processing patterns: Batch -> Event Driven • New processing technologies: Map Reduce -> Spark Hive -> Impala • New storage technologies: HDFS -> Alluxio | IGFS • New Use-cases: Real-time stream processing IoT Overview In-Memory Trend
  • 7. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 7 Overview Un-addressed Challenges: • On-Heap memory in memory- centric platforms (e.g. Spark) is limited thus causing resource pressure • Data resilience is compromised in the event of application crashes and causes expensive disk I/O • Inter-process data/state sharing still relies on HDFS I/O thereby causing performance issues On-Heap Memory Constraints: • Managing increasing number of compute and storage platforms increases complexity • Adding/Removing respective systems require application changes thus impacting DevOps lifecycle • Data locality gets compromised Many Compute to Storage Integration Paradox: Many leading Big data platforms still don’t support: • ACID compliant Transactions • ANSI SQL compliance • Indexing • In-place mutation Missing SQL and Transactional Support on Hadoop
  • 8. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 8 Potential Missing Pieces of Puzzle: • Alluxio • Apache Ignite
  • 9. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 9 Alluxio • Launched in 2012 by UC Berkeley AMPLab • Formerly known as Tachyon • Licensed under Apache License 2.0 • Approximately 500 contributors • Deployed in Yahoo, Baidu, Intel, Samsung to name a few A distributed and scalable storage virtualized across multiple storage systems under unified namespace to facilitate data access at memory speed
  • 10. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 10 Ignite • First release in early 2015 • August 2015: second fastest project to graduate after Spark • Licensed under Apache License 2.0 • Approximately 120 contributors • Deployed in IBM, Siemens, Citibank, Barclays, Nielsen to name a few An distributed key-value store and scalable in-memory computing platform with powerful SQL, key-value and processing APIs
  • 11. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 11 Deep Dive Comparison • Alluxio • Apache Ignite
  • 12. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 12 Alluxio Apache Ignite Master Nodes: • Manage File System Metadata • Can be Primary or Secondary • HA supported via ZooKeeper Worker Nodes: • Store data in the form of blocks • No rebalancing of blocks upon addition of new nodes • Send heartbeats to Master Nodes Require Under File System (UFS) (e.g. HDFS, S3) for operation Architecture master node(s): group: B worker node(s): group: A servers: Optional Node Roles: • Servers (Default | Equal by design | Multiple servers on one host) • Clients (Explicitly defined | Connect to servers for computation) Logical Grouping: • User configurable node roles via attributes registered by nodes at start-up • Registered attributes can be leveraged for dynamic logical grouping based on predicates (e.g. CPU Utilization > 50%) for localizing processes and jobs No Name-Node Architecture: • When used as IGFS, no centralized metadata management (e.g. like HDFS NameNode or Alluxio’s Master Nodes) is needed • Hashing is used for data locality determination
  • 13. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 13 Alluxio Apache Ignite Configuration: • Requires explicitly specifying: • Master Node(s) • Worker Node(s) in the configuration files • Addition of nodes requires restarting cluster Interfaces: • Alluxio Shell • Web interface (also enables to browse Alluxio FS) Configuration: • Doesn’t require explicit specification of nodes in configuration files • Nodes discover themselves automatically when started • Supported methods for nodes discovery: • Multi-cast • Static IP based • Cluster can be scaled without restarting • Supported deployment modes: Shared or Embedded Interfaces: Visor CommandLine: For viewing topology, node metrics, cache statistics and administrating cluster Web Interface: Needs to be installed separately Architecture (Continued) group: Bgroup: A servers:master node(s): worker node(s):
  • 14. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 14 Alluxio Apache Ignite Architecture (Continued) Alluxio Shell Visor CommandLine
  • 15. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 15 Alluxio Apache Ignite Architecture (Continued) Alluxio Web Interface
  • 16. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 16 Alluxio Apache Ignite • Provides two modes of persistence in addition to in-memory: • Native Persistence (disk only) • 3rd party Persistence (pluggable) Native persistence: • Treats disk for persisting super-set of data • Supports SSD, Flash, 3d Xpoint storage • Features like ACID compliance, SQL are supported only in this mode 3rd Party Persistence: • Data stores like HDFS, Cassandra and JDBC based are pluggable • Involves implementing CacheStore interface for read/write through • Supports write-behind caching for improved performance Integration with Data Stores Alluxio (exposed as file system) Spark Map Reduce Flink … S3 HDFS Blob… Ignite (In-memory) Disk HDFS Cassandra… native persistence 3rd party persistence Spark/ Map Reduce/ Flink SQL Streaming … Machine Learning Compute • Enables processing frameworks to interact with data from different data stores with unified namespace and API • Conveniently supports the following data stores as UFS: • HDFS, Blob, S3, GCS, Minio, Ceph, Swift, MapR-FS to name a few • Process involves mounting different UFS at different mount points in Alluxio namespace and then accessing seamlessly in applications • Addition of more UFS storage is configurable
  • 17. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 17 Alluxio Apache Ignite • Alluxio is provides Hadoop compatible file system APIs and thus data can be read/written via Spark RDD’s file system related APIs • Enables to read/write data from different data stores (configured as UFS) via Alluxio’s unified Namespace and API • Automatically manages movement of data persisted in Alluxio or UFS Two ways to integrate with Spark: • As stand-alone IGFS or caching layer on HDFS: • Ignite is exposed as HDFS and thus data can be read/written via Spark RDD’s File System related APIs • As Distributed Cache via IgniteContext: • Provides implementation of Spark RDDs (supporting all transformations and actions) • Mutable RDDs (view over distributed cache’s content) • Configurable lifespan depending upon Ignite’s deployment mode Integration with Spark Alluxio (exposed as file system) S3 HDFS Blob RDD’s save to file system API IgniteContext IgniteRDD [Tuple2] transformations RDD’s read from FileSystem API transformations Disk… savePairs Action Ignite IGFS Ignite (Distributed Cache) RDD’s read/write to/from FileSystem API
  • 18. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 18 Alluxio Apache Ignite //reading data: val textRdd = sc.textFile(“alluxio://masternode:19998/path”) //transformations: val textRdd2=textRdd.filter(_.contains(“deloitte”)) //writing data: textRdd2.saveAsTextFile(“alluxio://masternode:19998/desti nation_path”) //creating IgniteContext val igniteContext = new IgniteContext(sparkContext,() => new IgniteConfiguration()) //creating IgniteRDD val cacheRdd:org.apache.ignite.spark.IgniteRDD[Integer,String]= igniteContext.fromCache(“deloitte_cache") //transformations: val cacheRdd2=cacheRdd.filter(_._2.contains(“deloitte”)) //writing data: cacheRdd2.savePairs() Integration with Spark (Continued) Alluxio (exposed as file system) S3 HDFS Blob RDD’s save to file system API IgniteContext IgniteRDD [Tuple2] transformations RDD’s read from FileSystem API transformations Disk… savePairs Action Ignite IGFS Ignite (Distributed Cache) RDD’s read/write to/from FileSystem API
  • 19. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 19 Alluxio Apache Ignite Memory Architecture MEM SSD HDD Alluxio Tiers Performance Alluxio storage is divided into three ordered tiers as follows: • MEM (memory) • SSD • HDD • Allows to store data greater than the available Memory in cluster • Automatically manages data between tiers • Data is written to top tier by default Memory Disk … Memory Region Memory Region Memory Region Memory Segment Data Page B+ Tree Index Page Index Page Data Page In Native Persistence, data and index storage is divided into: • Memory (subset of data) • Disk (superset of data) • Data can be stored both off-heap and on-heap. • When stored off-heap, less constraints on volume of data to be stored and less GC pauses • Memory is further divided into Memory Regions • Memory Regions consist of Memory Segments which comprise of Data Page, B+ Tree Page, Index Page and FreeList Structures
  • 20. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 20 Alluxio Apache Ignite Advanced Memory Management MEM SSD HDD Alluxio Tiers Performance • Pinning/Unpinning: To enforce data locality in a specific tier • Allocators: For choosing locations to write new data blocks. • Evictors: For choosing which data to move to lower tier for freeing space. Supported algorithms: Greedy, LRU, LRFU, Partial LRU • Evictors and Allocators are applied globally • Write may fail if space cant be freed or if data exceeds the size of top tier Memory Disk … Memory Region Memory Region Memory Region Memory Segment Data Page B+ Tree Index Page Index Page Data Page Supports memory policies (e.g. eviction) to be applicable at: • Memory Region Level (for off-heap caching) • Entry Level (for on-heap caching) thus providing more granular control Eviction: Supported algorithms for Page Based Eviction: Random-LRU, Random2-LRU Supported algorithms for Entry Level Eviction: FIFO, LRU, Random
  • 21. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 21 Alluxio Apache Ignite Additional Capabilities: SQL Support Not supported • Supports distributed and Horizontally scalable SQL Database capabilities • Supports indexing • SQL ANSI-99 compliant • Supports all SQL DDL and DML commands including UPDATE, DELETE, MERGE queries • Resembles Kudu’s capabilities • Counters limitation of HDFS • Supports running queries on data spanning on memory or disk. All of the data need not be in memory for processing unlike in Impala or Spark JDBC ODBC Ignite SQL API Disk RAM Disk RAM Disk RAM Disk RAM Disk RAM … Data and Indexes Java .Net C++ PHP BI Tools
  • 22. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 22 Alluxio Apache Ignite Additional Capabilities (Continued) Key Value APIs (not transactional/ACID compliant) Key Value APIs (transactional - ACID compliant) Compute Grid: Distributed and parallel computation MLGrid: Machine learning library on top of Apache Ignite. Currently supports limited vector and matrix algebra operations and other algorithms are on the roadmap. Streaming ingestion: Ingesting real time streams of data into ignite in distributed, scalable and fault tolerant manner
  • 23. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 23 • For inter-process state sharing (e.g. across Spark jobs), both provide adequate functional capabilities. • Both platforms provide automatic hot data management whereas Apache Ignite provides more granular control courtesy of its per memory region policies. • For convenience in use-cases involving interacting with data in multiple storage systems at memory speed, Alluxio makes more sense. • For building real-time data and analytics pipelines, Apache Ignite makes more sense as a sink. • For analytical use-cases involving relational processing and in-place mutation at high speed, Apache Ignite makes more sense. Key Takeaways:
  • 24. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 24 Questions

Editor's Notes

  1. Traditional computation (e.g. Map Reduce) requires disk I/O thus resulting in slow performance Training Machine learning models iterative steps which are severely impacted if it involved intermittent disk I/O In machine learning involving interative algorithms, intermittent disk I/O severely impacts performance
  2. Traditional computation (e.g. Map Reduce) requires disk I/O thus resulting in slow performance Training Machine learning models iterative steps which are severely impacted if it involved intermittent disk I/O In machine learning involving interative algorithms, intermittent disk I/O severely impacts performance
  3. In-place mutation Tiered st Traditional computation (e.g. Map Reduce) requires disk I/O thus resulting in slow performance Training Machine learning models iterative steps which are severely impacted if it involved intermittent disk I/O In machine learning involving interative algorithms, intermittent disk I/O severely impacts performance https://www.slideshare.net/imcsummit/imcs2015-1-devopensource-inmemory-platforms?qid=f751b742-9974-420c-b0da-e4ee57eeeadd&v=&b=&from_search=34 States not shareable across processes Causes resource pressure Abstracts are limited to the lifecycle of a job Spark: RDDs don’t exist beyond a job RDDs state cant be shared across jobs ???