In-Memory Computing: How, Why? and common Patterns

In-Memory Computing
Srinath Perera
Director, Research
WSO2 Inc.

Performance Numbers (based on Jeff Dean’s
numbers )
Mem Ops
/ Sec
If Memory access
is a Second
L1 cache reference 0.05 1/20th sec
Main memory reference 1 1 sec
Send 2K bytes over 1
Gbps network 200 3 min
Read 1 MB sequentially
from memory 2500 41 min
Disk seek 1*10^5 27 hours
Read 1 MB sequentially
from disk 2*10^5 2 days
Send packet CA-
>Netherlands->CA 1.5*10^6 17 days
Operation
Speed
MB/sec
Hadoop Select 3
Terasort Bench
mark 18
Complex Query
Hadoop 0.2
CEP 60
CEP Complex 2.5
SSD 300-500
Disk 50-100

Latency Lags Bandwidth
• Observation in prof. Patterson’s
Keynote at 2004
• Bandwidth improves, but not latency
• Same holds now, and the gap is
widening with new systems

Handling Speed Differences in Memory
Hierarchy
1. Caching
– E.g. Processor caches, file cache,
disk cache, permission cache
2. Replication
– E.g. RAID, Content Distribution
Networks (CDN), Web Cache
3. Prediction – Predict what data
will be needed and prefect
– Tradeoff bandwidth
– E.g. disk caches, Google Earth

Above three does not always work
• Limitations
– Caching works only if working set is small
– Prefetching only works when access patterns are predictable
– Replication is expensive and limited by receiving side machines
• Lets assume you are reading and filtering 10G data
(assuming 6b per record that is 17Billion records)
– 3 minutes to read the data from disk
– 35ms to filter 10M in my laptop => 1 minutes to process all data
– Keeping data in memory can give about 30X more

Data Access Patterns in Big Data Applications
• Read from Disk, process once (Basic Analytics)
– Data can be perfected, batch load is only about 100 times faster.
– OK if processing time > data read time
• Read from Disk, iteratively Process (Machine Learning Algos, e.g.
KMean)
– Need to load data from disk once and process (e.g. Spark supports this)
• Interactive (OLAP)
– Queries are random, data may be scattered. Once query started, can load
data to memory and process
• Random Access (e.g. Graph Processing)
– Very hard to optimize
• Realtime Access
– As data comes in

Four Myths
• Myths
– Too expensive 1TB RAM cluster for 20-40k (about 1$/GB)
– It is not durable
– Flash is fast enough
– It is about In-Memory DBs
• From Nikita Ivanov’s post
– http://gridgaintech.wordpress.com/2013/09/18/four-myths-
of-in-memory-computing/

Let us look at each Big data access
pattern and where In-Memory
Computing can make a difference

Access Pattern 1:Read from Disk, Process Once
• If Tp = 35ms vs
Td=1.2 sec with
60MB chunks, it
will give about
30X to keep all
data in Memory
• However, this
benefit is less if
computation is
more complex
(e.g. Sort)

Access Pattern 2: Read from Disk, iteratively Process
• Very common pattern for machine learning
algorithms (e.g. KMean)
• On this case, advantages are greater
– If we cannot hold data in memory fully, we need to offload
– Then we need to read again
– Then cost is very high to load and process and much faster
with in memory computing
• Spark let you load to memory fully and process

Spark
• New Programming Model
built on functional
programming concepts
• Can be much faster for
recursive usecases
• Have a complete stack of
products
file =
spark.textFile("hdfs://...”)
file.flatMap(
line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)

Access Pattern 3: Interactive Queries
• Need to be responsive, < 10 sec
• Harder to predict what data is needed
• Queries tend to be simpler
• Can be made faster by a RAM Cloud
– SAP Hana
– Volt DB
• With smaller queries, disk may still be OK. Apache
Drill as an Alternative

VoltDB Story
• VoltDB Team (Michael
Stonebraker et al.)
observed 92% of work in a
DB related to Disk
• By building complete in-memory
database cluster
they made it 20x faster!

Distributed Cloud (e.g. Hazelcast)
• Store the data portioned and replicated across many
machines
• Used as a cache that span multipme machines
• Key value access

Access Pattern 4: Random Accesses
• E.g. Graph Traversal
• This is the hardest usecase
• In easy cases, there is a small working set and can be
solved with a cache ( checking users against a black
list), not the case with Graph most graph operations
like traversal
• Hard cases, In Memory Computing is only real
solution
• Can be as fast as 1000x or more

Access Pattern 5: Realtime Processing
• This is already In-Memory technology using tools like
Complex Event Processing (e.g. WSO2 CEP) or stream
processing (e.g. Apache Storm)

Faster Access to Data
• In-Memory databases (e.g. VoltDB, MemSQL)
– Provide Same SQL interface
– Can think as fast database
– VoltDB has shown to about 20X faster than MySQL
• Distributed Cache
– Can Integrated as a Large Cache

Load Data Set to Memory and Analyze
• Used with Interactive and Random access usecases
• Can be as 1000x faster for some usecases
• Tools
– Spark
– Hazelcast
– SAP Hana

Realtime Processing
• Realtime analytics tools
– CEP (WSO2 CEP)
– Stream Processing (e.g. Storm)
• Can generate results within few
milliseconds to seconds
• Can process 10ks-millions of
events per second
• Not all algorithms can be
implemented

In Memory Computing with WSO2 Platform

In-Memory Computing: How, Why? and common Patterns

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to In-Memory Computing: How, Why? and common Patterns

Similar to In-Memory Computing: How, Why? and common Patterns (20)

More from Srinath Perera

More from Srinath Perera (20)

Recently uploaded

Recently uploaded (20)

In-Memory Computing: How, Why? and common Patterns