Traditionally, big data is mostly read from disks and processed. However, most big data systems are latency bound, which means often the CPU sits idle waiting for data to arrive. This problem is more prevalent with use cases like graph searches that need to randomly access different parts of datasets. In-memory computing proposes an alternative model where data is loaded or stored in-memory and processed instead of processing them from the disk. Although such designs cost more in terms of memory, sometimes resulting systems can have faster order of magnitudes (e.g. 1000X), which could lead to savings in the long run. With rapidly falling memory prices, this difference is reducing by the day. Furthermore, in-memory computing can enable use cases like ad hoc analysis over a large set of data that was not possible earlier. This talk will provide an overview of in-memory technology and discuss how WSO2 technologies like complex event processing that can be used to build in-memory solutions. It will also provide an overview of upcoming improvements in the WSO2 platform.
2. Performance Numbers (based on Jeff Dean’s
numbers )
Mem Ops
/ Sec
If Memory access
is a Second
L1 cache reference 0.05 1/20th sec
Main memory reference 1 1 sec
Send 2K bytes over 1
Gbps network 200 3 min
Read 1 MB sequentially
from memory 2500 41 min
Disk seek 1*10^5 27 hours
Read 1 MB sequentially
from disk 2*10^5 2 days
Send packet CA-
>Netherlands->CA 1.5*10^6 17 days
Operation
Speed
MB/sec
Hadoop Select 3
Terasort Bench
mark 18
Complex Query
Hadoop 0.2
CEP 60
CEP Complex 2.5
SSD 300-500
Disk 50-100
3. Performance Numbers (based on Jeff Dean’s
numbers )
Mem Ops
/ Sec
If Memory access
is a Second
L1 cache reference 0.05 1/20th sec
Main memory reference 1 1 sec
Send 2K bytes over 1
Gbps network 200 3 min
Read 1 MB sequentially
from memory 2500 41 min
Disk seek 1*10^5 27 hours
Read 1 MB sequentially
from disk 2*10^5 2 days
Send packet CA-
>Netherlands->CA 1.5*10^6 17 days
Operation
Speed
MB/sec
Hadoop Select 3
Terasort Bench
mark 18
Complex Query
Hadoop 0.2
CEP 60
CEP Complex 2.5
SSD 300-500
Disk 50-100
4. Latency Lags Bandwidth
• Observation in prof. Patterson’s
Keynote at 2004
• Bandwidth improves, but not latency
• Same holds now, and the gap is
widening with new systems
5. Handling Speed Differences in Memory
Hierarchy
1. Caching
– E.g. Processor caches, file cache,
disk cache, permission cache
2. Replication
– E.g. RAID, Content Distribution
Networks (CDN), Web Cache
3. Prediction – Predict what data
will be needed and prefect
– Tradeoff bandwidth
– E.g. disk caches, Google Earth
6. Above three does not always work
• Limitations
– Caching works only if working set is small
– Prefetching only works when access patterns are predictable
– Replication is expensive and limited by receiving side machines
• Lets assume you are reading and filtering 10G data
(assuming 6b per record that is 17Billion records)
– 3 minutes to read the data from disk
– 35ms to filter 10M in my laptop => 1 minutes to process all data
– Keeping data in memory can give about 30X more
7. Data Access Patterns in Big Data Applications
• Read from Disk, process once (Basic Analytics)
– Data can be perfected, batch load is only about 100 times faster.
– OK if processing time > data read time
• Read from Disk, iteratively Process (Machine Learning Algos, e.g.
KMean)
– Need to load data from disk once and process (e.g. Spark supports this)
• Interactive (OLAP)
– Queries are random, data may be scattered. Once query started, can load
data to memory and process
• Random Access (e.g. Graph Processing)
– Very hard to optimize
• Realtime Access
– As data comes in
9. Four Myths
• Myths
– Too expensive 1TB RAM cluster for 20-40k (about 1$/GB)
– It is not durable
– Flash is fast enough
– It is about In-Memory DBs
• From Nikita Ivanov’s post
– http://gridgaintech.wordpress.com/2013/09/18/four-myths-
of-in-memory-computing/
10. Let us look at each Big data access
pattern and where In-Memory
Computing can make a difference
11. Access Pattern 1:Read from Disk, Process Once
• If Tp = 35ms vs
Td=1.2 sec with
60MB chunks, it
will give about
30X to keep all
data in Memory
• However, this
benefit is less if
computation is
more complex
(e.g. Sort)
12. Access Pattern 2: Read from Disk, iteratively Process
• Very common pattern for machine learning
algorithms (e.g. KMean)
• On this case, advantages are greater
– If we cannot hold data in memory fully, we need to offload
– Then we need to read again
– Then cost is very high to load and process and much faster
with in memory computing
• Spark let you load to memory fully and process
13. Spark
• New Programming Model
built on functional
programming concepts
• Can be much faster for
recursive usecases
• Have a complete stack of
products
file =
spark.textFile("hdfs://...”)
file.flatMap(
line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
14. Access Pattern 3: Interactive Queries
• Need to be responsive, < 10 sec
• Harder to predict what data is needed
• Queries tend to be simpler
• Can be made faster by a RAM Cloud
– SAP Hana
– Volt DB
• With smaller queries, disk may still be OK. Apache
Drill as an Alternative
15. VoltDB Story
• VoltDB Team (Michael
Stonebraker et al.)
observed 92% of work in a
DB related to Disk
• By building complete in-memory
database cluster
they made it 20x faster!
16. Distributed Cloud (e.g. Hazelcast)
• Store the data portioned and replicated across many
machines
• Used as a cache that span multipme machines
• Key value access
17. Access Pattern 4: Random Accesses
• E.g. Graph Traversal
• This is the hardest usecase
• In easy cases, there is a small working set and can be
solved with a cache ( checking users against a black
list), not the case with Graph most graph operations
like traversal
• Hard cases, In Memory Computing is only real
solution
• Can be as fast as 1000x or more
18. Access Pattern 5: Realtime Processing
• This is already In-Memory technology using tools like
Complex Event Processing (e.g. WSO2 CEP) or stream
processing (e.g. Apache Storm)
19.
20. Faster Access to Data
• In-Memory databases (e.g. VoltDB, MemSQL)
– Provide Same SQL interface
– Can think as fast database
– VoltDB has shown to about 20X faster than MySQL
• Distributed Cache
– Can Integrated as a Large Cache
21. Load Data Set to Memory and Analyze
• Used with Interactive and Random access usecases
• Can be as 1000x faster for some usecases
• Tools
– Spark
– Hazelcast
– SAP Hana
22. Realtime Processing
• Realtime analytics tools
– CEP (WSO2 CEP)
– Stream Processing (e.g. Storm)
• Can generate results within few
milliseconds to seconds
• Can process 10ks-millions of
events per second
• Not all algorithms can be
implemented