[150824]symposium v4

Hadoop MapReduce
How to Survive Out-of-Memory Errors
Member: Yoonseung Choi
Soyeong Park
Faculty Mentor: Prof. Harry Xu
Student Mentor: Khanh Nguyen
The International Summer Undergraduate Research Fellowship
1

Outline
• Introduction
• What is MapReduce?
• How does MapReduce work?
• Limitations of MapReduce
• What are our goals?
• Operation test
• Conclusions
2

“There was 5 exabytes of information created
between the dawn of civilization through 2003,
But that much information is now created
every 2 days, and the pace is increasing...”
- Eric Schmidt, The Former Google CEO
3

Data scientists want
to analyze these
large data sets
But single
machines
have
limitations
in processing
these data sets
How can we handle that?
Furthermore, data sets
are now growing very rapidly
We don’t want
to understand
parallelization,
fault tolerance,
data distribution,
and load balancing!
Distributed processing
Therefore, we purpose
The ‘MapReduce’
parallelization
fault tolerance
data distribution
load balancing
4

MapReduce is
a programming model for
processing large data sets
Many real world tasks are
expressible in this model
The model is easy to use, even
for programmers without
experience with parallel and
distributed systems
[1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified Data Processing on Large Clusters”.
* https://en.wikipedia.org/wiki/Apache_Hadoop
MapReduce Layer
HDFS Layer
5

What is MapReduce?
Mapper takes an input
and produces a set
of intermediate
key/value pairs
Reducer merges together
these intermediate values
associated with the same
intermediate key
[1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified Data Processing on Large Clusters”. p.12
6

How does MapReduce work?
The cat sees the dog, and the dog sees the cat.
The cat sees the dog
Andthedogseesthecat
cat, 1
dog, 1
sees, 1
the, 2
cat, 1
dog, 1
sees, 1
the, 2
and, 1
cat, 2
dog, 2
sees, 2
the, 4
and, 1
- Wordcount program
- A sentence is split
into two map tasks
Map Phase
Reduce
Phase
7

Limitations of MapReduce
There are many reasons for poor performance
And even experts sometimes can’t figure them out
8

What are our goals?
• Research Out-of-Memory Error(OOM) cases
• Document OOM cases
• Implement and simulate StackOverflow OOM cases
• Develop solutions for such OOM cases
… all done!!
9

Two Categories
1. Inappropriate Configuration
Configuration which causes poor performance
2. Large Intermediate Results
Temporary data structure grows too large
[3] Lijie Xu, “An Empirical study on real-world OOM cases in MapReduce jobs, Chinese Academy of Sciences.
10

Operation test environments
1. Standalone & Pseudo-distributed mode
- ‘14 MacBook Pro, 2.8 GHz Intel Core i5
8GB 1600 MHz DDR3, 500GB HDD
- ‘12 MacBook Air 1.4, GHz Intel Core i5
4GB 1600 MHz DDR3, 256GB HDD
2. Fully-distributed mode
- Raspberry Pi 2 Model B (3 nodes)
A quad-core ARM Cortex-A7 CPU (1Ghz Overclock)
1GB 500MHz SDRAM, 64GB HDD, 100Mbps Ethernet
11

Split size variation [Single node]
* ‘14 MacBook Pro 2.8, GHz Intel Core i5, 8GB 1600 MHz DDR3, 500GB SSD
Input: StackOverflow’s users profiles (1GB)
173.3
88.3
47.3
26.7 24.3
204
117.3
86.3
64.7
56.3
169.3
117.3
78.7
59
55
0
50
100
150
200
16 32 64 128 256
(sec)
169.7
85.7
43
23
23.3
172.7
103.7
64.7 48.7
37.7
129.7
77.7 55 39
32.7
0
50
100
150
200
16 32 64 128 256
[ Distributed grep (no Reducer) ][ Standard deviation of users’ age ]
(MB)
(sec)
(MB)
Standalone Pseudo-distributed
(2Mapper 2Reducer)
Pseudo-distributed
(4Mapper 4Reducer)
12

[ ]
Split size variation [Single node]
* ‘12 MacBook Air 1.4, GHz Intel Core i5, 4GB 1600 MHz DDR3, 256GB SSD
Input: StackOverflow’s Comments (8.5GB)
1577.7
807.7
425
411
312.3
1586.3
831
634
454.3
299
1590
803.7
540.3
397.7
323
250
550
850
1150
1450
16 32 64 128 256
Standard deviation of
comment’s text length
Count Min
and Max value
Standalone Pseudo-distributed
(2Mapper 2Reducer)
Pseudo-distributed
(4Mapper 4Reducer)
1469
783
398 389.3
281.3
1614
610.7 612
418.7
294.3
1598
609 488
362.7254.3
250
550
850
1150
1450
1750
16 32 64 128 256
[ ]
(MB)(MB)
(sec) (sec)
13

Split size variation [Fully-distributed]
Input: StackOverflow’s users profiles (1GB)
375
396
442
548
313 296
350
557
0
200
400
600
800
32 64 128 256
* Raspberry Pi 2 Model B (3 nodes) A quad-core ARM Cortex-A7 CPU (1Ghz Overclock)
1GB 500MHz SDRAM, 64GB HDD, 100Mbps Ethernet
462.7 428.7
476.7
561.7 604
333.3 303
345
33…
603
0
200
400
600
800
16 32 64 128 256
[ Distributed grep (no Reducer) ][ average users’ age based on countries ]
6 Mapper 12 Mapper
(MB)(MB)
(sec) (sec)
14

io.sort.mb variation [Single node]
* ‘12 MacBook Air 1.4, GHz Intel Core i5, 4GB 1600 MHz DDR3, 256GB SSD
Input: StackOverflow’s Comments (8.5GB)
Test program: Standard deviation of comment’s text length
872
827 814 798
803.7
661 638.7 632 629.7 629.7
633.7 641 635.7 629.3 629600
700
800
900
20 40 80 160 320
Stand alone Pseudo distributed; 2M2R Pseudo distributed; 4M2R
(MB)
(sec)
15

I am working well with small datasets like 200-500MB.
But for datasets above 1GB, I am getting an error like this:
* http://stackoverflow.com/questions/23042829/getting-java-heap-space-error-while-running-a-mapreduce-code-for-large-dataset
2. Large Intermediate Results
16

Problem Investigation
Splited
Input files
Task 1
Task 2
Task 3
Task 4
Task 5
[K, V]
[K, V]
[K, V]
[K, V]
[K, V]
The Mapper
Intermediate
key/value pairs
1.3
GB
4.8
GB
almost
1 GB
17

Problem Investigation
[K, V]
[K, V]
[K, V]
[K, V]
[K, V]
The Reducer
Intermediate
key/value pairs
4.8
GB
almost
1 GB
I just have
1GB heap
space!
almost
1 GB
Java heap can’t contain
intermediate data structure
18

Configuration was:
1.3GB Input, 256MB Split size, 1024MB Java Heap Space
Error: Java heap space
19

Summary of Solutions
• Modify the configuration parameters
• Alter the program’s algorithm
: Some alternative solution was suggested from the site
-> Succeed with original version failed Configuration
( 256MB Split size & 1024MB Java heap size )
Java Heap size 1024MB 2048MB
Split size
128 MB Successful Successful
256 MB Failed Successful
20

Conclusions
• How to solve the poor performance
1. Adjust ‘split size’ & ‘sort space’
- the more size, the less time to spend
2. Adjust the number of Mapper
- Utilize all CPU Cores
- Larger number of mapper not always right
• If intermediate data structure is too large,
- Modify the configuration parameter or
- Alter the program’s algorithm
21

References
[1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified
Data Processing on Large Clusters”. [Online].
Available: http://static.googleusercontent.com/media/research.google.com/ko//archive/mapreduce-osdi04.pdf
[2] 한기용, Do it! 직접 해보는 하둡 프로그래밍. Seoul: EasysPublishing,
2013.
[3] Lijie Xu, “An Empirical study on real-world OOM cases in MapReduce
jobs, Chinese Academy of Sciences.
[4] Donald Miner and Adam Shook, MapReduce Design Patterns. O’Reilly
Media. Inc, 2012.
22

Thank You
And if you want to know more technical information,
please enter our GitHub repository.
Our project is Open Source.
https://github.com/I-SURF-Hadoop/MapReduce
23

appendix
How does MapReduce really work?
24

[ Map Phase ]
cat, 1
dog, 1
sees, 1
the, 2
Combining & Sorting
the, 1
cat, 1
sees, 1
the, 1
dog, 1
MapReduce library first splits
the input into M pieces.
A map worker processes these
pieces using a user-defined Map
function. Intermediate key/value
pairs will be produced by this
function.
The cat sees the dog
25

sees, 2
the, 4
cat, 2
dog, 2
and, 1
[ Reduce Phase ]
When a reduce worker has read
all intermediate data, it sorts
them by the intermediate keys.
The reduce worker iterates the
sorted intermediate data and for
each unique intermediate key
encountered, it passes the key
and the values to the user’s
Reduce function.
cat, 1
dog, 1
sees, 1
the, 2
cat, 1
dog, 1
sees, 1
the, 2
and, 1
Shuffling
Two independent reducer
26

[150824]symposium v4

Recommended

Recommended

More Related Content

Similar to [150824]symposium v4

Similar to [150824]symposium v4 (20)

More from yyooooon

More from yyooooon (8)

Recently uploaded

Recently uploaded (20)

[150824]symposium v4

Editor's Notes