SlideShare a Scribd company logo
1 of 26
Hadoop MapReduce
How to Survive Out-of-Memory Errors
Member: Yoonseung Choi
Soyeong Park
Faculty Mentor: Prof. Harry Xu
Student Mentor: Khanh Nguyen
The International Summer Undergraduate Research Fellowship
1
Outline
• Introduction
• What is MapReduce?
• How does MapReduce work?
• Limitations of MapReduce
• What are our goals?
• Operation test
• Conclusions
2
“There was 5 exabytes of information created
between the dawn of civilization through 2003,
But that much information is now created
every 2 days, and the pace is increasing...”
- Eric Schmidt, The Former Google CEO
3
Data scientists want
to analyze these
large data sets
But single
machines
have
limitations
in processing
these data sets
How can we handle that?
Furthermore, data sets
are now growing very rapidly
We don’t want
to understand
parallelization,
fault tolerance,
data distribution,
and load balancing!
Distributed processing
Therefore, we purpose
The ‘MapReduce’
parallelization
fault tolerance
data distribution
load balancing
4
MapReduce is
a programming model for
processing large data sets
Many real world tasks are
expressible in this model
The model is easy to use, even
for programmers without
experience with parallel and
distributed systems
[1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified Data Processing on Large Clusters”.
* https://en.wikipedia.org/wiki/Apache_Hadoop
MapReduce Layer
HDFS Layer
5
What is MapReduce?
Mapper takes an input
and produces a set
of intermediate
key/value pairs
Reducer merges together
these intermediate values
associated with the same
intermediate key
[1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified Data Processing on Large Clusters”. p.12
6
How does MapReduce work?
The cat sees the dog, and the dog sees the cat.
The cat sees the dog
Andthedogseesthecat
cat, 1
dog, 1
sees, 1
the, 2
cat, 1
dog, 1
sees, 1
the, 2
and, 1
cat, 2
dog, 2
sees, 2
the, 4
and, 1
- Wordcount program
- A sentence is split
into two map tasks
Map Phase
Reduce
Phase
7
Limitations of MapReduce
There are many reasons for poor performance
And even experts sometimes can’t figure them out
8
What are our goals?
• Research Out-of-Memory Error(OOM) cases
• Document OOM cases
• Implement and simulate StackOverflow OOM cases
• Develop solutions for such OOM cases
… all done!!
9
Two Categories
1. Inappropriate Configuration
Configuration which causes poor performance
2. Large Intermediate Results
Temporary data structure grows too large
[3] Lijie Xu, “An Empirical study on real-world OOM cases in MapReduce jobs, Chinese Academy of Sciences.
10
Operation test environments
1. Standalone & Pseudo-distributed mode
- ‘14 MacBook Pro, 2.8 GHz Intel Core i5
8GB 1600 MHz DDR3, 500GB HDD
- ‘12 MacBook Air 1.4, GHz Intel Core i5
4GB 1600 MHz DDR3, 256GB HDD
2. Fully-distributed mode
- Raspberry Pi 2 Model B (3 nodes)
A quad-core ARM Cortex-A7 CPU (1Ghz Overclock)
1GB 500MHz SDRAM, 64GB HDD, 100Mbps Ethernet
11
Split size variation [Single node]
* ‘14 MacBook Pro 2.8, GHz Intel Core i5, 8GB 1600 MHz DDR3, 500GB SSD
Input: StackOverflow’s users profiles (1GB)
173.3
88.3
47.3
26.7 24.3
204
117.3
86.3
64.7
56.3
169.3
117.3
78.7
59
55
0
50
100
150
200
16 32 64 128 256
(sec)
169.7
85.7
43
23
23.3
172.7
103.7
64.7 48.7
37.7
129.7
77.7 55 39
32.7
0
50
100
150
200
16 32 64 128 256
[ Distributed grep (no Reducer) ][ Standard deviation of users’ age ]
(MB)
(sec)
(MB)
Standalone Pseudo-distributed
(2Mapper 2Reducer)
Pseudo-distributed
(4Mapper 4Reducer)
12
[ ]
Split size variation [Single node]
* ‘12 MacBook Air 1.4, GHz Intel Core i5, 4GB 1600 MHz DDR3, 256GB SSD
Input: StackOverflow’s Comments (8.5GB)
1577.7
807.7
425
411
312.3
1586.3
831
634
454.3
299
1590
803.7
540.3
397.7
323
250
550
850
1150
1450
16 32 64 128 256
Standard deviation of
comment’s text length
Count Min
and Max value
Standalone Pseudo-distributed
(2Mapper 2Reducer)
Pseudo-distributed
(4Mapper 4Reducer)
1469
783
398 389.3
281.3
1614
610.7 612
418.7
294.3
1598
609 488
362.7254.3
250
550
850
1150
1450
1750
16 32 64 128 256
[ ]
(MB)(MB)
(sec) (sec)
13
Split size variation [Fully-distributed]
Input: StackOverflow’s users profiles (1GB)
375
396
442
548
313 296
350
557
0
200
400
600
800
32 64 128 256
* Raspberry Pi 2 Model B (3 nodes) A quad-core ARM Cortex-A7 CPU (1Ghz Overclock)
1GB 500MHz SDRAM, 64GB HDD, 100Mbps Ethernet
462.7 428.7
476.7
561.7 604
333.3 303
345
33…
603
0
200
400
600
800
16 32 64 128 256
[ Distributed grep (no Reducer) ][ average users’ age based on countries ]
6 Mapper 12 Mapper
(MB)(MB)
(sec) (sec)
14
io.sort.mb variation [Single node]
* ‘12 MacBook Air 1.4, GHz Intel Core i5, 4GB 1600 MHz DDR3, 256GB SSD
Input: StackOverflow’s Comments (8.5GB)
Test program: Standard deviation of comment’s text length
872
827 814 798
803.7
661 638.7 632 629.7 629.7
633.7 641 635.7 629.3 629600
700
800
900
20 40 80 160 320
Stand alone Pseudo distributed; 2M2R Pseudo distributed; 4M2R
(MB)
(sec)
15
I am working well with small datasets like 200-500MB.
But for datasets above 1GB, I am getting an error like this:
* http://stackoverflow.com/questions/23042829/getting-java-heap-space-error-while-running-a-mapreduce-code-for-large-dataset
2. Large Intermediate Results
16
Problem Investigation
Splited
Input files
Task 1
Task 2
Task 3
Task 4
Task 5
[K, V]
[K, V]
[K, V]
[K, V]
[K, V]
The Mapper
Intermediate
key/value pairs
1.3
GB
4.8
GB
almost
1 GB
17
Problem Investigation
[K, V]
[K, V]
[K, V]
[K, V]
[K, V]
The Reducer
Intermediate
key/value pairs
4.8
GB
almost
1 GB
I just have
1GB heap
space!
almost
1 GB
Java heap can’t contain
intermediate data structure
18
Configuration was:
1.3GB Input, 256MB Split size, 1024MB Java Heap Space
Error: Java heap space
19
Summary of Solutions
• Modify the configuration parameters
• Alter the program’s algorithm
: Some alternative solution was suggested from the site
-> Succeed with original version failed Configuration
( 256MB Split size & 1024MB Java heap size )
Java Heap size 1024MB 2048MB
Split size
128 MB Successful Successful
256 MB Failed Successful
20
Conclusions
• How to solve the poor performance
1. Adjust ‘split size’ & ‘sort space’
- the more size, the less time to spend
2. Adjust the number of Mapper
- Utilize all CPU Cores
- Larger number of mapper not always right
• If intermediate data structure is too large,
- Modify the configuration parameter or
- Alter the program’s algorithm
21
References
[1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified
Data Processing on Large Clusters”. [Online].
Available: http://static.googleusercontent.com/media/research.google.com/ko//archive/mapreduce-osdi04.pdf
[2] 한기용, Do it! 직접 해보는 하둡 프로그래밍. Seoul: EasysPublishing,
2013.
[3] Lijie Xu, “An Empirical study on real-world OOM cases in MapReduce
jobs, Chinese Academy of Sciences.
[4] Donald Miner and Adam Shook, MapReduce Design Patterns. O’Reilly
Media. Inc, 2012.
22
Thank You
And if you want to know more technical information,
please enter our GitHub repository.
Our project is Open Source.
https://github.com/I-SURF-Hadoop/MapReduce
23
appendix
How does MapReduce really work?
24
How does MapReduce work?
[ Map Phase ]
cat, 1
dog, 1
sees, 1
the, 2
Combining & Sorting
The cat sees the dog, and the dog sees the cat.
the, 1
cat, 1
sees, 1
the, 1
dog, 1
MapReduce library first splits
the input into M pieces.
A map worker processes these
pieces using a user-defined Map
function. Intermediate key/value
pairs will be produced by this
function.
The cat sees the dog
25
How does MapReduce work?
The cat sees the dog, and the dog sees the cat.
sees, 2
the, 4
cat, 2
dog, 2
and, 1
[ Reduce Phase ]
When a reduce worker has read
all intermediate data, it sorts
them by the intermediate keys.
The reduce worker iterates the
sorted intermediate data and for
each unique intermediate key
encountered, it passes the key
and the values to the user’s
Reduce function.
cat, 1
dog, 1
sees, 1
the, 2
cat, 1
dog, 1
sees, 1
the, 2
and, 1
Shuffling
Two independent reducer
26

More Related Content

Similar to [150824]symposium v4

Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.Kyong-Ha Lee
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learningArnaud Rachez
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine LearningSudarsun Santhiappan
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Spark Summit
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learningbutest
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedTuri, Inc.
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...areej qasrawi
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLArnab Biswas
 
Topic 6 IB DP CS
Topic 6 IB DP CSTopic 6 IB DP CS
Topic 6 IB DP CSzion66
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsLucidworks
 
Productionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground UpProductionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground Upodsc
 
Presentation
PresentationPresentation
Presentationbutest
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...Simon Lia-Jonassen
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?CS, NcState
 

Similar to [150824]symposium v4 (20)

Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Spark
SparkSpark
Spark
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
 
MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learning
 
Making Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and DistributedMaking Machine Learning Scale: Single Machine and Distributed
Making Machine Learning Scale: Single Machine and Distributed
 
Cloud accounting software uk
Cloud accounting software ukCloud accounting software uk
Cloud accounting software uk
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Topic 6 IB DP CS
Topic 6 IB DP CSTopic 6 IB DP CS
Topic 6 IB DP CS
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
 
Productionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground UpProductionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground Up
 
Presentation
PresentationPresentation
Presentation
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
 
In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?In the age of Big Data, what role for Software Engineers?
In the age of Big Data, what role for Software Engineers?
 

More from yyooooon

#15.7.16 Presentation in UCI
#15.7.16 Presentation in UCI#15.7.16 Presentation in UCI
#15.7.16 Presentation in UCIyyooooon
 
about message coalescing
about message coalescingabout message coalescing
about message coalescingyyooooon
 
ffmpeg optimization using CUDA
ffmpeg optimization using CUDAffmpeg optimization using CUDA
ffmpeg optimization using CUDAyyooooon
 
HM10 for presentation
HM10 for presentationHM10 for presentation
HM10 for presentationyyooooon
 
Hm10 Research sheets
Hm10 Research sheetsHm10 Research sheets
Hm10 Research sheetsyyooooon
 
Android_01 Install Eclipse ADK/formatter/checker
Android_01 Install Eclipse ADK/formatter/checker Android_01 Install Eclipse ADK/formatter/checker
Android_01 Install Eclipse ADK/formatter/checker yyooooon
 
MCP3008 & TMP36 을 이용한 온도측정 및 응용
MCP3008 & TMP36 을 이용한 온도측정 및 응용MCP3008 & TMP36 을 이용한 온도측정 및 응용
MCP3008 & TMP36 을 이용한 온도측정 및 응용yyooooon
 
01_라즈베리파이세팅
01_라즈베리파이세팅01_라즈베리파이세팅
01_라즈베리파이세팅yyooooon
 

More from yyooooon (8)

#15.7.16 Presentation in UCI
#15.7.16 Presentation in UCI#15.7.16 Presentation in UCI
#15.7.16 Presentation in UCI
 
about message coalescing
about message coalescingabout message coalescing
about message coalescing
 
ffmpeg optimization using CUDA
ffmpeg optimization using CUDAffmpeg optimization using CUDA
ffmpeg optimization using CUDA
 
HM10 for presentation
HM10 for presentationHM10 for presentation
HM10 for presentation
 
Hm10 Research sheets
Hm10 Research sheetsHm10 Research sheets
Hm10 Research sheets
 
Android_01 Install Eclipse ADK/formatter/checker
Android_01 Install Eclipse ADK/formatter/checker Android_01 Install Eclipse ADK/formatter/checker
Android_01 Install Eclipse ADK/formatter/checker
 
MCP3008 & TMP36 을 이용한 온도측정 및 응용
MCP3008 & TMP36 을 이용한 온도측정 및 응용MCP3008 & TMP36 을 이용한 온도측정 및 응용
MCP3008 & TMP36 을 이용한 온도측정 및 응용
 
01_라즈베리파이세팅
01_라즈베리파이세팅01_라즈베리파이세팅
01_라즈베리파이세팅
 

Recently uploaded

Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxsomshekarkn64
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 

Recently uploaded (20)

Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptx
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.ppt
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 

[150824]symposium v4

  • 1. Hadoop MapReduce How to Survive Out-of-Memory Errors Member: Yoonseung Choi Soyeong Park Faculty Mentor: Prof. Harry Xu Student Mentor: Khanh Nguyen The International Summer Undergraduate Research Fellowship 1
  • 2. Outline • Introduction • What is MapReduce? • How does MapReduce work? • Limitations of MapReduce • What are our goals? • Operation test • Conclusions 2
  • 3. “There was 5 exabytes of information created between the dawn of civilization through 2003, But that much information is now created every 2 days, and the pace is increasing...” - Eric Schmidt, The Former Google CEO 3
  • 4. Data scientists want to analyze these large data sets But single machines have limitations in processing these data sets How can we handle that? Furthermore, data sets are now growing very rapidly We don’t want to understand parallelization, fault tolerance, data distribution, and load balancing! Distributed processing Therefore, we purpose The ‘MapReduce’ parallelization fault tolerance data distribution load balancing 4
  • 5. MapReduce is a programming model for processing large data sets Many real world tasks are expressible in this model The model is easy to use, even for programmers without experience with parallel and distributed systems [1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified Data Processing on Large Clusters”. * https://en.wikipedia.org/wiki/Apache_Hadoop MapReduce Layer HDFS Layer 5
  • 6. What is MapReduce? Mapper takes an input and produces a set of intermediate key/value pairs Reducer merges together these intermediate values associated with the same intermediate key [1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified Data Processing on Large Clusters”. p.12 6
  • 7. How does MapReduce work? The cat sees the dog, and the dog sees the cat. The cat sees the dog Andthedogseesthecat cat, 1 dog, 1 sees, 1 the, 2 cat, 1 dog, 1 sees, 1 the, 2 and, 1 cat, 2 dog, 2 sees, 2 the, 4 and, 1 - Wordcount program - A sentence is split into two map tasks Map Phase Reduce Phase 7
  • 8. Limitations of MapReduce There are many reasons for poor performance And even experts sometimes can’t figure them out 8
  • 9. What are our goals? • Research Out-of-Memory Error(OOM) cases • Document OOM cases • Implement and simulate StackOverflow OOM cases • Develop solutions for such OOM cases … all done!! 9
  • 10. Two Categories 1. Inappropriate Configuration Configuration which causes poor performance 2. Large Intermediate Results Temporary data structure grows too large [3] Lijie Xu, “An Empirical study on real-world OOM cases in MapReduce jobs, Chinese Academy of Sciences. 10
  • 11. Operation test environments 1. Standalone & Pseudo-distributed mode - ‘14 MacBook Pro, 2.8 GHz Intel Core i5 8GB 1600 MHz DDR3, 500GB HDD - ‘12 MacBook Air 1.4, GHz Intel Core i5 4GB 1600 MHz DDR3, 256GB HDD 2. Fully-distributed mode - Raspberry Pi 2 Model B (3 nodes) A quad-core ARM Cortex-A7 CPU (1Ghz Overclock) 1GB 500MHz SDRAM, 64GB HDD, 100Mbps Ethernet 11
  • 12. Split size variation [Single node] * ‘14 MacBook Pro 2.8, GHz Intel Core i5, 8GB 1600 MHz DDR3, 500GB SSD Input: StackOverflow’s users profiles (1GB) 173.3 88.3 47.3 26.7 24.3 204 117.3 86.3 64.7 56.3 169.3 117.3 78.7 59 55 0 50 100 150 200 16 32 64 128 256 (sec) 169.7 85.7 43 23 23.3 172.7 103.7 64.7 48.7 37.7 129.7 77.7 55 39 32.7 0 50 100 150 200 16 32 64 128 256 [ Distributed grep (no Reducer) ][ Standard deviation of users’ age ] (MB) (sec) (MB) Standalone Pseudo-distributed (2Mapper 2Reducer) Pseudo-distributed (4Mapper 4Reducer) 12
  • 13. [ ] Split size variation [Single node] * ‘12 MacBook Air 1.4, GHz Intel Core i5, 4GB 1600 MHz DDR3, 256GB SSD Input: StackOverflow’s Comments (8.5GB) 1577.7 807.7 425 411 312.3 1586.3 831 634 454.3 299 1590 803.7 540.3 397.7 323 250 550 850 1150 1450 16 32 64 128 256 Standard deviation of comment’s text length Count Min and Max value Standalone Pseudo-distributed (2Mapper 2Reducer) Pseudo-distributed (4Mapper 4Reducer) 1469 783 398 389.3 281.3 1614 610.7 612 418.7 294.3 1598 609 488 362.7254.3 250 550 850 1150 1450 1750 16 32 64 128 256 [ ] (MB)(MB) (sec) (sec) 13
  • 14. Split size variation [Fully-distributed] Input: StackOverflow’s users profiles (1GB) 375 396 442 548 313 296 350 557 0 200 400 600 800 32 64 128 256 * Raspberry Pi 2 Model B (3 nodes) A quad-core ARM Cortex-A7 CPU (1Ghz Overclock) 1GB 500MHz SDRAM, 64GB HDD, 100Mbps Ethernet 462.7 428.7 476.7 561.7 604 333.3 303 345 33… 603 0 200 400 600 800 16 32 64 128 256 [ Distributed grep (no Reducer) ][ average users’ age based on countries ] 6 Mapper 12 Mapper (MB)(MB) (sec) (sec) 14
  • 15. io.sort.mb variation [Single node] * ‘12 MacBook Air 1.4, GHz Intel Core i5, 4GB 1600 MHz DDR3, 256GB SSD Input: StackOverflow’s Comments (8.5GB) Test program: Standard deviation of comment’s text length 872 827 814 798 803.7 661 638.7 632 629.7 629.7 633.7 641 635.7 629.3 629600 700 800 900 20 40 80 160 320 Stand alone Pseudo distributed; 2M2R Pseudo distributed; 4M2R (MB) (sec) 15
  • 16. I am working well with small datasets like 200-500MB. But for datasets above 1GB, I am getting an error like this: * http://stackoverflow.com/questions/23042829/getting-java-heap-space-error-while-running-a-mapreduce-code-for-large-dataset 2. Large Intermediate Results 16
  • 17. Problem Investigation Splited Input files Task 1 Task 2 Task 3 Task 4 Task 5 [K, V] [K, V] [K, V] [K, V] [K, V] The Mapper Intermediate key/value pairs 1.3 GB 4.8 GB almost 1 GB 17
  • 18. Problem Investigation [K, V] [K, V] [K, V] [K, V] [K, V] The Reducer Intermediate key/value pairs 4.8 GB almost 1 GB I just have 1GB heap space! almost 1 GB Java heap can’t contain intermediate data structure 18
  • 19. Configuration was: 1.3GB Input, 256MB Split size, 1024MB Java Heap Space Error: Java heap space 19
  • 20. Summary of Solutions • Modify the configuration parameters • Alter the program’s algorithm : Some alternative solution was suggested from the site -> Succeed with original version failed Configuration ( 256MB Split size & 1024MB Java heap size ) Java Heap size 1024MB 2048MB Split size 128 MB Successful Successful 256 MB Failed Successful 20
  • 21. Conclusions • How to solve the poor performance 1. Adjust ‘split size’ & ‘sort space’ - the more size, the less time to spend 2. Adjust the number of Mapper - Utilize all CPU Cores - Larger number of mapper not always right • If intermediate data structure is too large, - Modify the configuration parameter or - Alter the program’s algorithm 21
  • 22. References [1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified Data Processing on Large Clusters”. [Online]. Available: http://static.googleusercontent.com/media/research.google.com/ko//archive/mapreduce-osdi04.pdf [2] 한기용, Do it! 직접 해보는 하둡 프로그래밍. Seoul: EasysPublishing, 2013. [3] Lijie Xu, “An Empirical study on real-world OOM cases in MapReduce jobs, Chinese Academy of Sciences. [4] Donald Miner and Adam Shook, MapReduce Design Patterns. O’Reilly Media. Inc, 2012. 22
  • 23. Thank You And if you want to know more technical information, please enter our GitHub repository. Our project is Open Source. https://github.com/I-SURF-Hadoop/MapReduce 23
  • 24. appendix How does MapReduce really work? 24
  • 25. How does MapReduce work? [ Map Phase ] cat, 1 dog, 1 sees, 1 the, 2 Combining & Sorting The cat sees the dog, and the dog sees the cat. the, 1 cat, 1 sees, 1 the, 1 dog, 1 MapReduce library first splits the input into M pieces. A map worker processes these pieces using a user-defined Map function. Intermediate key/value pairs will be produced by this function. The cat sees the dog 25
  • 26. How does MapReduce work? The cat sees the dog, and the dog sees the cat. sees, 2 the, 4 cat, 2 dog, 2 and, 1 [ Reduce Phase ] When a reduce worker has read all intermediate data, it sorts them by the intermediate keys. The reduce worker iterates the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the values to the user’s Reduce function. cat, 1 dog, 1 sees, 1 the, 2 cat, 1 dog, 1 sees, 1 the, 2 and, 1 Shuffling Two independent reducer 26

Editor's Notes

  1. Anteater is so cute
  2. Before the speech spoke at the Techonomy conference(10’) in Lake Tahoe  http://readwrite.com/2010/08/04/google_ceo_schmidt_people_arent_ready_for_the_tech
  3. [1, p.12] – map > emit * ADD AN ANIMATION
  4. - 논문에 써있는 configuration parameter 수 체크 From now on, next contents are little a bit technical. So don’t sleep. Because many programming models which uses MR are generally implemented by managed languages like JAVA or C++ It uses garbage collector and sometimes it make problem
  5. I want to tell you what we are doing now
  6. We research some papers, and there’re some patterns which make an OOM. And we can categorize this patterns into 3 categories.
  7. Show just running time decrease
  8. Show just running time decrease
  9. Why graph grows? Because 256 split size has just 4 map tasks It means 2 of 6 mapper will not work. So we need more bigger
  10. Factor value 가 io.sort.mb의 1/10임을 말로 설명 Show just running time decrease