SlideShare a Scribd company logo
1 of 25
Hadoop Introduction II
               K-means && Python && Dumbo
Outline

• Dumbo
• K-means
• Python and Data Mining




12/20/12                   2
Hadoop in Python
• Jython: Happy
• Cython:
     • Pydoop
           • components(RecordReader , RecordWriter and Partitioner)
           • Get configuration, set counters and report statuscpython use any module
                                         Dumbo
           • HDFS API
     • Hadoopy: an other Cython
• Streaming:
     • Dumbo
     • Other small Map-Reduce wrapper

12/20/12                                                                               3
Hadoop in Python




12/20/12           4
Hadoop in Python Extention




                                 Hadoop in Python




Integration with Pipes(C++) + Integration with libhdfs(C)
 12/20/12                                                   5
Dumbo
•   Dumbo is a project that allows you to easily write and
    run Hadoop programs in Python. More generally, Dumbo can be
    considered a convenient Python API for writing MapReduce
    programs.
•   Advantages:
     • Easy: Dumbo strives to be as Pythonic as possible
     • Efficient: Dumbo programs communicate with Hadoop in a very
       effecient way by relying on typed bytes, a nifty serialisation
       mechanism that was specifically added to Hadoop with Dumbo
       in mind.
     • Flexible: We can extend it
     • Mature

12/20/12                                                            6
Dumbo: Review WordCount




12/20/12                  7
Dumbo – Word Count




12/20/12             8
Dumbo IP counts




12/20/12          9
Dumbo IP counts




12/20/12          10
K-means in Map-Reduce
•   Normal K-means:
     •     Inputs: a set of n d-dimensional points && a number of desired clusters k.

     •     Step 1: Random choice K points at the sample of n Points
     •     Step2 : Calculate every point to K initial centers. Choice closest
     •     Step3 : Using this assignation of points to cluster centers, each cluster center is
           recalculated as the centroid of its member points.
     •     Step4: This process is then iterated until convergence is reached.
     •     Final: points are reassigned to centers, and centroids recalculated until the k
           cluster centers shift by less than some delta value.



•   k-means is a surprisingly parallelizable algorithm.


12/20/12                                                                                    11
K-means in Map-Reduce
•   Key-points:
     • we want to come up with a scheme where we can operate on
       each point in the data set independently.
     • a small amount of shared data (The cluster centers)
     • when we partition points among MapReduce nodes, we
       also distribute a copy of the cluster centers. This results
       in a small amount of data duplication, but very minimal.
       In this way each of the points can be operated on
       independently.




12/20/12                                                         12
Hadoop Phase
• Map:
  • In : points in the data set
  • Output : (ClusterID, Point) pair for each point.
    Where the ClusterID is the integer Id of the
    cluster which is cloest to point.




12/20/12                                           13
Hadoop Phase
• Reduce Phase:
   • In : (ClusterID, Point)
• Operator:
   • the outputs of the map phase are grouped by
     ClusterID.
   • for each ClusterID the centroid of the points
     associated with that ClusterID is calculated.
     • Output: (ClusterID, Centroid) pairs. Which represent the
       newly calculated cluster centers.

12/20/12                                                     14
External Program
•   Each iteration of the algorithm is structured as a single
    MapReduce job.

•   After each phase, our lib reads the output , determines
    whether convergence has been reached by the calculating
    by how much distance the clusters have moved. The runs
    another Mapreduce job.




12/20/12                                                        15
Write in Dumbo




12/20/12         16
Write in Dumbo




12/20/12         17
Write in Dumbo




12/20/12         18
Results




12/20/12   19
Next
• Write n-times iteration wrapper
• Optimize K-means
• Result Visualization with Python




12/20/12                             20
Optimize
•   If partial centroids for clusters are computed on the map
    nodes are computed on the map nodes themselves.
    (Mapper Local calculate!) and then a weighted average
    of the centroids is taken later by the reducer. In
    other words, the mapping was one to one, and so for
    every point inputted , our mapper outputted a single
    point which it was necessary to sort and transfer to a
    reducer.


• We can use Combiner!

12/20/12                                                   21
Dumbo Usage
•   Very easy
•   You can write your own code for Dumbo
•   Debug easy
•   Command easy




12/20/12                                    22
Python and Data Mining
• Books:
   • 用 Python 进行科学计算
   • 集体智慧编程
   • 挖掘社交网络
   • 用 Python 进行自然语言处理
   • Think Stats Python 与数据分析




12/20/12                        23
Python and Data Mining
• Tools
   • Numpy
   • Scipy
   • Orange (利用 orange 进行关联规则挖掘)




12/20/12                           24
thanks




12/20/12            25

More Related Content

What's hot

Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explainedDmytro Sandu
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advancedChirag Ahuja
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...Govt.Engineering college, Idukki
 

What's hot (20)

Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
MapReduce
MapReduceMapReduce
MapReduce
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 

Viewers also liked

Viewers also liked (8)

Linux Introduction (Commands)
Linux Introduction (Commands)Linux Introduction (Commands)
Linux Introduction (Commands)
 
Mr
MrMr
Mr
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
Tutorial on Web Scraping in Python
Tutorial on Web Scraping in PythonTutorial on Web Scraping in Python
Tutorial on Web Scraping in Python
 
Linux File System
Linux File SystemLinux File System
Linux File System
 
Linux.ppt
Linux.ppt Linux.ppt
Linux.ppt
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 

Similar to Hadoop introduction 2

Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersHybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersKoichi Shirahata
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceBhupesh Chawda
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programsjani shaik
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce
MapReduceMapReduce
MapReduceKavyaGo
 
Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014soujavajug
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersKumari Surabhi
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfWasyihunSema2
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionDong Ngoc
 

Similar to Hadoop introduction 2 (20)

MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersHybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Hadoop
HadoopHadoop
Hadoop
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
MapReduce
MapReduceMapReduce
MapReduce
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014Hadoop Spark - Reuniao SouJava 12/04/2014
Hadoop Spark - Reuniao SouJava 12/04/2014
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

More from Tianwei Liu

2021 ee大会-旷视ai产品背后的研发效能工具建设
2021 ee大会-旷视ai产品背后的研发效能工具建设2021 ee大会-旷视ai产品背后的研发效能工具建设
2021 ee大会-旷视ai产品背后的研发效能工具建设Tianwei Liu
 
2020 gops-旷视城市大脑私有云平台实践-刘天伟
2020 gops-旷视城市大脑私有云平台实践-刘天伟2020 gops-旷视城市大脑私有云平台实践-刘天伟
2020 gops-旷视城市大脑私有云平台实践-刘天伟Tianwei Liu
 
豆瓣Paa s平台 dae - 2017
豆瓣Paa s平台 dae - 2017豆瓣Paa s平台 dae - 2017
豆瓣Paa s平台 dae - 2017Tianwei Liu
 
douban happyday docker for daeqaci
douban happyday docker for daeqacidouban happyday docker for daeqaci
douban happyday docker for daeqaciTianwei Liu
 
DAE 新变化介绍
DAE 新变化介绍DAE 新变化介绍
DAE 新变化介绍Tianwei Liu
 
Docker在豆瓣的实践 刘天伟-20160709
Docker在豆瓣的实践 刘天伟-20160709Docker在豆瓣的实践 刘天伟-20160709
Docker在豆瓣的实践 刘天伟-20160709Tianwei Liu
 
Mr&ueh数据库方面
Mr&ueh数据库方面Mr&ueh数据库方面
Mr&ueh数据库方面Tianwei Liu
 
Kmeans in-hadoop
Kmeans in-hadoopKmeans in-hadoop
Kmeans in-hadoopTianwei Liu
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionTianwei Liu
 

More from Tianwei Liu (10)

2021 ee大会-旷视ai产品背后的研发效能工具建设
2021 ee大会-旷视ai产品背后的研发效能工具建设2021 ee大会-旷视ai产品背后的研发效能工具建设
2021 ee大会-旷视ai产品背后的研发效能工具建设
 
2020 gops-旷视城市大脑私有云平台实践-刘天伟
2020 gops-旷视城市大脑私有云平台实践-刘天伟2020 gops-旷视城市大脑私有云平台实践-刘天伟
2020 gops-旷视城市大脑私有云平台实践-刘天伟
 
豆瓣Paa s平台 dae - 2017
豆瓣Paa s平台 dae - 2017豆瓣Paa s平台 dae - 2017
豆瓣Paa s平台 dae - 2017
 
douban happyday docker for daeqaci
douban happyday docker for daeqacidouban happyday docker for daeqaci
douban happyday docker for daeqaci
 
DAE 新变化介绍
DAE 新变化介绍DAE 新变化介绍
DAE 新变化介绍
 
Docker在豆瓣的实践 刘天伟-20160709
Docker在豆瓣的实践 刘天伟-20160709Docker在豆瓣的实践 刘天伟-20160709
Docker在豆瓣的实践 刘天伟-20160709
 
Mr&ueh数据库方面
Mr&ueh数据库方面Mr&ueh数据库方面
Mr&ueh数据库方面
 
Kmeans in-hadoop
Kmeans in-hadoopKmeans in-hadoop
Kmeans in-hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Ueh
UehUeh
Ueh
 

Hadoop introduction 2

  • 1. Hadoop Introduction II K-means && Python && Dumbo
  • 2. Outline • Dumbo • K-means • Python and Data Mining 12/20/12 2
  • 3. Hadoop in Python • Jython: Happy • Cython: • Pydoop • components(RecordReader , RecordWriter and Partitioner) • Get configuration, set counters and report statuscpython use any module Dumbo • HDFS API • Hadoopy: an other Cython • Streaming: • Dumbo • Other small Map-Reduce wrapper 12/20/12 3
  • 5. Hadoop in Python Extention Hadoop in Python Integration with Pipes(C++) + Integration with libhdfs(C) 12/20/12 5
  • 6. Dumbo • Dumbo is a project that allows you to easily write and run Hadoop programs in Python. More generally, Dumbo can be considered a convenient Python API for writing MapReduce programs. • Advantages: • Easy: Dumbo strives to be as Pythonic as possible • Efficient: Dumbo programs communicate with Hadoop in a very effecient way by relying on typed bytes, a nifty serialisation mechanism that was specifically added to Hadoop with Dumbo in mind. • Flexible: We can extend it • Mature 12/20/12 6
  • 8. Dumbo – Word Count 12/20/12 8
  • 11. K-means in Map-Reduce • Normal K-means: • Inputs: a set of n d-dimensional points && a number of desired clusters k. • Step 1: Random choice K points at the sample of n Points • Step2 : Calculate every point to K initial centers. Choice closest • Step3 : Using this assignation of points to cluster centers, each cluster center is recalculated as the centroid of its member points. • Step4: This process is then iterated until convergence is reached. • Final: points are reassigned to centers, and centroids recalculated until the k cluster centers shift by less than some delta value. • k-means is a surprisingly parallelizable algorithm. 12/20/12 11
  • 12. K-means in Map-Reduce • Key-points: • we want to come up with a scheme where we can operate on each point in the data set independently. • a small amount of shared data (The cluster centers) • when we partition points among MapReduce nodes, we also distribute a copy of the cluster centers. This results in a small amount of data duplication, but very minimal. In this way each of the points can be operated on independently. 12/20/12 12
  • 13. Hadoop Phase • Map: • In : points in the data set • Output : (ClusterID, Point) pair for each point. Where the ClusterID is the integer Id of the cluster which is cloest to point. 12/20/12 13
  • 14. Hadoop Phase • Reduce Phase: • In : (ClusterID, Point) • Operator: • the outputs of the map phase are grouped by ClusterID. • for each ClusterID the centroid of the points associated with that ClusterID is calculated. • Output: (ClusterID, Centroid) pairs. Which represent the newly calculated cluster centers. 12/20/12 14
  • 15. External Program • Each iteration of the algorithm is structured as a single MapReduce job. • After each phase, our lib reads the output , determines whether convergence has been reached by the calculating by how much distance the clusters have moved. The runs another Mapreduce job. 12/20/12 15
  • 20. Next • Write n-times iteration wrapper • Optimize K-means • Result Visualization with Python 12/20/12 20
  • 21. Optimize • If partial centroids for clusters are computed on the map nodes are computed on the map nodes themselves. (Mapper Local calculate!) and then a weighted average of the centroids is taken later by the reducer. In other words, the mapping was one to one, and so for every point inputted , our mapper outputted a single point which it was necessary to sort and transfer to a reducer. • We can use Combiner! 12/20/12 21
  • 22. Dumbo Usage • Very easy • You can write your own code for Dumbo • Debug easy • Command easy 12/20/12 22
  • 23. Python and Data Mining • Books: • 用 Python 进行科学计算 • 集体智慧编程 • 挖掘社交网络 • 用 Python 进行自然语言处理 • Think Stats Python 与数据分析 12/20/12 23
  • 24. Python and Data Mining • Tools • Numpy • Scipy • Orange (利用 orange 进行关联规则挖掘) 12/20/12 24

Editor's Notes

  1. 素材天下 sucaitianxia.com