SlideShare a Scribd company logo
1 of 46
MapR, Implications for Integration CHUG – August 2011
Outline MapR system overview Map-reduce review MapR architecture Performance Results Map-reduce on MapR Architectural implications Search indexing / deployment EM algorithm for machine learning … and more …
Map-Reduce Shuffle Input Output
Bottlenecks and Issues Read-only files Many copies in I/O path Shuffle based on HTTP Can’t use new technologies Eats file descriptors Spills go to local file space Bad for skewed distribution of sizes
MapR Areas of Development
MapR Improvements Faster file system Fewer copies Multiple NICS No file descriptor or page-buf competition Faster map-reduce Uses distributed file system Direct RPC to receiver Very wide merges
MapR Innovations Volumes Distributed management Data placement Read/write random access file system Allows distributed meta-data Improved scaling Enables NFS access Application-level NIC bonding Transactionally correct snapshots and mirrors
MapR'sContainers Files/directories are sharded into blocks, whichare placed into mini NNs (containers ) on disks ,[object Object]
Directories & files
Data blocks
Replicated on servers
No need to manage directlyContainers are 16-32 GB segments of disk, placed on nodes
Container locations and replication CLDB N1, N2 N1 N3, N2 N1, N2 N2 N1, N3 N3, N2 N3 Container location database (CLDB) keeps track of nodes hosting each container
MapR Scaling Containers represent 16 - 32GB of data ,[object Object]
100M containers =  ~ 2 Exabytes  (a very large cluster)250 bytes DRAM to cache a container ,[object Object]
But not necessary, can page to disk
Typical large 10PB cluster needs 2GBContainer-reports are 100x - 1000x  <  HDFS block-reports ,[object Object]
Increase container size to 64G to serve 4EB cluster
Map/reduce not affected,[object Object]
Terasort on MapR 10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm Elapsed time (mins) Lower is better
HBase on MapR YCSB Random Read with 1 billion 1K records 10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM Recordspersecond Higher is better
Small Files (Apache Hadoop, 10 nodes) Out of box Op:  - create file         - write 100 bytes         - close Notes: - NN not replicated - NN uses 20G DRAM - DN uses  2G  DRAM Tuned Rate (files/sec) # of files (m)
MUCH faster for some operations Same 10 nodes … Create Rate # of files (millions)
What MapR is not Volumes != federation MapR supports > 10,000 volumes all with independent placement and defaults Volumes support snapshots and mirroring NFS != FUSE Checksum and compress at gateway IP fail-over Read/write/update semantics at full speed MapR != maprfs
New Capabilities
NFS mounting models Export to the world NFS gateway runs on selected gateway hosts Local server NFS gateway runs on local host Enables local compression and check summing Export to self NFS gateway runs on all data nodes, mounted from localhost
Export to the world NFS Server NFS Server NFS Server NFS Server NFS Client
Local server Client Application NFS Server Cluster Nodes
Universal export to self Cluster Nodes Cluster Node Task NFS Server
Cluster Node Task NFS Server Cluster Node Task Cluster Node Task NFS Server NFS Server Nodes are identical
Application architecture So now we have a hammer Let’s find us some nails!
Sharded text Indexing Index text to local disk and then copy index to distributed file store Assign documents to shards Map Reducer Clustered index storage Input documents Copy to local disk typically required before index can be loaded Local disk Search Engine Local disk
Shardedtext indexing Mapper assigns document to shard Shard is usually hash of document id Reducer indexes all documents for a shard Indexes created on local disk On success, copy index to DFS On failure, delete local files Must avoid directory collisions  can’t use shard id! Must manage and reclaim local disk space
Conventional data flow Failure of search engine requires another download of the index from clustered storage. Map Failure of a reducer causes garbage to accumulate in the local disk Reducer Clustered index storage Input documents Local disk Search Engine Local disk
Simplified NFS data flows Index to task work directory via NFS Map Reducer Search Engine Input documents Clustered index storage Failure of a reducer is cleaned up by map-reduce framework Search engine reads mirrored index directly.
Simplified NFS data flows Search Engine Mirroring allows exact placement of index data Map Reducer Input documents Search Engine Aribitrary levels of replication also possible Mirrors
How about another one?
K-means Classic E-M based algorithm Given cluster centroids, Assign each data point to nearest centroid Accumulate new centroids Rinse, lather, repeat
K-means, the movie Centroids Assign to Nearest centroid I n p u t Aggregate new centroids
But …
Parallel Stochastic Gradient Descent Model Train sub model I n p u t Average models
VariationalDirichlet Assignment Model Gather sufficient statistics I n p u t Update model
Old tricks, new dogs Mapper Assign point to cluster Emit cluster id, (1, point) Combiner and reducer Sum counts, weighted sum of points Emit cluster id, (n, sum/n) Output to HDFS Read from local disk from distributed cache Read from HDFS to local disk by distributed cache Written by map-reduce
Old tricks, new dogs Mapper Assign point to cluster Emit cluster id, (1, point) Combiner and reducer Sum counts, weighted sum of points Emit cluster id, (n, sum/n) Output to HDFS Read from NFS Written by map-reduce MapR FS
Poor man’s Pregel Mapper Lines in bold can use conventional I/O via NFS while not done:     read and accumulate input models     for each input:        accumulate model     write model    synchronize     reset input format emit summary 37
Click modeling architecture Map-reduce Side-data Now via NFS Feature extraction and down sampling I n p u t Data join Sequential SGD Learning

More Related Content

What's hot

Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the ElephantDataWorks Summit
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanNarayana B
 
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunningOscon data-2011-ted-dunning
Oscon data-2011-ted-dunningTed Dunning
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basicHafizur Rahman
 
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, FacebookМасштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebookyaevents
 
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewHdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewNitesh Ghosh
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2Tianwei Liu
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APImcsrivas
 
JFall 2011 no sql workshop
JFall 2011 no sql workshopJFall 2011 no sql workshop
JFall 2011 no sql workshopfvanvollenhoven
 
Quantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSQuantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSbigdatagurus_meetup
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and PipesHanborq Inc.
 

What's hot (20)

Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the Elephant
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunningOscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
RuG Guest Lecture
RuG Guest LectureRuG Guest Lecture
RuG Guest Lecture
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, FacebookМасштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook
 
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewHdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase API
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
JFall 2011 no sql workshop
JFall 2011 no sql workshopJFall 2011 no sql workshop
JFall 2011 no sql workshop
 
01 hbase
01 hbase01 hbase
01 hbase
 
Quantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSQuantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFS
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 

Viewers also liked

Prezentarea agentiei Justpixel
Prezentarea agentiei JustpixelPrezentarea agentiei Justpixel
Prezentarea agentiei JustpixelUngureanu Lucian
 
OpenFest 2013 Open Source Hardware (OSHW) made in Bulgaria
OpenFest 2013 Open Source Hardware (OSHW) made in BulgariaOpenFest 2013 Open Source Hardware (OSHW) made in Bulgaria
OpenFest 2013 Open Source Hardware (OSHW) made in BulgariaOlimex Bulgaria
 
Come (e perchè) un'istituzione finanziaria può costruire un ottimo blog azien...
Come (e perchè) un'istituzione finanziaria può costruire un ottimo blog azien...Come (e perchè) un'istituzione finanziaria può costruire un ottimo blog azien...
Come (e perchè) un'istituzione finanziaria può costruire un ottimo blog azien...AdviseOnly
 
Aiguille du Midi en France
Aiguille du Midi  en FranceAiguille du Midi  en France
Aiguille du Midi en FranceBalcon60
 
Verden trenger mer sjømat - langsiktig megatrend - Holberg Fondene
Verden trenger mer sjømat - langsiktig megatrend - Holberg FondeneVerden trenger mer sjømat - langsiktig megatrend - Holberg Fondene
Verden trenger mer sjømat - langsiktig megatrend - Holberg FondeneNordnet Norge
 
Stream Data into the Cloud with Raspberry Pi and Windows 10 IoT Core
Stream Data into the Cloud with Raspberry Pi and Windows 10 IoT CoreStream Data into the Cloud with Raspberry Pi and Windows 10 IoT Core
Stream Data into the Cloud with Raspberry Pi and Windows 10 IoT CoreMike Branstein
 
Peningkatan mutu agregat ringan beton bertulang ringan struktural untuk bangu...
Peningkatan mutu agregat ringan beton bertulang ringan struktural untuk bangu...Peningkatan mutu agregat ringan beton bertulang ringan struktural untuk bangu...
Peningkatan mutu agregat ringan beton bertulang ringan struktural untuk bangu...Krismanto Mahendra
 
Çiçeklerin Dünyası, World of Flowers IV
Çiçeklerin Dünyası, World of Flowers IVÇiçeklerin Dünyası, World of Flowers IV
Çiçeklerin Dünyası, World of Flowers IV***
 
Маркетинг на вдъхновението
Маркетинг на вдъхновениетоМаркетинг на вдъхновението
Маркетинг на вдъхновениетоJustine Toms
 
L298N 碳刷馬達驅動
L298N 碳刷馬達驅動L298N 碳刷馬達驅動
L298N 碳刷馬達驅動Ziyuan Chen
 
Платформа и решения НРЕ для больших данных
Платформа и решения НРЕ для больших данныхПлатформа и решения НРЕ для больших данных
Платформа и решения НРЕ для больших данныхAndrey Karpov
 
Kya aap jantay hain
Kya aap jantay hainKya aap jantay hain
Kya aap jantay hainrubab fatima
 
Dziś już nie ma offline - Paweł Loedl
Dziś już nie ma offline - Paweł LoedlDziś już nie ma offline - Paweł Loedl
Dziś już nie ma offline - Paweł LoedlSchool of New Media
 
مذكرة The future للصف الثالث الابتدائى الترم الثانى
مذكرة The future للصف الثالث الابتدائى الترم الثانىمذكرة The future للصف الثالث الابتدائى الترم الثانى
مذكرة The future للصف الثالث الابتدائى الترم الثانىSalah Abdelsalam
 
Geopolítica do Petróleo UENF - 30 AGO 2016
Geopolítica do Petróleo UENF - 30 AGO 2016Geopolítica do Petróleo UENF - 30 AGO 2016
Geopolítica do Petróleo UENF - 30 AGO 2016Lincoln Weinhardt
 
Syllabus leertraject koninkrijkszaken
Syllabus leertraject koninkrijkszakenSyllabus leertraject koninkrijkszaken
Syllabus leertraject koninkrijkszakenSibrenne Wagenaar
 
Güller,Roses
Güller,RosesGüller,Roses
Güller,Roses***
 

Viewers also liked (20)

Prezentarea agentiei Justpixel
Prezentarea agentiei JustpixelPrezentarea agentiei Justpixel
Prezentarea agentiei Justpixel
 
OpenFest 2013 Open Source Hardware (OSHW) made in Bulgaria
OpenFest 2013 Open Source Hardware (OSHW) made in BulgariaOpenFest 2013 Open Source Hardware (OSHW) made in Bulgaria
OpenFest 2013 Open Source Hardware (OSHW) made in Bulgaria
 
Come (e perchè) un'istituzione finanziaria può costruire un ottimo blog azien...
Come (e perchè) un'istituzione finanziaria può costruire un ottimo blog azien...Come (e perchè) un'istituzione finanziaria può costruire un ottimo blog azien...
Come (e perchè) un'istituzione finanziaria può costruire un ottimo blog azien...
 
Aiguille du Midi en France
Aiguille du Midi  en FranceAiguille du Midi  en France
Aiguille du Midi en France
 
Verden trenger mer sjømat - langsiktig megatrend - Holberg Fondene
Verden trenger mer sjømat - langsiktig megatrend - Holberg FondeneVerden trenger mer sjømat - langsiktig megatrend - Holberg Fondene
Verden trenger mer sjømat - langsiktig megatrend - Holberg Fondene
 
Stream Data into the Cloud with Raspberry Pi and Windows 10 IoT Core
Stream Data into the Cloud with Raspberry Pi and Windows 10 IoT CoreStream Data into the Cloud with Raspberry Pi and Windows 10 IoT Core
Stream Data into the Cloud with Raspberry Pi and Windows 10 IoT Core
 
Peningkatan mutu agregat ringan beton bertulang ringan struktural untuk bangu...
Peningkatan mutu agregat ringan beton bertulang ringan struktural untuk bangu...Peningkatan mutu agregat ringan beton bertulang ringan struktural untuk bangu...
Peningkatan mutu agregat ringan beton bertulang ringan struktural untuk bangu...
 
Çiçeklerin Dünyası, World of Flowers IV
Çiçeklerin Dünyası, World of Flowers IVÇiçeklerin Dünyası, World of Flowers IV
Çiçeklerin Dünyası, World of Flowers IV
 
TITANIC II
TITANIC IITITANIC II
TITANIC II
 
Маркетинг на вдъхновението
Маркетинг на вдъхновениетоМаркетинг на вдъхновението
Маркетинг на вдъхновението
 
Big Data y Salud. Un enfoque orientado a resultados
Big Data y Salud. Un enfoque orientado a resultadosBig Data y Salud. Un enfoque orientado a resultados
Big Data y Salud. Un enfoque orientado a resultados
 
L298N 碳刷馬達驅動
L298N 碳刷馬達驅動L298N 碳刷馬達驅動
L298N 碳刷馬達驅動
 
Платформа и решения НРЕ для больших данных
Платформа и решения НРЕ для больших данныхПлатформа и решения НРЕ для больших данных
Платформа и решения НРЕ для больших данных
 
Kya aap jantay hain
Kya aap jantay hainKya aap jantay hain
Kya aap jantay hain
 
Dziś już nie ma offline - Paweł Loedl
Dziś już nie ma offline - Paweł LoedlDziś już nie ma offline - Paweł Loedl
Dziś już nie ma offline - Paweł Loedl
 
калин 100
калин 100калин 100
калин 100
 
مذكرة The future للصف الثالث الابتدائى الترم الثانى
مذكرة The future للصف الثالث الابتدائى الترم الثانىمذكرة The future للصف الثالث الابتدائى الترم الثانى
مذكرة The future للصف الثالث الابتدائى الترم الثانى
 
Geopolítica do Petróleo UENF - 30 AGO 2016
Geopolítica do Petróleo UENF - 30 AGO 2016Geopolítica do Petróleo UENF - 30 AGO 2016
Geopolítica do Petróleo UENF - 30 AGO 2016
 
Syllabus leertraject koninkrijkszaken
Syllabus leertraject koninkrijkszakenSyllabus leertraject koninkrijkszaken
Syllabus leertraject koninkrijkszaken
 
Güller,Roses
Güller,RosesGüller,Roses
Güller,Roses
 

Similar to MapR, Implications for Integration

Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Data mining-2011-09
Data mining-2011-09Data mining-2011-09
Data mining-2011-09Ted Dunning
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...Amazon Web Services
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesappaji intelhunt
 
Ted Dunning - Whither Hadoop
Ted Dunning - Whither HadoopTed Dunning - Whither Hadoop
Ted Dunning - Whither HadoopEd Kohlwey
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
 

Similar to MapR, Implications for Integration (20)

Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Data mining-2011-09
Data mining-2011-09Data mining-2011-09
Data mining-2011-09
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Ted Dunning - Whither Hadoop
Ted Dunning - Whither HadoopTed Dunning - Whither Hadoop
Ted Dunning - Whither Hadoop
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Unit 1
Unit 1Unit 1
Unit 1
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Data Science
Data ScienceData Science
Data Science
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 

More from trihug

TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Rangertrihug
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparktrihug
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Productiontrihug
 
TriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache SentryTriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache Sentrytrihug
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Sharktrihug
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
Practical pig
Practical pigPractical pig
Practical pigtrihug
 
Financial services trihug
Financial services trihugFinancial services trihug
Financial services trihugtrihug
 
TriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris ShainTriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris Shaintrihug
 
TriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan GatesTriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan Gatestrihug
 
TriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan GatesTriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan Gatestrihug
 

More from trihug (11)

TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Ranger
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Production
 
TriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache SentryTriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache Sentry
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Practical pig
Practical pigPractical pig
Practical pig
 
Financial services trihug
Financial services trihugFinancial services trihug
Financial services trihug
 
TriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris ShainTriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris Shain
 
TriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan GatesTriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan Gates
 
TriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan GatesTriHUG November Pig Talk by Alan Gates
TriHUG November Pig Talk by Alan Gates
 

Recently uploaded

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Recently uploaded (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

MapR, Implications for Integration

  • 1. MapR, Implications for Integration CHUG – August 2011
  • 2. Outline MapR system overview Map-reduce review MapR architecture Performance Results Map-reduce on MapR Architectural implications Search indexing / deployment EM algorithm for machine learning … and more …
  • 4. Bottlenecks and Issues Read-only files Many copies in I/O path Shuffle based on HTTP Can’t use new technologies Eats file descriptors Spills go to local file space Bad for skewed distribution of sizes
  • 5. MapR Areas of Development
  • 6. MapR Improvements Faster file system Fewer copies Multiple NICS No file descriptor or page-buf competition Faster map-reduce Uses distributed file system Direct RPC to receiver Very wide merges
  • 7. MapR Innovations Volumes Distributed management Data placement Read/write random access file system Allows distributed meta-data Improved scaling Enables NFS access Application-level NIC bonding Transactionally correct snapshots and mirrors
  • 8.
  • 12. No need to manage directlyContainers are 16-32 GB segments of disk, placed on nodes
  • 13. Container locations and replication CLDB N1, N2 N1 N3, N2 N1, N2 N2 N1, N3 N3, N2 N3 Container location database (CLDB) keeps track of nodes hosting each container
  • 14.
  • 15.
  • 16. But not necessary, can page to disk
  • 17.
  • 18. Increase container size to 64G to serve 4EB cluster
  • 19.
  • 20. Terasort on MapR 10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm Elapsed time (mins) Lower is better
  • 21. HBase on MapR YCSB Random Read with 1 billion 1K records 10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM Recordspersecond Higher is better
  • 22. Small Files (Apache Hadoop, 10 nodes) Out of box Op: - create file - write 100 bytes - close Notes: - NN not replicated - NN uses 20G DRAM - DN uses 2G DRAM Tuned Rate (files/sec) # of files (m)
  • 23. MUCH faster for some operations Same 10 nodes … Create Rate # of files (millions)
  • 24. What MapR is not Volumes != federation MapR supports > 10,000 volumes all with independent placement and defaults Volumes support snapshots and mirroring NFS != FUSE Checksum and compress at gateway IP fail-over Read/write/update semantics at full speed MapR != maprfs
  • 26. NFS mounting models Export to the world NFS gateway runs on selected gateway hosts Local server NFS gateway runs on local host Enables local compression and check summing Export to self NFS gateway runs on all data nodes, mounted from localhost
  • 27. Export to the world NFS Server NFS Server NFS Server NFS Server NFS Client
  • 28. Local server Client Application NFS Server Cluster Nodes
  • 29. Universal export to self Cluster Nodes Cluster Node Task NFS Server
  • 30. Cluster Node Task NFS Server Cluster Node Task Cluster Node Task NFS Server NFS Server Nodes are identical
  • 31. Application architecture So now we have a hammer Let’s find us some nails!
  • 32. Sharded text Indexing Index text to local disk and then copy index to distributed file store Assign documents to shards Map Reducer Clustered index storage Input documents Copy to local disk typically required before index can be loaded Local disk Search Engine Local disk
  • 33. Shardedtext indexing Mapper assigns document to shard Shard is usually hash of document id Reducer indexes all documents for a shard Indexes created on local disk On success, copy index to DFS On failure, delete local files Must avoid directory collisions can’t use shard id! Must manage and reclaim local disk space
  • 34. Conventional data flow Failure of search engine requires another download of the index from clustered storage. Map Failure of a reducer causes garbage to accumulate in the local disk Reducer Clustered index storage Input documents Local disk Search Engine Local disk
  • 35. Simplified NFS data flows Index to task work directory via NFS Map Reducer Search Engine Input documents Clustered index storage Failure of a reducer is cleaned up by map-reduce framework Search engine reads mirrored index directly.
  • 36. Simplified NFS data flows Search Engine Mirroring allows exact placement of index data Map Reducer Input documents Search Engine Aribitrary levels of replication also possible Mirrors
  • 38. K-means Classic E-M based algorithm Given cluster centroids, Assign each data point to nearest centroid Accumulate new centroids Rinse, lather, repeat
  • 39. K-means, the movie Centroids Assign to Nearest centroid I n p u t Aggregate new centroids
  • 41. Parallel Stochastic Gradient Descent Model Train sub model I n p u t Average models
  • 42. VariationalDirichlet Assignment Model Gather sufficient statistics I n p u t Update model
  • 43. Old tricks, new dogs Mapper Assign point to cluster Emit cluster id, (1, point) Combiner and reducer Sum counts, weighted sum of points Emit cluster id, (n, sum/n) Output to HDFS Read from local disk from distributed cache Read from HDFS to local disk by distributed cache Written by map-reduce
  • 44. Old tricks, new dogs Mapper Assign point to cluster Emit cluster id, (1, point) Combiner and reducer Sum counts, weighted sum of points Emit cluster id, (n, sum/n) Output to HDFS Read from NFS Written by map-reduce MapR FS
  • 45. Poor man’s Pregel Mapper Lines in bold can use conventional I/O via NFS while not done: read and accumulate input models for each input: accumulate model write model synchronize reset input format emit summary 37
  • 46. Click modeling architecture Map-reduce Side-data Now via NFS Feature extraction and down sampling I n p u t Data join Sequential SGD Learning
  • 47. Click modeling architecture Map-reduce Map-reduce Side-data Map-reduce cooperates with NFS Sequential SGD Learning Feature extraction and down sampling Sequential SGD Learning I n p u t Data join Sequential SGD Learning Sequential SGD Learning
  • 49. Hybrid model flow Map-reduce Map-reduce Feature extraction and down sampling Down stream modeling Deployed Model ?? SVD (PageRank) (spectral)
  • 50.
  • 51. Hybrid model flow Feature extraction and down sampling Down stream modeling Deployed Model Sequential Map-reduce SVD (PageRank) (spectral)
  • 53. Trivial visualization interface Map-reduce output is visible via NFS Legacy visualization just works $ R > x <- read.csv(“/mapr/my.cluster/home/ted/data/foo.out”) > plot(error ~ t, x) > q(save=‘n’)
  • 54. Conclusions We used to know all this Tab completion used to work 5 years of work-arounds have clouded our memories We just have to remember the future