SlideShare a Scribd company logo
1 of 29
IPPON 2020
Accelerate Big Data Analytics
with Managed Cluster Services
IPPON 2020
IPPON 2019
Introduction.
Data engineer at Ippon
Technologies, a boutique
consulting firm specializing in
Data, Cloud, DevOps, and Full
Stack Application
Development.
Ippon has deep expertise
across all major cloud
platforms:
★ AWS
★ Azure
★ GCP
Delivers Data Initiative
Architecture consulting from scoping
Project delivery from POC
Robust client portfolio
Led by Peter Choe
We’re hiring!
About Ippon’s Data Practice
Engineer with hands-on
experience with AWS and
Azure across multiple data
projects.
Currently working with
Cassandra (NoSQL) and
AWS compute and analytics
services.
Fun fact: My team won first
place at the 2020 Virginia
Governor's Datathon
I also organize the
Richmond (VA) Data
Engineering Meetup
Sam Portillo
Data Engineer
IPPON 2020
IPPON 2019
Agenda.
1. Introduction
2. Architecture and Design
3. Contrast with Similar Services
4. Lessons Learned
5. Q&A
IPPON
2020
What is a cluster?
Background on Big Data Computing.
❖ Pre 1994 - mainframe era
❖ 1994 to 2010 - cluster era
➢ Identical machines connected to the same network to run
workloads
➢ Price varies from 5-6 figures for hardware alone
❖ 2000s to present - GPU era
➢ GPU processing accelerates specific types of
problems (ML training, gaming graphics, etc.)
➢ Hardware can also vary from 5-6 figures
❖ 2010s to present - Cloud era
➢ Introduces pay-as-you-go IaaS and PaaS with, firms can run
Big Data applications without buying/managing hardware
Benefits of IaaS.
❖ No longer need to buy/manage
hardware
❖ “Wait minutes not months”
❖ Pay-as-you-go beats financing
hardware in most cases
❖ Scale without physical hardware
limits
❖ User needs to manage OS,
middleware, applications
Benefits of PaaS.
❖ Cloud provider manages OS,
database, development tools
❖ Can be cheaper than IaaS
❖ Faster development time because
software comes preconfigured
❖ User needs to applications, data,
etc.
IaaS, PaaS, SaaS Overview.
Managed Cluster Services.
❖ PaaS offering that allows cloud customers to easily run Big Data
workloads
❖ Support for open source tools like Spark and Hadoop
❖ Examples are
➢ AWS Elastic MapReduce
➢ Azure HDInsight
➢ Google Cloud Dataproc
Core Benefits of Managed Cluster Services.
❖ Cost savings associated with PaaS
❖ Ease of use for developers
❖ Integration with other cloud services
❖ Elastic
➢ Scales up and down as needed
❖ Flexibility
➢ Users can customize environments to solve a variety of problems
Use Cases.
❖ ETL Pipelines
➢ Build reliable data pipelines with Spark
❖ Big data migration
➢ Utilize the power of a cluster to transfer data in a distributed way
❖ Interactive analytics
➢ Quickly ingest terabytes of data to do initial analysis
❖ Machine learning
➢ Train ML models with Tensorflow or Spark MLlib
Takeaway.
❖ Managed cluster services can be versatile and used in different workloads
❖ Benefits that cloud services are typically known for
❖ A good understanding of architecture principles will help an engineer
support a variety of projects
Architecture and Design
Architecture in AWS EMR.
❖ Each node is an EC2 instance
❖ Master node
➢ Responsible for coordinating applications
among the cluster
➢ Runs driver component for Spark apps
➢ SSH access
❖ Core nodes
➢ Do heavy lifting for applications
➢ Store data for Hadoop Distributed File
System
How distributed computing speeds up workloads.
❖ Workloads get distributed among a cluster
❖ Hadoop MapReduce
❖ Heavily relies on disk; stores data back to HDFS after each operation
How distributed computing speeds up workloads.
❖ Apache Spark uses a directed acyclic graph (DAG) to plan workloads
❖ Advantages over MapReduce:
➢ Utilizes RAM and caches datasets between map-reduce jobs
➢ Doesn’t need to write to disk between map-reduce jobs
➢ Rich API that allows for dataset transformations
➢ Potentially 100 times faster than Hadoop MapReduce.
Support for open-source software.
Customizing compute environments.
❖ Support for
➢ Logging
➢ SSH connection
➢ Bootstrap scripts to install custom software/packages
➢ Custom AMIs to achieve anything a bootstrap script can’t
EMR in action.
Contrast with Similar Services
Services to address.
❖ Databricks
❖ ETL tools
❖ Batch services
Databricks.
❖ Databricks is a managed Spark service
➢ Databricks manages compute
➢ Workflow automation and data pipelines
➢ Integrated workspace
❖ A managed cluster service just runs Spark jobs
ETL Tools.
❖ Can have similar ETL functions but managed cluster
services offer more than ETL
❖ ETL on ETL tools is generally much easier to configure, but
less flexible
❖ AWS Glue works on top of a Spark environment
Batch Services.
❖ Batch services are meant to run any batch computing job
at any scale.
❖ They may be able to do some ETL use cases
❖ Not a cluster service, don’t run Spark or Hadoop
❖ Not an environment for interactive analytics
Takeaway.
❖ Decision to use a managed cluster service heavily depends
on the workload
Lessons Learned
IPPON 2020
IPPON 2019
Things to keep in mind.
1. Tuning can be difficult
2. Adopt best practices early
3. Assess what parts of a workload warrant a cluster service
4. Know your data
5. Explore different options before committing to a solution
IPPON
2020
IPPON 2020
Connect with me :D
❖ Sam Portillo
❖ https://www.linkedin.com/in/portillosc
❖ Email: sportillo@ippon.fr
Q&A

More Related Content

What's hot

How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?Vincent Terrasi
 
Clinical Suspecting at Scale Using PySpark
Clinical Suspecting at Scale Using PySparkClinical Suspecting at Scale Using PySpark
Clinical Suspecting at Scale Using PySparkDatabricks
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...Dataconomy Media
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Azure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsAzure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsWaqas Idrees
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
 
From hadoop to spark
From hadoop to sparkFrom hadoop to spark
From hadoop to sparksteccami
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsPat Patterson
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowSimplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowPyData
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irdatastack
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataDataWorks Summit
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics toolsNascenia IT
 
Serverless data pipelines gcp
Serverless data pipelines gcpServerless data pipelines gcp
Serverless data pipelines gcpCatherine Kimani
 
What is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | WhitepaperWhat is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | WhitepaperVasu S
 

What's hot (20)

How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?
 
Clinical Suspecting at Scale Using PySpark
Clinical Suspecting at Scale Using PySparkClinical Suspecting at Scale Using PySpark
Clinical Suspecting at Scale Using PySpark
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Azure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsAzure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake Analytics
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
From hadoop to spark
From hadoop to sparkFrom hadoop to spark
From hadoop to spark
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowSimplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
 
Big data vahidamiri-datastack.ir
Big data vahidamiri-datastack.irBig data vahidamiri-datastack.ir
Big data vahidamiri-datastack.ir
 
Digital Transformation with Microsoft Azure
Digital Transformation with Microsoft AzureDigital Transformation with Microsoft Azure
Digital Transformation with Microsoft Azure
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics tools
 
Serverless data pipelines gcp
Serverless data pipelines gcpServerless data pipelines gcp
Serverless data pipelines gcp
 
Dremio introduction
Dremio introductionDremio introduction
Dremio introduction
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
 
Introduction to Dremio
Introduction to DremioIntroduction to Dremio
Introduction to Dremio
 
What is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | WhitepaperWhat is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | Whitepaper
 

Similar to Managed Cluster Services

Cloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs GoogleCloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs GooglePatrick Pierson
 
AWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesAWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesTobyWilman
 
What is Google Cloud Platform - GDG DevFest 18 Depok
What is Google Cloud Platform - GDG DevFest 18 DepokWhat is Google Cloud Platform - GDG DevFest 18 Depok
What is Google Cloud Platform - GDG DevFest 18 DepokImre Nagi
 
AirBNB's ML platform - BigHead
AirBNB's ML platform - BigHeadAirBNB's ML platform - BigHead
AirBNB's ML platform - BigHeadKarthik Murugesan
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...Databricks
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...Amazon Web Services
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 
Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...
Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...
Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...VMware Tanzu
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersKumari Surabhi
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsVMware Tanzu
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaData Con LA
 
Big Data Infrastructure
Big Data InfrastructureBig Data Infrastructure
Big Data InfrastructureTrivadis
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2Raul Chong
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for AnalyticsJen Stirrup
 
Oracle Data Integration - Overview
Oracle Data Integration - OverviewOracle Data Integration - Overview
Oracle Data Integration - OverviewJeffrey T. Pollock
 
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...Openbar
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalVMware Tanzu Korea
 

Similar to Managed Cluster Services (20)

An introduction to cloud systems architecture
An introduction to cloud systems architectureAn introduction to cloud systems architecture
An introduction to cloud systems architecture
 
Cloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs GoogleCloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs Google
 
AWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesAWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - Slides
 
What is Google Cloud Platform - GDG DevFest 18 Depok
What is Google Cloud Platform - GDG DevFest 18 DepokWhat is Google Cloud Platform - GDG DevFest 18 Depok
What is Google Cloud Platform - GDG DevFest 18 Depok
 
Talend for big_data_intorduction
Talend for big_data_intorductionTalend for big_data_intorduction
Talend for big_data_intorduction
 
AirBNB's ML platform - BigHead
AirBNB's ML platform - BigHeadAirBNB's ML platform - BigHead
AirBNB's ML platform - BigHead
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...
Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...
Delivering the power of data using Spring Cloud DataFlow and DataStax Enterpr...
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive Applications
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Big Data Infrastructure
Big Data InfrastructureBig Data Infrastructure
Big Data Infrastructure
 
IBM - Introduction to Cloudant
IBM - Introduction to CloudantIBM - Introduction to Cloudant
IBM - Introduction to Cloudant
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics
 
Oracle Data Integration - Overview
Oracle Data Integration - OverviewOracle Data Integration - Overview
Oracle Data Integration - Overview
 
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from Pivotal
 

More from Adam Doyle

Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering RolesAdam Doyle
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations PresentationAdam Doyle
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowAdam Doyle
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAdam Doyle
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop DevelopmentAdam Doyle
 
The new big data
The new big dataThe new big data
The new big dataAdam Doyle
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020Adam Doyle
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleAdam Doyle
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAAdam Doyle
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackAdam Doyle
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020Adam Doyle
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does dataAdam Doyle
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsAdam Doyle
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingAdam Doyle
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019Adam Doyle
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleAdam Doyle
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user groupAdam Doyle
 
Cloudera - Docker on hadoop
Cloudera - Docker on hadoopCloudera - Docker on hadoop
Cloudera - Docker on hadoopAdam Doyle
 

More from Adam Doyle (20)

ML Ops.pptx
ML Ops.pptxML Ops.pptx
ML Ops.pptx
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop Development
 
The new big data
The new big dataThe new big data
The new big data
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech Stack
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does data
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analytics
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science Lifecycle
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 
Cloudera - Docker on hadoop
Cloudera - Docker on hadoopCloudera - Docker on hadoop
Cloudera - Docker on hadoop
 

Recently uploaded

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 

Recently uploaded (20)

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Managed Cluster Services

  • 1. IPPON 2020 Accelerate Big Data Analytics with Managed Cluster Services
  • 2. IPPON 2020 IPPON 2019 Introduction. Data engineer at Ippon Technologies, a boutique consulting firm specializing in Data, Cloud, DevOps, and Full Stack Application Development. Ippon has deep expertise across all major cloud platforms: ★ AWS ★ Azure ★ GCP Delivers Data Initiative Architecture consulting from scoping Project delivery from POC Robust client portfolio Led by Peter Choe We’re hiring! About Ippon’s Data Practice Engineer with hands-on experience with AWS and Azure across multiple data projects. Currently working with Cassandra (NoSQL) and AWS compute and analytics services. Fun fact: My team won first place at the 2020 Virginia Governor's Datathon I also organize the Richmond (VA) Data Engineering Meetup Sam Portillo Data Engineer
  • 3. IPPON 2020 IPPON 2019 Agenda. 1. Introduction 2. Architecture and Design 3. Contrast with Similar Services 4. Lessons Learned 5. Q&A IPPON 2020
  • 4. What is a cluster?
  • 5. Background on Big Data Computing. ❖ Pre 1994 - mainframe era ❖ 1994 to 2010 - cluster era ➢ Identical machines connected to the same network to run workloads ➢ Price varies from 5-6 figures for hardware alone ❖ 2000s to present - GPU era ➢ GPU processing accelerates specific types of problems (ML training, gaming graphics, etc.) ➢ Hardware can also vary from 5-6 figures ❖ 2010s to present - Cloud era ➢ Introduces pay-as-you-go IaaS and PaaS with, firms can run Big Data applications without buying/managing hardware
  • 6. Benefits of IaaS. ❖ No longer need to buy/manage hardware ❖ “Wait minutes not months” ❖ Pay-as-you-go beats financing hardware in most cases ❖ Scale without physical hardware limits ❖ User needs to manage OS, middleware, applications
  • 7. Benefits of PaaS. ❖ Cloud provider manages OS, database, development tools ❖ Can be cheaper than IaaS ❖ Faster development time because software comes preconfigured ❖ User needs to applications, data, etc.
  • 8. IaaS, PaaS, SaaS Overview.
  • 9. Managed Cluster Services. ❖ PaaS offering that allows cloud customers to easily run Big Data workloads ❖ Support for open source tools like Spark and Hadoop ❖ Examples are ➢ AWS Elastic MapReduce ➢ Azure HDInsight ➢ Google Cloud Dataproc
  • 10. Core Benefits of Managed Cluster Services. ❖ Cost savings associated with PaaS ❖ Ease of use for developers ❖ Integration with other cloud services ❖ Elastic ➢ Scales up and down as needed ❖ Flexibility ➢ Users can customize environments to solve a variety of problems
  • 11. Use Cases. ❖ ETL Pipelines ➢ Build reliable data pipelines with Spark ❖ Big data migration ➢ Utilize the power of a cluster to transfer data in a distributed way ❖ Interactive analytics ➢ Quickly ingest terabytes of data to do initial analysis ❖ Machine learning ➢ Train ML models with Tensorflow or Spark MLlib
  • 12. Takeaway. ❖ Managed cluster services can be versatile and used in different workloads ❖ Benefits that cloud services are typically known for ❖ A good understanding of architecture principles will help an engineer support a variety of projects
  • 14. Architecture in AWS EMR. ❖ Each node is an EC2 instance ❖ Master node ➢ Responsible for coordinating applications among the cluster ➢ Runs driver component for Spark apps ➢ SSH access ❖ Core nodes ➢ Do heavy lifting for applications ➢ Store data for Hadoop Distributed File System
  • 15. How distributed computing speeds up workloads. ❖ Workloads get distributed among a cluster ❖ Hadoop MapReduce ❖ Heavily relies on disk; stores data back to HDFS after each operation
  • 16. How distributed computing speeds up workloads. ❖ Apache Spark uses a directed acyclic graph (DAG) to plan workloads ❖ Advantages over MapReduce: ➢ Utilizes RAM and caches datasets between map-reduce jobs ➢ Doesn’t need to write to disk between map-reduce jobs ➢ Rich API that allows for dataset transformations ➢ Potentially 100 times faster than Hadoop MapReduce.
  • 18. Customizing compute environments. ❖ Support for ➢ Logging ➢ SSH connection ➢ Bootstrap scripts to install custom software/packages ➢ Custom AMIs to achieve anything a bootstrap script can’t
  • 21. Services to address. ❖ Databricks ❖ ETL tools ❖ Batch services
  • 22. Databricks. ❖ Databricks is a managed Spark service ➢ Databricks manages compute ➢ Workflow automation and data pipelines ➢ Integrated workspace ❖ A managed cluster service just runs Spark jobs
  • 23. ETL Tools. ❖ Can have similar ETL functions but managed cluster services offer more than ETL ❖ ETL on ETL tools is generally much easier to configure, but less flexible ❖ AWS Glue works on top of a Spark environment
  • 24. Batch Services. ❖ Batch services are meant to run any batch computing job at any scale. ❖ They may be able to do some ETL use cases ❖ Not a cluster service, don’t run Spark or Hadoop ❖ Not an environment for interactive analytics
  • 25. Takeaway. ❖ Decision to use a managed cluster service heavily depends on the workload
  • 27. IPPON 2020 IPPON 2019 Things to keep in mind. 1. Tuning can be difficult 2. Adopt best practices early 3. Assess what parts of a workload warrant a cluster service 4. Know your data 5. Explore different options before committing to a solution IPPON 2020
  • 28. IPPON 2020 Connect with me :D ❖ Sam Portillo ❖ https://www.linkedin.com/in/portillosc ❖ Email: sportillo@ippon.fr
  • 29. Q&A

Editor's Notes

  1. Introduce myself and stuff Talk about rvade maybe or the cats Been with ippon for about a year Things i like I’m on the qomplx project and use a lot of aws services and cassandra I like python
  2. Skim over this but focus on the about me Offer: Ippon helps companies delivering Data initiative from massive workloads (FastData) to massive storage (BigData). Services: Architecture consulting from scoping DataEngineering industrialisation Project delivery from POC US Reference: Ippon is delivering the full Data capability of SwissRe in the US from realtime analysis to legal longterm storage in a secured Cloud.
  3. Read off slides Small demo with architecture and design
  4. Pre 1994 - super computer/mainframe; single big computer 1994 - 2010 - linux got popular and this started with the beowulf project. Folks started buying commodity hardware with two network cards and distributing workloads Present - there’s still benefit in on prem hardware, which is why the GPU era is still relevant. I know carmax has on prem resources for their machine learning workloads. Makes sense when you do the quick math between running a workload in the cloud vs buying the hardware (not the same as managing an entire data center). Acknowledge the overlap
  5. Examples of these are amazon ec2, azure vms, gcp compute engine
  6. Aws elastic beanstalk is a paas orchestration service offered by Amazon Web Services for deploying applications which orchestrates various AWS services
  7. Here’s an overview of iaas, paas, and saas, i see new as-a-service acronyms each year but most cloud services fall into these categories Most cloud certification courses ask questions about these a saas example would be the microsoft office 365 suite
  8. Tell audience we’re focusing on EMR because i’ve used it on a couple projects within the DP Not a sales pitch for EMR I’ve just used it too much
  9. Ease of use: easy for devs to get started with; learning curve is pretty low
  10. Use “emerging problems” Any questions so far? At its core, EMR is a platform as a service that offers on-demand distributed computing clusters. This service comes with the benefits such as scalability and cost-savings that AWS services are typically known for. AWS also manages the installation of a variety of popular distributed computing and data frameworks like Spark, Hadoop, Tensorflow, and many more. This makes EMR especially versatile as engineers can work on almost any type of problem without too much trouble switching contexts. With EMR being a cornerstone for many workloads, it’s an incredibly helpful tool to keep in the toolbox. A strong understanding of its architecture principles coupled with it’s variety of popular frameworks available makes it a good choice for emerging problems.
  11. Ignore the “mapr node” thing. I stole this image
  12. Most services are phasing out hadoop verbiage on their product pages. And apache retired some hadoop related projects recently Explanation of graphic: say we want to count the occurrences of each animal in this input dataset In the Map phase, each line gets parsed and all of the animals occurrences are extracted. The output of each step of the Map phase is just a list of of each animal and a 1 associated with it. Now that we have all of the animal occurrences by line, we want a count of occurrences across the dataset for each animal. In the Reduce phase, we combine all of the same animals from different mappers and "reduce" them to a single animal and a count. Any questions?
  13. Introducing Spark
  14. Go straight to the console
  15. Theres a lot of similar services out there. I’d like to clear up some differences. Bottom line will be that the service you choose depends on your workload
  16. Spark tuning is really hard, figure out what needs to be done correctly otherwise you may end up just pulling a bunch of different levers Whenever working with a new service or framework, get to know the best practices, you do not want to be three weeks into something when you realize you’re doing something anti pattern Theres a lot of components of an ETL pipeline, but just because one part needs to be done in spark doesn’t mean the whole thing does Omg, self explanatory. Exhaustively study it to make sure nothing will come in to break your pipeline/business logic. Found hundred line sql statements in breach data Don’t fall into the hype of the latest and greatest service. Extensively research your use-case. There’s a lot of ways to get something done, make sure you pick the right tool for the job
  17. If you like to talk about data/tinker with tools/technology/services OR like to throw it down on the dance floor. Connect with me, let’s be friends I also welcome philosophical debate, whether its about material i presented, thoughts on architecture paradigms/strategies, the future of aws/data.