SlideShare a Scribd company logo
1 of 11
#BDWmeetup @joe_Caserta 
Big Data 
Warehousing: 
September 17, 2014 
Big ETL 
• Traditional Tools 
• Map Reduce 
• Pig 
• Hive 
• Python 
What to use when
Agenda 
7:00 Networking 
Grab some food and drink... Make some friends. 
7:15 Joe Caserta 
President 
Caserta Concepts 
Welcome + Intro 
About the Meetup, about Caserta Concepts 
Overview of evolution and future of ETL 
7:35 Elliott Cordo 
Chief Architect 
Caserta Concepts 
Deeper dive in to Pig, Hive, Spark, etc. 
Demo of Spark! 
8:00 Kyle Hubert 
Principal Data Architect 
Simulmedia 
Hadoop Streaming with Python 
8:45 Q&A, More Networking 
Tell us what you’re up to… 
#BDWmeetup @joe_Caserta
About the BDW Meetup Twitter: #BDWmeetup 
• Big Data is a complex, rapidly changing 
landscape 
• We want to share our stories and hear 
about yours 
• Great networking opportunity for like 
minded data nerds 
• Opportunities to collaborate on exciting 
projects 
• Founded by Caserta Concepts 
• November 10, 2012 – HAPPY ANNIVERSARY!!! 
• Next BDW Meetup: Want to present? 
#BDWmeetup @joe_Caserta 
@CasertaConcepts 
@Simulmedia
About Caserta Concepts 
• Technology services company with expertise in data analysis: 
• Big Data Solutions 
• Data Warehousing 
• Business Intelligence 
• Data Science & Analytics 
• Data on the Cloud 
• Data Interaction & Visualization 
• Core focus in the following industries: 
• eCommerce / Retail / Marketing 
• Financial Services / Insurance 
• Healthcare / Ad Tech / Higher Ed 
• Established in 2001: 
• Increased growth year-over-year 
• Industry recognized work force 
• Strategy, Implementation 
• Writing, Education, Mentoring 
#BDWmeetup @joe_Caserta
Help Wanted 
Does this word cloud excite you? 
Cassandra 
Speak with us about our open positions: leslie@casertaconcepts.com 
#BDWmeetup @joe_Caserta 
Storm 
Big Data Architect Hbase
The Evolution of the Enterprise Data Hub POC 
Enrollments 
Claims 
Finance 
ETL 
NoSQL 
Databases 
Traditional 
EDW 
ETL 
Enterprise Data Hub 
Spark MapReduce Pig/Hive 
N1 N2 N3 N4 N5 
Hadoop Distributed File System (HDFS) 
Horizontally Scalable Environment - Optimized for Analytics 
Others… 
ETL 
#BDWmeetup @joe_Caserta
ETL for the (Big Data) Enterprise Data Hub 
• Convergence of 
• Data quality 
• Data Management and policies 
• All data in an organization 
• Set of processes 
• Ensure data assets are formally managed 
throughout the enterprise. 
• Ensure data can be trusted 
• EDH - Backbone of business 
• Production environment 
• Agile 
#BDWmeetup @joe_Caserta
Components of a Mature Enterprise Data Hub 
• Add Big Data to overall framework and assign responsibility 
• Add data scientists to the Stewardship program 
• Assign stewards to new data sets (twitter, call center logs, etc.) 
• This is the ‘people’ part. Establishing Enterprise Data Council, 
Data Stewards, etc. Organization 
• Larger scale 
• New datatypes 
• Integrate with Hive Metastore, HCatalog, home grown tables 
•Definitions, lineage (where does this data come from), 
business definitions, technical metadata Metadata 
Privacy/Security •Identify • Data detection and control and masking sensitive on unstructured data, regulatory data upon compliance 
ingest 
• Data Quality and Monitoring (probably home grown, drools?) 
• Quality checks not only SQL: machine learning, Pig and Map Reduce 
• Acting on large dataset quality checks may require distribution 
•Data must be complete and correct. Measure, improve, 
certify 
Data Quality and 
Monitoring 
Business Process Integration •Policies around data frequency, source availability, etc. 
• Near-zero latency, DevOps, Core component of business operations 
• Graph databases are more flexible than relational 
• Lower latency service required 
• Distributed data quality and matching algorithms 
•Ensure consistent business critical data i.e. Members, 
Providers, Agents, etc. Master Data Management 
• Secure and mask multiple data types (not just tabular) 
• Deletes are more uncommon (unless there is regulatory requirement) 
• Take advantage of compression and archiving (like AWS Glacier) 
•Data retention, purge schedule, storage/archiving 
Information Lifecycle 
Management (ILM) 
#BDWmeetup @joe_Caserta
Enterprise Data Pyramid 
 ETL cleans, conforms, consolidates, enriches each tier. 
 Only top (trusted) tier of the pyramid is fully accessible by the 
masses. 
Big 
Data 
Warehouse 
Fully Data Governed ( trusted) 
ETL 
Data Science Workspace 
Agile business insight through data-munging 
machine learning, blending with external 
data, development of to-be BDW facts 
Metadata  Catalog 
ILM  who has access, how long do we “manage it” 
ETL 
Data Lake – Integrated Sandbox 
Data Quality and Monitoring  Monitoring of 
completeness of data 
Landing Area – Source Data in “Full Fidelity” 
Data is ready to be 
turned into information: 
organized, well defined, 
complete. 
#BDWmeetup @joe_Caserta 
Metadata  Catalog 
ILM  who has access, 
how long do we “manage it” 
Raw machine 
data collection, 
collect 
everything 
Metadata  Catalog 
ILM  who has access, how long to “manage it” 
Data Quality and Monitoring  Monitoring 
of completeness of data 
User community arbitrary queries and 
reporting 
ETL 
ETL
Thank You 
Joe Caserta 
President, Caserta Concepts 
joe@casertaconcepts.com 
(914) 261-3648 
@joe_Caserta 
#BDWmeetup @joe_Caserta
Free Python books, Courtesy of: 
RAFFLE!!! 
#BDWmeetup @joe_Caserta

More Related Content

Viewers also liked

National Patient Safety Foundation 2012 Dashboard Demo
National Patient Safety Foundation 2012 Dashboard DemoNational Patient Safety Foundation 2012 Dashboard Demo
National Patient Safety Foundation 2012 Dashboard DemoEdgewater
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Designing High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseDesigning High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseMarcel Franke
 
Customer Centricity Score
Customer Centricity ScoreCustomer Centricity Score
Customer Centricity ScoreJan-Erik Baars
 
Veri Ambarları için Oracle'ın Analitik SQL Desteği
Veri Ambarları için Oracle'ın Analitik SQL DesteğiVeri Ambarları için Oracle'ın Analitik SQL Desteği
Veri Ambarları için Oracle'ın Analitik SQL DesteğiEmrah METE
 
Oracle PL/SQL Best Practices
Oracle PL/SQL Best PracticesOracle PL/SQL Best Practices
Oracle PL/SQL Best PracticesEmrah METE
 
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityCaserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
Open Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETLOpen Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETLJonathan Levin
 
Agile data warehouse
Agile data warehouseAgile data warehouse
Agile data warehouseDao Vo
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the CloudCaserta
 
Informatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools ComparisonInformatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools ComparisonRoberto Espinosa
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseCaserta
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profilingShailja Khurana
 
Data quality architecture
Data quality architectureData quality architecture
Data quality architectureanicewick
 

Viewers also liked (17)

National Patient Safety Foundation 2012 Dashboard Demo
National Patient Safety Foundation 2012 Dashboard DemoNational Patient Safety Foundation 2012 Dashboard Demo
National Patient Safety Foundation 2012 Dashboard Demo
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Designing High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseDesigning High Performance ETL for Data Warehouse
Designing High Performance ETL for Data Warehouse
 
Customer Centricity Score
Customer Centricity ScoreCustomer Centricity Score
Customer Centricity Score
 
Veri Ambarları için Oracle'ın Analitik SQL Desteği
Veri Ambarları için Oracle'ın Analitik SQL DesteğiVeri Ambarları için Oracle'ın Analitik SQL Desteği
Veri Ambarları için Oracle'ın Analitik SQL Desteği
 
Oracle PL/SQL Best Practices
Oracle PL/SQL Best PracticesOracle PL/SQL Best Practices
Oracle PL/SQL Best Practices
 
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data Quality
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
Open Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETLOpen Source ETL vs Commercial ETL
Open Source ETL vs Commercial ETL
 
Agile data warehouse
Agile data warehouseAgile data warehouse
Agile data warehouse
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Informatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools ComparisonInformatica Pentaho Etl Tools Comparison
Informatica Pentaho Etl Tools Comparison
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
Data quality architecture
Data quality architectureData quality architecture
Data quality architecture
 

More from Caserta

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingCaserta
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Caserta
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteCaserta
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Caserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?Caserta
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for EveryoneCaserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure CloudCaserta
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data LakeCaserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsCaserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It? Caserta
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data LakeCaserta
 

More from Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Big Data Warehousing Meetup: BigETL: Trad Tool vs Pig vs Hive vs Python. What to use when. (Slide Set #1)

  • 1. #BDWmeetup @joe_Caserta Big Data Warehousing: September 17, 2014 Big ETL • Traditional Tools • Map Reduce • Pig • Hive • Python What to use when
  • 2. Agenda 7:00 Networking Grab some food and drink... Make some friends. 7:15 Joe Caserta President Caserta Concepts Welcome + Intro About the Meetup, about Caserta Concepts Overview of evolution and future of ETL 7:35 Elliott Cordo Chief Architect Caserta Concepts Deeper dive in to Pig, Hive, Spark, etc. Demo of Spark! 8:00 Kyle Hubert Principal Data Architect Simulmedia Hadoop Streaming with Python 8:45 Q&A, More Networking Tell us what you’re up to… #BDWmeetup @joe_Caserta
  • 3. About the BDW Meetup Twitter: #BDWmeetup • Big Data is a complex, rapidly changing landscape • We want to share our stories and hear about yours • Great networking opportunity for like minded data nerds • Opportunities to collaborate on exciting projects • Founded by Caserta Concepts • November 10, 2012 – HAPPY ANNIVERSARY!!! • Next BDW Meetup: Want to present? #BDWmeetup @joe_Caserta @CasertaConcepts @Simulmedia
  • 4. About Caserta Concepts • Technology services company with expertise in data analysis: • Big Data Solutions • Data Warehousing • Business Intelligence • Data Science & Analytics • Data on the Cloud • Data Interaction & Visualization • Core focus in the following industries: • eCommerce / Retail / Marketing • Financial Services / Insurance • Healthcare / Ad Tech / Higher Ed • Established in 2001: • Increased growth year-over-year • Industry recognized work force • Strategy, Implementation • Writing, Education, Mentoring #BDWmeetup @joe_Caserta
  • 5. Help Wanted Does this word cloud excite you? Cassandra Speak with us about our open positions: leslie@casertaconcepts.com #BDWmeetup @joe_Caserta Storm Big Data Architect Hbase
  • 6. The Evolution of the Enterprise Data Hub POC Enrollments Claims Finance ETL NoSQL Databases Traditional EDW ETL Enterprise Data Hub Spark MapReduce Pig/Hive N1 N2 N3 N4 N5 Hadoop Distributed File System (HDFS) Horizontally Scalable Environment - Optimized for Analytics Others… ETL #BDWmeetup @joe_Caserta
  • 7. ETL for the (Big Data) Enterprise Data Hub • Convergence of • Data quality • Data Management and policies • All data in an organization • Set of processes • Ensure data assets are formally managed throughout the enterprise. • Ensure data can be trusted • EDH - Backbone of business • Production environment • Agile #BDWmeetup @joe_Caserta
  • 8. Components of a Mature Enterprise Data Hub • Add Big Data to overall framework and assign responsibility • Add data scientists to the Stewardship program • Assign stewards to new data sets (twitter, call center logs, etc.) • This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc. Organization • Larger scale • New datatypes • Integrate with Hive Metastore, HCatalog, home grown tables •Definitions, lineage (where does this data come from), business definitions, technical metadata Metadata Privacy/Security •Identify • Data detection and control and masking sensitive on unstructured data, regulatory data upon compliance ingest • Data Quality and Monitoring (probably home grown, drools?) • Quality checks not only SQL: machine learning, Pig and Map Reduce • Acting on large dataset quality checks may require distribution •Data must be complete and correct. Measure, improve, certify Data Quality and Monitoring Business Process Integration •Policies around data frequency, source availability, etc. • Near-zero latency, DevOps, Core component of business operations • Graph databases are more flexible than relational • Lower latency service required • Distributed data quality and matching algorithms •Ensure consistent business critical data i.e. Members, Providers, Agents, etc. Master Data Management • Secure and mask multiple data types (not just tabular) • Deletes are more uncommon (unless there is regulatory requirement) • Take advantage of compression and archiving (like AWS Glacier) •Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) #BDWmeetup @joe_Caserta
  • 9. Enterprise Data Pyramid  ETL cleans, conforms, consolidates, enriches each tier.  Only top (trusted) tier of the pyramid is fully accessible by the masses. Big Data Warehouse Fully Data Governed ( trusted) ETL Data Science Workspace Agile business insight through data-munging machine learning, blending with external data, development of to-be BDW facts Metadata  Catalog ILM  who has access, how long do we “manage it” ETL Data Lake – Integrated Sandbox Data Quality and Monitoring  Monitoring of completeness of data Landing Area – Source Data in “Full Fidelity” Data is ready to be turned into information: organized, well defined, complete. #BDWmeetup @joe_Caserta Metadata  Catalog ILM  who has access, how long do we “manage it” Raw machine data collection, collect everything Metadata  Catalog ILM  who has access, how long to “manage it” Data Quality and Monitoring  Monitoring of completeness of data User community arbitrary queries and reporting ETL ETL
  • 10. Thank You Joe Caserta President, Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta #BDWmeetup @joe_Caserta
  • 11. Free Python books, Courtesy of: RAFFLE!!! #BDWmeetup @joe_Caserta

Editor's Notes

  1. Robotman was actually the first cyborg superhero. Robert Crane was fatally shot and had his brain placed in a super strong robot body. The cybernetic Robotman lived on, using a rubber mask and flesh-like body suit to disguise himself as Paul Dennis. The new hero used his cyborg might to smash crime during DC’s Golden Age. First Appearance: Star Spangled Comics #7 (1942)