Big Data Warehousing Meetup: BigETL: Trad Tool vs Pig vs Hive vs Python. What to use when. (Slide Set #1)
We discussed best practices for implementing the right tool for the Big ETL job, whether it's Tradition ETL tools, Map Reduce, Pig, Hive or Python. We explained when, where and how to apply the different steps of the ETL process in the Big Data ecosystem, which tool to use for each, and a demo of a real use case.
Presenters included Joe Caserta, President, Caserta Concepts, Elliott Cordo, Chief Architect, Caserta Concepts, Kyle Hubert, Principal Data Architect, Simulmedia.
The event was hosted by Caserta Concepts and Simulmedia and sponsored by O'Reilly.
For more information on Caserta Concepts, visit our website at http://www.casertaconcepts.com/.
For access to additional slide decks, visit our SlideShare site at http://www.slideshare.net/CasertaConcepts.
Big Data Warehousing Meetup: BigETL: Trad Tool vs Pig vs Hive vs Python. What to use when. (Slide Set #1)
1. #BDWmeetup @joe_Caserta
Big Data
Warehousing:
September 17, 2014
Big ETL
• Traditional Tools
• Map Reduce
• Pig
• Hive
• Python
What to use when
2. Agenda
7:00 Networking
Grab some food and drink... Make some friends.
7:15 Joe Caserta
President
Caserta Concepts
Welcome + Intro
About the Meetup, about Caserta Concepts
Overview of evolution and future of ETL
7:35 Elliott Cordo
Chief Architect
Caserta Concepts
Deeper dive in to Pig, Hive, Spark, etc.
Demo of Spark!
8:00 Kyle Hubert
Principal Data Architect
Simulmedia
Hadoop Streaming with Python
8:45 Q&A, More Networking
Tell us what you’re up to…
#BDWmeetup @joe_Caserta
3. About the BDW Meetup Twitter: #BDWmeetup
• Big Data is a complex, rapidly changing
landscape
• We want to share our stories and hear
about yours
• Great networking opportunity for like
minded data nerds
• Opportunities to collaborate on exciting
projects
• Founded by Caserta Concepts
• November 10, 2012 – HAPPY ANNIVERSARY!!!
• Next BDW Meetup: Want to present?
#BDWmeetup @joe_Caserta
@CasertaConcepts
@Simulmedia
4. About Caserta Concepts
• Technology services company with expertise in data analysis:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Ad Tech / Higher Ed
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy, Implementation
• Writing, Education, Mentoring
#BDWmeetup @joe_Caserta
5. Help Wanted
Does this word cloud excite you?
Cassandra
Speak with us about our open positions: leslie@casertaconcepts.com
#BDWmeetup @joe_Caserta
Storm
Big Data Architect Hbase
6. The Evolution of the Enterprise Data Hub POC
Enrollments
Claims
Finance
ETL
NoSQL
Databases
Traditional
EDW
ETL
Enterprise Data Hub
Spark MapReduce Pig/Hive
N1 N2 N3 N4 N5
Hadoop Distributed File System (HDFS)
Horizontally Scalable Environment - Optimized for Analytics
Others…
ETL
#BDWmeetup @joe_Caserta
7. ETL for the (Big Data) Enterprise Data Hub
• Convergence of
• Data quality
• Data Management and policies
• All data in an organization
• Set of processes
• Ensure data assets are formally managed
throughout the enterprise.
• Ensure data can be trusted
• EDH - Backbone of business
• Production environment
• Agile
#BDWmeetup @joe_Caserta
8. Components of a Mature Enterprise Data Hub
• Add Big Data to overall framework and assign responsibility
• Add data scientists to the Stewardship program
• Assign stewards to new data sets (twitter, call center logs, etc.)
• This is the ‘people’ part. Establishing Enterprise Data Council,
Data Stewards, etc. Organization
• Larger scale
• New datatypes
• Integrate with Hive Metastore, HCatalog, home grown tables
•Definitions, lineage (where does this data come from),
business definitions, technical metadata Metadata
Privacy/Security •Identify • Data detection and control and masking sensitive on unstructured data, regulatory data upon compliance
ingest
• Data Quality and Monitoring (probably home grown, drools?)
• Quality checks not only SQL: machine learning, Pig and Map Reduce
• Acting on large dataset quality checks may require distribution
•Data must be complete and correct. Measure, improve,
certify
Data Quality and
Monitoring
Business Process Integration •Policies around data frequency, source availability, etc.
• Near-zero latency, DevOps, Core component of business operations
• Graph databases are more flexible than relational
• Lower latency service required
• Distributed data quality and matching algorithms
•Ensure consistent business critical data i.e. Members,
Providers, Agents, etc. Master Data Management
• Secure and mask multiple data types (not just tabular)
• Deletes are more uncommon (unless there is regulatory requirement)
• Take advantage of compression and archiving (like AWS Glacier)
•Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
#BDWmeetup @joe_Caserta
9. Enterprise Data Pyramid
ETL cleans, conforms, consolidates, enriches each tier.
Only top (trusted) tier of the pyramid is fully accessible by the
masses.
Big
Data
Warehouse
Fully Data Governed ( trusted)
ETL
Data Science Workspace
Agile business insight through data-munging
machine learning, blending with external
data, development of to-be BDW facts
Metadata Catalog
ILM who has access, how long do we “manage it”
ETL
Data Lake – Integrated Sandbox
Data Quality and Monitoring Monitoring of
completeness of data
Landing Area – Source Data in “Full Fidelity”
Data is ready to be
turned into information:
organized, well defined,
complete.
#BDWmeetup @joe_Caserta
Metadata Catalog
ILM who has access,
how long do we “manage it”
Raw machine
data collection,
collect
everything
Metadata Catalog
ILM who has access, how long to “manage it”
Data Quality and Monitoring Monitoring
of completeness of data
User community arbitrary queries and
reporting
ETL
ETL
10. Thank You
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
(914) 261-3648
@joe_Caserta
#BDWmeetup @joe_Caserta
Robotman was actually the first cyborg superhero. Robert Crane was fatally shot and had his brain placed in a super strong robot body. The cybernetic Robotman lived on, using a rubber mask and flesh-like body suit to disguise himself as Paul Dennis. The new hero used his cyborg might to smash crime during DC’s Golden Age. First Appearance: Star Spangled Comics #7 (1942)