SlideShare a Scribd company logo
1 of 76
Large Scale ETL with Hadoop
    Headline Goes Here
    Eric Sammer | Principal Solution Architect
    Speaker Name or Subhead Goes Here
    @esammer
    Strata + Hadoop World 2012




1
ETL is like “REST” or “Disaster Recovery”




2
ETL is like “REST” or “Disaster Recovery”
       Everyone defines it differently (and loves to fight
       about it)




2
ETL is like “REST” or “Disaster Recovery”
       Everyone defines it differently (and loves to fight
       about it)
       It’s more of a problem/solution space than a thing




2
ETL is like “REST” or “Disaster Recovery”
       Everyone defines it differently (and loves to fight
       about it)
       It’s more of a problem/solution space than a thing
       Hard to generalize without being lossy in some
       way




2
ETL is like “REST” or “Disaster Recovery”
       Everyone defines it differently (and loves to fight
       about it)
       It’s more of a problem/solution space than a thing
       Hard to generalize without being lossy in some
       way
       Worst, it’s trivial at face value, complicated in
       practice

2
So why is ETL hard?




3
So why is ETL hard?
       It’s not because ƒ(A) → B is hard (anymore)




3
So why is ETL hard?
       It’s not because ƒ(A) → B is hard (anymore)
       Data integration




3
So why is ETL hard?
       It’s not because ƒ(A) → B is hard (anymore)
       Data integration
       Organization and management




3
So why is ETL hard?
       It’s not because ƒ(A) → B is hard (anymore)
       Data integration
       Organization and management
       Process orchestration and scheduling




3
So why is ETL hard?
       It’s not because ƒ(A) → B is hard (anymore)
       Data integration
       Organization and management
       Process orchestration and scheduling
       Accessibility



3
So why is ETL hard?
       It’s not because ƒ(A) → B is hard (anymore)
       Data integration
       Organization and management
       Process orchestration and scheduling
       Accessibility
       How it all fits together


3
Hadoop is two components




4
Hadoop is two components
      HDFS – Massive, redundant data storage




4
Hadoop is two components
      HDFS – Massive, redundant data storage
      MapReduce – Batch-oriented data processing at
      scale




4
The ecosystem brings additional functionality




5
The ecosystem brings additional functionality
      Higher level languages and abstractions on
      MapReduce




5
The ecosystem brings additional functionality
      Higher level languages and abstractions on
      MapReduce
          Hive, Pig, Cascading, ...




5
The ecosystem brings additional functionality
      Higher level languages and abstractions on
      MapReduce
      File, relational, and streaming data integration




6
The ecosystem brings additional functionality
      Higher level languages and abstractions on
      MapReduce
      File, relational, and streaming data integration
          Flume, Sqoop, WebHDFS, ...




6
The ecosystem brings additional functionality
      Higher level languages and abstractions on
      MapReduce
      File, relational, and streaming data integration
      Process orchestration and scheduling




7
The ecosystem brings additional functionality
      Higher level languages and abstractions on
      MapReduce
      File, relational, and streaming data integration
      Process orchestration and scheduling
          Oozie, Azkaban, ...




7
The ecosystem brings additional functionality
      Higher level languages and abstractions on
      MapReduce
      File, relational, and streaming data integration
      Process orchestration and scheduling
      Libraries for parsing and text extraction




8
The ecosystem brings additional functionality
      Higher level languages and abstractions on
      MapReduce
      File, relational, and streaming data integration
      Process orchestration and scheduling
      Libraries for parsing and text extraction
          Tika, ?, ...



8
The ecosystem brings additional functionality
      Higher level languages and abstractions on
      MapReduce
      File, relational, and streaming data integration
      Process orchestration and scheduling
      Libraries for parsing and text extraction
      ...and now low latency query with Impala


9
To truly scale ETL, separate infrastructure from
     processes




10
To truly scale ETL, separate infrastructure from
     processes, and make it a macro-level service




11
To truly scale ETL, separate infrastructure from
     processes, and make it a macro-level service
     (composed of other services).




12
The services of ETL




13
The services of ETL
       Process Repository




13
The services of ETL
       Process Repository
       Metadata Repository




13
The services of ETL
       Process Repository
       Metadata Repository
       Scheduling




13
The services of ETL
       Process Repository
       Metadata Repository
       Scheduling
       Process Orchestration




13
The services of ETL
       Process Repository
       Metadata Repository
       Scheduling
       Process Orchestration
       Integration Adapters or Channels



13
The services of ETL
       Process Repository
       Metadata Repository
       Scheduling
       Process Orchestration
       Integration Adapters or Channels
       Service and Process Instrumentation and
       Collection

13
What do we have today?




14
What do we have today?
       HDFS and MapReduce – The core




14
What do we have today?
       HDFS and MapReduce – The core
       Flume – Streaming event data integration




14
What do we have today?
       HDFS and MapReduce – The core
       Flume – Streaming event data integration
       Sqoop – Batch exchange of relational database
       tables




14
What do we have today?
       HDFS and MapReduce – The core
       Flume – Streaming event data integration
       Sqoop – Batch exchange of relational database
       tables
       Oozie – Process orchestration and basic
       scheduling


14
What do we have today?
       HDFS and MapReduce – The core
       Flume – Streaming event data integration
       Sqoop – Batch exchange of relational database
       tables
       Oozie – Process orchestration and basic
       scheduling
       Impala – Fast analysis of data quality

14
MapReduce is the assembly language of data
     processing




15
MapReduce is the assembly language of data
     processing
        “Simple things are hard, but hard things are
        possible”




15
MapReduce is the assembly language of data
     processing
        “Simple things are hard, but hard things are
        possible”
        Comparatively low level




15
MapReduce is the assembly language of data
     processing
        “Simple things are hard, but hard things are
        possible”
        Comparatively low level
        Java knowledge required




15
MapReduce is the assembly language of data
     processing
        “Simple things are hard, but hard things are
        possible”
        Comparatively low level
        Java knowledge required
        Use higher level tools where possible


15
Data organization in HDFS




16
Data organization in HDFS
        Standard file system tricks to make operations
        atomic




16
Data organization in HDFS
        Standard file system tricks to make operations
        atomic
        Use a well-defined structure that supports tooling




16
Data organization in HDFS – Hierarchy
       /intent
          /category
             /application (optional)
                /dataset
                    /partitions
                       /files

       Examples:
       /data/fraud/txs/2012-01-01/20120101-00.avro
       /data/fraud/txs/2012-01-01/20120101-01.avro
       /group/research/model-17/training-txs/part-00000.avro
       /group/research/model-17/training-txs/part-00001.avro
       /user/esammer/scratch/foo/



17
A view of data integration




18
Event
                      headers:({
                      ((app:((1234,
                      ((type:(321
                      ((ts:(((<epoch>
                      },
                      body:(((<bytes>


        Syslog)
        Events             Flume)Agent

                                                        HDFS
                              Flume)
     Applica7on)            (Channel)1)   /data/ops/syslog/2012P01P01/
       Events


                              Flume)      /data/web/core/2012P01P01/
                            (Channel)2)   /data/web/retail/2012P01P01/
     Clickstream)
        Events                                                             Relational Data
                                          /data/pos/US/NY/17/2012P01P01/
                              Flume)      /data/pos/US/CA/42/2012P01P01/
     Point)of)Sale)         (Channel)3)
        Events
                                                                           Sqoop     Web)App)
                                                                           (Job)1)   Database
                                          /data/wdb/<database>/<table>/




        Streaming Data                    /data/edw/<database>/<table>/    Sqoop
                                                                                      EDW
                                                                           (Job)2)




19
Structure data in tiers




20
Structure data in tiers
        A clear hierarchy of source/derived relationships




20
Structure data in tiers
        A clear hierarchy of source/derived relationships
        One step on the road to proper lineage




20
Structure data in tiers
        A clear hierarchy of source/derived relationships
        One step on the road to proper lineage
        Simple “fault and rebuild” processes




20
Structure data in tiers
        A clear hierarchy of source/derived relationships
        One step on the road to proper lineage
        Simple “fault and rebuild” processes
        Examples




20
Structure data in tiers
        A clear hierarchy of source/derived relationships
        One step on the road to proper lineage
        Simple “fault and rebuild” processes
        Examples
           Tier 0 – Raw data from source systems




20
Structure data in tiers
        A clear hierarchy of source/derived relationships
        One step on the road to proper lineage
        Simple “fault and rebuild” processes
        Examples
           Tier 0 – Raw data from source systems
           Tier 1 – Derived from 0, cleansed, normalized



20
Structure data in tiers
        A clear hierarchy of source/derived relationships
        One step on the road to proper lineage
        Simple “fault and rebuild” processes
        Examples
           Tier 0 – Raw data from source systems
           Tier 1 – Derived from 0, cleansed, normalized
           Tier 2 – Derived from 1, aggregated


20
HDFS%(Tier%0)                                                  HDFS%(Tier%1)

     /data/ops/syslog/2012G01G01/                               /data/repor9ng/sessionsGday/YYYYGMMGDD/

                                           Sessioniza9on

     /data/web/core/2012G01G01/
                                                                /data/repor9ng/eventsGday/YYYYGMMGDD/
     /data/web/retail/2012G01G01/



     /data/pos/US/NY/17/2012G01G01/   Event%Report%Aggrega9on   /data/repor9ng/eventsGhour/YYYYGMMGDD/
     /data/pos/US/CA/42/2012G01G01/



     /data/wdb/<database>/<table>/

                                      Inventory%Reconcilia9on                HDFS%(For%export)


     /data/edw/<database>/<table>/                              /export/edw/inventory/itemGdiff/<ts>/




21
There’s a lot to do




22
There’s a lot to do
       Build libraries or services to reveal higher-level
       interfaces




22
There’s a lot to do
       Build libraries or services to reveal higher-level
       interfaces
       Data management and lifecycle events




22
There’s a lot to do
       Build libraries or services to reveal higher-level
       interfaces
       Data management and lifecycle events
       Instrument jobs and services for performance/
       quality




22
There’s a lot to do
       Build libraries or services to reveal higher-level
       interfaces
       Data management and lifecycle events
       Instrument jobs and services for performance/
       quality
       Metadata, metadata, metadata (metadata)


22
There’s a lot to do
       Build libraries or services to reveal higher-level
       interfaces
       Data management and lifecycle events
       Instrument jobs and services for performance/
       quality
       Metadata, metadata, metadata (metadata)
       Process (job) deployment, service location,

22
To the contributors, potential and current




23
To the contributors, potential and current
        We have work to do




23
To the contributors, potential and current
        We have work to do
        Still way too much scaffolding work




23
To the contributors, potential and current
        We have work to do
        Still way too much scaffolding work




23
I’m out of time (for now)




24
I’m out of time (for now)
        Join me for office hours – 1:40 - 2:20 in
        Rhinelander




24
I’m out of time (for now)
        Join me for office hours – 1:40 - 2:20 in
        Rhinelander
        I’m signing copies of Hadoop Operations tonight




24
25

More Related Content

What's hot

Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big DataDataWorks Summit
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1Adam Muise
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldLester Martin
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache HiveMurtaza Doctor
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 

What's hot (20)

Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 

Similar to Large scale ETL with Hadoop

Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalMichael Rainey
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLAdam Muise
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsbarbie0909
 
Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data setsCreditas
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Jeff Magnusson
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BIPrasad Prabhu (PP)
 

Similar to Large scale ETL with Hadoop (20)

Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Big data overview by Edgars
Big data overview by EdgarsBig data overview by Edgars
Big data overview by Edgars
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Handling not so big data
Handling not so big dataHandling not so big data
Handling not so big data
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data sets
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Big data
Big dataBig data
Big data
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 

More from OReillyStrata

Dealing with Uncertainty: What the reverend Bayes can teach us.
Dealing with Uncertainty: What the reverend Bayes can teach us.Dealing with Uncertainty: What the reverend Bayes can teach us.
Dealing with Uncertainty: What the reverend Bayes can teach us.OReillyStrata
 
SapientNitro Strata_presentation_upload
SapientNitro Strata_presentation_uploadSapientNitro Strata_presentation_upload
SapientNitro Strata_presentation_uploadOReillyStrata
 
Digital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the worldDigital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the worldOReillyStrata
 
Giving Organisations new capabilities to ask the right business questions 1.7
Giving Organisations new capabilities to ask the right business questions 1.7Giving Organisations new capabilities to ask the right business questions 1.7
Giving Organisations new capabilities to ask the right business questions 1.7OReillyStrata
 
Data as an Art Material. Case study: The Open Data Institute
Data as an Art Material. Case study: The Open Data InstituteData as an Art Material. Case study: The Open Data Institute
Data as an Art Material. Case study: The Open Data InstituteOReillyStrata
 
Giving Organisations new Capabilities to ask the Right Business Questions
Giving Organisations new Capabilities to ask the Right Business QuestionsGiving Organisations new Capabilities to ask the Right Business Questions
Giving Organisations new Capabilities to ask the Right Business QuestionsOReillyStrata
 
Big Data for Big Power: How smart is the grid if the infrastructure is stupid?
Big Data for Big Power:  How smart is the grid if the infrastructure is stupid?Big Data for Big Power:  How smart is the grid if the infrastructure is stupid?
Big Data for Big Power: How smart is the grid if the infrastructure is stupid?OReillyStrata
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow AbstractionOReillyStrata
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 
The Future of Big Data is Relational (or why you can't escape SQL)
The Future of Big Data is Relational (or why you can't escape SQL)The Future of Big Data is Relational (or why you can't escape SQL)
The Future of Big Data is Relational (or why you can't escape SQL)OReillyStrata
 
Visualizing Networks: Beyond the Hairball
Visualizing Networks: Beyond the HairballVisualizing Networks: Beyond the Hairball
Visualizing Networks: Beyond the HairballOReillyStrata
 
Designing Big Data Interactions: The Language of Discovery
Designing Big Data Interactions: The Language of DiscoveryDesigning Big Data Interactions: The Language of Discovery
Designing Big Data Interactions: The Language of DiscoveryOReillyStrata
 
Digital Reasoning_Tim Estes_Strata NYC 2012
Digital Reasoning_Tim Estes_Strata NYC 2012Digital Reasoning_Tim Estes_Strata NYC 2012
Digital Reasoning_Tim Estes_Strata NYC 2012OReillyStrata
 
clearScienceStrataRx2012
clearScienceStrataRx2012clearScienceStrataRx2012
clearScienceStrataRx2012OReillyStrata
 

More from OReillyStrata (14)

Dealing with Uncertainty: What the reverend Bayes can teach us.
Dealing with Uncertainty: What the reverend Bayes can teach us.Dealing with Uncertainty: What the reverend Bayes can teach us.
Dealing with Uncertainty: What the reverend Bayes can teach us.
 
SapientNitro Strata_presentation_upload
SapientNitro Strata_presentation_uploadSapientNitro Strata_presentation_upload
SapientNitro Strata_presentation_upload
 
Digital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the worldDigital analytics & privacy: it's not the end of the world
Digital analytics & privacy: it's not the end of the world
 
Giving Organisations new capabilities to ask the right business questions 1.7
Giving Organisations new capabilities to ask the right business questions 1.7Giving Organisations new capabilities to ask the right business questions 1.7
Giving Organisations new capabilities to ask the right business questions 1.7
 
Data as an Art Material. Case study: The Open Data Institute
Data as an Art Material. Case study: The Open Data InstituteData as an Art Material. Case study: The Open Data Institute
Data as an Art Material. Case study: The Open Data Institute
 
Giving Organisations new Capabilities to ask the Right Business Questions
Giving Organisations new Capabilities to ask the Right Business QuestionsGiving Organisations new Capabilities to ask the Right Business Questions
Giving Organisations new Capabilities to ask the Right Business Questions
 
Big Data for Big Power: How smart is the grid if the infrastructure is stupid?
Big Data for Big Power:  How smart is the grid if the infrastructure is stupid?Big Data for Big Power:  How smart is the grid if the infrastructure is stupid?
Big Data for Big Power: How smart is the grid if the infrastructure is stupid?
 
The Workflow Abstraction
The Workflow AbstractionThe Workflow Abstraction
The Workflow Abstraction
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
The Future of Big Data is Relational (or why you can't escape SQL)
The Future of Big Data is Relational (or why you can't escape SQL)The Future of Big Data is Relational (or why you can't escape SQL)
The Future of Big Data is Relational (or why you can't escape SQL)
 
Visualizing Networks: Beyond the Hairball
Visualizing Networks: Beyond the HairballVisualizing Networks: Beyond the Hairball
Visualizing Networks: Beyond the Hairball
 
Designing Big Data Interactions: The Language of Discovery
Designing Big Data Interactions: The Language of DiscoveryDesigning Big Data Interactions: The Language of Discovery
Designing Big Data Interactions: The Language of Discovery
 
Digital Reasoning_Tim Estes_Strata NYC 2012
Digital Reasoning_Tim Estes_Strata NYC 2012Digital Reasoning_Tim Estes_Strata NYC 2012
Digital Reasoning_Tim Estes_Strata NYC 2012
 
clearScienceStrataRx2012
clearScienceStrataRx2012clearScienceStrataRx2012
clearScienceStrataRx2012
 

Large scale ETL with Hadoop

  • 1. Large Scale ETL with Hadoop Headline Goes Here Eric Sammer | Principal Solution Architect Speaker Name or Subhead Goes Here @esammer Strata + Hadoop World 2012 1
  • 2. ETL is like “REST” or “Disaster Recovery” 2
  • 3. ETL is like “REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it) 2
  • 4. ETL is like “REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it) It’s more of a problem/solution space than a thing 2
  • 5. ETL is like “REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it) It’s more of a problem/solution space than a thing Hard to generalize without being lossy in some way 2
  • 6. ETL is like “REST” or “Disaster Recovery” Everyone defines it differently (and loves to fight about it) It’s more of a problem/solution space than a thing Hard to generalize without being lossy in some way Worst, it’s trivial at face value, complicated in practice 2
  • 7. So why is ETL hard? 3
  • 8. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) 3
  • 9. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration 3
  • 10. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management 3
  • 11. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management Process orchestration and scheduling 3
  • 12. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management Process orchestration and scheduling Accessibility 3
  • 13. So why is ETL hard? It’s not because ƒ(A) → B is hard (anymore) Data integration Organization and management Process orchestration and scheduling Accessibility How it all fits together 3
  • 14. Hadoop is two components 4
  • 15. Hadoop is two components HDFS – Massive, redundant data storage 4
  • 16. Hadoop is two components HDFS – Massive, redundant data storage MapReduce – Batch-oriented data processing at scale 4
  • 17. The ecosystem brings additional functionality 5
  • 18. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce 5
  • 19. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce Hive, Pig, Cascading, ... 5
  • 20. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration 6
  • 21. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Flume, Sqoop, WebHDFS, ... 6
  • 22. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling 7
  • 23. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Oozie, Azkaban, ... 7
  • 24. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Libraries for parsing and text extraction 8
  • 25. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Libraries for parsing and text extraction Tika, ?, ... 8
  • 26. The ecosystem brings additional functionality Higher level languages and abstractions on MapReduce File, relational, and streaming data integration Process orchestration and scheduling Libraries for parsing and text extraction ...and now low latency query with Impala 9
  • 27. To truly scale ETL, separate infrastructure from processes 10
  • 28. To truly scale ETL, separate infrastructure from processes, and make it a macro-level service 11
  • 29. To truly scale ETL, separate infrastructure from processes, and make it a macro-level service (composed of other services). 12
  • 30. The services of ETL 13
  • 31. The services of ETL Process Repository 13
  • 32. The services of ETL Process Repository Metadata Repository 13
  • 33. The services of ETL Process Repository Metadata Repository Scheduling 13
  • 34. The services of ETL Process Repository Metadata Repository Scheduling Process Orchestration 13
  • 35. The services of ETL Process Repository Metadata Repository Scheduling Process Orchestration Integration Adapters or Channels 13
  • 36. The services of ETL Process Repository Metadata Repository Scheduling Process Orchestration Integration Adapters or Channels Service and Process Instrumentation and Collection 13
  • 37. What do we have today? 14
  • 38. What do we have today? HDFS and MapReduce – The core 14
  • 39. What do we have today? HDFS and MapReduce – The core Flume – Streaming event data integration 14
  • 40. What do we have today? HDFS and MapReduce – The core Flume – Streaming event data integration Sqoop – Batch exchange of relational database tables 14
  • 41. What do we have today? HDFS and MapReduce – The core Flume – Streaming event data integration Sqoop – Batch exchange of relational database tables Oozie – Process orchestration and basic scheduling 14
  • 42. What do we have today? HDFS and MapReduce – The core Flume – Streaming event data integration Sqoop – Batch exchange of relational database tables Oozie – Process orchestration and basic scheduling Impala – Fast analysis of data quality 14
  • 43. MapReduce is the assembly language of data processing 15
  • 44. MapReduce is the assembly language of data processing “Simple things are hard, but hard things are possible” 15
  • 45. MapReduce is the assembly language of data processing “Simple things are hard, but hard things are possible” Comparatively low level 15
  • 46. MapReduce is the assembly language of data processing “Simple things are hard, but hard things are possible” Comparatively low level Java knowledge required 15
  • 47. MapReduce is the assembly language of data processing “Simple things are hard, but hard things are possible” Comparatively low level Java knowledge required Use higher level tools where possible 15
  • 49. Data organization in HDFS Standard file system tricks to make operations atomic 16
  • 50. Data organization in HDFS Standard file system tricks to make operations atomic Use a well-defined structure that supports tooling 16
  • 51. Data organization in HDFS – Hierarchy /intent /category /application (optional) /dataset /partitions /files Examples: /data/fraud/txs/2012-01-01/20120101-00.avro /data/fraud/txs/2012-01-01/20120101-01.avro /group/research/model-17/training-txs/part-00000.avro /group/research/model-17/training-txs/part-00001.avro /user/esammer/scratch/foo/ 17
  • 52. A view of data integration 18
  • 53. Event headers:({ ((app:((1234, ((type:(321 ((ts:(((<epoch> }, body:(((<bytes> Syslog) Events Flume)Agent HDFS Flume) Applica7on) (Channel)1) /data/ops/syslog/2012P01P01/ Events Flume) /data/web/core/2012P01P01/ (Channel)2) /data/web/retail/2012P01P01/ Clickstream) Events Relational Data /data/pos/US/NY/17/2012P01P01/ Flume) /data/pos/US/CA/42/2012P01P01/ Point)of)Sale) (Channel)3) Events Sqoop Web)App) (Job)1) Database /data/wdb/<database>/<table>/ Streaming Data /data/edw/<database>/<table>/ Sqoop EDW (Job)2) 19
  • 54. Structure data in tiers 20
  • 55. Structure data in tiers A clear hierarchy of source/derived relationships 20
  • 56. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage 20
  • 57. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes 20
  • 58. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples 20
  • 59. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples Tier 0 – Raw data from source systems 20
  • 60. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples Tier 0 – Raw data from source systems Tier 1 – Derived from 0, cleansed, normalized 20
  • 61. Structure data in tiers A clear hierarchy of source/derived relationships One step on the road to proper lineage Simple “fault and rebuild” processes Examples Tier 0 – Raw data from source systems Tier 1 – Derived from 0, cleansed, normalized Tier 2 – Derived from 1, aggregated 20
  • 62. HDFS%(Tier%0) HDFS%(Tier%1) /data/ops/syslog/2012G01G01/ /data/repor9ng/sessionsGday/YYYYGMMGDD/ Sessioniza9on /data/web/core/2012G01G01/ /data/repor9ng/eventsGday/YYYYGMMGDD/ /data/web/retail/2012G01G01/ /data/pos/US/NY/17/2012G01G01/ Event%Report%Aggrega9on /data/repor9ng/eventsGhour/YYYYGMMGDD/ /data/pos/US/CA/42/2012G01G01/ /data/wdb/<database>/<table>/ Inventory%Reconcilia9on HDFS%(For%export) /data/edw/<database>/<table>/ /export/edw/inventory/itemGdiff/<ts>/ 21
  • 63. There’s a lot to do 22
  • 64. There’s a lot to do Build libraries or services to reveal higher-level interfaces 22
  • 65. There’s a lot to do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events 22
  • 66. There’s a lot to do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events Instrument jobs and services for performance/ quality 22
  • 67. There’s a lot to do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events Instrument jobs and services for performance/ quality Metadata, metadata, metadata (metadata) 22
  • 68. There’s a lot to do Build libraries or services to reveal higher-level interfaces Data management and lifecycle events Instrument jobs and services for performance/ quality Metadata, metadata, metadata (metadata) Process (job) deployment, service location, 22
  • 69. To the contributors, potential and current 23
  • 70. To the contributors, potential and current We have work to do 23
  • 71. To the contributors, potential and current We have work to do Still way too much scaffolding work 23
  • 72. To the contributors, potential and current We have work to do Still way too much scaffolding work 23
  • 73. I’m out of time (for now) 24
  • 74. I’m out of time (for now) Join me for office hours – 1:40 - 2:20 in Rhinelander 24
  • 75. I’m out of time (for now) Join me for office hours – 1:40 - 2:20 in Rhinelander I’m signing copies of Hadoop Operations tonight 24
  • 76. 25

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. \n
  79. \n
  80. \n
  81. \n
  82. \n
  83. \n
  84. \n
  85. \n
  86. \n