SlideShare a Scribd company logo
1 of 14
Pig
      Dataflow Scripting for Hadoop


      Alan F. Gates
      @alanfgates




© Hortonworks, Inc 2011
                                      Page 1
Who Am I?

•   Pig committer and PMC Member
•   HCatalog committer and mentor
•   Member of ASF and Incubator PMC
•   Co-founder of Hortonworks
•   Author of Programming Pig from O’Reilly




           Photo credit: Steven Guarnaccia, The Three Little Pigs
Who Are You?




               3
Example


For all of your           Load Users                Load Logs
registered users, you
                                        Semi-join
want to count how
many came to your site
                         Count by zip                Count by
this month. You want                                age, gender
this count both by
geography (zip code)        Store                      Store
                           results                    results
and by demographic
group (age and
gender)
In Pig Latin
-- Load web server logs
logs      = load 'server_logs' using HCatLoader();
thismonth = filter logs by date >= '20110801'
            and date < '20110901';

-- Load users
users     = load 'users' using HCatLoader();

-- Remove   any users that did not visit this month
grpd        = cogroup thismonth by userid, users by userid;
fltrd       = filter grpd by not IsEmpty(logs);
visited     = foreach fltrd generate flatten(users);

-- Count by zip code
grpbyzip = group visited by zip;
cntzip    = foreach grpbyzip generate group, COUNT(visited);
store cntzip into 'by_zip' using HCatStorer('date=201108');

-- Count by demographics
grpbydemo = group visited by (age, gender);
cntdemo   = foreach grpbydemo
             generate flatten(group), COUNT(visited);
store cntdemo into 'by_demo' using HCatStorer('date=201108');
Pig’s Place in the Data World




 Data Collection   Data Factory           Data Warehouse
                   Pig                    Hive

                   Pipelines              BI Tools
                   Iterative Processing   Analysis
                   Research



                    6
Why not MapReduce?

• Pig Provides a number of standard data operators
   – Five different implementations of join (hash, fragment-
     replicate, merge, sparse merged, skewed)
   – Order by provides total ordering across reducers in a balanced
     way
• Provides optimizations that are hard to do by hand
   – Multi-query: Pig will combine certain types of operations
     together in a single pipeline to reduce the number of times data
     is scanned
• User Defined Functions provide a way to inject your code
  into the data transformation
   – can be written in Java or Python
   – can do column transformation (TOUPPER) and aggregation
     (SUM)
   – can be written to take advantage of the combiner
• Control flow can be done via Python or Java

                            7
Embedding Example: Compute Pagerank


PageRank:
A system of linear equations (as many as there
  are pages on the web, yeah, a lot):


It can be approximated iteratively: compute the
   new page rank based on the page ranks of
   the previous iteration. Start with some value.

Ref: http://en.wikipedia.org/wiki/PageRank


                                   Slide courtesy of Julien Le Dem
Or more visually




Each page sends a fraction of its
 PageRank to the pages linked to.
 Inversely proportional to the
 number of links.
              Slide courtesy of Julien Le Dem
Slide courtesy of Julien Le Dem
Let’s zoom in



           pig script: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... +
                             PR(Tn)/C(Tn))



                                    Iterate 10 times

                                                                 Pass parameters
                                                                  as a dictionary


                                                             Just run P, that was
                                                               declared above
                                              The output
                                           becomes the new
                                                input
      Slide courtesy of Julien Le Dem
Recently Added Features

• New in 0.9 (released July 2011):
  – Embedding in Python
  – Macros and Imports
• New in 0.10 (should release in Dec 2011)
  – Boolean data type
  – Hash based aggregation for aggregates with
    low cardinality keys
  – UDFs to build and apply bloom filters
  – UDFs in JRuby (may slip to next release)

                   14
Learn More

• Read the online documentation:
  http://pig.apache.org/
• Programming Pig from O’Reilly
  Press
• Join the mailing lists:
  – user@pig.apache.org for user
    questions
  – dev@pig.apache.com for developer
    issues
• Follow me on
  Twitter, @alanfgates
Questions




            16

More Related Content

What's hot

apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 
Practical pig
Practical pigPractical pig
Practical pigtrihug
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...PyData
 
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018Piotr Wikiel
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsHadoop User Group
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍Tae Young Lee
 
Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
Confitura 2018 — Apache Beam — Promyk Nadziei Data EngineeraConfitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
Confitura 2018 — Apache Beam — Promyk Nadziei Data EngineeraPiotr Wikiel
 
Simple ETL in python 3.5+ with Bonobo - PyParis 2017
Simple ETL in python 3.5+ with Bonobo - PyParis 2017Simple ETL in python 3.5+ with Bonobo - PyParis 2017
Simple ETL in python 3.5+ with Bonobo - PyParis 2017Romain Dorgueil
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Alex Levenson
 
GPars in Saga Groovy Study
GPars in Saga Groovy StudyGPars in Saga Groovy Study
GPars in Saga Groovy StudyNaoki Rin
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Ian Huston
 
Nov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In PythonNov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In PythonYahoo Developer Network
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)
An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)
An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)Koji Sekiguchi
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4LKoji Sekiguchi
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of HadoopAsif Ali
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock AnalysisVaibhav Jain
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 

What's hot (20)

apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Practical pig
Practical pigPractical pig
Practical pig
 
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
Massively Parallel Processing with Procedural Python by Ronert Obst PyData Be...
 
最終発表
最終発表最終発表
最終発表
 
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
Apache beam — promyk nadziei data engineera na Toruń JUG 28.03.2018
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
Confitura 2018 — Apache Beam — Promyk Nadziei Data EngineeraConfitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
Confitura 2018 — Apache Beam — Promyk Nadziei Data Engineera
 
Simple ETL in python 3.5+ with Bonobo - PyParis 2017
Simple ETL in python 3.5+ with Bonobo - PyParis 2017Simple ETL in python 3.5+ with Bonobo - PyParis 2017
Simple ETL in python 3.5+ with Bonobo - PyParis 2017
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
GPars in Saga Groovy Study
GPars in Saga Groovy StudyGPars in Saga Groovy Study
GPars in Saga Groovy Study
 
Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)Massively Parallel Processing with Procedural Python (PyData London 2014)
Massively Parallel Processing with Procedural Python (PyData London 2014)
 
Nov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In PythonNov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In Python
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)
An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)
An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4L
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock Analysis
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 

Similar to TriHUG November Pig Talk by Alan Gates

Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysisPramod Toraskar
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Prophet - Beijing Perl Workshop
Prophet - Beijing Perl WorkshopProphet - Beijing Perl Workshop
Prophet - Beijing Perl WorkshopJesse Vincent
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonMassively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonPyData
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & HadoopJeffrey Breen
 
The devops approach to monitoring, Open Source and Infrastructure as Code Style
The devops approach to monitoring, Open Source and Infrastructure as Code StyleThe devops approach to monitoring, Open Source and Infrastructure as Code Style
The devops approach to monitoring, Open Source and Infrastructure as Code StyleJulien Pivotto
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
Mashup University 4: Intro To Mashups
Mashup University 4: Intro To MashupsMashup University 4: Intro To Mashups
Mashup University 4: Intro To MashupsJohn Herren
 
H2O World - What's New in H2O with Cliff Click
H2O World - What's New in H2O with Cliff ClickH2O World - What's New in H2O with Cliff Click
H2O World - What's New in H2O with Cliff ClickSri Ambati
 
Learn How to Run Python on Redshift
Learn How to Run Python on RedshiftLearn How to Run Python on Redshift
Learn How to Run Python on RedshiftChartio
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Yu Liu
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...Dataiku
 

Similar to TriHUG November Pig Talk by Alan Gates (20)

Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Prophet - Beijing Perl Workshop
Prophet - Beijing Perl WorkshopProphet - Beijing Perl Workshop
Prophet - Beijing Perl Workshop
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Python ml
Python mlPython ml
Python ml
 
Massively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian HustonMassively Parallel Process with Prodedural Python by Ian Huston
Massively Parallel Process with Prodedural Python by Ian Huston
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
The devops approach to monitoring, Open Source and Infrastructure as Code Style
The devops approach to monitoring, Open Source and Infrastructure as Code StyleThe devops approach to monitoring, Open Source and Infrastructure as Code Style
The devops approach to monitoring, Open Source and Infrastructure as Code Style
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Mashup University 4: Intro To Mashups
Mashup University 4: Intro To MashupsMashup University 4: Intro To Mashups
Mashup University 4: Intro To Mashups
 
H2O World - What's New in H2O with Cliff Click
H2O World - What's New in H2O with Cliff ClickH2O World - What's New in H2O with Cliff Click
H2O World - What's New in H2O with Cliff Click
 
Learn How to Run Python on Redshift
Learn How to Run Python on RedshiftLearn How to Run Python on Redshift
Learn How to Run Python on Redshift
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
 
Pyhton-1a-Basics.pdf
Pyhton-1a-Basics.pdfPyhton-1a-Basics.pdf
Pyhton-1a-Basics.pdf
 
Allegograph
AllegographAllegograph
Allegograph
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 

More from trihug

TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Rangertrihug
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparktrihug
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Productiontrihug
 
TriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache SentryTriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache Sentrytrihug
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Sharktrihug
 
Impala presentation
Impala presentationImpala presentation
Impala presentationtrihug
 
Financial services trihug
Financial services trihugFinancial services trihug
Financial services trihugtrihug
 
TriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris ShainTriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris Shaintrihug
 
TriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan GatesTriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan Gatestrihug
 
MapR, Implications for Integration
MapR, Implications for IntegrationMapR, Implications for Integration
MapR, Implications for Integrationtrihug
 

More from trihug (10)

TriHUG October: Apache Ranger
TriHUG October: Apache RangerTriHUG October: Apache Ranger
TriHUG October: Apache Ranger
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
TriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in ProductionTriHUG 3/14: HBase in Production
TriHUG 3/14: HBase in Production
 
TriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache SentryTriHUG 2/14: Apache Sentry
TriHUG 2/14: Apache Sentry
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Financial services trihug
Financial services trihugFinancial services trihug
Financial services trihug
 
TriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris ShainTriHUG January 2012 Talk by Chris Shain
TriHUG January 2012 Talk by Chris Shain
 
TriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan GatesTriHUG November HCatalog Talk by Alan Gates
TriHUG November HCatalog Talk by Alan Gates
 
MapR, Implications for Integration
MapR, Implications for IntegrationMapR, Implications for Integration
MapR, Implications for Integration
 

Recently uploaded

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 

Recently uploaded (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 

TriHUG November Pig Talk by Alan Gates

  • 1. Pig Dataflow Scripting for Hadoop Alan F. Gates @alanfgates © Hortonworks, Inc 2011 Page 1
  • 2. Who Am I? • Pig committer and PMC Member • HCatalog committer and mentor • Member of ASF and Incubator PMC • Co-founder of Hortonworks • Author of Programming Pig from O’Reilly Photo credit: Steven Guarnaccia, The Three Little Pigs
  • 4. Example For all of your Load Users Load Logs registered users, you Semi-join want to count how many came to your site Count by zip Count by this month. You want age, gender this count both by geography (zip code) Store Store results results and by demographic group (age and gender)
  • 5. In Pig Latin -- Load web server logs logs = load 'server_logs' using HCatLoader(); thismonth = filter logs by date >= '20110801' and date < '20110901'; -- Load users users = load 'users' using HCatLoader(); -- Remove any users that did not visit this month grpd = cogroup thismonth by userid, users by userid; fltrd = filter grpd by not IsEmpty(logs); visited = foreach fltrd generate flatten(users); -- Count by zip code grpbyzip = group visited by zip; cntzip = foreach grpbyzip generate group, COUNT(visited); store cntzip into 'by_zip' using HCatStorer('date=201108'); -- Count by demographics grpbydemo = group visited by (age, gender); cntdemo = foreach grpbydemo generate flatten(group), COUNT(visited); store cntdemo into 'by_demo' using HCatStorer('date=201108');
  • 6. Pig’s Place in the Data World Data Collection Data Factory Data Warehouse Pig Hive Pipelines BI Tools Iterative Processing Analysis Research 6
  • 7. Why not MapReduce? • Pig Provides a number of standard data operators – Five different implementations of join (hash, fragment- replicate, merge, sparse merged, skewed) – Order by provides total ordering across reducers in a balanced way • Provides optimizations that are hard to do by hand – Multi-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scanned • User Defined Functions provide a way to inject your code into the data transformation – can be written in Java or Python – can do column transformation (TOUPPER) and aggregation (SUM) – can be written to take advantage of the combiner • Control flow can be done via Python or Java 7
  • 8. Embedding Example: Compute Pagerank PageRank: A system of linear equations (as many as there are pages on the web, yeah, a lot): It can be approximated iteratively: compute the new page rank based on the page ranks of the previous iteration. Start with some value. Ref: http://en.wikipedia.org/wiki/PageRank Slide courtesy of Julien Le Dem
  • 9. Or more visually Each page sends a fraction of its PageRank to the pages linked to. Inversely proportional to the number of links. Slide courtesy of Julien Le Dem
  • 10. Slide courtesy of Julien Le Dem
  • 11. Let’s zoom in pig script: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Iterate 10 times Pass parameters as a dictionary Just run P, that was declared above The output becomes the new input Slide courtesy of Julien Le Dem
  • 12. Recently Added Features • New in 0.9 (released July 2011): – Embedding in Python – Macros and Imports • New in 0.10 (should release in Dec 2011) – Boolean data type – Hash based aggregation for aggregates with low cardinality keys – UDFs to build and apply bloom filters – UDFs in JRuby (may slip to next release) 14
  • 13. Learn More • Read the online documentation: http://pig.apache.org/ • Programming Pig from O’Reilly Press • Join the mailing lists: – user@pig.apache.org for user questions – dev@pig.apache.com for developer issues • Follow me on Twitter, @alanfgates
  • 14. Questions 16

Editor's Notes

  1. Say a little about Hortonworks
  2. SQL is a query languageDeclarative, what not howOriented around answering a questionRequires uniform schemaRequires metadataKnown by everyoneA great choice for answering queries, building reports, use with automated toolsPig Latin is a data flow languageScript defines a data flowIntended for pipelines where there may be tens or hundreds of stepsBuilt for raw world of Hadoop where schemas are optional, data may not be clean, etc.Can operate with or without metadataA great choice for ETL pipelines, data models, iterative processing, and research on raw data