SlideShare a Scribd company logo
1 of 32
Whirlwind tour of Pig
           Chris Wilkes
    cwilkes@seattlehadoop.org
Agenda


  1   Why Pig?
  2   Data types
  3   Operators
  4   UDFs
  5   Using Pig
Agenda


  1   Why Pig?
  2   Data types
  3   Operators
  4   UDFs
  5   Using Pig
Why Pig?                                      Tired of boilerplate


•   Started off writing Mappers/Reducers in Java

    •   Fun at first

    •   Gets a little tedious

•   Need to do more than one MR step

    •   Write own flow control

    •   Utility classes to pass parameters / input paths

•   Go back and change a Reducer’s input type

    •   Did you change it in the Job setup?

    •   Processing two different input types in first job
Why Pig?                 Java MapReduce boilerplate example




•   Typical use case: have two different input types

    •   log files (timestamps and userids)

    •   database table dump (userids and names)

•   Want to combine the two together

•   Relatively simple, but tedious
Why Pig?                   Java MapReduce boilerplate example

Need to handle two different output types, need a single
class that can handle both, designate with a “tag”:
 Mapper<LongWritable,Text,TaggedKeyWritable,TaggedValueWritable>
 Reducer<TaggedKeyWritable,TaggedValueWritable,Text,PurchaseInfoWritable>

Inside of Mapper check in setup() or run() for Path of
input to decide if this is a log file or database table
 if (context.getInputSplit().getPath().contains(“logfile”)) {
   inputType=”LOGFILE” } else if { ... inputType=”DATABASE”}

Reducer: check tag and then combine
 if (key.getTag().equals(“LOGFILE”) { LogEntry logEntry = value.getValue() } else
 (key.getTag().equals(“DATABASE”) { UserInfo userInfo = value.getValue() }
 context.write(userInfo.getId(), logEntry.getTime() + “ “ + userInfo.getName())
Where’s your shears?

"I was working on my thesis
and realized I needed a
reference. I'd
seen a post on comp.arch
recently that cited a paper,
so I fired up
gnus. While I was searching
the for the post, I came
across another
post whose MIME encoding
screwed up my ancient version
of gnus, so I
stopped and downloaded the
latest version of gnus.
Agenda


  1   Why Pig?
  2   Data types
  3   Operators
  4   UDFs
  5   Using Pig
Data Types




•   From largest to smallest:

    •   Bag (relation / group)

    •   Tuple

    •   Field

•   A bag is a collection of tuples, tuples have fields
Data Types                                                                   Bag

$ cat logs                    $ cat groupbooks.pig
101 1002       10.09          logs = LOAD 'logs' AS
101 8912       5.96             (userid: int, bookid: long, price: double);
102 1002       10.09          bookbuys = GROUP logs BY bookid;
103 8912       5.96           DESCRIBE bookbuys;
103 7122       88.99          DUMP bookbuys;

$ pig -x local groupbooks.pig
bookbuys: {group: long,logs: {userid: int,bookid: long,price: double}}
                                   Tuple
(1002L,{(101,1002L,10.09),(102,1002L,10.09)})
(7122L,{(103,7122L,88.99)})
(8912L,{(101,8912L,5.96),(103,8912L,5.96)})
                    Inner bag
                                                              Field
      Field
Data Types                                               Tuple and Fields

$ cat booksexpensive.pig
logs = LOAD 'logs' AS (userid: int, bookid: long, price: double);
bookbuys = GROUP logs BY bookid;
expensive = FOREACH bookbuys {
      inside = FILTER logs BY price > 6.0;
      GENERATE inside;
}
                                         Refers to the
DESCRIBE expensive;
                                            inner bag
DUMP expensive;
$ pig -x local booksexpensive.pig
expensive: {inside: {userid: int,bookid: long,price: double}}
({(101,1002L,10.09),(102,1002L,10.09)})
({(103,7122L,88.99)})
({})                Inner bag
                            Note: can always refer to $0, $1, etc
Agenda


  1   Why Pig?
  2   Data types
  3   Operators
  4   UDFs
  5   Using Pig
Operator                                                             Load

This will load all files under the logs/2010/05 directory
(or the logs/2010/05 file) and put into clicklogs:
      clicklogs = LOAD 'logs/2010/05';


Names the files in the tuple “userid” and “url” -- instead
of having to refer as $0 and $1

      clicklogs = LOAD 'logs/2010/05' as (userid: int, url: chararray)


              Inner bag occurs till a dump/store
  Note: no actual loading
  command is executed.
Operator                                           Load

By default splits on the tab character (the same as the
key/value separator in MapReduce jobs). Can also specify
your own delimiter:
         LOAD ‘logs’ USING PigStorage(‘~’)

PigStorage implements LoadFunc -- implement this
interface to create your own loader, ie “RegExLoader”
from the Piggybank.


              Inner bag
Operator                              Describe, Dump, and Store

“Describe” prints out that variable’s schema:
    DUMP combotimes;
    combotimes: {group: chararray,
      enter: {time: chararray,userid: chararray},
      exit: {time: chararray,userid: chararray,cost: double}}
To see output on the screen type “dump varname”:
    DUMP namesandaddresses;
 To output to a file / directory use store:
    STORE patienttrials INTO ‘trials/2010’;


                Inner bag
Operator                                                           Group

 $ cat starnames           $ cat starpositions
 1     Mintaka             1     R.A. 05h 32m 0.4s, Dec. -00 17' 57"
 2     Alnitak             2     R.A. 05h 40m 45.5s, Dec. -01 56' 34"
 3     Epsilon Orionis     3     R.A. 05h 36m 12.8s, Dec. -01 12' 07"
    $ cat starsandpositions.pig
    names = LOAD 'starnames' as (id: int, name: chararray);
    positions = LOAD 'starpositions' as (id: int, position: chararray);
    nameandpos = GROUP names BY id, positions BY id;
    DESCRIBE nameandpos;
    DUMP nameandpos;
 nameandpos: {group: int,names: {id: int,name: chararray},
  positions: {id: int,position: chararray}}
 (1,{(1,Mintaka)},{(1,R.A. bag 32m 0.4s, Dec. -00 17' 57")})
                   Inner 05h
 (2,{(2,Alnitak)},{(2,R.A. 05h 40m 45.5s, Dec. -01 56' 34")})
 (3,{(3,Epsilon Orionis)},{(3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")})
Operator                                                                 Join

Just like GROUP but flatter
   $ cat starsandpositions2.pig
   names = LOAD 'starnames' as (id: int, name: chararray);
   positions = LOAD 'starpositions' as (id: int, position: chararray);
   nameandpos = JOIN names BY id, positions BY id;
   DESCRIBE nameandpos;
   DUMP nameandpos;

 nameandpos: {names::id: int,names::name: chararray,
 positions::id: int,positions::position: chararray}

 (1,Mintaka,1,R.A. 05h 32m 0.4s, Dec. -00 17' 57")
 (2,Alnitak,2,R.A.Inner bag
                   05h 40m 45.5s, Dec. -01 56' 34")
 (3,Epsilon Orionis,3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")
Operator                                                          Flatten

Ugly looking output from before:
  expensive: {inside: {userid: int,bookid: long,price: double}}
  ({(101,1002L,10.09),(102,1002L,10.09)})
  ({(103,7122L,88.99)})
Use the FLATTEN operator
  expensive = FOREACH bookbuys {
        inside = FILTER logs BY price > 6.0;
        GENERATE group, FLATTEN (inside);
  }
  expensive: {group: long,inside::userid: int,inside::bookid:
  long,inside::price: double}
  (1002L,101,1002L,10.09)
                 Inner bag
  (1002L,102,1002L,10.09)
  (7122L,103,7122L,88.99)
Operator                                       Renaming in Foreach

 All columns with cumbersome names:
  expensive: {group: long,inside::userid: int,inside::bookid:
  long,inside::price: double}
 Pick and rename:
  expensive = FOREACH bookbuys {
     inside = FILTER logs BY price > 6.0;
     GENERATE group AS userid,
       FLATTEN (inside.(bookid, price)) AS (bookid, price);
  }
                                               Kept the type!
 Now easy to use:
  expensive: {userid: long,bookid: long,price: double}
  (1002L,1002L,10.09)
  (1002L,1002L,10.09) bag
                Inner
  (7122L,7122L,88.99)
Operator                                                          Split

When input file mixes types or needs separation
  $ cat enterexittimes
  2010-05-10 12:55:12     user123 enter
  2010-05-10 13:14:23     user456 enter
  2010-05-10 13:16:53     user123 exit 23.79
  2010-05-10 13:17:49     user456 exit 0.50

  inandout = LOAD 'enterexittimes';
  SPLIT inandout INTO enter1 IF $2 == 'enter', exit1 IF $2 == 'exit';

                               (2010-05-10 12:55:12,user123,enter)
                     enter1:
                               (2010-05-10 13:14:23,user456,enter)
  (2010-05-10 13:16:53,user123,exit,23.79)
                                            : exit1
  (2010-05-10 13:17:49,user456,exit,0.50)
Operator                                                          Split

If same schema for each line can specify on load, in this
case need to do a foreach:
 enter = FOREACH enter1 GENERATE
   (chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray;
 exit = FOREACH exit1 GENERATE
   (chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray,
   (double)$3 AS cost:double;
 DESCRIBE enter;
 DESCRIBE exit;

 enter: {time: chararray,userid: chararray}
 exit: {time: chararray,userid: chararray,cost: double}
Operator                                                 Sample, Limit

For testing purposes sample both large inputs:
   names1 = LOAD 'starnames' as (id: int, name: chararray);
   names = SAMPLE names1 0.3;
   positions1 = LOAD 'starpositions' as (id: int, position: chararray);
   positions = SAMPLE positions1 0.3;
Running returns random rows every time
   (1,Mintaka,1,R.A. 05h 32m 0.4s, Dec. -00 17' 57")
Limit only returns the first N results. Use with OrderBy
to return the top results:
   nameandpos1 = JOIN names BY id, positions BY id;
   nameandpos2 = ORDER nameandpos1 BY names::id DESC;
   nameandpos Inner bag
              = LIMIT nameandpos2 2;
   (3,Epsilon Orionis,3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")
   (2,Alnitak,2,R.A. 05h 40m 45.5s, Dec. -01 56' 34")
Agenda


  1   Why Pig?
  2   Data types
  3   Operators
  4   UDFs
  5   Using Pig
UDF

UDF: User Defined Function
Operates on single values or a group
Simple example: IsEmpty (a FilterFunc)
   users = JOIN names BY id, addresses BY id;
   D = FOREACH users GENERATE group,
    FLATTEN ((IsEmpty(names::firstName) ? “none” : names::firstName)
Working over an aggregate, ie COUNT:
   users = JOIN names BY id, books BY buyerId;
   D = FOREACH users GENERATE group, COUNT(books)
Working on two values:
   distance1= CROSS stars and stars;
   distance =
Agenda


  1   Why Pig?
  2   Data types
  3   Operators
  4   UDFs
  5   Using Pig
LOAD and GROUP
logfiles = LOAD ‘logs’ AS (userid: int, bookid: long, price:
double);
userinfo = LOAD ‘users’ AS (userid: int, name: chararray);
userpurchases = GROUP logfiles BY userid, userinfo BY
userid;
DESCRIBE userpurchases;
DUMP userpurchases;
Inside {} are bags (unordered)
inside () are tuples (ordered list of fields)

report = FOREACH userpurchases GENERATE
FLATTEN(userinfo.name) AS name, group AS userid,
FLATTEN(SUM(logfiles.price)) AS cost;
bybigspender = ORDER report BY cost DESC;
DUMP bybigspender;

(Bob,103,94.94999999999999)
(Joe,101,16.05)
(Cindy,102,10.09)
Entering and exiting recorded in same file:

2010-05-10 12:55:12 user123 enter
2010-05-10 13:14:23 user456 enter
2010-05-10 13:16:53 user123 exit 23.79
2010-05-10 13:17:49 user456 exit 0.50
inandout = LOAD 'enterexittimes';
SPLIT inandout INTO enter
  IF $2 == 'enter', exit1 IF $2 == 'exit';

enter = FOREACH enter1 GENERATE
 (chararray)$0 AS time:chararray,
 (chararray)$1 AS userid:chararray;

exit = FOREACH exit1 GENERATE
 (chararray)$0 AS time:chararray,
 (chararray)$1 AS userid:chararray,
 (double)$3 AS cost:double;
combotimes = GROUP enter BY $1, exit BY $1;

purchases = FOREACH combotimes GENERATE
 group AS userid,
 FLATTEN(enter.$0) AS entertime,
 FLATTEN(exit.$0) AS exittime,
 FLATTEN(exit.$2);

DUMP purchases;
Schema for inandout, enter1, exit1 unknown.

enter: {time: chararray,userid: chararray}
exit: {time: chararray,userid: chararray,cost: double}

combotimes: {group: chararray,
 enter: {time: chararray,userid: chararray},
 exit: {time: chararray,userid: chararray,cost: double}}

purchases: {userid: chararray,entertime: chararray,
 exittime: chararray,cost: double}
UDFs
• User Defined Function
• For doing an operation on data
• Already use several builtins:
  • COUNT
  • SUM
•

More Related Content

What's hot

Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6Workhorse Computing
 
Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.Workhorse Computing
 
(Parameterized) Roles
(Parameterized) Roles(Parameterized) Roles
(Parameterized) Rolessartak
 
Backbone.js: Run your Application Inside The Browser
Backbone.js: Run your Application Inside The BrowserBackbone.js: Run your Application Inside The Browser
Backbone.js: Run your Application Inside The BrowserHoward Lewis Ship
 
Decorators Explained: A Powerful Tool That Should Be in Your Python Toolbelt.
Decorators Explained: A Powerful Tool That Should Be in Your Python Toolbelt.Decorators Explained: A Powerful Tool That Should Be in Your Python Toolbelt.
Decorators Explained: A Powerful Tool That Should Be in Your Python Toolbelt.Samuel Fortier-Galarneau
 
MTDDC 2010.2.5 Tokyo - Brand new API
MTDDC 2010.2.5 Tokyo - Brand new APIMTDDC 2010.2.5 Tokyo - Brand new API
MTDDC 2010.2.5 Tokyo - Brand new APISix Apart KK
 
PERL for QA - Important Commands and applications
PERL for QA - Important Commands and applicationsPERL for QA - Important Commands and applications
PERL for QA - Important Commands and applicationsSunil Kumar Gunasekaran
 
Mongoskin - Guilin
Mongoskin - GuilinMongoskin - Guilin
Mongoskin - GuilinJackson Tian
 
Powerful JavaScript Tips and Best Practices
Powerful JavaScript Tips and Best PracticesPowerful JavaScript Tips and Best Practices
Powerful JavaScript Tips and Best PracticesDragos Ionita
 
Introdução ao Perl 6
Introdução ao Perl 6Introdução ao Perl 6
Introdução ao Perl 6garux
 
Lithium: The Framework for People Who Hate Frameworks
Lithium: The Framework for People Who Hate FrameworksLithium: The Framework for People Who Hate Frameworks
Lithium: The Framework for People Who Hate FrameworksNate Abele
 
Beyond javascript using the features of tomorrow
Beyond javascript   using the features of tomorrowBeyond javascript   using the features of tomorrow
Beyond javascript using the features of tomorrowAlexander Varwijk
 
PHP 7 – What changed internally? (Forum PHP 2015)
PHP 7 – What changed internally? (Forum PHP 2015)PHP 7 – What changed internally? (Forum PHP 2015)
PHP 7 – What changed internally? (Forum PHP 2015)Nikita Popov
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) BigDataEverywhere
 
The Zen of Lithium
The Zen of LithiumThe Zen of Lithium
The Zen of LithiumNate Abele
 
Python decorators
Python decoratorsPython decorators
Python decoratorsAlex Su
 
Decorators in Python
Decorators in PythonDecorators in Python
Decorators in PythonBen James
 
Grails: a quick tutorial (1)
Grails: a quick tutorial (1)Grails: a quick tutorial (1)
Grails: a quick tutorial (1)Davide Rossi
 
Javascript the New Parts v2
Javascript the New Parts v2Javascript the New Parts v2
Javascript the New Parts v2Federico Galassi
 

What's hot (20)

Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
Neatly Hashing a Tree: FP tree-fold in Perl5 & Perl6
 
Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.Perl6 Regexen: Reduce the line noise in your code.
Perl6 Regexen: Reduce the line noise in your code.
 
(Parameterized) Roles
(Parameterized) Roles(Parameterized) Roles
(Parameterized) Roles
 
Backbone.js: Run your Application Inside The Browser
Backbone.js: Run your Application Inside The BrowserBackbone.js: Run your Application Inside The Browser
Backbone.js: Run your Application Inside The Browser
 
Decorators Explained: A Powerful Tool That Should Be in Your Python Toolbelt.
Decorators Explained: A Powerful Tool That Should Be in Your Python Toolbelt.Decorators Explained: A Powerful Tool That Should Be in Your Python Toolbelt.
Decorators Explained: A Powerful Tool That Should Be in Your Python Toolbelt.
 
MTDDC 2010.2.5 Tokyo - Brand new API
MTDDC 2010.2.5 Tokyo - Brand new APIMTDDC 2010.2.5 Tokyo - Brand new API
MTDDC 2010.2.5 Tokyo - Brand new API
 
GORM
GORMGORM
GORM
 
PERL for QA - Important Commands and applications
PERL for QA - Important Commands and applicationsPERL for QA - Important Commands and applications
PERL for QA - Important Commands and applications
 
Mongoskin - Guilin
Mongoskin - GuilinMongoskin - Guilin
Mongoskin - Guilin
 
Powerful JavaScript Tips and Best Practices
Powerful JavaScript Tips and Best PracticesPowerful JavaScript Tips and Best Practices
Powerful JavaScript Tips and Best Practices
 
Introdução ao Perl 6
Introdução ao Perl 6Introdução ao Perl 6
Introdução ao Perl 6
 
Lithium: The Framework for People Who Hate Frameworks
Lithium: The Framework for People Who Hate FrameworksLithium: The Framework for People Who Hate Frameworks
Lithium: The Framework for People Who Hate Frameworks
 
Beyond javascript using the features of tomorrow
Beyond javascript   using the features of tomorrowBeyond javascript   using the features of tomorrow
Beyond javascript using the features of tomorrow
 
PHP 7 – What changed internally? (Forum PHP 2015)
PHP 7 – What changed internally? (Forum PHP 2015)PHP 7 – What changed internally? (Forum PHP 2015)
PHP 7 – What changed internally? (Forum PHP 2015)
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
 
The Zen of Lithium
The Zen of LithiumThe Zen of Lithium
The Zen of Lithium
 
Python decorators
Python decoratorsPython decorators
Python decorators
 
Decorators in Python
Decorators in PythonDecorators in Python
Decorators in Python
 
Grails: a quick tutorial (1)
Grails: a quick tutorial (1)Grails: a quick tutorial (1)
Grails: a quick tutorial (1)
 
Javascript the New Parts v2
Javascript the New Parts v2Javascript the New Parts v2
Javascript the New Parts v2
 

Similar to Pig Introduction to Pig

Practical pig
Practical pigPractical pig
Practical pigtrihug
 
Getting Started with PL/Proxy
Getting Started with PL/ProxyGetting Started with PL/Proxy
Getting Started with PL/ProxyPeter Eisentraut
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePedro Figueiredo
 
Can't Miss Features of PHP 5.3 and 5.4
Can't Miss Features of PHP 5.3 and 5.4Can't Miss Features of PHP 5.3 and 5.4
Can't Miss Features of PHP 5.3 and 5.4Jeff Carouth
 
The Story About The Migration
 The Story About The Migration The Story About The Migration
The Story About The MigrationEDB
 
The Rust Programming Language: an Overview
The Rust Programming Language: an OverviewThe Rust Programming Language: an Overview
The Rust Programming Language: an OverviewRoberto Casadei
 
Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?osfameron
 
Good Evils In Perl (Yapc Asia)
Good Evils In Perl (Yapc Asia)Good Evils In Perl (Yapc Asia)
Good Evils In Perl (Yapc Asia)Kang-min Liu
 
Wheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility ModulesWheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility ModulesWorkhorse Computing
 
Mixing functional and object oriented approaches to programming in C#
Mixing functional and object oriented approaches to programming in C#Mixing functional and object oriented approaches to programming in C#
Mixing functional and object oriented approaches to programming in C#Mark Needham
 
Marrow: A Meta-Framework for Python 2.6+ and 3.1+
Marrow: A Meta-Framework for Python 2.6+ and 3.1+Marrow: A Meta-Framework for Python 2.6+ and 3.1+
Marrow: A Meta-Framework for Python 2.6+ and 3.1+ConFoo
 
A Recovering Java Developer Learns to Go
A Recovering Java Developer Learns to GoA Recovering Java Developer Learns to Go
A Recovering Java Developer Learns to GoMatt Stine
 

Similar to Pig Introduction to Pig (20)

Apache pig
Apache pigApache pig
Apache pig
 
Pig workshop
Pig workshopPig workshop
Pig workshop
 
Practical pig
Practical pigPractical pig
Practical pig
 
Getting Started with PL/Proxy
Getting Started with PL/ProxyGetting Started with PL/Proxy
Getting Started with PL/Proxy
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReduce
 
Can't Miss Features of PHP 5.3 and 5.4
Can't Miss Features of PHP 5.3 and 5.4Can't Miss Features of PHP 5.3 and 5.4
Can't Miss Features of PHP 5.3 and 5.4
 
Hadoop Pig
Hadoop PigHadoop Pig
Hadoop Pig
 
The Story About The Migration
 The Story About The Migration The Story About The Migration
The Story About The Migration
 
The Rust Programming Language: an Overview
The Rust Programming Language: an OverviewThe Rust Programming Language: an Overview
The Rust Programming Language: an Overview
 
Groovy intro for OUDL
Groovy intro for OUDLGroovy intro for OUDL
Groovy intro for OUDL
 
Groovy
GroovyGroovy
Groovy
 
Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?Is Haskell an acceptable Perl?
Is Haskell an acceptable Perl?
 
Good Evils In Perl (Yapc Asia)
Good Evils In Perl (Yapc Asia)Good Evils In Perl (Yapc Asia)
Good Evils In Perl (Yapc Asia)
 
Wheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility ModulesWheels we didn't re-invent: Perl's Utility Modules
Wheels we didn't re-invent: Perl's Utility Modules
 
Mixing functional and object oriented approaches to programming in C#
Mixing functional and object oriented approaches to programming in C#Mixing functional and object oriented approaches to programming in C#
Mixing functional and object oriented approaches to programming in C#
 
Subroutines
SubroutinesSubroutines
Subroutines
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Hw09 Hadoop + Clojure
Hw09   Hadoop + ClojureHw09   Hadoop + Clojure
Hw09 Hadoop + Clojure
 
Marrow: A Meta-Framework for Python 2.6+ and 3.1+
Marrow: A Meta-Framework for Python 2.6+ and 3.1+Marrow: A Meta-Framework for Python 2.6+ and 3.1+
Marrow: A Meta-Framework for Python 2.6+ and 3.1+
 
A Recovering Java Developer Learns to Go
A Recovering Java Developer Learns to GoA Recovering Java Developer Learns to Go
A Recovering Java Developer Learns to Go
 

Recently uploaded

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Pig Introduction to Pig

  • 1. Whirlwind tour of Pig Chris Wilkes cwilkes@seattlehadoop.org
  • 2. Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
  • 3. Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
  • 4. Why Pig? Tired of boilerplate • Started off writing Mappers/Reducers in Java • Fun at first • Gets a little tedious • Need to do more than one MR step • Write own flow control • Utility classes to pass parameters / input paths • Go back and change a Reducer’s input type • Did you change it in the Job setup? • Processing two different input types in first job
  • 5. Why Pig? Java MapReduce boilerplate example • Typical use case: have two different input types • log files (timestamps and userids) • database table dump (userids and names) • Want to combine the two together • Relatively simple, but tedious
  • 6. Why Pig? Java MapReduce boilerplate example Need to handle two different output types, need a single class that can handle both, designate with a “tag”: Mapper<LongWritable,Text,TaggedKeyWritable,TaggedValueWritable> Reducer<TaggedKeyWritable,TaggedValueWritable,Text,PurchaseInfoWritable> Inside of Mapper check in setup() or run() for Path of input to decide if this is a log file or database table if (context.getInputSplit().getPath().contains(“logfile”)) { inputType=”LOGFILE” } else if { ... inputType=”DATABASE”} Reducer: check tag and then combine if (key.getTag().equals(“LOGFILE”) { LogEntry logEntry = value.getValue() } else (key.getTag().equals(“DATABASE”) { UserInfo userInfo = value.getValue() } context.write(userInfo.getId(), logEntry.getTime() + “ “ + userInfo.getName())
  • 7. Where’s your shears? "I was working on my thesis and realized I needed a reference. I'd seen a post on comp.arch recently that cited a paper, so I fired up gnus. While I was searching the for the post, I came across another post whose MIME encoding screwed up my ancient version of gnus, so I stopped and downloaded the latest version of gnus.
  • 8. Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
  • 9. Data Types • From largest to smallest: • Bag (relation / group) • Tuple • Field • A bag is a collection of tuples, tuples have fields
  • 10. Data Types Bag $ cat logs $ cat groupbooks.pig 101 1002 10.09 logs = LOAD 'logs' AS 101 8912 5.96 (userid: int, bookid: long, price: double); 102 1002 10.09 bookbuys = GROUP logs BY bookid; 103 8912 5.96 DESCRIBE bookbuys; 103 7122 88.99 DUMP bookbuys; $ pig -x local groupbooks.pig bookbuys: {group: long,logs: {userid: int,bookid: long,price: double}} Tuple (1002L,{(101,1002L,10.09),(102,1002L,10.09)}) (7122L,{(103,7122L,88.99)}) (8912L,{(101,8912L,5.96),(103,8912L,5.96)}) Inner bag Field Field
  • 11. Data Types Tuple and Fields $ cat booksexpensive.pig logs = LOAD 'logs' AS (userid: int, bookid: long, price: double); bookbuys = GROUP logs BY bookid; expensive = FOREACH bookbuys { inside = FILTER logs BY price > 6.0; GENERATE inside; } Refers to the DESCRIBE expensive; inner bag DUMP expensive; $ pig -x local booksexpensive.pig expensive: {inside: {userid: int,bookid: long,price: double}} ({(101,1002L,10.09),(102,1002L,10.09)}) ({(103,7122L,88.99)}) ({}) Inner bag Note: can always refer to $0, $1, etc
  • 12. Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
  • 13. Operator Load This will load all files under the logs/2010/05 directory (or the logs/2010/05 file) and put into clicklogs: clicklogs = LOAD 'logs/2010/05'; Names the files in the tuple “userid” and “url” -- instead of having to refer as $0 and $1 clicklogs = LOAD 'logs/2010/05' as (userid: int, url: chararray) Inner bag occurs till a dump/store Note: no actual loading command is executed.
  • 14. Operator Load By default splits on the tab character (the same as the key/value separator in MapReduce jobs). Can also specify your own delimiter: LOAD ‘logs’ USING PigStorage(‘~’) PigStorage implements LoadFunc -- implement this interface to create your own loader, ie “RegExLoader” from the Piggybank. Inner bag
  • 15. Operator Describe, Dump, and Store “Describe” prints out that variable’s schema: DUMP combotimes; combotimes: {group: chararray, enter: {time: chararray,userid: chararray}, exit: {time: chararray,userid: chararray,cost: double}} To see output on the screen type “dump varname”: DUMP namesandaddresses; To output to a file / directory use store: STORE patienttrials INTO ‘trials/2010’; Inner bag
  • 16. Operator Group $ cat starnames $ cat starpositions 1 Mintaka 1 R.A. 05h 32m 0.4s, Dec. -00 17' 57" 2 Alnitak 2 R.A. 05h 40m 45.5s, Dec. -01 56' 34" 3 Epsilon Orionis 3 R.A. 05h 36m 12.8s, Dec. -01 12' 07" $ cat starsandpositions.pig names = LOAD 'starnames' as (id: int, name: chararray); positions = LOAD 'starpositions' as (id: int, position: chararray); nameandpos = GROUP names BY id, positions BY id; DESCRIBE nameandpos; DUMP nameandpos; nameandpos: {group: int,names: {id: int,name: chararray}, positions: {id: int,position: chararray}} (1,{(1,Mintaka)},{(1,R.A. bag 32m 0.4s, Dec. -00 17' 57")}) Inner 05h (2,{(2,Alnitak)},{(2,R.A. 05h 40m 45.5s, Dec. -01 56' 34")}) (3,{(3,Epsilon Orionis)},{(3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")})
  • 17. Operator Join Just like GROUP but flatter $ cat starsandpositions2.pig names = LOAD 'starnames' as (id: int, name: chararray); positions = LOAD 'starpositions' as (id: int, position: chararray); nameandpos = JOIN names BY id, positions BY id; DESCRIBE nameandpos; DUMP nameandpos; nameandpos: {names::id: int,names::name: chararray, positions::id: int,positions::position: chararray} (1,Mintaka,1,R.A. 05h 32m 0.4s, Dec. -00 17' 57") (2,Alnitak,2,R.A.Inner bag 05h 40m 45.5s, Dec. -01 56' 34") (3,Epsilon Orionis,3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")
  • 18. Operator Flatten Ugly looking output from before: expensive: {inside: {userid: int,bookid: long,price: double}} ({(101,1002L,10.09),(102,1002L,10.09)}) ({(103,7122L,88.99)}) Use the FLATTEN operator expensive = FOREACH bookbuys { inside = FILTER logs BY price > 6.0; GENERATE group, FLATTEN (inside); } expensive: {group: long,inside::userid: int,inside::bookid: long,inside::price: double} (1002L,101,1002L,10.09) Inner bag (1002L,102,1002L,10.09) (7122L,103,7122L,88.99)
  • 19. Operator Renaming in Foreach All columns with cumbersome names: expensive: {group: long,inside::userid: int,inside::bookid: long,inside::price: double} Pick and rename: expensive = FOREACH bookbuys { inside = FILTER logs BY price > 6.0; GENERATE group AS userid, FLATTEN (inside.(bookid, price)) AS (bookid, price); } Kept the type! Now easy to use: expensive: {userid: long,bookid: long,price: double} (1002L,1002L,10.09) (1002L,1002L,10.09) bag Inner (7122L,7122L,88.99)
  • 20. Operator Split When input file mixes types or needs separation $ cat enterexittimes 2010-05-10 12:55:12 user123 enter 2010-05-10 13:14:23 user456 enter 2010-05-10 13:16:53 user123 exit 23.79 2010-05-10 13:17:49 user456 exit 0.50 inandout = LOAD 'enterexittimes'; SPLIT inandout INTO enter1 IF $2 == 'enter', exit1 IF $2 == 'exit'; (2010-05-10 12:55:12,user123,enter) enter1: (2010-05-10 13:14:23,user456,enter) (2010-05-10 13:16:53,user123,exit,23.79) : exit1 (2010-05-10 13:17:49,user456,exit,0.50)
  • 21. Operator Split If same schema for each line can specify on load, in this case need to do a foreach: enter = FOREACH enter1 GENERATE (chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray; exit = FOREACH exit1 GENERATE (chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray, (double)$3 AS cost:double; DESCRIBE enter; DESCRIBE exit; enter: {time: chararray,userid: chararray} exit: {time: chararray,userid: chararray,cost: double}
  • 22. Operator Sample, Limit For testing purposes sample both large inputs: names1 = LOAD 'starnames' as (id: int, name: chararray); names = SAMPLE names1 0.3; positions1 = LOAD 'starpositions' as (id: int, position: chararray); positions = SAMPLE positions1 0.3; Running returns random rows every time (1,Mintaka,1,R.A. 05h 32m 0.4s, Dec. -00 17' 57") Limit only returns the first N results. Use with OrderBy to return the top results: nameandpos1 = JOIN names BY id, positions BY id; nameandpos2 = ORDER nameandpos1 BY names::id DESC; nameandpos Inner bag = LIMIT nameandpos2 2; (3,Epsilon Orionis,3,R.A. 05h 36m 12.8s, Dec. -01 12' 07") (2,Alnitak,2,R.A. 05h 40m 45.5s, Dec. -01 56' 34")
  • 23. Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
  • 24. UDF UDF: User Defined Function Operates on single values or a group Simple example: IsEmpty (a FilterFunc) users = JOIN names BY id, addresses BY id; D = FOREACH users GENERATE group, FLATTEN ((IsEmpty(names::firstName) ? “none” : names::firstName) Working over an aggregate, ie COUNT: users = JOIN names BY id, books BY buyerId; D = FOREACH users GENERATE group, COUNT(books) Working on two values: distance1= CROSS stars and stars; distance =
  • 25. Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
  • 26. LOAD and GROUP logfiles = LOAD ‘logs’ AS (userid: int, bookid: long, price: double); userinfo = LOAD ‘users’ AS (userid: int, name: chararray); userpurchases = GROUP logfiles BY userid, userinfo BY userid; DESCRIBE userpurchases; DUMP userpurchases;
  • 27. Inside {} are bags (unordered) inside () are tuples (ordered list of fields) report = FOREACH userpurchases GENERATE FLATTEN(userinfo.name) AS name, group AS userid, FLATTEN(SUM(logfiles.price)) AS cost; bybigspender = ORDER report BY cost DESC; DUMP bybigspender; (Bob,103,94.94999999999999) (Joe,101,16.05) (Cindy,102,10.09)
  • 28. Entering and exiting recorded in same file: 2010-05-10 12:55:12 user123 enter 2010-05-10 13:14:23 user456 enter 2010-05-10 13:16:53 user123 exit 23.79 2010-05-10 13:17:49 user456 exit 0.50
  • 29. inandout = LOAD 'enterexittimes'; SPLIT inandout INTO enter IF $2 == 'enter', exit1 IF $2 == 'exit'; enter = FOREACH enter1 GENERATE (chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray; exit = FOREACH exit1 GENERATE (chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray, (double)$3 AS cost:double;
  • 30. combotimes = GROUP enter BY $1, exit BY $1; purchases = FOREACH combotimes GENERATE group AS userid, FLATTEN(enter.$0) AS entertime, FLATTEN(exit.$0) AS exittime, FLATTEN(exit.$2); DUMP purchases;
  • 31. Schema for inandout, enter1, exit1 unknown. enter: {time: chararray,userid: chararray} exit: {time: chararray,userid: chararray,cost: double} combotimes: {group: chararray, enter: {time: chararray,userid: chararray}, exit: {time: chararray,userid: chararray,cost: double}} purchases: {userid: chararray,entertime: chararray, exittime: chararray,cost: double}
  • 32. UDFs • User Defined Function • For doing an operation on data • Already use several builtins: • COUNT • SUM •

Editor's Notes