Pig Introduction to Pig

Whirlwind tour of Pig
Chris Wilkes
cwilkes@seattlehadoop.org

Agenda

1 Why Pig?
2 Data types
3 Operators
4 UDFs
5 Using Pig

Why Pig? Tired of boilerplate

• Started off writing Mappers/Reducers in Java

• Fun at first

• Gets a little tedious

• Need to do more than one MR step

• Write own flow control

• Utility classes to pass parameters / input paths

• Go back and change a Reducer’s input type

• Did you change it in the Job setup?

• Processing two different input types in first job

Why Pig? Java MapReduce boilerplate example

• Typical use case: have two different input types

• log ﬁles (timestamps and userids)

• database table dump (userids and names)

• Want to combine the two together

• Relatively simple, but tedious

Why Pig? Java MapReduce boilerplate example

Need to handle two different output types, need a single
class that can handle both, designate with a “tag”:
Mapper<LongWritable,Text,TaggedKeyWritable,TaggedValueWritable>
Reducer<TaggedKeyWritable,TaggedValueWritable,Text,PurchaseInfoWritable>

Inside of Mapper check in setup() or run() for Path of
input to decide if this is a log ﬁle or database table
if (context.getInputSplit().getPath().contains(“logﬁle”)) {
inputType=”LOGFILE” } else if { ... inputType=”DATABASE”}

Reducer: check tag and then combine
if (key.getTag().equals(“LOGFILE”) { LogEntry logEntry = value.getValue() } else
(key.getTag().equals(“DATABASE”) { UserInfo userInfo = value.getValue() }
context.write(userInfo.getId(), logEntry.getTime() + “ “ + userInfo.getName())

Where’s your shears?

"I was working on my thesis
and realized I needed a
reference. I'd
seen a post on comp.arch
recently that cited a paper,
so I fired up
gnus. While I was searching
the for the post, I came
across another
post whose MIME encoding
screwed up my ancient version
of gnus, so I
stopped and downloaded the
latest version of gnus.

Data Types

• From largest to smallest:

• Bag (relation / group)

• Tuple

• Field

• A bag is a collection of tuples, tuples have ﬁelds

Data Types Bag

$ cat logs $ cat groupbooks.pig
101 1002 10.09 logs = LOAD 'logs' AS
101 8912 5.96 (userid: int, bookid: long, price: double);
102 1002 10.09 bookbuys = GROUP logs BY bookid;
103 8912 5.96 DESCRIBE bookbuys;
103 7122 88.99 DUMP bookbuys;

$ pig -x local groupbooks.pig
bookbuys: {group: long,logs: {userid: int,bookid: long,price: double}}
Tuple
(1002L,{(101,1002L,10.09),(102,1002L,10.09)})
(7122L,{(103,7122L,88.99)})
(8912L,{(101,8912L,5.96),(103,8912L,5.96)})
Inner bag
Field
Field

Data Types Tuple and Fields

$ cat booksexpensive.pig
logs = LOAD 'logs' AS (userid: int, bookid: long, price: double);
bookbuys = GROUP logs BY bookid;
expensive = FOREACH bookbuys {
inside = FILTER logs BY price > 6.0;
GENERATE inside;
}
Refers to the
DESCRIBE expensive;
inner bag
DUMP expensive;
$ pig -x local booksexpensive.pig
expensive: {inside: {userid: int,bookid: long,price: double}}
({(101,1002L,10.09),(102,1002L,10.09)})
({(103,7122L,88.99)})
({}) Inner bag
Note: can always refer to $0, $1, etc

Operator Load

This will load all files under the logs/2010/05 directory
(or the logs/2010/05 file) and put into clicklogs:
clicklogs = LOAD 'logs/2010/05';

Names the files in the tuple “userid” and “url” -- instead
of having to refer as $0 and $1

clicklogs = LOAD 'logs/2010/05' as (userid: int, url: chararray)

Inner bag occurs till a dump/store
Note: no actual loading
command is executed.

Operator Load

By default splits on the tab character (the same as the
key/value separator in MapReduce jobs). Can also specify
your own delimiter:
LOAD ‘logs’ USING PigStorage(‘~’)

PigStorage implements LoadFunc -- implement this
interface to create your own loader, ie “RegExLoader”
from the Piggybank.

Inner bag

Operator Describe, Dump, and Store

“Describe” prints out that variable’s schema:
DUMP combotimes;
combotimes: {group: chararray,
enter: {time: chararray,userid: chararray},
exit: {time: chararray,userid: chararray,cost: double}}
To see output on the screen type “dump varname”:
DUMP namesandaddresses;
To output to a ﬁle / directory use store:
STORE patienttrials INTO ‘trials/2010’;

Inner bag

Operator Group

$ cat starnames $ cat starpositions
1 Mintaka 1 R.A. 05h 32m 0.4s, Dec. -00 17' 57"
2 Alnitak 2 R.A. 05h 40m 45.5s, Dec. -01 56' 34"
3 Epsilon Orionis 3 R.A. 05h 36m 12.8s, Dec. -01 12' 07"
$ cat starsandpositions.pig
names = LOAD 'starnames' as (id: int, name: chararray);
positions = LOAD 'starpositions' as (id: int, position: chararray);
nameandpos = GROUP names BY id, positions BY id;
DESCRIBE nameandpos;
DUMP nameandpos;
nameandpos: {group: int,names: {id: int,name: chararray},
positions: {id: int,position: chararray}}
(1,{(1,Mintaka)},{(1,R.A. bag 32m 0.4s, Dec. -00 17' 57")})
Inner 05h
(2,{(2,Alnitak)},{(2,R.A. 05h 40m 45.5s, Dec. -01 56' 34")})
(3,{(3,Epsilon Orionis)},{(3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")})

Operator Join

Just like GROUP but ﬂatter
$ cat starsandpositions2.pig
names = LOAD 'starnames' as (id: int, name: chararray);
positions = LOAD 'starpositions' as (id: int, position: chararray);
nameandpos = JOIN names BY id, positions BY id;
DESCRIBE nameandpos;
DUMP nameandpos;

nameandpos: {names::id: int,names::name: chararray,
positions::id: int,positions::position: chararray}

(1,Mintaka,1,R.A. 05h 32m 0.4s, Dec. -00 17' 57")
(2,Alnitak,2,R.A.Inner bag
05h 40m 45.5s, Dec. -01 56' 34")
(3,Epsilon Orionis,3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")

Operator Flatten

Ugly looking output from before:
expensive: {inside: {userid: int,bookid: long,price: double}}
({(101,1002L,10.09),(102,1002L,10.09)})
({(103,7122L,88.99)})
Use the FLATTEN operator
GENERATE group, FLATTEN (inside);
}
expensive: {group: long,inside::userid: int,inside::bookid:
long,inside::price: double}
(1002L,101,1002L,10.09)
Inner bag
(1002L,102,1002L,10.09)
(7122L,103,7122L,88.99)

Operator Renaming in Foreach

All columns with cumbersome names:
expensive: {group: long,inside::userid: int,inside::bookid:
long,inside::price: double}
Pick and rename:
GENERATE group AS userid,
FLATTEN (inside.(bookid, price)) AS (bookid, price);
}
Kept the type!
Now easy to use:
expensive: {userid: long,bookid: long,price: double}
(1002L,1002L,10.09)
(1002L,1002L,10.09) bag
Inner
(7122L,7122L,88.99)

Operator Split

When input ﬁle mixes types or needs separation
$ cat enterexittimes
2010-05-10 12:55:12 user123 enter
2010-05-10 13:14:23 user456 enter
2010-05-10 13:16:53 user123 exit 23.79
2010-05-10 13:17:49 user456 exit 0.50

inandout = LOAD 'enterexittimes';
SPLIT inandout INTO enter1 IF $2 == 'enter', exit1 IF $2 == 'exit';

(2010-05-10 12:55:12,user123,enter)
enter1:
(2010-05-10 13:14:23,user456,enter)
(2010-05-10 13:16:53,user123,exit,23.79)
: exit1
(2010-05-10 13:17:49,user456,exit,0.50)

Operator Split

If same schema for each line can specify on load, in this
case need to do a foreach:
enter = FOREACH enter1 GENERATE
(chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray;
exit = FOREACH exit1 GENERATE
(chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray,
(double)$3 AS cost:double;
DESCRIBE enter;
DESCRIBE exit;

enter: {time: chararray,userid: chararray}
exit: {time: chararray,userid: chararray,cost: double}

Operator Sample, Limit

For testing purposes sample both large inputs:
names1 = LOAD 'starnames' as (id: int, name: chararray);
names = SAMPLE names1 0.3;
positions1 = LOAD 'starpositions' as (id: int, position: chararray);
positions = SAMPLE positions1 0.3;
Running returns random rows every time
(1,Mintaka,1,R.A. 05h 32m 0.4s, Dec. -00 17' 57")
Limit only returns the ﬁrst N results. Use with OrderBy
to return the top results:
nameandpos1 = JOIN names BY id, positions BY id;
nameandpos2 = ORDER nameandpos1 BY names::id DESC;
nameandpos Inner bag
= LIMIT nameandpos2 2;
(3,Epsilon Orionis,3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")
(2,Alnitak,2,R.A. 05h 40m 45.5s, Dec. -01 56' 34")

UDF

UDF: User Defined Function
Operates on single values or a group
Simple example: IsEmpty (a FilterFunc)
users = JOIN names BY id, addresses BY id;
D = FOREACH users GENERATE group,
FLATTEN ((IsEmpty(names::firstName) ? “none” : names::firstName)
Working over an aggregate, ie COUNT:
users = JOIN names BY id, books BY buyerId;
D = FOREACH users GENERATE group, COUNT(books)
Working on two values:
distance1= CROSS stars and stars;
distance =

LOAD and GROUP
logﬁles = LOAD ‘logs’ AS (userid: int, bookid: long, price:
double);
userinfo = LOAD ‘users’ AS (userid: int, name: chararray);
userpurchases = GROUP logﬁles BY userid, userinfo BY
userid;
DESCRIBE userpurchases;
DUMP userpurchases;

Inside {} are bags (unordered)
inside () are tuples (ordered list of ﬁelds)

report = FOREACH userpurchases GENERATE
FLATTEN(userinfo.name) AS name, group AS userid,
FLATTEN(SUM(logﬁles.price)) AS cost;
bybigspender = ORDER report BY cost DESC;
DUMP bybigspender;

(Bob,103,94.94999999999999)
(Joe,101,16.05)
(Cindy,102,10.09)

Entering and exiting recorded in same ﬁle:

2010-05-10 12:55:12 user123 enter
2010-05-10 13:14:23 user456 enter
2010-05-10 13:16:53 user123 exit 23.79
2010-05-10 13:17:49 user456 exit 0.50

inandout = LOAD 'enterexittimes';
SPLIT inandout INTO enter
IF $2 == 'enter', exit1 IF $2 == 'exit';

enter = FOREACH enter1 GENERATE
(chararray)$0 AS time:chararray,
(chararray)$1 AS userid:chararray;

exit = FOREACH exit1 GENERATE
(chararray)$0 AS time:chararray,
(chararray)$1 AS userid:chararray,
(double)$3 AS cost:double;

combotimes = GROUP enter BY $1, exit BY $1;

purchases = FOREACH combotimes GENERATE
group AS userid,
FLATTEN(enter.$0) AS entertime,
FLATTEN(exit.$0) AS exittime,
FLATTEN(exit.$2);

DUMP purchases;

Schema for inandout, enter1, exit1 unknown.

enter: {time: chararray,userid: chararray}
exit: {time: chararray,userid: chararray,cost: double}

combotimes: {group: chararray,
enter: {time: chararray,userid: chararray},
exit: {time: chararray,userid: chararray,cost: double}}

purchases: {userid: chararray,entertime: chararray,
exittime: chararray,cost: double}

UDFs
• User Deﬁned Function
• For doing an operation on data
• Already use several builtins:
• COUNT
• SUM
•

Pig Introduction to Pig

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pig Introduction to Pig

Similar to Pig Introduction to Pig (20)

Recently uploaded

Recently uploaded (20)

Pig Introduction to Pig

Editor's Notes