2. Agenda
1 Why Pig?
2 Data types
3 Operators
4 UDFs
5 Using Pig
3. Agenda
1 Why Pig?
2 Data types
3 Operators
4 UDFs
5 Using Pig
4. Why Pig? Tired of boilerplate
• Started off writing Mappers/Reducers in Java
• Fun at first
• Gets a little tedious
• Need to do more than one MR step
• Write own flow control
• Utility classes to pass parameters / input paths
• Go back and change a Reducer’s input type
• Did you change it in the Job setup?
• Processing two different input types in first job
5. Why Pig? Java MapReduce boilerplate example
• Typical use case: have two different input types
• log files (timestamps and userids)
• database table dump (userids and names)
• Want to combine the two together
• Relatively simple, but tedious
6. Why Pig? Java MapReduce boilerplate example
Need to handle two different output types, need a single
class that can handle both, designate with a “tag”:
Mapper<LongWritable,Text,TaggedKeyWritable,TaggedValueWritable>
Reducer<TaggedKeyWritable,TaggedValueWritable,Text,PurchaseInfoWritable>
Inside of Mapper check in setup() or run() for Path of
input to decide if this is a log file or database table
if (context.getInputSplit().getPath().contains(“logfile”)) {
inputType=”LOGFILE” } else if { ... inputType=”DATABASE”}
Reducer: check tag and then combine
if (key.getTag().equals(“LOGFILE”) { LogEntry logEntry = value.getValue() } else
(key.getTag().equals(“DATABASE”) { UserInfo userInfo = value.getValue() }
context.write(userInfo.getId(), logEntry.getTime() + “ “ + userInfo.getName())
7. Where’s your shears?
"I was working on my thesis
and realized I needed a
reference. I'd
seen a post on comp.arch
recently that cited a paper,
so I fired up
gnus. While I was searching
the for the post, I came
across another
post whose MIME encoding
screwed up my ancient version
of gnus, so I
stopped and downloaded the
latest version of gnus.
8. Agenda
1 Why Pig?
2 Data types
3 Operators
4 UDFs
5 Using Pig
9. Data Types
• From largest to smallest:
• Bag (relation / group)
• Tuple
• Field
• A bag is a collection of tuples, tuples have fields
10. Data Types Bag
$ cat logs $ cat groupbooks.pig
101 1002 10.09 logs = LOAD 'logs' AS
101 8912 5.96 (userid: int, bookid: long, price: double);
102 1002 10.09 bookbuys = GROUP logs BY bookid;
103 8912 5.96 DESCRIBE bookbuys;
103 7122 88.99 DUMP bookbuys;
$ pig -x local groupbooks.pig
bookbuys: {group: long,logs: {userid: int,bookid: long,price: double}}
Tuple
(1002L,{(101,1002L,10.09),(102,1002L,10.09)})
(7122L,{(103,7122L,88.99)})
(8912L,{(101,8912L,5.96),(103,8912L,5.96)})
Inner bag
Field
Field
11. Data Types Tuple and Fields
$ cat booksexpensive.pig
logs = LOAD 'logs' AS (userid: int, bookid: long, price: double);
bookbuys = GROUP logs BY bookid;
expensive = FOREACH bookbuys {
inside = FILTER logs BY price > 6.0;
GENERATE inside;
}
Refers to the
DESCRIBE expensive;
inner bag
DUMP expensive;
$ pig -x local booksexpensive.pig
expensive: {inside: {userid: int,bookid: long,price: double}}
({(101,1002L,10.09),(102,1002L,10.09)})
({(103,7122L,88.99)})
({}) Inner bag
Note: can always refer to $0, $1, etc
12. Agenda
1 Why Pig?
2 Data types
3 Operators
4 UDFs
5 Using Pig
13. Operator Load
This will load all files under the logs/2010/05 directory
(or the logs/2010/05 file) and put into clicklogs:
clicklogs = LOAD 'logs/2010/05';
Names the files in the tuple “userid” and “url” -- instead
of having to refer as $0 and $1
clicklogs = LOAD 'logs/2010/05' as (userid: int, url: chararray)
Inner bag occurs till a dump/store
Note: no actual loading
command is executed.
14. Operator Load
By default splits on the tab character (the same as the
key/value separator in MapReduce jobs). Can also specify
your own delimiter:
LOAD ‘logs’ USING PigStorage(‘~’)
PigStorage implements LoadFunc -- implement this
interface to create your own loader, ie “RegExLoader”
from the Piggybank.
Inner bag
15. Operator Describe, Dump, and Store
“Describe” prints out that variable’s schema:
DUMP combotimes;
combotimes: {group: chararray,
enter: {time: chararray,userid: chararray},
exit: {time: chararray,userid: chararray,cost: double}}
To see output on the screen type “dump varname”:
DUMP namesandaddresses;
To output to a file / directory use store:
STORE patienttrials INTO ‘trials/2010’;
Inner bag
17. Operator Join
Just like GROUP but flatter
$ cat starsandpositions2.pig
names = LOAD 'starnames' as (id: int, name: chararray);
positions = LOAD 'starpositions' as (id: int, position: chararray);
nameandpos = JOIN names BY id, positions BY id;
DESCRIBE nameandpos;
DUMP nameandpos;
nameandpos: {names::id: int,names::name: chararray,
positions::id: int,positions::position: chararray}
(1,Mintaka,1,R.A. 05h 32m 0.4s, Dec. -00 17' 57")
(2,Alnitak,2,R.A.Inner bag
05h 40m 45.5s, Dec. -01 56' 34")
(3,Epsilon Orionis,3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")
18. Operator Flatten
Ugly looking output from before:
expensive: {inside: {userid: int,bookid: long,price: double}}
({(101,1002L,10.09),(102,1002L,10.09)})
({(103,7122L,88.99)})
Use the FLATTEN operator
expensive = FOREACH bookbuys {
inside = FILTER logs BY price > 6.0;
GENERATE group, FLATTEN (inside);
}
expensive: {group: long,inside::userid: int,inside::bookid:
long,inside::price: double}
(1002L,101,1002L,10.09)
Inner bag
(1002L,102,1002L,10.09)
(7122L,103,7122L,88.99)
19. Operator Renaming in Foreach
All columns with cumbersome names:
expensive: {group: long,inside::userid: int,inside::bookid:
long,inside::price: double}
Pick and rename:
expensive = FOREACH bookbuys {
inside = FILTER logs BY price > 6.0;
GENERATE group AS userid,
FLATTEN (inside.(bookid, price)) AS (bookid, price);
}
Kept the type!
Now easy to use:
expensive: {userid: long,bookid: long,price: double}
(1002L,1002L,10.09)
(1002L,1002L,10.09) bag
Inner
(7122L,7122L,88.99)
20. Operator Split
When input file mixes types or needs separation
$ cat enterexittimes
2010-05-10 12:55:12 user123 enter
2010-05-10 13:14:23 user456 enter
2010-05-10 13:16:53 user123 exit 23.79
2010-05-10 13:17:49 user456 exit 0.50
inandout = LOAD 'enterexittimes';
SPLIT inandout INTO enter1 IF $2 == 'enter', exit1 IF $2 == 'exit';
(2010-05-10 12:55:12,user123,enter)
enter1:
(2010-05-10 13:14:23,user456,enter)
(2010-05-10 13:16:53,user123,exit,23.79)
: exit1
(2010-05-10 13:17:49,user456,exit,0.50)
21. Operator Split
If same schema for each line can specify on load, in this
case need to do a foreach:
enter = FOREACH enter1 GENERATE
(chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray;
exit = FOREACH exit1 GENERATE
(chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray,
(double)$3 AS cost:double;
DESCRIBE enter;
DESCRIBE exit;
enter: {time: chararray,userid: chararray}
exit: {time: chararray,userid: chararray,cost: double}
22. Operator Sample, Limit
For testing purposes sample both large inputs:
names1 = LOAD 'starnames' as (id: int, name: chararray);
names = SAMPLE names1 0.3;
positions1 = LOAD 'starpositions' as (id: int, position: chararray);
positions = SAMPLE positions1 0.3;
Running returns random rows every time
(1,Mintaka,1,R.A. 05h 32m 0.4s, Dec. -00 17' 57")
Limit only returns the first N results. Use with OrderBy
to return the top results:
nameandpos1 = JOIN names BY id, positions BY id;
nameandpos2 = ORDER nameandpos1 BY names::id DESC;
nameandpos Inner bag
= LIMIT nameandpos2 2;
(3,Epsilon Orionis,3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")
(2,Alnitak,2,R.A. 05h 40m 45.5s, Dec. -01 56' 34")
23. Agenda
1 Why Pig?
2 Data types
3 Operators
4 UDFs
5 Using Pig
24. UDF
UDF: User Defined Function
Operates on single values or a group
Simple example: IsEmpty (a FilterFunc)
users = JOIN names BY id, addresses BY id;
D = FOREACH users GENERATE group,
FLATTEN ((IsEmpty(names::firstName) ? “none” : names::firstName)
Working over an aggregate, ie COUNT:
users = JOIN names BY id, books BY buyerId;
D = FOREACH users GENERATE group, COUNT(books)
Working on two values:
distance1= CROSS stars and stars;
distance =
25. Agenda
1 Why Pig?
2 Data types
3 Operators
4 UDFs
5 Using Pig
26. LOAD and GROUP
logfiles = LOAD ‘logs’ AS (userid: int, bookid: long, price:
double);
userinfo = LOAD ‘users’ AS (userid: int, name: chararray);
userpurchases = GROUP logfiles BY userid, userinfo BY
userid;
DESCRIBE userpurchases;
DUMP userpurchases;
27. Inside {} are bags (unordered)
inside () are tuples (ordered list of fields)
report = FOREACH userpurchases GENERATE
FLATTEN(userinfo.name) AS name, group AS userid,
FLATTEN(SUM(logfiles.price)) AS cost;
bybigspender = ORDER report BY cost DESC;
DUMP bybigspender;
(Bob,103,94.94999999999999)
(Joe,101,16.05)
(Cindy,102,10.09)
28. Entering and exiting recorded in same file:
2010-05-10 12:55:12 user123 enter
2010-05-10 13:14:23 user456 enter
2010-05-10 13:16:53 user123 exit 23.79
2010-05-10 13:17:49 user456 exit 0.50
29. inandout = LOAD 'enterexittimes';
SPLIT inandout INTO enter
IF $2 == 'enter', exit1 IF $2 == 'exit';
enter = FOREACH enter1 GENERATE
(chararray)$0 AS time:chararray,
(chararray)$1 AS userid:chararray;
exit = FOREACH exit1 GENERATE
(chararray)$0 AS time:chararray,
(chararray)$1 AS userid:chararray,
(double)$3 AS cost:double;
30. combotimes = GROUP enter BY $1, exit BY $1;
purchases = FOREACH combotimes GENERATE
group AS userid,
FLATTEN(enter.$0) AS entertime,
FLATTEN(exit.$0) AS exittime,
FLATTEN(exit.$2);
DUMP purchases;