SlideShare a Scribd company logo
1 of 21
Download to read offline
Apache Pig
陳威宇
Agenda
• What is Apache Pig
• How to Setup
• Tutorial Examples
PIG Introduction
• Apache Pig is a platform for analyzing large data sets
that consists of a high-level language for expressing
data analysis programs
• Pig generates and compiles a Map/Reduce program(s)
on the fly
PIG
Parse
Compile
Optimize
Plan
Pig Latin
Scripts
有Pig後Map-Reduce簡單了!?
• Apache Pig用來處理大規模資料的高級查詢語言
• 適合操作大型半結構化數據集
• 比使用Java,C++等語言編寫大規模資料處理程式的
難度要小16倍,實現同樣的效果的代碼量也小20倍。
• Pig元件
– Pig Shell (Grunt)
– Pig Language (Latin)
– Libraries (Piggy Bank)
– UDF:使用者定義功能
4
figure Source : http://www.slideshare.net/ydn/hadoop-yahoo-internet-scale-data-processing
大象遇到豬 ( setup )
• 解壓縮
• 修改~/.bashrc
• 啟動 pig shell
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PIG_HOME=/home/hadoop/pig
export PATH=$PATH:$PIG_HOME/bin
cd /home/hadoop
wget http://archive.cloudera.com/cdh5/cdh/5/pig-0.12.0-
cdh5.3.2.tar.gz
tar –zxvf pig-0.12.0-cdh5.3.2.tar.gz
mv pig-0.12.0-cdh5.3.2 pig
$ pig
grunt>
grunt> ls /
hdfs://master:9000/hadoop <dir>
hdfs://master:9000/tmp <dir>
hdfs://master:9000/user <dir>
grunt>
豬也會的程式設計
6
功能 指令
讀取 LOAD
儲存 STORE
資料
處理
REGEX_EXTRACT, FILTER, FOREACH,
GROUP, JOIN, UNION, SPLIT, …
彙總
運算
AVG, COUNT, MAX, MIN, SIZE, …
數學
運算
ABS, RANDOM, ROUND, …
字串
處理
INDEXOF, SUBSTRING, REGEX
EXTRACT, …
Debug DUMP, DESCRIBE, EXPLAIN, ILLUSTRATE
HDFS cat, ls, cp, mkdir, …
$ pig -x
grunt> A = LOAD ‘file1’ AS (x, y, z);
grunt> B = FILTER A by y > 10000;
grunt> STORE B INTO ‘output’;
 用 shell 硬把程式兜出來,放棄用 hadoop 了
 使用 PIG
 發憤圖強,廢寢忘食的研究…
練習一 :
• 場景:
– 老闆要我統計組織內所有員工的平均工時。於是我取
得了全台灣的打卡紀錄檔(打卡鐘的log檔),還跟人事
部門拿到了員工 id 對應表。這些資料量又多且大,我
想到要餵進去 Hadoop 的HDFS, .. 然後
• 問題:
– 為了寫MapReduce,開始學 Java, 物件導向, hadoop
API, … @@
• 解法:
7
整型前的mapreduce code
8
nm dp Id Id dt hr
劉 北 A1 A1 7/7 13
李 中 B1 A1 7/8 12
王 中 B2 A1 7/9 4
Java Code
Map-Reduce
A1 劉 北 7/8 13
A1 劉 北 7/9 12
A1 劉 北 Jul 12.5
用pig 整形後
9
北 A1 劉 12.5
LOAD
LOAD
FILTER
JOIN
GROUP
FOREACH
STORE
(nm, dp, id)
(nm, dp, id)
(id, dt, hr)
(nm, dp, id, id, dt, hr)
(group, {(nm, dp, id, id, dt, hr)})
(group, …., AVG(hr))
(dp,group, nm, hr)
Logical PlanPig Latin
A = LOAD 'file1.txt' using PigStorage(',') AS (nm, dp, id) ;
B = LOAD 'file2.txt' using PigStorage(',') AS (id, dt, hr) ;
C = FILTER B by hr > 8;
D = JOIN C BY id, A BY id;
E = GROUP D BY A::id;
F = FOREACH E GENERATE $1.dp,group,$1.nm,
AVG($1.hr);
STORE F INTO '/tmp/pig_output/';
nm dp Id Id dt hr
劉 北 A1 A1 7/7 13
李 中 B1 A1 7/8 12
王 中 B2 A1 7/9 4
Tips : 先用小量資料於 pig -x local 模式驗證;
每行先配合dump or illustrate看是否正確
練習一 : 實作
 cd ~
git clone
https://github.com/waue0920/hadoop_example.git
 cd ~/hadoop_example/pig/ex1
 pig -x local -f exc1.pig
 cat /tmp/pig_output/part-r-00000
練習 : 執行 pig -x mapreduce,將 exc1.pig 每一行單獨執行,並
搭配 dump , illustrate 來看結果,如 :
Grunt> A = LOAD 'file1.txt' using PigStorage(',') AS (nm, dp, id);
Grunt> Dump A
Grunt> Illustrate A
Q : result 是否有改進空間 ?
Q : 如何改進 ?
進階
Simple Types Description Example
int Signed 32-bit integer 10
long Signed 64-bit integer Data: 10L or 10l
Display: 10L
float 32-bit floating point
Data: 10.5F or 10.5f or
10.5e2f or 10.5E2F
Display: 10.5F or 1050.0F
double 64-bit floating point
Data: 10.5 or 10.5e2 or
10.5E2
Display: 10.5 or 1050.0
chararray
Character array (string) in
Unicode UTF-8 format
hello world
bytearray Byte array (blob)
boolean boolean true/false (case insensitive)
datetime datetime
1970-01-
01T00:00:00.000+00:00
biginteger Java BigInteger 2E+11
bigdecimal Java BigDecimal 33.45678332
Complex Types Description Example
Fields A piece of data John
tuple An ordered set of fields. (19,2)
bag An collection of tuples. {(19,2), (18,1)}
map A set of key value pairs. [open#apache]
進階
• cat data;
• A = LOAD 'data' AS
( t1:tuple(t1a:int,t1b:int,t1c:int),
t2:tuple(t2a:int,t2b:int,t2c:int)
);
• X = FOREACH A GENERATE t1.t1a,t2.$0;
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
((2,5,8),(9,5,8))
(3,4)
(1,3)
(2,9)
進階
進階
進階
Data Types and More Relational Operators
Complex Types ASSERT MAPREDUCE
Bags COGROUP ORDER BY
Tuples CROSS RANK
Fields CUBE SAMPLE
Map DEFINE SPLIT
Simple Types DISTINCT STORE
int FILTER STREAM
long FOREACH UNION
float GROUP
double IMPORT
chararray JOIN (inner)
bytearray JOIN (outer)
boolean LIMIT
datetime LOAD
biginteger UDF Statements
bigdecimal DEFINE REGISTER
練習二
• 說明 : 從數字陣列中,觀察 pig 的語法,以及結果的
變化
• 使用技術 : filter .. by, foreach .. by, group .. by,
foreach .. generate, cogroup
• Input/ output
See : https://wiki.apache.org/pig/PigLatin (last edited 2010)
myfile.txt B.txt
練習二
 cd ~/hadoop_example/pig/ex2
 hadoop fs -put myfile.txt B.txt ./
 pig -x mapred
> A = LOAD 'myfile.txt' USING PigStorage('t') AS (f1,f2,f3);
> B = LOAD 'B.txt' ; dump A; dump B;
> Y = FILTER A BY f1 == '8'; dump Y;
> Y = FILTER A BY (f1 == '8') OR (NOT (f2+f3 > f1)); dump Y;
> X = GROUP A BY f1; dump X;
> X = FOREACH A GENERATE f1, f2; dump X;
> X = FOREACH A GENERATE f1+f2 as sumf1f2; dump X;
> Y = FILTER X by sumf1f2 > 5.0; dump Y;
> C = COGROUP A BY $0, B BY $0; dump C;
> C = COGROUP A BY $0 INNER, B BY $0 INNER; dump C;
練習三
• 說明 : 從 <userid, time, query_term> 的記錄檔,
做出使用者喜愛的關鍵字分析
• 使用技術 : UDF, DISTINCT, FLATTEN, ORDER
• Source pigtutorial.tar.gz
• Input / output
See : https://cwiki.apache.org/confluence/display/PIG/PigTutorial
練習三
 cd ~/hadoop_example/pig/ex3
 pig -x local
> REGISTER ./tutorial.jar;
> raw = LOAD 'excite-small.log' USING PigStorage('t') AS (user, time, query);
> clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);
> clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as
query;
> houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as
hour, query;
> ngramed1 = FOREACH houred GENERATE user, hour,
flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;
> ngramed2 = DISTINCT ngramed1;
> hour_frequency1 = GROUP ngramed2 BY (ngram, hour);
> hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count;
> uniq_frequency1 = GROUP hour_frequency2 BY group::ngram;
> uniq_frequency2 = FOREACH uniq_frequency1 GENERATE flatten($0),
flatten(org.apache.pig.tutorial.ScoreGenerator($1));
> uniq_frequency3 = FOREACH uniq_frequency2 GENERATE $1 as hour, $0 as ngram, $2 as
score, $3 as count, $4 as mean;
> filtered_uniq_frequency = FILTER uniq_frequency3 BY score > 2.0;
> ordered_uniq_frequency = ORDER filtered_uniq_frequency BY hour, score;
> STORE ordered_uniq_frequency INTO 'result' USING PigStorage();
 pig -x local -f script1-local.pig
 cat result/part-r-00000
UDF
package myudfs;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.util.WrappedIOException;
public class UPPER extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return str.toUpperCase();
}catch(Exception e){
throw WrappedIOException.wrap("Caught exception p
rocessing input row ", e);
}}}
grunt> register myudfs.jar
grunt> A = load 'student_data' using PigStorage(',')
as (name:chararray, age:int,gpa:double);
grunt> B = FOREACH A GENERATE
myudfs.UPPER(name);
grunt> dump B;
Reference
• Pig 說明
– http://pig.apache.org/docs/r0.12.0/basic.html
• Pig 參考投影片
– http://www.slideshare.net/ydn/hadoop-yahoo-
internet-scale-data-processing
• Pig 範例參考
– https://cwiki.apache.org/confluence/display/PIG/Pi
gTutorial

More Related Content

What's hot

Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceHortonworks
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Sumeet Singh
 
データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)Takumi Asai
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitTyler Treat
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answerstechieguy85
 

What's hot (20)

Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
 
データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)データ解析技術入門(Hadoop編)
データ解析技術入門(Hadoop編)
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profit
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
 

Similar to Hadoop pig

Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Jianfeng Zhang
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkbhargavi804095
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxRahul Borate
 
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce WorkflowsRAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce WorkflowsHyunjung Park
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewDan Morrill
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopSages
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prodYunong Xiao
 
A CTF Hackers Toolbox
A CTF Hackers ToolboxA CTF Hackers Toolbox
A CTF Hackers ToolboxStefan
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of HadoopAsif Ali
 
Pig_Presentation
Pig_PresentationPig_Presentation
Pig_PresentationArjun Shah
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaFerdinand Jamitzky
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 

Similar to Hadoop pig (20)

Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud Pig: Data Analysis Tool in Cloud
Pig: Data Analysis Tool in Cloud
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
Hadoop Pig
Hadoop PigHadoop Pig
Hadoop Pig
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce WorkflowsRAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
 
Pig latin
Pig latinPig latin
Pig latin
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prod
 
A CTF Hackers Toolbox
A CTF Hackers ToolboxA CTF Hackers Toolbox
A CTF Hackers Toolbox
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
 
Pig_Presentation
Pig_PresentationPig_Presentation
Pig_Presentation
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 

More from Wei-Yu Chen

Clipper@datacon.2019.tw
Clipper@datacon.2019.twClipper@datacon.2019.tw
Clipper@datacon.2019.twWei-Yu Chen
 
大資料趨勢介紹與相關使用技術
大資料趨勢介紹與相關使用技術大資料趨勢介紹與相關使用技術
大資料趨勢介紹與相關使用技術Wei-Yu Chen
 
加速開發! 在Windows開發hadoop程式,直接運行 map/reduce
加速開發! 在Windows開發hadoop程式,直接運行 map/reduce加速開發! 在Windows開發hadoop程式,直接運行 map/reduce
加速開發! 在Windows開發hadoop程式,直接運行 map/reduceWei-Yu Chen
 
Hadoop 2.0 之古往今來
Hadoop 2.0 之古往今來Hadoop 2.0 之古往今來
Hadoop 2.0 之古往今來Wei-Yu Chen
 
Hadoop ecosystem - hadoop 生態系
Hadoop ecosystem - hadoop 生態系Hadoop ecosystem - hadoop 生態系
Hadoop ecosystem - hadoop 生態系Wei-Yu Chen
 
Hadoop 0.20 程式設計
Hadoop 0.20 程式設計Hadoop 0.20 程式設計
Hadoop 0.20 程式設計Wei-Yu Chen
 
Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Wei-Yu Chen
 
Cloudslam09:Building a Cloud Computing Analysis System for Intrusion Detection
Cloudslam09:Building a Cloud Computing Analysis System for  Intrusion DetectionCloudslam09:Building a Cloud Computing Analysis System for  Intrusion Detection
Cloudslam09:Building a Cloud Computing Analysis System for Intrusion DetectionWei-Yu Chen
 

More from Wei-Yu Chen (10)

Clipper@datacon.2019.tw
Clipper@datacon.2019.twClipper@datacon.2019.tw
Clipper@datacon.2019.tw
 
大資料趨勢介紹與相關使用技術
大資料趨勢介紹與相關使用技術大資料趨勢介紹與相關使用技術
大資料趨勢介紹與相關使用技術
 
加速開發! 在Windows開發hadoop程式,直接運行 map/reduce
加速開發! 在Windows開發hadoop程式,直接運行 map/reduce加速開發! 在Windows開發hadoop程式,直接運行 map/reduce
加速開發! 在Windows開發hadoop程式,直接運行 map/reduce
 
Hadoop sqoop
Hadoop sqoop Hadoop sqoop
Hadoop sqoop
 
Hadoop hive
Hadoop hiveHadoop hive
Hadoop hive
 
Hadoop 2.0 之古往今來
Hadoop 2.0 之古往今來Hadoop 2.0 之古往今來
Hadoop 2.0 之古往今來
 
Hadoop ecosystem - hadoop 生態系
Hadoop ecosystem - hadoop 生態系Hadoop ecosystem - hadoop 生態系
Hadoop ecosystem - hadoop 生態系
 
Hadoop 0.20 程式設計
Hadoop 0.20 程式設計Hadoop 0.20 程式設計
Hadoop 0.20 程式設計
 
Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計
 
Cloudslam09:Building a Cloud Computing Analysis System for Intrusion Detection
Cloudslam09:Building a Cloud Computing Analysis System for  Intrusion DetectionCloudslam09:Building a Cloud Computing Analysis System for  Intrusion Detection
Cloudslam09:Building a Cloud Computing Analysis System for Intrusion Detection
 

Recently uploaded

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 

Recently uploaded (20)

Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 

Hadoop pig

  • 2. Agenda • What is Apache Pig • How to Setup • Tutorial Examples
  • 3. PIG Introduction • Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs • Pig generates and compiles a Map/Reduce program(s) on the fly PIG Parse Compile Optimize Plan Pig Latin Scripts
  • 4. 有Pig後Map-Reduce簡單了!? • Apache Pig用來處理大規模資料的高級查詢語言 • 適合操作大型半結構化數據集 • 比使用Java,C++等語言編寫大規模資料處理程式的 難度要小16倍,實現同樣的效果的代碼量也小20倍。 • Pig元件 – Pig Shell (Grunt) – Pig Language (Latin) – Libraries (Piggy Bank) – UDF:使用者定義功能 4 figure Source : http://www.slideshare.net/ydn/hadoop-yahoo-internet-scale-data-processing
  • 5. 大象遇到豬 ( setup ) • 解壓縮 • 修改~/.bashrc • 啟動 pig shell export JAVA_HOME=/usr/lib/jvm/java-7-oracle export HADOOP_HOME=/home/hadoop/hadoop export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export PIG_HOME=/home/hadoop/pig export PATH=$PATH:$PIG_HOME/bin cd /home/hadoop wget http://archive.cloudera.com/cdh5/cdh/5/pig-0.12.0- cdh5.3.2.tar.gz tar –zxvf pig-0.12.0-cdh5.3.2.tar.gz mv pig-0.12.0-cdh5.3.2 pig $ pig grunt> grunt> ls / hdfs://master:9000/hadoop <dir> hdfs://master:9000/tmp <dir> hdfs://master:9000/user <dir> grunt>
  • 6. 豬也會的程式設計 6 功能 指令 讀取 LOAD 儲存 STORE 資料 處理 REGEX_EXTRACT, FILTER, FOREACH, GROUP, JOIN, UNION, SPLIT, … 彙總 運算 AVG, COUNT, MAX, MIN, SIZE, … 數學 運算 ABS, RANDOM, ROUND, … 字串 處理 INDEXOF, SUBSTRING, REGEX EXTRACT, … Debug DUMP, DESCRIBE, EXPLAIN, ILLUSTRATE HDFS cat, ls, cp, mkdir, … $ pig -x grunt> A = LOAD ‘file1’ AS (x, y, z); grunt> B = FILTER A by y > 10000; grunt> STORE B INTO ‘output’;
  • 7.  用 shell 硬把程式兜出來,放棄用 hadoop 了  使用 PIG  發憤圖強,廢寢忘食的研究… 練習一 : • 場景: – 老闆要我統計組織內所有員工的平均工時。於是我取 得了全台灣的打卡紀錄檔(打卡鐘的log檔),還跟人事 部門拿到了員工 id 對應表。這些資料量又多且大,我 想到要餵進去 Hadoop 的HDFS, .. 然後 • 問題: – 為了寫MapReduce,開始學 Java, 物件導向, hadoop API, … @@ • 解法: 7
  • 8. 整型前的mapreduce code 8 nm dp Id Id dt hr 劉 北 A1 A1 7/7 13 李 中 B1 A1 7/8 12 王 中 B2 A1 7/9 4 Java Code Map-Reduce A1 劉 北 7/8 13 A1 劉 北 7/9 12 A1 劉 北 Jul 12.5
  • 9. 用pig 整形後 9 北 A1 劉 12.5 LOAD LOAD FILTER JOIN GROUP FOREACH STORE (nm, dp, id) (nm, dp, id) (id, dt, hr) (nm, dp, id, id, dt, hr) (group, {(nm, dp, id, id, dt, hr)}) (group, …., AVG(hr)) (dp,group, nm, hr) Logical PlanPig Latin A = LOAD 'file1.txt' using PigStorage(',') AS (nm, dp, id) ; B = LOAD 'file2.txt' using PigStorage(',') AS (id, dt, hr) ; C = FILTER B by hr > 8; D = JOIN C BY id, A BY id; E = GROUP D BY A::id; F = FOREACH E GENERATE $1.dp,group,$1.nm, AVG($1.hr); STORE F INTO '/tmp/pig_output/'; nm dp Id Id dt hr 劉 北 A1 A1 7/7 13 李 中 B1 A1 7/8 12 王 中 B2 A1 7/9 4 Tips : 先用小量資料於 pig -x local 模式驗證; 每行先配合dump or illustrate看是否正確
  • 10. 練習一 : 實作  cd ~ git clone https://github.com/waue0920/hadoop_example.git  cd ~/hadoop_example/pig/ex1  pig -x local -f exc1.pig  cat /tmp/pig_output/part-r-00000 練習 : 執行 pig -x mapreduce,將 exc1.pig 每一行單獨執行,並 搭配 dump , illustrate 來看結果,如 : Grunt> A = LOAD 'file1.txt' using PigStorage(',') AS (nm, dp, id); Grunt> Dump A Grunt> Illustrate A Q : result 是否有改進空間 ? Q : 如何改進 ?
  • 11. 進階 Simple Types Description Example int Signed 32-bit integer 10 long Signed 64-bit integer Data: 10L or 10l Display: 10L float 32-bit floating point Data: 10.5F or 10.5f or 10.5e2f or 10.5E2F Display: 10.5F or 1050.0F double 64-bit floating point Data: 10.5 or 10.5e2 or 10.5E2 Display: 10.5 or 1050.0 chararray Character array (string) in Unicode UTF-8 format hello world bytearray Byte array (blob) boolean boolean true/false (case insensitive) datetime datetime 1970-01- 01T00:00:00.000+00:00 biginteger Java BigInteger 2E+11 bigdecimal Java BigDecimal 33.45678332 Complex Types Description Example Fields A piece of data John tuple An ordered set of fields. (19,2) bag An collection of tuples. {(19,2), (18,1)} map A set of key value pairs. [open#apache]
  • 12. 進階 • cat data; • A = LOAD 'data' AS ( t1:tuple(t1a:int,t1b:int,t1c:int), t2:tuple(t2a:int,t2b:int,t2c:int) ); • X = FOREACH A GENERATE t1.t1a,t2.$0; (3,8,9) (4,5,6) (1,4,7) (3,7,5) (2,5,8) (9,5,8) ((3,8,9),(4,5,6)) ((1,4,7),(3,7,5)) ((2,5,8),(9,5,8)) (3,4) (1,3) (2,9)
  • 15. 進階 Data Types and More Relational Operators Complex Types ASSERT MAPREDUCE Bags COGROUP ORDER BY Tuples CROSS RANK Fields CUBE SAMPLE Map DEFINE SPLIT Simple Types DISTINCT STORE int FILTER STREAM long FOREACH UNION float GROUP double IMPORT chararray JOIN (inner) bytearray JOIN (outer) boolean LIMIT datetime LOAD biginteger UDF Statements bigdecimal DEFINE REGISTER
  • 16. 練習二 • 說明 : 從數字陣列中,觀察 pig 的語法,以及結果的 變化 • 使用技術 : filter .. by, foreach .. by, group .. by, foreach .. generate, cogroup • Input/ output See : https://wiki.apache.org/pig/PigLatin (last edited 2010) myfile.txt B.txt
  • 17. 練習二  cd ~/hadoop_example/pig/ex2  hadoop fs -put myfile.txt B.txt ./  pig -x mapred > A = LOAD 'myfile.txt' USING PigStorage('t') AS (f1,f2,f3); > B = LOAD 'B.txt' ; dump A; dump B; > Y = FILTER A BY f1 == '8'; dump Y; > Y = FILTER A BY (f1 == '8') OR (NOT (f2+f3 > f1)); dump Y; > X = GROUP A BY f1; dump X; > X = FOREACH A GENERATE f1, f2; dump X; > X = FOREACH A GENERATE f1+f2 as sumf1f2; dump X; > Y = FILTER X by sumf1f2 > 5.0; dump Y; > C = COGROUP A BY $0, B BY $0; dump C; > C = COGROUP A BY $0 INNER, B BY $0 INNER; dump C;
  • 18. 練習三 • 說明 : 從 <userid, time, query_term> 的記錄檔, 做出使用者喜愛的關鍵字分析 • 使用技術 : UDF, DISTINCT, FLATTEN, ORDER • Source pigtutorial.tar.gz • Input / output See : https://cwiki.apache.org/confluence/display/PIG/PigTutorial
  • 19. 練習三  cd ~/hadoop_example/pig/ex3  pig -x local > REGISTER ./tutorial.jar; > raw = LOAD 'excite-small.log' USING PigStorage('t') AS (user, time, query); > clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query); > clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query; > houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, query; > ngramed1 = FOREACH houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram; > ngramed2 = DISTINCT ngramed1; > hour_frequency1 = GROUP ngramed2 BY (ngram, hour); > hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count; > uniq_frequency1 = GROUP hour_frequency2 BY group::ngram; > uniq_frequency2 = FOREACH uniq_frequency1 GENERATE flatten($0), flatten(org.apache.pig.tutorial.ScoreGenerator($1)); > uniq_frequency3 = FOREACH uniq_frequency2 GENERATE $1 as hour, $0 as ngram, $2 as score, $3 as count, $4 as mean; > filtered_uniq_frequency = FILTER uniq_frequency3 BY score > 2.0; > ordered_uniq_frequency = ORDER filtered_uniq_frequency BY hour, score; > STORE ordered_uniq_frequency INTO 'result' USING PigStorage();  pig -x local -f script1-local.pig  cat result/part-r-00000
  • 20. UDF package myudfs; import java.io.IOException; import org.apache.pig.EvalFunc; import org.apache.pig.data.Tuple; import org.apache.pig.impl.util.WrappedIOException; public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception p rocessing input row ", e); }}} grunt> register myudfs.jar grunt> A = load 'student_data' using PigStorage(',') as (name:chararray, age:int,gpa:double); grunt> B = FOREACH A GENERATE myudfs.UPPER(name); grunt> dump B;
  • 21. Reference • Pig 說明 – http://pig.apache.org/docs/r0.12.0/basic.html • Pig 參考投影片 – http://www.slideshare.net/ydn/hadoop-yahoo- internet-scale-data-processing • Pig 範例參考 – https://cwiki.apache.org/confluence/display/PIG/Pi gTutorial

Editor's Notes

  1. 資料來源:http://www.slideshare.net/hbshashidhar/apache-pig-v-62?next_slideshow=1
  2. http://pig.apache.org/docs/r0.12.0/basic.html
  3. http://pig.apache.org/docs/r0.12.0/basic.html
  4. http://www.slideshare.net/subhaskghosh/04-pig-data-operations
  5. http://www.slideshare.net/subhaskghosh/04-pig-data-operations
  6. A = LOAD 'myfile.txt' USING PigStorage('\t') AS (f1,f2,f3); B = LOAD 'B.txt' ; Y = FILTER A BY f1 == '8'; Y = FILTER A BY (f1 == '8') OR (NOT (f2+f3 > f1)); ====== (1,{(1,2,3)}) (4,{(4,3,3),(4,2,1)}) (7,{(7,2,5)}) (8,{(8,4,3),(8,3,4)}) ====== X = GROUP A BY f1; ==== (1,{(1,2,3)}) (4,{(4,3,3),(4,2,1)}) (7,{(7,2,5)}) (8,{(8,4,3),(8,3,4)}) ==== Projection X = FOREACH A GENERATE f1, f2; ==== 1,2) (4,2) (8,3) (4,3) (7,2) (8,4) ==== X = FOREACH A GENERATE f1+f2 as sumf1f2; Y = FILTER X by sumf1f2 > 5.0; ===== (6.0) (11.0) (7.0) (9.0) (12.0) ===== C = COGROUP A BY $0 INNER, B BY $0 INNER; ==== (1,{(1,2,3)},{(1,3)}) (4,{(4,3,3),(4,2,1)},{(4,9),(4,6)}) (8,{(8,4,3),(8,3,4)},{(8,9)}) ====