SlideShare a Scribd company logo
1 of 22
Pig : Data Analysis Tool in the Cloud  Jeff Zhang zjffdu@gmail.com Committer  of Pig in ASF
Agenda Background What is Pig Brief introduction of Pig internals Demo Q/A
Data Explosion Web 2.0 ,[object Object],[object Object]
Then, Pig’s Coming
What is Pig  Apache Pig is a platform for analyzing large data sets that consists of a high-level language (PigLatin) for expressing data analysis programs, coupled with infrastructure for evaluating these programs.  Ease of programming Optimization opportunities Extensibility Built upon Hadoop
A simple example of Pig-Latin  1291950309812, http://snda.com/page_1  1291950309822, http://snda.com/page_2     1291950309832, http://snda.com/page_3 …. ,[object Object],raw_data = load   '/java_one/pv'   UsingPigStorage(‘,')                     as   (time_stamp : long,   url :  chararray);pages = foreachraw_datagenerateurl;pages = grouppagesbyurl;pages = foreachpagesgenerategroupasurl,  COUNT(pages.url)  aspv; ,[object Object],pages = orderpages  bypvdesc;top10 = limitpages  10;dumptop10;
Operators in Pig-Latin Load   - a = load ‘data’ usingPigStorage(‘’)  as (f1:int ,f2:double,f3:chararray) Store  - store a into ‘/test/output’ usingPigStorage(‘,’)  Dump - dump a Filter  - b = filter a by f1 > 0 and f2 == ‘java_one’ Foreach - b = foreach a generate  f1, f3 Group  - b= group a by f3; Join	- b = Join a by f1, b by f1; Describe	- describe b; ….
Data Structure in Pig Cell   field in database -  Primitive types: int, long, float, double, bytearray, chararrar,nul -  Complex types:  map, tuple, databag Tuple row (1,  1.2,  “java”) DataBag table or view  { (1, 1.2, “java”),  (2,2.3, “c++”) ,  (3,4.5,”c”) }
How to use Pig Grunt (Interactive Shell) Java API Other languages (in future)
Architecture of Pig Grunt (Interactive shell) PigServer  (Java API)   Parser   (PigLatinLogicalPlan) PigContext Optimizer   (LogicalPlan LogicalPlan) Compiler  (LogicalPlan PhysiclaPlan  MapReducePlan) ExecutionEngine Hadoop
Three basic operations of Pig Group by Join Order
How Pig do Group by Data Source           Split               Mapper         Partition          Reducer (A,1) (B,2) (C,3) (A,1) (B,2) (C,3) (B,4) (B,5) (C,6) (A,7) (E,8) (D,9) (A,{(A,1),(A,7)}) (C,{(C,3),(C,6)}) (E,{(E,8)}) (B,4) (B,5) (C,6) (B,{(B,2),(B,4),(B,5)}) (D,{(D,9)}) (A,7) (E,8) (D,9)
How Pig do Join Data Source           Split              Mapper         Partition          Reducer (1,A1) (4,A4) (3,A3) (5,A5) (2,A2) (1,A1) (4,A4) (5,B5) (1,B1) ((1,A1),(1,B1)) ((3,A3),(3,B3)) ((5,A5),(5,B5)) (3,A3) (5,A5) (3,B3) (2,B2) (5,B5) (1,B1) (3,B3) (2,B2) (4,B4) ((2,A2)(2,B2)) ((4,B4),(4,B4)) (2,A2) (4,B4)
How Pig do Sort Data Source          Split       Mapper         Range Partition        Reducer (100) (200) (900) (50) (100) (200) (300) (400) (100) (200) (900) (50) (600) (800) (300) (400) (50) (600) (800) (600) (800) (300) (400)
UDF (User-Defined-Function) register myudf.jar; raw_data=  load   ‘/java_one/udf’   as  (name:chararray); firstnames  =  foreachraw_datageneratemyudf.FirstName (name);  storefirstnamesinto   ‘/java_one/udf_output’; public class  FirstNameextendsEvalFunc<String>{     @Override     public String exec(Tuple input) throwsIOException {         String name=input.get(0).toString(); …. returnfirstname; } }
What Storage Pig Supports HDFS Plain Text Binary format Customized format (XML, JSON, Protobuf,  Thrift…) RDBMS(DBStorage) Cassandra (CassandraStorage) HBase(HBaseStorage)
What fields can Pig be applied  Data Analysis Text Processing ETL Machine Learning
Who’s using Pig More:	 http://wiki.apache.org/pig/PoweredBy
References http://pig.apache.org  (Pig official site) http://hadoop.apache.org  (Hadoop official site) https://github.com/zjffdu/RAF-PIG (Rich API for Pig)
Demo
Thank you ! 				Q&A
Pig: Data Analysis Tool in Cloud

More Related Content

What's hot

Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0Simon Elliston Ball
 
Memory efficient pytorch
Memory efficient pytorchMemory efficient pytorch
Memory efficient pytorchHyungjoo Cho
 
Wasserstein GAN Tfug2017 07-12
Wasserstein GAN Tfug2017 07-12Wasserstein GAN Tfug2017 07-12
Wasserstein GAN Tfug2017 07-12Yuta Kashino
 
Chunked, dplyr for large text files
Chunked, dplyr for large text filesChunked, dplyr for large text files
Chunked, dplyr for large text filesEdwin de Jonge
 
Benchy: Lightweight framework for Performance Benchmarks
Benchy: Lightweight framework for Performance Benchmarks Benchy: Lightweight framework for Performance Benchmarks
Benchy: Lightweight framework for Performance Benchmarks Marcel Caraciolo
 
機械学習によるデータ分析 実践編
機械学習によるデータ分析 実践編機械学習によるデータ分析 実践編
機械学習によるデータ分析 実践編Ryota Kamoshida
 
SociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data AnalysisSociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data AnalysisDataWorks Summit
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014Edwin de Jonge
 
深層学習とベイズ統計
深層学習とベイズ統計深層学習とベイズ統計
深層学習とベイズ統計Yuta Kashino
 
TF.data & Eager Execution
TF.data & Eager ExecutionTF.data & Eager Execution
TF.data & Eager ExecutionModulabs
 
Phil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonPhil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonRoss McDonald
 
tf.data: TensorFlow Input Pipeline
tf.data: TensorFlow Input Pipelinetf.data: TensorFlow Input Pipeline
tf.data: TensorFlow Input PipelineAlluxio, Inc.
 
Beyond Lists - Functional Kats Conf Dublin 2015
Beyond Lists - Functional Kats Conf Dublin 2015Beyond Lists - Functional Kats Conf Dublin 2015
Beyond Lists - Functional Kats Conf Dublin 2015Phillip Trelford
 
Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019Aerospike
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the CloudDataMine Lab
 
Boost.Python: C++ and Python Integration
Boost.Python: C++ and Python IntegrationBoost.Python: C++ and Python Integration
Boost.Python: C++ and Python IntegrationGlobalLogic Ukraine
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysisPramod Toraskar
 

What's hot (20)

Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0Riding the Elephant - Hadoop 2.0
Riding the Elephant - Hadoop 2.0
 
Memory efficient pytorch
Memory efficient pytorchMemory efficient pytorch
Memory efficient pytorch
 
Inside database
Inside databaseInside database
Inside database
 
Wasserstein GAN Tfug2017 07-12
Wasserstein GAN Tfug2017 07-12Wasserstein GAN Tfug2017 07-12
Wasserstein GAN Tfug2017 07-12
 
Chunked, dplyr for large text files
Chunked, dplyr for large text filesChunked, dplyr for large text files
Chunked, dplyr for large text files
 
Math synonyms
Math synonymsMath synonyms
Math synonyms
 
Benchy: Lightweight framework for Performance Benchmarks
Benchy: Lightweight framework for Performance Benchmarks Benchy: Lightweight framework for Performance Benchmarks
Benchy: Lightweight framework for Performance Benchmarks
 
機械学習によるデータ分析 実践編
機械学習によるデータ分析 実践編機械学習によるデータ分析 実践編
機械学習によるデータ分析 実践編
 
SociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data AnalysisSociaLite: High-level Query Language for Big Data Analysis
SociaLite: High-level Query Language for Big Data Analysis
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014
 
深層学習とベイズ統計
深層学習とベイズ統計深層学習とベイズ統計
深層学習とベイズ統計
 
TF.data & Eager Execution
TF.data & Eager ExecutionTF.data & Eager Execution
TF.data & Eager Execution
 
Phil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonPhil Bartie QGIS PLPython
Phil Bartie QGIS PLPython
 
tf.data: TensorFlow Input Pipeline
tf.data: TensorFlow Input Pipelinetf.data: TensorFlow Input Pipeline
tf.data: TensorFlow Input Pipeline
 
Ml15m2018 10-27
Ml15m2018 10-27Ml15m2018 10-27
Ml15m2018 10-27
 
Beyond Lists - Functional Kats Conf Dublin 2015
Beyond Lists - Functional Kats Conf Dublin 2015Beyond Lists - Functional Kats Conf Dublin 2015
Beyond Lists - Functional Kats Conf Dublin 2015
 
Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
 
Boost.Python: C++ and Python Integration
Boost.Python: C++ and Python IntegrationBoost.Python: C++ and Python Integration
Boost.Python: C++ and Python Integration
 
Basic of python for data analysis
Basic of python for data analysisBasic of python for data analysis
Basic of python for data analysis
 

Similar to Pig: Data Analysis Tool in Cloud

Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Trainingstratapps
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce WorkflowsRAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce WorkflowsHyunjung Park
 
python beginner talk slide
python beginner talk slidepython beginner talk slide
python beginner talk slidejonycse
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Hands on data science with r.pptx
Hands  on data science with r.pptxHands  on data science with r.pptx
Hands on data science with r.pptxNimrita Koul
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prodYunong Xiao
 
Introduction of R on Hadoop
Introduction of R on HadoopIntroduction of R on Hadoop
Introduction of R on HadoopChung-Tsai Su
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineeringjtdudley
 
Golang basics for Java developers - Part 1
Golang basics for Java developers - Part 1Golang basics for Java developers - Part 1
Golang basics for Java developers - Part 1Robert Stern
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of HadoopAsif Ali
 

Similar to Pig: Data Analysis Tool in Cloud (20)

Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Apache pig
Apache pigApache pig
Apache pig
 
Pig latin
Pig latinPig latin
Pig latin
 
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce WorkflowsRAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
 
Hadoop
HadoopHadoop
Hadoop
 
Pig workshop
Pig workshopPig workshop
Pig workshop
 
python beginner talk slide
python beginner talk slidepython beginner talk slide
python beginner talk slide
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Hands on data science with r.pptx
Hands  on data science with r.pptxHands  on data science with r.pptx
Hands on data science with r.pptx
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prod
 
Introduction of R on Hadoop
Introduction of R on HadoopIntroduction of R on Hadoop
Introduction of R on Hadoop
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
 
Golang basics for Java developers - Part 1
Golang basics for Java developers - Part 1Golang basics for Java developers - Part 1
Golang basics for Java developers - Part 1
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
 

Recently uploaded

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 

Recently uploaded (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 

Pig: Data Analysis Tool in Cloud

  • 1. Pig : Data Analysis Tool in the Cloud Jeff Zhang zjffdu@gmail.com Committer of Pig in ASF
  • 2. Agenda Background What is Pig Brief introduction of Pig internals Demo Q/A
  • 3.
  • 5. What is Pig Apache Pig is a platform for analyzing large data sets that consists of a high-level language (PigLatin) for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Ease of programming Optimization opportunities Extensibility Built upon Hadoop
  • 6.
  • 7. Operators in Pig-Latin Load - a = load ‘data’ usingPigStorage(‘’) as (f1:int ,f2:double,f3:chararray) Store - store a into ‘/test/output’ usingPigStorage(‘,’) Dump - dump a Filter - b = filter a by f1 > 0 and f2 == ‘java_one’ Foreach - b = foreach a generate f1, f3 Group - b= group a by f3; Join - b = Join a by f1, b by f1; Describe - describe b; ….
  • 8. Data Structure in Pig Cell  field in database - Primitive types: int, long, float, double, bytearray, chararrar,nul - Complex types: map, tuple, databag Tuple row (1, 1.2, “java”) DataBag table or view { (1, 1.2, “java”), (2,2.3, “c++”) , (3,4.5,”c”) }
  • 9. How to use Pig Grunt (Interactive Shell) Java API Other languages (in future)
  • 10. Architecture of Pig Grunt (Interactive shell) PigServer (Java API) Parser (PigLatinLogicalPlan) PigContext Optimizer (LogicalPlan LogicalPlan) Compiler (LogicalPlan PhysiclaPlan  MapReducePlan) ExecutionEngine Hadoop
  • 11. Three basic operations of Pig Group by Join Order
  • 12. How Pig do Group by Data Source  Split  Mapper  Partition  Reducer (A,1) (B,2) (C,3) (A,1) (B,2) (C,3) (B,4) (B,5) (C,6) (A,7) (E,8) (D,9) (A,{(A,1),(A,7)}) (C,{(C,3),(C,6)}) (E,{(E,8)}) (B,4) (B,5) (C,6) (B,{(B,2),(B,4),(B,5)}) (D,{(D,9)}) (A,7) (E,8) (D,9)
  • 13. How Pig do Join Data Source  Split  Mapper  Partition  Reducer (1,A1) (4,A4) (3,A3) (5,A5) (2,A2) (1,A1) (4,A4) (5,B5) (1,B1) ((1,A1),(1,B1)) ((3,A3),(3,B3)) ((5,A5),(5,B5)) (3,A3) (5,A5) (3,B3) (2,B2) (5,B5) (1,B1) (3,B3) (2,B2) (4,B4) ((2,A2)(2,B2)) ((4,B4),(4,B4)) (2,A2) (4,B4)
  • 14. How Pig do Sort Data Source  Split  Mapper  Range Partition  Reducer (100) (200) (900) (50) (100) (200) (300) (400) (100) (200) (900) (50) (600) (800) (300) (400) (50) (600) (800) (600) (800) (300) (400)
  • 15. UDF (User-Defined-Function) register myudf.jar; raw_data= load ‘/java_one/udf’ as (name:chararray); firstnames = foreachraw_datageneratemyudf.FirstName (name); storefirstnamesinto ‘/java_one/udf_output’; public class FirstNameextendsEvalFunc<String>{ @Override public String exec(Tuple input) throwsIOException { String name=input.get(0).toString(); …. returnfirstname; } }
  • 16. What Storage Pig Supports HDFS Plain Text Binary format Customized format (XML, JSON, Protobuf, Thrift…) RDBMS(DBStorage) Cassandra (CassandraStorage) HBase(HBaseStorage)
  • 17. What fields can Pig be applied Data Analysis Text Processing ETL Machine Learning
  • 18. Who’s using Pig More: http://wiki.apache.org/pig/PoweredBy
  • 19. References http://pig.apache.org (Pig official site) http://hadoop.apache.org (Hadoop official site) https://github.com/zjffdu/RAF-PIG (Rich API for Pig)
  • 20. Demo
  • 21. Thank you ! Q&A