SlideShare a Scribd company logo
1 of 50
Download to read offline
PIG: A Big Data Processor
Tushar B. Kute,
http://tusharkute.com
What is Pig?
• Apache Pig is an abstraction over MapReduce. It is a
tool/platform which is used to analyze larger sets of
data representing them as data flows.
• Pig is generally used with Hadoop; we can perform all
the data manipulation operations in Hadoop using
Apache Pig.
• To write data analysis programs, Pig provides a high-
level language known as Pig Latin.
• This language provides various operators using which
programmers can develop their own functions for
reading, writing, and processing data.
Apache Pig
• To analyze data using Apache Pig, programmers
need to write scripts using Pig Latin language.
• All these scripts are internally converted to Map
and Reduce tasks.
• Apache Pig has a component known as Pig
Engine that accepts the Pig Latin scripts as
input and converts those scripts into
MapReduce jobs.
Why do we need Apache Pig?
• Using Pig Latin, programmers can perform MapReduce tasks
easily without having to type complex codes in Java.
• Apache Pig uses multi-query approach, thereby reducing the
length of codes. For example, an operation that would require you
to type 200 lines of code (LoC) in Java can be easily done by typing
as less as just 10 LoC in Apache Pig. Ultimately, Apache Pig
reduces the development time by almost 16 times.
• Pig Latin is SQL-like language and it is easy to learn Apache Pig
when you are familiar with SQL.
• Apache Pig provides many built-in operators to support data
operations like joins, filters, ordering, etc. In addition, it also
provides nested data types like tuples, bags, and maps that are
missing from MapReduce.
Features of Pig
• Rich set of operators: It provides many operators to perform
operations like join, sort, filer, etc.
• Ease of programming: Pig Latin is similar to SQL and it is easy to write
a Pig script if you are good at SQL.
• Optimization opportunities: The tasks in Apache Pig optimize their
execution automatically, so the programmers need to focus only on
semantics of the language.
• Extensibility: Using the existing operators, users can develop their
own functions to read, process, and write data.
• UDF’s: Pig provides the facility to create User-defined Functions in
other programming languages such as Java and invoke or embed them
in Pig Scripts.
• Handles all kinds of data: Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.
Pig vs. MapReduce
Pig vs. SQL
Pig vs. Hive
Applications of Apache Pig
• To process huge data sources such as web logs.
• To perform data processing for search
platforms.
• To process time sensitive data loads.
Apache Pig – History
• In 2006, Apache Pig was developed as a
research project at Yahoo, especially to create
and execute MapReduce jobs on every dataset.
• In 2007, Apache Pig was open sourced via
Apache incubator.
• In 2008, the first release of Apache Pig came
out. In 2010, Apache Pig graduated as an
Apache top-level project.
Pig Architecture
Apache Pig – Components
• Parser: Initially the Pig Scripts are handled by the Parser. It
checks the syntax of the script, does type checking, and other
miscellaneous checks. The output of the parser will be a DAG
(directed acyclic graph), which represents the Pig Latin
statements and logical operators.
• Optimizer: The logical plan (DAG) is passed to the logical
optimizer, which carries out the logical optimizations such as
projection and pushdown.
• Compiler: The compiler compiles the optimized logical plan
into a series of MapReduce jobs.
• Execution engine: Finally the MapReduce jobs are submitted
to Hadoop in a sorted order. Finally, these MapReduce jobs are
executed on Hadoop producing the desired results.
Apache Pig – Data Model
Apache Pig – Elements
• Atom
– Any single value in Pig Latin, irrespective of their
data, type is known as an Atom.
– It is stored as string and can be used as string
and number. int, long, float, double, chararray,
and bytearray are the atomic values of Pig.
– A piece of data or a simple atomic value is known
as a field.
– Example: ‘raja’ or ‘30’
Apache Pig – Elements
• Tuple
– A record that is formed by an ordered set of
fields is known as a tuple, the fields can be of any
type. A tuple is similar to a row in a table of
RDBMS.
– Example: (Raja, 30)
Apache Pig – Elements
• Bag
– A bag is an unordered set of tuples. In other words, a
collection of tuples (non-unique) is known as a bag. Each
tuple can have any number of fields (flexible schema). A
bag is represented by ‘{}’. It is similar to a table in RDBMS,
but unlike a table in RDBMS, it is not necessary that every
tuple contain the same number of fields or that the fields
in th same position (column) have the same type.
– Example: {(Raja, 30), (Mohammad, 45)}
– A bag can be a field in a relation; in that context, it is
known as inner bag.
– Example: {Raja, 30, {9848022338, raja@gmail.com,}}
Apache Pig – Elements
• Relation
– A relation is a bag of tuples. The relations in Pig
Latin are unordered (there is no guarantee that
tuples are processed in any particular order).
• Map
– A map (or data map) is a set of key-value pairs.
The key needs to be of type chararray and should
be unique. The value might be of any type. It is
represented by ‘[]’
– Example: [name#Raja, age#30]
Installation of PIG
Download
• Download the tar.gz file of Apache Pig from
here:
http://mirror.fibergrid.in/apache/pig/pig-0.15.0/
pig-0.15.0.tar.gz
Extract and copy
• Extract this file using right-click -> 'Extract here'
option or by tar -xzvf command.
• Rename the created folder 'pig-0.15.0' to 'pig'
• Now, move this folder to /usr/lib using following
command:
$ sudo mv pig/ /usr/lib
Edit the bashrc file
• Open the bashrc file:
sudo gedit ~/.bashrc
• Go to end of the file and add following lines.
export PIG_HOME=/usr/lib/pig
export PATH=$PATH:$PIG_HOME/bin
• Type following command to make it in effect:
source ~/.bashrc
Start the Pig
• Start the pig in local mode:
pig -x local
• Start the pig in mapreduce mode (needs hadoop
datanode started):
pig -x mapreduce
Grunt shell
Data Processing with PIG
Example: movies_data.csv
1,Dhadakebaz,1986,3.2,7560
2,Dhumdhadaka,1985,3.8,6300
3,Ashi hi banva banvi,1988,4.1,7802
4,Zapatlela,1993,3.7,6022
5,Ayatya Gharat Gharoba,1991,3.4,5420
6,Navra Maza Navsacha,2004,3.9,4904
7,De danadan,1987,3.4,5623
8,Gammat Jammat,1987,3.4,7563
9,Eka peksha ek,1990,3.2,6244
10,Pachhadlela,2004,3.1,6956
Load data
• $ pig -x local
• grunt> movies = LOAD
'movies_data.csv' USING
PigStorage(',') as
(id,name,year,rating,duration)
• grunt> dump movies;
it displays the contents
Filter data
• grunt> movies_greater_than_35 =
FILTER movies BY (float)rating > 3.5;
• grunt> dump movies_greater_than_35;
Store the results data
• grunt> store movies_greater_than_35
into 'my_movies';
• It stores the result in local file system directory
named 'my_movies'.
Display the result
• Now display the result from local file system.
cat my_movies/part-m-00000
Load command
• The load command specified only the column
names. We can modify the statement as follows
to include the data type of the columns:
• grunt> movies = LOAD 
'movies_data.csv' USING 
PigStorage(',') as (id:int, 
name:chararray, year:int, 
rating:double, duration:int);
Check the filters
• List the movies that were released between 1950 and
1960
grunt> movies_between_90_95 = FILTER 
movies by year > 1990 and year < 1995;
• List the movies that start with the Alpahbet D
grunt> movies_starting_with_D = FILTER 
movies by name matches 'D.*';
• List the movies that have duration greater that 2 hours
grunt> movies_duration_2_hrs = FILTER 
movies by duration > 7200; 
Output
Movies between
1990 to 1995
Movies starts
W
ith 'D'
Movies greater
Than 2 hours
Describe
• DESCRIBE The schema of a relation/alias can be
viewed using the DESCRIBE command:
grunt> DESCRIBE movies;
movies: {id: int, name: chararray, 
year: int, rating: double, duration: 
int} 
Foreach
• FOREACH gives a simple way to apply
transformations based on columns. Let’s understand
this with an example.
• List the movie names its duration in minutes
grunt> movie_duration = FOREACH movies 
GENERATE name, (double)(duration/60);
• The above statement generates a new alias that has
the list of movies and it duration in minutes.
• You can check the results using the DUMP command.
Output
Group
• The GROUP keyword is used to group fields in a
relation.
• List the years and the number of movies released
each year.
grunt> grouped_by_year = group movies 
by year;
grunt> count_by_year = FOREACH 
grouped_by_year GENERATE group, 
COUNT(movies);
Output
Order by
• Let us question the data to illustrate the ORDER BY
operation.
• List all the movies in the ascending order of year.
grunt> desc_movies_by_year = ORDER 
movies BY year ASC;
grunt> DUMP desc_movies_by_year;
• List all the movies in the descending order of year.
grunt> asc_movies_by_year = ORDER movies 
by year DESC;
grunt> DUMP asc_movies_by_year;
Output- Ascending by year
From
1985
To
2004
Limit
• Use the LIMIT keyword to get only a limited number
for results from relation.
grunt> top_5_movies = LIMIT movies 5;
grunt> DUMP top_10_movies;
Pig: Modes of Execution
• Pig programs can be run in three methods which
work in both local and MapReduce mode. They are
– Script Mode
– Grunt Mode
– Embedded Mode
Script mode
• Script Mode or Batch Mode: In script mode, pig runs
the commands specified in a script file. The following
example shows how to run a pig programs from a
script file:
$ vim scriptfile.pig
   A = LOAD 'script_file';
   DUMP A;
$ pig ­x local scriptfile.pig
Grunt mode
• Grunt Mode or Interactive Mode: The grunt mode can also
be called as interactive mode. Grunt is pig's interactive shell.
It is started when no file is specified for pig to run.
$ pig ­x local
grunt> A = LOAD 'grunt_file';
grunt> DUMP A;
• You can also run pig scripts from grunt using run and exec
commands.
grunt> run scriptfile.pig
grunt> exec scriptfile.pig
Embedded mode
• You can embed pig programs in Java, python and
ruby and can run from the same.
Example: Wordcount program
• Q) How to find the number of occurrences of the
words in a file using the pig script?
• You can find the famous word count example written
in map reduce programs in apache website. Here we
will write a simple pig script for the word count
problem.
• The pig script given in next slide finds the number of
times a word repeated in a file:
Example: text file- shivneri.txt
Example: Wordcount program
lines = LOAD 'shivneri.txt' AS 
(line:chararray);
words = FOREACH lines GENERATE 
FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
w_count = FOREACH grouped GENERATE group, 
COUNT(words);
DUMP w_count;
forts.pig
Output snapshot
$ pig -x local forts.pig
References
• “Programming Pig” by Alan Gates, O'Reilly
Publishers.
• “Pig Design Patterns” by Pradeep Pasupuleti,
PACKT Publishing
• Tutorials Point
• http://github.com/rohitdens
• http://pig.apache.org
tushar@tusharkute.com
Thank you
This presentation is created using LibreOffice Impress 4.2.8.2, can be used freely as per GNU General Public License
Blogs
http://digitallocha.blogspot.in
http://kyamputar.blogspot.in
Web Resources
http://tusharkute.com

More Related Content

What's hot

Problem reduction AND OR GRAPH & AO* algorithm.ppt
Problem reduction AND OR GRAPH & AO* algorithm.pptProblem reduction AND OR GRAPH & AO* algorithm.ppt
Problem reduction AND OR GRAPH & AO* algorithm.pptarunsingh660
 
AI_Session 7 Greedy Best first search algorithm.pptx
AI_Session 7 Greedy Best first search algorithm.pptxAI_Session 7 Greedy Best first search algorithm.pptx
AI_Session 7 Greedy Best first search algorithm.pptxAsst.prof M.Gokilavani
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Data cube computation
Data cube computationData cube computation
Data cube computationRashmi Sheikh
 
DATA PERSISTENCE IN ANDROID OPERATING SYSTEM
DATA PERSISTENCE IN ANDROID OPERATING SYSTEMDATA PERSISTENCE IN ANDROID OPERATING SYSTEM
DATA PERSISTENCE IN ANDROID OPERATING SYSTEMAYESHA JAVED
 
Activation function
Activation functionActivation function
Activation functionAstha Jain
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmPınar Yahşi
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALASaikiran Panjala
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streamsKrish_ver2
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its ApplicationsDr Ganesh Iyer
 
Learning set of rules
Learning set of rulesLearning set of rules
Learning set of rulesswapnac12
 
State space search
State space searchState space search
State space searchchauhankapil
 

What's hot (20)

Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Problem reduction AND OR GRAPH & AO* algorithm.ppt
Problem reduction AND OR GRAPH & AO* algorithm.pptProblem reduction AND OR GRAPH & AO* algorithm.ppt
Problem reduction AND OR GRAPH & AO* algorithm.ppt
 
AI_Session 7 Greedy Best first search algorithm.pptx
AI_Session 7 Greedy Best first search algorithm.pptxAI_Session 7 Greedy Best first search algorithm.pptx
AI_Session 7 Greedy Best first search algorithm.pptx
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Data cube computation
Data cube computationData cube computation
Data cube computation
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Chapter 5 Syntax Directed Translation
Chapter 5   Syntax Directed TranslationChapter 5   Syntax Directed Translation
Chapter 5 Syntax Directed Translation
 
DATA PERSISTENCE IN ANDROID OPERATING SYSTEM
DATA PERSISTENCE IN ANDROID OPERATING SYSTEMDATA PERSISTENCE IN ANDROID OPERATING SYSTEM
DATA PERSISTENCE IN ANDROID OPERATING SYSTEM
 
Activation function
Activation functionActivation function
Activation function
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
 
1.Role lexical Analyzer
1.Role lexical Analyzer1.Role lexical Analyzer
1.Role lexical Analyzer
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Decision tree
Decision treeDecision tree
Decision tree
 
Direct linking loaders
Direct linking loadersDirect linking loaders
Direct linking loaders
 
Specification-of-tokens
Specification-of-tokensSpecification-of-tokens
Specification-of-tokens
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its Applications
 
Learning set of rules
Learning set of rulesLearning set of rules
Learning set of rules
 
State space search
State space searchState space search
State space search
 

Viewers also liked

Signal Handling in Linux
Signal Handling in LinuxSignal Handling in Linux
Signal Handling in LinuxTushar B Kute
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!
 
Unit 6 Operating System TEIT Savitribai Phule Pune University by Tushar B Kute
Unit 6 Operating System TEIT Savitribai Phule Pune University by Tushar B KuteUnit 6 Operating System TEIT Savitribai Phule Pune University by Tushar B Kute
Unit 6 Operating System TEIT Savitribai Phule Pune University by Tushar B KuteTushar B Kute
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data ScientistsDataWorks Summit
 
Part 02 Linux Kernel Module Programming
Part 02 Linux Kernel Module ProgrammingPart 02 Linux Kernel Module Programming
Part 02 Linux Kernel Module ProgrammingTushar B Kute
 
MIS 02 foundations of information systems
MIS 02  foundations of information systemsMIS 02  foundations of information systems
MIS 02 foundations of information systemsTushar B Kute
 
Mis 03 management information systems
Mis 03  management information systemsMis 03  management information systems
Mis 03 management information systemsTushar B Kute
 
MIS 04 Management of Business
MIS 04  Management of BusinessMIS 04  Management of Business
MIS 04 Management of BusinessTushar B Kute
 
Introduction to linux ppt
Introduction to linux pptIntroduction to linux ppt
Introduction to linux pptOmi Vichare
 
Linux.ppt
Linux.ppt Linux.ppt
Linux.ppt onu9
 
Study techniques of programming in c at kkwpss
Study techniques of programming in c at kkwpssStudy techniques of programming in c at kkwpss
Study techniques of programming in c at kkwpssTushar B Kute
 
MIS 05 Decision Support Systems
MIS 05  Decision Support SystemsMIS 05  Decision Support Systems
MIS 05 Decision Support SystemsTushar B Kute
 
Apache Pig - JavaZone 2013
Apache Pig - JavaZone 2013Apache Pig - JavaZone 2013
Apache Pig - JavaZone 2013janerikcarlsen
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigAnshul Bhatnagar
 
Open source applications softwares
Open source applications softwaresOpen source applications softwares
Open source applications softwaresTushar B Kute
 
NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013Facundo Farias
 
Module 1 introduction to Linux
Module 1 introduction to LinuxModule 1 introduction to Linux
Module 1 introduction to LinuxTushar B Kute
 
Module 02 Using Linux Command Shell
Module 02 Using Linux Command ShellModule 02 Using Linux Command Shell
Module 02 Using Linux Command ShellTushar B Kute
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easyVictor Sanchez Anguix
 

Viewers also liked (20)

Signal Handling in Linux
Signal Handling in LinuxSignal Handling in Linux
Signal Handling in Linux
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
Unit 6 Operating System TEIT Savitribai Phule Pune University by Tushar B Kute
Unit 6 Operating System TEIT Savitribai Phule Pune University by Tushar B KuteUnit 6 Operating System TEIT Savitribai Phule Pune University by Tushar B Kute
Unit 6 Operating System TEIT Savitribai Phule Pune University by Tushar B Kute
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
 
Part 02 Linux Kernel Module Programming
Part 02 Linux Kernel Module ProgrammingPart 02 Linux Kernel Module Programming
Part 02 Linux Kernel Module Programming
 
MIS 02 foundations of information systems
MIS 02  foundations of information systemsMIS 02  foundations of information systems
MIS 02 foundations of information systems
 
Mis 03 management information systems
Mis 03  management information systemsMis 03  management information systems
Mis 03 management information systems
 
MIS 04 Management of Business
MIS 04  Management of BusinessMIS 04  Management of Business
MIS 04 Management of Business
 
Introduction to linux ppt
Introduction to linux pptIntroduction to linux ppt
Introduction to linux ppt
 
Linux.ppt
Linux.ppt Linux.ppt
Linux.ppt
 
Study techniques of programming in c at kkwpss
Study techniques of programming in c at kkwpssStudy techniques of programming in c at kkwpss
Study techniques of programming in c at kkwpss
 
MIS 05 Decision Support Systems
MIS 05  Decision Support SystemsMIS 05  Decision Support Systems
MIS 05 Decision Support Systems
 
Apache Pig - JavaZone 2013
Apache Pig - JavaZone 2013Apache Pig - JavaZone 2013
Apache Pig - JavaZone 2013
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Open source applications softwares
Open source applications softwaresOpen source applications softwares
Open source applications softwares
 
NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013NoSQL Databases Introduction - UTN 2013
NoSQL Databases Introduction - UTN 2013
 
Module 1 introduction to Linux
Module 1 introduction to LinuxModule 1 introduction to Linux
Module 1 introduction to Linux
 
Module 02 Using Linux Command Shell
Module 02 Using Linux Command ShellModule 02 Using Linux Command Shell
Module 02 Using Linux Command Shell
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
 

Similar to Apache Pig: A big data processor

unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdfssuser92282c
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsKrishnaVeni451953
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramViswanath Gangavaram
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkbhargavi804095
 
Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4caizer_x
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopJeyamariappan Guru
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsNetajiGandi1
 
M4,C5 APACHE PIG.pptx
M4,C5 APACHE PIG.pptxM4,C5 APACHE PIG.pptx
M4,C5 APACHE PIG.pptxShrinivasa6
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache PigSachin Vakkund
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionDong Ngoc
 

Similar to Apache Pig: A big data processor (20)

Apache pig
Apache pigApache pig
Apache pig
 
unit-4-apache pig-.pdf
unit-4-apache pig-.pdfunit-4-apache pig-.pdf
unit-4-apache pig-.pdf
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaram
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 
pig intro.pdf
pig intro.pdfpig intro.pdf
pig intro.pdf
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
M4,C5 APACHE PIG.pptx
M4,C5 APACHE PIG.pptxM4,C5 APACHE PIG.pptx
M4,C5 APACHE PIG.pptx
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

More from Tushar B Kute

01 Introduction to Android
01 Introduction to Android01 Introduction to Android
01 Introduction to AndroidTushar B Kute
 
Ubuntu OS and it's Flavours
Ubuntu OS and it's FlavoursUbuntu OS and it's Flavours
Ubuntu OS and it's FlavoursTushar B Kute
 
Install Drupal in Ubuntu by Tushar B. Kute
Install Drupal in Ubuntu by Tushar B. KuteInstall Drupal in Ubuntu by Tushar B. Kute
Install Drupal in Ubuntu by Tushar B. KuteTushar B Kute
 
Install Wordpress in Ubuntu Linux by Tushar B. Kute
Install Wordpress in Ubuntu Linux by Tushar B. KuteInstall Wordpress in Ubuntu Linux by Tushar B. Kute
Install Wordpress in Ubuntu Linux by Tushar B. KuteTushar B Kute
 
Share File easily between computers using sftp
Share File easily between computers using sftpShare File easily between computers using sftp
Share File easily between computers using sftpTushar B Kute
 
Implementation of FIFO in Linux
Implementation of FIFO in LinuxImplementation of FIFO in Linux
Implementation of FIFO in LinuxTushar B Kute
 
Implementation of Pipe in Linux
Implementation of Pipe in LinuxImplementation of Pipe in Linux
Implementation of Pipe in LinuxTushar B Kute
 
Basic Multithreading using Posix Threads
Basic Multithreading using Posix ThreadsBasic Multithreading using Posix Threads
Basic Multithreading using Posix ThreadsTushar B Kute
 
Part 04 Creating a System Call in Linux
Part 04 Creating a System Call in LinuxPart 04 Creating a System Call in Linux
Part 04 Creating a System Call in LinuxTushar B Kute
 
Part 03 File System Implementation in Linux
Part 03 File System Implementation in LinuxPart 03 File System Implementation in Linux
Part 03 File System Implementation in LinuxTushar B Kute
 
Part 01 Linux Kernel Compilation (Ubuntu)
Part 01 Linux Kernel Compilation (Ubuntu)Part 01 Linux Kernel Compilation (Ubuntu)
Part 01 Linux Kernel Compilation (Ubuntu)Tushar B Kute
 
Introduction to Ubuntu Edge Operating System (Ubuntu Touch)
Introduction to Ubuntu Edge Operating System (Ubuntu Touch)Introduction to Ubuntu Edge Operating System (Ubuntu Touch)
Introduction to Ubuntu Edge Operating System (Ubuntu Touch)Tushar B Kute
 
Technical blog by Engineering Students of Sandip Foundation, itsitrc
Technical blog by Engineering Students of Sandip Foundation, itsitrcTechnical blog by Engineering Students of Sandip Foundation, itsitrc
Technical blog by Engineering Students of Sandip Foundation, itsitrcTushar B Kute
 
Chapter 01 Introduction to Java by Tushar B Kute
Chapter 01 Introduction to Java by Tushar B KuteChapter 01 Introduction to Java by Tushar B Kute
Chapter 01 Introduction to Java by Tushar B KuteTushar B Kute
 
Chapter 02: Classes Objects and Methods Java by Tushar B Kute
Chapter 02: Classes Objects and Methods Java by Tushar B KuteChapter 02: Classes Objects and Methods Java by Tushar B Kute
Chapter 02: Classes Objects and Methods Java by Tushar B KuteTushar B Kute
 
Java Servlet Programming under Ubuntu Linux by Tushar B Kute
Java Servlet Programming under Ubuntu Linux by Tushar B KuteJava Servlet Programming under Ubuntu Linux by Tushar B Kute
Java Servlet Programming under Ubuntu Linux by Tushar B KuteTushar B Kute
 
Module 01 Introduction to Linux
Module 01 Introduction to LinuxModule 01 Introduction to Linux
Module 01 Introduction to LinuxTushar B Kute
 
Module 03 Programming on Linux
Module 03 Programming on LinuxModule 03 Programming on Linux
Module 03 Programming on LinuxTushar B Kute
 
Module 05 Preprocessor and Macros in C
Module 05 Preprocessor and Macros in CModule 05 Preprocessor and Macros in C
Module 05 Preprocessor and Macros in CTushar B Kute
 

More from Tushar B Kute (20)

01 Introduction to Android
01 Introduction to Android01 Introduction to Android
01 Introduction to Android
 
Ubuntu OS and it's Flavours
Ubuntu OS and it's FlavoursUbuntu OS and it's Flavours
Ubuntu OS and it's Flavours
 
Install Drupal in Ubuntu by Tushar B. Kute
Install Drupal in Ubuntu by Tushar B. KuteInstall Drupal in Ubuntu by Tushar B. Kute
Install Drupal in Ubuntu by Tushar B. Kute
 
Install Wordpress in Ubuntu Linux by Tushar B. Kute
Install Wordpress in Ubuntu Linux by Tushar B. KuteInstall Wordpress in Ubuntu Linux by Tushar B. Kute
Install Wordpress in Ubuntu Linux by Tushar B. Kute
 
Share File easily between computers using sftp
Share File easily between computers using sftpShare File easily between computers using sftp
Share File easily between computers using sftp
 
Implementation of FIFO in Linux
Implementation of FIFO in LinuxImplementation of FIFO in Linux
Implementation of FIFO in Linux
 
Implementation of Pipe in Linux
Implementation of Pipe in LinuxImplementation of Pipe in Linux
Implementation of Pipe in Linux
 
Basic Multithreading using Posix Threads
Basic Multithreading using Posix ThreadsBasic Multithreading using Posix Threads
Basic Multithreading using Posix Threads
 
Part 04 Creating a System Call in Linux
Part 04 Creating a System Call in LinuxPart 04 Creating a System Call in Linux
Part 04 Creating a System Call in Linux
 
Part 03 File System Implementation in Linux
Part 03 File System Implementation in LinuxPart 03 File System Implementation in Linux
Part 03 File System Implementation in Linux
 
Part 01 Linux Kernel Compilation (Ubuntu)
Part 01 Linux Kernel Compilation (Ubuntu)Part 01 Linux Kernel Compilation (Ubuntu)
Part 01 Linux Kernel Compilation (Ubuntu)
 
Introduction to Ubuntu Edge Operating System (Ubuntu Touch)
Introduction to Ubuntu Edge Operating System (Ubuntu Touch)Introduction to Ubuntu Edge Operating System (Ubuntu Touch)
Introduction to Ubuntu Edge Operating System (Ubuntu Touch)
 
Technical blog by Engineering Students of Sandip Foundation, itsitrc
Technical blog by Engineering Students of Sandip Foundation, itsitrcTechnical blog by Engineering Students of Sandip Foundation, itsitrc
Technical blog by Engineering Students of Sandip Foundation, itsitrc
 
Chapter 01 Introduction to Java by Tushar B Kute
Chapter 01 Introduction to Java by Tushar B KuteChapter 01 Introduction to Java by Tushar B Kute
Chapter 01 Introduction to Java by Tushar B Kute
 
Chapter 02: Classes Objects and Methods Java by Tushar B Kute
Chapter 02: Classes Objects and Methods Java by Tushar B KuteChapter 02: Classes Objects and Methods Java by Tushar B Kute
Chapter 02: Classes Objects and Methods Java by Tushar B Kute
 
Java Servlet Programming under Ubuntu Linux by Tushar B Kute
Java Servlet Programming under Ubuntu Linux by Tushar B KuteJava Servlet Programming under Ubuntu Linux by Tushar B Kute
Java Servlet Programming under Ubuntu Linux by Tushar B Kute
 
Module 01 Introduction to Linux
Module 01 Introduction to LinuxModule 01 Introduction to Linux
Module 01 Introduction to Linux
 
Module 03 Programming on Linux
Module 03 Programming on LinuxModule 03 Programming on Linux
Module 03 Programming on Linux
 
See through C
See through CSee through C
See through C
 
Module 05 Preprocessor and Macros in C
Module 05 Preprocessor and Macros in CModule 05 Preprocessor and Macros in C
Module 05 Preprocessor and Macros in C
 

Recently uploaded

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Apache Pig: A big data processor

  • 1. PIG: A Big Data Processor Tushar B. Kute, http://tusharkute.com
  • 2. What is Pig? • Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. • Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Apache Pig. • To write data analysis programs, Pig provides a high- level language known as Pig Latin. • This language provides various operators using which programmers can develop their own functions for reading, writing, and processing data.
  • 3. Apache Pig • To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. • All these scripts are internally converted to Map and Reduce tasks. • Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
  • 4. Why do we need Apache Pig? • Using Pig Latin, programmers can perform MapReduce tasks easily without having to type complex codes in Java. • Apache Pig uses multi-query approach, thereby reducing the length of codes. For example, an operation that would require you to type 200 lines of code (LoC) in Java can be easily done by typing as less as just 10 LoC in Apache Pig. Ultimately, Apache Pig reduces the development time by almost 16 times. • Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar with SQL. • Apache Pig provides many built-in operators to support data operations like joins, filters, ordering, etc. In addition, it also provides nested data types like tuples, bags, and maps that are missing from MapReduce.
  • 5. Features of Pig • Rich set of operators: It provides many operators to perform operations like join, sort, filer, etc. • Ease of programming: Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL. • Optimization opportunities: The tasks in Apache Pig optimize their execution automatically, so the programmers need to focus only on semantics of the language. • Extensibility: Using the existing operators, users can develop their own functions to read, process, and write data. • UDF’s: Pig provides the facility to create User-defined Functions in other programming languages such as Java and invoke or embed them in Pig Scripts. • Handles all kinds of data: Apache Pig analyzes all kinds of data, both structured as well as unstructured. It stores the results in HDFS.
  • 9. Applications of Apache Pig • To process huge data sources such as web logs. • To perform data processing for search platforms. • To process time sensitive data loads.
  • 10. Apache Pig – History • In 2006, Apache Pig was developed as a research project at Yahoo, especially to create and execute MapReduce jobs on every dataset. • In 2007, Apache Pig was open sourced via Apache incubator. • In 2008, the first release of Apache Pig came out. In 2010, Apache Pig graduated as an Apache top-level project.
  • 12. Apache Pig – Components • Parser: Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type checking, and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators. • Optimizer: The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown. • Compiler: The compiler compiles the optimized logical plan into a series of MapReduce jobs. • Execution engine: Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results.
  • 13. Apache Pig – Data Model
  • 14. Apache Pig – Elements • Atom – Any single value in Pig Latin, irrespective of their data, type is known as an Atom. – It is stored as string and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic values of Pig. – A piece of data or a simple atomic value is known as a field. – Example: ‘raja’ or ‘30’
  • 15. Apache Pig – Elements • Tuple – A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to a row in a table of RDBMS. – Example: (Raja, 30)
  • 16. Apache Pig – Elements • Bag – A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in th same position (column) have the same type. – Example: {(Raja, 30), (Mohammad, 45)} – A bag can be a field in a relation; in that context, it is known as inner bag. – Example: {Raja, 30, {9848022338, raja@gmail.com,}}
  • 17. Apache Pig – Elements • Relation – A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are processed in any particular order). • Map – A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value might be of any type. It is represented by ‘[]’ – Example: [name#Raja, age#30]
  • 19. Download • Download the tar.gz file of Apache Pig from here: http://mirror.fibergrid.in/apache/pig/pig-0.15.0/ pig-0.15.0.tar.gz
  • 20. Extract and copy • Extract this file using right-click -> 'Extract here' option or by tar -xzvf command. • Rename the created folder 'pig-0.15.0' to 'pig' • Now, move this folder to /usr/lib using following command: $ sudo mv pig/ /usr/lib
  • 21. Edit the bashrc file • Open the bashrc file: sudo gedit ~/.bashrc • Go to end of the file and add following lines. export PIG_HOME=/usr/lib/pig export PATH=$PATH:$PIG_HOME/bin • Type following command to make it in effect: source ~/.bashrc
  • 22. Start the Pig • Start the pig in local mode: pig -x local • Start the pig in mapreduce mode (needs hadoop datanode started): pig -x mapreduce
  • 26. Load data • $ pig -x local • grunt> movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration) • grunt> dump movies; it displays the contents
  • 27. Filter data • grunt> movies_greater_than_35 = FILTER movies BY (float)rating > 3.5; • grunt> dump movies_greater_than_35;
  • 28. Store the results data • grunt> store movies_greater_than_35 into 'my_movies'; • It stores the result in local file system directory named 'my_movies'.
  • 29. Display the result • Now display the result from local file system. cat my_movies/part-m-00000
  • 30. Load command • The load command specified only the column names. We can modify the statement as follows to include the data type of the columns: • grunt> movies = LOAD  'movies_data.csv' USING  PigStorage(',') as (id:int,  name:chararray, year:int,  rating:double, duration:int);
  • 31. Check the filters • List the movies that were released between 1950 and 1960 grunt> movies_between_90_95 = FILTER  movies by year > 1990 and year < 1995; • List the movies that start with the Alpahbet D grunt> movies_starting_with_D = FILTER  movies by name matches 'D.*'; • List the movies that have duration greater that 2 hours grunt> movies_duration_2_hrs = FILTER  movies by duration > 7200; 
  • 32. Output Movies between 1990 to 1995 Movies starts W ith 'D' Movies greater Than 2 hours
  • 33. Describe • DESCRIBE The schema of a relation/alias can be viewed using the DESCRIBE command: grunt> DESCRIBE movies; movies: {id: int, name: chararray,  year: int, rating: double, duration:  int} 
  • 34. Foreach • FOREACH gives a simple way to apply transformations based on columns. Let’s understand this with an example. • List the movie names its duration in minutes grunt> movie_duration = FOREACH movies  GENERATE name, (double)(duration/60); • The above statement generates a new alias that has the list of movies and it duration in minutes. • You can check the results using the DUMP command.
  • 36. Group • The GROUP keyword is used to group fields in a relation. • List the years and the number of movies released each year. grunt> grouped_by_year = group movies  by year; grunt> count_by_year = FOREACH  grouped_by_year GENERATE group,  COUNT(movies);
  • 38. Order by • Let us question the data to illustrate the ORDER BY operation. • List all the movies in the ascending order of year. grunt> desc_movies_by_year = ORDER  movies BY year ASC; grunt> DUMP desc_movies_by_year; • List all the movies in the descending order of year. grunt> asc_movies_by_year = ORDER movies  by year DESC; grunt> DUMP asc_movies_by_year;
  • 39. Output- Ascending by year From 1985 To 2004
  • 40. Limit • Use the LIMIT keyword to get only a limited number for results from relation. grunt> top_5_movies = LIMIT movies 5; grunt> DUMP top_10_movies;
  • 41. Pig: Modes of Execution • Pig programs can be run in three methods which work in both local and MapReduce mode. They are – Script Mode – Grunt Mode – Embedded Mode
  • 42. Script mode • Script Mode or Batch Mode: In script mode, pig runs the commands specified in a script file. The following example shows how to run a pig programs from a script file: $ vim scriptfile.pig    A = LOAD 'script_file';    DUMP A; $ pig ­x local scriptfile.pig
  • 43. Grunt mode • Grunt Mode or Interactive Mode: The grunt mode can also be called as interactive mode. Grunt is pig's interactive shell. It is started when no file is specified for pig to run. $ pig ­x local grunt> A = LOAD 'grunt_file'; grunt> DUMP A; • You can also run pig scripts from grunt using run and exec commands. grunt> run scriptfile.pig grunt> exec scriptfile.pig
  • 44. Embedded mode • You can embed pig programs in Java, python and ruby and can run from the same.
  • 45. Example: Wordcount program • Q) How to find the number of occurrences of the words in a file using the pig script? • You can find the famous word count example written in map reduce programs in apache website. Here we will write a simple pig script for the word count problem. • The pig script given in next slide finds the number of times a word repeated in a file:
  • 46. Example: text file- shivneri.txt
  • 48. Output snapshot $ pig -x local forts.pig
  • 49. References • “Programming Pig” by Alan Gates, O'Reilly Publishers. • “Pig Design Patterns” by Pradeep Pasupuleti, PACKT Publishing • Tutorials Point • http://github.com/rohitdens • http://pig.apache.org
  • 50. tushar@tusharkute.com Thank you This presentation is created using LibreOffice Impress 4.2.8.2, can be used freely as per GNU General Public License Blogs http://digitallocha.blogspot.in http://kyamputar.blogspot.in Web Resources http://tusharkute.com