Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Pig: Data Analysis Tool in Cloud <br />Jeff Zhang<br />zjffdu@gmail.com<br />Committer  of Pig in ASF<br />
Agenda<br />Background<br />What is Pig<br />Brief introduction of Pig internals<br />Demo<br />Q/A<br />
Data Explosion<br />Web 2.0<br /><ul><li>More digit terminal</li></li></ul><li>What we have for data analysis<br />RDBMS  ...
Then, Pig’s Coming<br />
What is Pig <br />Apache Pig is a platform for analyzing large data sets that consists of a high-level language (PigLatin)...
A simple example of Pig-Latin <br />1291950309812, http://snda.com/page_1 <br />1291950309822, http://snda.com/page_2    <...
Operators in Pig-Latin<br />Load   - a = load ‘data’ usingPigStorage(‘t’)  as (f1:int ,f2:double,f3:chararray)<br />Store ...
Data Structure in Pig<br />Cell   field in database<br />-  Primitive types: int, long, float, double, bytearray, chararr...
How to use Pig<br />Grunt (Interactive Shell)<br />Java API<br />Other languages (in future)<br />
Architecture of Pig<br />Grunt (Interactive shell)<br />PigServer  (Java API)  <br />Parser   (PigLatinLogicalPlan)<br />...
Three basic operations of Pig<br />Group by<br />Join<br />Order<br />
How Pig do Group by<br />Data Source           Split               Mapper         Partition          Reducer<br />(A,1...
How Pig do Join<br />Data Source           Split              Mapper         Partition          Reducer<br />(1,A1)<br...
How Pig do Sort<br />Data Source          Split       Mapper         Range Partition        Reducer<br />(100)<br />(2...
UDF (User-Defined-Function)<br />register myudf.jar;<br />raw_data=  load   ‘/java_one/udf’   as  (name:chararray);<br />f...
What Storage Pig Supports<br />HDFS<br />Plain Text<br />Binary format<br />Customized format (XML, JSON, Protobuf,  Thrif...
What fields can Pig be applied <br />Data Analysis<br />Text Processing<br />ETL<br />Machine Learning<br />
Who’s using Pig<br />More:	 http://wiki.apache.org/pig/PoweredBy<br />
References<br />http://pig.apache.org  (Pig official site)<br />http://hadoop.apache.org  (Hadoop official site)<br />http...
Demo<br />
Thank you !<br />				Q&A<br />
Pig: Data Analysis Tool in Cloud
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
Apache pig
Next
Download to read offline and view in fullscreen.

8

Share

Download to read offline

Pig: Data Analysis Tool in Cloud

Download to read offline

Presentation in Java One conference in Beijing 2010

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Pig: Data Analysis Tool in Cloud

  1. 1. Pig: Data Analysis Tool in Cloud <br />Jeff Zhang<br />zjffdu@gmail.com<br />Committer of Pig in ASF<br />
  2. 2. Agenda<br />Background<br />What is Pig<br />Brief introduction of Pig internals<br />Demo<br />Q/A<br />
  3. 3. Data Explosion<br />Web 2.0<br /><ul><li>More digit terminal</li></li></ul><li>What we have for data analysis<br />RDBMS (Scalability)<br />Parallel RDBMS (Expensive)<br />Programming Language (Too complex)<br />HadoopMapReduce (Still too complex for non-hadoop users)<br />
  4. 4. Then, Pig’s Coming<br />
  5. 5. What is Pig <br />Apache Pig is a platform for analyzing large data sets that consists of a high-level language (PigLatin) for expressing data analysis programs, coupled with infrastructure for evaluating these programs. <br />Ease of programming<br />Optimization opportunities<br />Extensibility<br />Built upon Hadoop<br />
  6. 6. A simple example of Pig-Latin <br />1291950309812, http://snda.com/page_1 <br />1291950309822, http://snda.com/page_2 <br />1291950309832, http://snda.com/page_3<br />….<br /><ul><li> Page view </li></ul>raw_data = load '/java_one/pv' UsingPigStorage(‘,')         as (time_stamp : long, url : chararray);pages = foreachraw_datagenerateurl;pages = grouppagesbyurl;pages = foreachpagesgenerategroupasurl, COUNT(pages.url) aspv;<br /><ul><li>The most 10 popular pages</li></ul>result = orderpages bypvdesc;top10 = limitresult 10;dumptop10;<br />
  7. 7. Operators in Pig-Latin<br />Load - a = load ‘data’ usingPigStorage(‘t’) as (f1:int ,f2:double,f3:chararray)<br />Store - store a into ‘/test/output’ usingPigStorage(‘,’) <br />Dump - dump a<br />Filter - b = foreach a by f1 > 0 and f2 == ‘java_one’<br />Foreach - b = foreach a generate f1, f3<br />Group - b= group a by f3;<br />Join - b = Join a by f1, b by f1;<br />Describe - describe b;<br />….<br />
  8. 8. Data Structure in Pig<br />Cell  field in database<br />- Primitive types: int, long, float, double, bytearray, chararrar,nul<br />- Complex types: map, tuple, databag<br />Tuple row<br />(1, 1.2, “java”)<br />DataBag table or view <br />{ (1, 1.2, “java”), (2,2.3, “c++”) , (3,4.5,”c”) }<br />
  9. 9. How to use Pig<br />Grunt (Interactive Shell)<br />Java API<br />Other languages (in future)<br />
  10. 10. Architecture of Pig<br />Grunt (Interactive shell)<br />PigServer (Java API) <br />Parser (PigLatinLogicalPlan)<br />PigContext<br />Optimizer (LogicalPlan LogicalPlan)<br />Compiler (LogicalPlan PhysiclaPlan  MapReducePlan)<br />ExecutionEngine<br />Hadoop<br />
  11. 11. Three basic operations of Pig<br />Group by<br />Join<br />Order<br />
  12. 12. How Pig do Group by<br />Data Source  Split  Mapper  Partition  Reducer<br />(A,1)<br />(B,2)<br />(C,3)<br />(A,1)<br />(B,2)<br />(C,3)<br />(B,4)<br />(B,5)<br />(C,6)<br />(A,7)<br />(E,8)<br />(D,9)<br />(A,{(A,1),(A,7)}<br />(C,{(C,3),(C,6)})<br />(E,{(E,8)})<br />(B,4)<br />(B,5)<br />(C,6)<br />(B,{(B,2),(B,4),(B,5)}<br />(D,{(D,9)}<br />(A,7)<br />(E,8)<br />(D,9)<br />
  13. 13. How Pig do Join<br />Data Source  Split  Mapper  Partition  Reducer<br />(1,A1)<br />(4,A4)<br />(3,A3)<br />(5,A5)<br />(2,A2)<br />(1,A1)<br />(4,A4)<br />(5,B5)<br />(1,B1)<br />((1,A1),(1,B1))<br />((3,A3),(3,B3))<br />((5,A5),(5,B5))<br />(3,A3)<br />(5,A5)<br />(3,B3)<br />(2,B2)<br />(5,B5)<br />(1,B1)<br />(3,B3)<br />(2,B2)<br />(4,B4)<br />((2,A2)(2,B2))<br />((4,B4),(4,B4))<br />(2,A2)<br />(4,B4)<br />
  14. 14. How Pig do Sort<br />Data Source  Split  Mapper  Range Partition  Reducer<br />(100)<br />(200)<br />(900)<br />(50)<br />(100)<br />(200)<br />(300)<br />(400)<br />(100)<br />(200)<br />(900)<br />(50)<br />(600)<br />(800)<br />(300)<br />(400)<br />(50)<br />(600)<br />(800)<br />(600)<br />(800)<br />(300)<br />(400)<br />
  15. 15. UDF (User-Defined-Function)<br />register myudf.jar;<br />raw_data= load ‘/java_one/udf’ as (name:chararray);<br />firstnames = foreachraw_datageneratemyudf.FirstName (name); <br />storefirstnamesinto ‘/java_one/udf_output’;<br />public class FirstNameextendsEvalFunc<String>{<br /> @Override<br /> public String exec(Tuple input) throwsIOException {<br /> String name=input.get(0).toString();<br />….<br />returnfirstname;<br />}<br />}<br />
  16. 16. What Storage Pig Supports<br />HDFS<br />Plain Text<br />Binary format<br />Customized format (XML, JSON, Protobuf, Thrift…)<br />RDBMS(DBStorage)<br />Cassandra (CassandraStorage)<br />HBase(HBaseStorage)<br />
  17. 17. What fields can Pig be applied <br />Data Analysis<br />Text Processing<br />ETL<br />Machine Learning<br />
  18. 18. Who’s using Pig<br />More: http://wiki.apache.org/pig/PoweredBy<br />
  19. 19. References<br />http://pig.apache.org (Pig official site)<br />http://hadoop.apache.org (Hadoop official site)<br />https://github.com/zjffdu/RAF-PIG (Rich API for Pig)<br />
  20. 20. Demo<br />
  21. 21. Thank you !<br /> Q&A<br />
  • ssuser084bab

    Jul. 7, 2013
  • MaheedharReddyKV

    Mar. 29, 2013
  • fabiouechi

    Jan. 14, 2013
  • yatming

    Sep. 26, 2012
  • wkai681

    Aug. 6, 2012
  • shashwat2010

    Apr. 18, 2012
  • xpower1984

    Mar. 8, 2012
  • binlijin

    Oct. 23, 2011

Presentation in Java One conference in Beijing 2010

Views

Total views

4,702

On Slideshare

0

From embeds

0

Number of embeds

32

Actions

Downloads

100

Shares

0

Comments

0

Likes

8

×