SlideShare a Scribd company logo
1 of 32
Download to read offline
introduction to data processing
using Hadoop and Pig
ricardo varela
ricardov@yahoo-inc.com
http://twitter.com/phobeo
yahoo ydn tuesdays
London, 6th oct 2009
ah! the data!
• NYSE generates 1 Terabyte of
data per day
• The LHC in Geneva will
produce 15 Petabytes of data
per year
• The estimated “digital info” by
2011: 1.8 zettabytes
(that is 1,000,000,000,000,000,000,000 = 1021
bytes)
• Think status updates, facebook
photos, slashdot comments
(individual digital footprints get to Tb/year)
unlicensed img from IBM archivesdata from The Diverse and Exploding Digital Universe, by IDC
“everything counts in
large amounts...”
• where do you store a petabyte?
• how do you read it?
(remote 10 mb/sec, local 100mb/sec)
• how do you process it?
• and what if something goes
wrong?
data from The Diverse and Exploding Digital Universe, by IDC
so, here comes parallel computing!
In pioneer days they used oxen for heavy pulling, and when
one ox couldn't budge a log, they didn't try to grow a larger
ox. We shouldn't be trying for bigger computers, but for
more systems of computers
Grace Hopper
however...
There are 3 rules to follow when parallelizing large code
bases.
Unfortunately, no one knows what these rules are
Gary R. Montry
enter mapreduce
• introduced by Jeff Dean and
Sanjay Ghemawat (google),
based on functional
programming “map” and
“reduce” functions
• distributes load and reads/
writes to distributed filesystem
img courtesy of Janne, http://helmer.sfe.se/
enter mapreduce
• introduced by Jeff Dean and
Sanjay Ghemawat (google),
based on functional
programming “map” and
“reduce” functions
• distributes load and reads/
writes to distributed filesystem
apache hadoop
• top level apache project since
jan 2008
• open source, java-based
• winner of the terabyte sort
benchmark
• heavily invested in and used
inside Yahoo!
apache hadoop
• top level apache project since
jan 2008
• open source, java-based
• winner of the terabyte sort
benchmark
• heavily invested in and used
inside Yahoo!
hdfs
• designed to store lots of data in
a reliable and scalable way
• sequential access and read-
focused, with replication
simple mapreduce
simple mapreduce
• note: beware of the
single reduce! :)
simple mapreduce
example: simple processing
#!/bin/bash
# search maximum temperatures according to NCDC records
for year in all/*
do
echo -ne `basename $year .gz`”t”
gunzip -c $year | 
awk ‘{ temp = substr($0,88,5) + 0;
q = substr($0, 93, 1);
if(temp != 9999 && q ~ / [01459]/
&& temp > max)
max = temp; }
END { print max }’
done
example: simple
processing
• data for last 100 years may
take in the order of the hour
(and non scalable)
• we can express the same in
terms of a single map and
reduce
example: mapper
public class MaxTemperatureMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
output.collect(new Text(year), new IntWritable(airTemperature));
}
}
example: reducer
public class MaxTemperatureReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int maxValue = Integer.MIN_VALUE;
while (values.hasNext()) {
maxValue = Math.max(maxValue, values.next().get());
}
output.collect(key, new IntWritable(maxValue))
}
}
example: driver
public static void main(String[] args) throws IOException {
JobConf conf = new JobConf(MaxTemperature.class);
conf.setJobName("Max temperature");
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(MaxTemperatureMapper.class);
conf.setReducerClass(MaxTemperatureReducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
}
et voilà!
• our process runs in order of
minutes (for 10 nodes) and is
almost-linearly scalable
(limit being on how splittable input is)
but it may get
verbose...
• needs a bit of code to make it
work
• chain jobs together
(sequences can just use JobClient.runJob()
but more complex dependencies need
JobControl)
• also, for simple tasks, you can
resort to hadoop streaming
unlicensed image from The Matrix, copyright Warner Bros.
pig to the rescue
• makes it simpler to write
mapreduce programs
• PigLatin abstracts you from
specific details and focus on
data processing
simple example, now with pig
-- max_temp.pig: Finds the maximum temperature by year
records = LOAD 'sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999
AND (quality == 0 OR quality == 1 OR quality == 4
OR quality == 5 OR quality == 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records
GENERATE group, MAX(filtered_records.temperature)
DUMP max_temp;
a more complex use
• user data collection in one file
• website visits data in log
• find the top 5 most visited
pages by users aged 18 to 25
in mapreduce...
and now with pig...
Users = LOAD ‘users’ AS (name, age);
Fltrd = FILTER Users BY
age >= 18 AND age <= 25;
Pages = LOAD ‘pages’ AS (user, url);
Jnd = JOIN Fltrd BY name, Pages BY user;
Grpd = GROUP Jnd BY url;
Smmd = FOREACH Grpd GENERATE group,
COUNT(Jnd) AS clicks;
Srtd = ORDER Smmd BY clicks DESC;
Top5 = LIMIT Srtd 5;
STORE Top5 INTO ‘top5sites’;
lots of constructs for data manipulation
load/store Read/write data from file system
dump Write output to stdout
foreach Apply expression to each record and output one or more records
filter Apply predicate and remove records that do not return true
group/cogroup Collect records with the same key from one or more inputs
join Join two or more inputs based on a key
cross Generates the cartesian product of two or more inputs
order Sort records based on a key
distinct Remove duplicate records
union Merge two data sets
split Split data into 2 or more sets, based on filter conditions
limit Limit the number of records
stream Send all records through a user provided binary
so, what can we use
this for?
• log processing and analysis
• user preference tracking /
recommendations
• multimedia processing
• ...
example: New York Times
• Needed offline conversion of public domain articles from 1851-1922
• Used Hadoop to convert scanned images to PDF, on 100 Amazon EC2
instances for around 24 hours
• 4 TB of input, 1.5 TB of output
published in 1892. Copyright The New York Times,
coming next: speed
dating
• finally, computers are useful!
• Online Dating Advice: Exactly
What To Say In A First Message
http://bit.ly/MHIST
• The Speed Dating dataset
http://bit.ly/2sOkXm
img by DougSavage - savagechickens.com
after the talk...
• hadoop and pig docs
• our very own step-by-step
tutorial
http://developer.yahoo.com/
hadoop/tutorial
• now there’s also books
• http://huguk.org/
and if you get stuck
• http://developer.yahoo.com
• http://hadoop.apache.org
• common-user@hadoop.apache.org
• pig-user@hadoop.apache.org
• IRC: #hadoop on irc.freenode.org
img from icanhascheezburger.com
thank you!
ricardo varela
ricardov@yahoo-inc.com
http://twitter.com/phobeo

More Related Content

What's hot

Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandasPurna Chander K
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
 

What's hot (20)

Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 

Viewers also liked

Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopDavid Yahalom
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 

Viewers also liked (12)

Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 

Similar to introduction to data processing using Hadoop and Pig

Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.comRenzo Tomà
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Javamalduarte
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesOleksii Diagiliev
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM Joy Rahman
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingCollin Bennett
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLArnab Biswas
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconPeter Lawrey
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - HadoopTalentica Software
 
Python VS GO
Python VS GOPython VS GO
Python VS GOOfir Nir
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 

Similar to introduction to data processing using Hadoop and Pig (20)

Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.com
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Java
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @Geecon
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Python VS GO
Python VS GOPython VS GO
Python VS GO
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 

More from Ricardo Varela

Mobile mas alla de la app store: APIs y Mobile Web - MobileConGalicia 2011
Mobile mas alla de la app store: APIs y Mobile Web - MobileConGalicia 2011Mobile mas alla de la app store: APIs y Mobile Web - MobileConGalicia 2011
Mobile mas alla de la app store: APIs y Mobile Web - MobileConGalicia 2011Ricardo Varela
 
WAC Network APIs @ OverTheAir 2011
WAC Network APIs @ OverTheAir 2011WAC Network APIs @ OverTheAir 2011
WAC Network APIs @ OverTheAir 2011Ricardo Varela
 
Over The Air 2010: Privacy for Mobile Developers
Over The Air 2010: Privacy for Mobile DevelopersOver The Air 2010: Privacy for Mobile Developers
Over The Air 2010: Privacy for Mobile DevelopersRicardo Varela
 
Blueprint talk at Open Hackday London 2009
Blueprint talk at Open Hackday London 2009Blueprint talk at Open Hackday London 2009
Blueprint talk at Open Hackday London 2009Ricardo Varela
 
Yahoo Mobile Widget Vision
Yahoo Mobile Widget VisionYahoo Mobile Widget Vision
Yahoo Mobile Widget VisionRicardo Varela
 
Creating Yahoo Mobile Widgets
Creating Yahoo Mobile WidgetsCreating Yahoo Mobile Widgets
Creating Yahoo Mobile WidgetsRicardo Varela
 

More from Ricardo Varela (7)

Mobile mas alla de la app store: APIs y Mobile Web - MobileConGalicia 2011
Mobile mas alla de la app store: APIs y Mobile Web - MobileConGalicia 2011Mobile mas alla de la app store: APIs y Mobile Web - MobileConGalicia 2011
Mobile mas alla de la app store: APIs y Mobile Web - MobileConGalicia 2011
 
WAC Network APIs @ OverTheAir 2011
WAC Network APIs @ OverTheAir 2011WAC Network APIs @ OverTheAir 2011
WAC Network APIs @ OverTheAir 2011
 
Over The Air 2010: Privacy for Mobile Developers
Over The Air 2010: Privacy for Mobile DevelopersOver The Air 2010: Privacy for Mobile Developers
Over The Air 2010: Privacy for Mobile Developers
 
Blueprint talk at Open Hackday London 2009
Blueprint talk at Open Hackday London 2009Blueprint talk at Open Hackday London 2009
Blueprint talk at Open Hackday London 2009
 
yahoo mobile widgets
yahoo mobile widgetsyahoo mobile widgets
yahoo mobile widgets
 
Yahoo Mobile Widget Vision
Yahoo Mobile Widget VisionYahoo Mobile Widget Vision
Yahoo Mobile Widget Vision
 
Creating Yahoo Mobile Widgets
Creating Yahoo Mobile WidgetsCreating Yahoo Mobile Widgets
Creating Yahoo Mobile Widgets
 

Recently uploaded

COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 

Recently uploaded (20)

COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 

introduction to data processing using Hadoop and Pig

  • 1. introduction to data processing using Hadoop and Pig ricardo varela ricardov@yahoo-inc.com http://twitter.com/phobeo yahoo ydn tuesdays London, 6th oct 2009
  • 2. ah! the data! • NYSE generates 1 Terabyte of data per day • The LHC in Geneva will produce 15 Petabytes of data per year • The estimated “digital info” by 2011: 1.8 zettabytes (that is 1,000,000,000,000,000,000,000 = 1021 bytes) • Think status updates, facebook photos, slashdot comments (individual digital footprints get to Tb/year) unlicensed img from IBM archivesdata from The Diverse and Exploding Digital Universe, by IDC
  • 3. “everything counts in large amounts...” • where do you store a petabyte? • how do you read it? (remote 10 mb/sec, local 100mb/sec) • how do you process it? • and what if something goes wrong? data from The Diverse and Exploding Digital Universe, by IDC
  • 4. so, here comes parallel computing! In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers Grace Hopper
  • 5. however... There are 3 rules to follow when parallelizing large code bases. Unfortunately, no one knows what these rules are Gary R. Montry
  • 6. enter mapreduce • introduced by Jeff Dean and Sanjay Ghemawat (google), based on functional programming “map” and “reduce” functions • distributes load and reads/ writes to distributed filesystem img courtesy of Janne, http://helmer.sfe.se/
  • 7. enter mapreduce • introduced by Jeff Dean and Sanjay Ghemawat (google), based on functional programming “map” and “reduce” functions • distributes load and reads/ writes to distributed filesystem
  • 8. apache hadoop • top level apache project since jan 2008 • open source, java-based • winner of the terabyte sort benchmark • heavily invested in and used inside Yahoo!
  • 9. apache hadoop • top level apache project since jan 2008 • open source, java-based • winner of the terabyte sort benchmark • heavily invested in and used inside Yahoo!
  • 10. hdfs • designed to store lots of data in a reliable and scalable way • sequential access and read- focused, with replication
  • 12. simple mapreduce • note: beware of the single reduce! :)
  • 14. example: simple processing #!/bin/bash # search maximum temperatures according to NCDC records for year in all/* do echo -ne `basename $year .gz`”t” gunzip -c $year | awk ‘{ temp = substr($0,88,5) + 0; q = substr($0, 93, 1); if(temp != 9999 && q ~ / [01459]/ && temp > max) max = temp; } END { print max }’ done
  • 15. example: simple processing • data for last 100 years may take in the order of the hour (and non scalable) • we can express the same in terms of a single map and reduce
  • 16. example: mapper public class MaxTemperatureMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); String year = line.substring(15, 19); int airTemperature; if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs airTemperature = Integer.parseInt(line.substring(88, 92)); } else { airTemperature = Integer.parseInt(line.substring(87, 92)); } String quality = line.substring(92, 93); if (airTemperature != MISSING && quality.matches("[01459]")) { output.collect(new Text(year), new IntWritable(airTemperature)); } }
  • 17. example: reducer public class MaxTemperatureReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int maxValue = Integer.MIN_VALUE; while (values.hasNext()) { maxValue = Math.max(maxValue, values.next().get()); } output.collect(key, new IntWritable(maxValue)) } }
  • 18. example: driver public static void main(String[] args) throws IOException { JobConf conf = new JobConf(MaxTemperature.class); conf.setJobName("Max temperature"); FileInputFormat.addInputPath(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setMapperClass(MaxTemperatureMapper.class); conf.setReducerClass(MaxTemperatureReducer.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); }
  • 19. et voilà! • our process runs in order of minutes (for 10 nodes) and is almost-linearly scalable (limit being on how splittable input is)
  • 20. but it may get verbose... • needs a bit of code to make it work • chain jobs together (sequences can just use JobClient.runJob() but more complex dependencies need JobControl) • also, for simple tasks, you can resort to hadoop streaming unlicensed image from The Matrix, copyright Warner Bros.
  • 21. pig to the rescue • makes it simpler to write mapreduce programs • PigLatin abstracts you from specific details and focus on data processing
  • 22. simple example, now with pig -- max_temp.pig: Finds the maximum temperature by year records = LOAD 'sample.txt' AS (year:chararray, temperature:int, quality:int); filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); grouped_records = GROUP filtered_records BY year; max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature) DUMP max_temp;
  • 23. a more complex use • user data collection in one file • website visits data in log • find the top 5 most visited pages by users aged 18 to 25
  • 25. and now with pig... Users = LOAD ‘users’ AS (name, age); Fltrd = FILTER Users BY age >= 18 AND age <= 25; Pages = LOAD ‘pages’ AS (user, url); Jnd = JOIN Fltrd BY name, Pages BY user; Grpd = GROUP Jnd BY url; Smmd = FOREACH Grpd GENERATE group, COUNT(Jnd) AS clicks; Srtd = ORDER Smmd BY clicks DESC; Top5 = LIMIT Srtd 5; STORE Top5 INTO ‘top5sites’;
  • 26. lots of constructs for data manipulation load/store Read/write data from file system dump Write output to stdout foreach Apply expression to each record and output one or more records filter Apply predicate and remove records that do not return true group/cogroup Collect records with the same key from one or more inputs join Join two or more inputs based on a key cross Generates the cartesian product of two or more inputs order Sort records based on a key distinct Remove duplicate records union Merge two data sets split Split data into 2 or more sets, based on filter conditions limit Limit the number of records stream Send all records through a user provided binary
  • 27. so, what can we use this for? • log processing and analysis • user preference tracking / recommendations • multimedia processing • ...
  • 28. example: New York Times • Needed offline conversion of public domain articles from 1851-1922 • Used Hadoop to convert scanned images to PDF, on 100 Amazon EC2 instances for around 24 hours • 4 TB of input, 1.5 TB of output published in 1892. Copyright The New York Times,
  • 29. coming next: speed dating • finally, computers are useful! • Online Dating Advice: Exactly What To Say In A First Message http://bit.ly/MHIST • The Speed Dating dataset http://bit.ly/2sOkXm img by DougSavage - savagechickens.com
  • 30. after the talk... • hadoop and pig docs • our very own step-by-step tutorial http://developer.yahoo.com/ hadoop/tutorial • now there’s also books • http://huguk.org/
  • 31. and if you get stuck • http://developer.yahoo.com • http://hadoop.apache.org • common-user@hadoop.apache.org • pig-user@hadoop.apache.org • IRC: #hadoop on irc.freenode.org img from icanhascheezburger.com