Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

introduction to data processing using Hadoop and Pig

100,686 views

Published on

In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications

YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London

Published in: Technology
  • ⇒⇒⇒WRITE-MY-PAPER.net ⇐⇐⇐ I love this site. It always finds me the best tutors in accordance with my needs. I have been using it since last year. The prices are not expensive compared to other sites. I am glad I discored this site:)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • You can ask here for a help. They helped me a lot an i`m highly satisfied with quality of work done. I can promise you 100% un-plagiarized text and good experts there. Use with pleasure! ⇒ www.HelpWriting.net ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THE BOOK INTO AVAILABLE FORMAT (New Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://redirect.is/fyxsb0u } ......................................................................................................................... Download Full EPUB Ebook here { https://redirect.is/fyxsb0u } ......................................................................................................................... Download Full doc Ebook here { https://redirect.is/fyxsb0u } ......................................................................................................................... Download PDF EBOOK here { https://redirect.is/fyxsb0u } ......................................................................................................................... Download EPUB Ebook here { https://redirect.is/fyxsb0u } ......................................................................................................................... Download doc Ebook here { https://redirect.is/fyxsb0u } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THE can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THE is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBOOK .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookBOOK, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, EBOOK, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THE Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THE the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THE Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • If you want to discover how you can increase your cup size within 6 weeks then you need to see this new website... This is an all natural alternative to painful surgery or expensive pills... It's what plastic surgeons have been hiding for years. Jenny went from an A cup to a C cup in just 6 weeks. Want to do the same yourself... ☀☀☀ https://t.cn/A6Liz7kD
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • A professional Paper writing services can alleviate your stress in writing a successful paper and take the pressure off you to hand it in on time. Check out, please HelpWriting.net
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

introduction to data processing using Hadoop and Pig

  1. introduction to data processing using Hadoop and Pig ricardo varela ricardov@yahoo-inc.com http://twitter.com/phobeo yahoo ydn tuesdays London, 6th oct 2009
  2. ah! the data! • NYSE generates 1 Terabyte of data per day • The LHC in Geneva will produce 15 Petabytes of data per year • The estimated “digital info” by 2011: 1.8 zettabytes (that is 1,000,000,000,000,000,000,000 = 1021 bytes) • Think status updates, facebook photos, slashdot comments (individual digital footprints get to Tb/year) unlicensed img from IBM archivesdata from The Diverse and Exploding Digital Universe, by IDC
  3. “everything counts in large amounts...” • where do you store a petabyte? • how do you read it? (remote 10 mb/sec, local 100mb/sec) • how do you process it? • and what if something goes wrong? data from The Diverse and Exploding Digital Universe, by IDC
  4. so, here comes parallel computing! In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers Grace Hopper
  5. however... There are 3 rules to follow when parallelizing large code bases. Unfortunately, no one knows what these rules are Gary R. Montry
  6. enter mapreduce • introduced by Jeff Dean and Sanjay Ghemawat (google), based on functional programming “map” and “reduce” functions • distributes load and reads/ writes to distributed filesystem img courtesy of Janne, http://helmer.sfe.se/
  7. enter mapreduce • introduced by Jeff Dean and Sanjay Ghemawat (google), based on functional programming “map” and “reduce” functions • distributes load and reads/ writes to distributed filesystem
  8. apache hadoop • top level apache project since jan 2008 • open source, java-based • winner of the terabyte sort benchmark • heavily invested in and used inside Yahoo!
  9. apache hadoop • top level apache project since jan 2008 • open source, java-based • winner of the terabyte sort benchmark • heavily invested in and used inside Yahoo!
  10. hdfs • designed to store lots of data in a reliable and scalable way • sequential access and read- focused, with replication
  11. simple mapreduce
  12. simple mapreduce • note: beware of the single reduce! :)
  13. simple mapreduce
  14. example: simple processing #!/bin/bash # search maximum temperatures according to NCDC records for year in all/* do echo -ne `basename $year .gz`”t” gunzip -c $year | awk ‘{ temp = substr($0,88,5) + 0; q = substr($0, 93, 1); if(temp != 9999 && q ~ / [01459]/ && temp > max) max = temp; } END { print max }’ done
  15. example: simple processing • data for last 100 years may take in the order of the hour (and non scalable) • we can express the same in terms of a single map and reduce
  16. example: mapper public class MaxTemperatureMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); String year = line.substring(15, 19); int airTemperature; if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs airTemperature = Integer.parseInt(line.substring(88, 92)); } else { airTemperature = Integer.parseInt(line.substring(87, 92)); } String quality = line.substring(92, 93); if (airTemperature != MISSING && quality.matches("[01459]")) { output.collect(new Text(year), new IntWritable(airTemperature)); } }
  17. example: reducer public class MaxTemperatureReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int maxValue = Integer.MIN_VALUE; while (values.hasNext()) { maxValue = Math.max(maxValue, values.next().get()); } output.collect(key, new IntWritable(maxValue)) } }
  18. example: driver public static void main(String[] args) throws IOException { JobConf conf = new JobConf(MaxTemperature.class); conf.setJobName("Max temperature"); FileInputFormat.addInputPath(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setMapperClass(MaxTemperatureMapper.class); conf.setReducerClass(MaxTemperatureReducer.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); }
  19. et voilà! • our process runs in order of minutes (for 10 nodes) and is almost-linearly scalable (limit being on how splittable input is)
  20. but it may get verbose... • needs a bit of code to make it work • chain jobs together (sequences can just use JobClient.runJob() but more complex dependencies need JobControl) • also, for simple tasks, you can resort to hadoop streaming unlicensed image from The Matrix, copyright Warner Bros.
  21. pig to the rescue • makes it simpler to write mapreduce programs • PigLatin abstracts you from specific details and focus on data processing
  22. simple example, now with pig -- max_temp.pig: Finds the maximum temperature by year records = LOAD 'sample.txt' AS (year:chararray, temperature:int, quality:int); filtered_records = FILTER records BY temperature != 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); grouped_records = GROUP filtered_records BY year; max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature) DUMP max_temp;
  23. a more complex use • user data collection in one file • website visits data in log • find the top 5 most visited pages by users aged 18 to 25
  24. in mapreduce...
  25. and now with pig... Users = LOAD ‘users’ AS (name, age); Fltrd = FILTER Users BY age >= 18 AND age <= 25; Pages = LOAD ‘pages’ AS (user, url); Jnd = JOIN Fltrd BY name, Pages BY user; Grpd = GROUP Jnd BY url; Smmd = FOREACH Grpd GENERATE group, COUNT(Jnd) AS clicks; Srtd = ORDER Smmd BY clicks DESC; Top5 = LIMIT Srtd 5; STORE Top5 INTO ‘top5sites’;
  26. lots of constructs for data manipulation load/store Read/write data from file system dump Write output to stdout foreach Apply expression to each record and output one or more records filter Apply predicate and remove records that do not return true group/cogroup Collect records with the same key from one or more inputs join Join two or more inputs based on a key cross Generates the cartesian product of two or more inputs order Sort records based on a key distinct Remove duplicate records union Merge two data sets split Split data into 2 or more sets, based on filter conditions limit Limit the number of records stream Send all records through a user provided binary
  27. so, what can we use this for? • log processing and analysis • user preference tracking / recommendations • multimedia processing • ...
  28. example: New York Times • Needed offline conversion of public domain articles from 1851-1922 • Used Hadoop to convert scanned images to PDF, on 100 Amazon EC2 instances for around 24 hours • 4 TB of input, 1.5 TB of output published in 1892. Copyright The New York Times,
  29. coming next: speed dating • finally, computers are useful! • Online Dating Advice: Exactly What To Say In A First Message http://bit.ly/MHIST • The Speed Dating dataset http://bit.ly/2sOkXm img by DougSavage - savagechickens.com
  30. after the talk... • hadoop and pig docs • our very own step-by-step tutorial http://developer.yahoo.com/ hadoop/tutorial • now there’s also books • http://huguk.org/
  31. and if you get stuck • http://developer.yahoo.com • http://hadoop.apache.org • common-user@hadoop.apache.org • pig-user@hadoop.apache.org • IRC: #hadoop on irc.freenode.org img from icanhascheezburger.com
  32. thank you! ricardo varela ricardov@yahoo-inc.com http://twitter.com/phobeo

×