SlideShare a Scribd company logo
1 of 34
Download to read offline
Introducción a Hadoop
   El bazuca de los datos


       Iván de Prado Alonso // @ivanprado // @datasalt
Datasalt




  Foco en el Big Data
  –   Contribución al Open Source
  –   Consultoría & Desarrollo
  –   Formación



                                    2 / 34
BIG
“MAC”
 DATA



        3 / 34
Fisonomía de un proyecto Big Data



             Adquisición


           Procesamiento


               Servicio

                                    4 / 34
Tipos de sistemas Big Data

●   Offline
    –   La latencia no es un problema
●   Online
    –   La inmediatez de los datos es importante
●   Mixto
    –   Lo más común

                Offline                       Online
    MapReduce                     Bases de datos NoSQL
    Hadoop                        Motores de búsqueda
    Distributed RDBMS


                                                         5 / 34
“Swiss army knife of the
                                           21st century”
                                                              Media Guardian Innovation
                                                                                Awards




http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop   6 / 34
Historia

●   2004-2006
    –   Google publica los papers de GFS y MapReduce
    –   Doug Cutting implementa una versión Open Source en
        Nutch
●   2006-2008
    –   Hadoop se separa de Nutch
    –   Se alcanza la escala web en 2008
●   2008-Hasta ahora
    –   Hadoop se populariza y se comienza a explotar
        comercialmente.

                               Fuente: Hadoop: a brief history. Doug Cutting

                                                                        7 / 34
Hadoop

     “The Apache Hadoop
      software library is a
  framework that allows for
        the distributed
  processing of large data
    sets across clusters of
       computers using a
    simple programming
            model”
              De la página de Hadoop



                                       8 / 34
Sistema de Ficheros Distribuido

●   Sistema de ficheros distribuido (HDFS)
    –   Bloques grandes: 64 Mb
         ●   Almacenados en el sistema de ficheros del SO
    –   Tolerante a Fallos (replicación)
    –   Formatos habituales:
         ●   Ficheros en formato texto (CSV)
         ●   SequenceFiles
              –   Ristras de pares [clave, valor]




                                                            9 / 34
MapReduce

●   Dos funciones (Map y Reduce)
    –   Map(k, v) : [z,w]*
    –   Reduce(k, v*) : [z, w]*
●   Ejemplo: contar palabras
    –   Map([documento, null]) -> [palabra, 1]*
    –   Reduce(palabra, 1*) -> [palabra, total]
●   MapReduce y SQL
    –   SELECT palabra, count(*) GROUP BY palabra
●   Ejecución distribuida en un cluster con escalabilidad
    horizontal


                                                        10 / 34
El típico Word Count
  Esto es una linea
  Esto también


 Map                                  Reduce
                                       reduce(es, {1}) =
  map(“Esto es una linea”) =
                                           es, 1
      esto, 1
                                       reduce(esto, {1, 1}) =
      es, 1
                                           esto, 2
      una, 1
                                       reduce(linea, {1}) =
      linea, 1
                                           linea, 1
  map(“Esto también”) =
                                       reduce(también, {1}) =
      esto, 1
                                           también, 1
      también, 1
                                       reduce(una, {1}) =
                                           una, 1


                               es, 1
                               esto, 2
  Resultado:                   linea, 1
                               también, 1
                               una, 1



                                                                11 / 34
Word Count en Hadoop
   public class WordCountHadoop extends Configured implements Tool {

              public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

                        private final static IntWritable one = new IntWritable(1);
                        private Text word = new Text();

                        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
                                  StringTokenizer itr = new StringTokenizer(value.toString());
                                  while(itr.hasMoreTokens()) {
                                            word.set(itr.nextToken());
                                            context.write(word, one);
                                  }
                        }
              }

              public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

                        private IntWritable result = new IntWritable();

                        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
                            InterruptedException {
                                  int sum = 0;
                                  for(IntWritable val : values) {
                                            sum += val.get();
                                  }




 ¡Mejor vamos por partes!
                                  result.set(sum);
                                  context.write(key, result);
                        }
              }

               @Override
       public int run(String[] args) throws Exception {

                        if(args.length != 2) {
                                  System.err.println("Usage: wordcount-hadoop <in> <out>");
                                  System.exit(2);
                        }

                        Path output = new Path(args[1]);
                        HadoopUtils.deleteIfExists(FileSystem.get(output.toUri(), conf), output);

                        Job job = new Job(getConf(), "word count hadoop");
                        job.setJarByClass(WordCountHadoop.class);
                        job.setMapperClass(TokenizerMapper.class);
                        job.setCombinerClass(IntSumReducer.class);
                        job.setReducerClass(IntSumReducer.class);
                        job.setOutputKeyClass(Text.class);
                        job.setOutputValueClass(IntWritable.class);
                        FileInputFormat.addInputPath(job, new Path(args[0]));

                        FileOutputFormat.setOutputPath(job, new Path(args[1]));
                        job.waitForCompletion(true);

                        return 0;
       }

              public static void main(String[] args) throws Exception {
                        ToolRunner.run(new SortJobHadoop(), args);
              }
   }




                                                                                                                              12 / 34
Word Count en Hadoop - Mapper


      public static class TokenizerMapper extends Mapper<Object, Text,
  Text, IntWritable> {

         private final static IntWritable one = new IntWritable(1);
         private Text word = new Text();

          public void map(Object key, Text value, Context context) throws
  IOException, InterruptedException {

             StringTokenizer itr = new StringTokenizer(value.toString());
             while(itr.hasMoreTokens()) {

                 word.set(itr.nextToken());
                 context.write(word, one);
             }
         }
     }




                                                                         13 / 34
Word Count en Hadoop - Reducer


      public static class IntSumReducer extends Reducer<Text, IntWritable,
  Text, IntWritable> {

         private IntWritable result = new IntWritable();

          public void reduce(Text key, Iterable<IntWritable> values,
  Context context) throws IOException,
              InterruptedException {
              int sum = 0;
              for(IntWritable val : values) {
                  sum += val.get();
              }
              result.set(sum);
              context.write(key, result);
          }
      }




                                                                       14 / 34
Word Count en Hadoop – Configuración y
ejecución

         if(args.length != 2) {
             System.err.println("Usage: wordcount-hadoop <in> <out>");
             System.exit(2);
         }

          Path output = new Path(args[1]);
          HadoopUtils.deleteIfExists(FileSystem.get(output.toUri(), conf),
  output);

         Job job = new Job(getConf(), "word count hadoop");
         job.setJarByClass(WordCountHadoop.class);
         job.setMapperClass(TokenizerMapper.class);
         job.setCombinerClass(IntSumReducer.class);
         job.setReducerClass(IntSumReducer.class);
         job.setOutputKeyClass(Text.class);
         job.setOutputValueClass(IntWritable.class);
         FileInputFormat.addInputPath(job, new Path(args[0]));

         FileOutputFormat.setOutputPath(job, new Path(args[1]));
         job.waitForCompletion(true);




                                                                      15 / 34
Ejecución de un Job MapReduce
                  Bloques del fichero de entrada




                           Nodo 1




                                                   Nodo 2
    Mappers




    Datos
    Intermedios
                           Nodo 1




                                                   Nodo 2
    Reducers

    Resultado

                                                            16 / 34
Serialización

 ●   Writables
     • Serialización nativa de Hadoop
     • De muy bajo nivel
     • Tipos básicos: IntWritable, Text, etc.
 ●   Otras
     • Thrift, Avro, Protostuff
     • Compatibilidad hacia atrás.




                                                17 / 34
La curva de
aprendizaje
de Hadoop
  es alta




          18 / 34
Tuple MapReduce

●   Un MapReduce más simple
    –   Tuplas en lugar de key/value




    –   A nivel de job se define
         ●   Los campos por los que agrupar
         ●   Los campos por los que ordenar
    –   Tuple MapReduce-join


                                              19 / 34
Pangool
●   Implementación de
    TupleMap reduce
    –   Desarrollado por Datasalt
    –   OpenSource
    –   Eficiencia equiparable a
        Hadoop
●   Objetivo: reemplazar la API
    de Hadoop
●   Si quieres aprender
    Hadoop, empieza por
    Pangool



                                    20 / 34
Eficiencia de Pangool
●   Equiparable a Hadoop




    Ver http://pangool.net/benchmark.html

                                            21 / 34
Pangool – URL resolution

●   Ejemplo de Join
    –   Muy difícil en Hadoop. Fácil en Pangool.
●   Problema:
    –   Existen muchos acortadores de URLs y redirecciones
    –   Para analizar datos, suele ser útil reemplazar las URLs por su URL
        canónica
    –   Supongamos que tenemos ambos datasets
        ●   Un mapa con entradas URL → URL canónica
        ●   Un dataset con URLs (que queremos resolver) y otros campos.
    –   El siguiente job Pangool soluciona el problema de manera escalable.




                                                                          22 / 34
URL Resolution – Definiendo Schemas


    static Schema getURLRegisterSchema() {
        List<Field> urlRegisterFields = new ArrayList<Field>();
        urlRegisterFields.add(Field.create("url",Type.STRING));
        urlRegisterFields.add(Field.create("timestamp",Type.LONG));
        urlRegisterFields.add(Field.create("ip",Type.STRING));
        return new Schema("urlRegister", urlRegisterFields);
    }

    static Schema getURLMapSchema() {
        List<Field> urlMapFields = new ArrayList<Field>();
        urlMapFields.add(Field.create("url",Type.STRING));
        urlMapFields.add(Field.create("canonicalUrl",Type.STRING));
        return new Schema("urlMap", urlMapFields);
    }




                                                                      23 / 34
URL Resolution – Cargando el fichero a
resolver


     public static class UrlProcessor extends TupleMapper<LongWritable,
 Text> {

        private Tuple tuple = new Tuple(getURLRegisterSchema());

         @Override
         public void map(LongWritable key, Text value, TupleMRContext
 context, Collector collector)
             throws IOException, InterruptedException {

            String[] fields = value.toString().split("t");
            tuple.set("url", fields[0]);
            tuple.set("timestamp", Long.parseLong(fields[1]));
            tuple.set("ip", fields[2]);
            collector.write(tuple);
        }
    }




                                                                        24 / 34
URL Resolution – Cargando el mapa de URLs


     public static class UrlMapProcessor extends TupleMapper<LongWritable,
 Text> {

        private Tuple tuple = new Tuple(getURLMapSchema());

         @Override
         public void map(LongWritable key, Text value, TupleMRContext
 context, Collector collector)
             throws IOException, InterruptedException {

            String[] fields = value.toString().split("t");
            tuple.set("url", fields[0]);
            tuple.set("canonicalUrl", fields[1]);
            collector.write(tuple);
        }
    }




                                                                        25 / 34
URL Resolution – Resolución en el reducer
     public static class Handler extends TupleReducer<Text, NullWritable>
 {

        private Text result;

         @Override
         public void reduce(ITuple group, Iterable<ITuple> tuples,
 TupleMRContext context, Collector collector)
             throws IOException, InterruptedException, TupleMRException {
             if (result == null) {
                 result = new Text();
             }
             String cannonicalUrl = null;
             for(ITuple tuple : tuples) {
                 if("urlMap".equals(tuple.getSchema().getName())) {
                     cannonicalUrl = tuple.get("canonicalUrl").toString();
                 } else {
                     result.set(cannonicalUrl + "t" +
 tuple.get("timestamp") + "t" + tuple.get("ip"));
                     collector.write(result, NullWritable.get());
                 }
             }
         }
     }

                                                                      26 / 34
URL Resolution – Configurando y Lanzando
el job
  String input1 = args[0];
  String input2 = args[1];
  String output = args[2];

  deleteOutput(output);

  TupleMRBuilder mr = new TupleMRBuilder(conf,"Pangool Url Resolution");
  mr.addIntermediateSchema(getURLMapSchema());
  mr.addIntermediateSchema(getURLRegisterSchema());
  mr.setGroupByFields("url");
  mr.setOrderBy(
      new OrderBy().add("url", Order.ASC).addSchemaOrder(Order.ASC));
  mr.setTupleReducer(new Handler());
  mr.setOutput(new Path(output),
      new HadoopOutputFormat(TextOutputFormat.class),
      Text.class,
      NullWritable.class);
  mr.addInput(new Path(input1),
      new HadoopInputFormat(TextInputFormat.class),
      new UrlMapProcessor());
  mr.addInput(new Path(input2),
      new HadoopInputFormat(TextInputFormat.class),
      new UrlProcessor());
  mr.createJob().waitForCompletion(true);



                                                                      27 / 34
Introducción a hadoop
Introducción a hadoop
Introducción a hadoop
Introducción a hadoop
Introducción a hadoop
Introducción a hadoop
Introducción a hadoop

More Related Content

What's hot

Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityMapR Technologies
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drilltshiran
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, Howmcsrivas
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and PipesHanborq Inc.
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseHakan Ilter
 
Cassandra/Hadoop Integration
Cassandra/Hadoop IntegrationCassandra/Hadoop Integration
Cassandra/Hadoop IntegrationJeremy Hanna
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillDataWorks Summit
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 
Ecossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine LuizaEcossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine LuizaNelson Forte
 

What's hot (20)

Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Pptx present
Pptx presentPptx present
Pptx present
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Mongodb
Introduction to MongodbIntroduction to Mongodb
Introduction to Mongodb
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Cassandra/Hadoop Integration
Cassandra/Hadoop IntegrationCassandra/Hadoop Integration
Cassandra/Hadoop Integration
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Ecossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine LuizaEcossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine Luiza
 
Hadoop
HadoopHadoop
Hadoop
 

Viewers also liked

Tuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduceTuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreducedatasalt
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for HadoopSplout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoopdatasalt
 
Datasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactionsDatasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactionsdatasalt
 
Day snowman
Day snowmanDay snowman
Day snowmanafresh65
 
Internettrendsv1 150526193103-lva1-app6892
Internettrendsv1 150526193103-lva1-app6892Internettrendsv1 150526193103-lva1-app6892
Internettrendsv1 150526193103-lva1-app6892Benjamin Crucq
 
Switching On the Growth Engine in Your Small Consulting Practice
Switching On the Growth Engine in Your Small Consulting PracticeSwitching On the Growth Engine in Your Small Consulting Practice
Switching On the Growth Engine in Your Small Consulting PracticeGYK Antler
 
1. creative writing workshop april 2015 final
1. creative writing workshop april 2015 final1. creative writing workshop april 2015 final
1. creative writing workshop april 2015 finalamgonzalezpineiro
 
Kopienu rīcībspēja un identitāte
Kopienu rīcībspēja un identitāte Kopienu rīcībspēja un identitāte
Kopienu rīcībspēja un identitāte nacionalaidentitate
 
Keyboarding chapter 4 ppt
Keyboarding chapter 4 pptKeyboarding chapter 4 ppt
Keyboarding chapter 4 pptks0385
 
1. interviewing revised feb 3 2015
1. interviewing revised feb 3 20151. interviewing revised feb 3 2015
1. interviewing revised feb 3 2015amgonzalezpineiro
 
Mobilne usluge - Ljubicasta buducnost
Mobilne usluge - Ljubicasta buducnostMobilne usluge - Ljubicasta buducnost
Mobilne usluge - Ljubicasta buducnostKupindo
 
Rihanna[1]
Rihanna[1]Rihanna[1]
Rihanna[1]ibutt5
 
Gepard gm6 lynx x
Gepard gm6 lynx xGepard gm6 lynx x
Gepard gm6 lynx xriskis
 
Driving Profits in the Downturn, Using Data to Improve Website Performance an...
Driving Profits in the Downturn, Using Data to Improve Website Performance an...Driving Profits in the Downturn, Using Data to Improve Website Performance an...
Driving Profits in the Downturn, Using Data to Improve Website Performance an...Marisa Gallagher
 
Qualcomm Life Connect 2013 - Michael Z. Jones
Qualcomm Life Connect 2013 - Michael Z. JonesQualcomm Life Connect 2013 - Michael Z. Jones
Qualcomm Life Connect 2013 - Michael Z. JonesQualcomm Life
 
Basisregistratie Ondergronds - StadSPOORT
Basisregistratie Ondergronds - StadSPOORTBasisregistratie Ondergronds - StadSPOORT
Basisregistratie Ondergronds - StadSPOORTStadSPOORT
 
Nācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīze
Nācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīzeNācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīze
Nācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīzenacionalaidentitate
 

Viewers also liked (20)

Tuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduceTuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduce
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for HadoopSplout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
 
Datasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactionsDatasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactions
 
Day snowman
Day snowmanDay snowman
Day snowman
 
Final evaluation
Final evaluationFinal evaluation
Final evaluation
 
Internettrendsv1 150526193103-lva1-app6892
Internettrendsv1 150526193103-lva1-app6892Internettrendsv1 150526193103-lva1-app6892
Internettrendsv1 150526193103-lva1-app6892
 
Switching On the Growth Engine in Your Small Consulting Practice
Switching On the Growth Engine in Your Small Consulting PracticeSwitching On the Growth Engine in Your Small Consulting Practice
Switching On the Growth Engine in Your Small Consulting Practice
 
1. creative writing workshop april 2015 final
1. creative writing workshop april 2015 final1. creative writing workshop april 2015 final
1. creative writing workshop april 2015 final
 
Kopienu rīcībspēja un identitāte
Kopienu rīcībspēja un identitāte Kopienu rīcībspēja un identitāte
Kopienu rīcībspēja un identitāte
 
Saul Sours
Saul SoursSaul Sours
Saul Sours
 
Keyboarding chapter 4 ppt
Keyboarding chapter 4 pptKeyboarding chapter 4 ppt
Keyboarding chapter 4 ppt
 
1. interviewing revised feb 3 2015
1. interviewing revised feb 3 20151. interviewing revised feb 3 2015
1. interviewing revised feb 3 2015
 
Mobilne usluge - Ljubicasta buducnost
Mobilne usluge - Ljubicasta buducnostMobilne usluge - Ljubicasta buducnost
Mobilne usluge - Ljubicasta buducnost
 
Rihanna[1]
Rihanna[1]Rihanna[1]
Rihanna[1]
 
Gepard gm6 lynx x
Gepard gm6 lynx xGepard gm6 lynx x
Gepard gm6 lynx x
 
Driving Profits in the Downturn, Using Data to Improve Website Performance an...
Driving Profits in the Downturn, Using Data to Improve Website Performance an...Driving Profits in the Downturn, Using Data to Improve Website Performance an...
Driving Profits in the Downturn, Using Data to Improve Website Performance an...
 
Qualcomm Life Connect 2013 - Michael Z. Jones
Qualcomm Life Connect 2013 - Michael Z. JonesQualcomm Life Connect 2013 - Michael Z. Jones
Qualcomm Life Connect 2013 - Michael Z. Jones
 
Basisregistratie Ondergronds - StadSPOORT
Basisregistratie Ondergronds - StadSPOORTBasisregistratie Ondergronds - StadSPOORT
Basisregistratie Ondergronds - StadSPOORT
 
Nācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīze
Nācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīzeNācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīze
Nācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīze
 

Similar to Introducción a hadoop

Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusKoichi Fujikawa
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startupsbmlever
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopDilum Bandara
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingEd Kohlwey
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopSvetlin Nakov
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInVitaly Gordon
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 

Similar to Introducción a hadoop (20)

Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Cloud jpl
Cloud jplCloud jpl
Cloud jpl
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Hw09 Hadoop + Clojure
Hw09   Hadoop + ClojureHw09   Hadoop + Clojure
Hw09 Hadoop + Clojure
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 

Recently uploaded

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Recently uploaded (20)

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Introducción a hadoop

  • 1. Introducción a Hadoop El bazuca de los datos Iván de Prado Alonso // @ivanprado // @datasalt
  • 2. Datasalt Foco en el Big Data – Contribución al Open Source – Consultoría & Desarrollo – Formación 2 / 34
  • 4. Fisonomía de un proyecto Big Data Adquisición Procesamiento Servicio 4 / 34
  • 5. Tipos de sistemas Big Data ● Offline – La latencia no es un problema ● Online – La inmediatez de los datos es importante ● Mixto – Lo más común Offline Online MapReduce Bases de datos NoSQL Hadoop Motores de búsqueda Distributed RDBMS 5 / 34
  • 6. “Swiss army knife of the 21st century” Media Guardian Innovation Awards http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop 6 / 34
  • 7. Historia ● 2004-2006 – Google publica los papers de GFS y MapReduce – Doug Cutting implementa una versión Open Source en Nutch ● 2006-2008 – Hadoop se separa de Nutch – Se alcanza la escala web en 2008 ● 2008-Hasta ahora – Hadoop se populariza y se comienza a explotar comercialmente. Fuente: Hadoop: a brief history. Doug Cutting 7 / 34
  • 8. Hadoop “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model” De la página de Hadoop 8 / 34
  • 9. Sistema de Ficheros Distribuido ● Sistema de ficheros distribuido (HDFS) – Bloques grandes: 64 Mb ● Almacenados en el sistema de ficheros del SO – Tolerante a Fallos (replicación) – Formatos habituales: ● Ficheros en formato texto (CSV) ● SequenceFiles – Ristras de pares [clave, valor] 9 / 34
  • 10. MapReduce ● Dos funciones (Map y Reduce) – Map(k, v) : [z,w]* – Reduce(k, v*) : [z, w]* ● Ejemplo: contar palabras – Map([documento, null]) -> [palabra, 1]* – Reduce(palabra, 1*) -> [palabra, total] ● MapReduce y SQL – SELECT palabra, count(*) GROUP BY palabra ● Ejecución distribuida en un cluster con escalabilidad horizontal 10 / 34
  • 11. El típico Word Count Esto es una linea Esto también Map Reduce reduce(es, {1}) = map(“Esto es una linea”) = es, 1 esto, 1 reduce(esto, {1, 1}) = es, 1 esto, 2 una, 1 reduce(linea, {1}) = linea, 1 linea, 1 map(“Esto también”) = reduce(también, {1}) = esto, 1 también, 1 también, 1 reduce(una, {1}) = una, 1 es, 1 esto, 2 Resultado: linea, 1 también, 1 una, 1 11 / 34
  • 12. Word Count en Hadoop public class WordCountHadoop extends Configured implements Tool { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while(itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for(IntWritable val : values) { sum += val.get(); } ¡Mejor vamos por partes! result.set(sum); context.write(key, result); } } @Override public int run(String[] args) throws Exception { if(args.length != 2) { System.err.println("Usage: wordcount-hadoop <in> <out>"); System.exit(2); } Path output = new Path(args[1]); HadoopUtils.deleteIfExists(FileSystem.get(output.toUri(), conf), output); Job job = new Job(getConf(), "word count hadoop"); job.setJarByClass(WordCountHadoop.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); return 0; } public static void main(String[] args) throws Exception { ToolRunner.run(new SortJobHadoop(), args); } } 12 / 34
  • 13. Word Count en Hadoop - Mapper public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while(itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } 13 / 34
  • 14. Word Count en Hadoop - Reducer public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for(IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } 14 / 34
  • 15. Word Count en Hadoop – Configuración y ejecución if(args.length != 2) { System.err.println("Usage: wordcount-hadoop <in> <out>"); System.exit(2); } Path output = new Path(args[1]); HadoopUtils.deleteIfExists(FileSystem.get(output.toUri(), conf), output); Job job = new Job(getConf(), "word count hadoop"); job.setJarByClass(WordCountHadoop.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); 15 / 34
  • 16. Ejecución de un Job MapReduce Bloques del fichero de entrada Nodo 1 Nodo 2 Mappers Datos Intermedios Nodo 1 Nodo 2 Reducers Resultado 16 / 34
  • 17. Serialización ● Writables • Serialización nativa de Hadoop • De muy bajo nivel • Tipos básicos: IntWritable, Text, etc. ● Otras • Thrift, Avro, Protostuff • Compatibilidad hacia atrás. 17 / 34
  • 18. La curva de aprendizaje de Hadoop es alta 18 / 34
  • 19. Tuple MapReduce ● Un MapReduce más simple – Tuplas en lugar de key/value – A nivel de job se define ● Los campos por los que agrupar ● Los campos por los que ordenar – Tuple MapReduce-join 19 / 34
  • 20. Pangool ● Implementación de TupleMap reduce – Desarrollado por Datasalt – OpenSource – Eficiencia equiparable a Hadoop ● Objetivo: reemplazar la API de Hadoop ● Si quieres aprender Hadoop, empieza por Pangool 20 / 34
  • 21. Eficiencia de Pangool ● Equiparable a Hadoop Ver http://pangool.net/benchmark.html 21 / 34
  • 22. Pangool – URL resolution ● Ejemplo de Join – Muy difícil en Hadoop. Fácil en Pangool. ● Problema: – Existen muchos acortadores de URLs y redirecciones – Para analizar datos, suele ser útil reemplazar las URLs por su URL canónica – Supongamos que tenemos ambos datasets ● Un mapa con entradas URL → URL canónica ● Un dataset con URLs (que queremos resolver) y otros campos. – El siguiente job Pangool soluciona el problema de manera escalable. 22 / 34
  • 23. URL Resolution – Definiendo Schemas static Schema getURLRegisterSchema() { List<Field> urlRegisterFields = new ArrayList<Field>(); urlRegisterFields.add(Field.create("url",Type.STRING)); urlRegisterFields.add(Field.create("timestamp",Type.LONG)); urlRegisterFields.add(Field.create("ip",Type.STRING)); return new Schema("urlRegister", urlRegisterFields); } static Schema getURLMapSchema() { List<Field> urlMapFields = new ArrayList<Field>(); urlMapFields.add(Field.create("url",Type.STRING)); urlMapFields.add(Field.create("canonicalUrl",Type.STRING)); return new Schema("urlMap", urlMapFields); } 23 / 34
  • 24. URL Resolution – Cargando el fichero a resolver public static class UrlProcessor extends TupleMapper<LongWritable, Text> { private Tuple tuple = new Tuple(getURLRegisterSchema()); @Override public void map(LongWritable key, Text value, TupleMRContext context, Collector collector) throws IOException, InterruptedException { String[] fields = value.toString().split("t"); tuple.set("url", fields[0]); tuple.set("timestamp", Long.parseLong(fields[1])); tuple.set("ip", fields[2]); collector.write(tuple); } } 24 / 34
  • 25. URL Resolution – Cargando el mapa de URLs public static class UrlMapProcessor extends TupleMapper<LongWritable, Text> { private Tuple tuple = new Tuple(getURLMapSchema()); @Override public void map(LongWritable key, Text value, TupleMRContext context, Collector collector) throws IOException, InterruptedException { String[] fields = value.toString().split("t"); tuple.set("url", fields[0]); tuple.set("canonicalUrl", fields[1]); collector.write(tuple); } } 25 / 34
  • 26. URL Resolution – Resolución en el reducer public static class Handler extends TupleReducer<Text, NullWritable> { private Text result; @Override public void reduce(ITuple group, Iterable<ITuple> tuples, TupleMRContext context, Collector collector) throws IOException, InterruptedException, TupleMRException { if (result == null) { result = new Text(); } String cannonicalUrl = null; for(ITuple tuple : tuples) { if("urlMap".equals(tuple.getSchema().getName())) { cannonicalUrl = tuple.get("canonicalUrl").toString(); } else { result.set(cannonicalUrl + "t" + tuple.get("timestamp") + "t" + tuple.get("ip")); collector.write(result, NullWritable.get()); } } } } 26 / 34
  • 27. URL Resolution – Configurando y Lanzando el job String input1 = args[0]; String input2 = args[1]; String output = args[2]; deleteOutput(output); TupleMRBuilder mr = new TupleMRBuilder(conf,"Pangool Url Resolution"); mr.addIntermediateSchema(getURLMapSchema()); mr.addIntermediateSchema(getURLRegisterSchema()); mr.setGroupByFields("url"); mr.setOrderBy( new OrderBy().add("url", Order.ASC).addSchemaOrder(Order.ASC)); mr.setTupleReducer(new Handler()); mr.setOutput(new Path(output), new HadoopOutputFormat(TextOutputFormat.class), Text.class, NullWritable.class); mr.addInput(new Path(input1), new HadoopInputFormat(TextInputFormat.class), new UrlMapProcessor()); mr.addInput(new Path(input2), new HadoopInputFormat(TextInputFormat.class), new UrlProcessor()); mr.createJob().waitForCompletion(true); 27 / 34