SlideShare a Scribd company logo
1 of 112
Big Data, MapReduce and
         beyond
             Iván de Prado Alonso // @ivanprado
           Pere Ferrera Bertran // @ferrerabertran
                                        @datasalt
Outline




                                         Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Big Data. What and why
2. MapReduce & Hadoop
3. MapReduce Design Patterns
4. Real-life MapReduce
   1. Data analytics
   2. Crawling
   3. Full-text indexing
   4. Reputation systems
   5. Data Mining
5. Tuple MapReduce & Pangool




                               2 / 112
Big Data
What and why
In the past...




                                                                    Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Data and computation fits on one monolithic machine
●   Monolithic databases: RDBMS
●   Scalability:
    –   Vertical: buy better hardware
●   Distributed systems
    –   No very common
    –   Logic centric: Data move where the logic is
●   Distributed storage: SAN




                                                          4 / 112
Distributed systems are hard




                                                                                       Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Building distributed systems is hard
    –   If you can scale vertically at a reasonable cost, why to deal with
        distributed systems complexity?
●   But circumstances are changing:
    –   Big Data
●   Big data refers to the massive amounts of data that are difficult
    to analyze and handle using common database management
    tools




                                                                             5 / 112
BIG

                                                     DATA
                                                    “MAC”




6 / 112




          Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
Big Data




                                                            Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Data is the new bottleneck
    –   Web data
         ●   Web pages
         ●   Interaction Logs
    –   Social networks data
    –   Mobile devices
    –   Data generated by Sensors
●   Old systems/techniques are not appropriated
●   A new approach is needed




                                                  7 / 112
Big Data project parts




              Serving
                                                                                   Acquiring


                                                Processing




8 / 112




          Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
Acquiring




                                                              Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Gathering/receiving/storing data from sources
●   Many kind of sources
    –   Internet
    –   Sensors
    –   User behavior
    –   Mobile devices
    –   Health care data
    –   Banking data
    –   Social Networks
    –   …..




                                                    9 / 112
Processing




                                                                        Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Data is present in the system (acquired)
●   This step is responsible of extracting value from data
    –   Eliminate duplicates
    –   Infer relations
    –   Calculate statistics
    –   Correlate information
    –   Ensure quality
    –   Generate recommendations
    –   ….




                                                             10 / 112
Serving




                                                                        Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Most of the cases, some interface has to be provided to access
    the processed information
●   Possibilities
    –   Big Data / No Big Data
    –   Real time access to results / non real time access
●   Some examples:
    –   Search engine → inverted index
    –   Banking data → relational database
    –   Social Network → NoSQL database




                                                             11 / 112
Big Data system types




                                                                                 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Offline
    –   Latency is not a problem
●   Online
    –   Response immediacy is important
●   Mixed
    –   Online behavior, but internally is a mixture of two systems
         ●   One online
         ●   One offline

                 Offline                                Online
MapReduce                                NoSQL
Hadoop                                   Search engines
Distributed RDBMS

                                                                      12 / 112
A




                                       Mixed
                                                                                  Online
                                                                                                                       Offline




                            A
                                                      P
                                                     AS
                                                                                                P




                            P


                 P
                             S




           A S
                                                                                                S
                                                                                                                                   Big Data Systems types II




13 / 112




                     Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
MapReduce & Hadoop
“Swiss army knife of the
                                           21st century”
                                                         Media Guardian Innovation Awards




http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop 15 / 112
History




                                                                                      Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   2004-2006
    –   GFS and MapReduce papers published by Google
    –   Doug Cutting implements an open source version for Nutch
●   2006-2008
    –   Hadoop project becomes independent from Nutch
    –   Web scale reached in 2008
●   2008-now
    –   Hadoop becomes popular and is commercially exploited




                                    Source: Hadoop: a brief history. Doug Cutting

                                                                           16 / 112
Hadoop

     “The Apache Hadoop
      software library is a
  framework that allows for
        the distributed
  processing of large data
    sets across clusters of
       computers using a
    simple programming
            model”
            From Apache Hadoop page



                                      17 / 112
Main ideas




                                                                           Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Distributed
    –   Distributed storage
    –   Distributed computation platform
●   Built to be fault tolerant
●   Shared nothing architecture
●   Programmer isolation from distributed system difficulties
    –   By providing an simply primitives for programming




                                                                18 / 112
Hadoop Distributed File System (HDFS)




                                                                    Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Distributed
    –   Aggregates the individual storage of each node
●   Files formed by blocks
    –   Typically 64 or 128 Mb (configurable)
    –   Stored in the OS filesystem: Xfs, Ext3, etc.
●   Fault tolerant
    –   Blocks replicated more than once




                                                         19 / 112
How files are stored




                                                                    Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
                   DataNode 1 (DN1)



   NameNode
                       1              DataNode 2 (DN2)
 Data.txt:                   2
     Blocks:
       1
             DN1
                                      1
             DN2
       2                              3
             DN1
             DN4
       3                              DataNode 4 (DN4)
             DN2
             DN3                      2
       4
             DN4                      4
             DN3
                   DataNode 3 (DN3)

                   3

                   4
                                                         20 / 112
MapReduce




                                                                             Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Map Reduce is the abstraction behind Hadoop
●   The unit of execution is the Job
●   Job has
    –   An input
    –   An output
    –   A map function
    –   A reduce function
●   Input and output are sequences of key/value pairs
●   The map and reduce functions are provided by the developer
    –   The execution is distributed and parallelized by Hadoop




                                                                  21 / 112
Job phases




                                                                     Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Two different phases: mapping and reducing
●   Mapping phase
    –   Map function is applied to Input data
         ●   Intermediate data is generated
●   Reducing phase
    –   Reduce function is applied to intermediate data
         ●   Final output is generated




                                                          22 / 112
MapReduce




                                                         Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Two functions (Map & Reduce)
    –   Map(u, v) : [w,x]*
    –   Reduce(w, x*) : [y, z]*
●   Example: word count
    –   Map([document, null]) -> [word, 1]*
    –   Reduce(word, 1*) -> [word, total]
●   MapReduce & SQL
    –   SELECT word, count(*) GROUP BY word
●   Distributed execution in a cluster
    –   Horizontal scalability




                                              23 / 112
Word Count




                                                                            Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
  This is a line
  Also this


 Map                                    Reduce
                                        reduce(a, {1}) =
  map(“This   is a line”) =
                                            a, 1
      this,   1
                                        reduce(also, {1}) =
      is, 1
                                            also, 1
      a, 1
                                        reduce(is, {1}) =
      line,   1
                                            is, 1
  map(“Also   this”) =
                                        reduce(line, {1}) =
      also,   1
                                            line, 1
      this,   1
                                        reduce(this, {1, 1}) =
                                            this, 2


                              a, 1
                              also, 1
  Result:                     is, 1
                              line, 1
                              this, 2



                                                                 24 / 112
Map examples




                                                                       Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Swap Mapper
    –   Swaps key and value
    map(key, value):
        emit (value, key)

●   Split Key Mapper
    –   Splits key in words and emit a pair per each word
    map(key, value):
        words = key.split(“ “)
        for each word in words:
            emit (word, value)




                                                            25 / 112
Map examples (II)




                                                                   Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Filter Mapper
    –   Filter out some records
    map(key, value):
        if (key <> “the”):
            emit (key, value)

●   Key/value concatenation mapper
    –   Concatenates the key and the value in the key
    map(key, value):
        emit (key + “ “ + value, null)




                                                        26 / 112
Reduce examples




                                                                Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Count reducer
    –   Counts the number of elements per each key
    reduce(key, values):
        count = 0
        for each value in values:
            count++
        emit(key, count)



●   Average reducer
    –   Computes the average value for each key
    reduce(key, values):
        count = 0
        total = 0
        for each value in values:
            count++
            total += value
        emit(key, total / count)


                                                     27 / 112
Reduce examples (II)




                                                           Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Keep first reducer
    –   Keeps the first key/value input pair
    reduce(key, values):
        emit(key, first(values))




●   Value concatenation reducer
    –   Concatenates the values in one string

    reduce(key, values):
        result = “”
        for each value in values:
            result += “ “ + value
        emit(key, result)




                                                28 / 112
Identity map and reduce




                                                                       Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   The identity functions are those that keeps the input
    unchanged
     –   Map identity
    map(key, value):
        emit (key, value)




     –   Reduce identity

    reduce(key, values):
        for each value in values:
            emit (key, value)




                                                            29 / 112
Putting all together




                                                                                      Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
    map(k, v) → [w, x]*
    reduce(w, [x]+) → [y, z]*


●   Job flow:
    –   The mapper generates key/value pairs
    –   This pairs are grouped by the key
    –   Hadoop calls the reduce function for each group
    –   The output of the reduce function is the final Job output
●   Hadoop will distribute the work
    –   Different nodes in the cluster will process the data in parallel




                                                                           30 / 112
data




                      Tasks
           Output
                      Reduce
                                                                                      Map Tasks




                                                     Intermediate
                                                                                                                                               Job Execution




                     Node 1                                                     Node 1
                                                                                                                       Input Splits (blocks)




                     Node 2                                                     Node 2
31 / 112




                    Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
Job Execution (II)




                                                                                     Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Key/value pair are sorted by key in the shuffle & sorting phase
    –   That is needed in order to group registers by the key when calling the
        reducer
    –   It also means that calls to the reduce function are done in key-order
         ●   Reduce function with key “A” is always called before than reduce
             function with key “B” whiting the same reduce task
●   Reducers starts downloading data from the mappers as soon
    as possible
    –   In order to reduce the shuffle & sorting phase time
    –   Number of reducers can be configured by the programmer




                                                                          32 / 112
Partial Sort Job




                                                                                                              Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
  ●   A job formed with the identity map and the identity reducer
       –     It just sort data by the key per each reducer

Input file                D        B       A    B      C           D          E            A


                                   Map 1                                   Map 2

Intermediate          D        A           B    B          D           A               C           E
    data



                                   Reduce 1                            Reduce 2



                           A       A       D   D               B       B           C           E


                                               Output files                                        33 / 112
Input Splits




                                                                                        Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●     Each map task process one input split
       –   Map task starts processing at the first complete record, and finishes
           processing the record crossed by the rightmost boundary


                      Input        Input         Input          Input
                       Split        Split         Split          Split
                        1            2             3              4



    File


     Records

                         Map           Map            Map          Map
                         Task          Task           Task         Task
                           1             2              3            4
                                                                             34 / 112
Combiner




                                                                                      Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Intermediate data goes from the map tasks to the reduce tasks
    through the network
    –   Network can be saturated
●   Combiners can be used to reduce the amount of data sent to
    the reducers
    –   When the operation is commutative and associative
●   A combiner is a function similar to the reducer
    –   But it is executed in the map task, just after all the mapping has been
        done
●   Combiners can't have side effects
    –   Because Hadoop can decide to execute them or not




                                                                           35 / 112
Design Patterns
Design patterns




                                      Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Filtering
2. Secondary sorting
3. Distributed execution
4. Computing statistics
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation




                           37 / 112
Filtering




                                                                    Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Filtering:
                                   Input data
●   We process the
    input file in parallel
    with Hadoop and
                             if(condition) { emit(); }
    emit a smaller dataset
    in the end

                                 Output data



                                                         38 / 112
Design patterns




                                      Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Filtering
2. Secondary sorting
3. Distributed execution
4. Computing statistics
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation




                           39 / 112
Secondary sorting




                                                                     Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Receive reducer values in a specific order
●   Moving averages:
    –   Secondary sort by timestamp
    –   Fill an in-memory window and perform average.
●   Top N items in a group:
    –   Secondary sort by <X>
    –   Emit the first N elements in a group
●   Useful, yet quite difficult to implement in Hadoop.
                                Sort Comparator

         Key


                          Group Comparator
                          Partitioner
                                                          40 / 112
Design patterns




                                      Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Filtering
2. Secondary sorting
3. Distributed execution
4. Computing statistics
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation




                           41 / 112
Distributed execution without Hadoop




                                                                                 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Distributed queue
    –   It is needed a common queue used to coordinate and assign work
●   Distributed workers
    –   Consumers working on each node, getting work from the queue
●   Problems:
    –   Difficult to coordinate
         ●   Failover
         ●   Loosing messages
         ●   Load balance
    –   Queue must scale




                                                                      42 / 112
Distributed execution with Hadoop




                                                                            Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Map-only Jobs.
●   Use Hadoop just for the sake of “parallelizing something”.
●   Anything that doesn't involve a “group by” (no shuffle/reducer)
●   Examples:
    –   Text categorization
    –   Filtering
    –   Crawling
                              Map 1   Map 2   …         Map n
    –   Updating a DB
    –   Distributed grep
●   NlineInputFormat can be handy for that.




                                                                 43 / 112
Disadvantages




                                                                            Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Work is done in batches
    –   And distribution is not probably even
         ●   Some resources are wasted
●   There are some tricks to alleviate the problem
    –   Task timeout + saving remaining work to next execution




                                                                 44 / 112
Design patterns




                                      Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Filtering
2. Secondary sorting
3. Distributed execution
4. Computing statistics
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation




                           45 / 112
Computing statistics (I)




                                                                                       Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Count, Sum, Average, Std. Dev...
●   Aggregated by something
●   Recall SQL: select user, count(clicks) … group by user

                 user, click        user, click   Map: emit (user, click)


                                     Reduce by user: count values


                           user, count(click)




                                                                            46 / 112
Computing statistics (II)




                                                                                   Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   When sum(), avg(), etc, Combiners are often needed

●   Imagine a user performed 3 million clicks
    –   Then, a reducer will receive 3 million registers
    –   This reducer will be the bottleneck of the Job. Everyone needs to wait
        for it to count 3 million things.


●   Solution: Perform partial counts in a Combiner

●   Combiner is executed before shuffling, after Mapper.




                                                                        47 / 112
Computing statistics (III)




                                                                                Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Using a Combiner:


              user, click            user, click       Map



           user, count(click)     user, count(click)   Combine



                    user, sum(count(click))            Reduce




●   For each group, reducer aggregates N counts in the worst case! (N =
    #mappers)
                                                                     48 / 112
Design patterns




                                      Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Filtering
2. Secondary sorting
3. Distributed execution
4. Computing statistics
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation




                           49 / 112
Distinct




                                                                             Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   How to calculate distinct count(something) group by X ?
●   It is somewhat easy (2 M/Rs):


    M/R 1 (eliminates duplicates):
    –   emit ({X, something}, null)
    –   so rows are grouped by ({X, something})
    –   In the reducer, just emit the first (ignore duplicates)
    M/R 2 (groups by X and count):
    –   For each input X, something → emit (X, 1)
    –   group by (X)
    –   The reducer counts incoming values


                                                                  50 / 112
Distinct: example




                                                                  Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
            M/R 1                     M/R 2

      (X1, s1)
      (X1, s1)
      (X1, s2)      (X1, s1)   (X1, s1)
      (X1, s1)      (X1, s2)   (X1, s2)       X1 → 2
      (X1, s2)      (X2, s1)   (X2, s1)       X2 → 2
      (X2, s1)      (X2, s3)   (X2, s3)
      (X2, s1)
      (X2, s3)
      (X2, s1)




                                                       51 / 112
Distinct: Secondary sort




                                                                                Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   We can calculate distinct in only one Job
●   Using Secondary Sorting


    M/R:
    –   emit ({X, something}, null)
    –   group by (X), secondary sort by (something)
    –   The reducer: iterate, count & emit “something, count” when
        “something” changes. Reset the counter each “something” change.


●   Need to use a Combiner to eliminate duplicates (otherwise
    reducer would receive too many records).
●   disctinct count() is more parallelizable with 2 Jobs than with 1!

                                                                     52 / 112
Design patterns




                                      Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Filtering
2. Secondary sorting
3. Distributed execution
4. Computing statistics
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation




                           53 / 112
Sorting




                                                                         Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   We have seen how sorting is (partially) inherent in Hadoop.
●   But if we want “pure” sorting:
    –   Use one Reducer (not scalable)
    –   Use an advanced partitioning strategy


●   Yahoo! TeraSort (http://sortbenchmark.org/Yahoo2009.pdf)
●   Use sampling to calculate data distribution
●   Implement custom Partitioning according to distribution




                                                              54 / 112
Sorting (II)




                                                                            Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Hash partitioning:
          0 1 2 0 1      2 0 1 2 0 1 2 0 1       2 0   1 2 ...




●   Distribution-aware partitioning:



             0           1                   2




                                                                 55 / 112
Design patterns




                                      Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Filtering
2. Secondary sorting
3. Distributed execution
4. Computing statistics
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation




                           56 / 112
Joins




                                                                                            Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Joining two (or more) datasets is a quite common need.
●   The difficulty is that both datasets may be “too big”
    –   Otherwise, an in-memory join can be do quite easily just by reading
        one of the datasets in RAM.


●   “Big joins” are commonly done “reduce-side”:

        Map dataset 1:   (K1, d21)   Map dataset 2:   (K1, d11)
                         (K2, d22)                    (K2, d12)

                                             Reduce by common key (K1, K2,...)

                                         K1 → d11, d21
                         Reduce: Join
                                         K2 → d12, d22


●   The so-called “map-side joins” are more complex and tricky.

                                                                                 57 / 112
Joins: 1-N relation




                                                                          Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Use secondary sorting to get the “one-side” of the relation the
    first
    –   Otherwise you need to use memory to perform the join
         ●   Does not scale
●   Employee (E) – Sales join (S)

             SSESSS               You need to use memory


             ESSSSS                Memory not needed




                                                               58 / 112
Left – Right – Inner joins




                                                       Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●     Join between Employee and Sales

    reducer(key, values):
         employee = null
         first = first(values)
         rest = rest(values)
         If isEmployee(first)
                employee = first

         If employee = null
                // rigth join      SSSSS
         else if size(rest) = 0
                // left join       E
         else
                // inner join      ESSSSS




                                            59 / 112
Design patterns




                                      Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Filtering
2. Secondary sorting
3. Distributed execution
4. Computing statistics
5. Count distinct
6. Sorting
7. Joins
8. Reconciliation




                           60 / 112
Reconciliation




                                                                                         Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Hadoop can be used to “simulate a database”.
●   For that, we need to:
    –   Merge data state s(t) with past state s(t-1) using a Join.
    –   Update rows with the same ID (performing whatever logic).
    –   Store the result as the next full data state.
    –   Rotate states:
         ●   s(t-1) = s(t)
         ●   s(t) = s(t + 1)         s(t-1)             s(t)




                                                                     s(t+1)
                                              M/R

                                                                              61 / 112
Real-life Hadoop projects




                                                                         Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   95% Real-life Hadoop projects are a mixture of the patterns we
    just saw.
●   Example: A vertical search-engine.
     –    Distributed execution: Feed crawl / parse
     –    Data reconciliation: Merge data by listing ID
     –    Join: Augment listings with geographical info
     –    …
●   Example: Payments data stats.
     –    Secondary sort: weekly, daily & monthly stats
     –    Distributed execution: Random-updates to a DB
●   ...


                                                              62 / 112
Real-life MapReduce
Real-life MapReduce




                                   Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Data analytics
2. Crawling
3. Full-text indexing
4. Reputation systems
5. Data Mining




                        64 / 112
Data analytics




                                                                    Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Obvious use case for MapReduce.
●   Examples: calculate unique visits per page.
    –   Top products per month.
    –   Unique clicks per banner.
    –   Etc.
●   Offline analytics (Batch-oriented).
    –   Online analytics not a good fit for MapReduce.




                                                         65 / 112
Data analytics: How it works




                                                                                   Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   A batch process that uses all historical data.
    –   Recompute everything always.
    –   Easier to manage and maintain than incremental computation.
●   A chain of MapReduce steps produce the final output.
●   There are tools that ease building / maintaining the MapReduce
    chain:
    –   Hive, Pig, Cascading, Pangool for programming a MapReduce flow
        easily.
    –   Oozie, Azkaban for connecting existing MapReduce jobs easily.
         ●   Scheduling flows and such.




                                                                        66 / 112
Data analytics: Difficulties




                                                                                         Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Some things are harder to calculate than others.
●   Calculating unique visits per page.
    –   A simple solution in two MapReduce steps or a more sophisticated one
        in a single MapReduce step.
    –   Approximated methods can be used as well.
●   Calculating the median.
    –   Need to sort all the dataset and iterate twice if we don’t know the
        number of elements.




                                                                              67 / 112
Data analytics: Examples




                                                                                 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Gather clicks on pages.
    –   Save (click, page, timestamp) in the HDFS.
●   A MapReduce job groups by page and counts the number of
    clicks:
         ●   map: emit(page, click).
         ●   reduce: (page, list<click>) emits (page, totalClicks).
●   We now have the total number of clicks per page.




                                                                      68 / 112
Data analytics: Examples (II)




                                                                                      Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Another MapReduce job groups by day and page and counts
    the number of clicks:
         ●   map: emit((page, day), click).
         ●   reduce: Same as before.
●   We now have the total number of clicks per page and day.
●   These are simple examples, but data analytics can get as
    sophisticated as you want.
    –   Example: calculate a 10 bar histogram of the distribution of clicks over
        the hours of the day for each page.




                                                                           69 / 112
Real-life MapReduce




                                   Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Data analytics
2. Crawling
3. Full-text indexing
4. Reputation systems
5. Data Mining




                        70 / 112
Crawling: Web Crawling




                                                                                Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Web Crawling:
    –   “A Web crawler is a computer program that browses the World Wide
        Web in a methodical, automated manner or in an orderly fashion.”
●   Applications:
    –   Search engines.
    –   NLP (Sentiment analysis).
●   Examples:




                                                                     71 / 112
Real-life MapReduce




                                   Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Data analytics
2. Crawling
3. Full-text indexing
4. Reputation systems
5. Data Mining




                        72 / 112
Crawling: Web Crawling (at scale)




                                                                             Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   How to parallelize storage and bandwidth?
●   How to deduplicate stored URLs?
●   Other complexities: politeness, infinite loops, robots.txt,
    canonical URLs, pagination, parsing, …
●   Relevancy: Pagerank.




                                                                  73 / 112
Crawling: Nutch




                                                              Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   What is Nutch?
    –   Open source web-search software project.
    –   Apache project.
    –   Hadoop, Tike, Lucene, SOLR.


●   (Brief) history
    –   Started in 2002/2003
    –   2005: MapReduce
    –   2006: Hadoop
    –   2006/2007: Tika
    –   2010 TLP Apache project




                                                   74 / 112
Crawling: Nutch: How it works




                                                                 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   “Select, Crawl, Parse, Dedup by URL” loop.
●   Lucene, SOLR for indexing.
    –   We will see them later.
●   CrawlDB: Pages are saved in HDFS.
●   MapReduce makes storage and computing scalable.
    –   Helps in deduplicating pages by URL.
    –   Helps in identifying new pages to crawl.




                                                      75 / 112
Crawling: Not-Only Web Crawling




                                                                                  Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Custom crawlers:
    –   Tweets.
    –   XML feeds.
●   Simpler, as we usually don’t need to traverse a tree.
    –   Sometimes only crawling a fixed seed of resources is enough.
●   Applications
    –   Vertical search engines.
    –   Reputation systems.




                                                                       76 / 112
Crawling: Example: Crawling tweets at scale




                                                                                Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Use a scalable computing engine for fetching tweets.
    –   Storm is a good fit.
    –   Hadoop can be used as well.
         ●   Tricky usage of MapReduce: Create as many groups as crawlers
             and embed a Crawler in them.
●   Save raw feed data (JSON) in HDFS.
●   MapReduce: Parse JSON tweets.
●   MapReduce: Deduplicate tweets.
●   MapReduce: Analyze tweets and perform data analysis.




                                                                     77 / 112
M/R
   Parse
                                       HDFS




   M/R
   Dedup
   M/R
   Analysis
                                                               Results
                                                                                                                            Crawling: Example: Crawling tweets at scale




78 / 112




              Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
Real-life MapReduce




                                   Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Data analytics
2. Crawling
3. Full-text indexing
4. Reputation systems
5. Data Mining




                        79 / 112
Full-text indexing: Definitions




                                                                                       Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Search engine:
    –   An information retrieval system designed to help find information
        stored on a computer system.
●   Inverted index:
    –   Index data structure storing a mapping from content, such as words or
        numbers, to its locations in a database file, or in a document or a set of
        documents.
    –   B-Trees are not inverted indexes!
●   Stemming.
●   Relevancy in results.




                                                                            80 / 112
Full-text indexing: Applications




                                                                           Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Web search engines
    –   Finding relevant pages for a topic
●   Vertical search engines
    –   Finding jobs by description
●   Social networks
    –   Finding messages by text
●   e-Commerce
    –   Finding articles by description
●   In general, any service or application needing efficient text
    information retrieval




                                                                81 / 112
Full-text indexing (at scale)




                                                                                              Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Real-time indexing versus batch-indexing:
    –   The first is cool: it is real-time, but it is difficult. We will not address it
        now.
    –   The second is not real-time, but it is simpler.
●   How to batch-index a big corpus dataset?
    –   Need a scalable storage, (HDFS).
●   How to deduplicate documents?
    –   MapReduce to the rescue (like we saw before).
●   How to generate multiple indexes?
    –   MapReduce can help (we will see how).




                                                                                   82 / 112
Full-text indexing: MapReduce




                                                                         Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   MapReduce can be used to generate an inverted index.
    –   Vertical partitioning v.s. Horizontal partitioning.
●   Example:
    –   Map: emit(word, docId)
    –   Reduce: emit(word, list<docIds>)
●   Quite simple. But what about stop words, stemming, etc?
●   How to store the index?
●   Better not to reinvent the wheel.




                                                              83 / 112
Full-text indexing: Lucene / SOLR




                                                                                         Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Lucene: Doug Cutting’s
    –   From Nutch.
    –   Mainstream open-source implementation of an inverted index.
    –   Efficient disk allocation, highly performant.
●   SOLR: Mainstream open-source search server.
    –   Provides stemming, analyzers, HTTP servlets, etc.
    –   Lacks some other desirable properties:
         ●   Elasticity, real-time indexing, horizontal partitioning (although work
             in progress).
●   Still the reference technology for creating and serving inverted
    indexes.



                                                                              84 / 112
Full-text indexing: MapReduce meets SOLR




                                                                        Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   We can use MapReduce for scaling the indexing process.
●   At the same time, we can use SOLR for creating the resulting
    index.
    –   SOLR is used as-a-library.
●   Generated indexes are later deployed to the search servers.




                                                             85 / 112
Full-text indexing: Example




                                                                      Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   A vertical job search engine.
●   Jobs are parsed from crawled feeds and saved in the HDFS.
●   MapReduce for deduplicating job offers.
    –   map: emit(jobId, job)
    –   reduce (jobId, list<job>) -> emit (jobId, job)
         ●   Retention policy: keep latest job.




                                                           86 / 112
Full-text indexing: Example (II)




                                                                                          Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   MapReduce for augmenting job information (adding
    geographical information).
    –   map1: emit(job.city, job)
    –   map2: emit(city, geoInfo)
    –   reduce (job.city, list<job>)(city, geoInfo) -> for all jobs emit(job, geoInfo)
●   MapReduce for distributing the index process:
    –   map: emit(job.country, job)
    –   reduce: (job.country, list<job>) -> Create index for country “job.country”
        using SOLR.
●   Deploy per-country indexes to search cluster.




                                                                               87 / 112
Full-text indexing: Example




                                                                        Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
                 XML
                 feeds             Search Cluster




                                            Deploy




                                   Geo          Indexes
                HDFS               info




         M/R              M/R     M/R                 M/R
        Parse            Dedup   Geo info            Index
                                                             88 / 112
Real-life MapReduce




                                   Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Data analytics
2. Crawling
3. Full-text indexing
4. Reputation systems
5. Data Mining




                        89 / 112
Reputation: Definitions




                                                   Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   What is reputation?
●   Reputation in social communities.
    –   eBay, StackOverflow...
●   Reputation in social media.
    –   Twitter, Facebook...
●   Why is it important?




                                        90 / 112
Reputation: Relationships




                                                                         Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Modelling relationships is needed for calculating reputation
●   Graph-like models arise
●   Usually stored as vertices
    –   A interacts with B
    –   or, A trusts B
    –   or, A → B
●   Internet-scale graphs can be stored in HDFS
    –   Each vertex in a row
    –   Add needed metadata to vertices: date, etc.




                                                              91 / 112
Reputation: MapReduce analysis on vertices




                                                                                      Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   All people whom “A” interacted with.
    –   map: (a, b)
    –   reduce: (a, list<b>).
●   Essentially things like PageRank can be very easily
    implemented.
●   PageRank as a measure of page relevancy from page inlinks.
    –   But it can be extrapolated to any kind of authority and trustiness metric.
    –   E.g. People relevancy from social networks.




                                                                           92 / 112
Reputation: Going deeper on graphs




                                                                          Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Friends of friends of friends.
    –   1 MapReduce step: My friends.
    –   2 MapReduce steps: Friends of my friends.
    –   3 MapReduce steps: Friends of friends of my friends.
●   Iterative MapReduce solves it.
●   But there are better foundational models such as Google’s
    Pregel.
    –   Exploiting data locality in graphs.
    –   Apache Giraph.
    –   Apache Hama.




                                                               93 / 112
Reputation: Difficulties




                                                                            Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Sometimes multiple MapReduce steps are needed for
    calculating a final metric.
    –   Because data doesn’t fit in memory.
●   Intermediate relations need to be calculated.
    –   And later filtered out.
●   “Polynomial effect”: Calculate all pairwise relations in a set:
    N*(N-1)/2
    –   Possible bottleneck.




                                                                 94 / 112
Reputation: Difficulties: Data imbalance




                                                                                     Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   When grouping by something, some groups may be much
    bigger than others.
    –   Causing “data imbalance”.
●   Data imbalance in MapReduce is a big problem.
    –   Some machines will finish quickly while one will be busy for hours.
         ●   Inefficient usage of resources.
         ●   Data processing doesn’t scale linearly anymore.
    –   Next MapReduce step can’t start as long as current one hasn’t yet
        finished.




                                                                          95 / 112
Reputation: Example




                                                                   Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Input: Tweets.
●   Distributed crawling for fetching the tweets.
    –   Save them in the HDFS.
●   Parse the tweets. Define the graph of trustiness.
    –   A trusts B if A follows B.
●   Execute PageRank over the graph.
    –   Spreads trustiness to all vertices.




                                                        96 / 112
M/R
   Parse
                                             HDFS
                                                                                                                             Reputation: Example




C
               A
           B

D
                       Graph of trustiness



      M/R
                                                       Results




    PageRank
97 / 112




               Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
Real-life MapReduce




                                   Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
1. Data analytics
2. Crawling
3. Full-text indexing
4. Reputation systems
5. Data Mining




                        98 / 112
Data mining: Text classification




                                                                                        Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Document classification
    –   Documents are texts in this case.
●   Assigns a text one or more categories.
    –   Binary classifiers versus multi-category classifiers
    –   Multi-category classifiers can be built from multiple binary classifiers
●   Two steps: generating the model and classifying.




                                                                             99 / 112
Data mining: Text classification: Steps




                                                                                                    Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Generating the model
    –   The resultant model may or may not fit in memory.
         ●   Let’s assume the final model fits in memory.
    –   Use a large dataset for generating the model.
         ●   MapReduce helps scaling the model generation process.
         ●   Example: Build multiple binary classifiers -> parallelize by classifier.
         ●   Example: Calculate conditional probabilities of a Bayesian model.
             Paralellize by word (like in WordCount example).
●   Classifying
    –   MapReduce also helps in classifying a large dataset.
         ●   If model fits in memory, parallelize documents to classify and load the
             model in memory.
    –   Batch-classifying: parallelize documents to classify. Output is the set of
        documents with the assigned categories.


                                                                                        100 / 112
Data mining: Others




                                                                                   Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Mahout library for other data mining problems.
    –   Clustering, logistic regression, etc.
●   Recommendation algorithms.
    –   Many recommendation algorithms are based on calculating
        correlations.
    –   Calculating correlations in parallel with MapReduce is easy.
●   Remember: always in the “batch” or “offline” domain.
    –   Recommendations are reloaded after batch process finishes.




                                                                       101 / 112
Tuple Mapreduce
 Pere Ferrera, Ivan de Prado, Eric Palacios, Jose Luis Fernandez-Marquez, Giovanna Di Marzo
Serugendo: Tuple MapReduce: Beyond classic MapReduce. In ICDM 2012: Proceedings of the
      IEEE International Conference on Data Mining (To appear) (10.7% Acceptance rate)
Common MapReduce problems




                                                                                            Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Lack of compound records
    –   By default, key & value are considered monolithic entities.
         ●   In real life, this case is rare.
    –   Alleviated by some serialization libraries (Thrift, Protocol Buffers)
●   Sorting within a group
    –   MapReduce foundation does nos support it
         ●   Although MapReduce implementations overcome this problem with
             “tricks”
●   Joins
    –   Needs of compound records and sorting within a group to be implemented
    –   Not directly supported by MapReduce




                                                                                103 / 112
Tuple Map Reduce: rationale




                                                                       Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Compound records, sorting within a group and join are design
    patterns that arises in most MapReduce applications...
●   … but MapReduce does not make the implementation easy
●   An evolution of MapReduce paradigm is needed to cover these
    design patterns:



          Tuple MapReduce


                                                           104 / 112
Tuple MapReduce




                                                                       Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   We extended the classic (key, value) MapReduce model
●   Use n-sized Tuples instead of (key, value)
●   Define a Tuple-based M/R
    –   Covering most common use cases




                                                           105 / 112
Group by / Sort by




                                                                                     Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   You can think of a M/R as a SELECT … GROUP BY …
●   With Tuple MapReduce, you simply “group by” a subset of
    Tuple fields
    –   Easier, more intuitive than having objects for each kind of Key you
        want to group by.
●   Alternatively, you may “sort by” a wider subset
    –   Hiding all complex logic behind secondary sort




                                                                         106 / 112
Tuple-Join MapReduce




                                                                                Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Extend the whole idea to allow for easier joins

          Tuple1: (a,b,c,d)                   Tuple2: (a,b,f,g,h)

                              Join by (a,b)


●   Formally speaking:




                                                                    107 / 112
Pangool




                 http://pangool.net


108 / 112




            Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
Pangool: What?




                                                                                  Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Better, simpler, powerful API replacement for Hadoop's API
●   What do we mean by API replacement?
    –   APIs on top of Hadoop: Pig, Hive, Cascading.
    –   Using them always comes with a tradeoff.
    –   Paradigms other than MapReduce, not always the best choice.
    –   Performance restrictions.


●   Pangool is still MapReduce, low-level and high performing
    –   Yet a lot simpler!




                                                                      109 / 112
Pangool: Why?




                                                                                Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Hadoop has a steep learning curve
●   Default API is too low-level
●   Making things efficient is harsh (binary comparisons, ser/de...)
●   There are some common patterns (joins, secondary sorting...)

       Common pattern


                                      How can we make them simpler
       Common pattern              without loosing flexibility and power?



       Common pattern



                                                                    110 / 112
Pangool API




                                                                         Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
●   Schema, Tuple, …
●   Reducers, Mappers, etc are instances instead of static classes
    –   Easier to configure them: new MyReducer(5, 2.0);
●   Still tied to Hadoop's particularities in some ways
    –   NullWritable, etc


●   Let's see an example




                                                             111 / 112
Thanks!!

Iván de Prado Alonso
Pere Ferrera Bertran

More Related Content

What's hot

Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - IntroductionTomy Rhymond
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An OverviewArvind Kalyan
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop BasicsSonal Tiwari
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopFebiyan Rachman
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big dataYukti Kaura
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data HadoopApache Apex
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storagehybrid cloud
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012Gigaom
 

What's hot (20)

Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Big Data simplified
Big Data simplifiedBig Data simplified
Big Data simplified
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
 

Similar to Big data, map reduce and beyond

Integrating Big Data Technologies
Integrating Big Data TechnologiesIntegrating Big Data Technologies
Integrating Big Data TechnologiesDATAVERSITY
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
 
Big Data in small words
Big Data in small wordsBig Data in small words
Big Data in small wordsYogesh Tomar
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview EMC
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetupRoby Chen
 
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
 How to use Hadoop for operational and transactional purposes by RODRIGO MERI... How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...Big Data Spain
 
obzen Business Analytics(Big Data with R, Hadoop)
obzen Business Analytics(Big Data with R, Hadoop)obzen Business Analytics(Big Data with R, Hadoop)
obzen Business Analytics(Big Data with R, Hadoop)Jinsup
 
Big data tim
Big data timBig data tim
Big data timT Weir
 
Bay Area Hadoop User Group
Bay Area Hadoop User GroupBay Area Hadoop User Group
Bay Area Hadoop User GroupPentaho
 
Recipes for Unlocking Value from Big Data
Recipes for Unlocking Value from Big DataRecipes for Unlocking Value from Big Data
Recipes for Unlocking Value from Big DataFadi Yousuf
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxData Science London
 
UK - Agile Data Applications on Hadoop
UK - Agile Data Applications on HadoopUK - Agile Data Applications on Hadoop
UK - Agile Data Applications on HadoopHortonworks
 
Trends in big data
Trends in big dataTrends in big data
Trends in big dataselvaraaju
 
Information Architech and DWH with PowerDesigner
Information Architech and DWH with PowerDesignerInformation Architech and DWH with PowerDesigner
Information Architech and DWH with PowerDesignerSybase Türkiye
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopAdam Muise
 
Bb3061 bess systems of record sv
Bb3061 bess systems of record svBb3061 bess systems of record sv
Bb3061 bess systems of record svCharlie Bess
 
An Overview of BigData
An Overview of BigDataAn Overview of BigData
An Overview of BigDataValarmathi V
 
INTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPINTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPDr Geetha Mohan
 

Similar to Big data, map reduce and beyond (20)

Integrating Big Data Technologies
Integrating Big Data TechnologiesIntegrating Big Data Technologies
Integrating Big Data Technologies
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Big Data in small words
Big Data in small wordsBig Data in small words
Big Data in small words
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
Hadoop for shanghai dev meetup
Hadoop for shanghai dev meetupHadoop for shanghai dev meetup
Hadoop for shanghai dev meetup
 
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
 How to use Hadoop for operational and transactional purposes by RODRIGO MERI... How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
 
obzen Business Analytics(Big Data with R, Hadoop)
obzen Business Analytics(Big Data with R, Hadoop)obzen Business Analytics(Big Data with R, Hadoop)
obzen Business Analytics(Big Data with R, Hadoop)
 
Forrester
ForresterForrester
Forrester
 
Big data tim
Big data timBig data tim
Big data tim
 
Bay Area Hadoop User Group
Bay Area Hadoop User GroupBay Area Hadoop User Group
Bay Area Hadoop User Group
 
Big data primer
Big data primerBig data primer
Big data primer
 
Recipes for Unlocking Value from Big Data
Recipes for Unlocking Value from Big DataRecipes for Unlocking Value from Big Data
Recipes for Unlocking Value from Big Data
 
Investigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists ToolboxInvestigative Analytics- What's in a Data Scientists Toolbox
Investigative Analytics- What's in a Data Scientists Toolbox
 
UK - Agile Data Applications on Hadoop
UK - Agile Data Applications on HadoopUK - Agile Data Applications on Hadoop
UK - Agile Data Applications on Hadoop
 
Trends in big data
Trends in big dataTrends in big data
Trends in big data
 
Information Architech and DWH with PowerDesigner
Information Architech and DWH with PowerDesignerInformation Architech and DWH with PowerDesigner
Information Architech and DWH with PowerDesigner
 
Hadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of HadoopHadoop at the Center: The Next Generation of Hadoop
Hadoop at the Center: The Next Generation of Hadoop
 
Bb3061 bess systems of record sv
Bb3061 bess systems of record svBb3061 bess systems of record sv
Bb3061 bess systems of record sv
 
An Overview of BigData
An Overview of BigDataAn Overview of BigData
An Overview of BigData
 
INTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPINTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOP
 

More from datasalt

Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for HadoopSplout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoopdatasalt
 
Tuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduceTuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreducedatasalt
 
Datasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactionsDatasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactionsdatasalt
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoopdatasalt
 
Buscador vertical escalable con Hadoop
Buscador vertical escalable con HadoopBuscador vertical escalable con Hadoop
Buscador vertical escalable con Hadoopdatasalt
 
Scalable vertical search engine with hadoop
Scalable vertical search engine with hadoopScalable vertical search engine with hadoop
Scalable vertical search engine with hadoopdatasalt
 

More from datasalt (6)

Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for HadoopSplout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
 
Tuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduceTuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduce
 
Datasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactionsDatasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactions
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
Buscador vertical escalable con Hadoop
Buscador vertical escalable con HadoopBuscador vertical escalable con Hadoop
Buscador vertical escalable con Hadoop
 
Scalable vertical search engine with hadoop
Scalable vertical search engine with hadoopScalable vertical search engine with hadoop
Scalable vertical search engine with hadoop
 

Recently uploaded

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 

Recently uploaded (20)

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 

Big data, map reduce and beyond

  • 1. Big Data, MapReduce and beyond Iván de Prado Alonso // @ivanprado Pere Ferrera Bertran // @ferrerabertran @datasalt
  • 2. Outline Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Big Data. What and why 2. MapReduce & Hadoop 3. MapReduce Design Patterns 4. Real-life MapReduce 1. Data analytics 2. Crawling 3. Full-text indexing 4. Reputation systems 5. Data Mining 5. Tuple MapReduce & Pangool 2 / 112
  • 4. In the past... Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Data and computation fits on one monolithic machine ● Monolithic databases: RDBMS ● Scalability: – Vertical: buy better hardware ● Distributed systems – No very common – Logic centric: Data move where the logic is ● Distributed storage: SAN 4 / 112
  • 5. Distributed systems are hard Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Building distributed systems is hard – If you can scale vertically at a reasonable cost, why to deal with distributed systems complexity? ● But circumstances are changing: – Big Data ● Big data refers to the massive amounts of data that are difficult to analyze and handle using common database management tools 5 / 112
  • 6. BIG DATA “MAC” 6 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
  • 7. Big Data Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Data is the new bottleneck – Web data ● Web pages ● Interaction Logs – Social networks data – Mobile devices – Data generated by Sensors ● Old systems/techniques are not appropriated ● A new approach is needed 7 / 112
  • 8. Big Data project parts Serving Acquiring Processing 8 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
  • 9. Acquiring Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Gathering/receiving/storing data from sources ● Many kind of sources – Internet – Sensors – User behavior – Mobile devices – Health care data – Banking data – Social Networks – ….. 9 / 112
  • 10. Processing Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Data is present in the system (acquired) ● This step is responsible of extracting value from data – Eliminate duplicates – Infer relations – Calculate statistics – Correlate information – Ensure quality – Generate recommendations – …. 10 / 112
  • 11. Serving Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Most of the cases, some interface has to be provided to access the processed information ● Possibilities – Big Data / No Big Data – Real time access to results / non real time access ● Some examples: – Search engine → inverted index – Banking data → relational database – Social Network → NoSQL database 11 / 112
  • 12. Big Data system types Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Offline – Latency is not a problem ● Online – Response immediacy is important ● Mixed – Online behavior, but internally is a mixture of two systems ● One online ● One offline Offline Online MapReduce NoSQL Hadoop Search engines Distributed RDBMS 12 / 112
  • 13. A Mixed Online Offline A P AS P P P S A S S Big Data Systems types II 13 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
  • 15. “Swiss army knife of the 21st century” Media Guardian Innovation Awards http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop 15 / 112
  • 16. History Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● 2004-2006 – GFS and MapReduce papers published by Google – Doug Cutting implements an open source version for Nutch ● 2006-2008 – Hadoop project becomes independent from Nutch – Web scale reached in 2008 ● 2008-now – Hadoop becomes popular and is commercially exploited Source: Hadoop: a brief history. Doug Cutting 16 / 112
  • 17. Hadoop “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model” From Apache Hadoop page 17 / 112
  • 18. Main ideas Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Distributed – Distributed storage – Distributed computation platform ● Built to be fault tolerant ● Shared nothing architecture ● Programmer isolation from distributed system difficulties – By providing an simply primitives for programming 18 / 112
  • 19. Hadoop Distributed File System (HDFS) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Distributed – Aggregates the individual storage of each node ● Files formed by blocks – Typically 64 or 128 Mb (configurable) – Stored in the OS filesystem: Xfs, Ext3, etc. ● Fault tolerant – Blocks replicated more than once 19 / 112
  • 20. How files are stored Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. DataNode 1 (DN1) NameNode 1 DataNode 2 (DN2) Data.txt: 2 Blocks: 1 DN1 1 DN2 2 3 DN1 DN4 3 DataNode 4 (DN4) DN2 DN3 2 4 DN4 4 DN3 DataNode 3 (DN3) 3 4 20 / 112
  • 21. MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Map Reduce is the abstraction behind Hadoop ● The unit of execution is the Job ● Job has – An input – An output – A map function – A reduce function ● Input and output are sequences of key/value pairs ● The map and reduce functions are provided by the developer – The execution is distributed and parallelized by Hadoop 21 / 112
  • 22. Job phases Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Two different phases: mapping and reducing ● Mapping phase – Map function is applied to Input data ● Intermediate data is generated ● Reducing phase – Reduce function is applied to intermediate data ● Final output is generated 22 / 112
  • 23. MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Two functions (Map & Reduce) – Map(u, v) : [w,x]* – Reduce(w, x*) : [y, z]* ● Example: word count – Map([document, null]) -> [word, 1]* – Reduce(word, 1*) -> [word, total] ● MapReduce & SQL – SELECT word, count(*) GROUP BY word ● Distributed execution in a cluster – Horizontal scalability 23 / 112
  • 24. Word Count Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. This is a line Also this Map Reduce reduce(a, {1}) = map(“This is a line”) = a, 1 this, 1 reduce(also, {1}) = is, 1 also, 1 a, 1 reduce(is, {1}) = line, 1 is, 1 map(“Also this”) = reduce(line, {1}) = also, 1 line, 1 this, 1 reduce(this, {1, 1}) = this, 2 a, 1 also, 1 Result: is, 1 line, 1 this, 2 24 / 112
  • 25. Map examples Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Swap Mapper – Swaps key and value map(key, value): emit (value, key) ● Split Key Mapper – Splits key in words and emit a pair per each word map(key, value): words = key.split(“ “) for each word in words: emit (word, value) 25 / 112
  • 26. Map examples (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Filter Mapper – Filter out some records map(key, value): if (key <> “the”): emit (key, value) ● Key/value concatenation mapper – Concatenates the key and the value in the key map(key, value): emit (key + “ “ + value, null) 26 / 112
  • 27. Reduce examples Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Count reducer – Counts the number of elements per each key reduce(key, values): count = 0 for each value in values: count++ emit(key, count) ● Average reducer – Computes the average value for each key reduce(key, values): count = 0 total = 0 for each value in values: count++ total += value emit(key, total / count) 27 / 112
  • 28. Reduce examples (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Keep first reducer – Keeps the first key/value input pair reduce(key, values): emit(key, first(values)) ● Value concatenation reducer – Concatenates the values in one string reduce(key, values): result = “” for each value in values: result += “ “ + value emit(key, result) 28 / 112
  • 29. Identity map and reduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● The identity functions are those that keeps the input unchanged – Map identity map(key, value): emit (key, value) – Reduce identity reduce(key, values): for each value in values: emit (key, value) 29 / 112
  • 30. Putting all together Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. map(k, v) → [w, x]* reduce(w, [x]+) → [y, z]* ● Job flow: – The mapper generates key/value pairs – This pairs are grouped by the key – Hadoop calls the reduce function for each group – The output of the reduce function is the final Job output ● Hadoop will distribute the work – Different nodes in the cluster will process the data in parallel 30 / 112
  • 31. data Tasks Output Reduce Map Tasks Intermediate Job Execution Node 1 Node 1 Input Splits (blocks) Node 2 Node 2 31 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
  • 32. Job Execution (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Key/value pair are sorted by key in the shuffle & sorting phase – That is needed in order to group registers by the key when calling the reducer – It also means that calls to the reduce function are done in key-order ● Reduce function with key “A” is always called before than reduce function with key “B” whiting the same reduce task ● Reducers starts downloading data from the mappers as soon as possible – In order to reduce the shuffle & sorting phase time – Number of reducers can be configured by the programmer 32 / 112
  • 33. Partial Sort Job Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● A job formed with the identity map and the identity reducer – It just sort data by the key per each reducer Input file D B A B C D E A Map 1 Map 2 Intermediate D A B B D A C E data Reduce 1 Reduce 2 A A D D B B C E Output files 33 / 112
  • 34. Input Splits Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Each map task process one input split – Map task starts processing at the first complete record, and finishes processing the record crossed by the rightmost boundary Input Input Input Input Split Split Split Split 1 2 3 4 File Records Map Map Map Map Task Task Task Task 1 2 3 4 34 / 112
  • 35. Combiner Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Intermediate data goes from the map tasks to the reduce tasks through the network – Network can be saturated ● Combiners can be used to reduce the amount of data sent to the reducers – When the operation is commutative and associative ● A combiner is a function similar to the reducer – But it is executed in the map task, just after all the mapping has been done ● Combiners can't have side effects – Because Hadoop can decide to execute them or not 35 / 112
  • 37. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Filtering 2. Secondary sorting 3. Distributed execution 4. Computing statistics 5. Count distinct 6. Sorting 7. Joins 8. Reconciliation 37 / 112
  • 38. Filtering Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Filtering: Input data ● We process the input file in parallel with Hadoop and if(condition) { emit(); } emit a smaller dataset in the end Output data 38 / 112
  • 39. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Filtering 2. Secondary sorting 3. Distributed execution 4. Computing statistics 5. Count distinct 6. Sorting 7. Joins 8. Reconciliation 39 / 112
  • 40. Secondary sorting Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Receive reducer values in a specific order ● Moving averages: – Secondary sort by timestamp – Fill an in-memory window and perform average. ● Top N items in a group: – Secondary sort by <X> – Emit the first N elements in a group ● Useful, yet quite difficult to implement in Hadoop. Sort Comparator Key Group Comparator Partitioner 40 / 112
  • 41. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Filtering 2. Secondary sorting 3. Distributed execution 4. Computing statistics 5. Count distinct 6. Sorting 7. Joins 8. Reconciliation 41 / 112
  • 42. Distributed execution without Hadoop Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Distributed queue – It is needed a common queue used to coordinate and assign work ● Distributed workers – Consumers working on each node, getting work from the queue ● Problems: – Difficult to coordinate ● Failover ● Loosing messages ● Load balance – Queue must scale 42 / 112
  • 43. Distributed execution with Hadoop Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Map-only Jobs. ● Use Hadoop just for the sake of “parallelizing something”. ● Anything that doesn't involve a “group by” (no shuffle/reducer) ● Examples: – Text categorization – Filtering – Crawling Map 1 Map 2 … Map n – Updating a DB – Distributed grep ● NlineInputFormat can be handy for that. 43 / 112
  • 44. Disadvantages Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Work is done in batches – And distribution is not probably even ● Some resources are wasted ● There are some tricks to alleviate the problem – Task timeout + saving remaining work to next execution 44 / 112
  • 45. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Filtering 2. Secondary sorting 3. Distributed execution 4. Computing statistics 5. Count distinct 6. Sorting 7. Joins 8. Reconciliation 45 / 112
  • 46. Computing statistics (I) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Count, Sum, Average, Std. Dev... ● Aggregated by something ● Recall SQL: select user, count(clicks) … group by user user, click user, click Map: emit (user, click) Reduce by user: count values user, count(click) 46 / 112
  • 47. Computing statistics (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● When sum(), avg(), etc, Combiners are often needed ● Imagine a user performed 3 million clicks – Then, a reducer will receive 3 million registers – This reducer will be the bottleneck of the Job. Everyone needs to wait for it to count 3 million things. ● Solution: Perform partial counts in a Combiner ● Combiner is executed before shuffling, after Mapper. 47 / 112
  • 48. Computing statistics (III) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Using a Combiner: user, click user, click Map user, count(click) user, count(click) Combine user, sum(count(click)) Reduce ● For each group, reducer aggregates N counts in the worst case! (N = #mappers) 48 / 112
  • 49. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Filtering 2. Secondary sorting 3. Distributed execution 4. Computing statistics 5. Count distinct 6. Sorting 7. Joins 8. Reconciliation 49 / 112
  • 50. Distinct Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● How to calculate distinct count(something) group by X ? ● It is somewhat easy (2 M/Rs): M/R 1 (eliminates duplicates): – emit ({X, something}, null) – so rows are grouped by ({X, something}) – In the reducer, just emit the first (ignore duplicates) M/R 2 (groups by X and count): – For each input X, something → emit (X, 1) – group by (X) – The reducer counts incoming values 50 / 112
  • 51. Distinct: example Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. M/R 1 M/R 2 (X1, s1) (X1, s1) (X1, s2) (X1, s1) (X1, s1) (X1, s1) (X1, s2) (X1, s2) X1 → 2 (X1, s2) (X2, s1) (X2, s1) X2 → 2 (X2, s1) (X2, s3) (X2, s3) (X2, s1) (X2, s3) (X2, s1) 51 / 112
  • 52. Distinct: Secondary sort Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● We can calculate distinct in only one Job ● Using Secondary Sorting M/R: – emit ({X, something}, null) – group by (X), secondary sort by (something) – The reducer: iterate, count & emit “something, count” when “something” changes. Reset the counter each “something” change. ● Need to use a Combiner to eliminate duplicates (otherwise reducer would receive too many records). ● disctinct count() is more parallelizable with 2 Jobs than with 1! 52 / 112
  • 53. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Filtering 2. Secondary sorting 3. Distributed execution 4. Computing statistics 5. Count distinct 6. Sorting 7. Joins 8. Reconciliation 53 / 112
  • 54. Sorting Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● We have seen how sorting is (partially) inherent in Hadoop. ● But if we want “pure” sorting: – Use one Reducer (not scalable) – Use an advanced partitioning strategy ● Yahoo! TeraSort (http://sortbenchmark.org/Yahoo2009.pdf) ● Use sampling to calculate data distribution ● Implement custom Partitioning according to distribution 54 / 112
  • 55. Sorting (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Hash partitioning: 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 ... ● Distribution-aware partitioning: 0 1 2 55 / 112
  • 56. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Filtering 2. Secondary sorting 3. Distributed execution 4. Computing statistics 5. Count distinct 6. Sorting 7. Joins 8. Reconciliation 56 / 112
  • 57. Joins Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Joining two (or more) datasets is a quite common need. ● The difficulty is that both datasets may be “too big” – Otherwise, an in-memory join can be do quite easily just by reading one of the datasets in RAM. ● “Big joins” are commonly done “reduce-side”: Map dataset 1: (K1, d21) Map dataset 2: (K1, d11) (K2, d22) (K2, d12) Reduce by common key (K1, K2,...) K1 → d11, d21 Reduce: Join K2 → d12, d22 ● The so-called “map-side joins” are more complex and tricky. 57 / 112
  • 58. Joins: 1-N relation Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Use secondary sorting to get the “one-side” of the relation the first – Otherwise you need to use memory to perform the join ● Does not scale ● Employee (E) – Sales join (S) SSESSS You need to use memory ESSSSS Memory not needed 58 / 112
  • 59. Left – Right – Inner joins Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Join between Employee and Sales reducer(key, values): employee = null first = first(values) rest = rest(values) If isEmployee(first) employee = first If employee = null // rigth join SSSSS else if size(rest) = 0 // left join E else // inner join ESSSSS 59 / 112
  • 60. Design patterns Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Filtering 2. Secondary sorting 3. Distributed execution 4. Computing statistics 5. Count distinct 6. Sorting 7. Joins 8. Reconciliation 60 / 112
  • 61. Reconciliation Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Hadoop can be used to “simulate a database”. ● For that, we need to: – Merge data state s(t) with past state s(t-1) using a Join. – Update rows with the same ID (performing whatever logic). – Store the result as the next full data state. – Rotate states: ● s(t-1) = s(t) ● s(t) = s(t + 1) s(t-1) s(t) s(t+1) M/R 61 / 112
  • 62. Real-life Hadoop projects Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● 95% Real-life Hadoop projects are a mixture of the patterns we just saw. ● Example: A vertical search-engine. – Distributed execution: Feed crawl / parse – Data reconciliation: Merge data by listing ID – Join: Augment listings with geographical info – … ● Example: Payments data stats. – Secondary sort: weekly, daily & monthly stats – Distributed execution: Random-updates to a DB ● ... 62 / 112
  • 64. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Data analytics 2. Crawling 3. Full-text indexing 4. Reputation systems 5. Data Mining 64 / 112
  • 65. Data analytics Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Obvious use case for MapReduce. ● Examples: calculate unique visits per page. – Top products per month. – Unique clicks per banner. – Etc. ● Offline analytics (Batch-oriented). – Online analytics not a good fit for MapReduce. 65 / 112
  • 66. Data analytics: How it works Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● A batch process that uses all historical data. – Recompute everything always. – Easier to manage and maintain than incremental computation. ● A chain of MapReduce steps produce the final output. ● There are tools that ease building / maintaining the MapReduce chain: – Hive, Pig, Cascading, Pangool for programming a MapReduce flow easily. – Oozie, Azkaban for connecting existing MapReduce jobs easily. ● Scheduling flows and such. 66 / 112
  • 67. Data analytics: Difficulties Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Some things are harder to calculate than others. ● Calculating unique visits per page. – A simple solution in two MapReduce steps or a more sophisticated one in a single MapReduce step. – Approximated methods can be used as well. ● Calculating the median. – Need to sort all the dataset and iterate twice if we don’t know the number of elements. 67 / 112
  • 68. Data analytics: Examples Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Gather clicks on pages. – Save (click, page, timestamp) in the HDFS. ● A MapReduce job groups by page and counts the number of clicks: ● map: emit(page, click). ● reduce: (page, list<click>) emits (page, totalClicks). ● We now have the total number of clicks per page. 68 / 112
  • 69. Data analytics: Examples (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Another MapReduce job groups by day and page and counts the number of clicks: ● map: emit((page, day), click). ● reduce: Same as before. ● We now have the total number of clicks per page and day. ● These are simple examples, but data analytics can get as sophisticated as you want. – Example: calculate a 10 bar histogram of the distribution of clicks over the hours of the day for each page. 69 / 112
  • 70. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Data analytics 2. Crawling 3. Full-text indexing 4. Reputation systems 5. Data Mining 70 / 112
  • 71. Crawling: Web Crawling Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Web Crawling: – “A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion.” ● Applications: – Search engines. – NLP (Sentiment analysis). ● Examples: 71 / 112
  • 72. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Data analytics 2. Crawling 3. Full-text indexing 4. Reputation systems 5. Data Mining 72 / 112
  • 73. Crawling: Web Crawling (at scale) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● How to parallelize storage and bandwidth? ● How to deduplicate stored URLs? ● Other complexities: politeness, infinite loops, robots.txt, canonical URLs, pagination, parsing, … ● Relevancy: Pagerank. 73 / 112
  • 74. Crawling: Nutch Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● What is Nutch? – Open source web-search software project. – Apache project. – Hadoop, Tike, Lucene, SOLR. ● (Brief) history – Started in 2002/2003 – 2005: MapReduce – 2006: Hadoop – 2006/2007: Tika – 2010 TLP Apache project 74 / 112
  • 75. Crawling: Nutch: How it works Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● “Select, Crawl, Parse, Dedup by URL” loop. ● Lucene, SOLR for indexing. – We will see them later. ● CrawlDB: Pages are saved in HDFS. ● MapReduce makes storage and computing scalable. – Helps in deduplicating pages by URL. – Helps in identifying new pages to crawl. 75 / 112
  • 76. Crawling: Not-Only Web Crawling Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Custom crawlers: – Tweets. – XML feeds. ● Simpler, as we usually don’t need to traverse a tree. – Sometimes only crawling a fixed seed of resources is enough. ● Applications – Vertical search engines. – Reputation systems. 76 / 112
  • 77. Crawling: Example: Crawling tweets at scale Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Use a scalable computing engine for fetching tweets. – Storm is a good fit. – Hadoop can be used as well. ● Tricky usage of MapReduce: Create as many groups as crawlers and embed a Crawler in them. ● Save raw feed data (JSON) in HDFS. ● MapReduce: Parse JSON tweets. ● MapReduce: Deduplicate tweets. ● MapReduce: Analyze tweets and perform data analysis. 77 / 112
  • 78. M/R Parse HDFS M/R Dedup M/R Analysis Results Crawling: Example: Crawling tweets at scale 78 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
  • 79. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Data analytics 2. Crawling 3. Full-text indexing 4. Reputation systems 5. Data Mining 79 / 112
  • 80. Full-text indexing: Definitions Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Search engine: – An information retrieval system designed to help find information stored on a computer system. ● Inverted index: – Index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. – B-Trees are not inverted indexes! ● Stemming. ● Relevancy in results. 80 / 112
  • 81. Full-text indexing: Applications Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Web search engines – Finding relevant pages for a topic ● Vertical search engines – Finding jobs by description ● Social networks – Finding messages by text ● e-Commerce – Finding articles by description ● In general, any service or application needing efficient text information retrieval 81 / 112
  • 82. Full-text indexing (at scale) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Real-time indexing versus batch-indexing: – The first is cool: it is real-time, but it is difficult. We will not address it now. – The second is not real-time, but it is simpler. ● How to batch-index a big corpus dataset? – Need a scalable storage, (HDFS). ● How to deduplicate documents? – MapReduce to the rescue (like we saw before). ● How to generate multiple indexes? – MapReduce can help (we will see how). 82 / 112
  • 83. Full-text indexing: MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● MapReduce can be used to generate an inverted index. – Vertical partitioning v.s. Horizontal partitioning. ● Example: – Map: emit(word, docId) – Reduce: emit(word, list<docIds>) ● Quite simple. But what about stop words, stemming, etc? ● How to store the index? ● Better not to reinvent the wheel. 83 / 112
  • 84. Full-text indexing: Lucene / SOLR Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Lucene: Doug Cutting’s – From Nutch. – Mainstream open-source implementation of an inverted index. – Efficient disk allocation, highly performant. ● SOLR: Mainstream open-source search server. – Provides stemming, analyzers, HTTP servlets, etc. – Lacks some other desirable properties: ● Elasticity, real-time indexing, horizontal partitioning (although work in progress). ● Still the reference technology for creating and serving inverted indexes. 84 / 112
  • 85. Full-text indexing: MapReduce meets SOLR Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● We can use MapReduce for scaling the indexing process. ● At the same time, we can use SOLR for creating the resulting index. – SOLR is used as-a-library. ● Generated indexes are later deployed to the search servers. 85 / 112
  • 86. Full-text indexing: Example Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● A vertical job search engine. ● Jobs are parsed from crawled feeds and saved in the HDFS. ● MapReduce for deduplicating job offers. – map: emit(jobId, job) – reduce (jobId, list<job>) -> emit (jobId, job) ● Retention policy: keep latest job. 86 / 112
  • 87. Full-text indexing: Example (II) Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● MapReduce for augmenting job information (adding geographical information). – map1: emit(job.city, job) – map2: emit(city, geoInfo) – reduce (job.city, list<job>)(city, geoInfo) -> for all jobs emit(job, geoInfo) ● MapReduce for distributing the index process: – map: emit(job.country, job) – reduce: (job.country, list<job>) -> Create index for country “job.country” using SOLR. ● Deploy per-country indexes to search cluster. 87 / 112
  • 88. Full-text indexing: Example Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. XML feeds Search Cluster Deploy Geo Indexes HDFS info M/R M/R M/R M/R Parse Dedup Geo info Index 88 / 112
  • 89. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Data analytics 2. Crawling 3. Full-text indexing 4. Reputation systems 5. Data Mining 89 / 112
  • 90. Reputation: Definitions Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● What is reputation? ● Reputation in social communities. – eBay, StackOverflow... ● Reputation in social media. – Twitter, Facebook... ● Why is it important? 90 / 112
  • 91. Reputation: Relationships Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Modelling relationships is needed for calculating reputation ● Graph-like models arise ● Usually stored as vertices – A interacts with B – or, A trusts B – or, A → B ● Internet-scale graphs can be stored in HDFS – Each vertex in a row – Add needed metadata to vertices: date, etc. 91 / 112
  • 92. Reputation: MapReduce analysis on vertices Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● All people whom “A” interacted with. – map: (a, b) – reduce: (a, list<b>). ● Essentially things like PageRank can be very easily implemented. ● PageRank as a measure of page relevancy from page inlinks. – But it can be extrapolated to any kind of authority and trustiness metric. – E.g. People relevancy from social networks. 92 / 112
  • 93. Reputation: Going deeper on graphs Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Friends of friends of friends. – 1 MapReduce step: My friends. – 2 MapReduce steps: Friends of my friends. – 3 MapReduce steps: Friends of friends of my friends. ● Iterative MapReduce solves it. ● But there are better foundational models such as Google’s Pregel. – Exploiting data locality in graphs. – Apache Giraph. – Apache Hama. 93 / 112
  • 94. Reputation: Difficulties Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Sometimes multiple MapReduce steps are needed for calculating a final metric. – Because data doesn’t fit in memory. ● Intermediate relations need to be calculated. – And later filtered out. ● “Polynomial effect”: Calculate all pairwise relations in a set: N*(N-1)/2 – Possible bottleneck. 94 / 112
  • 95. Reputation: Difficulties: Data imbalance Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● When grouping by something, some groups may be much bigger than others. – Causing “data imbalance”. ● Data imbalance in MapReduce is a big problem. – Some machines will finish quickly while one will be busy for hours. ● Inefficient usage of resources. ● Data processing doesn’t scale linearly anymore. – Next MapReduce step can’t start as long as current one hasn’t yet finished. 95 / 112
  • 96. Reputation: Example Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Input: Tweets. ● Distributed crawling for fetching the tweets. – Save them in the HDFS. ● Parse the tweets. Define the graph of trustiness. – A trusts B if A follows B. ● Execute PageRank over the graph. – Spreads trustiness to all vertices. 96 / 112
  • 97. M/R Parse HDFS Reputation: Example C A B D Graph of trustiness M/R Results PageRank 97 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
  • 98. Real-life MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. 1. Data analytics 2. Crawling 3. Full-text indexing 4. Reputation systems 5. Data Mining 98 / 112
  • 99. Data mining: Text classification Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Document classification – Documents are texts in this case. ● Assigns a text one or more categories. – Binary classifiers versus multi-category classifiers – Multi-category classifiers can be built from multiple binary classifiers ● Two steps: generating the model and classifying. 99 / 112
  • 100. Data mining: Text classification: Steps Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Generating the model – The resultant model may or may not fit in memory. ● Let’s assume the final model fits in memory. – Use a large dataset for generating the model. ● MapReduce helps scaling the model generation process. ● Example: Build multiple binary classifiers -> parallelize by classifier. ● Example: Calculate conditional probabilities of a Bayesian model. Paralellize by word (like in WordCount example). ● Classifying – MapReduce also helps in classifying a large dataset. ● If model fits in memory, parallelize documents to classify and load the model in memory. – Batch-classifying: parallelize documents to classify. Output is the set of documents with the assigned categories. 100 / 112
  • 101. Data mining: Others Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Mahout library for other data mining problems. – Clustering, logistic regression, etc. ● Recommendation algorithms. – Many recommendation algorithms are based on calculating correlations. – Calculating correlations in parallel with MapReduce is easy. ● Remember: always in the “batch” or “offline” domain. – Recommendations are reloaded after batch process finishes. 101 / 112
  • 102. Tuple Mapreduce Pere Ferrera, Ivan de Prado, Eric Palacios, Jose Luis Fernandez-Marquez, Giovanna Di Marzo Serugendo: Tuple MapReduce: Beyond classic MapReduce. In ICDM 2012: Proceedings of the IEEE International Conference on Data Mining (To appear) (10.7% Acceptance rate)
  • 103. Common MapReduce problems Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Lack of compound records – By default, key & value are considered monolithic entities. ● In real life, this case is rare. – Alleviated by some serialization libraries (Thrift, Protocol Buffers) ● Sorting within a group – MapReduce foundation does nos support it ● Although MapReduce implementations overcome this problem with “tricks” ● Joins – Needs of compound records and sorting within a group to be implemented – Not directly supported by MapReduce 103 / 112
  • 104. Tuple Map Reduce: rationale Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Compound records, sorting within a group and join are design patterns that arises in most MapReduce applications... ● … but MapReduce does not make the implementation easy ● An evolution of MapReduce paradigm is needed to cover these design patterns: Tuple MapReduce 104 / 112
  • 105. Tuple MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● We extended the classic (key, value) MapReduce model ● Use n-sized Tuples instead of (key, value) ● Define a Tuple-based M/R – Covering most common use cases 105 / 112
  • 106. Group by / Sort by Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● You can think of a M/R as a SELECT … GROUP BY … ● With Tuple MapReduce, you simply “group by” a subset of Tuple fields – Easier, more intuitive than having objects for each kind of Key you want to group by. ● Alternatively, you may “sort by” a wider subset – Hiding all complex logic behind secondary sort 106 / 112
  • 107. Tuple-Join MapReduce Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Extend the whole idea to allow for easier joins Tuple1: (a,b,c,d) Tuple2: (a,b,f,g,h) Join by (a,b) ● Formally speaking: 107 / 112
  • 108. Pangool http://pangool.net 108 / 112 Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent.
  • 109. Pangool: What? Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Better, simpler, powerful API replacement for Hadoop's API ● What do we mean by API replacement? – APIs on top of Hadoop: Pig, Hive, Cascading. – Using them always comes with a tradeoff. – Paradigms other than MapReduce, not always the best choice. – Performance restrictions. ● Pangool is still MapReduce, low-level and high performing – Yet a lot simpler! 109 / 112
  • 110. Pangool: Why? Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Hadoop has a steep learning curve ● Default API is too low-level ● Making things efficient is harsh (binary comparisons, ser/de...) ● There are some common patterns (joins, secondary sorting...) Common pattern How can we make them simpler Common pattern without loosing flexibility and power? Common pattern 110 / 112
  • 111. Pangool API Copyright © 2012 Datasalt Systems S.L. All rights reserved. Not to be reproduced without prior written consent. ● Schema, Tuple, … ● Reducers, Mappers, etc are instances instead of static classes – Easier to configure them: new MyReducer(5, 2.0); ● Still tied to Hadoop's particularities in some ways – NullWritable, etc ● Let's see an example 111 / 112
  • 112. Thanks!! Iván de Prado Alonso Pere Ferrera Bertran

Editor's Notes

  1. - Premios a la Innovación de The Guardian Hay que reconocer que las navajas suizas son útiles … Quién no ha necesitado una lupa en un momento de emergencia! A Hadoop le pasa como las navajas suizas. Son muy útiles, sudas la gota gorda consigues sacar el accesorio que quieres
  2. Distribuida: aprovecha la potencia de varias máquinas en un cluster Grandes conjuntos de datos: Hadoop no es apropiado para conjuntos de datos pequeños Simple Programming Model: Hadoop no es sólo un framework, es un nuevo paradigma de programación distribuida Hadoop se asienta principalmente en dos modulos: Un sistema de ficheros distribuido Para almacenar grandes volumenes de datos Un nuevo paradigma de programación: MapReduce Veamos uno por uno.