SlideShare a Scribd company logo
1 of 35
XXL Graph Algorithms
                                              Sergei Vassilvitskii
                                                Yahoo! Research

With help from Jake Hofman, Siddharth Suri, Cong Yu and many others
Introduction
  XXL Graphs are everywhere:
   – Web graph
   – Friend graphs
   – Advertising graphs...




                             2
Introduction
  XXL Graphs are everywhere:
   – Web graph
   – Friend graphs
   – Advertising graphs...



  But we have Hadoop!
   – Few algorithms have been ported (no Hadoop Algorithms book)
   – Few general algorithmic approaches
   – Active area of research




                                  3
Outline
  Today:
   – Act 1: Crawl before you walk
      • Counting connected components
   – Act 2: The curse of the last reducer
      • Finding tight knit friend groups




                                     4
Act 1: Connected Components
     Given a graph, how many components does it have?


                        f
           b
 a
                            g


       c

                    e           h


               d




                                5
Act 1: Connected Components
     Given a graph, how many components does it have?


                        f
           b
                                                  (b,c)             1
 a                                                                      (f,h)       1
                            g                   (b,d)           1

                                    (a,c)   1                       (a,b)       1
                                                (c,d)       1
       c
                                       (c,e)      1                         (f,g)       1
                    e           h                     (d,e)             1

                                            (d,e)       1
               d                                            (b,e)             1
                                                                            (g,h)       1

                                     Data too big to fit on one reducer!

                                6
CC Overview
  Outline for Connected Components
  – Partition the input into several chunks (map 1)
  – Summarize the connectivity on each chunk (reduce 1)
  – Combine all of the (small) summaries (map 2)
  – Find the number of connected components




                                    7
Connected Components
     1. Partition (randomly):


                           f
            b
 a
                                g


        c

                       e            h


                d




                                    8
Connected Components
  1. Partition (randomly):


                                                        f
         b                               b
                                 a
                                                            g


     c                               c

                    e                                           h


               d

         Reduce 1                            Reduce 2


                             9
Connected Components
  1. Partition:
  2. Summarize (retain < n edges):
                                                        f
         b                               b
                                 a
                                                            g


     c                               c

                    e                                           h


               d

         Reduce 1                            Reduce 2


                            10
Connected Components
  1. Partition:
  2. Summarize (retain < n edges):
                                                         f
         b                                b
                                  a
                                                             g


     c                                c

                    e                                            h


               d

         Reduce 1                             Reduce 2


                             11
Connected Components
  1. Partition:
  2. Summarize:
  3. Recombine:                                     f
         b                           b
                             a
                                                        g


     c                           c

                    e                                       h


               d

         Reduce 1                        Reduce 2


                        12
Connected Components
     1. Partition:
     2. Summarize:
     3. Recombine:
            b                  f
 a


                                   g

        c

                          e
                                       h

                 d

                     Round 2


                                       13
Connected Components
     1. Partition:
     2. Summarize:
     3. Recombine:
            b                  f                          (b,c)             1
 a                                                                              (f,h)       1
                                                        (b,d)           1

                                   g        (a,c)   1                       (a,b)       1
                                                        (c,d)       1
        c
                                               (c,e)      1                         (f,g)       1
                                                              (d,e)             1
                          e
                                       h            (d,e)       1
                                                                    (b,e)             1
                 d                                                                  (g,h)       1

                     Round 2


                                       14
Connected Components
     1. Partition:
     2. Summarize:
     3. Recombine:
            b                  f
 a


                                   g        (a,c)   1                   (a,b)   1
                                                        (c,d)       1
        c
                                                                           (f,g)    1

                          e
                                       h            (d,e)       1

                 d                                                         (g,h)    1

                     Round 2
                                             Small enough to fit!

                                       15
Connected Components
  The summarization does not affect connectivity
  – Drops redundant edges
  – Dramatically reduces data size
  – Takes two MapReduce rounds




                                     16
Connected Components
  The summarization does not affect connectivity
  – Drops redundant edges
  – Dramatically reduces data size
  – Takes two MapReduce rounds


  Similar approach works in other situations:
  – Consider vertices connected only if k edges between vertices
  – Consider vertices connected if similarity score above a threshold
     • E.g. approximate Jaccard similarity when computing for recommendation
       systems
  – Find minimum spanning trees
     • Summarize by computing an MST on the subset graph
  – Clustering
     • Cluster each partition, then aggregate the clusters



                                         17
Outline
  Today:
   – Act 1: Crawl before you walk
      • Counting connected components
   – Act 2: The curse of the last reducer
      • Finding tight knit friend groups




                                     18
Act 2: Clustering Coefficient
  Finding tight knit groups of friends




                              19
Act 2: Clustering Coefficient
  Finding tight knit groups of friends




                             vs.




                              19
Act 2: Clustering Coefficient
  Finding tight knit groups of friends




                                   vs.




           2/15   ≈ 0.13                        8/15   ≈ 0.53

  CC(v) = Fraction of v’s friends who know each other
   – Count: number of triangles incident on v


                                   20
Finding CC For Each Node
  Attempt 1:
  – Look at each node
  – Enumerate all possible triangles (Pivot)




                                   21
Finding CC For Each Node
  Attempt 1:
  – Look at each node
  – Enumerate all possible triangles (Pivot)




                                   22
Finding CC For Each Node
  Attempt 1:
  – Look at each node
  – Enumerate all possible triangles (Pivot)
  – Check which of those edges exist:




                      ∩                          =


                             15 edges possible       2 edges present


                                   23
Finding CC For Each Node
  Attempt 1:
  – Look at each node
  – Enumerate all possible triangles (Pivot)
  – Check which of those edges exist




                                   24
Finding CC For Each Node
  Attempt 1:
  – Look at each node
  – Enumerate all possible triangles
  – Check which of those edges exist


  Amount of intermediate data
  – Quadratic in the degree of the nodes
  – 6 friends: 15 possible triangles
  – n friends, n(n-1)/2 possible triangles




                                       25
Finding CC For Each Node
  Attempt 1:
  – Look at each node
  – Enumerate all possible triangles
  – Check which of those edges exist


  Amount of intermediate data
  – Quadratic in the degree of the nodes
  – 6 friends: 15 possible triangles
  – n friends, n(n-1)/2 possible triangles


  There’s always “that guy”:
  – tens of thousands of friends
  – tens of thousands of movie ratings (really!)
  – millions of followers
                                       26
Finding CC For Each Node
  Attempt 1:
  – Look at each node    a le
                       Sc triangles
                    ot
  – Enumerate all possible
                sn
             oe
  – Check which of those edges exist
           D




                                 27
Finding CC For Each Node
  Attempt 1:
  – Look at each node      a le
                       Sc triangles
                    ot
  – Enumerate all possible
                sn
             oe
  – Check which of those edges exist
           D


  Attempt 2:
  – There is a limited number of High degree nodes
  – Count LLL, LLH, LHH, and HHH triangles differently
     – If a triangle has at least one Low node
        – Pivot on Low node to count the triangles
     – If a triangle has all High nodes
        – Pivot but only on other neighboring High nodes (not all nodes)


                                    28
Algorithm in Pictures
  When looking at Low degree nodes
   – Check for all triangles




                               29
Algorithm in Pictures
  When looking at Low degree nodes
   – Check for all triangles

  When looking at High degree nodes
   – Check for triangles with other High degree nodes




                                   30
Clustering Coefficient Discussion
  Attempt 2:
   – Main idea: treat High and Low degree nodes differently
      • Limit the amount of data generated (No more than O(n) per node)
   – All triangles accounted for
   – Can set High-Low threshold to balance the two cases
      • Rule of thumb: threshold around square root of number of vertices
   – A bit more complex, but still easy to code
      • Doesn’t suffer from the one high degree node problem




                                         31
XXL Graphs: Conclusions
  Algorithm Design
   – Prove performance guarantees independent of input data
      • Input skew (e.g. high degree nodes) should not severely affect
        algorithm performance
      • Number of rounds fixed (and hopefully small)




                                    32
XXL Graphs: Conclusions
  Algorithm Design
   – Prove performance guarantees independent of input data
      • Input skew (e.g. high degree nodes) should not severely affect
        algorithm performance
      • Number of rounds fixed (and hopefully small)



  Rethink graph algorithms:
   – Connected Components: Two round approach
   – Clustering Coefficient: High-Low node decomposition
   – (Breaking News) Matchings: Two round sampling technique




                                    33
Thank You
sergei@yahoo-inc.com

More Related Content

Viewers also liked

Network Analysis with networkX : Fundamentals of network theory-1
Network Analysis with networkX : Fundamentals of network theory-1Network Analysis with networkX : Fundamentals of network theory-1
Network Analysis with networkX : Fundamentals of network theory-1Kyunghoon Kim
 
CLIQUE Automatic subspace clustering of high dimensional data for data mining...
CLIQUE Automatic subspace clustering of high dimensional data for data mining...CLIQUE Automatic subspace clustering of high dimensional data for data mining...
CLIQUE Automatic subspace clustering of high dimensional data for data mining...Raed Aldahdooh
 
Community detection (Поиск сообществ в графах)
Community detection (Поиск сообществ в графах)Community detection (Поиск сообществ в графах)
Community detection (Поиск сообществ в графах)Kirill Rybachuk
 
Suicide ideation of individuals in online social networks tokyo webmining
Suicide ideation of individuals in online social networks tokyo webminingSuicide ideation of individuals in online social networks tokyo webmining
Suicide ideation of individuals in online social networks tokyo webminingHiroko Onari
 
Social network analysis part ii
Social network analysis part iiSocial network analysis part ii
Social network analysis part iiTHomas Plotkowiak
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 

Viewers also liked (10)

Importance
ImportanceImportance
Importance
 
4 Cliques Clusters
4 Cliques Clusters4 Cliques Clusters
4 Cliques Clusters
 
Network Analysis with networkX : Fundamentals of network theory-1
Network Analysis with networkX : Fundamentals of network theory-1Network Analysis with networkX : Fundamentals of network theory-1
Network Analysis with networkX : Fundamentals of network theory-1
 
CLIQUE Automatic subspace clustering of high dimensional data for data mining...
CLIQUE Automatic subspace clustering of high dimensional data for data mining...CLIQUE Automatic subspace clustering of high dimensional data for data mining...
CLIQUE Automatic subspace clustering of high dimensional data for data mining...
 
Community detection (Поиск сообществ в графах)
Community detection (Поиск сообществ в графах)Community detection (Поиск сообществ в графах)
Community detection (Поиск сообществ в графах)
 
Suicide ideation of individuals in online social networks tokyo webmining
Suicide ideation of individuals in online social networks tokyo webminingSuicide ideation of individuals in online social networks tokyo webmining
Suicide ideation of individuals in online social networks tokyo webmining
 
Clique
Clique Clique
Clique
 
Social network analysis part ii
Social network analysis part iiSocial network analysis part ii
Social network analysis part ii
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
 

More from Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...Yahoo Developer Network
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
 

More from Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 

Recently uploaded

Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - AvrilIvanti
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 

Recently uploaded (20)

Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - Avril
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 

XXL Graph Algorithms__HadoopSummit2010

  • 1. XXL Graph Algorithms Sergei Vassilvitskii Yahoo! Research With help from Jake Hofman, Siddharth Suri, Cong Yu and many others
  • 2. Introduction XXL Graphs are everywhere: – Web graph – Friend graphs – Advertising graphs... 2
  • 3. Introduction XXL Graphs are everywhere: – Web graph – Friend graphs – Advertising graphs... But we have Hadoop! – Few algorithms have been ported (no Hadoop Algorithms book) – Few general algorithmic approaches – Active area of research 3
  • 4. Outline Today: – Act 1: Crawl before you walk • Counting connected components – Act 2: The curse of the last reducer • Finding tight knit friend groups 4
  • 5. Act 1: Connected Components Given a graph, how many components does it have? f b a g c e h d 5
  • 6. Act 1: Connected Components Given a graph, how many components does it have? f b (b,c) 1 a (f,h) 1 g (b,d) 1 (a,c) 1 (a,b) 1 (c,d) 1 c (c,e) 1 (f,g) 1 e h (d,e) 1 (d,e) 1 d (b,e) 1 (g,h) 1 Data too big to fit on one reducer! 6
  • 7. CC Overview Outline for Connected Components – Partition the input into several chunks (map 1) – Summarize the connectivity on each chunk (reduce 1) – Combine all of the (small) summaries (map 2) – Find the number of connected components 7
  • 8. Connected Components 1. Partition (randomly): f b a g c e h d 8
  • 9. Connected Components 1. Partition (randomly): f b b a g c c e h d Reduce 1 Reduce 2 9
  • 10. Connected Components 1. Partition: 2. Summarize (retain < n edges): f b b a g c c e h d Reduce 1 Reduce 2 10
  • 11. Connected Components 1. Partition: 2. Summarize (retain < n edges): f b b a g c c e h d Reduce 1 Reduce 2 11
  • 12. Connected Components 1. Partition: 2. Summarize: 3. Recombine: f b b a g c c e h d Reduce 1 Reduce 2 12
  • 13. Connected Components 1. Partition: 2. Summarize: 3. Recombine: b f a g c e h d Round 2 13
  • 14. Connected Components 1. Partition: 2. Summarize: 3. Recombine: b f (b,c) 1 a (f,h) 1 (b,d) 1 g (a,c) 1 (a,b) 1 (c,d) 1 c (c,e) 1 (f,g) 1 (d,e) 1 e h (d,e) 1 (b,e) 1 d (g,h) 1 Round 2 14
  • 15. Connected Components 1. Partition: 2. Summarize: 3. Recombine: b f a g (a,c) 1 (a,b) 1 (c,d) 1 c (f,g) 1 e h (d,e) 1 d (g,h) 1 Round 2 Small enough to fit! 15
  • 16. Connected Components The summarization does not affect connectivity – Drops redundant edges – Dramatically reduces data size – Takes two MapReduce rounds 16
  • 17. Connected Components The summarization does not affect connectivity – Drops redundant edges – Dramatically reduces data size – Takes two MapReduce rounds Similar approach works in other situations: – Consider vertices connected only if k edges between vertices – Consider vertices connected if similarity score above a threshold • E.g. approximate Jaccard similarity when computing for recommendation systems – Find minimum spanning trees • Summarize by computing an MST on the subset graph – Clustering • Cluster each partition, then aggregate the clusters 17
  • 18. Outline Today: – Act 1: Crawl before you walk • Counting connected components – Act 2: The curse of the last reducer • Finding tight knit friend groups 18
  • 19. Act 2: Clustering Coefficient Finding tight knit groups of friends 19
  • 20. Act 2: Clustering Coefficient Finding tight knit groups of friends vs. 19
  • 21. Act 2: Clustering Coefficient Finding tight knit groups of friends vs. 2/15 ≈ 0.13 8/15 ≈ 0.53 CC(v) = Fraction of v’s friends who know each other – Count: number of triangles incident on v 20
  • 22. Finding CC For Each Node Attempt 1: – Look at each node – Enumerate all possible triangles (Pivot) 21
  • 23. Finding CC For Each Node Attempt 1: – Look at each node – Enumerate all possible triangles (Pivot) 22
  • 24. Finding CC For Each Node Attempt 1: – Look at each node – Enumerate all possible triangles (Pivot) – Check which of those edges exist: ∩ = 15 edges possible 2 edges present 23
  • 25. Finding CC For Each Node Attempt 1: – Look at each node – Enumerate all possible triangles (Pivot) – Check which of those edges exist 24
  • 26. Finding CC For Each Node Attempt 1: – Look at each node – Enumerate all possible triangles – Check which of those edges exist Amount of intermediate data – Quadratic in the degree of the nodes – 6 friends: 15 possible triangles – n friends, n(n-1)/2 possible triangles 25
  • 27. Finding CC For Each Node Attempt 1: – Look at each node – Enumerate all possible triangles – Check which of those edges exist Amount of intermediate data – Quadratic in the degree of the nodes – 6 friends: 15 possible triangles – n friends, n(n-1)/2 possible triangles There’s always “that guy”: – tens of thousands of friends – tens of thousands of movie ratings (really!) – millions of followers 26
  • 28. Finding CC For Each Node Attempt 1: – Look at each node a le Sc triangles ot – Enumerate all possible sn oe – Check which of those edges exist D 27
  • 29. Finding CC For Each Node Attempt 1: – Look at each node a le Sc triangles ot – Enumerate all possible sn oe – Check which of those edges exist D Attempt 2: – There is a limited number of High degree nodes – Count LLL, LLH, LHH, and HHH triangles differently – If a triangle has at least one Low node – Pivot on Low node to count the triangles – If a triangle has all High nodes – Pivot but only on other neighboring High nodes (not all nodes) 28
  • 30. Algorithm in Pictures When looking at Low degree nodes – Check for all triangles 29
  • 31. Algorithm in Pictures When looking at Low degree nodes – Check for all triangles When looking at High degree nodes – Check for triangles with other High degree nodes 30
  • 32. Clustering Coefficient Discussion Attempt 2: – Main idea: treat High and Low degree nodes differently • Limit the amount of data generated (No more than O(n) per node) – All triangles accounted for – Can set High-Low threshold to balance the two cases • Rule of thumb: threshold around square root of number of vertices – A bit more complex, but still easy to code • Doesn’t suffer from the one high degree node problem 31
  • 33. XXL Graphs: Conclusions Algorithm Design – Prove performance guarantees independent of input data • Input skew (e.g. high degree nodes) should not severely affect algorithm performance • Number of rounds fixed (and hopefully small) 32
  • 34. XXL Graphs: Conclusions Algorithm Design – Prove performance guarantees independent of input data • Input skew (e.g. high degree nodes) should not severely affect algorithm performance • Number of rounds fixed (and hopefully small) Rethink graph algorithms: – Connected Components: Two round approach – Clustering Coefficient: High-Low node decomposition – (Breaking News) Matchings: Two round sampling technique 33