Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

1

Share

Download to read offline

Spark graphx

Download to read offline

Introduction to Apache Spark GraphX

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Spark graphx

  1. 1. ® © 2016 MapR Technologies 9-1© 2017 MapR Technologies ® Spark GraphX
  2. 2. ® © 2016 MapR Technologies 9-2 Learning Goals •  Describe GraphX •  Define Regular, Directed, and Property Graphs •  Create a Property Graph •  Perform Operations on Graphs
  3. 3. ® © 2016 MapR Technologies 9-3 Learning Goals •  Describe GraphX •  Define Regular, Directed, and Property Graphs •  Create a Property Graph •  Perform Operations on Graphs
  4. 4. ® © 2016 MapR Technologies 9-4 What is a Graph? Graph: vertices connected by edges vertex edge 5 1
  5. 5. ® © 2016 MapR Technologies 9-5 What is a Graph? set of vertices, connected by edges. vertex edge DFW ATL Relationship: distance
  6. 6. ® © 2016 MapR Technologies 9-6 Graphs are Essential to Data Mining and Machine Learning •  Identify influential entities (people, information…) •  Find communities •  Understand people’s shared interests •  Model complex data dependencies
  7. 7. ® © 2016 MapR Technologies 9-7 Real World Graphs •  Web Pages Reference Spark GraphX in Action
  8. 8. ® © 2016 MapR Technologies 9-8 Real World Graphs •  Web Pages Reference Spark GraphX in Action
  9. 9. ® © 2016 MapR Technologies 9-9 Real World Graphs •  Web Pages Reference Spark GraphX in Action
  10. 10. ® © 2016 MapR Technologies 9-10 Real World Graphs Reference Spark GraphX in Action
  11. 11. ® © 2016 MapR Technologies 9-11 Real World Graphs Reference Spark GraphX in Action
  12. 12. ® © 2016 MapR Technologies 9-12 Real World Graphs Reference Spark GraphX in Action
  13. 13. ® © 2016 MapR Technologies 9-13 Real World Graphs Reference Spark GraphX in Action
  14. 14. ® © 2016 MapR Technologies 9-14 Real World Graphs Reference Spark GraphX in Action
  15. 15. ® © 2016 MapR Technologies 9-15 Real World Graphs •  Recommendations Ratings Items Users
  16. 16. ® © 2016 MapR Technologies 9-16 Real World Graphs •  Credit Card Application Fraud Reference Spark Summit
  17. 17. ® © 2016 MapR Technologies 9-17 Real World Graphs •  Credit Card Fraud
  18. 18. ® © 2016 MapR Technologies 9-18 Finding Communities Count triangles passing through each vertex: " Measures “cohesiveness” of local community More Triangles Stronger Community Fewer Triangles Weaker Community 1 2 3 4
  19. 19. ® © 2016 MapR Technologies 9-19 Real World Graphs Healthcare
  20. 20. ® © 2016 MapR Technologies 9-20 Liberal Conservative Post Post Post Post Post Post Post Post Predicting User Behavior Post Post Post Post Post Post Post Post Post Post Post Post Post Post ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 20 Conditional Random Field! Belief Propagation!
  21. 21. ® © 2016 MapR Technologies 9-21 Enable JoiningTables and Graphs 21 User Data Product Ratings Friend Graph ETL Product Rec. Graph Join Inf. Prod. Rec. Tables Graphs
  22. 22. ® © 2016 MapR Technologies 9-22 Table and Graph Analytics
  23. 23. ® © 2016 MapR Technologies 9-23 What is GraphX? Spark SQL •  Structured Data •  Querying with SQL/HQL •  DataFrames Spark Streaming •  Processing of live streams •  Micro-batching MLlib •  Machine Learning •  Multiple types of ML algorithms GraphX •  Graph processing •  Graph parallel computations RDD Transformations and Actions •  Task scheduling •  Memory management •  Fault recovery •  Interacting with storage systems Spark Core
  24. 24. ® © 2016 MapR Technologies 9-24 Apache Spark GraphX •  Spark component for graphs and graph- parallel computations •  Combines data parallel and graph parallel processing in single API •  View data as graphs and as collections (RDD) –  no duplication or movement of data •  Operations for graph computation –  includes optimized version of Pregel •  Provides graph algorithms and builders GraphX •  Graph processing •  Graph parallel computations
  25. 25. ® © 2016 MapR Technologies 9-25 Learning Goals •  Describe GraphX •  Define Regular, Directed, and Property Graphs •  Create a Property Graph •  Perform Operations on Graphs
  26. 26. ® © 2016 MapR Technologies 9-26 Regular Graphs vs Directed Graphs edge Carol Bob vertex Relationship: Friends •  Regular graph: each vertex has the same number of edges •  Example: Facebook friends –  Bob is a friend of Carol –  Carol is a friend of Bob
  27. 27. ® © 2016 MapR Technologies 9-27 Regular Graphs vs Directed Graphs vertex edge Carol 1 2 3 Oprah 6 •  Directed graph: edges have a direction •  Example: Twitter followers –  Carol follows Oprah –  Oprah does not follow Carol Relationship: follows
  28. 28. ® © 2016 MapR Technologies 9-28 Property Graph Flight 123 Flight 1002 LAX SJC Properties Properties
  29. 29. ® © 2016 MapR Technologies 9-29 Flight Example with GraphX edge ORD vertex SFO 1800 miles 800 miles1400 miles DFW Originating Airport Destination Airport Distance SFO ORD 1800 miles ORD DFW 800 miles DFW SFO 1400 miles
  30. 30. ® © 2016 MapR Technologies 9-30 Flight Example with GraphX edge ORD vertex SFO 1800 miles 800 miles1400 miles DFW Id Property 1 SFO 2 ORD 3 DFW SrcId DestId Property 1 2 1800 2 3 800 3 1 1400 Vertex Table Edge Table
  31. 31. ® © 2016 MapR Technologies 9-31 Spark Property Graph class edge ORD vertex SFO 1800 miles 800 miles1400 miles DFW class Graph[VD, ED] { val vertices: VertexRDD[VD] val edges: EdgeRDD[ED] }
  32. 32. ® © 2016 MapR Technologies 9-32 Learning Goals •  Define GraphX •  Define Regular, Directed, and Property Graphs •  Create a Property Graph •  Perform Operations on Graphs
  33. 33. ® © 2016 MapR Technologies 9-33 Create a Property Graph Import required classes Create vertex RDD Create edge RDD Create graph 1 2 3 4
  34. 34. ® © 2016 MapR Technologies 9-34 import org.apache.spark._ import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD Create a Property Graph 1 Import required classes
  35. 35. ® © 2016 MapR Technologies 9-35 Create a Property Graph: Data Set Vertices: Airports Edges: Routes Source ID Dest ID Property (E) Id Id Distance (Integer) Vertex ID Property (V) Id (Long) Name (String)
  36. 36. ® © 2016 MapR Technologies 9-36 Create a Property Graph // create vertices RDD with ID and Name val vertices=Array((1L, ("SFO")),(2L, ("ORD")),(3L,("DFW"))) val vRDD= sc.parallelize(vertices) vRDD.take(1) // Array((1,SFO)) 2 Create vertex RDD Id Property 1 SFO 2 ORD 3 DFW
  37. 37. ® © 2016 MapR Technologies 9-37 Create a Property Graph 3 Create edge RDD // create routes RDD with srcid, destid , distance val edges = Array(Edge(1L,2L,1800),Edge(2L,3L,800), Edge(3L,1L,1400)) val eRDD= sc.parallelize(edges) eRDD.take(2) // Array(Edge(1,2,1800), Edge(2,3,800)) SrcId DestId Property 1 2 1800 2 3 800 3 1 1400
  38. 38. ® © 2016 MapR Technologies 9-38 Create a Property Graph 4 Create graph // define default vertex nowhere val nowhere = “nowhere” //build initial graph val graph = Graph(vertices, edges, nowhere) graph.vertices.take(3).foreach(print) // (2,ORD)(1,SFO)(3,DFW) graph.edges.take(3).foreach(print) // Edge(1,2,1800) Edge(2,3,800) Edge(3,1,1400)
  39. 39. ® © 2016 MapR Technologies 9-39 Learning Goals •  Define GraphX •  Define Regular, Directed, and Property Graphs •  Create a Property Graph •  Perform Operations on Graphs
  40. 40. ® © 2016 MapR Technologies 9-40 Graph Operators To answer questions such as: •  How many airports are there? •  How many flight routes are there? •  What are the longest distance routes? •  Which airport has the most incoming flights? •  What are the top 10 flights?
  41. 41. ® © 2016 MapR Technologies 9-41 Graph Class
  42. 42. ® © 2016 MapR Technologies 9-42 Graph Operators To find information about the graph Operator Description numEdges number of edges (Long) numVertices number of vertices (Long) inDegrees The in-degree of each vertex (VertexRDD[Int]) outDegrees The out-degree of each vertex (VertexRDD[Int]) degrees The degree of each vertex (VertexRDD[Int])
  43. 43. ® © 2016 MapR Technologies 9-43 Graph Operators Graph Operators // How many airports? val numairports = graph.numVertices // Long = 3 // How many routes? val numroutes = graph.numEdges // Long = 3 // routes > 1000 miles distance? graph.edges.filter { case ( Edge(org_id, dest_id,distance))=> distance > 1000 }.take(3) // Array(Edge(1,2,1800), Edge(3,1,1400)
  44. 44. ® © 2016 MapR Technologies 9-44 Triplets // Triplets add source and destination properties to Edges graph.triplets.take(3).foreach(println) ((1,SFO),(2,ORD),1800) ((2,ORD),(3,DFW),800) ((3,DFW),(1,SFO),1400)
  45. 45. ® © 2016 MapR Technologies 9-45 Triplets What are the longest routes ? ((1,SFO),(2,ORD),1800) ((2,ORD),(3,DFW),800) ((3,DFW),(1,SFO),1400) // print out longest routes graph.triplets.sortBy(_.attr, ascending=false) .map(triplet =>"Distance" + triplet.attr.toString + “from" + triplet.srcAttr + “to" + triplet.dstAttr) .collect.foreach(println) Distance 1800 from SFO to ORD Distance 1400 from DFW to SFO Distance 800 from ORD to DFW
  46. 46. ® © 2016 MapR Technologies 9-46 Graph Operators Which airport has the most incoming flights? (real dataset) // Define a function to compute the highest degree vertex def max(a:(VertexId,Int),b:(VertexId, Int)):(VertexId, Int) = { if (a._2 > b._2) a else b } // Which Airport has the most incoming flights? val maxInDegree:(VertexId, Int)= graph.inDegrees.reduce(max) // (10397,152) ATL
  47. 47. ® © 2016 MapR Technologies 9-47 Graph Operators Which 3 airports have the most incoming flights? (real dataset) // get top 3 val maxIncoming = graph.inDegrees.collect .sortWith(_._2 > _._2) .map(x => (airportMap(x._1), x._2)).take(3) maxIncoming.foreach(println) (ATL,152) (ORD,145) (DFW,143)
  48. 48. ® © 2016 MapR Technologies 9-48 Graph Operators Caching Graphs Operator Description cache() Caches the vertices and edges; default level is MEMORY_ONLY persist(newLevel) Caches the vertices and edges at specified storage level; returns a reference to this graph unpersist(blocking) Uncaches both vertices and edges of this graph unpersistVertices(blocking) Uncaches only the vertices, leaving edges alone
  49. 49. ® © 2016 MapR Technologies 9-49 Graph Class
  50. 50. ® © 2016 MapR Technologies 9-50 Class Discussion 1.  How many airports are there? •  In our graph, what represents airports? •  Which operator could you use to find the number of airports? 2.  How many routes are there? •  In our graph, what represents routes? •  Which operator could you use to find the number of routes?
  51. 51. ® © 2016 MapR Technologies 9-51 How Many Airports are There? How many airports are there? •  In our graph, what represents airports? Vertices •  Which operator could you use to find the number of airports? graph.numVertices
  52. 52. ® © 2016 MapR Technologies 9-52 Pregel API •  GraphX exposes variant of Pregel API •  iterative graph processing –  Iterations of message passing between vertices
  53. 53. ® © 2016 MapR Technologies 9-53 The Graph-Parallel Abstraction A user-definedVertex-Program runs on each Graph vertex •  Using messages (e.g. Pregel ) •  Parallelism: run multiple vertex programs simultaneously
  54. 54. ® © 2016 MapR Technologies 9-54 Pregel Operator Initial message received at each vertex Message computed at each vertex Sum of message received at each vertex Message computed at each vertex Sum of message received at each vertex Message computed at each vertex 1Super step 2Super step nSuper step Loop until no messages left OR max iterations
  55. 55. ® © 2016 MapR Technologies 9-55 Pregel Operator: Example Use Pregel to find the cheapest airfare: // starting vertex val sourceId: VertexId = 13024 // a graph with edges containing airfare cost calculation val gg = graph.mapEdges(e => 50.toDouble + e.attr.toDouble/20 ) // initialize graph, all vertices except source have distance infinity val initialGraph = gg.mapVertices((id, _) => if (id == sourceId) 0.0 else Double.PositiveInfinity
  56. 56. ® © 2016 MapR Technologies 9-56 Graph Class Pregel
  57. 57. ® © 2016 MapR Technologies 9-57 Pregel Operator: Example Use Pregel to find the cheapest airfare: // call pregel on graph val sssp = initialGraph.pregel(Double.PositiveInfinity)( // Vertex Program (id, distCost, newDistCost) => math.min(distCost, newDistCost), triplet => { // Send Message if (triplet.srcAttr + triplet.attr < triplet.dstAttr) { Iterator((triplet.dstId, triplet.srcAttr + triplet.attr)) } else { Iterator.empty } }, // Merge Message (a,b) => math.min(a,b) )
  58. 58. ® © 2016 MapR Technologies 9-58 Pregel Operator: Example Use Pregel to find the cheapest airfare: // routes , lowest flight cost println(sssp.edges.take(4).mkString("n")) Edge(10135,10397,84.6) Edge(10135,13930,82.7) Edge(10140,10397,113.45) Edge(10140,10821,133.5)
  59. 59. ® © 2016 MapR Technologies 9-59 PageRank •  Measures the importance of vertices in a graph •  In links are votes •  In links from important vertices are more important •  Returns a graph with vertex attributes graph.pageRank(tolerance).vertices
  60. 60. ® © 2016 MapR Technologies 9-60 Page Rank: Example Use Page Rank: // use pageRank val ranks = graph.pageRank(0.1).vertices // join the ranks with the map of airport id to name val temp= ranks.join(airports) temp.take(1) // Array((15370,(0.5365013694244737,TUL))) // sort by ranking val temp2 = temp.sortBy(_._2._1, false) temp2.take(2) //Array((10397,(5.431032677813346,ATL)), (13930,(5.4148119418905765,ORD))) // get just the airport names val impAirports =temp2.map(_._2._2) impAirports.take(4) //res6: Array[String] = Array(ATL, ORD, DFW, DEN)
  61. 61. ® © 2016 MapR Technologies 9-61 Use Case Monitor air traffic at airports Monitor delays Analyze airport and routes overall Analyze airport and routes by airline
  62. 62. ® © 2016 MapR Technologies 9-62 Learn More •  https://www.mapr.com/blog/how-get-started-using-apache-spark-graphx-scala •  GraphX Programming Guide http://spark.apache.org/docs/latest/graphx- programming-guide.html •  MapR announces Free Complete Apache Spark Training and Developer Certification https://www.mapr.com/company/press-releases/mapr-unveils-free- complete-apache-spark-training-and-developer-certification •  Free Spark On Demand Training http://learn.mapr.com/?q=spark#-l •  Get Certified on Spark with MapR Spark Certification http://learn.mapr.com/? q=spark#certification-1,-l
  63. 63. ® © 2016 MapR Technologies 9-63 Open Source Engines & Tools Commercial Engines & Applications Enterprise-Grade Platform Services DataProcessing Web-Scale Storage MapR-FS MapR-DB Search and Others Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability MapR Streams Cloud and Managed Services Search and Others UnifiedManagementandMonitoring Search and Others Event StreamingDatabase Custom Apps MapR Converged Data Platform HDFS API POSIX, NFS Kakfa APIHBase API OJAI API
  • ssuserce170b

    Jun. 15, 2019

Introduction to Apache Spark GraphX

Views

Total views

1,207

On Slideshare

0

From embeds

0

Number of embeds

26

Actions

Downloads

130

Shares

0

Comments

0

Likes

1

×