Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Hearts GraphLab Create

Seattle Spark Meetup 2015 - Technical talk on integrating Apache Spark with GraphLab Create

  • Login to see the comments

  • Be the first to like this

Spark Hearts GraphLab Create

  1. 1. PK COPYRIGHT © 2015 PROKARMA INC. Copyrights, trademarks, and registered trademarks for all technology described in this document are owned by the respective companies. +
  2. 2. OVERVIEW ü  Strategy ü  Technology Verticals ü  Industry Verticals ü  Platform Solutions SERVICES ü  21 Offices including 11 Global Delivery Centers in US, India, Argentina & Peru PRESENCE ü  150+ Customers ü  40 Fortune 1000 Companies CUSTOMERS ü  Doing business since 2000 ü  Global Staff of 2000+ employees PEOPLE PK
  3. 3. problem formulation -> tool chain construction data science with notebooks the data set Demo: Spark + GraphLab Create RDD -> SFrame ML on PySpark introduction to boosted trees in GraphLab Create Demo: Spark + GraphLab Create PySpark ML with GraphLab Create PK What should we talk about? +
  4. 4. Moving down the data pipeline from raw to results How best to quickly move through pipeline to: 1.  Show value of work 2.  Communicate results 3.  Move models into production pipeline PK problem formulation -> tool chain construction + Clean up your data Run Sophisticated Analytics Discover Insights! Integrate Results into Pipelines Use lots of different data sources ETL Communicate Results and Value …NOTEBOOKS Based on Diagram Source: Paco Nathan
  5. 5. PK problem formulation -> tool chain construction + “Models are not right or wrong; they're always wrong. They're always approximations. The question you have to ask is whether a model tells you more information than you would have had otherwise. If it does, it's skillful.” - Gavin Schmidt’s excellent TED Talk Data science with notebooks allows data science teams to quickly move from exploration > transformation > modeling > visualization > export to pipeline Many startups create to fill the need for one product solution with collaboration and containerization In the mean time: PySpark + IPython + GraphLab Create
  6. 6. Fatality Analysis Reporting System National Highway Transportation Safety Administration ●  Publicly available ●  Historical time-series Currently available: 1975 – 2013 ●  Raw, Rich, Relevant ●  Time-series, geo location ●  Human recorded events transcribed into annual databases (dbf, SQL, SAS) ●  Measurable outcomes for modelling (updated on yearly basis) PK example data set: transportation safety +
  7. 7. ²  Proof  of  concept  demonstra/on  for   customer  concerned  with   Transporta/on  Safety ²  Applica/on  of  science  method   towards  diverse  data  sets ²  Evolving  real-­‐world  data  sets  for   advanced  analy/cs  workshops  and   training  sessions ²  Visual  and  conceptual  presenta/on   of  scien/fic  approach  to   computa/onal  analysis data science showcase IENCE – DATA PROTOTYPING 25 30 35 40 45 50 −120 −100 −80 Fatalities in Fatal Accidents from 1975−2012 lat 0 40000 80000 120000 160000 deaths ü  Scien/fic  Method ü  Predic/ve  Modeling ü  Hidden  Insights   PK +
  8. 8. Hardware: •  MacBook Pro (late 2012) •  ~36 GB free disk space •  8GB RAM •  2 cores •  Not exactly a blazingly fast, top of the line machine…. Software: •  Spark 1.1.0 for Hadoop 2.4 •  GraphLab Create 1.2.1 •  Hadoop 2.4 •  Scala 2.10.4 •  Python 2.7.9 on Anaconda 1.9.1 PK notes about my setup +
  9. 9. One more step… gl.get_spark_integration_jar_path() $SPARK_HOME/bin/spark-submit --driver-class-path /path/to/graphlab- create-spark-integration.jar --master yarn-client Or… $SPARK_HOME/bin/pyspark --driver-class-path /path/to/graphlab-create- spark-integration.jar --master yarn-client Works with –master local OR –master yarn-client Then…it works J PK working with glc in pyspark +
  10. 10. Demo time! Graphic courtesy Dato
  11. 11. GraphLab Create Recommender system Factorization-based methods Neighborhood-based item similarity Popularity-based methods Classification Deep learning Boosted trees Support vector machine Logistic regression Regression Boosted trees Deep learning Linear regression Text analysis Topic modeling (LDA) Featurization utilities: bm25, td_idf, remove stop words, etc. Image analysis Deep learning Image load, resize, mean computation Clustering – K-means Graph analytics Nearest neighbors Vowpal Wabbit wrapper PK ML in PySpark…even better now + MLlib + GraphX Classification and regression linear models (SVMs, logistic regression, linear regression) naive Bayes decision trees ensembles of trees (Random Forests and Gradient-Boosted Trees) Collaborative filtering alternating least squares (ALS) Clustering k-means Dimensionality reduction singular value decomposition (SVD) principal component analysis (PCA) Optimization (developer) stochastic gradient descent limited-memory BFGS (L-BFGS) Graph analytics
  12. 12. PK Gradient Boosted Trees with GraphLab Create + Source: Dato, ICCV 2009 Tutorial ……! tree t1! tree tT! split nodes! leaf nodes! v! v!  !
  13. 13. Demo time! Graphic courtesy Dato
  14. 14. + PK ?
  15. 15. DATA SCIENCE – DATA PROTOTYPING PK Amanda Casari + humor from xkcd