SlideShare a Scribd company logo
1 of 15
Download to read offline
PK
COPYRIGHT © 2015 PROKARMA INC. Copyrights, trademarks, and registered trademarks for all technology described in this document are owned by the respective companies.
+
OVERVIEW
ü  Strategy
ü  Technology Verticals
ü  Industry Verticals
ü  Platform Solutions
SERVICES
ü  21 Offices including 11 Global Delivery
Centers in US, India, Argentina & Peru
PRESENCE
ü  150+ Customers
ü  40 Fortune 1000 Companies
CUSTOMERS
ü  Doing business since 2000
ü  Global Staff of 2000+ employees
PEOPLE
PK
problem formulation -> tool chain construction
data science with notebooks
the data set
Demo: Spark + GraphLab Create
RDD -> SFrame
ML on PySpark
introduction to boosted trees in GraphLab Create
Demo: Spark + GraphLab Create
PySpark ML with GraphLab Create
PK
What should we talk about?
+
Moving down the
data pipeline from
raw to results
How best to quickly
move through
pipeline to:
1.  Show value of
work
2.  Communicate
results
3.  Move models into
production
pipeline
PK
problem formulation -> tool chain construction
+
Clean up your
data
Run
Sophisticated
Analytics
Discover
Insights!
Integrate
Results into
Pipelines
Use lots of different
data sources
ETL Communicate
Results and
Value
…NOTEBOOKS
Based on Diagram Source: Paco Nathan
PK
problem formulation -> tool chain construction
+
“Models are not right or wrong; they're always
wrong. They're always approximations. The
question you have to ask is whether a model
tells you more information than you would
have had otherwise. If it does, it's skillful.” -
Gavin Schmidt’s excellent TED Talk
Data science with notebooks allows
data science teams to quickly move
from exploration > transformation >
modeling > visualization > export to
pipeline
Many startups create to fill the need
for one product solution with
collaboration and containerization
In the mean time:
PySpark +
IPython +
GraphLab Create
Fatality Analysis Reporting System
National Highway Transportation Safety Administration
●  Publicly available
●  Historical time-series
Currently available: 1975 – 2013
●  Raw, Rich, Relevant
●  Time-series, geo location
●  Human recorded events transcribed into annual databases (dbf,
SQL, SAS)
●  Measurable outcomes for modelling (updated on yearly basis)
PK
example data set: transportation safety +
²  Proof  of  concept  demonstra/on  for  
customer  concerned  with  
Transporta/on  Safety
²  Applica/on  of  science  method  
towards  diverse  data  sets
²  Evolving  real-­‐world  data  sets  for  
advanced  analy/cs  workshops  and  
training  sessions
²  Visual  and  conceptual  presenta/on  
of  scien/fic  approach  to  
computa/onal  analysis
data science showcase
IENCE – DATA PROTOTYPING
25
30
35
40
45
50
−120 −100 −80
Fatalities in Fatal Accidents from 1975−2012
lat
0
40000
80000
120000
160000
deaths
ü  Scien/fic  Method
ü  Predic/ve  Modeling
ü  Hidden  Insights  
PK
+
Hardware:
•  MacBook Pro (late 2012)
•  ~36 GB free disk space
•  8GB RAM
•  2 cores
•  Not exactly a blazingly fast, top of the line machine….
Software:
•  Spark 1.1.0 for Hadoop 2.4
•  GraphLab Create 1.2.1
•  Hadoop 2.4
•  Scala 2.10.4
•  Python 2.7.9 on Anaconda 1.9.1
PK
notes about my setup
+
One more step…
gl.get_spark_integration_jar_path()
$SPARK_HOME/bin/spark-submit --driver-class-path /path/to/graphlab-
create-spark-integration.jar --master yarn-client my_awesome_code.py
Or…
$SPARK_HOME/bin/pyspark --driver-class-path /path/to/graphlab-create-
spark-integration.jar --master yarn-client
Works with –master local OR –master yarn-client
Then…it works J
PK
working with glc in pyspark
+
Demo time!
Graphic courtesy Dato
GraphLab Create
Recommender system
Factorization-based methods
Neighborhood-based item similarity
Popularity-based methods
Classification
Deep learning
Boosted trees
Support vector machine
Logistic regression
Regression
Boosted trees
Deep learning
Linear regression
Text analysis
Topic modeling (LDA)
Featurization utilities: bm25, td_idf, remove stop
words, etc.
Image analysis
Deep learning
Image load, resize, mean computation
Clustering – K-means
Graph analytics
Nearest neighbors
Vowpal Wabbit wrapper
PK
ML in PySpark…even better now
+
MLlib + GraphX
Classification and regression
linear models (SVMs, logistic regression,
linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and
Gradient-Boosted Trees)
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
Graph analytics
PK
Gradient Boosted Trees with GraphLab Create
+
Source: Dato, ICCV 2009 Tutorial
……!
tree t1! tree tT!
split nodes!
leaf nodes!
v! v!
 !
Demo time!
Graphic courtesy Dato
+
PK
http://goo.gl/forms/y8LYl53hje
?
DATA SCIENCE – DATA PROTOTYPING
PK
Amanda Casari
acasari@prokarma.com
+
humor from xkcd

More Related Content

What's hot

How Boston Scientific Improves Manufacturing Quality Using Graph Analytics
How Boston Scientific Improves Manufacturing Quality Using Graph AnalyticsHow Boston Scientific Improves Manufacturing Quality Using Graph Analytics
How Boston Scientific Improves Manufacturing Quality Using Graph AnalyticsGraphAware
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaData Science Thailand
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks
 
GraphAware Framework Intro
GraphAware Framework IntroGraphAware Framework Intro
GraphAware Framework IntroMichal Bachman
 
Data pipelines from zero
Data pipelines from zero Data pipelines from zero
Data pipelines from zero Lars Albertsson
 
Graph Analytics on Data from Meetup.com
Graph Analytics on Data from Meetup.comGraph Analytics on Data from Meetup.com
Graph Analytics on Data from Meetup.comKarin Patenge
 
Lean Dependency Management with graphs
Lean Dependency Management with graphsLean Dependency Management with graphs
Lean Dependency Management with graphsLuanne Misquitta
 
Organising for Data Success
Organising for Data SuccessOrganising for Data Success
Organising for Data SuccessLars Albertsson
 
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...Introduction to Distributed Computing Engines for Data Processing - Simone Ro...
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...Data Science Milan
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
 
The A-Z of Data: Introduction to MLOps
The A-Z of Data: Introduction to MLOpsThe A-Z of Data: Introduction to MLOps
The A-Z of Data: Introduction to MLOpsDataPhoenix
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big dataSigmoid
 
Machine Learning Powered by Graphs - Alessandro Negro
Machine Learning Powered by Graphs - Alessandro NegroMachine Learning Powered by Graphs - Alessandro Negro
Machine Learning Powered by Graphs - Alessandro NegroGraphAware
 
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...DataBench
 
1257103560 X Mp Lantand Iso15926 Oct2009
1257103560 X Mp Lantand Iso15926 Oct20091257103560 X Mp Lantand Iso15926 Oct2009
1257103560 X Mp Lantand Iso15926 Oct2009Giorgio Amici
 
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Databricks
 

What's hot (20)

What's new in Spark 2.0?
What's new in Spark 2.0?What's new in Spark 2.0?
What's new in Spark 2.0?
 
How Boston Scientific Improves Manufacturing Quality Using Graph Analytics
How Boston Scientific Improves Manufacturing Quality Using Graph AnalyticsHow Boston Scientific Improves Manufacturing Quality Using Graph Analytics
How Boston Scientific Improves Manufacturing Quality Using Graph Analytics
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
 
GraphAware Framework Intro
GraphAware Framework IntroGraphAware Framework Intro
GraphAware Framework Intro
 
Data pipelines from zero
Data pipelines from zero Data pipelines from zero
Data pipelines from zero
 
Graph Analytics on Data from Meetup.com
Graph Analytics on Data from Meetup.comGraph Analytics on Data from Meetup.com
Graph Analytics on Data from Meetup.com
 
Lean Dependency Management with graphs
Lean Dependency Management with graphsLean Dependency Management with graphs
Lean Dependency Management with graphs
 
Hakkache mohamed 202111
Hakkache mohamed 202111Hakkache mohamed 202111
Hakkache mohamed 202111
 
Organising for Data Success
Organising for Data SuccessOrganising for Data Success
Organising for Data Success
 
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...Introduction to Distributed Computing Engines for Data Processing - Simone Ro...
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
The A-Z of Data: Introduction to MLOps
The A-Z of Data: Introduction to MLOpsThe A-Z of Data: Introduction to MLOps
The A-Z of Data: Introduction to MLOps
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
Machine Learning Powered by Graphs - Alessandro Negro
Machine Learning Powered by Graphs - Alessandro NegroMachine Learning Powered by Graphs - Alessandro Negro
Machine Learning Powered by Graphs - Alessandro Negro
 
Power of Polyglot Search
Power of Polyglot SearchPower of Polyglot Search
Power of Polyglot Search
 
CaseStudy_Ludan_Engineering
CaseStudy_Ludan_EngineeringCaseStudy_Ludan_Engineering
CaseStudy_Ludan_Engineering
 
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, B...
 
1257103560 X Mp Lantand Iso15926 Oct2009
1257103560 X Mp Lantand Iso15926 Oct20091257103560 X Mp Lantand Iso15926 Oct2009
1257103560 X Mp Lantand Iso15926 Oct2009
 
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
 

Similar to Spark Hearts GraphLab Create

Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to ProductionMostafa Majidpour
 
Data analytics at a petabyte scale final
Data analytics at a petabyte scale   finalData analytics at a petabyte scale   final
Data analytics at a petabyte scale finalOri Reshef
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
La gestion de projet dans l'industrie 4.0
La gestion de projet dans l'industrie 4.0La gestion de projet dans l'industrie 4.0
La gestion de projet dans l'industrie 4.0PMI-Montréal
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Databricks
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 editionDavid Talby
 
ITCamp 2018 - Andrea Martorana Tusa - Failure prediction for manufacturing in...
ITCamp 2018 - Andrea Martorana Tusa - Failure prediction for manufacturing in...ITCamp 2018 - Andrea Martorana Tusa - Failure prediction for manufacturing in...
ITCamp 2018 - Andrea Martorana Tusa - Failure prediction for manufacturing in...ITCamp
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...confluent
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Lillian Pierson
 
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla AirII-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla AirDr. Haxel Consult
 
CV_JMorilloEN-LinkedIn
CV_JMorilloEN-LinkedInCV_JMorilloEN-LinkedIn
CV_JMorilloEN-LinkedInJos Morillo
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 

Similar to Spark Hearts GraphLab Create (20)

Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
Data analytics at a petabyte scale final
Data analytics at a petabyte scale   finalData analytics at a petabyte scale   final
Data analytics at a petabyte scale final
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Resume analyst
Resume analystResume analyst
Resume analyst
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
La gestion de projet dans l'industrie 4.0
La gestion de projet dans l'industrie 4.0La gestion de projet dans l'industrie 4.0
La gestion de projet dans l'industrie 4.0
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Resume (1)
Resume (1)Resume (1)
Resume (1)
 
Resume (1)
Resume (1)Resume (1)
Resume (1)
 
Insync10 anthony spierings
Insync10 anthony spieringsInsync10 anthony spierings
Insync10 anthony spierings
 
Ankur Singh CV
Ankur Singh CV Ankur Singh CV
Ankur Singh CV
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
ITCamp 2018 - Andrea Martorana Tusa - Failure prediction for manufacturing in...
ITCamp 2018 - Andrea Martorana Tusa - Failure prediction for manufacturing in...ITCamp 2018 - Andrea Martorana Tusa - Failure prediction for manufacturing in...
ITCamp 2018 - Andrea Martorana Tusa - Failure prediction for manufacturing in...
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
 
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla AirII-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
 
CV_JMorilloEN-LinkedIn
CV_JMorilloEN-LinkedInCV_JMorilloEN-LinkedIn
CV_JMorilloEN-LinkedIn
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 

More from Amanda Casari

When Privacy Scales - Intelligent Product Design under GDPR
When Privacy Scales - Intelligent Product Design under GDPRWhen Privacy Scales - Intelligent Product Design under GDPR
When Privacy Scales - Intelligent Product Design under GDPRAmanda Casari
 
Scaling Data Science Products, Not Data Science Teams
Scaling Data Science Products, Not Data Science TeamsScaling Data Science Products, Not Data Science Teams
Scaling Data Science Products, Not Data Science TeamsAmanda Casari
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyone20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyoneAmanda Casari
 
Feature Engineering for Machine Learning at QConSP
Feature Engineering for Machine Learning at QConSPFeature Engineering for Machine Learning at QConSP
Feature Engineering for Machine Learning at QConSPAmanda Casari
 
Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...
Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...
Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...Amanda Casari
 
Design for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLabDesign for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLabAmanda Casari
 
PyLadies Seattle - Lessons in Interactive Visualizations
PyLadies Seattle - Lessons in Interactive VisualizationsPyLadies Seattle - Lessons in Interactive Visualizations
PyLadies Seattle - Lessons in Interactive VisualizationsAmanda Casari
 

More from Amanda Casari (8)

When Privacy Scales - Intelligent Product Design under GDPR
When Privacy Scales - Intelligent Product Design under GDPRWhen Privacy Scales - Intelligent Product Design under GDPR
When Privacy Scales - Intelligent Product Design under GDPR
 
Scaling Data Science Products, Not Data Science Teams
Scaling Data Science Products, Not Data Science TeamsScaling Data Science Products, Not Data Science Teams
Scaling Data Science Products, Not Data Science Teams
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyone20160512 apache-spark-for-everyone
20160512 apache-spark-for-everyone
 
Feature Engineering for Machine Learning at QConSP
Feature Engineering for Machine Learning at QConSPFeature Engineering for Machine Learning at QConSP
Feature Engineering for Machine Learning at QConSP
 
Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...
Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...
Understanding Products Driven by Machine Learning and AI: A Data Scientist's ...
 
Design for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLabDesign for X: Exploring Product Design with Apache Spark and GraphLab
Design for X: Exploring Product Design with Apache Spark and GraphLab
 
PyLadies Seattle - Lessons in Interactive Visualizations
PyLadies Seattle - Lessons in Interactive VisualizationsPyLadies Seattle - Lessons in Interactive Visualizations
PyLadies Seattle - Lessons in Interactive Visualizations
 

Recently uploaded

SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 

Recently uploaded (20)

SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 

Spark Hearts GraphLab Create

  • 1. PK COPYRIGHT © 2015 PROKARMA INC. Copyrights, trademarks, and registered trademarks for all technology described in this document are owned by the respective companies. +
  • 2. OVERVIEW ü  Strategy ü  Technology Verticals ü  Industry Verticals ü  Platform Solutions SERVICES ü  21 Offices including 11 Global Delivery Centers in US, India, Argentina & Peru PRESENCE ü  150+ Customers ü  40 Fortune 1000 Companies CUSTOMERS ü  Doing business since 2000 ü  Global Staff of 2000+ employees PEOPLE PK
  • 3. problem formulation -> tool chain construction data science with notebooks the data set Demo: Spark + GraphLab Create RDD -> SFrame ML on PySpark introduction to boosted trees in GraphLab Create Demo: Spark + GraphLab Create PySpark ML with GraphLab Create PK What should we talk about? +
  • 4. Moving down the data pipeline from raw to results How best to quickly move through pipeline to: 1.  Show value of work 2.  Communicate results 3.  Move models into production pipeline PK problem formulation -> tool chain construction + Clean up your data Run Sophisticated Analytics Discover Insights! Integrate Results into Pipelines Use lots of different data sources ETL Communicate Results and Value …NOTEBOOKS Based on Diagram Source: Paco Nathan
  • 5. PK problem formulation -> tool chain construction + “Models are not right or wrong; they're always wrong. They're always approximations. The question you have to ask is whether a model tells you more information than you would have had otherwise. If it does, it's skillful.” - Gavin Schmidt’s excellent TED Talk Data science with notebooks allows data science teams to quickly move from exploration > transformation > modeling > visualization > export to pipeline Many startups create to fill the need for one product solution with collaboration and containerization In the mean time: PySpark + IPython + GraphLab Create
  • 6. Fatality Analysis Reporting System National Highway Transportation Safety Administration ●  Publicly available ●  Historical time-series Currently available: 1975 – 2013 ●  Raw, Rich, Relevant ●  Time-series, geo location ●  Human recorded events transcribed into annual databases (dbf, SQL, SAS) ●  Measurable outcomes for modelling (updated on yearly basis) PK example data set: transportation safety +
  • 7. ²  Proof  of  concept  demonstra/on  for   customer  concerned  with   Transporta/on  Safety ²  Applica/on  of  science  method   towards  diverse  data  sets ²  Evolving  real-­‐world  data  sets  for   advanced  analy/cs  workshops  and   training  sessions ²  Visual  and  conceptual  presenta/on   of  scien/fic  approach  to   computa/onal  analysis data science showcase IENCE – DATA PROTOTYPING 25 30 35 40 45 50 −120 −100 −80 Fatalities in Fatal Accidents from 1975−2012 lat 0 40000 80000 120000 160000 deaths ü  Scien/fic  Method ü  Predic/ve  Modeling ü  Hidden  Insights   PK +
  • 8. Hardware: •  MacBook Pro (late 2012) •  ~36 GB free disk space •  8GB RAM •  2 cores •  Not exactly a blazingly fast, top of the line machine…. Software: •  Spark 1.1.0 for Hadoop 2.4 •  GraphLab Create 1.2.1 •  Hadoop 2.4 •  Scala 2.10.4 •  Python 2.7.9 on Anaconda 1.9.1 PK notes about my setup +
  • 9. One more step… gl.get_spark_integration_jar_path() $SPARK_HOME/bin/spark-submit --driver-class-path /path/to/graphlab- create-spark-integration.jar --master yarn-client my_awesome_code.py Or… $SPARK_HOME/bin/pyspark --driver-class-path /path/to/graphlab-create- spark-integration.jar --master yarn-client Works with –master local OR –master yarn-client Then…it works J PK working with glc in pyspark +
  • 11. GraphLab Create Recommender system Factorization-based methods Neighborhood-based item similarity Popularity-based methods Classification Deep learning Boosted trees Support vector machine Logistic regression Regression Boosted trees Deep learning Linear regression Text analysis Topic modeling (LDA) Featurization utilities: bm25, td_idf, remove stop words, etc. Image analysis Deep learning Image load, resize, mean computation Clustering – K-means Graph analytics Nearest neighbors Vowpal Wabbit wrapper PK ML in PySpark…even better now + MLlib + GraphX Classification and regression linear models (SVMs, logistic regression, linear regression) naive Bayes decision trees ensembles of trees (Random Forests and Gradient-Boosted Trees) Collaborative filtering alternating least squares (ALS) Clustering k-means Dimensionality reduction singular value decomposition (SVD) principal component analysis (PCA) Optimization (developer) stochastic gradient descent limited-memory BFGS (L-BFGS) Graph analytics
  • 12. PK Gradient Boosted Trees with GraphLab Create + Source: Dato, ICCV 2009 Tutorial ……! tree t1! tree tT! split nodes! leaf nodes! v! v!  !
  • 15. DATA SCIENCE – DATA PROTOTYPING PK Amanda Casari acasari@prokarma.com + humor from xkcd