SlideShare a Scribd company logo
1 of 20
Download to read offline
Google Cloud Platform
for Data Science teams
Barton Rhodes [@bmorphism]
Senior Data Scientist at Pandata [pandata.co]
Why Cloud?
Be it AWS, GCP, or Azure, having a well-integrated
‘productionizing’ workflow brings these benefits to data
scientists:
● escape limitations of a single machine
● reproducible infrastructure (e.g. Terraform)
● scale up on demand
● use highly specialized products and workflows (no need
to set up full systems)
● lessen DevOps / data engineering burden
● leaner than a packaged product (e.g. Domino Data Lab,
databricks)
Why Google Cloud?
Our reasons:
● relatively good UX and clean APIs
● embraces community tools (e.g. Apache Beam, Jupyter)
● integrates with Google tools (TensorFlow, Kubernetes)
● price (especially storage) and per-minute billing
● HIPAA-compliant infrastructure and ISO security
certifications (important for a consultancy)
Google Cloud Platform for Data Science
Datalab
BigQuery Machine
Learning
Dataflow
Dataproc
...
Container Engine
Dataprep
Natural
Language API
Vision API
Video
Intelligence API
Focus of this talk - Data Engineering
Datalab
BigQuery Machine
Learning
Dataflow
Dataproc
...
Container Engine
Dataprep
Natural
Language API
Vision API
Video
Intelligence API
GCP Documentation
Documentation available at:
https://cloud.google.com/docs/
Active certification program: Coursera specialization:
GCP Documentation
GCP Console
Google Cloud SDK
A set of command line tools to manage GCP:
gcloud - general API interaction, auth, configuration
gsutil - manage Google Storage
bq - BigQuery queries and management
kubectl - manage containers in Kubernetes
Client libraries for Python, Java, NodeJS, and others!
Datalab (aka Jupyter)
● interactive data
science in modified
Jupyter
● runs inside of a
Docker container
● integrates with the
rest of GCP
● alas, single user
● alas, only Python by
default
BigQuery
Massively parallel query engine with SQL-like query language compatible
with SQL 2011 standard.
● acts as a data warehouse at Google scale
● columnar data storage
● lower cost than other forms of storage
● BigQuery Slots to guarantee resources
● availability of public datasets
● no UPDATE / DELETE on existing data
Tableau can tap directly into BigQuery, making it easy to perform BI-type
tasks over large datasets quickly.
BigQuery example
In this example, one can use the publicly available StackOverflow data
(~150GiB) to calculate % answered questions over the years:
BigQuery example (continued)
Output:
Dataproc
Run Apache Hadoop and Apache Spark clusters in the Cloud.
● fully managed solution (removes configuration headaches)
● easy to scale quickly through a Web UI / CLI
● can substitute a much faster Google Cloud Storage for HDFS
● can start with a job and provision a cluster appropriately
● cannot customize initial components trivially like one would with
Amazon EMR (can use init actions through scripts)
● Apache Zeppelin does not come pre-installed
For sample init actions, see the following repository:
https://github.com/GoogleCloudPlatform/dataproc-initialization-actions
Dataproc example
Dataflow (aka Apache Beam)
A runner for Apache Beam (also donated by Google).
Beam model allows for unified semantics for batch & streaming systems, enabling the
coveted write once run everywhere* approach to building data pipelines.
Dataflow example
Machine Learning
● integrates TensorFlow for training and prediction
● automatically provisions training and prediction instances
(purpose-built compute instances)
● built-in into Datalab
● operates in batch mode only, streaming support experimental
● still on Python 2.7 (boo!)
Some sample projects:
https://github.com/GoogleCloudPlatform/cloudml-samples
Cloud TPUs (alpha)
Google-designed integrated circuits highly optimized for reduced precision
operations, integration with TensorFlow.
Accelerates AI inference and now also training.
Exciting times ahead!
Bringing it all together(demo)
Adapted from:
https://cloud.google.com/dataproc/docs/tutorials/bigquery-sparkml
In this demo, we will predict birth weight using Linear Regression in Apache
Spark (Dataproc) using the publicly available BigQuery natality dataset.

More Related Content

What's hot

Introduction to Google Compute Engine
Introduction to Google Compute EngineIntroduction to Google Compute Engine
Introduction to Google Compute Engine
Colin Su
 

What's hot (20)

Google Cloud Platform: Prototype ->Production-> Planet scale
Google Cloud Platform: Prototype ->Production-> Planet scaleGoogle Cloud Platform: Prototype ->Production-> Planet scale
Google Cloud Platform: Prototype ->Production-> Planet scale
 
Google cloud platform introduction
Google cloud platform introductionGoogle cloud platform introduction
Google cloud platform introduction
 
Google Cloud: Data Analysis and Machine Learningn Technologies
Google Cloud: Data Analysis and Machine Learningn Technologies Google Cloud: Data Analysis and Machine Learningn Technologies
Google Cloud: Data Analysis and Machine Learningn Technologies
 
Introduction to Google Cloud Platform
Introduction to Google Cloud PlatformIntroduction to Google Cloud Platform
Introduction to Google Cloud Platform
 
Google Cloud Technologies Overview
Google Cloud Technologies OverviewGoogle Cloud Technologies Overview
Google Cloud Technologies Overview
 
Google Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App EngineGoogle Cloud Platform, Compute Engine, and App Engine
Google Cloud Platform, Compute Engine, and App Engine
 
Google Cloud Platform Introduction - 2016Q3
Google Cloud Platform Introduction - 2016Q3Google Cloud Platform Introduction - 2016Q3
Google Cloud Platform Introduction - 2016Q3
 
Google Cloud Platform Empowers TensorFlow and Machine Learning
Google Cloud Platform Empowers TensorFlow and Machine LearningGoogle Cloud Platform Empowers TensorFlow and Machine Learning
Google Cloud Platform Empowers TensorFlow and Machine Learning
 
#DataUnlimited - Google Big Data Unlimited
#DataUnlimited - Google Big Data Unlimited#DataUnlimited - Google Big Data Unlimited
#DataUnlimited - Google Big Data Unlimited
 
Google I/O 2016 Recap - Google Cloud Platform News Update
Google I/O 2016 Recap - Google Cloud Platform News UpdateGoogle I/O 2016 Recap - Google Cloud Platform News Update
Google I/O 2016 Recap - Google Cloud Platform News Update
 
Introduction to Google Compute Engine
Introduction to Google Compute EngineIntroduction to Google Compute Engine
Introduction to Google Compute Engine
 
Google Cloud Platform - Eric Johnson, Joe Selman - ManageIQ Design Summit 2016
Google Cloud Platform - Eric Johnson, Joe Selman - ManageIQ Design Summit 2016Google Cloud Platform - Eric Johnson, Joe Selman - ManageIQ Design Summit 2016
Google Cloud Platform - Eric Johnson, Joe Selman - ManageIQ Design Summit 2016
 
TIAD : Automate everything with Google Cloud
TIAD : Automate everything with Google CloudTIAD : Automate everything with Google Cloud
TIAD : Automate everything with Google Cloud
 
Cloud computing by Google Cloud Platform - Presentation
Cloud computing by Google Cloud Platform - PresentationCloud computing by Google Cloud Platform - Presentation
Cloud computing by Google Cloud Platform - Presentation
 
MongoDB Days UK: Run MongoDB on Google Cloud Platform
MongoDB Days UK: Run MongoDB on Google Cloud PlatformMongoDB Days UK: Run MongoDB on Google Cloud Platform
MongoDB Days UK: Run MongoDB on Google Cloud Platform
 
Google Cloud Platform Update
Google Cloud Platform UpdateGoogle Cloud Platform Update
Google Cloud Platform Update
 
Google Cloud Platform - Service Glossary
Google Cloud Platform - Service GlossaryGoogle Cloud Platform - Service Glossary
Google Cloud Platform - Service Glossary
 
Exploring BigData with Google BigQuery
Exploring BigData with Google BigQueryExploring BigData with Google BigQuery
Exploring BigData with Google BigQuery
 
 Introduction google cloud platform
 Introduction google cloud platform Introduction google cloud platform
 Introduction google cloud platform
 
Google Compute Engine
Google Compute EngineGoogle Compute Engine
Google Compute Engine
 

Similar to Google Cloud Platform for Data Science teams

Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 

Similar to Google Cloud Platform for Data Science teams (20)

Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
 
Cloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs GoogleCloud comparison - AWS vs Azure vs Google
Cloud comparison - AWS vs Azure vs Google
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
A fresh look at Google’s Cloud by Mandy Waite
A fresh look at Google’s Cloud by Mandy Waite A fresh look at Google’s Cloud by Mandy Waite
A fresh look at Google’s Cloud by Mandy Waite
 
Solving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalSolving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute final
 
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
Scale with a smile with Google Cloud Platform At DevConTLV (June 2014)
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdf
 
Powerful Google Cloud tools for your hack
Powerful Google Cloud tools for your hackPowerful Google Cloud tools for your hack
Powerful Google Cloud tools for your hack
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
 
Gdsc muk - innocent
Gdsc   muk - innocentGdsc   muk - innocent
Gdsc muk - innocent
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Google Cloud Platform for Data Science teams

  • 1. Google Cloud Platform for Data Science teams Barton Rhodes [@bmorphism] Senior Data Scientist at Pandata [pandata.co]
  • 2. Why Cloud? Be it AWS, GCP, or Azure, having a well-integrated ‘productionizing’ workflow brings these benefits to data scientists: ● escape limitations of a single machine ● reproducible infrastructure (e.g. Terraform) ● scale up on demand ● use highly specialized products and workflows (no need to set up full systems) ● lessen DevOps / data engineering burden ● leaner than a packaged product (e.g. Domino Data Lab, databricks)
  • 3. Why Google Cloud? Our reasons: ● relatively good UX and clean APIs ● embraces community tools (e.g. Apache Beam, Jupyter) ● integrates with Google tools (TensorFlow, Kubernetes) ● price (especially storage) and per-minute billing ● HIPAA-compliant infrastructure and ISO security certifications (important for a consultancy)
  • 4. Google Cloud Platform for Data Science Datalab BigQuery Machine Learning Dataflow Dataproc ... Container Engine Dataprep Natural Language API Vision API Video Intelligence API
  • 5. Focus of this talk - Data Engineering Datalab BigQuery Machine Learning Dataflow Dataproc ... Container Engine Dataprep Natural Language API Vision API Video Intelligence API
  • 6. GCP Documentation Documentation available at: https://cloud.google.com/docs/ Active certification program: Coursera specialization:
  • 9. Google Cloud SDK A set of command line tools to manage GCP: gcloud - general API interaction, auth, configuration gsutil - manage Google Storage bq - BigQuery queries and management kubectl - manage containers in Kubernetes Client libraries for Python, Java, NodeJS, and others!
  • 10. Datalab (aka Jupyter) ● interactive data science in modified Jupyter ● runs inside of a Docker container ● integrates with the rest of GCP ● alas, single user ● alas, only Python by default
  • 11. BigQuery Massively parallel query engine with SQL-like query language compatible with SQL 2011 standard. ● acts as a data warehouse at Google scale ● columnar data storage ● lower cost than other forms of storage ● BigQuery Slots to guarantee resources ● availability of public datasets ● no UPDATE / DELETE on existing data Tableau can tap directly into BigQuery, making it easy to perform BI-type tasks over large datasets quickly.
  • 12. BigQuery example In this example, one can use the publicly available StackOverflow data (~150GiB) to calculate % answered questions over the years:
  • 14. Dataproc Run Apache Hadoop and Apache Spark clusters in the Cloud. ● fully managed solution (removes configuration headaches) ● easy to scale quickly through a Web UI / CLI ● can substitute a much faster Google Cloud Storage for HDFS ● can start with a job and provision a cluster appropriately ● cannot customize initial components trivially like one would with Amazon EMR (can use init actions through scripts) ● Apache Zeppelin does not come pre-installed For sample init actions, see the following repository: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions
  • 16. Dataflow (aka Apache Beam) A runner for Apache Beam (also donated by Google). Beam model allows for unified semantics for batch & streaming systems, enabling the coveted write once run everywhere* approach to building data pipelines.
  • 18. Machine Learning ● integrates TensorFlow for training and prediction ● automatically provisions training and prediction instances (purpose-built compute instances) ● built-in into Datalab ● operates in batch mode only, streaming support experimental ● still on Python 2.7 (boo!) Some sample projects: https://github.com/GoogleCloudPlatform/cloudml-samples
  • 19. Cloud TPUs (alpha) Google-designed integrated circuits highly optimized for reduced precision operations, integration with TensorFlow. Accelerates AI inference and now also training. Exciting times ahead!
  • 20. Bringing it all together(demo) Adapted from: https://cloud.google.com/dataproc/docs/tutorials/bigquery-sparkml In this demo, we will predict birth weight using Linear Regression in Apache Spark (Dataproc) using the publicly available BigQuery natality dataset.