SlideShare a Scribd company logo
1 of 21
Download to read offline
Google Cloud Data Fusion
Drag & drop data pipelines
Balkan Misirli
Data Engineer @ Data Runs Deep
● Web Analytics (GA360) agency
● Google Cloud consulting partner
● Lots of BigQuery/Dataflow/Cloud Functions
Agenda (about 20 mins)
• What is Data Fusion?
• How does it compare?
• Demo
• Pricing + other details
• My first impressions
• Questions
A bit of background
● Data startup Cask developed open source
software CDAP (Cask Data App Platform)
● Google bought Cask last year
● GCP beta released Data Fusion as a
managed CDAP service last month
What is Data Fusion / CDAP ?
● A set of tools to
wrangle/explore data and
create pipelines
● Completely drag & drop
interface (no coding)
● Enables sharing of created
pipelines within organisation
How does it run pipelines?
● Converts GUI input into a DAG
to run as a Dataproc job
● Ephemeral Hadoop MR/Spark cluster
● Can also run on existing cluster (Terraform)
● Soon to be available for Dataflow execution
● All of this runs on GKE in the back end
● No AUS in-country option yet
Batch or streaming pipelines?
● Only batch for Basic edition
● Both batch and streaming for Enterprise edition
● Batch jobs run either Hadoop MR or Spark
● Streaming jobs run Spark Streaming
Demo !
Hub - Library of existing pipelines / plugins
Dashboard - shows all jobs that ran recently
Upload your own plugins / drivers / libraries
Wrangler - explore your data
Visual Pipeline Builder
My first impressions
● Instance creation takes up to 30 mins - slow!
● Hadoop execution is slow
● Web UI is pretty decent and intuitive
● Good (but maybe excessive) logging capability
● Quirky beta style errors
● Will definitely save labour hours
The good parts
● Pretty intuitive and easy
● Somewhat configurable (Env/CPUs/placeholder vars, etc)
● Stackdriver logging and monitoring available
● Open source, can import/export CDAP jobs - no vendor lock in
● Maybe cheaper than other enterprise alternatives
● Don’t have to operate your own Spark cluster!
The parts that have an exciting
journey of improvement ahead!
● PERMISSIONS!
● Wrangler only shows first 1000 rows - can be misleading when
filters/aggregations applied
● Doesn’t do input validation until runtime - annoying
● Java error stacktraces for a GUI based tool
Random. But at least it looks nice
Thorough Java stacktraces - perfect for GUI users!
Basic vs. Enterprise
Enterprise Only
● Streaming
● Can run in production
● Data lineage tool
● Choice of execution env
● Schedules & Triggers
● Unlimited simultaneous
pipeline execution
Both Editions
● Batch
● Can run in Dev/Sandbox
● Unlimited users
● Wrangler tool
● Visual pipeline builder
● (Basic) limit of 2
simultaneous pipelines
Pricing
● Priced in two parts: pipeline development + execution
● Development is USD $1.80 per hour (Basic) or
USD $4.20 per hour (Enterprise), billed by the minute
● First 120 hours of development on Basic edition is free
● Roughly $1100 per month for Basic, $3000 for Enterprise
● Execution is priced according to Dataproc VM pricing
Thanks !
I’ll share the slides on Linkedin SlideShare
Linkedin: linkedin.com/in/balkanmisirli
Email: balkan@datarunsdeep.com.au

More Related Content

What's hot

Data pipeline and data lake
Data pipeline and data lakeData pipeline and data lake
Data pipeline and data lakeDaeMyung Kang
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpJoseph Arriola
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Amazon Web Services
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaObjectRocket
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWSAmazon Web Services
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)Faysal Shaarani (MBA)
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
JSON Data Parsing in Snowflake (By Faysal Shaarani)
JSON Data Parsing in Snowflake (By Faysal Shaarani)JSON Data Parsing in Snowflake (By Faysal Shaarani)
JSON Data Parsing in Snowflake (By Faysal Shaarani)Faysal Shaarani (MBA)
 
Amazon Redshift로 데이터웨어하우스(DW) 구축하기
Amazon Redshift로 데이터웨어하우스(DW) 구축하기Amazon Redshift로 데이터웨어하우스(DW) 구축하기
Amazon Redshift로 데이터웨어하우스(DW) 구축하기Amazon Web Services Korea
 
농심 그룹 메가마트 : 온프레미스 Exadata의 AWS 클라우드 환경 전환 사례 공유-김동현, NDS Cloud Innovation Ce...
농심 그룹 메가마트 : 온프레미스 Exadata의 AWS 클라우드 환경 전환 사례 공유-김동현, NDS Cloud Innovation Ce...농심 그룹 메가마트 : 온프레미스 Exadata의 AWS 클라우드 환경 전환 사례 공유-김동현, NDS Cloud Innovation Ce...
농심 그룹 메가마트 : 온프레미스 Exadata의 AWS 클라우드 환경 전환 사례 공유-김동현, NDS Cloud Innovation Ce...Amazon Web Services Korea
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 

What's hot (20)

Data pipeline and data lake
Data pipeline and data lakeData pipeline and data lake
Data pipeline and data lake
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
How to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcp
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
JSON Data Parsing in Snowflake (By Faysal Shaarani)
JSON Data Parsing in Snowflake (By Faysal Shaarani)JSON Data Parsing in Snowflake (By Faysal Shaarani)
JSON Data Parsing in Snowflake (By Faysal Shaarani)
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Amazon Redshift로 데이터웨어하우스(DW) 구축하기
Amazon Redshift로 데이터웨어하우스(DW) 구축하기Amazon Redshift로 데이터웨어하우스(DW) 구축하기
Amazon Redshift로 데이터웨어하우스(DW) 구축하기
 
TiDB Introduction
TiDB IntroductionTiDB Introduction
TiDB Introduction
 
농심 그룹 메가마트 : 온프레미스 Exadata의 AWS 클라우드 환경 전환 사례 공유-김동현, NDS Cloud Innovation Ce...
농심 그룹 메가마트 : 온프레미스 Exadata의 AWS 클라우드 환경 전환 사례 공유-김동현, NDS Cloud Innovation Ce...농심 그룹 메가마트 : 온프레미스 Exadata의 AWS 클라우드 환경 전환 사례 공유-김동현, NDS Cloud Innovation Ce...
농심 그룹 메가마트 : 온프레미스 Exadata의 AWS 클라우드 환경 전환 사례 공유-김동현, NDS Cloud Innovation Ce...
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Presto
PrestoPresto
Presto
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 

Similar to Balkan - data eng meetup - data fusion

What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Kaxil Naik
 
Kotlin REST & GraphQL API
Kotlin REST & GraphQL APIKotlin REST & GraphQL API
Kotlin REST & GraphQL APISean O'Brien
 
GraphQL Bangkok Meetup 6.0
GraphQL Bangkok Meetup 6.0GraphQL Bangkok Meetup 6.0
GraphQL Bangkok Meetup 6.0Tobias Meixner
 
Collaborative environment with data science notebook
Collaborative environment with data science notebook Collaborative environment with data science notebook
Collaborative environment with data science notebook Moon Soo Lee
 
Scaling up wso2 bam for billions of requests and terabytes of data
Scaling up wso2 bam for billions of requests and terabytes of dataScaling up wso2 bam for billions of requests and terabytes of data
Scaling up wso2 bam for billions of requests and terabytes of dataWSO2
 
SCM Puppet: from an intro to the scaling
SCM Puppet: from an intro to the scalingSCM Puppet: from an intro to the scaling
SCM Puppet: from an intro to the scalingStanislav Osipov
 
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...Openbar
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Waysmalltown
 
Using FME to Transform and Integrate Optical Connection Data Between Systems
Using FME to Transform and Integrate Optical Connection Data Between SystemsUsing FME to Transform and Integrate Optical Connection Data Between Systems
Using FME to Transform and Integrate Optical Connection Data Between SystemsSafe Software
 
Logging in The World of DevOps
Logging in The World of DevOps Logging in The World of DevOps
Logging in The World of DevOps DevOps Indonesia
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Mariano Gonzalez
 
Introduction to serverless computing on Google Cloud
Introduction to serverless computing on Google CloudIntroduction to serverless computing on Google Cloud
Introduction to serverless computing on Google Cloudwesley chun
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak
 

Similar to Balkan - data eng meetup - data fusion (20)

What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
 
Next.js with drupal, the good parts
Next.js with drupal, the good partsNext.js with drupal, the good parts
Next.js with drupal, the good parts
 
Kotlin REST & GraphQL API
Kotlin REST & GraphQL APIKotlin REST & GraphQL API
Kotlin REST & GraphQL API
 
GraphQL Bangkok Meetup 6.0
GraphQL Bangkok Meetup 6.0GraphQL Bangkok Meetup 6.0
GraphQL Bangkok Meetup 6.0
 
Collaborative environment with data science notebook
Collaborative environment with data science notebook Collaborative environment with data science notebook
Collaborative environment with data science notebook
 
Why Go Lang?
Why Go Lang?Why Go Lang?
Why Go Lang?
 
Scaling up wso2 bam for billions of requests and terabytes of data
Scaling up wso2 bam for billions of requests and terabytes of dataScaling up wso2 bam for billions of requests and terabytes of data
Scaling up wso2 bam for billions of requests and terabytes of data
 
SCM Puppet: from an intro to the scaling
SCM Puppet: from an intro to the scalingSCM Puppet: from an intro to the scaling
SCM Puppet: from an intro to the scaling
 
Grafana 7.0
Grafana 7.0Grafana 7.0
Grafana 7.0
 
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
 
Using FME to Transform and Integrate Optical Connection Data Between Systems
Using FME to Transform and Integrate Optical Connection Data Between SystemsUsing FME to Transform and Integrate Optical Connection Data Between Systems
Using FME to Transform and Integrate Optical Connection Data Between Systems
 
Dataflow.pptx
Dataflow.pptxDataflow.pptx
Dataflow.pptx
 
Logging in The World of DevOps
Logging in The World of DevOps Logging in The World of DevOps
Logging in The World of DevOps
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
 
Presto@Uber
Presto@UberPresto@Uber
Presto@Uber
 
Introduction to serverless computing on Google Cloud
Introduction to serverless computing on Google CloudIntroduction to serverless computing on Google Cloud
Introduction to serverless computing on Google Cloud
 
Are we there yet?
Are we there yet?Are we there yet?
Are we there yet?
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 

Recently uploaded

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 

Recently uploaded (20)

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 

Balkan - data eng meetup - data fusion

  • 1. Google Cloud Data Fusion Drag & drop data pipelines
  • 2. Balkan Misirli Data Engineer @ Data Runs Deep ● Web Analytics (GA360) agency ● Google Cloud consulting partner ● Lots of BigQuery/Dataflow/Cloud Functions
  • 3. Agenda (about 20 mins) • What is Data Fusion? • How does it compare? • Demo • Pricing + other details • My first impressions • Questions
  • 4. A bit of background ● Data startup Cask developed open source software CDAP (Cask Data App Platform) ● Google bought Cask last year ● GCP beta released Data Fusion as a managed CDAP service last month
  • 5. What is Data Fusion / CDAP ? ● A set of tools to wrangle/explore data and create pipelines ● Completely drag & drop interface (no coding) ● Enables sharing of created pipelines within organisation
  • 6. How does it run pipelines? ● Converts GUI input into a DAG to run as a Dataproc job ● Ephemeral Hadoop MR/Spark cluster ● Can also run on existing cluster (Terraform) ● Soon to be available for Dataflow execution ● All of this runs on GKE in the back end ● No AUS in-country option yet
  • 7. Batch or streaming pipelines? ● Only batch for Basic edition ● Both batch and streaming for Enterprise edition ● Batch jobs run either Hadoop MR or Spark ● Streaming jobs run Spark Streaming
  • 9. Hub - Library of existing pipelines / plugins
  • 10. Dashboard - shows all jobs that ran recently
  • 11. Upload your own plugins / drivers / libraries
  • 12. Wrangler - explore your data
  • 14. My first impressions ● Instance creation takes up to 30 mins - slow! ● Hadoop execution is slow ● Web UI is pretty decent and intuitive ● Good (but maybe excessive) logging capability ● Quirky beta style errors ● Will definitely save labour hours
  • 15. The good parts ● Pretty intuitive and easy ● Somewhat configurable (Env/CPUs/placeholder vars, etc) ● Stackdriver logging and monitoring available ● Open source, can import/export CDAP jobs - no vendor lock in ● Maybe cheaper than other enterprise alternatives ● Don’t have to operate your own Spark cluster!
  • 16. The parts that have an exciting journey of improvement ahead! ● PERMISSIONS! ● Wrangler only shows first 1000 rows - can be misleading when filters/aggregations applied ● Doesn’t do input validation until runtime - annoying ● Java error stacktraces for a GUI based tool
  • 17. Random. But at least it looks nice
  • 18. Thorough Java stacktraces - perfect for GUI users!
  • 19. Basic vs. Enterprise Enterprise Only ● Streaming ● Can run in production ● Data lineage tool ● Choice of execution env ● Schedules & Triggers ● Unlimited simultaneous pipeline execution Both Editions ● Batch ● Can run in Dev/Sandbox ● Unlimited users ● Wrangler tool ● Visual pipeline builder ● (Basic) limit of 2 simultaneous pipelines
  • 20. Pricing ● Priced in two parts: pipeline development + execution ● Development is USD $1.80 per hour (Basic) or USD $4.20 per hour (Enterprise), billed by the minute ● First 120 hours of development on Basic edition is free ● Roughly $1100 per month for Basic, $3000 for Enterprise ● Execution is priced according to Dataproc VM pricing
  • 21. Thanks ! I’ll share the slides on Linkedin SlideShare Linkedin: linkedin.com/in/balkanmisirli Email: balkan@datarunsdeep.com.au