Balkan - data eng meetup - data fusion

•

1 like•343 views

Balkan Misirli

A brief look at making data pipelines (without coding a thing) with Google Cloud Data Fusion

Data & Analytics

Google Cloud Data Fusion
Drag & drop data pipelines

Balkan Misirli
Data Engineer @ Data Runs Deep
● Web Analytics (GA360) agency
● Google Cloud consulting partner
● Lots of BigQuery/Dataﬂow/Cloud Functions

Agenda (about 20 mins)
• What is Data Fusion?
• How does it compare?
• Demo
• Pricing + other details
• My ﬁrst impressions
• Questions

A bit of background
● Data startup Cask developed open source
software CDAP (Cask Data App Platform)
● Google bought Cask last year
● GCP beta released Data Fusion as a
managed CDAP service last month

What is Data Fusion / CDAP ?
● A set of tools to
wrangle/explore data and
create pipelines
● Completely drag & drop
interface (no coding)
● Enables sharing of created
pipelines within organisation

How does it run pipelines?
● Converts GUI input into a DAG
to run as a Dataproc job
● Ephemeral Hadoop MR/Spark cluster
● Can also run on existing cluster (Terraform)
● Soon to be available for Dataﬂow execution
● All of this runs on GKE in the back end
● No AUS in-country option yet

Batch or streaming pipelines?
● Only batch for Basic edition
● Both batch and streaming for Enterprise edition
● Batch jobs run either Hadoop MR or Spark
● Streaming jobs run Spark Streaming

Hub - Library of existing pipelines / plugins

Dashboard - shows all jobs that ran recently

Upload your own plugins / drivers / libraries

My ﬁrst impressions
● Instance creation takes up to 30 mins - slow!
● Hadoop execution is slow
● Web UI is pretty decent and intuitive
● Good (but maybe excessive) logging capability
● Quirky beta style errors
● Will deﬁnitely save labour hours

The good parts
● Pretty intuitive and easy
● Somewhat conﬁgurable (Env/CPUs/placeholder vars, etc)
● Stackdriver logging and monitoring available
● Open source, can import/export CDAP jobs - no vendor lock in
● Maybe cheaper than other enterprise alternatives
● Don’t have to operate your own Spark cluster!

The parts that have an exciting
journey of improvement ahead!
● PERMISSIONS!
● Wrangler only shows ﬁrst 1000 rows - can be misleading when
ﬁlters/aggregations applied
● Doesn’t do input validation until runtime - annoying
● Java error stacktraces for a GUI based tool

Thorough Java stacktraces - perfect for GUI users!

Basic vs. Enterprise
Enterprise Only
● Streaming
● Can run in production
● Data lineage tool
● Choice of execution env
● Schedules & Triggers
● Unlimited simultaneous
pipeline execution
Both Editions
● Batch
● Can run in Dev/Sandbox
● Unlimited users
● Wrangler tool
● Visual pipeline builder
● (Basic) limit of 2
simultaneous pipelines

Pricing
● Priced in two parts: pipeline development + execution
● Development is USD $1.80 per hour (Basic) or
USD $4.20 per hour (Enterprise), billed by the minute
● First 120 hours of development on Basic edition is free
● Roughly $1100 per month for Basic, $3000 for Enterprise
● Execution is priced according to Dataproc VM pricing

Thanks !
I’ll share the slides on Linkedin SlideShare
Linkedin: linkedin.com/in/balkanmisirli
Email: balkan@datarunsdeep.com.au

What's hot

Data pipeline and data lakeDaeMyung Kang

Apache Spark Introductionsudhakara st

How to design and implement a data ops architecture with sdc and gcpJoseph Arriola

Apache Spark OverviewVadim Y. Bichutskiy

Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Amazon Web Services

An Intro to Elasticsearch and KibanaObjectRocket

Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha

(BDT317) Building A Data Lake On AWSAmazon Web Services

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks

Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)Faysal Shaarani (MBA)

Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov

JSON Data Parsing in Snowflake (By Faysal Shaarani)Faysal Shaarani (MBA)

Masterclass Live: Amazon EMRAmazon Web Services

Amazon Redshift로 데이터웨어하우스(DW) 구축하기Amazon Web Services Korea

TiDB IntroductionMorgan Tocker

농심 그룹 메가마트 : 온프레미스 Exadata의 AWS 클라우드 환경 전환 사례 공유-김동현, NDS Cloud Innovation Ce...Amazon Web Services Korea

Demystifying data engineeringThang Bui (Bob)

PrestoKnoldus Inc.

HADOOP TECHNOLOGY pptsravya raju

Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks

What's hot (20)

Data pipeline and data lake

Apache Spark Introduction

How to design and implement a data ops architecture with sdc and gcp

Apache Spark Overview

Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...

An Intro to Elasticsearch and Kibana

Spark 의 핵심은 무엇인가? RDD! (RDD paper review)

(BDT317) Building A Data Lake On AWS

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data

Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)

Apache Spark in Depth: Core Concepts, Architecture & Internals

JSON Data Parsing in Snowflake (By Faysal Shaarani)

Masterclass Live: Amazon EMR

Amazon Redshift로 데이터웨어하우스(DW) 구축하기

TiDB Introduction

농심 그룹 메가마트 : 온프레미스 Exadata의 AWS 클라우드 환경 전환 사례 공유-김동현, NDS Cloud Innovation Ce...

Demystifying data engineering

Presto

HADOOP TECHNOLOGY ppt

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

Similar to Balkan - data eng meetup - data fusion

What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik

Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Kaxil Naik

Next.js with drupal, the good partsTaller Negócio Digitais

Kotlin REST & GraphQL APISean O'Brien

GraphQL Bangkok Meetup 6.0Tobias Meixner

Collaborative environment with data science notebook Moon Soo Lee

Why Go Lang?Sathish VJ

Scaling up wso2 bam for billions of requests and terabytes of dataWSO2

SCM Puppet: from an intro to the scalingStanislav Osipov

Grafana 7.0Juraj Hantak

Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...Openbar

DevOpsDays Taipei 2019 - Mastering IaC the DevOps Waysmalltown

Using FME to Transform and Integrate Optical Connection Data Between SystemsSafe Software

Dataflow.pptxSadeka Islam

Logging in The World of DevOps DevOps Indonesia

Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Mariano Gonzalez

Presto@UberZhenxiao Luo

Introduction to serverless computing on Google Cloudwesley chun

Are we there yet?Johann Höchtl

Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak

Similar to Balkan - data eng meetup - data fusion (20)

What's coming in Airflow 2.0? - NYC Apache Airflow Meetup

Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...

Next.js with drupal, the good parts

Kotlin REST & GraphQL API

GraphQL Bangkok Meetup 6.0

Collaborative environment with data science notebook

Why Go Lang?

Scaling up wso2 bam for billions of requests and terabytes of data

SCM Puppet: from an intro to the scaling

Grafana 7.0

Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...

DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way

Using FME to Transform and Integrate Optical Connection Data Between Systems

Dataflow.pptx

Logging in The World of DevOps

Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020

Presto@Uber

Introduction to serverless computing on Google Cloud

Are we there yet?

Apache Beam and Google Cloud Dataflow - IDG - final

Recently uploaded

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

Easter Eggs From Star Wars and in cars 1 and 217djon017

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

Multiple time frame trading analysis -brianshannon.pdfchwongval

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

RadioAdProWritingCinderellabyButleri.pdfgstagge

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics

ASML's Taxonomy Adventure by Daniel Cantervoginip

Recently uploaded (20)

20240419 - Measurecamp Amsterdam - SAM.pdf

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

Top 5 Best Data Analytics Courses In Queens

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...

Defining Constituents, Data Vizzes and Telling a Data Story

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

Easter Eggs From Star Wars and in cars 1 and 2

GA4 Without Cookies [Measure Camp AMS]

E-Commerce Order PredictionShraddha Kamble.pptx

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

Multiple time frame trading analysis -brianshannon.pdf

Call Girls In Dwarka 9654467111 Escorts Service

RadioAdProWritingCinderellabyButleri.pdf

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

Heart Disease Classification Report: A Data Analysis Project

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT

ASML's Taxonomy Adventure by Daniel Canter

Balkan - data eng meetup - data fusion

1. Google Cloud Data Fusion Drag & drop data pipelines

2. Balkan Misirli Data Engineer @ Data Runs Deep ● Web Analytics (GA360) agency ● Google Cloud consulting partner ● Lots of BigQuery/Dataﬂow/Cloud Functions

3. Agenda (about 20 mins) • What is Data Fusion? • How does it compare? • Demo • Pricing + other details • My ﬁrst impressions • Questions

4. A bit of background ● Data startup Cask developed open source software CDAP (Cask Data App Platform) ● Google bought Cask last year ● GCP beta released Data Fusion as a managed CDAP service last month

5. What is Data Fusion / CDAP ? ● A set of tools to wrangle/explore data and create pipelines ● Completely drag & drop interface (no coding) ● Enables sharing of created pipelines within organisation

6. How does it run pipelines? ● Converts GUI input into a DAG to run as a Dataproc job ● Ephemeral Hadoop MR/Spark cluster ● Can also run on existing cluster (Terraform) ● Soon to be available for Dataﬂow execution ● All of this runs on GKE in the back end ● No AUS in-country option yet

7. Batch or streaming pipelines? ● Only batch for Basic edition ● Both batch and streaming for Enterprise edition ● Batch jobs run either Hadoop MR or Spark ● Streaming jobs run Spark Streaming

8. Demo !

9. Hub - Library of existing pipelines / plugins

10. Dashboard - shows all jobs that ran recently

11. Upload your own plugins / drivers / libraries

12. Wrangler - explore your data

13. Visual Pipeline Builder

14. My ﬁrst impressions ● Instance creation takes up to 30 mins - slow! ● Hadoop execution is slow ● Web UI is pretty decent and intuitive ● Good (but maybe excessive) logging capability ● Quirky beta style errors ● Will deﬁnitely save labour hours

15. The good parts ● Pretty intuitive and easy ● Somewhat conﬁgurable (Env/CPUs/placeholder vars, etc) ● Stackdriver logging and monitoring available ● Open source, can import/export CDAP jobs - no vendor lock in ● Maybe cheaper than other enterprise alternatives ● Don’t have to operate your own Spark cluster!

16. The parts that have an exciting journey of improvement ahead! ● PERMISSIONS! ● Wrangler only shows ﬁrst 1000 rows - can be misleading when ﬁlters/aggregations applied ● Doesn’t do input validation until runtime - annoying ● Java error stacktraces for a GUI based tool

17. Random. But at least it looks nice

18. Thorough Java stacktraces - perfect for GUI users!

19. Basic vs. Enterprise Enterprise Only ● Streaming ● Can run in production ● Data lineage tool ● Choice of execution env ● Schedules & Triggers ● Unlimited simultaneous pipeline execution Both Editions ● Batch ● Can run in Dev/Sandbox ● Unlimited users ● Wrangler tool ● Visual pipeline builder ● (Basic) limit of 2 simultaneous pipelines

20. Pricing ● Priced in two parts: pipeline development + execution ● Development is USD $1.80 per hour (Basic) or USD $4.20 per hour (Enterprise), billed by the minute ● First 120 hours of development on Basic edition is free ● Roughly $1100 per month for Basic, $3000 for Enterprise ● Execution is priced according to Dataproc VM pricing

21. Thanks ! I’ll share the slides on Linkedin SlideShare Linkedin: linkedin.com/in/balkanmisirli Email: balkan@datarunsdeep.com.au

Balkan - data eng meetup - data fusion

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Balkan - data eng meetup - data fusion

Similar to Balkan - data eng meetup - data fusion (20)

Recently uploaded

Recently uploaded (20)

Balkan - data eng meetup - data fusion