2. Balkan Misirli
Data Engineer @ Data Runs Deep
● Web Analytics (GA360) agency
● Google Cloud consulting partner
● Lots of BigQuery/Dataflow/Cloud Functions
3. Agenda (about 20 mins)
• What is Data Fusion?
• How does it compare?
• Demo
• Pricing + other details
• My first impressions
• Questions
4. A bit of background
● Data startup Cask developed open source
software CDAP (Cask Data App Platform)
● Google bought Cask last year
● GCP beta released Data Fusion as a
managed CDAP service last month
5. What is Data Fusion / CDAP ?
● A set of tools to
wrangle/explore data and
create pipelines
● Completely drag & drop
interface (no coding)
● Enables sharing of created
pipelines within organisation
6. How does it run pipelines?
● Converts GUI input into a DAG
to run as a Dataproc job
● Ephemeral Hadoop MR/Spark cluster
● Can also run on existing cluster (Terraform)
● Soon to be available for Dataflow execution
● All of this runs on GKE in the back end
● No AUS in-country option yet
7. Batch or streaming pipelines?
● Only batch for Basic edition
● Both batch and streaming for Enterprise edition
● Batch jobs run either Hadoop MR or Spark
● Streaming jobs run Spark Streaming
14. My first impressions
● Instance creation takes up to 30 mins - slow!
● Hadoop execution is slow
● Web UI is pretty decent and intuitive
● Good (but maybe excessive) logging capability
● Quirky beta style errors
● Will definitely save labour hours
15. The good parts
● Pretty intuitive and easy
● Somewhat configurable (Env/CPUs/placeholder vars, etc)
● Stackdriver logging and monitoring available
● Open source, can import/export CDAP jobs - no vendor lock in
● Maybe cheaper than other enterprise alternatives
● Don’t have to operate your own Spark cluster!
16. The parts that have an exciting
journey of improvement ahead!
● PERMISSIONS!
● Wrangler only shows first 1000 rows - can be misleading when
filters/aggregations applied
● Doesn’t do input validation until runtime - annoying
● Java error stacktraces for a GUI based tool
19. Basic vs. Enterprise
Enterprise Only
● Streaming
● Can run in production
● Data lineage tool
● Choice of execution env
● Schedules & Triggers
● Unlimited simultaneous
pipeline execution
Both Editions
● Batch
● Can run in Dev/Sandbox
● Unlimited users
● Wrangler tool
● Visual pipeline builder
● (Basic) limit of 2
simultaneous pipelines
20. Pricing
● Priced in two parts: pipeline development + execution
● Development is USD $1.80 per hour (Basic) or
USD $4.20 per hour (Enterprise), billed by the minute
● First 120 hours of development on Basic edition is free
● Roughly $1100 per month for Basic, $3000 for Enterprise
● Execution is priced according to Dataproc VM pricing
21. Thanks !
I’ll share the slides on Linkedin SlideShare
Linkedin: linkedin.com/in/balkanmisirli
Email: balkan@datarunsdeep.com.au