Integrating Apache Spark and NiFi for Data Lakes

•Download as PPTX, PDF•

25 likes•10,945 views

DataWorks Summit/Hadoop Summit

Technology

3
• A central repository
with trusted,
consistent data
• Reduce costs by
offloading analytical
systems and archiving cold
data
• Derive value quickly
with easier discovery
and prototyping
• A laboratory for
experimenting with
new technologies
and data
Goals for a Data Lake

4
• Automation of pipelines
with metadata and
performance tracking
• Governance with
clear distinction of
roles and responsibilities
• SLA tracking with
alerts on failures or
violations
• Interactive data discovery
and experimentation
What’s Needed For A Hadoop Data Lake?

5
Example Ingestion Project
• 4000+ unique flat files and RDMS tables, plus a few streaming
data feeds
• Mix of incremental and snapshot data
• Ingest into Hadoop (minimally HDFS and Hive tables)
• Cleansing/encryption and data validation
• Metadata capture
Focus shifts over time from data ingestion to
transformation then to analytics

8
Pipeline design with Apache
• Visual drag-and-drop
• Dozens of data connectors
• 150+ pre-built transforms
• Data lineage
• Batch and Streaming
• Extensible
© 2016 Think Big, a Teradata Company 7/10/2016

9
Role separation
• IT Designers design models in NiFi
• Register with framework
• Integrated development process
© 2016 Think Big, a Teradata Company 7/10/2016
Apache NiFi Think Big framework
• Users configure new feeds
• Based on common model
• Generated and executed in NiFi
register
deploy

1010
7/10/2016
© 2015 Think Big, a Teradata Company
User features
around
org. roles
Visual design
Streaming
and Batch
Fully
governed
Integrated
Best
Practices
Secure, modern
architecture
Design Approach
Will be open
source (Apache
license)

1111
Ingest and Prepare
• UI-guided feed creation
• Data protection
• Data cleanse
• Data validation
• Data profiling
• Powered by Apache Spark

1313
Data self-service and “wrangle”
• Graphical SQL builder
• 100+ transform functions
• Machine learning
• Publish and schedule
• Powered by Apache Spark

1515
Operations
• Dashboard
• Health Monitoring
• Data Confidence
• SLA enforcement
• Alerts
• Performance reports

16
• Powerful search capabilities for users against data
(think Google-like searching)
• NiFi processor extracts source data from Hadoop table
for indexing in ElasticSearch
• Incremental updates during ingest
ElasticSearch – Full Text Indexing
Data Lake
select id,user,tweet
from twitter_feed
extract JSON

What's hot

Introducing the Apache Flink Kubernetes OperatorFlink Forward

Data Federation with Apache SparkDataWorks Summit

InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData

Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseAldrin Piri

Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward

Building large scale transactional data lake using apache hudiBill Liu

Stream processing using KafkaKnoldus Inc.

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle

How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent

Real-Time Data Flows with Apache NiFiManish Gupta

Streaming SQL for Data Engineers: The Next Big Thing?Yaroslav Tkachenko

Elastic stack PresentationAmr Alaa Yassen

Apache Arrow Flight OverviewJacques Nadeau

Intro to Delta LakeDatabricks

Tuning Apache Kafka Connectors for Flink.pptxFlink Forward

Apache Flink in the Cloud-Native EraFlink Forward

Apache Kafka Best PracticesDataWorks Summit/Hadoop Summit

What's hot (20)

Introducing the Apache Flink Kubernetes Operator

Data Federation with Apache Spark

InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...

Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose

Introduction to DataFusion An Embeddable Query Engine Written in Rust

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...

Building large scale transactional data lake using apache hudi

Stream processing using Kafka

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Apache Iceberg Presentation for the St. Louis Big Data IDEA

How to build a streaming Lakehouse with Flink, Kafka, and Hudi

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...

Real-Time Data Flows with Apache NiFi

Streaming SQL for Data Engineers: The Next Big Thing?

Elastic stack Presentation

Apache Arrow Flight Overview

Intro to Delta Lake

Tuning Apache Kafka Connectors for Flink.pptx

Apache Flink in the Cloud-Native Era

Apache Kafka Best Practices

Viewers also liked

The Elephant in the CloudsDataWorks Summit/Hadoop Summit

Apache NiFi- MiNiFi meetup SlidesIsheeta Sanghi

Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks

Hortonworks Data In Motion Series Part 4Hortonworks

Dataflow with Apache NiFi - Crash Course - HS16SJDataWorks Summit/Hadoop Summit

Building a Smarter Home with Apache NiFi and SparkDataWorks Summit/Hadoop Summit

From Zero to Data Flow in Hours with Apache NiFiDataWorks Summit/Hadoop Summit

IOT, Streaming Analytics and Machine Learning DataWorks Summit/Hadoop Summit

Integrating Apache NiFi and Apache FlinkHortonworks

NJ Hadoop Meetup - Apache NiFi Deep DiveBryan Bende

Hortonworks Data in Motion Webinar Series - Part 1Hortonworks

Joe Witt presentation on Apache NiFiMark Kerzner

Webinar Series Part 5 New Features of HDF 5Hortonworks

Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...Chris Fregly

Data ingestion and distribution with apache NiFiLev Brailovskiy

Hortonworks Data In Motion Series Part 3 - HDF Ambari Hortonworks

Apache NiFi in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit

The Avant-garde of Apache NiFiDataWorks Summit/Hadoop Summit

Make Streaming Analytics work for you: The Devil is in the DetailsDataWorks Summit/Hadoop Summit

Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit

Viewers also liked (20)

The Elephant in the Clouds

Apache NiFi- MiNiFi meetup Slides

Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...

Hortonworks Data In Motion Series Part 4

Dataflow with Apache NiFi - Crash Course - HS16SJ

Building a Smarter Home with Apache NiFi and Spark

From Zero to Data Flow in Hours with Apache NiFi

IOT, Streaming Analytics and Machine Learning

Integrating Apache NiFi and Apache Flink

NJ Hadoop Meetup - Apache NiFi Deep Dive

Hortonworks Data in Motion Webinar Series - Part 1

Joe Witt presentation on Apache NiFi

Webinar Series Part 5 New Features of HDF 5

Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...

Data ingestion and distribution with apache NiFi

Hortonworks Data In Motion Series Part 3 - HDF Ambari

Apache NiFi in the Hadoop Ecosystem

The Avant-garde of Apache NiFi

Make Streaming Analytics work for you: The Devil is in the Details

Combining Machine Learning frameworks with Apache Spark

Similar to Integrating Apache Spark and NiFi for Data Lakes

"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...Dataconomy Media

Using Apache NiFi with Apache Pulsar for Fast Data On-RampTimothy Spann

Lambda architecture with SparkVincent GALOPIN

Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Cloudian

Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.

Twitter with hadoop for oowGwen (Chen) Shapira

Big Data Retrospective - STL Big Data IDEA Jan 2019Adam Doyle

Summer Shorts: Big Data Integrationibi

A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit

HDFCloud Workshop: HDF5 in the CloudThe HDF-EOS Tools and Information Center

Big data - Online TrainingLearntek1

Predictive Analytics and Machine Learning…with SAS and Apache HadoopHortonworks

Webinar: What's new in CDAP 3.5?Cask Data

Teradata Loom Introductory Presentationmlang222

Apache Spark: Usage and Roadmap in HadoopCloudera Japan

Building a Turbo-fast Data Warehousing Platform with DatabricksDatabricks

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367

Presentation big dataappliance-overview_oow_v3xKinAnx

Hortonworks.bdbEmil Andreas Siemes

What's New in Apache Hive 3.0?DataWorks Summit

Similar to Integrating Apache Spark and NiFi for Data Lakes (20)

"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...

Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp

Lambda architecture with Spark

Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...

Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud

Twitter with hadoop for oow

Big Data Retrospective - STL Big Data IDEA Jan 2019

Summer Shorts: Big Data Integration

A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...

HDFCloud Workshop: HDF5 in the Cloud

Big data - Online Training

Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Webinar: What's new in CDAP 3.5?

Teradata Loom Introductory Presentation

Apache Spark: Usage and Roadmap in Hadoop

Building a Turbo-fast Data Warehousing Platform with Databricks

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...

Presentation big dataappliance-overview_oow_v3

Hortonworks.bdb

What's New in Apache Hive 3.0?

Recently uploaded

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

WordPress Websites for Engineers: Elevate Your Brandgvaughan

"ML in Production",Oleksandr BaganFwdays

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Training state-of-the-art general text embeddingZilliz

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Recently uploaded (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL

WordPress Websites for Engineers: Elevate Your Brand

"ML in Production",Oleksandr Bagan

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Gen AI in Business - Global Trends Report 2024.pdf

Training state-of-the-art general text embedding

Unraveling Multimodality with Large Language Models.pdf

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Nell’iperspazio con Rocket: il Framework Web di Rust!

What's New in Teams Calling, Meetings and Devices March 2024

Human Factors of XR: Using Human Factors to Design XR Systems

DevEX - reference for building teams, processes, and platforms

Vector Databases 101 - An introduction to the world of Vector Databases

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

DevoxxFR 2024 Reproducible Builds with Apache Maven

My INSURER PTE LTD - Insurtech Innovation Award 2024

"Debugging python applications inside k8s environment", Andrii Soldatenko

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Artificial intelligence in cctv survelliance.pptx

Integrating Apache Spark and NiFi for Data Lakes

1. MAKING BIG DATA COME ALIVE Integrating Apache Spark And NiFi For Data Lakes Ron Bodkin Founder & President Scott Reisdorf R&D Architect

2. 2 Agenda • Requirements • Design • Demo

3. 3 • A central repository with trusted, consistent data • Reduce costs by offloading analytical systems and archiving cold data • Derive value quickly with easier discovery and prototyping • A laboratory for experimenting with new technologies and data Goals for a Data Lake

4. 4 • Automation of pipelines with metadata and performance tracking • Governance with clear distinction of roles and responsibilities • SLA tracking with alerts on failures or violations • Interactive data discovery and experimentation What’s Needed For A Hadoop Data Lake?

5. 5 Example Ingestion Project • 4000+ unique flat files and RDMS tables, plus a few streaming data feeds • Mix of incremental and snapshot data • Ingest into Hadoop (minimally HDFS and Hive tables) • Cleansing/encryption and data validation • Metadata capture Focus shifts over time from data ingestion to transformation then to analytics

6. 6 Design

7. 7 Apache Spark Functions • Cleanse • Validate • Profile • Wrangle

8. 8 Pipeline design with Apache • Visual drag-and-drop • Dozens of data connectors • 150+ pre-built transforms • Data lineage • Batch and Streaming • Extensible © 2016 Think Big, a Teradata Company 7/10/2016

9. 9 Role separation • IT Designers design models in NiFi • Register with framework • Integrated development process © 2016 Think Big, a Teradata Company 7/10/2016 Apache NiFi Think Big framework • Users configure new feeds • Based on common model • Generated and executed in NiFi register deploy

10. 1010 7/10/2016 © 2015 Think Big, a Teradata Company User features around org. roles Visual design Streaming and Batch Fully governed Integrated Best Practices Secure, modern architecture Design Approach Will be open source (Apache license)

11. 1111 Ingest and Prepare • UI-guided feed creation • Data protection • Data cleanse • Data validation • Data profiling • Powered by Apache Spark

12. Unpack and/or merge small files Put file HDFS Cleanse/Stand ardize Spark Data Profile Spark Metadata Validate Spark Data Ingest Model Metadata determines behavior of individual components Adds many Hadoop- specific higher-level NiFi processors Index Text Elasticsearch Merge / Dedupe Hive Compress & Archive Originals HDFS,S3 Extract Table JDBC Get File(s) Filesystem Message JMS/Kafka Other HTTP/REST, etc. Data policies 12

13. 1313 Data self-service and “wrangle” • Graphical SQL builder • 100+ transform functions • Machine learning • Publish and schedule • Powered by Apache Spark

14. 1414 Data Discovery • Google-like searching • Extensible metadata • Data profile • Data sampling

15. 1515 Operations • Dashboard • Health Monitoring • Data Confidence • SLA enforcement • Alerts • Performance reports

16. 16 • Powerful search capabilities for users against data (think Google-like searching) • NiFi processor extracts source data from Hadoop table for indexing in ElasticSearch • Incremental updates during ingest ElasticSearch – Full Text Indexing Data Lake select id,user,tweet from twitter_feed extract JSON

17. 17 Demo

18. 1818

Editor's Notes

Notice that we delegate processing to the Spark and Hadoop cluster for much of our work

Integrating Apache Spark and NiFi for Data Lakes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Integrating Apache Spark and NiFi for Data Lakes

Similar to Integrating Apache Spark and NiFi for Data Lakes (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Integrating Apache Spark and NiFi for Data Lakes

Editor's Notes