SlideShare a Scribd company logo
1 of 29
Real-Time Data Flows
withApache NiFi
June 2016
Manish Gupta
1. Data Flow Challenges in an Enterprise
2. Introduction to Apache NiFi
3. Core Features
4. Architecture
5. Demo – Simple Lambda Architecture
6. Use Cases
7. Q & A
Data Flow Challenges in an Enterprise
Connected Enterprises in a Distributed World
Evolution of DataProjects inanEnterprise
Simple VB/C++
applications
working directly on
Spreadsheet /
Access databases.
OLTP applications
(mostly web) using
a RDBMS at
backend, with
some operational
reporting
capabilities.
Multiple OLTP
applications
exchanging data
between
themselves and
trying to do BI
reporting. But
failed miserably in
providing single
version of truth
Emergence of
EDW, ETL, BI, OLAP,
MDM, Data
Governance.
Issues in Scalability,
Availability,
Maintainability,
Performance etc.
started showing up
at totally different
level.
Emergence of MPP
data warehouses,
grid computing,
Self-service BI tools
and technology.
Big Data,
Cloud/Hybrid
Architecture, Real-
Time analytics /
Stream Processing,
Document
indexing/search, Log
Capture & Analysis,
Data Virtualization,
NoSQL (Key-Value,
Document Oriented,
Column Family,
Graph), In-memory
database.
Application ArchitecturePatternSilos
Apps &
Services
RDBMS
OLTP Workload
Spreadsheets /
Access
RDBMS
App App App
Caches & Derived Store
Cache
Poll For Changes
App App App
NFS
Logs
Log Aggregator /
Flume
H A D O O P
Transform
RDBMS
ODS
Data Guard /
Golden Gate /
Log Shipping
Etc.
EDWETL
ELT
App App App
Splunk
H A D O O P
EDW
CSV Dump
/ Sqoop
App App App
NoSQL / Key-Value
Store
App
ActiveMQ
Stream
Processor
Kafka
NoSQL / Key-Value
Store
Cloud
App
Andwe endupin a “Giant Hairball” Arch…
Apps &
Services
RDBMS
OLTP Workload
App App App
Caches & Derived Store
Cache
Poll For Changes
ODS
Data Guard /
Golden Gate /
Log Shipping
Etc.
H A D O O P
EDW
ELT
ETL
CSV Dump
/ SqoopNFS
Logs
Log Aggregator /
Flume
Transform
App App App
NoSQL / Key-Value
Store
ETL
Load /
Refresh
Splunk
Spreadsheets /
Access
App
ActiveMQ
App
Cloud
Stream
Processor
MemSQL
Log Aggregator / Flume
Elastic/
Solr
Logstash
REST
Graph DB
DataIngestion Frameworks(excludingTraditionalETLTools)
Amazon
Kinesis
Apache
Chukwa
Apache
Flume
Apache
Kafka
Apache
Sqoop
Cloudera
Morphlines
Facebook
Scribe
Fluentd
Google
Photon
Mozilla
Heka
Hadoop In,
Hadoop Out
Twitter
Kestrel
Linkedin
Databus
Elastic
Logstash
Netflix
Suro
Linkedin
Gobblin
InfoSphere
Streams
Apache
Beam
- Purpose built ( Not designed with Universal
Applicability)
- Induces lot of complexity in project architecture
- Hard to extend
Common Data Integration Problems
Size and Velocity
Messages in Streaming manner
Tiny to small files in micro batches time.
Small files in mini batches.
Medium to large files in batches.
Formats
CSV, TSV, PSV, TEXT, JSON, XML, XLS, XLSX, PDF, BMP, PRN
Avro, Protocol Buffer, Parquet, RC, ORC, Sequence File
Zip, GZIP, TAR, LZIP, 7z, RAR
Mediums
File Share, FTP, REST, HTTP, TCP, UDP
Schedule
Once, Every day, every hour, every minute, every second,
continuous.
Mode
 Push / Pull / Poll
Asynchronous Operation Challenges
 Fast edge consumers + slow processors = everything breaks
 Process Message A first, all others can take a backseat.
Security
 Data should be secure – not just at rest, but in motion too.
Miscellaneous
 Can you route a copy of this to our NoSQL store as well after
converting it to JSON.
 Ability to run from failure (checkpoint / rerun / replay)
 Merge small files to large files for Hadoop
 Break large files into smaller manageable chunks for NoSQL
Introduction
What is Apache NiFi, it’s History, and some terminology.
Whatis Apache NiFi
NiFi (short for “Niagara Files”) is a powerful enterprise grade dataflow tool that
can collect, route enrich, transform and Process data in a scalable manner.
NiFi is based on the concepts of flow-based programming (FBP). FBP is a
programming paradigm that defines applications as networks of "black box"
processes, which exchange data across predefined connections by message
passing, where the connections are specified externally to the processes.
Single combined platform for
 Data acquisition
 Simple event processing
 Transport and delivery
 Designed to accommodate highly diverse and
complicated dataflows
• It has Visual command and control interface which allows you to define and
manipulate data flows in real-time and with great agility.
ShortHistory
 Developed by the National Security Agency (NSA) for over 8 years
 Open sourced in Nov 2014 (Apache). Major contributors were ex-NSA
who formed a company named Onyara. Lead – Joe Witt.
 Become Apache Top Level Project in July 2015
 In August 2015, Hortonworks acquired Onyara
 In September 2015, Hortonworks released HDF 1.0 powered by NiFi.
Current version is HDF 1.2
 HDF has got solid backing of Hortonworks.
Terminology
FlowFile (Information Packet)
Unit of data (each object) moving through the system
Content + Attributes (key/value pairs)
Processor (Black Box)
Performs the work, can access FlowFiles
Currently there are 135 different processors
Connection (Bounded Buffer)
Links between processors
Queues that can be dynamically prioritized
Process Group (Subnet)
Set of processors and their connections
Receive data via input ports, send data via output ports
Flow Controller (Scheduler)
Maintains the knowledge of how processes are connected and
manages the threads and their allocation.
Header
<< UUID, Name, Size, In time, Attribute Map
>>
Content
Flow File
Types (Events, Files,
Objects, Messages etc.),
Formats (JSON, XML,
AVRO, Text, Proprietary
etc.), Size (B to GBs)
Processor
Routing (Context/Content),
Transformation (enrich, filter,
convert, split, aggregate,
custom), Mediation (push /
pull), Scheduling
Connection
Queuing, back-pressure,
expiration, prioritization
ApacheNiFi is not
• NiFi is not a distributed computation Engine
• An engine to do CEP (Complex event processing)
• A computational framework to do distributed Joins or Rolling Window Aggregations the way
Spark/Storm/Flink does.
• Hence it’s not based on Map Reduce / Spark or any other framework.
• NiFi doesn’t have any dependency on any big data tool like Hadoop or zookeeper etc. All it needs is Java.
• It’s not a full fledge ETL tool like Informatica / Pentaho / Talend / SSIS as of now. But it will be - eventually.
• It’s not a long term Data storage tool. It only holds data temporarily for re-run / data provenance purposes.
• It’s not a document indexer. It’s indexing capabilities are only to help in troubleshooting / debugging.
Core Features
What are the core features and benefits of Apache NiFi?
GuaranteedDataDelivery
• Even at very high scale, delivery is guaranteed
• Persistent Write Ahead Log (Flow File Repository) and Data Partitioning
(Content Repository) ensures this. They are together designed in a way that
they allow:
• Very high transaction rates
• Effective load spreading
• Copy-on-write scheme (for every change in data)
• Pass-by-reference
DataBufferingw/BackPressureandPressureRelease
• Supports buffering of all queued data.
• Ability to back-pressure (Even if there is no load balancing, nodes can say “Back-Off”
and other nodes in the pipeline pick up the slack.
• When backpressure is applied to a connection, it will cause the processor that is the
source of the connection to stop being scheduled to run until the queue clears out.
However, data will still queue up in that processor's incoming connections.
Prioritized Queuing
• NiFi allows the setting of one or more prioritization schemes for how data is
retrieved from a queue.
• Oldest First, Newest first, Largest first, Smallest First, or custom scheme
• The default is oldest first
DesignedforExtension
• NiFi by design is Highly Extensible.
• One can write custom:
 Processor
 Controller Service
 Reporting Tasks
 Prioritizer
 User Interface
• These extensions are bundles in something called as NAR Files (NiFi
Archives).
Visual Interface forCommand andControl
• Drag and drop processors to build a flow
• Start, stop, and configure components in real time
• View errors and corresponding error messages
• View statistics and health of data flow
• Create templates of common processor & connections
DataProvenance(Not justLineage)
• View attributes and content at given points in time
(before and after each processor) !!!
• Records, indexes, and makes events available for
display
BenefitsofApacheNiFi
Single data-source agnostic collection platform
Intuitive, real-time visual user interface with drag-and-drop capabilities
Powerful Data security capabilities from source to storage
Highly granular data sharing policies
Ability to react in real time by leveraging bi-directional data flows and prioritized data feeds
Extremely scalable, extensible platform
Architecture
High level architecture (single machine), Primary components
Host
F PC
Single Node
Host NiFi's HTTP-based command and
control API.
Real Brain. Provide and manage threads.
Scheduling.
Runs within JVM. Processor / Controller
Service / Reporting Service / U I/
Prioritizer.
State of about a given FlowFile which is
presently active in the flow. WAL.
Actual content bytes of a given FlowFile.
Blocks of data in FS. More than 1 FS
(partitions)
All provenance event data is stored.
Saved on FS. Indexed / Searchable.
Demo
Simple λArchitecture
Demo(SimpleλArchitectureusingNiFi)
1. Start NiFi (or HDF). ElasticSearch, Kibana and MongoDB on Docker. Create HDFS destination table.
2. Explore NiFi UI
3. Pull data from twitter
4. Route & Deliver to ElasticSearch, Mongo and Hadoop in real time
5. Explore NiFi capabilities
6. Design Dashboard on real-time data
Serving Layer
Speed Layer
Batch Layer
NiFiTwitter
Hadoop
ElasticSearch
MongoDB
Spark SQL
Kibana
Docker
Query
WYSIWYG…!!!
Use Cases
Some Scenarios
Some UseCases
Building Ingestion and Delivery layers in IoT Solutions
Ingestion tier in Lambda Architecture (for feeding both speed and batch layers)
Ingestion tier in Data Lake Architectures
Cross Geography Data Replication in a secure manner
Integrating on premise system to on cloud system (Hybrid Cloud Architecture)
Simplifying existing Big Data architectures which are currently using Flume, Kafka, Logstash, Scribe etc. or
custom connectors.
Developing Edge nodes for Trade repositories.
Enterprise Data Integration platform
And many more…
Q & A
Reference
https://nifi.apache.org/
http://hortonworks.com/products/data-center/hdf/
https://github.com/apache/nifi
https://twitter.com/apachenifi
Thank You
@manishpedia
https://in.linkedin.com/in/manishgforce

More Related Content

What's hot

Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsTimothy Spann
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Timothy Spann
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...HostedbyConfluent
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderDatabricks
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know SnowflakeKnoldus Inc.
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
 

What's hot (20)

Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know Snowflake
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 

Viewers also liked

Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiLev Brailovskiy
 
Apache NiFi Record Processing
Apache NiFi Record ProcessingApache NiFi Record Processing
Apache NiFi Record ProcessingBryan Bende
 
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...DataWorks Summit
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...SoftServe
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search TrainingCloudera, Inc.
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks
 
Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4Hortonworks
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesIsheeta Sanghi
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 

Viewers also liked (10)

Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFi
 
Apache NiFi Record Processing
Apache NiFi Record ProcessingApache NiFi Record Processing
Apache NiFi Record Processing
 
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
The Power of Intelligent Flows: Real-Time IoT Botnet Classification with Apac...
 
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search Training
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
 
Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4Hortonworks Data In Motion Series Part 4
Hortonworks Data In Motion Series Part 4
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 

Similar to Real-Time Data Flows with Apache NiFi

Integração de Dados com Apache NIFI - Marco Garcia Cetax
Integração de Dados com Apache NIFI - Marco Garcia CetaxIntegração de Dados com Apache NIFI - Marco Garcia Cetax
Integração de Dados com Apache NIFI - Marco Garcia CetaxMarco Garcia
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveAldrin Piri
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsgagravarr
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Data Con LA
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceChris Nauroth
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdfavenkatram
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCPBlibBlobb
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBasedarach
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalHortonworks
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Hortonworks
 
NiFi Best Practices for the Enterprise
NiFi Best Practices for the EnterpriseNiFi Best Practices for the Enterprise
NiFi Best Practices for the EnterpriseGregory Keys
 

Similar to Real-Time Data Flows with Apache NiFi (20)

Integração de Dados com Apache NIFI - Marco Garcia Cetax
Integração de Dados com Apache NIFI - Marco Garcia CetaxIntegração de Dados com Apache NIFI - Marco Garcia Cetax
Integração de Dados com Apache NIFI - Marco Garcia Cetax
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Big data
Big dataBig data
Big data
 
Hadoop
HadoopHadoop
Hadoop
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
Big data frameworks
Big data frameworksBig data frameworks
Big data frameworks
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 
Data Engineering on GCP
Data Engineering on GCPData Engineering on GCP
Data Engineering on GCP
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBase
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
 
NiFi Best Practices for the Enterprise
NiFi Best Practices for the EnterpriseNiFi Best Practices for the Enterprise
NiFi Best Practices for the Enterprise
 
paper
paperpaper
paper
 

Recently uploaded

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Real-Time Data Flows with Apache NiFi

  • 1. Real-Time Data Flows withApache NiFi June 2016 Manish Gupta
  • 2. 1. Data Flow Challenges in an Enterprise 2. Introduction to Apache NiFi 3. Core Features 4. Architecture 5. Demo – Simple Lambda Architecture 6. Use Cases 7. Q & A
  • 3. Data Flow Challenges in an Enterprise Connected Enterprises in a Distributed World
  • 4. Evolution of DataProjects inanEnterprise Simple VB/C++ applications working directly on Spreadsheet / Access databases. OLTP applications (mostly web) using a RDBMS at backend, with some operational reporting capabilities. Multiple OLTP applications exchanging data between themselves and trying to do BI reporting. But failed miserably in providing single version of truth Emergence of EDW, ETL, BI, OLAP, MDM, Data Governance. Issues in Scalability, Availability, Maintainability, Performance etc. started showing up at totally different level. Emergence of MPP data warehouses, grid computing, Self-service BI tools and technology. Big Data, Cloud/Hybrid Architecture, Real- Time analytics / Stream Processing, Document indexing/search, Log Capture & Analysis, Data Virtualization, NoSQL (Key-Value, Document Oriented, Column Family, Graph), In-memory database.
  • 5. Application ArchitecturePatternSilos Apps & Services RDBMS OLTP Workload Spreadsheets / Access RDBMS App App App Caches & Derived Store Cache Poll For Changes App App App NFS Logs Log Aggregator / Flume H A D O O P Transform RDBMS ODS Data Guard / Golden Gate / Log Shipping Etc. EDWETL ELT App App App Splunk H A D O O P EDW CSV Dump / Sqoop App App App NoSQL / Key-Value Store App ActiveMQ Stream Processor Kafka NoSQL / Key-Value Store Cloud App
  • 6. Andwe endupin a “Giant Hairball” Arch… Apps & Services RDBMS OLTP Workload App App App Caches & Derived Store Cache Poll For Changes ODS Data Guard / Golden Gate / Log Shipping Etc. H A D O O P EDW ELT ETL CSV Dump / SqoopNFS Logs Log Aggregator / Flume Transform App App App NoSQL / Key-Value Store ETL Load / Refresh Splunk Spreadsheets / Access App ActiveMQ App Cloud Stream Processor MemSQL Log Aggregator / Flume Elastic/ Solr Logstash REST Graph DB
  • 7. DataIngestion Frameworks(excludingTraditionalETLTools) Amazon Kinesis Apache Chukwa Apache Flume Apache Kafka Apache Sqoop Cloudera Morphlines Facebook Scribe Fluentd Google Photon Mozilla Heka Hadoop In, Hadoop Out Twitter Kestrel Linkedin Databus Elastic Logstash Netflix Suro Linkedin Gobblin InfoSphere Streams Apache Beam - Purpose built ( Not designed with Universal Applicability) - Induces lot of complexity in project architecture - Hard to extend
  • 8. Common Data Integration Problems Size and Velocity Messages in Streaming manner Tiny to small files in micro batches time. Small files in mini batches. Medium to large files in batches. Formats CSV, TSV, PSV, TEXT, JSON, XML, XLS, XLSX, PDF, BMP, PRN Avro, Protocol Buffer, Parquet, RC, ORC, Sequence File Zip, GZIP, TAR, LZIP, 7z, RAR Mediums File Share, FTP, REST, HTTP, TCP, UDP Schedule Once, Every day, every hour, every minute, every second, continuous. Mode  Push / Pull / Poll Asynchronous Operation Challenges  Fast edge consumers + slow processors = everything breaks  Process Message A first, all others can take a backseat. Security  Data should be secure – not just at rest, but in motion too. Miscellaneous  Can you route a copy of this to our NoSQL store as well after converting it to JSON.  Ability to run from failure (checkpoint / rerun / replay)  Merge small files to large files for Hadoop  Break large files into smaller manageable chunks for NoSQL
  • 9. Introduction What is Apache NiFi, it’s History, and some terminology.
  • 10. Whatis Apache NiFi NiFi (short for “Niagara Files”) is a powerful enterprise grade dataflow tool that can collect, route enrich, transform and Process data in a scalable manner. NiFi is based on the concepts of flow-based programming (FBP). FBP is a programming paradigm that defines applications as networks of "black box" processes, which exchange data across predefined connections by message passing, where the connections are specified externally to the processes. Single combined platform for  Data acquisition  Simple event processing  Transport and delivery  Designed to accommodate highly diverse and complicated dataflows • It has Visual command and control interface which allows you to define and manipulate data flows in real-time and with great agility.
  • 11. ShortHistory  Developed by the National Security Agency (NSA) for over 8 years  Open sourced in Nov 2014 (Apache). Major contributors were ex-NSA who formed a company named Onyara. Lead – Joe Witt.  Become Apache Top Level Project in July 2015  In August 2015, Hortonworks acquired Onyara  In September 2015, Hortonworks released HDF 1.0 powered by NiFi. Current version is HDF 1.2  HDF has got solid backing of Hortonworks.
  • 12. Terminology FlowFile (Information Packet) Unit of data (each object) moving through the system Content + Attributes (key/value pairs) Processor (Black Box) Performs the work, can access FlowFiles Currently there are 135 different processors Connection (Bounded Buffer) Links between processors Queues that can be dynamically prioritized Process Group (Subnet) Set of processors and their connections Receive data via input ports, send data via output ports Flow Controller (Scheduler) Maintains the knowledge of how processes are connected and manages the threads and their allocation. Header << UUID, Name, Size, In time, Attribute Map >> Content Flow File Types (Events, Files, Objects, Messages etc.), Formats (JSON, XML, AVRO, Text, Proprietary etc.), Size (B to GBs) Processor Routing (Context/Content), Transformation (enrich, filter, convert, split, aggregate, custom), Mediation (push / pull), Scheduling Connection Queuing, back-pressure, expiration, prioritization
  • 13. ApacheNiFi is not • NiFi is not a distributed computation Engine • An engine to do CEP (Complex event processing) • A computational framework to do distributed Joins or Rolling Window Aggregations the way Spark/Storm/Flink does. • Hence it’s not based on Map Reduce / Spark or any other framework. • NiFi doesn’t have any dependency on any big data tool like Hadoop or zookeeper etc. All it needs is Java. • It’s not a full fledge ETL tool like Informatica / Pentaho / Talend / SSIS as of now. But it will be - eventually. • It’s not a long term Data storage tool. It only holds data temporarily for re-run / data provenance purposes. • It’s not a document indexer. It’s indexing capabilities are only to help in troubleshooting / debugging.
  • 14. Core Features What are the core features and benefits of Apache NiFi?
  • 15. GuaranteedDataDelivery • Even at very high scale, delivery is guaranteed • Persistent Write Ahead Log (Flow File Repository) and Data Partitioning (Content Repository) ensures this. They are together designed in a way that they allow: • Very high transaction rates • Effective load spreading • Copy-on-write scheme (for every change in data) • Pass-by-reference DataBufferingw/BackPressureandPressureRelease • Supports buffering of all queued data. • Ability to back-pressure (Even if there is no load balancing, nodes can say “Back-Off” and other nodes in the pipeline pick up the slack. • When backpressure is applied to a connection, it will cause the processor that is the source of the connection to stop being scheduled to run until the queue clears out. However, data will still queue up in that processor's incoming connections.
  • 16. Prioritized Queuing • NiFi allows the setting of one or more prioritization schemes for how data is retrieved from a queue. • Oldest First, Newest first, Largest first, Smallest First, or custom scheme • The default is oldest first DesignedforExtension • NiFi by design is Highly Extensible. • One can write custom:  Processor  Controller Service  Reporting Tasks  Prioritizer  User Interface • These extensions are bundles in something called as NAR Files (NiFi Archives).
  • 17. Visual Interface forCommand andControl • Drag and drop processors to build a flow • Start, stop, and configure components in real time • View errors and corresponding error messages • View statistics and health of data flow • Create templates of common processor & connections
  • 18. DataProvenance(Not justLineage) • View attributes and content at given points in time (before and after each processor) !!! • Records, indexes, and makes events available for display
  • 19. BenefitsofApacheNiFi Single data-source agnostic collection platform Intuitive, real-time visual user interface with drag-and-drop capabilities Powerful Data security capabilities from source to storage Highly granular data sharing policies Ability to react in real time by leveraging bi-directional data flows and prioritized data feeds Extremely scalable, extensible platform
  • 20. Architecture High level architecture (single machine), Primary components
  • 21. Host F PC Single Node Host NiFi's HTTP-based command and control API. Real Brain. Provide and manage threads. Scheduling. Runs within JVM. Processor / Controller Service / Reporting Service / U I/ Prioritizer. State of about a given FlowFile which is presently active in the flow. WAL. Actual content bytes of a given FlowFile. Blocks of data in FS. More than 1 FS (partitions) All provenance event data is stored. Saved on FS. Indexed / Searchable.
  • 23. Demo(SimpleλArchitectureusingNiFi) 1. Start NiFi (or HDF). ElasticSearch, Kibana and MongoDB on Docker. Create HDFS destination table. 2. Explore NiFi UI 3. Pull data from twitter 4. Route & Deliver to ElasticSearch, Mongo and Hadoop in real time 5. Explore NiFi capabilities 6. Design Dashboard on real-time data Serving Layer Speed Layer Batch Layer NiFiTwitter Hadoop ElasticSearch MongoDB Spark SQL Kibana Docker Query
  • 26. Some UseCases Building Ingestion and Delivery layers in IoT Solutions Ingestion tier in Lambda Architecture (for feeding both speed and batch layers) Ingestion tier in Data Lake Architectures Cross Geography Data Replication in a secure manner Integrating on premise system to on cloud system (Hybrid Cloud Architecture) Simplifying existing Big Data architectures which are currently using Flume, Kafka, Logstash, Scribe etc. or custom connectors. Developing Edge nodes for Trade repositories. Enterprise Data Integration platform And many more…
  • 27. Q & A

Editor's Notes

  1. Big Data Ingestion and Streaming   What is data Ingestion? Data ingestion is the process of obtaining, importing, and processing data for later use or storage in a database. This process often involves altering individual files by editing their content and/or formatting them to fit into a larger document. An effective data ingestion methodology begins by validating the individual files, then prioritizes the sources for optimum processing, and finally validates the results. When numerous data sources exist in diverse formats (the sources may number in the hundreds and the formats in the dozens), maintaining reasonable speed and efficiency can become a major challenge. To that end, several vendors offer programs tailored to the task of data ingestion in specific applications or environments. Data Ingestion Frameworks Amazon Kinesis – real-time processing of streaming data at massive scale. Apache Chukwa – data collection system. Apache Flume – service to manage large amount of log data. Apache Kafka – distributed publish-subscribe messaging system. Apache Sqoop – tool to transfer data between Hadoop and a structured datastore. Cloudera Morphlines – framework that help ETL to Solr, HBase and HDFS. Facebook Scribe – streamed log data aggregator. Fluentd – tool to collect events and logs. Google Photon – geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency. Heka – open source stream processing software system. HIHO – framework for connecting disparate data sources with Hadoop. Kestrel – distributed message queue system. LinkedIn Databus – stream of change capture events for a database. LinkedIn Kamikaze – utility package for compressing sorted integer arrays. LinkedIn White Elephant – log aggregator and dashboard. Logstash – a tool for managing events and logs. Netflix Suro – log agregattor like Storm and Samza based on Chukwa. Pinterest Secor – is a service implementing Kafka log persistance. Linkedin Gobblin – linkedin’s universal data ingestion framework.     Flume A Flume agent is a Java virtual machine (JVM) process that hosts the components through which events flow. Each agent contains at the minimum a source, a channel and a sink. An agent can also run multiple sets of channels and sinks through a flow multiplexer that either replicates or selectively routes an event. Agents can cascade to form a multihop tiered collection topology until the final datastore is reached. A Flume event is a unit of data flow that contains a payload and an optional set of string attributes. An event is transmitted from its point of origination, normally called a client, to the source of an agent. When the source receives the event, it sends it to a channel that is a transient store for events within the agent. The associated sink can then remove the event from the channel and deliver it to the next agent or the event’s final destination. Flume is distributed system for collecting log data from many sources, aggregating it, and writing it to HDFS. It is designed to be reliable and highly available, while providing a simple, flexible, and intuitive programming model based on streaming data flows. Flume provides extensibility for online analytic applications that process data stream in situ. Flume and Chukwa share similar goals and features. However, there are some notable differences. Flume maintains a central list of ongoing data flows, stored redundantly in Zookeeper. In contrast, Chukwa distributes this information more broadly among its services. Flume adopts a “hop-by-hop” model, while in Chukwa the agents on each machine are responsible for deciding what data to send. Resources :  Flume Tutorial How to Refine and Visualize Server Log Data How To Refine and Visualize Sentiment Data   Chukwa Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. Chukwa is built on top of the Hadoop distributed filesystem (HDFS) andMapReduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying monitoring and analyzing results, in order to make the best use of this collected data. Log processing was one of the original purposes of MapReduce. Unfortunately, Hadoop is hard to use for this purpose. Writing MapReduce jobs to process logs is somewhat tedious and the batch nature of MapReduce makes it difficult to use with logs that are generated incrementally across many machines. Furthermore, HDFS stil does not support appending to existing files. Chukwa is a Hadoop subproject that bridges that gap between log handling and MapReduce. It provides a scalable distributed system for monitoring and analysis of log-based data. Some of the durability features include agent-side replying of data to recover from errors. Resources: Chukwa quick start guide    Sqoop Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. It offers two-way replication with both snapshots and incremental updates. Sqoop provides a pluggable mechanism for optimal connectivity to external systems. The Sqoop extension API provides a convenient framework for building new connectors which can be dropped into Sqoop installations to provide connectivity to various systems. Sqoop itself comes bundled with various connectors that can be used for popular database and data warehousing systems   Resources: Import from Microsoft SQL Server into the Hortonworks Sandbox using Sqoop Kafka Apache Kafka is a distributed publish-subscribe messaging system. It is designed to provide high throughput persistent messaging that’s scalable and allows for parallel data loads into Hadoop. Its features include the use of compression to optimize IO performance and mirroring to improve availability, scalability and to optimize performance in multiple-cluster scenarios. Fast A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Scalable Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers Durable Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact. Distributed by Design Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees. Resources: Kafka Tutorial Storm and Kafka Simulating and Transporting Rea-Time Event Storm Hadoop is ideal for batch-mode processing over massive data sets, but it doesn’t support event-stream (a.k.a. message-stream) processing, i.e., responding to individual events within a reasonable time frame. (For limited scenarios, you could use a NoSQL database like HBase to capture incoming data in the form of append updates.) Storm is a general-purpose, event-processing system that is growing in popularity for addressing this gap in Hadoop. Like Hadoop, Storm uses a cluster of services for scalability and reliability. In Storm terminology you create a topology that runs continuously over a stream of incoming data, which is analogous to a Hadoop job that runs as a batch process over a fixed data set and then terminates. An apt analogy is a continuous stream of water flowing through plumbing. The data sources for the topology are called spouts and each processing node is called a bolt. Bolts can perform arbitrarily sophisticated computations on the data, including output to data stores and other services. It is common for organizations to run a combination of Hadoop and Storm services to gain the best features of both platforms.  How Storm Works A storm cluster has three sets of nodes: Nimbus node (master node, similar to the Hadoop JobTracker): Uploads computations for execution Distributes code across the cluster Launches workers across the cluster Monitors computation and reallocates workers as needed ZooKeeper nodes – coordinates the Storm cluster Supervisor nodes – communicates with Nimbus through Zookeeper, starts and stops workers according to signals from Nimbus Five key abstractions help to understand how Storm processes data: Tuples– an ordered list of elements. For example, a “4-tuple” might be (7, 1, 3, 7) Streams – an unbounded sequence of tuples. Spouts –sources of streams in a computation (e.g. a Twitter API) Bolts – process input streams and produce output streams. They can: run functions; filter, aggregate, or join data; or talk to databases. Topologies – the overall calculation, represented visually as a network of spouts and bolts (as in the following diagram) Storm users define topologies for how to process the data when it comes streaming in from the spout. When the data comes in, it is processed and the results are passed into Hadoop. Resources: Ingesting and processing Real-time events with Apache Storm Process real-time big data with Twitter Storm Real time Data Ingestion in HBase & Hive using Storm Bolt   Elasticsearch Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements. Here are a few sample use-cases that Elasticsearch could be used for: You run an online web store where you allow your customers to search for products that you sell. In this case, you can use Elasticsearch to store your entire product catalog and inventory and provide search and autocomplete suggestions for them. You want to collect log or transaction data and you want to analyze and mine this data to look for trends, statistics, summarizations, or anomalies. In this case, you can use Logstash (part of the Elasticsearch/Logstash/Kibana stack) to collect, aggregate, and parse your data, and then have Logstash feed this data into Elasticsearch. Once the data is in Elasticsearch, you can run searches and aggregations to mine any information that is of interest to you. You run a price alerting platform which allows price-savvy customers to specify a rule like “I am interested in buying a specific electronic gadget and I want to be notified if the price of gadget falls below $X from any vendor within the next month”. In this case you can scrape vendor prices, push them into Elasticsearch and use its reverse-search (Percolator) capability to match price movements against customer queries and eventually push the alerts out to the customer once matches are found. You have analytics/business-intelligence needs and want to quickly investigate, analyze, visualize, and ask ad-hoc questions on a lot of data (think millions or billions of records). In this case, you can use Elasticsearch to store your data and then use Kibana (part of the Elasticsearch/Logstash/Kibana stack) to build custom dashboards that can visualize aspects of your data that are important to you. Additionally, you can use the Elasticsearch aggregations functionality to perform complex business intelligence queries against your data. Resources: Exploring Elasticserach Logstash + Elasticsearch + Kibana Spring XD              Spring XD is a unified, distributed, and extensible service for data ingestion, real time analytics, batch processing, and data export. The Spring XD project is an open source Apache 2 License licenced project whose goal is to tackle big data complexity. Much of the complexity in building real-world big data applications is related to integrating many disparate systems into one cohesive solution across a range of use-cases. Common use-cases encountered in creating a comprehensive big data solution are High throughput distributed data ingestion from a variety of input sources into big data store such as HDFS or Splunk Real-time analytics at ingestion time, e.g. gathering metrics and counting values. Workflow management via batch jobs. The jobs combine interactions with standard enterprise systems (e.g. RDBMS) as well as Hadoop operations (e.g. MapReduce, HDFS, Pig, Hive or HBase). High throughput data export, e.g. from HDFS to a RDBMS or NoSQL database. Resources: SpringXD Guide SpringXD Quick Start HDP + SpringXD   Solr Apache Solr is a powerful search server, which supports REST like API. Solr is powered by Lucene which enables powerful matching capabilities like phrases, wildcards, joins, grouping and many more across various data types. It is highly optimized for high traffic using Apache Zookeeper. Apache Solr comes with a wide set of features and we have listed a subset of  high impact features. Advanced Full-Text search capabilities. Standards based on Open Interfaces – XML, JSON and Http. Highly scalable and fault tolerant. Supports both Schema and Schemaless configuration. Faceted Search and Filtering. Support major languages like English, German, Chinese, Japanese, French and many more Rich Document Parsing. Resources: Solr for beginners Solr Tutorial HDP + Solr Additional Resources: Distcp Guide Real time Data Ingestion in HBase & Hive Elasticsearch Vs Solr