Building Continuously Curated Ingestion Pipelines

Building Continuously Curated
Ingestion Pipelines
Recipes for Success
Arvind Prabhakar

“Thru 2018, 70% of Hadoop deployments will not meet cost savings and
revenue generation objectives due to skills and integration challenges”
@nheudecker tweet (Gartner, 26 Feb 2015)

What is Data Ingestion?
 Acquiring data as it is produced from data source(s)
 Transforming it into a consumable form
 Delivering the transformed data to the consuming system(s)
The challenge:
Doing this continuously, and at scale across a wide variety of sources
and consuming systems!

Why is Data Ingestion Difficult? (Hint: Drift)
 Modern data sources and consuming applications evolve rapidly
(Infrastructure Drift)
 Data produced changes without notice independent of consuming applications
(Structural Drift)
 Data semantics change over time as same data powers new use cases
(Semantic Drift)

Continuous data ingestion is critical to the success
of modern data environments

Plan Your Ingestion Infrastructure Carefully!
 Plan ahead: Allocate time and resources specifically for building out your data
ingestion infrastructure
 Plan for future: Design ingestion infrastructure with sufficient extensibility to
accommodate unknown future requirements
 Plan for correction: Incorporate low-level instrumentation to help understand
the effectiveness of your ingestion infrastructure, correcting it as your systems
evolve

The Benefits of Well-Designed Data Ingestion
 Minimal effort needed to accommodate changes: Handle upgrades, onboard
new data sources/consuming applications/analytics technologies, etc.
 Increased confidence in data quality: Rest assured that your consuming
applications are working with correct, consistent and trustworthy data
 Reduced latency for consumption: Allow rapid consumption of data and
remove any need for manual intervention for enabling consuming applications

Recipe #1
Create decoupled ingest infrastructure
An Independent Infrastructure between Data Sources and Consumers

Decoupled Ingest Infrastructure
• Decoupled data format and packaging from source to destination
- Example: read CSV files and write Sequence files
• Breakdown input data into its smallest meaningful representation
- Example: Individual log record, individual tweet record, etc.
• Implement asynchronous data movement from source to destination
- Data movement process is independent of the source or consumer process
An Independent Infrastructure between Data Sources and Consumers

Decoupled Ingestion Using Apache Flume
Fan-in Log Aggregation
• Load data into client-tier Flume agents
• Route data through intermediate tiers
• Deposit data via collector tiers
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Tier-1 Tier-2 Tier-3Log Files Hive Warehouse

Decoupled Ingestion Using Apache Flume + StreamSets
Agent
Agent
Agent
Agent
Agent
Agent
Tier-2 Tier-3 Hive Warehouse
Use StreamSets Data Collector to onboard data into
Flume from variety of sources
SDC Tier

Decoupled Ingestion Using Apache Kafka
Pub-Sub Log Flow
• Producers load data into Kafka
topics from various log files
• Consumers read data from the
topics and deposit to destination
ProducersLog Files Hive Warehouse
Producer
Consumers
Consumer
Consumer
Consumer
Consumer
Apache
Kafka
Producer
Producer
Producer
Producer
Producer
Producer
Producer

Decoupled Ingestion Using Apache Kafka + StreamSets
Standalone SDC Hive WarehouseCluster SDC
Apache
Kafka

Decoupled Ingest Infrastructure
System Flume Kafka StreamSets
Decouple
Formats
Use built-in clients,
sources, serializers.
Extend if necessary.
Use third-party producers,
consumers. Write your
own if necessary.
Built-in any-to-any
format conversion
support.
Smallest
Representation
Opaque payload
(Event body)
Opaque payload Interpreted record
format
Asynchronous
Data Movement
Yes Yes Yes*
An Independent Infrastructure between Data Sources and Stores/Consumers

Poll: Tools for Ingest
What tools do you use for Ingest?
o Kafka
o Flume
o Other

Data scientists spend 50 to 80 percent of their time in mundane labor of
collecting and preparing unruly digital data, before it can be explored.
“For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”
The New York Times -- Steve Lohr, Aug 17, 2014

Recipe #2
Implement in-stream data sanitization
Prepare data using in-stream transformations to make it consumption-ready

In-Stream Data Sanitization
• Convert and enforce data types where necessary
- Example: turn strings into integers to enable type matching for consuming systems
• Filter and route data depending upon downstream requirements
- Example: Build routing logic to filter out invalid or low value data to reduce processing costs
• Enrich and transform data where needed
- Example: Extract and enrich credit card type information before masking the card number
Prepare data using in-stream transformations to make it consumption ready

Data Sanitization using Apache Flume
• Use Interceptors to convert data
types where necessary
• Use Channel Selectors for
contextual routing in conjunction
with Interceptors
• Use Interceptors to enrich or
transform messages
Flume Agent
Source
Channel 1
Channel ..
Channel n
Sink 1
Sink …
Sink n
Interceptors
Flume provides few Interceptors for static decoration and rudimentary data transformations

Data Sanitization using Apache Kafka
• No built-in support for data types
beyond a broad value type
• Use custom producers that choose
topics thus implementing routing
and filtering
• Use custom producers and
consumers to implement validation
and transformations where needed
Producers-consumer chains can be applied between topics to achieve data sanitization
Topic A
Topic B
Kafka
Producer …
Producer 1
Producer n
Consumer 1
Sanitizing
Consumer
Producer
Consumer x
Consumer y
Consumer z

Data Sanitization using StreamSets
• Built-in routing, filtering, validation
using minimal schema specification
Producers-consumer chains can be applied between topics to achieve data sanitization
• Support for JavaScript, Python,
Java EL as well as Java extensions
• Built-in transformations including PII
masking, anonymization, etc.

In-Stream Data Sanitization
Prepare data using in-stream transformations to make it consumption ready
Introspection/Validation Minimal support. Use
custom Interceptors or
morphlines.
None. Must be done
within custom
producers and
consumers.
Extensive built-in support
for introspection and
validation.
Filter/Routing Rudimentary support using
built-in Interceptors. Will
require using custom
interceptors.
None. Must be done
within custom
producers and
consumers.
Sophisticated built-in
contextual routing based
on data introspection.
Enrich/Transform Minimal support. Will
require using custom
interceptors.
None. Must be done
within custom
producers and
consumers.
Extensive support via
built-in functions as well
as scripting processors
for enrichment and
transformations.

Poll: Preparing Data for Consumption
For the most part, at what point do you prepare your data for consumption?
o Before Ingest
o During Ingest
o After Ingest

Results: Preparing Data for Consumption

Recipe #3
Implement drift-detection and handling
Enable runtime checks for detection and handling of drift

Drift Detection and Handling
• Specify and validate intent rather than schema to catch structure drift
- Example: data must contain a certain attribute, and not an attribute at a particular ordinal
• Specify and validate semantic constraints to catch semantic drift
- Example: service operating within a particular city must have bounded geo-coordinates
• Isolate ingest logic from downstream to enable infrastructure drift handling
- Example: Use isolated class-loaders to enable writing to binary incompatible systems

Drift Handling in Apache Flume and Kafka
 No support for structural drift: Flume and Kafka are low-level frameworks and
work with opaque data. You need to handle this in your code.
 No support for semantic drift: This is because Flume and Kafka work with
opaque data. Again, you need to handle this in your code.
 No support for infrastructure drift: These systems do not provide class-loader
isolation or other semantics to easily handle infrastructure drift. Their
extension mechanism makes it impossible to handle this in your code in most
cases.

Structural Drift Handling in StreamSets
• This stream-selector is using a header-name to identify an attribute
• If the data is delimited (e.g. CSV) the column position could vary from record to record
• If the attribute is absent, the record flows through the default stream

Semantic Drift Handling in StreamSets
• This stage specifies a required field as well as preconditions for field value validation
• Records that meet this constraint flow through the pipeline
• Records that do not meet this constraint are siphoned off to an error pipeline

Infrastructure Drift Handling in StreamSets
• This pipeline duplicates the data into two Kafka instances that may be incompatible
• No recompilation or dependency reconciliation required to work in such environments
• Ideal for handling upgrades, onboarding of new clusters, applications, etc.

Drift Detection and Handling
Structural Drift No support No support Support for loosely defined
intent-specification for structural
validation.
Semantic Drift No support No support Support for complex pre-
conditions and routing for
semantic data validation.
Infrastructure Drift No support No support Support for class-loader
isolation to ensure operation in
binary incompatible
environments.

Poll: Prevalence of Data Drift
How often do you encounter data drift as we’ve described it?
o Never
o Occasionally
o Frequently

Results: Prevalence of Data Drift

Recipe #4
Use the right tool for the right job

Use the Right Tool for the Job!
 Apache Flume: Best suited for long range bulk data transfer with basic routing
and filtering support
 Apache Kafka: Best suited for data democratization
 StreamSets: Best suited for any-to-any data movement, drift detection and
handling, complex routing and filtering, etc.
Use these systems together to build your best in class ingestion infrastructure!

Sustained data ingestion is critical for success of
modern data environments

Thank You!
Learn More at www.streamsets.com
 Get StreamSets : http://streamsets.com/opensource/
 Documentation: http://streamsets.com/docs
 Source Code: https://github.com/streamsets/
 Issue Tracker: https://issues.streamsets.com/
 Mailing List: https://goo.gl/wmFPLt
 Other Resources: http://streamsets.com/resources/

Building Continuously Curated Ingestion Pipelines

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Building Continuously Curated Ingestion Pipelines

Similar to Building Continuously Curated Ingestion Pipelines (20)

Recently uploaded

Recently uploaded (20)

Building Continuously Curated Ingestion Pipelines