SlideShare a Scribd company logo
1 of 38
Building Continuously Curated
Ingestion Pipelines
Recipes for Success
Arvind Prabhakar
“Thru 2018, 70% of Hadoop deployments will not meet cost savings and
revenue generation objectives due to skills and integration challenges”
@nheudecker tweet (Gartner, 26 Feb 2015)
What is Data Ingestion?
 Acquiring data as it is produced from data source(s)
 Transforming it into a consumable form
 Delivering the transformed data to the consuming system(s)
The challenge:
Doing this continuously, and at scale across a wide variety of sources
and consuming systems!
Why is Data Ingestion Difficult? (Hint: Drift)
 Modern data sources and consuming applications evolve rapidly
(Infrastructure Drift)
 Data produced changes without notice independent of consuming applications
(Structural Drift)
 Data semantics change over time as same data powers new use cases
(Semantic Drift)
Continuous data ingestion is critical to the success
of modern data environments
Plan Your Ingestion Infrastructure Carefully!
 Plan ahead: Allocate time and resources specifically for building out your data
ingestion infrastructure
 Plan for future: Design ingestion infrastructure with sufficient extensibility to
accommodate unknown future requirements
 Plan for correction: Incorporate low-level instrumentation to help understand
the effectiveness of your ingestion infrastructure, correcting it as your systems
evolve
The Benefits of Well-Designed Data Ingestion
 Minimal effort needed to accommodate changes: Handle upgrades, onboard
new data sources/consuming applications/analytics technologies, etc.
 Increased confidence in data quality: Rest assured that your consuming
applications are working with correct, consistent and trustworthy data
 Reduced latency for consumption: Allow rapid consumption of data and
remove any need for manual intervention for enabling consuming applications
Recipe #1
Create decoupled ingest infrastructure
An Independent Infrastructure between Data Sources and Consumers
Decoupled Ingest Infrastructure
• Decoupled data format and packaging from source to destination
- Example: read CSV files and write Sequence files
• Breakdown input data into its smallest meaningful representation
- Example: Individual log record, individual tweet record, etc.
• Implement asynchronous data movement from source to destination
- Data movement process is independent of the source or consumer process
An Independent Infrastructure between Data Sources and Consumers
Decoupled Ingestion Using Apache Flume
Fan-in Log Aggregation
• Load data into client-tier Flume agents
• Route data through intermediate tiers
• Deposit data via collector tiers
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Tier-1 Tier-2 Tier-3Log Files Hive Warehouse
Decoupled Ingestion Using Apache Flume + StreamSets
Agent
Agent
Agent
Agent
Agent
Agent
Tier-2 Tier-3 Hive Warehouse
Use StreamSets Data Collector to onboard data into
Flume from variety of sources
SDC Tier
Decoupled Ingestion Using Apache Kafka
Pub-Sub Log Flow
• Producers load data into Kafka
topics from various log files
• Consumers read data from the
topics and deposit to destination
ProducersLog Files Hive Warehouse
Producer
Consumers
Consumer
Consumer
Consumer
Consumer
Apache
Kafka
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Decoupled Ingestion Using Apache Kafka + StreamSets
Standalone SDC Hive WarehouseCluster SDC
Apache
Kafka
Decoupled Ingest Infrastructure
System Flume Kafka StreamSets
Decouple
Formats
Use built-in clients,
sources, serializers.
Extend if necessary.
Use third-party producers,
consumers. Write your
own if necessary.
Built-in any-to-any
format conversion
support.
Smallest
Representation
Opaque payload
(Event body)
Opaque payload Interpreted record
format
Asynchronous
Data Movement
Yes Yes Yes*
An Independent Infrastructure between Data Sources and Stores/Consumers
Poll: Tools for Ingest
What tools do you use for Ingest?
o Kafka
o Flume
o Other
Results: Tools for Ingest
Data scientists spend 50 to 80 percent of their time in mundane labor of
collecting and preparing unruly digital data, before it can be explored.
“For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”
The New York Times -- Steve Lohr, Aug 17, 2014
Recipe #2
Implement in-stream data sanitization
Prepare data using in-stream transformations to make it consumption-ready
In-Stream Data Sanitization
• Convert and enforce data types where necessary
- Example: turn strings into integers to enable type matching for consuming systems
• Filter and route data depending upon downstream requirements
- Example: Build routing logic to filter out invalid or low value data to reduce processing costs
• Enrich and transform data where needed
- Example: Extract and enrich credit card type information before masking the card number
Prepare data using in-stream transformations to make it consumption ready
Data Sanitization using Apache Flume
• Use Interceptors to convert data
types where necessary
• Use Channel Selectors for
contextual routing in conjunction
with Interceptors
• Use Interceptors to enrich or
transform messages
Flume Agent
Source
Channel 1
Channel ..
Channel n
Sink 1
Sink …
Sink n
Interceptors
Flume provides few Interceptors for static decoration and rudimentary data transformations
Data Sanitization using Apache Kafka
• No built-in support for data types
beyond a broad value type
• Use custom producers that choose
topics thus implementing routing
and filtering
• Use custom producers and
consumers to implement validation
and transformations where needed
Producers-consumer chains can be applied between topics to achieve data sanitization
Topic A
Topic B
Kafka
Producer …
Producer 1
Producer n
Consumer 1
Sanitizing
Consumer
Producer
Consumer x
Consumer y
Consumer z
Data Sanitization using StreamSets
• Built-in routing, filtering, validation
using minimal schema specification
Producers-consumer chains can be applied between topics to achieve data sanitization
• Support for JavaScript, Python,
Java EL as well as Java extensions
• Built-in transformations including PII
masking, anonymization, etc.
In-Stream Data Sanitization
Prepare data using in-stream transformations to make it consumption ready
System Flume Kafka StreamSets
Introspection/Validation Minimal support. Use
custom Interceptors or
morphlines.
None. Must be done
within custom
producers and
consumers.
Extensive built-in support
for introspection and
validation.
Filter/Routing Rudimentary support using
built-in Interceptors. Will
require using custom
interceptors.
None. Must be done
within custom
producers and
consumers.
Sophisticated built-in
contextual routing based
on data introspection.
Enrich/Transform Minimal support. Will
require using custom
interceptors.
None. Must be done
within custom
producers and
consumers.
Extensive support via
built-in functions as well
as scripting processors
for enrichment and
transformations.
Poll: Preparing Data for Consumption
For the most part, at what point do you prepare your data for consumption?
o Before Ingest
o During Ingest
o After Ingest
Results: Preparing Data for Consumption
Recipe #3
Implement drift-detection and handling
Enable runtime checks for detection and handling of drift
Drift Detection and Handling
• Specify and validate intent rather than schema to catch structure drift
- Example: data must contain a certain attribute, and not an attribute at a particular ordinal
• Specify and validate semantic constraints to catch semantic drift
- Example: service operating within a particular city must have bounded geo-coordinates
• Isolate ingest logic from downstream to enable infrastructure drift handling
- Example: Use isolated class-loaders to enable writing to binary incompatible systems
Enable runtime checks for detection and handling of drift
Drift Handling in Apache Flume and Kafka
 No support for structural drift: Flume and Kafka are low-level frameworks and
work with opaque data. You need to handle this in your code.
 No support for semantic drift: This is because Flume and Kafka work with
opaque data. Again, you need to handle this in your code.
 No support for infrastructure drift: These systems do not provide class-loader
isolation or other semantics to easily handle infrastructure drift. Their
extension mechanism makes it impossible to handle this in your code in most
cases.
Structural Drift Handling in StreamSets
• This stream-selector is using a header-name to identify an attribute
• If the data is delimited (e.g. CSV) the column position could vary from record to record
• If the attribute is absent, the record flows through the default stream
Semantic Drift Handling in StreamSets
• This stage specifies a required field as well as preconditions for field value validation
• Records that meet this constraint flow through the pipeline
• Records that do not meet this constraint are siphoned off to an error pipeline
Infrastructure Drift Handling in StreamSets
• This pipeline duplicates the data into two Kafka instances that may be incompatible
• No recompilation or dependency reconciliation required to work in such environments
• Ideal for handling upgrades, onboarding of new clusters, applications, etc.
Drift Detection and Handling
System Flume Kafka StreamSets
Structural Drift No support No support Support for loosely defined
intent-specification for structural
validation.
Semantic Drift No support No support Support for complex pre-
conditions and routing for
semantic data validation.
Infrastructure Drift No support No support Support for class-loader
isolation to ensure operation in
binary incompatible
environments.
Enable runtime checks for detection and handling of drift
Poll: Prevalence of Data Drift
How often do you encounter data drift as we’ve described it?
o Never
o Occasionally
o Frequently
Results: Prevalence of Data Drift
Recipe #4
Use the right tool for the right job
Use the Right Tool for the Job!
 Apache Flume: Best suited for long range bulk data transfer with basic routing
and filtering support
 Apache Kafka: Best suited for data democratization
 StreamSets: Best suited for any-to-any data movement, drift detection and
handling, complex routing and filtering, etc.
Use these systems together to build your best in class ingestion infrastructure!
Sustained data ingestion is critical for success of
modern data environments
Thank You!
Learn More at www.streamsets.com
 Get StreamSets : http://streamsets.com/opensource/
 Documentation: http://streamsets.com/docs
 Source Code: https://github.com/streamsets/
 Issue Tracker: https://issues.streamsets.com/
 Mailing List: https://goo.gl/wmFPLt
 Other Resources: http://streamsets.com/resources/

More Related Content

What's hot

Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Data Con LA
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseDataStax
 
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Big Data Spain
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryDataWorks Summit/Hadoop Summit
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR Technologies
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Cask Data
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Data Con LA
 
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask #BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask Cask Data
 
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesIntegrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesDataWorks Summit
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to..."Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to...Cask Data
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Spark meetup - Zoomdata Streaming
Spark meetup  - Zoomdata StreamingSpark meetup  - Zoomdata Streaming
Spark meetup - Zoomdata StreamingZoomdata
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineData Con LA
 

What's hot (20)

Streamsets and spark
Streamsets and sparkStreamsets and spark
Streamsets and spark
 
Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
 
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive Industry
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
 
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask #BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
 
Scaling HDFS at Xiaomi
Scaling HDFS at XiaomiScaling HDFS at Xiaomi
Scaling HDFS at Xiaomi
 
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesIntegrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query Engines
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to..."Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
 
Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Spark meetup - Zoomdata Streaming
Spark meetup  - Zoomdata StreamingSpark meetup  - Zoomdata Streaming
Spark meetup - Zoomdata Streaming
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineSpark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
 

Viewers also liked

Logging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data CollectorLogging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data CollectorCask Data
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexasArvind Prabhakar
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeArvind Prabhakar
 
Adaptive Data Cleansing with StreamSets and Cassandra
Adaptive Data Cleansing with StreamSets and CassandraAdaptive Data Cleansing with StreamSets and Cassandra
Adaptive Data Cleansing with StreamSets and CassandraPat Patterson
 
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...DataStax
 
Advanced MQTT and Kura - EclipseCON 2014
Advanced MQTT and Kura - EclipseCON 2014Advanced MQTT and Kura - EclipseCON 2014
Advanced MQTT and Kura - EclipseCON 2014Eurotech
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudRick Bilodeau
 
Informatica object migration
Informatica object migrationInformatica object migration
Informatica object migrationAmit Sharma
 
Bad Data is Polluting Big Data
Bad Data is Polluting Big DataBad Data is Polluting Big Data
Bad Data is Polluting Big DataStreamsets Inc.
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data IntegrationsPat Patterson
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Pat Patterson
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Pat Patterson
 
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Adam Kawa
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Provisioning IDaaS - Using SCIM to Enable Cloud Identity
Provisioning IDaaS - Using SCIM to Enable Cloud IdentityProvisioning IDaaS - Using SCIM to Enable Cloud Identity
Provisioning IDaaS - Using SCIM to Enable Cloud IdentityPat Patterson
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGAdam Kawa
 

Viewers also liked (20)

Logging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data CollectorLogging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data Collector
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache Flume
 
Adaptive Data Cleansing with StreamSets and Cassandra
Adaptive Data Cleansing with StreamSets and CassandraAdaptive Data Cleansing with StreamSets and Cassandra
Adaptive Data Cleansing with StreamSets and Cassandra
 
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Advanced MQTT and Kura - EclipseCON 2014
Advanced MQTT and Kura - EclipseCON 2014Advanced MQTT and Kura - EclipseCON 2014
Advanced MQTT and Kura - EclipseCON 2014
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
Informatica object migration
Informatica object migrationInformatica object migration
Informatica object migration
 
Bad Data is Polluting Big Data
Bad Data is Polluting Big DataBad Data is Polluting Big Data
Bad Data is Polluting Big Data
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data Integrations
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
HDFS Federation
HDFS FederationHDFS Federation
HDFS Federation
 
Informatica session
Informatica sessionInformatica session
Informatica session
 
Provisioning IDaaS - Using SCIM to Enable Cloud Identity
Provisioning IDaaS - Using SCIM to Enable Cloud IdentityProvisioning IDaaS - Using SCIM to Enable Cloud Identity
Provisioning IDaaS - Using SCIM to Enable Cloud Identity
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
 

Similar to Building Continuously Curated Ingestion Pipelines

Databus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineDatabus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineSunil Nagaraj
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...AboutYouGmbH
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataNaveen Korakoppa
 
network design 5.pptx
network design 5.pptxnetwork design 5.pptx
network design 5.pptxaida alsamawi
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeSaurabh K. Gupta
 
Monitoring docker containers and dockerized applications
Monitoring docker containers and dockerized applicationsMonitoring docker containers and dockerized applications
Monitoring docker containers and dockerized applicationsSatya Sanjibani Routray
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019Timothy Spann
 
Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration Saurabh K. Gupta
 
Monitoring-Docker-Container-and-Dockerized-Applications
Monitoring-Docker-Container-and-Dockerized-ApplicationsMonitoring-Docker-Container-and-Dockerized-Applications
Monitoring-Docker-Container-and-Dockerized-ApplicationsSatya Sanjibani Routray
 
Monitoring docker container and dockerized applications
Monitoring docker container and dockerized applicationsMonitoring docker container and dockerized applications
Monitoring docker container and dockerized applicationsAnanth Padmanabhan
 
Monitoring docker-container-and-dockerized-applications
Monitoring docker-container-and-dockerized-applicationsMonitoring docker-container-and-dockerized-applications
Monitoring docker-container-and-dockerized-applicationsSatya Sanjibani Routray
 
Monitoring Docker Containers and Dockererized Application
Monitoring Docker Containers and Dockererized ApplicationMonitoring Docker Containers and Dockererized Application
Monitoring Docker Containers and Dockererized ApplicationRahul Krishna Upadhyaya
 
(ATS3-DEV04) Introduction to Pipeline Pilot Protocol Development for Developers
(ATS3-DEV04) Introduction to Pipeline Pilot Protocol Development for Developers(ATS3-DEV04) Introduction to Pipeline Pilot Protocol Development for Developers
(ATS3-DEV04) Introduction to Pipeline Pilot Protocol Development for DevelopersBIOVIA
 
Enterprise Use Case Webinar - PaaS Metering and Monitoring
Enterprise Use Case Webinar - PaaS Metering and Monitoring Enterprise Use Case Webinar - PaaS Metering and Monitoring
Enterprise Use Case Webinar - PaaS Metering and Monitoring WSO2
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Dataconomy Media
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Apache Apex
 

Similar to Building Continuously Curated Ingestion Pipelines (20)

Databus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineDatabus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture Pipeline
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
Best Practices Using RTI Connext DDS
Best Practices Using RTI Connext DDSBest Practices Using RTI Connext DDS
Best Practices Using RTI Connext DDS
 
network design 5.pptx
network design 5.pptxnetwork design 5.pptx
network design 5.pptx
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
Monitoring docker containers and dockerized applications
Monitoring docker containers and dockerized applicationsMonitoring docker containers and dockerized applications
Monitoring docker containers and dockerized applications
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019
 
Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration
 
Monitoring-Docker-Container-and-Dockerized-Applications
Monitoring-Docker-Container-and-Dockerized-ApplicationsMonitoring-Docker-Container-and-Dockerized-Applications
Monitoring-Docker-Container-and-Dockerized-Applications
 
Monitoring docker container and dockerized applications
Monitoring docker container and dockerized applicationsMonitoring docker container and dockerized applications
Monitoring docker container and dockerized applications
 
Monitoring docker-container-and-dockerized-applications
Monitoring docker-container-and-dockerized-applicationsMonitoring docker-container-and-dockerized-applications
Monitoring docker-container-and-dockerized-applications
 
Monitoring Docker Containers and Dockererized Application
Monitoring Docker Containers and Dockererized ApplicationMonitoring Docker Containers and Dockererized Application
Monitoring Docker Containers and Dockererized Application
 
(ATS3-DEV04) Introduction to Pipeline Pilot Protocol Development for Developers
(ATS3-DEV04) Introduction to Pipeline Pilot Protocol Development for Developers(ATS3-DEV04) Introduction to Pipeline Pilot Protocol Development for Developers
(ATS3-DEV04) Introduction to Pipeline Pilot Protocol Development for Developers
 
Design patternsforiot
Design patternsforiotDesign patternsforiot
Design patternsforiot
 
Enterprise Use Case Webinar - PaaS Metering and Monitoring
Enterprise Use Case Webinar - PaaS Metering and Monitoring Enterprise Use Case Webinar - PaaS Metering and Monitoring
Enterprise Use Case Webinar - PaaS Metering and Monitoring
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 

Recently uploaded

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfrahulyadav957181
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 

Recently uploaded (20)

The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Rithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdfRithik Kumar Singh codealpha pythohn.pdf
Rithik Kumar Singh codealpha pythohn.pdf
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 

Building Continuously Curated Ingestion Pipelines

  • 1. Building Continuously Curated Ingestion Pipelines Recipes for Success Arvind Prabhakar
  • 2. “Thru 2018, 70% of Hadoop deployments will not meet cost savings and revenue generation objectives due to skills and integration challenges” @nheudecker tweet (Gartner, 26 Feb 2015)
  • 3. What is Data Ingestion?  Acquiring data as it is produced from data source(s)  Transforming it into a consumable form  Delivering the transformed data to the consuming system(s) The challenge: Doing this continuously, and at scale across a wide variety of sources and consuming systems!
  • 4. Why is Data Ingestion Difficult? (Hint: Drift)  Modern data sources and consuming applications evolve rapidly (Infrastructure Drift)  Data produced changes without notice independent of consuming applications (Structural Drift)  Data semantics change over time as same data powers new use cases (Semantic Drift)
  • 5. Continuous data ingestion is critical to the success of modern data environments
  • 6. Plan Your Ingestion Infrastructure Carefully!  Plan ahead: Allocate time and resources specifically for building out your data ingestion infrastructure  Plan for future: Design ingestion infrastructure with sufficient extensibility to accommodate unknown future requirements  Plan for correction: Incorporate low-level instrumentation to help understand the effectiveness of your ingestion infrastructure, correcting it as your systems evolve
  • 7. The Benefits of Well-Designed Data Ingestion  Minimal effort needed to accommodate changes: Handle upgrades, onboard new data sources/consuming applications/analytics technologies, etc.  Increased confidence in data quality: Rest assured that your consuming applications are working with correct, consistent and trustworthy data  Reduced latency for consumption: Allow rapid consumption of data and remove any need for manual intervention for enabling consuming applications
  • 8. Recipe #1 Create decoupled ingest infrastructure An Independent Infrastructure between Data Sources and Consumers
  • 9. Decoupled Ingest Infrastructure • Decoupled data format and packaging from source to destination - Example: read CSV files and write Sequence files • Breakdown input data into its smallest meaningful representation - Example: Individual log record, individual tweet record, etc. • Implement asynchronous data movement from source to destination - Data movement process is independent of the source or consumer process An Independent Infrastructure between Data Sources and Consumers
  • 10. Decoupled Ingestion Using Apache Flume Fan-in Log Aggregation • Load data into client-tier Flume agents • Route data through intermediate tiers • Deposit data via collector tiers Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Tier-1 Tier-2 Tier-3Log Files Hive Warehouse
  • 11. Decoupled Ingestion Using Apache Flume + StreamSets Agent Agent Agent Agent Agent Agent Tier-2 Tier-3 Hive Warehouse Use StreamSets Data Collector to onboard data into Flume from variety of sources SDC Tier
  • 12. Decoupled Ingestion Using Apache Kafka Pub-Sub Log Flow • Producers load data into Kafka topics from various log files • Consumers read data from the topics and deposit to destination ProducersLog Files Hive Warehouse Producer Consumers Consumer Consumer Consumer Consumer Apache Kafka Producer Producer Producer Producer Producer Producer Producer
  • 13. Decoupled Ingestion Using Apache Kafka + StreamSets Standalone SDC Hive WarehouseCluster SDC Apache Kafka
  • 14. Decoupled Ingest Infrastructure System Flume Kafka StreamSets Decouple Formats Use built-in clients, sources, serializers. Extend if necessary. Use third-party producers, consumers. Write your own if necessary. Built-in any-to-any format conversion support. Smallest Representation Opaque payload (Event body) Opaque payload Interpreted record format Asynchronous Data Movement Yes Yes Yes* An Independent Infrastructure between Data Sources and Stores/Consumers
  • 15. Poll: Tools for Ingest What tools do you use for Ingest? o Kafka o Flume o Other
  • 17. Data scientists spend 50 to 80 percent of their time in mundane labor of collecting and preparing unruly digital data, before it can be explored. “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights” The New York Times -- Steve Lohr, Aug 17, 2014
  • 18. Recipe #2 Implement in-stream data sanitization Prepare data using in-stream transformations to make it consumption-ready
  • 19. In-Stream Data Sanitization • Convert and enforce data types where necessary - Example: turn strings into integers to enable type matching for consuming systems • Filter and route data depending upon downstream requirements - Example: Build routing logic to filter out invalid or low value data to reduce processing costs • Enrich and transform data where needed - Example: Extract and enrich credit card type information before masking the card number Prepare data using in-stream transformations to make it consumption ready
  • 20. Data Sanitization using Apache Flume • Use Interceptors to convert data types where necessary • Use Channel Selectors for contextual routing in conjunction with Interceptors • Use Interceptors to enrich or transform messages Flume Agent Source Channel 1 Channel .. Channel n Sink 1 Sink … Sink n Interceptors Flume provides few Interceptors for static decoration and rudimentary data transformations
  • 21. Data Sanitization using Apache Kafka • No built-in support for data types beyond a broad value type • Use custom producers that choose topics thus implementing routing and filtering • Use custom producers and consumers to implement validation and transformations where needed Producers-consumer chains can be applied between topics to achieve data sanitization Topic A Topic B Kafka Producer … Producer 1 Producer n Consumer 1 Sanitizing Consumer Producer Consumer x Consumer y Consumer z
  • 22. Data Sanitization using StreamSets • Built-in routing, filtering, validation using minimal schema specification Producers-consumer chains can be applied between topics to achieve data sanitization • Support for JavaScript, Python, Java EL as well as Java extensions • Built-in transformations including PII masking, anonymization, etc.
  • 23. In-Stream Data Sanitization Prepare data using in-stream transformations to make it consumption ready System Flume Kafka StreamSets Introspection/Validation Minimal support. Use custom Interceptors or morphlines. None. Must be done within custom producers and consumers. Extensive built-in support for introspection and validation. Filter/Routing Rudimentary support using built-in Interceptors. Will require using custom interceptors. None. Must be done within custom producers and consumers. Sophisticated built-in contextual routing based on data introspection. Enrich/Transform Minimal support. Will require using custom interceptors. None. Must be done within custom producers and consumers. Extensive support via built-in functions as well as scripting processors for enrichment and transformations.
  • 24. Poll: Preparing Data for Consumption For the most part, at what point do you prepare your data for consumption? o Before Ingest o During Ingest o After Ingest
  • 25. Results: Preparing Data for Consumption
  • 26. Recipe #3 Implement drift-detection and handling Enable runtime checks for detection and handling of drift
  • 27. Drift Detection and Handling • Specify and validate intent rather than schema to catch structure drift - Example: data must contain a certain attribute, and not an attribute at a particular ordinal • Specify and validate semantic constraints to catch semantic drift - Example: service operating within a particular city must have bounded geo-coordinates • Isolate ingest logic from downstream to enable infrastructure drift handling - Example: Use isolated class-loaders to enable writing to binary incompatible systems Enable runtime checks for detection and handling of drift
  • 28. Drift Handling in Apache Flume and Kafka  No support for structural drift: Flume and Kafka are low-level frameworks and work with opaque data. You need to handle this in your code.  No support for semantic drift: This is because Flume and Kafka work with opaque data. Again, you need to handle this in your code.  No support for infrastructure drift: These systems do not provide class-loader isolation or other semantics to easily handle infrastructure drift. Their extension mechanism makes it impossible to handle this in your code in most cases.
  • 29. Structural Drift Handling in StreamSets • This stream-selector is using a header-name to identify an attribute • If the data is delimited (e.g. CSV) the column position could vary from record to record • If the attribute is absent, the record flows through the default stream
  • 30. Semantic Drift Handling in StreamSets • This stage specifies a required field as well as preconditions for field value validation • Records that meet this constraint flow through the pipeline • Records that do not meet this constraint are siphoned off to an error pipeline
  • 31. Infrastructure Drift Handling in StreamSets • This pipeline duplicates the data into two Kafka instances that may be incompatible • No recompilation or dependency reconciliation required to work in such environments • Ideal for handling upgrades, onboarding of new clusters, applications, etc.
  • 32. Drift Detection and Handling System Flume Kafka StreamSets Structural Drift No support No support Support for loosely defined intent-specification for structural validation. Semantic Drift No support No support Support for complex pre- conditions and routing for semantic data validation. Infrastructure Drift No support No support Support for class-loader isolation to ensure operation in binary incompatible environments. Enable runtime checks for detection and handling of drift
  • 33. Poll: Prevalence of Data Drift How often do you encounter data drift as we’ve described it? o Never o Occasionally o Frequently
  • 35. Recipe #4 Use the right tool for the right job
  • 36. Use the Right Tool for the Job!  Apache Flume: Best suited for long range bulk data transfer with basic routing and filtering support  Apache Kafka: Best suited for data democratization  StreamSets: Best suited for any-to-any data movement, drift detection and handling, complex routing and filtering, etc. Use these systems together to build your best in class ingestion infrastructure!
  • 37. Sustained data ingestion is critical for success of modern data environments
  • 38. Thank You! Learn More at www.streamsets.com  Get StreamSets : http://streamsets.com/opensource/  Documentation: http://streamsets.com/docs  Source Code: https://github.com/streamsets/  Issue Tracker: https://issues.streamsets.com/  Mailing List: https://goo.gl/wmFPLt  Other Resources: http://streamsets.com/resources/