Building scalable data pipelines for big data involves dealing with legacy systems, implementing data lineage and provenance, managing the data lifecycle, and engineering pipelines that can handle large volumes of data. Effective data pipeline engineering requires understanding how to extract, transform and load data while addressing issues like privacy, security, and integrating diverse data sources. Frameworks like Cascading can help build pipelines, but proper testing and scaling is also required to develop robust solutions.
2. Agenda
§ Introduction
§ Dealing with Legacy
§ Data Lineage and Provenance
§ Data Lifecycle Management
§ Data Pipeline Engineering for fun and profit
2
3. Big Data Introduction
Current state of Big Data Landscape
§ Hadoop
§ Solves for the three V’s:Volume,Velocity, and Variety
§ Primarily batch processing for large data sets
§ Hadoop2 YARN: distributed computing platform
§ Not only Hadoop!
§ Real-time systems: Storm, Spark, Samza etc.
§ Wide variety of NoSQL systems: Cassandra, Riak, etc.
§ Don’t forget Legacy!
3
4. Big Data Promise
Why is Big Data so hot?
§ This is what the Big Data vendors sell:
§ Throw some data in
§ Analyze it using map/reduce
§ Visualize your analytics/Generate insights
§ Do some Predictive Analytics or Recommendations
§ Profit!!
§ Rinse and Repeat!!!
4
5. Big Data Problems
Why is Big Data so hard?
§ Real-life environments are not that simple!
§ For instance, privacy and compliance issues
§ Extract, Transform and Load is non-trivial
§ Building reliable ingest across complex
environments
§ Data Lifecycle Management is not mature yet
5
6. Legacy Data
Why are Legacy environments important for Big Data?
§ Outside of Silicon Valley:
§ Companies have been around for a while
§ Have lots of valuable legacy data
§ Some of it in Mainframes
§ Some of it in flat files
§ Some of it in relational DBs
6
7. Mainframe Data
How would you handle Mainframe data?
§ The open source Hadoop eco system does
not provide a way to import data from main
frames
§ Only a commercial solution available as of
today
§ Think about that for a second
7
8. More on Mainframes
Why worry about Mainframe data?
§ Mainframes still run important systems
§ Separates schema from data (kinda like
Hadoop)
§ COBOL Copybooks
§ Hadoop can offload legacy data processing
§ But, you must first get the data in!!!
8
9. Other Legacy issues
Random collection of issues in dealing with legacy data
§ Unknown or incorrigible schema
§ Invalid data
§ Inconsistent data
§ Missing data
§ Fuzzy data
§ Sparse data
9
10. Big Data ETL
What is the problem with it?
§ First of all the name
§ Extract, Transform, Load was written in the
old days when data sets were smaller
§ Inherent assumption that the Transformation
will happen out of band
§ Assumption does not hold for Big Data!
10
11. ELT
Will ELT solve the problem?
§ Flip the transform and load steps
§ Get the data in and then transform it
§ This way the transform is not out of band
§ Leverage the power of the underlying Big Data
platform to do the transform
§ Makes perfect sense … except when
11
12. Privacy and Security
Issues with ELT approach for privacy and security
§ Loading raw data before transforming it poses
privacy and security challenges
§ What if the raw data contains SSNs or Credit
card numbers?
§ What if it is only meant to be seen by a few?
§ Once you load, the data is now available
12
13. The solution
Deal with it during extraction (as best as you can)
§ Do a secure extract
§ Perform a security/privacy audit of the raw
data and build in rules to mask/anonymize/
scrub data during the extraction
§ Somewhat solves the security problem but
complicates the Extract step
13
14. Some exceptions
What if you don’t know which parts of the data set need to protected?
§ Secure extract assumes that the data schema
is known and the privacy levels are known
§ Not a valid assumption at all times
§ For e.g., what if the legacy data set has
Facebook profile data before the new privacy
rules went in to effect?
14
15. Data Lineage and Provenance
What is data lineage and data provenance?
Data Lineage
Data Lineage records the origin of the data set.
This includes the time, place, original format
and privacy/security information.
Data Provenance
Records all the change history to the data set.
This includes timestamp, change agent,
purpose, process and edit log.
15
16. Data Lineage
Why is this a big deal?
§ Let’s go back to the Facebook problem
§ The solution is to record lineage information
§ This protects the consumer of the data set –
assures that the data was available for use as
of the point and time of origin
§ Protect yourself from law suites and fines!
16
17. Metadata
Data about the data
§ Astute observation: Metadata extraction is an
integral part of managing data and
implementing data lineage and data
provenance
§ It can be rule-based but increasingly more
automated systems are desirable
17
18. Data Provenance
Why is it important?
§ Data Lineage solves one piece of the puzzle –
namely, origin and metadata
§ What if data is changed during or after the
extract step?
§ For purposes of audit and traceability, this
must be recorded!
18
19. Data Provenance Approach
How to implement data provenance?
§ Can be workflow-based or dataflow-based
§ Workflow-based is much easier
§ Records the changes as part of the
workflow
§ Dataflow-based is much harder
§ Needs to record each and every access to
19
the data
20. Current Toolset
What exists currently in the open source big data ecosystem?
§ Nothing really to help with any of this
§ There are commercial products
§ But, no open source tools yet (or at least
none that are in production use that I am
aware of)
§ Would be a great idea to build one!
20
21. Data Lifecycle Management
Dealing with data throughout its lifecycle
§ Management of data from ingest to sunset
§ It involves dealing with all of the associated
metadata, lineage and provenance artifacts
§ It also involves moving data around (large
datasets in the Big Data world)
§ That is a data pipeline problem!
21
22. Data Lifecycle Management Tools
What exists in the Big Data eco system to handle this?
§ Current toolset is pretty limited
§ Apache Falcon (Hadoop sub-project) is a step
in this direction but still not widely available
for production use
§ It is possible to roll your own
§ But, it is a significant engineering effort
22
23. Modern Data Lifecycle Management
Modern data architecture needs modern data lifecycle management
§ Modern data architecture involves more than
just Hadoop
§ Queuing systems – for e.g. Kafka
§ Stream processing – for e.g. Storm
§ Real-time systems – for e.g. Spark
§ NoSQL system – for e.g. HBase
§ Integration with MPP systems
23
24. Data Lifecycle Management done right
Data Lifecycle Management across the Big Data Environment
§ Dealing with the various systems in the Big
Data Landscape
§ Ability to setup schedules and periodic runs
§ Also, provide on-demand data processing
§ Treat data as an asset – apply asset
management practices
24
25. Data Pipelining for fun and profit
Dealing with data pipelines as a distinct role in the Big Data Engineering world
§ Data Pipeline Engineering is a legitimate role
in the Big Data environment
§ The complexity and all of the attendant issues
makes it a specialty in its own right!
§ It is much more than just ETL
§ Security, Lineage, Provenance and Lifecycle
25
Management are all essential
26. So you want to be a data pipeline engineer?
What are the tools of the trade and ninja skill to master?
§ Languages
§ Systems
§ Python
§ Java
§ Sqoop
§ Scala
§ Storm
§ Pig
26
§ Flume
§ Hive/HBase
27. Can you make this easier?
I just want to write some code and be done with it
§ Pick your language
§ Cascading
§ Java
§ Data Pipeline
§ Scala
framework
§ Clojure
§ Full-featured
§ But, wait!
§ What about all the other stuff?
27
28. Integration and Extensions
Integrate with your favorite tools and extend when needed
§ Start with a solid pipeline framework like
Cascading (or its offshoots like Scalding or
Cascalog)
§ Integrate with either commercial or open
source tools for specific functionality needed
§ Look at Cascading extensions: Lingual,
Driven and Load
28
29. Build your own extensions
Extend Cascading with your own requirements
§ A programming framework such as
Cascading makes it much easier to extend
to build custom data lineage, provenance and
lifecycle management solutions
§ You can also integrate with Security and
Privacy solutions
§ This is a flexible approach
29
30. Build for scale
Understanding scale for data pipelines
§ Scaling data pipelines is quite complex due
to the multiple moving pieces
§ Pipeline is only as fast as the slowest piece
§ Hadoop scales - proven
§ Flume scales - proven
§ Kafka scales - proven
30
31. Scaling Sqoop
Scaling relation DB load
§ What about Sqoop?
§ Not as easy or straight forward to scale
§ Start slow and incrementally increase load
§ Watch for network statistics and optimize
§ Load aggregates if that is all you need
§ Parallelize as much without killing the DB
31
32. Scaling Storm
§ Lot of these systems depend on ZK
§ Storm also relies on Zero MQ (this is
changing)
§ Provision for average load (not peak load)
§ Benchmark with typical event size
(compress for larger events)
§ Storm on YARN will solve many issues
32
33. Scaling Tips
§ Measure end-to-end throughput
§ Benchmark and fine-tune the best
performing parts of the pipeline first
§ Scale the slower parts next – increase
incrementally
§ Batch the slower parts – aggregate if you
can and parallelize as much as possible
33
34. Summary
§ Pipeline Engineering will be one of the most
challenging areas in Big Data with several big
issues remaining to be solved
§ Expect plenty of innovation and action in
this space
§ It is a great place to start a Big Data career
34