Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

3

Share

Download to read offline

Data Governance in Apache Falcon - Hadoop Summit Brussels 2015

Download to read offline

Data Governance is a fairly important element in the enterprise data management world. As Hadoop makes it way to enterprises, there is a pressing need for a comprehensive data governance solution in this space. Apache Falcon looks at big data management in a holistic way by capturing metadata for governance policies and changes for every data assets and data applications and there by enabling comprehensive lineage, change management control and access control etc. In this talk we cover how Apache Falcon (incubating) addresses some of the key challenges in this area and discuss some case studies of how Apache Falcon is used to implement Data Governance in enterprises on big data platforms.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Data Governance in Apache Falcon - Hadoop Summit Brussels 2015

  1. 1. Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Falcon Hadoop Data Governance Hortonworks. We do Hadoop.
  2. 2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Venkatesh Seetharam Architect, Data Management Hortonworks Inc. PMC, Apache Falcon PMC, Apache Knox Proposed Apache Atlas
  3. 3. Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Agenda Overview Components Features Governance
  4. 4. Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Motivation for Apache Falcon
  5. 5. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Simple Data Pipeline… Page 5 HDFS YARN Landing Materialized Views Oozie Workflow source_db.raw_input_table Partition 2014-01-01-10 Partition 2014-01-01-12 Partition 2014-01-01-12 Partition N Pig JobHive Job source_db.input_table Partition 2014-01-01-10 Partition 2014-01-01-12 Partition N
  6. 6. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Add Data Management Capability to the Pipeline Page 6 HDFS YARN Landing Materialized Views Oozie Workflow source_db.raw_input_table Partition 2014-01-01-10 Partition 2014-01-01-12 Partition 2014-01-01-12 Partition N Pig JobHive Job source_db.input_table Partition 2014-01-01-10 Partition 2014-01-01-12 Partition N Frequent Feeds Late Data Arrival Replication Rentention Archival Exception Handling Lineage Audit Monitoring
  7. 7. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Pipeline Becomes Considerably More Complex Oozie Workflow Pig JobHive Job Results in Many Complex Oozie Workflows Frequent Feeds Late Data Arrival Replication RententionArchival Exception Handling Lineage AuditMonitoring Data Management Requirements
  8. 8. Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Introduction to Apache Falcon
  9. 9. Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Falcon Overview Centrally Manage Data Lifecycle – Centralized definition & management of pipelines for data ingest, process & export Business Continuity & Disaster Recovery – Out of the box policies for data replication & retention – End to end monitoring of data pipelines Address audit & compliance requirements – Visualize data pipeline lineage – Track data pipeline audit logs – Tag data with business metadata The data traffic cop
  10. 10. Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Complicated Pipeline Simplified with Apache Falcon Falcon Generates and Instruments Oozie Workflows Falcon Engine Lineage AuditMonitoring Frequent Feeds Late Data Arrival Replication RententionArchival Exception Handling Frequent Feeds Submit & Schedule Falcon Entities Cluster Cluster Feed Feed Feed Process
  11. 11. Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Falcon Architecture Centralized Falcon Orchestration Framework Hadoop ecosystem tools Falcon Server JMS API & UI AMBARI HDFS / Hive Oozie Entity Specs Scheduled Jobs Process Status MapRed / Pig / Hive / Sqoop / Flume / DistCP Data stewards + Hadoop admins
  12. 12. Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Falcon Basic Concepts • Cluster: Represents the “interfaces” to a Hadoop cluster • Feed: Defines a “dataset” File, Hive Table or Stream • Process: Consumes feeds, invokes processing logic & produces feeds Page 12 All these put together represent ‘Data Pipelines’ in Hadoop CLUSTER FEED aka DATASET PROCESS INPUT TO CREATES
  13. 13. Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Pipeline: Definition • Flexible based pipeline specification –JAXB / JSON / JAVA / XML –Modular - Clusters, feeds & processes defined separately and then linked together –Easy to re-use across multiple pipelines • Out of the box policies –Predefined policies for replication, late data handling & eviction –Easily customization of policies • Extensible –Plug in external solutions at any step of the pipeline –Eg. Invoke third party data obfuscation components
  14. 14. Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Flexibility in Processing Common types of processing engines can be tied to Falcon processes Oozie workflows Pig scripts HQL scripts
  15. 15. Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Pipeline: Monitoring DATA Primary site DR site Centralized monitoring of data pipeline With Falcon + Ambari Pipeline run alerts Hadoop Cluster-1 Hadoop Cluster-2 Pipeline run history Pipeline Scheduling raw clean prep raw clean prep
  16. 16. Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Replication with Falcon Staged Data Presented Data Cleansed Data Conformed Data Staged Data Presented Data Replication Failover Hadoop Cluster Primary Hadoop Cluster Replication BI / Analytics BusinessObjects BI • Falcon manages workflow and replication • Enables business continuity without requiring full data reprocessing • Failover clusters can be smaller than primary clusters
  17. 17. Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Retention with Falcon Staged Data Presented Data Cleansed Data Conformed Data Retain 5 Years Retain Last Copy Only Retain 3 Years Retain 3 Years • Sophisticated retention policies expressed in one place • Simplify data retention for audit, compliance, or for data re-processing Retention Policy
  18. 18. Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Late Data Handling with Falcon Staged Data Combined Data Online Transaction Data (via Sqoop) Web Log Data (via FTP) Wait up to 4 hours for FTP data to arrive • Processing waits until all required input data is available • Checks for late data arrivals, issues retrigger processing as necessary • Eliminates writing complex data handling rules within applications
  19. 19. Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HCatalog Table access Aligned metadata REST API • Raw Hadoop data • Inconsistent, unknown • Tool specific access Apache Falcon provides metadata services via HCatalog Metadata Services with HCatalog • Consistency of metadata and data models across tools (MapReduce, Pig, Hbase, and Hive) • Accessibility: share data as tables in and out of HDFS • Availability: enables flexible, thin-client access via REST API Shared table and schema management opens the platform Page 19
  20. 20. Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Governance in Apache Falcon
  21. 21. Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Pipeline: Tracing . Purchase feed Customer feed Product feed Store feed View dependencies between clusters, datasets and processes Data pipeline dependencies Add arbitrary tags to feeds & processes Data pipeline tagging Coming Soon Know who modified a dataset when and into what Data pipeline audits Analyze how a dataset reached a particular state Data pipeline lineage
  22. 22. Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Custom Metadata in Falcon • Metadata on Ingest (Content) – What is the format I expect my data to be in? – What source systems did the data come from, owners? – Answer: ingest descriptors + Hcat schema versioning • Metadata for Security (Access Controls) – How is each column blinded or encrypted? – Can I trust that I can join data across tables? What if email is encrypted differently? – Answer: security descriptors • Metadata for lineage (Source, History) – How do I chase down sources of data leading to reports and data? – Answer: lineage carried forward per workflow • Metadata for marts (Usage Constraints, Enrichment) – How do I materialize views and drop views as needed?
  23. 23. Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Entity Dependency in Falcon • Dependencies between Falcon entity definitions: cluster, feed & process – Lineage attributes: workflows, input/output feed windows, user, input and output paths, workflow engine, input/output size
  24. 24. Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Lineage in Falcon
  25. 25. Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Audit, Tagging and Access Control • Tagging – Allows custom tags in entities – Can decorate process entities pipeline names • Access Control – Support for ACL in entities – Authorization driven based on ACLs in entities • Audit – Each execution is controlled by Falcon and runs are audited – Correlate the execution with Lineage (Design) • Search – Search based on Tags, Pipelines, etc. – Full-text search
  26. 26. Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Technology • Metadata Repository – Titan Graph Database – Pluggable backing store, berkelydbje, Hbase • Entity Metadata – Tags, Entities are stored in the repository • Execution Metadata – Execution metadata are stored in the repository as well – this is unique to Falcon – Optional inputs • Search – Pluggable backend – Solr or Elastic Search
  27. 27. Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved New in Apache Falcon 0.6.0 What is coming soon?
  28. 28. Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved DR Mirroring of HDFS with Recipes •Mirroring for Disaster Recovery and Business continuity use cases. •Customizable for multiple targets and frequency of synchronization •Recipes: Template model re-use of complex workflows RecipePropertie s Workflow Template RecipePropertie s RecipePropertie s Workflow Template
  29. 29. Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Replication to Cloud •Seemlessly replicate to Cloud targets •Replicate from Cloud as a source. •Support for Amazon S3 and Microsoft Azure Azure Amazon S3 On Prem Cluster
  30. 30. Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  31. 31. Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  32. 32. Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  33. 33. Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  34. 34. Page34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  35. 35. Page35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  36. 36. Page36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Q & A
  37. 37. Page37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Thank you! Learn more at: hortonworks.com/hadoop/falcon/
  • harimkb

    Apr. 24, 2017
  • BijuCD1

    May. 22, 2016
  • ankurmitujjain

    Dec. 31, 2015

Data Governance is a fairly important element in the enterprise data management world. As Hadoop makes it way to enterprises, there is a pressing need for a comprehensive data governance solution in this space. Apache Falcon looks at big data management in a holistic way by capturing metadata for governance policies and changes for every data assets and data applications and there by enabling comprehensive lineage, change management control and access control etc. In this talk we cover how Apache Falcon (incubating) addresses some of the key challenges in this area and discuss some case studies of how Apache Falcon is used to implement Data Governance in enterprises on big data platforms.

Views

Total views

1,430

On Slideshare

0

From embeds

0

Number of embeds

23

Actions

Downloads

86

Shares

0

Comments

0

Likes

3

×