Thanks Justin,
Here are Falcon’s primary features.
1 The first is to manage the data lifecycle in one common place.
2 The second is to facilitate quick deployment of replication for business continuity and disaster recovery use cases. This includes monitoring and a base set of policies for replication and retention
3 Lastly, Falcon provide foundation audit and compliance features – visuallization and tracking of entity lineage and collection of audit logs
This is the high level Falcon Architecture
Falcon runs as a standalone server as part of your Hadoop cluster
A user creates entity specifications and submits to Falcon using the API
Falcon validates and saves entity specifications to HDFS
Falcon uses Oozie as its default scheduler
Dashboard for entity viewing in Falcon UI
Ambari integration for management
Feeds have location, replication schedule and retention policies
Meta info including frequency, where data is coming from (source), where to replicate (target), how to long to retain
Let take a look at the Data Pipeline or workflow.
** read high level **
Hive – HQL scripts
Pig scipts
Oozie workflows
Once a pipeline is create you’ll want to run it.
This means you probably want to monitoring as well.
Falcon in conjunction with Ambari has centralized monitor
** bullets **
Ok let chat about Replication with Falcon – which is very efficient.
In this example with a primary cluster with a typical workflow
There is business requirement to replicate this to a Failover cluster
** builett **
Falcon has flexible data retention policies, it’s able to model the business compliance requirements.
Sophisticated retention policies expressed in one place
Simplify data retention for audit, compliance, or for data re-processing
In this example, different dataset in a workflow can have different retention policies.
We realize at many type of workflow have inputs from different system with may be in different regions. Falcon has logic built-in to handle this potentially tricky situation.
HCatalog – metadata shared across whole platform
File locations become abstract (not hard-coded)
Data types become shared (not redefined per tool)
Partitioning and HDFS-optimized
Transition to Andrew
Last but not least you’ll want to Trace or track the Data Pipeline
We trace:
The first is DR mirroring with Recipes.
Actually recipes can be used in number different use cases, but we’ll just focus on mirroring.
Place holder pic
Dashboard view
Summary counts
Inplace filters – by user defined tags
Entity creation interface is contextual and has field level sematic check to help the user along.
As you can see on the right – we have the actual XML being generated as the UI field are being filled out.
This can be help if you want copy portions to skip repeating entity from scratch.
Lastly the new UI allow to drilll down to the detail level for each entity types.