Selective Data Replication with Geographically Distributed Hadoop
1. Selective Data Replication with
Geographically Distributed Hadoop
Brett Rudenstein
April 16, 2015
Brussels, Belgium
2. Dimensions of Scalability
For distributed storage systems
Ability to support ever-increasing
requirements for
- Space: more data
- Objects: more files
- Load: more clients
RAM is limiting HDFS scale
Other dimensions of scalability
- Geographic Scalability:
scaling across multiple data centers
Scalability as a universal measure
of a distributed system
2
3. Geographic Scalability
Scaling file system across multiple data centers
3
Running Hadoop in multiple Data Centers
- Distributed across the world
- As a single cluster
4. Main Steps
Four main stages to the goal
STAGE I: The role of the Coordination Engine
STAGE II: Replicated Virtual Namespace
- Active-active
STAGE III: Geographically-distributed Hadoop
- File system running in multiple data centers
- Disaster recovery, load balancing, self-healing, simultaneous data ingest
STAGE IV: Selective data replication
- Heterogeneous storage Zones
4
6. Requirements
File System with hardware components distributed over the WAN
Operated and perceived by users as a single system
- Unified file system view independent of where the data is physically stored
Strict Consistency
- Everybody sees the same data
- Seamless file level replication
Continuous Availability
- All components are Active
- Disaster recovery
Geographic Scalability: Support for multiple Data Centers
6
7. Architecture Principles
Strict consistency of metadata with fast data ingest
1. Synchronous replication of metadata between data centers
- Using Coordination Engine
- Provides strict consistency of the namespace
2. Asynchronous replication of data over the WAN
- Data replicated in the background
- Allows fast LAN-speed data creation
7
9. Coordination Engine
Determines the order of operations in the system
Coordination Engine ensures the order of events submitted to the engine by
multiple proposers
- Anybody can Propose
- Engine chooses a single Agreement every time and guarantees:
• Learners observe the agreements in the same order they were chosen
• An agreement triggers a corresponding application action
9
10. Central Coordination
Simple coordination without fault tolerance
Easy to Coordinate
- Single NameNode as an example of a
Central Coordination Engine (No HA)
- Performance and availability
bottleneck
- Single point of failure
10
11. Distributed Coordination Engine
Fault-tolerant coordination using multiple acceptors
Distributed Coordination Engine operates on participating nodes
- Roles: Proposer, Learner, and Acceptor
- Each node can combine multiple roles
Distributed coordination
- Proposing nodes submit events as
proposals to a quorum of acceptors
- Acceptors agree on the order of each
event in the global sequence of events
- Learners learn agreements in the same
deterministic order
11
12. Consensus Algorithms
Consensus is the process of agreeing on one result among a group of participants
Coordination Engine guarantees the same state of the learners at a given GSN
- Each agreement is assigned a unique Global Sequence Number (GSN)
- GSNs form a monotonically increasing number series – the order of agreements
- Learners have the same initial state, apply the same deterministic agreements in the same deterministic order
- GSN represents “logical” time in coordinated systems
PAXOS is a consensus algorithm
proven to tolerate a variety of failures
- Quorum-based Consensus
- Deterministic State Machine
- Leslie Lamport:
Part-Time Parliament (1990)
12
14. Replicated Virtual Namespace
Coordination Engine provides equivalence of multiple namespace replicas
Coordinated Virtual Namespace controlled by Fusion Node
- Is a client that acts as a proxy to other client interactions
- Reads are not coordinated
- Writes (Open, Close, Append, etc…) are coordinated
The namespace events are consistent with each other
- Each fusion server maintains a log of changes that would occur in the namespace
- Any Fusion Node can initiate an update, which is propagated to all other Fusion Nodes
Coordination Engine establishes the global order of namespace updates
- Fusion servers ensure deterministic updates in the same deterministic order to
underlying file system
- Systems, which start from the same state and apply the same updates, are equivalent
14
15. Strict Consistency Model
One-Copy Equivalence as known in replicated databases
Coordination Engine sequences file open and close proposals into the
global sequence of agreements
- Applied to individual replicated folder namespace in the order of
their Global Sequence Number
Fusion Replicated Folders have identical states when they reach the
same GSN
One-copy equivalence
- Folders may have different states at a given moment of “clock” time
as the rate of consuming agreements may vary
- Provides same state in logical time
15
15
17. Scaling Hadoop Across Data Centers
Continuous Availability and Disaster Recovery over the WAN
The system should appear, act, and be operated as a single cluster
- Instant and automatic replication of data and metadata
Parts of the cluster on different data centers should have equal roles
- Data could be ingested or accessed through any of the centers
Data creation and access should typically be at LAN speed
- Running time of a job executed on one data center as if there are no other centers
Failure scenarios: the system should provide service and remain consistent
- Any Fusion node can fail and still provide replication
- Fusion nodes can fail simultaneously on two or more data centers and still provide replication
- WAN Partitioning does not cause a data center outage
- RPO is as low as possible due to continuous replication as opposed to periodic
17
18. Foreign File Replication
File is created on the client’s data center and replicated to the other asynchronously
18
Fusion workflow
1. Client makes a request to create a file
2. Fusion coordinates File Open to other
clusters involved (membership)
3. File is added to underlying storage
4. IHC server pulls data from cluster and
pushed to remote clusters
5. Fusion coordinates File Close to other
clusters involved (membership)
19. Inter Hadoop Communication Service
Uses HCFS API and communicates directly with underlying storage systems
- Isilon
- MAPR
- HDFS
- S3
NameNode and DataNode operations are unchanged
19
22. Selective Data Replication
Three main use cases for restricting data replication
“Saudi Arabia” case – Data must never leave a specific data center
- This is needed to protect data from being replicated outside of a specific geo-location, a
country, or a facility, e.g., customer data from a branch in Saudi Arabia of a global bank must
never leave the country due to local regulations.
- Virtual namespace: only replicated metadata that has its supporting data replicated
“/tmp” case – Data created in a directory by a native client should remain native
- Transient data of a job running on a DC does not need to be replicated elsewhere as it is
deleted upon job completion and nobody else needs it.
“Ingest Only” case – Data directly ingested into cluster at data origin
- Data replicates to all other data centers
- Temporary network partitioned cluster can still ingest data
22
24. Heterogeneous Storage Zones
Virtual Data Centers representing different types of block storage
Storage Types: Hard Drive, SSD, RAM
Virtual data center is a zone of similarly configured Data Nodes
Example:
- Z1 archival zone: DataNodes with dense hard drive storage
- Z2 active data zone: DataNodes with high-performance SSDs
- Z3 real-time access zone: lots of RAM and cores, short-lived hot data
SDR policy defines three directories:
- /archive – replicated only on Z1
- /active-data – replicated on Z2 and Z1
- /real-time – replicated everywhere
24
25. Simplified WAN configurations
Reduced operational complexity
Fast network protocols can keep up
with demanding network replication
Hadoop clusters do not require
direct communication with each
other.
- No n x m communication among
datanodes across datacenters
- Reduced firewall / socks
complexities
Reduced Attack Surface
26. Thank You.
Questions?
Come visit WANdisco at Booth 11
Selective Data Replication with Geographically Distributed Hadoop
Brett Rudenstein
Editor's Notes
No secret RAM is the limiting factor for NN scalability and as the result for the entire HDFS
Achieved the goal or making a good progress
The core of a distributed CE are consensus algorithms
Double determinism is important for equivalent evolution of the systems
Unlike multi-cluster architecture, where clusters run independently on each data center mirroring data between them
DC1 DataNodes report replicas to native GeoNodes
DC1 GeoNode submits Foreign Replica Report proposal
FRR agreement is executed by all foreign GeoNodes: learn about foreign locations
DC1 GeoNode schedules replica transfer from native to foreign DC2 DataNode
DC2 DataNode reports new replica to DC2 GeoNodes
DC2 GeoNode schedules replication of the new replica within DC2