Apache Apex and Apache Geode are two of the most promising incubating open source projects. Combined, they promise to fill gaps of existing big data analytics platforms. Apache Apex is an enterprise grade native YARN big data-in-motion platform that unifies stream and batch processing. Apex is highly scalable, performant, fault tolerant, and strong in operability. Apache Geode provides a database-like consistency model, reliable transaction processing and a shared-nothing architecture to maintain very low latency performance with high concurrency processing. We will also look at some use cases where how these two projects can be used together to form distributed, fault tolerant, reliable in memory data processing layer.
4. Applica2on Programming Model Applica2on Programming Model
§ Stream is a sequence of data tuples
§ Operator takes one or more input streams, performs computations & emits one or more output streams
– Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
– Operator has many instances that run in parallel and each instance is single-threaded
§ Directed Acyclic Graph (DAG) is made up of operators and streams
– Iterative processing supported
Directed Acyclic Graph (DAG)
Output StreamTuple Tuple
er
Operator
er
Operator
er
Operator
er
Operator
7. • Operator state is checkpointed to a persistent store
– Automa2cally performed by engine, no addi2onal work needed by operator
– In case of failure operators are restarted from checkpoint state
– Frequency configurable per operator
– Asynchronous and distributed by default
– Default store is HDFS
• Automa2c detec2on and recovery of failed operators
– Heartbeat mechanism
• Buffering mechanism to ensure replay of data from recovered point so that
there is no loss of data
• Applica2on master state checkpointed
Apex Fault Tolerance