#GeodeSummit - Apex & Geode: In-memory streaming, storage & analytics

In-Memory Streaming, Storage & Analy4cs
Apache Apex + Apache Geode

Thomas Weise Ashish Tadose

•  In-memory Stream Processing
•  Par22oning and Scaling out
•  Windowing Support (temporal)
•  Stateful Fault-tolerance, Operability
•  Processing Guarantees
•  Compute Locality
•  Dynamic updates

Apex Features …

Applica2on Programming Model Applica2on Programming Model
§  Stream is a sequence of data tuples
§  Operator takes one or more input streams, performs computations & emits one or more output streams
–  Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
–  Operator has many instances that run in parallel and each instance is single-threaded
§  Directed Acyclic Graph (DAG) is made up of operators and streams
–  Iterative processing supported
Directed Acyclic Graph (DAG)
Output StreamTuple Tuple
er
Operator
er
Operator
er
Operator
er
Operator

Apex Na2ve Hadoop Integra2on
YARN is the resource
manager

HDFS used for storing
any persistent state

•  Operator state is checkpointed to a persistent store
–  Automa2cally performed by engine, no addi2onal work needed by operator
–  In case of failure operators are restarted from checkpoint state
–  Frequency conﬁgurable per operator
–  Asynchronous and distributed by default
–  Default store is HDFS
•  Automa2c detec2on and recovery of failed operators
–  Heartbeat mechanism
•  Buﬀering mechanism to ensure replay of data from recovered point so that
there is no loss of data
•  Applica2on master state checkpointed

Apex Fault Tolerance

At-least-once
• On recovery data will be replayed from a previous checkpoint
–  No messages lost
–  Default, suitable for most applica2ons
• Can be used to ensure data is wriUen once to store
–  Transac2ons with meta informa2on, Rewinding output, Feedback from external en2ty,
Idempotent opera2ons
At-most-once
• On recovery the latest data is made available to operator
–  Useful where data loss is acceptable and latest data is suﬃcient
Exactly-once
–  At-least-once + idempotency + transac2onal mechanisms (operator logic) to achieve end-to-end
exactly once behavior

Apex Processing Seman2cs

•  Data flow in-memory, no disk
•  Incremental recovery – buffer server
•  In-memory data for querying data

IMC Benefits for Apex

Streaming meets In Memory Data Grid

Apex + Geode Integra2on
Completed

•  Operator check-poin2ng in Geode.
•  Output operator to store tuples in Geode region.

Proposed

•  Geode output operator with Transac2onal support.
•  Ingest data from Geode to Apex DAG.
•  Distributed Cache Operator.
•  Scan operator for parallel query execu2on & result retrieval.

Operator Checkpoin2ng in Geode
Apex Operator check-poin4ng in an IMDG (Geode store)

• Checkpoin2ng is an essen2al mechanism to ensure Fault Tolerance
• Apex checkpoints operator state to HDFS
• Slower HDFS checkpoin2ng hurts applica2on performance
• Checkpoin2ng in Geode ensures that applica2on performance is not impacted
• Geode has beUer latency for write opera2ons than HDFS.
Implementa4on: GeodeStorageAgent
hUps://issues.apache.org/jira/browse/APEXCORE-283

Data Streams to Geode Store
Apex Output Operator to write to Geode store

•  Apex Output operator – Egress data from Apex DAG to external store
•  Store incoming tuples in binary / POJO format in Geode region
•  Geode Eﬃcient Query integra2on – OQL
•  Geode region supports data replica2on, overﬂow to disk, persistence & many evic2on
strategies
Implementa4on: GeodeStore
GeodePOJOPutOperator
AbstractGeodePutOperator
hUps://malhar.atlassian.net/projects/MLHR/issues/MLHR-1942

Geode Transac2ons writes
Apex Output Operator to write to Geode store with Transac4ons

• Apex DAG uses Transac2onableStore to provide guarantee that records are wriUen are
exactly once. E.g. JdbcTransac2onalStore
• Geode provides Transac2on support for eﬃcient and safe coordinated opera2ons
• Geode store using transac2ons guarantee that records are wriUen exactly once
• Put operator backed by GeodeTransac2onal store can help to achieve Exactly once
seman2cs
Implementa4on: GeodeWindowStore as Transac2onableStore

Streaming Geode data in Apex
Apex Input Operator to read from Geode store
• Apex Input operators – Ingest data from external sources into Apex DAG
• Geode provides versa2le and reliable event distribu2on to provide Real Time
updates to data
•  Use case – Apex operator to stream async events from Geode in DAG
•  Call back events reduce polling cycles over network
Implementa4on: GeodeRegionStreamOperator
receives a newly added tuples and emits in DAG

Geode Cache Operator
Apex Geode Cache Operator
• Geode provides efficient Events & No2fica2ons
•  Register interest – update local copies
•  Con2nuous Query
•  Receive no2fica2on when Query condi2on met on server
•  Eg.g SELECT * FROM /tradeOrder t WHERE t.price > 100.00
• Use Geode events no2fica2on framework to maintain & invalidate cache.
Implementa4on: GeodeCacheOperator
maintains consistent cache based on subscribed keyset/query

Geode Scan Operator
Apex Geode Scan Operator
• Func2on Execu2on provides Parallel Query Execu2on
• MapReduce like execu2on - concurrent execu2on on members & results are
collected from members & sent to caller.
• Use case: Streaming applica2on depending on large scan result from external store
Implementa4on: GeodeQueryOperator
execute data dependent queries on distributed region
emit results in DAG

Join the Apache Geode Community!
•  Check out: http://geode.incubator.apache.org
•  Subscribe: user-subscribe@geode.incubator.apache.org
•  Download: http://geode.incubator.apache.org/releases/

#GeodeSummit - Apex & Geode: In-memory streaming, storage & analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to #GeodeSummit - Apex & Geode: In-memory streaming, storage & analytics

Similar to #GeodeSummit - Apex & Geode: In-memory streaming, storage & analytics (20)

More from PivotalOpenSourceHub

More from PivotalOpenSourceHub (14)

Recently uploaded

Recently uploaded (20)

#GeodeSummit - Apex & Geode: In-memory streaming, storage & analytics