Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scalability truths and serverless architectures

Slides from my talk at HasGeek Fifth Elephant 2018

  • Login to see the comments

Scalability truths and serverless architectures

  1. 1. Scalability truths and serverless architectures why it is harder with stateful, data-driven systems regunathb
  2. 2. I’ll talk about ● Scalability truths ● Event driven systems/architecture (EDA) ● Complexity of maintaining State in EDA ● Stateful EDA = Workflow ● Serverless architecture - why it is useful, Primitives ● Design for a Serverless Stateful EDA platform ● The Flux project
  3. 3. Part 1 - Scalability truths
  4. 4. Things you hear ● Our platform is on the Cloud and we can scale seamlessly, often implying: ○ No impact on Performance - user perceived latencies ○ No impact on Data - all transactions are committed, data is durable and consistent ● We follow SOA and our services are Distributed across XX numbers of servers, often implying: ○ Our services are Stateless ○ Workloads are mostly identical and therefore can be serviced by any server ○ Our systems can keep up with the growth in available memory & compute ○ Network bandwidth is not a constraint & data transfer within the DC is a non-issue ○ There is no need for Consensus (or) there are no faulty processes
  5. 5. “There is no such thing as a "stateless" architecture; it's just someone else's problem” Truths ACID 1.0 Atomicity Consistency Isolation Durability ACID 2.0 Associative Commutative Idempotent Distributed with Scale ● Full Transaction support ● No Data Staleness ● Strict Ordering ● Zero Data Loss ● Limited Transaction support ● Eventually Consistent ● Relaxed Ordering ● High Availability Jonas Boner (CTO Lightbend) “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” Lamport State Management
  6. 6. State in e-Commerce systems ● Stateless (not logged in) enables caching of content ○ But, items go out of stock ○ Problems around data consistency ● Challenges in being Stateful ○ Availability takes a hit; Share-nothing is not feasible ○ Problem of Consensus when data gets replicated ○ Need for Data durability and guaranteed execution ■ Move from Fail-fast to Succeed-at-any-cost
  7. 7. Part 2 - Event Driven Systems
  8. 8. ● Events represent Facts ● Events are Immutable ● Events can be triggers for Actions ● Receive and React to Events/Facts ● Publish new Facts to the rest of the world ● Invert the Control Flow - minimizing coupling and increasing autonomy Events Event driven services
  9. 9. Event driven Finite State Machine (FSM) Current State Input Next State Output Locked coin Unlocked Unlock turnstile so customer can push through Locked push Locked None Unlocked coin Unlocked None Unlocked push Locked When customer has pushed through, lock turnstile (Source: Wikipedia) an abstract machine that can be in exactly one of a finite number of states at any given time
  10. 10. Stateful EDA = Workflow Events may be: ● User actions ○ E.g. Item packed ● Messages from Messaging middleware ○ E.g. Offer go-live ● Database state change of entities ○ E.g. Order created “Set of Tasks that execute Sequentially or Concurrently; and are triggered by Events”
  11. 11. Flux ● An Asynchronous, Distributed and Reliable state machine based orchestrator ● Used to build Stateful event-driven apps and workflows. ● Simple Primitives and Deployment dependencies ● Available as a Hosted Service at Flipkart & as a stand-alone library ○ Users : Accounting, Seller Platform, Compliance & Document generation, FinTech etc. ● Stateful Serverless(mostly) platform ● Open Source :
  12. 12. Flux use case - Doctor tele-consultation Deployed on AWS
  13. 13. Flux use case - Scheduling Health check-ups
  14. 14. Feature Set State Management Async and Distributed processing Fork & Join Parallel processing Correlation Id support Retriable Errors Configurable Retries & Timeouts Versioning of workflows At least once guarantees Idempotency at workflow level External and Scheduled events Cancellation of ongoing Workflow Metrics & Audit Dynamic conditional branching
  15. 15. Observability - Individual FSM
  16. 16. Observability - Task state transitions
  17. 17. Part 3 - Stateful & Serverless
  18. 18. Serverless platform ● Simple primitives (Functions) ○ outputType handler-name(inputType input, Context context) {...........} ○ @FunctionName("hello")
 public HttpResponseMessage<String> hello(
 @HttpTrigger(name = "req", methods = {"get", "post"}, authLevel = AuthorizationLevel.ANONYMOUS) HttpRequestMessage<Optional<String>> request,
 final ExecutionContext context) {............. } ● No Server management ● Flexible scaling ● Automated High availability ● Observability
  19. 19. Stateful Serverless platform AWS Step Functions (JSON DSL) Retries Parallel (Fork)
  20. 20. Stateful Serverless platform AWS Step Functions (Java API) Parallel (Fork)
  21. 21. Flux Programming Primitives WorkFlow Task
  22. 22. Deployment Unit Each Deployment Unit is loaded into Flux node using a separate class loader
  23. 23. ● Guarantees atleast-once execution of Tasks ○ Using durable states, transitions, events ● Local retries for timed-out executions and on Retriable exception ○ For @Task instances with ‘retries’ and ‘timeout’ attribute values - Retries executed on the same Flux node ● Global retries for stalled state transitions using Redriver ○ For @Task instances with ‘retries’ and ‘timeout’ attribute values - Retries executed on any of the Flux cluster nodes Fault tolerance
  24. 24. Observability - Live cluster state
  25. 25. Observability - Aggregated Metrics per Workflow JMX metrics from Flux code and client code
  26. 26. Flux Design
  27. 27. Tech choices Requirement Options Execution Isolation Separate JVMs, Thread Bulk-heading, Deployment Unit Execution runtime Akka Cluster, JVMs with coordination(ZK) State data store MySQL, HBase Retries Local (Actor Supervisor hierarchy), Global (Cluster-wide scheduler) Node placement, Coordination Akka Gossip protocol, ZK Deployment unit, Wiring/DI Guice, Separate Classloader, Java Module System Metrics & Monitoring JMX, Hystrix, Metrics timeseries Timeouts, Failure detection Hystrix
  28. 28. Flux Software components Java Google Guice - Dependency Injection Polyguice - Lifecycle Management Jetty - API & UI servers Akka - Queueing & Retries Hystrix - Circuit breaker - Timeouts - Metrics MySQL - Data persistence
  29. 29. Deployment - Managed Environment
  30. 30. Deployment - Isolated Execution Environment
  31. 31. Additional Reading ● Microsoft Service Fabric - Distributed platform for building Stateful micro services ● The Tail at Scale - by Jeff Dean - Building Latency tail-tolerant systems
  32. 32. Thank You Questions?
  33. 33. Additional slides
  34. 34. Invoking a workflow from Client
  35. 35. Events
  36. 36. Few Rules ● All params and return values should implement Event interface ● Stateless Classes ● Immutable event objects in workflow method ● Workflow method must return void
  37. 37. When a seller initiated pack, we need to call and update various systems. Downstream systems involved are 1. Vendor Assignor - this generates a tracking id which has to be shown on the label printed on the shipment. 2. Accounting - to generate invoices and various other accounting entries 3. Doc Service - this creates a PDF consisting of the label and the invoice which has to be printed and attached in the shipment. 4. Logistics Compliance Engine - to generate various government forms required when a shipment has to cross state borders. Flux orchestration is used in above scenarios. Use Cases - Seller Fulfilment Service
  38. 38. Testability ● Unit and Integration Testing ○ As workflows are written in Java, Unit and Integration testing would be similar to writing them for any other Java code ● Debugging through IDE ○ You can attach a debugger to Flux process to debug the workflow while development. Find more details about it on ● Toggling between Async and Sync ○ Removing “FluxClientInterceptorModule” from your Guice Injector lets your code to run in Sync mode. With this you can toggle between Async and Sync whenever you want.
  39. 39. State Machine Definition