Learn how to achieve high throughput processing for real-time applications using SMACK, Apache Kafka and Spark Streaming. Customers today expect immediate response times with their daily interactions - whether it’s conducting a search for a product, using a service, placing an order, or asking a question - it doesn’t matter. Making your customers wait results in lost sales and increased churn. Developing a real-time application that can provide the instantaneous response times that your customers expect is critical for a competitive advantage.
Webinar recording: https://youtu.be/cgKeoR9pJIQ
Please check out http://www.datastax.com/resources/web... for our huge selection of free on-demand webinars!
In this webinar, you will:
✓ Learn how to estimate the size of your data pipeline
✓ Learn its impact on the overall architecture of your application
✓ Understand the five most important considerations when developing a data pipeline
Data pipelines consisting of many parts
Kafka to organize
Akka and Spark to process
Cassandra to Store
Mesos to manage everything
Current Issues with Data Pipelines / Problems customers are facing
Requires Scaling Data Pipeline for high-through put of data
Italian Job - Elaborate gold heist using Minis to get away. They have to figure out how to distribute the load of gold bricks between the minis.
DStream: continuous sequence of micro batches
More complex processing models are possible with less effort
Streaming computations as a series of deterministic batch computations on small time intervals
Ingests and processes data using complex algorithms, which are expressed in high-level functions.
Elaborate gold heist using Minis to get away. They have to figure out how to distribute the load of gold bricks between the minis.
Elaborate gold heist using Minis to get away. They have to figure out how to distribute the load of gold bricks between the minis.
Elaborate gold heist using Minis to get away. They have to figure out how to distribute the load of gold bricks between the minis.
Elaborate gold heist using Minis to get away. They have to figure out how to distribute the load of gold bricks between the minis.
A good approach to figure out the right batch size for your application is to test it with a conservative batch interval (say, 5-10 seconds) and a low data rate. To verify whether the system is able to keep up with the data rate, you can check the value of the end-to-end delay experienced by each processed batch (either look for “Total delay” in Spark driver log4j logs, or use the StreamingListener interface). If the delay is maintained to be comparable to the batch size, then system is stable.
A good approach to figure out the right batch size for your application is to test it with a conservative batch interval (say, 5-10 seconds) and a low data rate. To verify whether the system is able to keep up with the data rate, you can check the value of the end-to-end delay experienced by each processed batch (either look for “Total delay” in Spark driver log4j logs, or use the StreamingListener interface). If the delay is maintained to be comparable to the batch size, then system is stable.
A topic is a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log that looks like this:
Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log.
The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition.
The Kafka cluster retains all published messages—whether or not they have been consumed—for a configurable period of time.
For example if the log retention is set to two days, then for the two days after a message is published it is available for consumption, after which it will be discarded to free up space.
=========================
Nice feature of the Direct API.
Kafka Topics map to Cassandra Tables
Kafka Partitions map to Spark Partitions.
1 kafka/spark partition, maxRatePerPartition = 100k, batchInterval = 5s,
Increasing the level of parallelism for receiving data, increase partitions (kafka -> spark)
Setting the right batch interval, stay away from sub-second batch intervals unless your processing time can meet that constraint.
A good approach to figure out the right batch size for your application is to test it with a conservative batch interval (say, 5-10 seconds) and a low data rate. To verify whether the system is able to keep up with the data rate, you can check the value of the end-to-end delay experienced by each processed batch (either look for “Total delay” in Spark driver log4j logs, or use the StreamingListener interface). If the delay is maintained to be comparable to the batch size, then system is stable.
How this fits into your existing architecture
Lots of applications that Spark Makes much easierBut it really breaks down into small questions and Big questions