Fast Data architectures are the answer to the increasing need for the enterprise to process and analyze continuous streams of data to accelerate decision making and become reactive to the particular characteristics of their market.
Apache Spark is a popular framework for data analytics. Its capabilities include SQL-based analytics, dataflow processing, graph analytics and a rich library of built-in machine learning algorithms. These libraries can be combined to address a wide range of requirements for large-scale data analytics.
To address Fast Data flows, Spark offers two API's: The mature Spark Streaming and its younger sibling, Structured Streaming. In this talk, we are going to introduce both APIs. Using practical examples, you will get a taste of each one and obtain guidance on how to choose the right one for your application.
8. What is Spark and Why we Should Care
Streaming APIs in Spark
- Structured Streaming Overview
- Interactive Session 1
- Spark Streaming Overview
- Interactive Session 2
Spark Streaming [AND|OR|XOR] Structured Streaming
17. Operations
...
val rawValues = rawData.selectExpr("CAST(value AS STRING)").as[String]
val jsonValues = rawValues.select(from_json($"value", schema) as "record")
val sensorData = jsonValues.select("record.*").as[SensorData]
…
18. Event Time
...
val movingAverage = sensorData.withColumn("timestamp", toSeconds($"ts").cast(TimestampType))
.withWatermark("timestamp", "30 seconds")
.groupBy($"id", window($"timestamp", "30 seconds", "10 seconds"))
.agg(avg($"temp"))
...
19. Sinks...
val visualizationQuery = sensorData.writeStream
.queryName("visualization") // this query name will be the SQL table name
.outputMode("update")
.format("memory")
.start()
...
val kafkaWriterQuery = kafkaFormat.writeStream
.queryName("kafkaWriter") // this query name will be the table name
.outputMode("append")
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("topic", targetTopic)
.option("checkpointLocation", "/tmp/spark/checkpoint")
.start()
20. Use Cases
● Streaming ETL
● Stream aggregations, windows
● Event-time oriented analytics
● Join Streams with Fixed Datasets
● Apply Machine Learning Models
42. Features
1. One-click component installations
2. Automatic dependency checks
3. One-click access to install logs
4. Real-time cluster visualization
5. Access to consolidated production logs
Benefits:
1. Easy to get started
2. Ready access to all components
3. Increased developer velocity
Fast Data Platform Manager, for Managing Running Clusters