Realtime Reporting using Spark Streaming

•

2 likes•2,861 views

Learn how Concur is transforming it's reporting solution from nightly batch processing to realtime using Apache Spark Streaming.

Breaking ETL barrier
with Real-time reporting
using Kafka, Spark Streaming

About us
Concur (now part of SAP) provides travel and
expense management services to
businesses.

Data Insights
A team that is building solutions to provide
customer access to data, visualization and
reporting.
Expense
Travel
Invoice

About me
Santosh Sahoo
Principal Architect III, Data Insights

Stack so far..
OLAP ReportETL
OLTP
App

Numbers
7K OLTP database sources
14K OLAP Reporting dbs
28K ETL Jobs
2B row changes
300M rows (Compacted)
Only ~20 failure a night

Traditional ETL challenges
Scheduled (High latency)
Hard to scale.
Failover and recovery.
Monolithic-ness
Spaghetti (Logic +SQL)

Moving forward
Streaming, real time
Scalable
Highly available
Reduce maintenance overhead
Eventual Consistency

Streaming Data Pipeline
Source
Flow Management
Processor
Storage
Querying

Data Source
Event bus for business events
Log Scrapping
Transaction log scraping
(Oracle GoldenGate, MySQL binlog, MongoDB oplog, Postgres BottledWater, SQL Server fn_dblog)
Change Data Capture
Application messaging/JMS
Micro batching
(High watermarked, change tracking)

Kafka - Flow Management
No nonsense logging
100K/s throughput vs 20k of RabbitMQ
Log compaction
Durable persistence
Partition tolerance
Replication
Best in class integration with Spark

Columnar Storage
Optimized for analytic query
performance.
Vertical partitioning
Column Projection
Compression
Loosely coupled schema.
HBase
AWS Redshift
Parquet
ORC
Postgres (Citrus)
SAP HANA

Hadoop/HDFS
Pro - Scale
Con- Latency

Spark Streaming
What? A data processing framework to build
scalable fault-tolerant streaming
applications.
Why? It lets you reuse the same code for
batch processing, join streams against
historical data, or run ad-hoc queries on
stream state.

Spark Streaming Architecture
Worker
Worker
Worker
Receiver
Driver Master
Executor
Executor
Executor
Source
D1 D2
D3 D4
WAL
D1 D2
Replication
Data
Store
TASK
DStream- Discretized Stream of RDD
RDD - Resilient Distributed Datasets

Optimized Direct Kafka API
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

How
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092,
anotherhost:9092")
val topics = Set("sometopic", "anothertopic")
val kafkaStream = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](streamingContext, kafkaParams,
topics)

Architecture

App
OLTP
Kafka
Spark
Streaming OLAP
Reporting
App
High level view

OLTP
Reporting
Cognos
Tableau ?
Archive
Flume
Camus
Stream
Processor
Spark
Samza,
Storm,
Flink
HDFS
Import
FTP
HTTP
SMTP
C
Tachyon
P
Standby
Protobuf
Json
Broker
Kafka
Hive/
Spark SQL
HANA
Load balance
Failover
HANA
HANA
HANA
Replication
Service bus
SqoopSnapshot
Pig/Hive/MR -
Normalization
Extract
Compensate
Data {Quality, Correction, Analytics}
Migrate method
API/SQL
Expense
Travel
TTX
API
Complete Architecture

Can Spark Streaming
survive Chaos Monkey?
http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html

Lambda Architecture
Lambda architecture is a data-processing
pattern designed to handle massive
quantities of data by taking advantage of
both batch- and stream-processing methods.

Demo
….

QnA

concur.com/en-us/careers
We are hiring

Thank you!

More Related Content

What's hot

Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks

Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks

Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks

Legacy Typesafe (now Lightbend)

ETL with SPARK - First Spark London meetup

ETL with SPARK - First Spark London meetup

ETL with SPARK - First Spark London meetup

SMACK Stack 1.0 has been Spark, Mesos, Akka, Cassandra and Kafka working into different cohesive systems delivering different solutions for different use cases. Haven't heard about it before? Oh man! Where have you been? https://www.google.com/search?q=smack+stack+1.0 SMACK Stack 1.1 we go a step further Streaming, Mesos, Analytics, Cassandra and Kafka and Joe Stein will walk through in detail some of the different viable options for Streaming and Analytics with Mesos, Kafka and Cassandra.

SMACK Stack 1.1

SMACK Stack 1.1

SMACK Stack 1.1

Temporal Operators For Spark Streaming And Its Application For Office365 Serv...

Temporal Operators For Spark Streaming And Its Application For Office365 Serv...

Temporal Operators For Spark Streaming And Its Application For Office365 Serv...

We present a solution for streaming anomaly detection, named “Coral”, based on Spark, Akka and Cassandra. In the system presented, we run Spark to run the data analytics pipeline for anomaly detection. By running Spark on the latest events and data, we make sure that the model is always up-to-date and that the amount of false positives is kept low, even under changing trends and conditions. Our machine learning pipeline uses Spark decision tree ensembles and k-means clustering. Once the model is trained by Spark, the model’s parameters are pushed to the Streaming Event Processing Layer, implemented in Akka. The Akka layer will then score 1000s of event per seconds according to the last model provided by Spark. Spark and Akka communicate which each other using Cassandra as a low-latency data store. By doing so, we make sure that every element of this solution is resilient and distributed. Spark performs micro-batches to keep the model up-to-date while Akka detects the new anomalies by using the latest Spark-generated data model. The project is currently hosted on Github. Have a look at : http://coral-streaming.github.io

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

Top 5 mistakes when writing Streaming applications

Top 5 mistakes when writing Streaming applications

Top 5 mistakes when writing Streaming applications

Intro to Spark development

Intro to Spark development

Intro to Spark development

Spark Summit East 2015 Advanced Devops Student Slides

Spark Summit East 2015 Advanced Devops Student Slides

Spark Summit East 2015 Advanced Devops Student Slides

Scalable And Incremental Data Profiling With Spark

Scalable And Incremental Data Profiling With Spark

Scalable And Incremental Data Profiling With Spark

ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics

ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics

ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics

Miklos Christine

Reactive dashboard’s using apache spark

Reactive dashboard’s using apache spark

Reactive dashboard’s using apache spark

Big Data visualization with Apache Spark and Zeppelin

Big Data visualization with Apache Spark and Zeppelin

Big Data visualization with Apache Spark and Zeppelin

La collecte de données au sein d'un DataLake sans impacter les systèmes opérationnels est un challenge pour de nombreuses entreprises. Lors du meetup Paris Data Engineers du 26 mars 2019, Dimitri Capitaine nous a présenté Data Collector qui est un outil de Change Data Capture (CDC) développé en interne chez OVH. Data Collector est capable d'assurer une réplication fiable et performante des bases de données jusqu'au DataLake. Hugo Larcher nous a alors présenté un cas d'utilisation autour de l'exploitation de données aéronautiques avec une touche d'IoT et de DataViz.

Change Data Capture with Data Collector @OVH

Change Data Capture with Data Collector @OVH

Change Data Capture with Data Collector @OVH

Paris Data Engineers !

Spark streaming State of the Union - Strata San Jose 2015

Spark streaming State of the Union - Strata San Jose 2015

Spark streaming State of the Union - Strata San Jose 2015

Real time data viz with Spark Streaming, Kafka and D3.js

Real time data viz with Spark Streaming, Kafka and D3.js

Real time data viz with Spark Streaming, Kafka and D3.js

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

Simplifying Big Data Analytics with Apache Spark

Simplifying Big Data Analytics with Apache Spark

Simplifying Big Data Analytics with Apache Spark

Spark Meetup at Uber

Spark Meetup at Uber

Spark Meetup at Uber

The demand for stream processing is increasing a lot these days. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples. A number of powerful, easy-to-use open source platforms have emerged to address this. But the same problem can be solved differently, various but sometimes overlapping use-cases can be targeted or different vocabularies for similar concepts can be used. This may lead to confusion, longer development time or costly wrong decisions.

Distributed Stream Processing - Spark Summit East 2017

Distributed Stream Processing - Spark Summit East 2017

Distributed Stream Processing - Spark Summit East 2017

As the adoption of Spark Streaming in the industry is increasing, so is the community’s demand for more features. Since the beginning of this year, we have made significant improvements in performance, usability, and semantic guarantees. In particular, some of these features are: - New Kafka integration for exactly-once guarantees - Improved Kinesis integration for stronger guarantees - Addition of more sources to the Python API Significantly improved UI for greater monitoring and debuggability. In this talk, I am going to discuss these improvements as well as the plethora of features we plan to add in the near future.

Strata NYC 2015: What's new in Spark Streaming

Strata NYC 2015: What's new in Spark Streaming

Strata NYC 2015: What's new in Spark Streaming

What's hot (20)

Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks

Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks

Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks

ETL with SPARK - First Spark London meetup

ETL with SPARK - First Spark London meetup

ETL with SPARK - First Spark London meetup

SMACK Stack 1.1

SMACK Stack 1.1

SMACK Stack 1.1

Temporal Operators For Spark Streaming And Its Application For Office365 Serv...

Temporal Operators For Spark Streaming And Its Application For Office365 Serv...

Temporal Operators For Spark Streaming And Its Application For Office365 Serv...

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

Top 5 mistakes when writing Streaming applications

Top 5 mistakes when writing Streaming applications

Top 5 mistakes when writing Streaming applications

Intro to Spark development

Intro to Spark development

Intro to Spark development

Spark Summit East 2015 Advanced Devops Student Slides

Spark Summit East 2015 Advanced Devops Student Slides

Spark Summit East 2015 Advanced Devops Student Slides

Scalable And Incremental Data Profiling With Spark

Scalable And Incremental Data Profiling With Spark

Scalable And Incremental Data Profiling With Spark

ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics

ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics

ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics

Reactive dashboard’s using apache spark

Reactive dashboard’s using apache spark

Reactive dashboard’s using apache spark

Big Data visualization with Apache Spark and Zeppelin

Big Data visualization with Apache Spark and Zeppelin

Big Data visualization with Apache Spark and Zeppelin

Change Data Capture with Data Collector @OVH

Change Data Capture with Data Collector @OVH

Change Data Capture with Data Collector @OVH

Spark streaming State of the Union - Strata San Jose 2015

Spark streaming State of the Union - Strata San Jose 2015

Spark streaming State of the Union - Strata San Jose 2015

Real time data viz with Spark Streaming, Kafka and D3.js

Real time data viz with Spark Streaming, Kafka and D3.js

Real time data viz with Spark Streaming, Kafka and D3.js

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

Simplifying Big Data Analytics with Apache Spark

Simplifying Big Data Analytics with Apache Spark

Simplifying Big Data Analytics with Apache Spark

Spark Meetup at Uber

Spark Meetup at Uber

Spark Meetup at Uber

Distributed Stream Processing - Spark Summit East 2017

Distributed Stream Processing - Spark Summit East 2017

Distributed Stream Processing - Spark Summit East 2017

Strata NYC 2015: What's new in Spark Streaming

Strata NYC 2015: What's new in Spark Streaming

Strata NYC 2015: What's new in Spark Streaming

Viewers also liked

Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

The data governance function exercises authority and control over the management of your mission critical assets and guides how all other data management functions are performed. When selling data governance to organizational management, it is useful to concentrate on the specifics that motivate the initiative. This means developing a specific vocabulary and set of narratives to facilitate understanding of your organizational business concepts. This webinar provides you with an understanding of what data governance functions are required and how they fit with other data management disciplines. Understanding these aspects is a necessary pre-requisite to eliminate the ambiguity that often surrounds initial discussions and implement effective data governance and stewardship programs that manage data in support of organizational strategy. Takeaways: Understanding why data governance can be tricky for most organizations Steps for improving data governance within your organization Guiding principles & lessons learned Understanding foundational data governance concepts based on the DAMA DMBOK

Data-Ed Webinar: Data Governance Strategies

Data-Ed Webinar: Data Governance Strategies

Data-Ed Webinar: Data Governance Strategies

To gain a competitive advantage in today's hyper competitive markets, businesses must constantly strive to develop, test, and release better software faster. This is made possible by means of continuously integrating, testing, and delivering new applications. In this webinar, Skytap and Orasi will share tips to improve software quality and velocity with the automated creation and management of on-demand, scalable test environments. It will focus on best practices for continuous integration through the joint use of HP Application Lifecycle Management (ALM), Jenkins and Skytap. Specifically you learn how to: -Integrate Jenkins with HP ALM -Extend Dev/Test workloads to the cloud -Integrate build automation with automated test management

Tips to achieve continuous integration/delivery using HP ALM, Jenkins, and S...

Tips to achieve continuous integration/delivery using HP ALM, Jenkins, and S...

Tips to achieve continuous integration/delivery using HP ALM, Jenkins, and S...

DevOps is powering the computing environments of tomorrow. When properly configured, the Splunk platform allows us to gain real-time visibility into the velocity, quality, and business impact of DevOps-driven application delivery across all roles, departments, process, and systems. Splunk can be used by DevOps practitioners to provide continuous integration/deployment and the real-time feedback to help the organization with their operational intelligence. Join us for a exciting talk about Splunk’s current approach to DevOps, and for examples of how Splunk is being used by customers today to transform DevOps initiatives.

DevOps Powered by Splunk

DevOps Powered by Splunk

DevOps Powered by Splunk

Business intelligence 3.0 and the data lake

Business intelligence 3.0 and the data lake

Business intelligence 3.0 and the data lake

Data Science Thailand

GOTO Night Amsterdam - Stream processing with Apache Flink

GOTO Night Amsterdam - Stream processing with Apache Flink

GOTO Night Amsterdam - Stream processing with Apache Flink

The presentation will describe methods for discovering interesting and actionable patterns in log files for security management without specifically knowing what you are looking for. This approach is different from "classic" log analysis and it allows gaining an insight into insider attacks and other advanced intrusions, which are extremely hard to discover with other methods. Specifically, I will demonstrate how data mining can be used as a source of ideas for designing future log analysis techniques, that will help uncover the coming threats. The important part of the presentation will be the demonstration how the above methods worked in a real-life environment.

Log Mining: Beyond Log Analysis

Log Mining: Beyond Log Analysis

Log Mining: Beyond Log Analysis

Production and Beyond: Deploying and Managing Machine Learning Models

Production and Beyond: Deploying and Managing Machine Learning Models

Production and Beyond: Deploying and Managing Machine Learning Models

Speaker: Hari Shreedharan Data Day Texas 2015 Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster. Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in. In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.

Real Time Data Processing using Spark Streaming | Data Day Texas 2015

Real Time Data Processing using Spark Streaming | Data Day Texas 2015

Real Time Data Processing using Spark Streaming | Data Day Texas 2015

Time series data is everywhere: IoT, sensor data, financial transactions. The industry has moved to databases like Cassandra to handle the high velocity and high volume of data that is now common place. However data is pointless without being able to process it in near real time. That's where Spark combined with Cassandra comes in! What was one just your storage system (Cassandra) can be transformed into an analytics system and it's really surprising how easy it is!

Real-Time Analytics with Apache Cassandra and Apache Spark

Real-Time Analytics with Apache Cassandra and Apache Spark

Real-Time Analytics with Apache Cassandra and Apache Spark

Viewers also liked (10)

Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Data-Ed Webinar: Data Governance Strategies

Data-Ed Webinar: Data Governance Strategies

Data-Ed Webinar: Data Governance Strategies

Tips to achieve continuous integration/delivery using HP ALM, Jenkins, and S...

Tips to achieve continuous integration/delivery using HP ALM, Jenkins, and S...

Tips to achieve continuous integration/delivery using HP ALM, Jenkins, and S...

DevOps Powered by Splunk

DevOps Powered by Splunk

DevOps Powered by Splunk

Business intelligence 3.0 and the data lake

Business intelligence 3.0 and the data lake

Business intelligence 3.0 and the data lake

GOTO Night Amsterdam - Stream processing with Apache Flink

GOTO Night Amsterdam - Stream processing with Apache Flink

GOTO Night Amsterdam - Stream processing with Apache Flink

Log Mining: Beyond Log Analysis

Log Mining: Beyond Log Analysis

Log Mining: Beyond Log Analysis

Production and Beyond: Deploying and Managing Machine Learning Models

Production and Beyond: Deploying and Managing Machine Learning Models

Production and Beyond: Deploying and Managing Machine Learning Models

Real Time Data Processing using Spark Streaming | Data Day Texas 2015

Real Time Data Processing using Spark Streaming | Data Day Texas 2015

Real Time Data Processing using Spark Streaming | Data Day Texas 2015

Real-Time Analytics with Apache Cassandra and Apache Spark

Real-Time Analytics with Apache Cassandra and Apache Spark

Real-Time Analytics with Apache Cassandra and Apache Spark

Similar to Realtime Reporting using Spark Streaming

Event Driven Microservices

Event Driven Microservices

Event Driven Microservices

Fabrizio Fortino

Unify Analytics: Combine Strengths of Data Lake and Data Warehouse

Unify Analytics: Combine Strengths of Data Lake and Data Warehouse

Unify Analytics: Combine Strengths of Data Lake and Data Warehouse

Nike tech talk.2

Nike tech talk.2

Nike tech talk.2

Jags Ramnarayan

Discover the evolution of Apache Hudi within the open-source realm - a community and project pushing the boundaries of data lake possibilities. This presentation delves into Apache Hudi 1.0, a pivotal release reimagining its transactional database layer while honoring its foundational principles. Join us in this transformative journey! Join the Apache Hudi Community https://join.slack.com/t/apache-hudi/shared_invite/zt-20r833rxh-627NWYDUyR8jRtMa2mZ~gg. Follow us on LinkedIn and Twitter https://www.linkedin.com/company/apache-hudi/ https://twitter.com/apachehudi

A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...

A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...

A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...

Atzmon Hen-Tov & Lior Schachter, Pontis Businesses everywhere are increasingly challenged by their dependencies on legacy platforms. The dramatic increase in data volume, speed, and types of data is quickly outstripping the capabilities of these legacy systems. By transitioning from a legacy RDBMS to a Hadoop-based platform, Pontis was able to process and analyze billions of mobile subscriber events every day. In this talk, we’ll provide a quick overview of our legacy system, as well as our process for migrating to our target architecture. We’ll continue with a review our Hadoop platform selection process, which involved a thorough RFP and a detailed analysis of the top Hadoop platform vendors. This session will focus on how we gradually transitioned to our big data platform over the course of several product versions, resulting in higher scalability and a lower TCO in each version. We’ll outline the benefits of the target architecture, and detail how we successfully integrated Hadoop into our organization. Our session will conclude with a look at technical solutions for dealing with big data deficiencies.

Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...

Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...

Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...

MapR Technologies

SnappyData at Spark Summit 2017

SnappyData at Spark Summit 2017

SnappyData at Spark Summit 2017

Jags Ramnarayan

Apache Spark 2.0 offers many enhancements that make continuous analytics quite simple. In this talk, we will discuss many other things that you can do with your Apache Spark cluster. We explain how a deep integration of Apache Spark 2.0 and in-memory databases can bring you the best of both worlds! In particular, we discuss how to manage mutable data in Apache Spark, run consistent transactions at the same speed as state-the-art in-memory grids, build and use indexes for point lookups, and run 100x more analytics queries at in-memory speeds. No need to bridge multiple products or manage, tune multiple clusters. We explain how one can take regulation Apache Spark SQL OLAP workloads and speed them up by up to 20x using optimizations in SnappyData. We then walk through several use-case examples, including IoT scenarios, where one has to ingest streams from many sources, cleanse it, manage the deluge by pre-aggregating and tracking metrics per minute, store all recent data in a in-memory store along with history in a data lake and permit interactive analytic queries at this constantly growing data. Rather than stitching together multiple clusters as proposed in Lambda, we walk through a design where everything is achieved in a single, horizontally scalable Apache Spark 2.0 cluster. A design that is simpler, a lot more efficient, and let’s you do everything from Machine Learning and Data Science to Transactions and Visual Analytics all in one single cluster.

SnappyData, the Spark Database. A unified cluster for streaming, transactions...

SnappyData, the Spark Database. A unified cluster for streaming, transactions...

SnappyData, the Spark Database. A unified cluster for streaming, transactions...

SnappyData overview NikeTechTalk 11/19/15

SnappyData overview NikeTechTalk 11/19/15

SnappyData overview NikeTechTalk 11/19/15

SpringOne Platform 2017 Milind Bhandarkar, Ampool "To provide hyper-personalized digital experiences in the emerging market transformation, innovative enterprises are building modern data-driven applications to deliver continuing value to their always-connected customers. Such applications need to utilize closed-loop deep insights to influence their users' behaviors in real-time. However, the traditional ways of capturing users' interactions, transporting data to large data warehouses or data lakes, further away from applications, and processing these data across multiple slow stages cannot meet the real-time expectations of both customers and businesses. What if one could capture, analyze, and serve data from a highly concurrent, high-performance data store powering these applications? In this talk, we'll present a memory-centric Active Data Store (ADS), powered by Apache Geode, to meet the exigent demands of modern applications while providing operational simplicity. Ampool's ADS allows fast ingest and storage of 'hot' app data, in situ updates and analysis, and data serving from the same scalable distributed in-memory data store. As the data cools (ages), Ampool ADS automatically tiers data to warm and cold secondary stores. By speeding analytics several-fold, Ampool enables feeding actionable insights back to applications, driving decisions in a closed loop. We will demonstrate the applicability of Ampool ADS for such an app by serving all data-access patterns from a single memory-centric store."

Real-time Analytics for Data-Driven Applications

Real-time Analytics for Data-Driven Applications

Real-time Analytics for Data-Driven Applications

An Architect's guide to real time big data systems

An Architect's guide to real time big data systems

An Architect's guide to real time big data systems

Along with the arrival of BigData, a parallel yet less well known but significant change to the way we process data has occurred. Data is getting faster! Business models are changing radically based on the ability to be first to know insights and act appropriately to keep the customer, prevent the breakdown or save the patient. In essence, knowing something now is overriding knowing everything later. Stream processing engines allow us to blend event streams from different internal and external sources to gain insights in real time. This talk will discuss the need for streaming, business models it can change, new applications it allows and why Apache Flink enables these applications. Apache Flink is a top Level Apache Project for real time stream processing at scale. It is a high throughput, low latency, fault tolerant, distributed, state based stream processing engine. Flink has associated Polyglot APIs (Scala, Python, Java) for manipulating streams, a Complex Event Processor for monitoring and alerting on the streams and integration points with other big data ecosystem tooling.

Don't Cross The Streams - Data Streaming And Apache Flink

Don't Cross The Streams - Data Streaming And Apache Flink

Don't Cross The Streams - Data Streaming And Apache Flink

John Gorman (BSc, CISSP)

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

Luan Moreno Medeiros Maciel

Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

SnappyData @ Seattle Spark Meetup

SnappyData @ Seattle Spark Meetup

SnappyData @ Seattle Spark Meetup

Data Con LA 2020 Description The data warehouse has been an analytics workhorse for decades. Unprecedented volumes of data, new types of data, and the need for advanced analyses like machine learning brought on the age of the data lake. But Hadoop by itself doesn't really live up to the hype. Now, many companies have a data lake, a data warehouse, or a mishmash of both, possibly combined with a mandate to go to the cloud. The end result can be a sprawling mess, a lot of duplicated effort, a lot of missed opportunities, a lot of projects that never made it into production, and a lot of financial investment without return. Technical and spiritual unification of the two opposed camps can make a powerful impact on the effectiveness of analytics for the business overall. Over time, different organizations with massive IoT workloads have found practical ways to bridge the artificial gap between these two data management strategies. Look under the hood at how companies have gotten IoT ML projects working, and how their data architectures have changed over time. Learn about new architectures that successfully supply the needs of both business analysts and data scientists. Get a peek at the future. In this area, no one likes surprises. *Look at successful data architectures from companies like Philips, Anritsu, Uber, *Learn to eliminate duplication of effort between data science and BI data engineering teams *Avoid some of the traps that have caused so many big data analytics implementations to fail *Get AI and ML projects into production where they have real impact, without bogging down essential BI *Study analytics architectures that work, why and how they work, and where they're going from here Speaker Paige Roberts,Vertica, Open Source Relations Manager

Unifying Analytics

Unifying Analytics

Unifying Analytics

5 Years of Progress in Active Data Warehousing

5 Years of Progress in Active Data Warehousing

5 Years of Progress in Active Data Warehousing

Stsg17 speaker yousunjeong

Stsg17 speaker yousunjeong

Stsg17 speaker yousunjeong

Data Con LA 2022 Keynote

Data Con LA 2022 Keynote

Data Con LA 2022 Keynote

Introduction to apache kafka, confluent and why they matter

Introduction to apache kafka, confluent and why they matter

Introduction to apache kafka, confluent and why they matter

Fossasia 2018-chetan-khatri

Fossasia 2018-chetan-khatri

Fossasia 2018-chetan-khatri

Similar to Realtime Reporting using Spark Streaming (20)

Event Driven Microservices

Event Driven Microservices

Event Driven Microservices

Unify Analytics: Combine Strengths of Data Lake and Data Warehouse

Unify Analytics: Combine Strengths of Data Lake and Data Warehouse

Unify Analytics: Combine Strengths of Data Lake and Data Warehouse

Nike tech talk.2

Nike tech talk.2

Nike tech talk.2

A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...

A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...

A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...

Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...

Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...

Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...

SnappyData at Spark Summit 2017

SnappyData at Spark Summit 2017

SnappyData at Spark Summit 2017

SnappyData, the Spark Database. A unified cluster for streaming, transactions...

SnappyData, the Spark Database. A unified cluster for streaming, transactions...

SnappyData, the Spark Database. A unified cluster for streaming, transactions...

SnappyData overview NikeTechTalk 11/19/15

SnappyData overview NikeTechTalk 11/19/15

SnappyData overview NikeTechTalk 11/19/15

Real-time Analytics for Data-Driven Applications

Real-time Analytics for Data-Driven Applications

Real-time Analytics for Data-Driven Applications

An Architect's guide to real time big data systems

An Architect's guide to real time big data systems

An Architect's guide to real time big data systems

Don't Cross The Streams - Data Streaming And Apache Flink

Don't Cross The Streams - Data Streaming And Apache Flink

Don't Cross The Streams - Data Streaming And Apache Flink

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics

SnappyData @ Seattle Spark Meetup

SnappyData @ Seattle Spark Meetup

SnappyData @ Seattle Spark Meetup

Unifying Analytics

Unifying Analytics

Unifying Analytics

5 Years of Progress in Active Data Warehousing

5 Years of Progress in Active Data Warehousing

5 Years of Progress in Active Data Warehousing

Stsg17 speaker yousunjeong

Stsg17 speaker yousunjeong

Stsg17 speaker yousunjeong

Data Con LA 2022 Keynote

Data Con LA 2022 Keynote

Data Con LA 2022 Keynote

Introduction to apache kafka, confluent and why they matter

Introduction to apache kafka, confluent and why they matter

Introduction to apache kafka, confluent and why they matter

Fossasia 2018-chetan-khatri

Fossasia 2018-chetan-khatri

Fossasia 2018-chetan-khatri

Recently uploaded

Understanding the FAA Part 107 License ..

Understanding the FAA Part 107 License ..

Understanding the FAA Part 107 License ..

Christopher Logan Kennedy

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Effective data discovery is crucial for maintaining compliance and mitigating risks in today's rapidly evolving privacy landscape. However, traditional manual approaches often struggle to keep pace with the growing volume and complexity of data. Join us for an insightful webinar where industry leaders from TrustArc and Privya will share their expertise on leveraging AI-powered solutions to revolutionize data discovery. You'll learn how to: - Effortlessly maintain a comprehensive, up-to-date data inventory - Harness code scanning insights to gain complete visibility into data flows leveraging the advantages of code scanning over DB scanning - Simplify compliance by leveraging Privya's integration with TrustArc - Implement proven strategies to mitigate third-party risks Our panel of experts will discuss real-world case studies and share practical strategies for overcoming common data discovery challenges. They'll also explore the latest trends and innovations in AI-driven data management, and how these technologies can help organizations stay ahead of the curve in an ever-changing privacy landscape.

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

Discover the innovative features and strategic vision that keep WSO2 an industry leader. Explore the exciting 2024 roadmap of WSO2 API management, showcasing innovations, unified APIM/APK control plane, natural language API interaction, and cloud native agility. Discover how open source solutions, microservices architecture, and cloud native technologies unlock seamless API management in today's dynamic landscapes. Leave with a clear blueprint to revolutionize your API journey and achieve industry success!

WSO2's API Vision: Unifying Control, Empowering Developers

WSO2's API Vision: Unifying Control, Empowering Developers

WSO2's API Vision: Unifying Control, Empowering Developers

Strategies for Landing an Oracle DBA Job as a Fresher

Strategies for Landing an Oracle DBA Job as a Fresher

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Webinar Recording: https://www.panagenda.com/webinars/why-teams-call-analytics-is-critical-to-your-entire-business Nothing is as frustrating and noticeable as being in an important call and being unable to see or hear the other person. Not surprising then, that issues with Teams calls are among the most common problems users call their helpdesk for. Having in depth insight into everything relevant going on at the user’s device, local network, ISP and Microsoft itself during the call is crucial for good Microsoft Teams Call quality support. To ensure a quick and adequate solution and to ensure your users get the most out of their Microsoft 365. But did you know that ‘bad calls’ are also an excellent indicator of other problems arising? Precisely because it is so noticeable!? Like the canary in the mine, bad calls can be early indicators of problems. Problems that might otherwise not have been noticed for a while but can have a big impact on productivity and satisfaction. Join this session by Christoph Adler to learn how true Microsoft Teams call quality analytics helped other organizations troubleshoot bad calls and identify and fix problems that impacted Teams calls or the use of Microsoft365 in general. See what it can do to keep your users happy and productive! In this session we will cover - Why CQD data alone is not enough to troubleshoot call problems - The importance of attributing call problems to the right call participant - What call quality analytics can do to help you quickly find, fix-, and prevent problems - Why having retrospective detailed insights matters - Real life examples of how others have used Microsoft Teams call quality monitoring to problem shoot problems with their ISP, network, device health and more.

Why Teams call analytics are critical to your entire business

Why Teams call analytics are critical to your entire business

Why Teams call analytics are critical to your entire business

Corporate and higher education. Two industries that, in the past, have had a clear divide with very little crossover. The difference in goals, learning styles and objectives paved the way for differing learning technologies platforms to evolve. Now, those stark lines are blurring as both sides are discovering they have content that’s relevant to the other. Join Tammy Rutherford as she walks through the pros and cons of corporate and higher ed collaborating. And the challenges of these different technology platforms working together for a brighter future.

Corporate and higher education May webinar.pptx

Corporate and higher education May webinar.pptx

Corporate and higher education May webinar.pptx

Rustici Software

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Exploring Multimodal Embeddings with Milvus

Exploring Multimodal Embeddings with Milvus

Exploring Multimodal Embeddings with Milvus

Following the popularity of “Cloud Revolution: Exploring the New Wave of Serverless Spatial Data,” we’re thrilled to announce this much-anticipated encore webinar. In this sequel, we’ll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you’re building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

CNIC Information System with Pakdata Cf In Pakistan

CNIC Information System with Pakdata Cf In Pakistan

CNIC Information System with Pakdata Cf In Pakistan

MINDCTI Revenue Release Quarter One 2024

MINDCTI Revenue Release Quarter One 2024

MINDCTI Revenue Release Quarter One 2024

The Good, the Bad and the Governed - Why is governance a dirty word? David O'Neill, Chief Operating Officer - APIContext Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Boost Fertility New Invention Ups Success Rates.pdf

Boost Fertility New Invention Ups Success Rates.pdf

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

Six Myths about Ontologies: The Basics of Formal Ontology

Six Myths about Ontologies: The Basics of Formal Ontology

Six Myths about Ontologies: The Basics of Formal Ontology

johnbeverley2021

In this keynote, Asanka Abeysinghe, CTO,WSO2 will explore the shift towards platformless technology ecosystems and their importance in driving digital adaptability and innovation. We will discuss strategies for leveraging decentralized architectures and integrating diverse technologies, with a focus on building resilient, flexible, and future-ready IT infrastructures. We will also highlight WSO2's roadmap, emphasizing our commitment to supporting this transformative journey with our evolving product suite.

Platformless Horizons for Digital Adaptability

Platformless Horizons for Digital Adaptability

Platformless Horizons for Digital Adaptability

Following the popularity of "Cloud Revolution: Exploring the New Wave of Serverless Spatial Data," we're thrilled to announce this much-anticipated encore webinar. In this sequel, we'll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you're building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Artificial Intelligence Chap.5 : Uncertainty

Artificial Intelligence Chap.5 : Uncertainty

Artificial Intelligence Chap.5 : Uncertainty

Khushali Kathiriya

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

Recently uploaded (20)

Understanding the FAA Part 107 License ..

Understanding the FAA Part 107 License ..

Understanding the FAA Part 107 License ..

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

WSO2's API Vision: Unifying Control, Empowering Developers

WSO2's API Vision: Unifying Control, Empowering Developers

WSO2's API Vision: Unifying Control, Empowering Developers

Strategies for Landing an Oracle DBA Job as a Fresher

Strategies for Landing an Oracle DBA Job as a Fresher

Strategies for Landing an Oracle DBA Job as a Fresher

Why Teams call analytics are critical to your entire business

Why Teams call analytics are critical to your entire business

Why Teams call analytics are critical to your entire business

Corporate and higher education May webinar.pptx

Corporate and higher education May webinar.pptx

Corporate and higher education May webinar.pptx

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Exploring Multimodal Embeddings with Milvus

Exploring Multimodal Embeddings with Milvus

Exploring Multimodal Embeddings with Milvus

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

CNIC Information System with Pakdata Cf In Pakistan

CNIC Information System with Pakdata Cf In Pakistan

CNIC Information System with Pakdata Cf In Pakistan

MINDCTI Revenue Release Quarter One 2024

MINDCTI Revenue Release Quarter One 2024

MINDCTI Revenue Release Quarter One 2024

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Boost Fertility New Invention Ups Success Rates.pdf

Boost Fertility New Invention Ups Success Rates.pdf

Boost Fertility New Invention Ups Success Rates.pdf

Six Myths about Ontologies: The Basics of Formal Ontology

Six Myths about Ontologies: The Basics of Formal Ontology

Six Myths about Ontologies: The Basics of Formal Ontology

Platformless Horizons for Digital Adaptability

Platformless Horizons for Digital Adaptability

Platformless Horizons for Digital Adaptability

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Artificial Intelligence Chap.5 : Uncertainty

Artificial Intelligence Chap.5 : Uncertainty

Artificial Intelligence Chap.5 : Uncertainty

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

Realtime Reporting using Spark Streaming

1. Breaking ETL barrier with Real-time reporting using Kafka, Spark Streaming

2. About us Concur (now part of SAP) provides travel and expense management services to businesses.

3. Data Insights A team that is building solutions to provide customer access to data, visualization and reporting. Expense Travel Invoice

4. About me Santosh Sahoo Principal Architect III, Data Insights

5. Stack so far.. OLAP ReportETL OLTP App

6. Numbers 7K OLTP database sources 14K OLAP Reporting dbs 28K ETL Jobs 2B row changes 300M rows (Compacted) Only ~20 failure a night

7. Traditional ETL challenges Scheduled (High latency) Hard to scale. Failover and recovery. Monolithic-ness Spaghetti (Logic +SQL)

8. Moving forward Streaming, real time Scalable Highly available Reduce maintenance overhead Eventual Consistency

9. Streaming Data Pipeline Source Flow Management Processor Storage Querying

10. Data Source Event bus for business events Log Scrapping Transaction log scraping (Oracle GoldenGate, MySQL binlog, MongoDB oplog, Postgres BottledWater, SQL Server fn_dblog) Change Data Capture Application messaging/JMS Micro batching (High watermarked, change tracking)

11. Kafka - Flow Management No nonsense logging 100K/s throughput vs 20k of RabbitMQ Log compaction Durable persistence Partition tolerance Replication Best in class integration with Spark

12. Columnar Storage Optimized for analytic query performance. Vertical partitioning Column Projection Compression Loosely coupled schema. HBase AWS Redshift Parquet ORC Postgres (Citrus) SAP HANA

13. Hadoop/HDFS Pro - Scale Con- Latency

14. Spark Streaming What? A data processing framework to build scalable fault-tolerant streaming applications. Why? It lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state.

15. Spark Streaming Architecture Worker Worker Worker Receiver Driver Master Executor Executor Executor Source D1 D2 D3 D4 WAL D1 D2 Replication Data Store TASK DStream- Discretized Stream of RDD RDD - Resilient Distributed Datasets

16. Optimized Direct Kafka API https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

17. How val kafkaParams = Map("metadata.broker.list" -> "localhost:9092, anotherhost:9092") val topics = Set("sometopic", "anothertopic") val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](streamingContext, kafkaParams, topics)

18. Architecture

19. App OLTP Kafka Spark Streaming OLAP Reporting App High level view

20. OLTP Reporting Cognos Tableau ? Archive Flume Camus Stream Processor Spark Samza, Storm, Flink HDFS Import FTP HTTP SMTP C Tachyon P Standby Protobuf Json Broker Kafka Hive/ Spark SQL HANA Load balance Failover HANA HANA HANA Replication Service bus SqoopSnapshot Pig/Hive/MR - Normalization Extract Compensate Data {Quality, Correction, Analytics} Migrate method API/SQL Expense Travel TTX API Complete Architecture

21. Can Spark Streaming survive Chaos Monkey? http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html

22. Lambda Architecture Lambda architecture is a data-processing pattern designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods.

25. concur.com/en-us/careers We are hiring