SlideShare a Scribd company logo
1 of 41
Download to read offline
Real-time analytics with
Flink and Druid
郭二文
資深經理 @ 緯創資通 企業應用系統
DataCon.TW 2017
緯創資通
有關講者
• 5年的生產管理/物料管理
• 3年的MES/SFC系統開發運維
• 3年的ERP (SAP-MM模組)/EAI 的導入開發Leader
• 2年的End-To-End Business Integration Analyst
• 1年的OEM Sales Team Leader
• 4年的企業營運分析/集團合併財報系統
• 4年的大數據新技術研究/推廣
• 超級鐵人三項226 Finisher
有關大數據技術研究/推廣
• 企業內部
• 方向
• 以公司需求出發, 從國內外社群找尋類似或可借鏡的案例分享
有關大數據技術研究/推廣
• 社群
• 取之於社會, 用之於社會
• 創立 http://eighty20.cc
• 方向
• 去除公司業務相關的資訊
• 專注技術整合/案例/經驗分享
• 2016 - SPARK 手把手 - 快速上手營 (6個session , 18小時)
• Taiwan R User Group/MLDM Monday 社群
• 2016 – Realtime WebApp 手把手 (3個session , 9小時)
• Taiwan Spark User Group 社群
• 2017 – ELK + Grafana手把手 (3個session , 9小時)
• Taiwan Spark User Group 社群
摘要
• 問題
• Business Intelligence Analytics For Time Series Data
• 可能的解決方案
• Choosing The Right Tools For The Job
• 架構
• End-to-end Data Pipeline
問題/挑戰
問題/挑戰
Source:http://www.dreamwerkztechnologies.com/dwz-inventory-management/
問題/挑戰
Source: https://logicyyc.com/2017/04/27/industry-4-0/
問題/挑戰
問題/挑戰
• Operational Information層級
• 使用人數最多, 常常因為軟體license的問題導致很多手動報表
• 需要非常動態的Query/Filter/Aggregation的功能
• 需要更及時的資料
• 需要底層的交易資料
要便宜
問題/挑戰
要快要好
功能
資料
速度
支援
人數
問題/挑戰
• 構建接近實時商業智能(BI)時間序列分析系統
• 很大量Operation層級的使用者
• Many users
• 很高的查詢/聚合需求的並發量
• High concurrency
• 很快的查詢/聚合結果產出
• Low latency , sub-second response
• 很大量的即時與歷史數據
• Large volume of history & current data
問題/挑戰
• 構建接近實時商業智能(BI)時間序列分析系統
• 當要處理資料量是很巨大的時候, 要用什麼架構呢?
• 外面有許許多的的解決方案、手法及廠商, 該選什麼呢?
• 要如何選擇對工具呢?
要快
要便宜
要好
可能的解決方案
使用RDBMS的解決方案
Business Intelligence Application
RDBMS
Event Data
ETL
使用RDBMS的解決方案
• 傳統資料倉儲
• Row store
• Start schema
• Aggregate tables
• Query cache
• Scanning raw data is slow and expensive
使用Apache Kylin的解決方案
Business Intelligence Application
Kylin / HBase
Event Data
ETL
使用Apache Kylin的解決方案
• 分布式多维分析(OLAP)引擎
• 串流資料導入的案例較少
** 尚未進行深入研究與測試 **
(值得關注的專案)
使用一般化的大數據解決方案
Business Intelligence Application
Spark SQL / Impala
Event Data
ETL
使用一般化的大數據解決方案
• 把資料載入到Hadoop
• 透過Spark SQL或Impala的工具
來查詢/過濾/聚合
• 利用Tableau或PowerBI來作資
料視覺化的呈現與互動性的分
析
使用一般化的大數據解決方案
• 構構建接近實時商業智能(BI)時間序列分析系統
• 很大量Operation層級的使用者
• Many users (有license的考量 Tableau, PowerBI)
• 很高的查詢/聚合需求的並發量
• High concurrency (Spark / Impala不易支持上百個使用者的同時操作)
• 很快的查詢/聚合結果產出
• Low latency , sub-second response (不容易調優到秒級的回應)
• 很大量的即時與歷史數據
• Large volume of history & current data (新的資料必需累積到128MB或10分鐘
才能以Parquet 的格式轉入HDFS)
使用Elastic的解決方案
Business Intelligence Application
Elasticsearch
Event Data
ETL
使用Elastic的解決方案
• Pros:
• 集群架設簡單
• 很高的查詢/聚合需求的並發量
• 很快的查詢/聚合結果產出
• 可查詢即時與歷史數據
• Cons:
• 展現層的out-of-box元件較少
• 難與傳統的BI工具結合使用
• 權限管理/認證需要commercial 的
license
• 需具備二次開發的能力與人才
• 資料實時的Ingestion Rate需要調優
使用Druid的解決方案
Event Data
Druid
Business Intelligence Application
ETL
使用Druid的解決方案 – 架構
Event
enrichment
Batch Data
Preprocess
Meta data
cache
展現層 - Superset
• Superset是一個視覺/互動的數據探索平台
• 由Airbnb開源, 現為Apache 孵化專案
• 豐富的視覺化元件及簡單直覺數據探索的操作
• 有彈性的authentication/authorization, 可與LDAP做整合
• 與Druid有深度的整合, 利用Druid超快的slice & dice的能力來提
升用者體驗
資料儲存/查詢層 - Druid
• Druid是一個用於大數據實時查詢和分析的高容錯、高性
能開源分佈式系統
• 為分析而設計—Druid是為OLAP工作流的探索性分析而構建,
它支持各種過濾、聚合和查詢
• 快速的交互式查詢—Druid的低延遲數據攝取架構允許事件在它
們創建後毫秒內可被查詢到
• 高可用性—Druid的數據在系統更新時依然可用,規模的擴大和
縮小都不會造成數據丟失
• 可擴展—Druid已實現每天能夠處理數十億事件和TB級數據
Fast Response Time
• Critical for interactive user experience
• Avg query times ~500ms
• 90% percentile under 1 sec
• 99% percentile under 10 sec
• Handle 1000’s of concurrent queries
Arbitrary slicing n’ dicing
• 支持任意過濾、分割和匯總數據的能力
Scalability
• Ability to handle
• petabytes of data
• billions of events/day
• Largest druid cluster
• 50 Trillion+ events
• 50PB+ of raw data
• Over 500TB of compressed query-able data
• Ingestion Rate over 500,000 events/sec
• 10-100K events/sec/core
誰在用Druid?
資料串流處理層- Flink
• Apache Flink是分佈式高性能、高可用與提供準確
的數據流計算的開源流處理框架
• High Performance & Low Latency
• Support for Event Time and Out-of-Order Events
• Exactly-once Semantics for Stateful Computations
• Highly flexible Streaming Windows
• Continuous Streaming Model with Backpressure
• Fault-tolerance via Lightweight Distributed Snapshots
串流處理
串流處理 – Event Enrichment
Event
Event after Enrichment
PART_A 4 2017-09-30 09:31:54SO#04
CATEGORY 4 2017-09-30 09:31:54SO#04 Customer USA PART_A TYPE..
串流處理 – Event Enrichment
Local Cache
(LRU)
Local Cache
(LRU)
Remote Cache
Or
Key/Value Store
參考: https://martin.kleppmann.com/2015/04/23/bottled-water-real-time-postgresql-kafka.html
M
M
M
快取裡頭的資料
會被更新的資料
所取代
M
檢查內部的LRU快取是否存在
相同的鍵值, 如果”是”則以最
新的資料取代
M
要被lookup的主檔資料是怎麼被即時處理與保存的?
串流處理 – Event Enrichment
Local Cache
(LRU)
Local Cache
(LRU)
Remote Cache
Or
Key/Value Store
參考: https://martin.kleppmann.com/2015/04/23/bottled-water-real-time-postgresql-kafka.html
E
E
E
1. 檢查local cache是否存在所需要的鍵值, 如果
有就取出使用
2. 如果沒有, 則從remote cache取回所需要的參
考資料, 並將資料保留在local cache
3. Local cache的資料如果超過設定的size, LRU的
機制會移除最不常使用的資料
M
M
E
E
E
Event是怎麼被豐富化?
總結 - 功能性需求
• 構建接近實時商業智能(BI)時間序列分析系統
• 很大量Operation層級的使用者
• Many users
• 很高的查詢/聚合需求的並發量
• High concurrency
• 很快的查詢/聚合結果產出
• Low latency , sub-second response
• 很大量的即時與歷史數據
• Large volume of history & current data
總結 - 非功能性需求
可維護性
Maintainability
可移植性
Portability
可靠性
Reliability
可擴展性
Scalability
靈活性
Flexibility
審查能力
Auditability
相關說明文件
Documentation
性能
Performance
安全
Security
可用性
Usability
Source: https://www.outsystems.com/blog/2013/03/the-truth-about-non-functional-requirements-nfrs.html
總結 - 功能性+非功能需求
Source: https://www.outsystems.com/blog/2013/03/the-truth-about-non-functional-requirements-nfrs.html
要快
要便宜
要好
白雪公主的故事裡..如果少了這些小矮
人, 那會精彩嗎?
Source: https://devilsadvocatepaper.com/2016/04/06/the-end/

More Related Content

What's hot

Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
Databricks
 
How netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudHow netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloud
Vinay Kumar Chella
 
MySQL INDEX+EXPLAIN入門
MySQL INDEX+EXPLAIN入門MySQL INDEX+EXPLAIN入門
MySQL INDEX+EXPLAIN入門
infinite_loop
 

What's hot (20)

dlux - Splunk Technical Overview
dlux - Splunk Technical Overviewdlux - Splunk Technical Overview
dlux - Splunk Technical Overview
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
VxRail Appliance - Modernize your infrastructure and accelerate IT transforma...
VxRail Appliance - Modernize your infrastructure and accelerate IT transforma...VxRail Appliance - Modernize your infrastructure and accelerate IT transforma...
VxRail Appliance - Modernize your infrastructure and accelerate IT transforma...
 
How netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloudHow netflix manages petabyte scale apache cassandra in the cloud
How netflix manages petabyte scale apache cassandra in the cloud
 
MySQL INDEX+EXPLAIN入門
MySQL INDEX+EXPLAIN入門MySQL INDEX+EXPLAIN入門
MySQL INDEX+EXPLAIN入門
 
DataOps, Data Mesh e Data Fabric. Melhores práticas para seu projeto de arqui...
DataOps, Data Mesh e Data Fabric. Melhores práticas para seu projeto de arqui...DataOps, Data Mesh e Data Fabric. Melhores práticas para seu projeto de arqui...
DataOps, Data Mesh e Data Fabric. Melhores práticas para seu projeto de arqui...
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMachine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud Platform
 
Splunk Tutorial for Beginners - What is Splunk | Edureka
Splunk Tutorial for Beginners - What is Splunk | EdurekaSplunk Tutorial for Beginners - What is Splunk | Edureka
Splunk Tutorial for Beginners - What is Splunk | Edureka
 
VNX Overview
VNX Overview   VNX Overview
VNX Overview
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache Cassandra
 
データ分析の目的に応じた人事、分析組織づくり、データ人材の評価
データ分析の目的に応じた人事、分析組織づくり、データ人材の評価データ分析の目的に応じた人事、分析組織づくり、データ人材の評価
データ分析の目的に応じた人事、分析組織づくり、データ人材の評価
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
このIRのグラフがすごい!上場企業2020
このIRのグラフがすごい!上場企業2020このIRのグラフがすごい!上場企業2020
このIRのグラフがすごい!上場企業2020
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Viewers also liked

Viewers also liked (7)

SMACK Dev Experience
SMACK Dev ExperienceSMACK Dev Experience
SMACK Dev Experience
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience Sharing
 
Game Analytics & Machine Learning
Game Analytics & Machine LearningGame Analytics & Machine Learning
Game Analytics & Machine Learning
 
Recency/Frequency and Predictive Analytics in the gaming industry
Recency/Frequency and Predictive Analytics in the gaming industryRecency/Frequency and Predictive Analytics in the gaming industry
Recency/Frequency and Predictive Analytics in the gaming industry
 
Developing Microservices with Apache Camel
Developing Microservices with Apache CamelDeveloping Microservices with Apache Camel
Developing Microservices with Apache Camel
 
數據營運與指標設計 web analytics 101 slideshare
數據營運與指標設計 web analytics 101 slideshare數據營運與指標設計 web analytics 101 slideshare
數據營運與指標設計 web analytics 101 slideshare
 
運用MMLSpark 來加速Spark 上 機器學習專案
運用MMLSpark 來加速Spark 上機器學習專案運用MMLSpark 來加速Spark 上機器學習專案
運用MMLSpark 來加速Spark 上 機器學習專案
 

Similar to Realtime analytics with Flink and Druid

Greenplum技术
Greenplum技术Greenplum技术
Greenplum技术
锐 张
 

Similar to Realtime analytics with Flink and Druid (20)

逻辑数据编织 – 构建先进的现代企业数据架构
逻辑数据编织 – 构建先进的现代企业数据架构逻辑数据编织 – 构建先进的现代企业数据架构
逻辑数据编织 – 构建先进的现代企业数据架构
 
啟動你的AI工匠魂
啟動你的AI工匠魂啟動你的AI工匠魂
啟動你的AI工匠魂
 
传媒梦工场分享
传媒梦工场分享传媒梦工场分享
传媒梦工场分享
 
IT445_Week_11.pdf
IT445_Week_11.pdfIT445_Week_11.pdf
IT445_Week_11.pdf
 
How Enterprises Leverage Data to Overcome Business Challenges During Coronavirus
How Enterprises Leverage Data to Overcome Business Challenges During CoronavirusHow Enterprises Leverage Data to Overcome Business Challenges During Coronavirus
How Enterprises Leverage Data to Overcome Business Challenges During Coronavirus
 
Greenplum技术
Greenplum技术Greenplum技术
Greenplum技术
 
唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub唯品会大数据实践 Sacc pub
唯品会大数据实践 Sacc pub
 
01 DevOps and Azure DevOps overview
01 DevOps and Azure DevOps overview01 DevOps and Azure DevOps overview
01 DevOps and Azure DevOps overview
 
行動數位時代與商業發展趨勢
行動數位時代與商業發展趨勢行動數位時代與商業發展趨勢
行動數位時代與商業發展趨勢
 
Big Data 101 一 一個充滿意圖與關聯世界的具體實現
Big Data 101 一 一個充滿意圖與關聯世界的具體實現Big Data 101 一 一個充滿意圖與關聯世界的具體實現
Big Data 101 一 一個充滿意圖與關聯世界的具體實現
 
[2019 臺灣雲端大會]使用雲端技術打造快速的 AI 服務上線
[2019 臺灣雲端大會]使用雲端技術打造快速的 AI 服務上線[2019 臺灣雲端大會]使用雲端技術打造快速的 AI 服務上線
[2019 臺灣雲端大會]使用雲端技術打造快速的 AI 服務上線
 
JIRA Live DEMO 2020 v17
JIRA Live DEMO 2020 v17JIRA Live DEMO 2020 v17
JIRA Live DEMO 2020 v17
 
Jira live demo 2021 v23
Jira live demo 2021 v23Jira live demo 2021 v23
Jira live demo 2021 v23
 
Hiiir 百人團隊導入敏捷實踐經驗
Hiiir 百人團隊導入敏捷實踐經驗Hiiir 百人團隊導入敏捷實踐經驗
Hiiir 百人團隊導入敏捷實踐經驗
 
Hadoop Big Data 成功案例分享
Hadoop Big Data 成功案例分享Hadoop Big Data 成功案例分享
Hadoop Big Data 成功案例分享
 
Actify's Product solution Presentation in Simplified Chinese
Actify's Product solution Presentation in Simplified ChineseActify's Product solution Presentation in Simplified Chinese
Actify's Product solution Presentation in Simplified Chinese
 
20150206 aic machine learning
20150206 aic machine learning20150206 aic machine learning
20150206 aic machine learning
 
WychERP 业务管理解决方案
WychERP 业务管理解决方案WychERP 业务管理解决方案
WychERP 业务管理解决方案
 
Modernising Data Architecture for Data Driven Insights (Chinese)
Modernising Data Architecture for Data Driven Insights (Chinese)Modernising Data Architecture for Data Driven Insights (Chinese)
Modernising Data Architecture for Data Driven Insights (Chinese)
 
Can data virtualization uphold performance with complex queries? (Chinese)
Can data virtualization uphold performance with complex queries? (Chinese)Can data virtualization uphold performance with complex queries? (Chinese)
Can data virtualization uphold performance with complex queries? (Chinese)
 

More from Erhwen Kuo

More from Erhwen Kuo (20)

Datacon 2019-ksql-kubernetes-prometheus
Datacon 2019-ksql-kubernetes-prometheusDatacon 2019-ksql-kubernetes-prometheus
Datacon 2019-ksql-kubernetes-prometheus
 
Cncf k8s Ingress Example-03
Cncf k8s Ingress Example-03Cncf k8s Ingress Example-03
Cncf k8s Ingress Example-03
 
Cncf k8s Ingress Example-02
Cncf k8s Ingress Example-02Cncf k8s Ingress Example-02
Cncf k8s Ingress Example-02
 
Cncf k8s Ingress Example-01
Cncf k8s Ingress Example-01Cncf k8s Ingress Example-01
Cncf k8s Ingress Example-01
 
Cncf k8s_network_03 (Ingress introduction)
Cncf k8s_network_03 (Ingress introduction)Cncf k8s_network_03 (Ingress introduction)
Cncf k8s_network_03 (Ingress introduction)
 
Cncf k8s_network_02
Cncf k8s_network_02Cncf k8s_network_02
Cncf k8s_network_02
 
Cncf k8s_network_part1
Cncf k8s_network_part1Cncf k8s_network_part1
Cncf k8s_network_part1
 
Cncf explore k8s_api_go
Cncf explore k8s_api_goCncf explore k8s_api_go
Cncf explore k8s_api_go
 
CNCF explore k8s api using java client
CNCF explore k8s api using java clientCNCF explore k8s api using java client
CNCF explore k8s api using java client
 
CNCF explore k8s_api
CNCF explore k8s_apiCNCF explore k8s_api
CNCF explore k8s_api
 
Cncf Istio introduction
Cncf Istio introductionCncf Istio introduction
Cncf Istio introduction
 
TDEA 2018 Kafka EOS (Exactly-once)
TDEA 2018 Kafka EOS (Exactly-once)TDEA 2018 Kafka EOS (Exactly-once)
TDEA 2018 Kafka EOS (Exactly-once)
 
Spark手把手:[e2-spk-s03]
Spark手把手:[e2-spk-s03]Spark手把手:[e2-spk-s03]
Spark手把手:[e2-spk-s03]
 
Spark手把手:[e2-spk-s02]
Spark手把手:[e2-spk-s02]Spark手把手:[e2-spk-s02]
Spark手把手:[e2-spk-s02]
 
Spark手把手:[e2-spk-s01]
Spark手把手:[e2-spk-s01]Spark手把手:[e2-spk-s01]
Spark手把手:[e2-spk-s01]
 
06 integrate elasticsearch
06 integrate elasticsearch06 integrate elasticsearch
06 integrate elasticsearch
 
05 integrate redis
05 integrate redis05 integrate redis
05 integrate redis
 
04 integrate entityframework
04 integrate entityframework04 integrate entityframework
04 integrate entityframework
 
03 integrate webapisignalr
03 integrate webapisignalr03 integrate webapisignalr
03 integrate webapisignalr
 
02 integrate highchart
02 integrate highchart02 integrate highchart
02 integrate highchart
 

Realtime analytics with Flink and Druid