Realtime analytics with Flink and Druid

Real-time analytics with
Flink and Druid
郭二文
資深經理 @ 緯創資通企業應用系統
DataCon.TW 2017

有關講者
• 5年的生產管理/物料管理
• 3年的MES/SFC系統開發運維
• 3年的ERP (SAP-MM模組)/EAI 的導入開發Leader
• 2年的End-To-End Business Integration Analyst
• 1年的OEM Sales Team Leader
• 4年的企業營運分析/集團合併財報系統
• 4年的大數據新技術研究/推廣
• 超級鐵人三項226 Finisher

有關大數據技術研究/推廣
• 企業內部
• 方向
• 以公司需求出發, 從國內外社群找尋類似或可借鏡的案例分享

有關大數據技術研究/推廣
• 社群
• 取之於社會, 用之於社會
• 創立 http://eighty20.cc
• 方向
• 去除公司業務相關的資訊
• 專注技術整合/案例/經驗分享
• 2016 - SPARK 手把手 - 快速上手營 (6個session , 18小時)
• Taiwan R User Group/MLDM Monday 社群
• 2016 – Realtime WebApp 手把手 (3個session , 9小時)
• Taiwan Spark User Group 社群
• 2017 – ELK + Grafana手把手 (3個session , 9小時)
• Taiwan Spark User Group 社群

摘要
• 問題
• Business Intelligence Analytics For Time Series Data
• 可能的解決方案
• Choosing The Right Tools For The Job
• 架構
• End-to-end Data Pipeline

問題/挑戰
Source:http://www.dreamwerkztechnologies.com/dwz-inventory-management/

問題/挑戰
Source: https://logicyyc.com/2017/04/27/industry-4-0/

問題/挑戰
• Operational Information層級
• 使用人數最多, 常常因為軟體license的問題導致很多手動報表
• 需要非常動態的Query/Filter/Aggregation的功能
• 需要更及時的資料
• 需要底層的交易資料

要便宜
問題/挑戰
要快要好
功能
資料
速度
支援
人數

問題/挑戰
• 構建接近實時商業智能(BI)時間序列分析系統
• 很大量Operation層級的使用者
• Many users
• 很高的查詢/聚合需求的並發量
• High concurrency
• 很快的查詢/聚合結果產出
• Low latency , sub-second response
• 很大量的即時與歷史數據
• Large volume of history & current data

問題/挑戰
• 當要處理資料量是很巨大的時候, 要用什麼架構呢?
• 外面有許許多的的解決方案、手法及廠商, 該選什麼呢?
• 要如何選擇對工具呢?
要快
要便宜
要好

使用RDBMS的解決方案
Business Intelligence Application
RDBMS
Event Data
ETL

使用RDBMS的解決方案
• 傳統資料倉儲
• Row store
• Start schema
• Aggregate tables
• Query cache
• Scanning raw data is slow and expensive

使用Apache Kylin的解決方案
Kylin / HBase
Event Data
ETL

使用Apache Kylin的解決方案
• 分布式多维分析（OLAP）引擎
• 串流資料導入的案例較少
** 尚未進行深入研究與測試 **
(值得關注的專案)

使用一般化的大數據解決方案
Spark SQL / Impala
Event Data
ETL

• 把資料載入到Hadoop
• 透過Spark SQL或Impala的工具
來查詢/過濾/聚合
• 利用Tableau或PowerBI來作資
料視覺化的呈現與互動性的分
析

• 構構建接近實時商業智能(BI)時間序列分析系統
• Many users (有license的考量 Tableau, PowerBI)
• High concurrency (Spark / Impala不易支持上百個使用者的同時操作)
• Low latency , sub-second response (不容易調優到秒級的回應)
• Large volume of history & current data (新的資料必需累積到128MB或10分鐘
才能以Parquet 的格式轉入HDFS)

使用Elastic的解決方案
Elasticsearch
Event Data
ETL

使用Elastic的解決方案
• Pros:
• 集群架設簡單
• 可查詢即時與歷史數據
• Cons:
• 展現層的out-of-box元件較少
• 難與傳統的BI工具結合使用
• 權限管理/認證需要commercial 的
license
• 需具備二次開發的能力與人才
• 資料實時的Ingestion Rate需要調優

使用Druid的解決方案
Event Data
Druid
ETL

使用Druid的解決方案 – 架構
Event
enrichment
Batch Data
Preprocess
Meta data
cache

展現層 - Superset
• Superset是一個視覺/互動的數據探索平台
• 由Airbnb開源, 現為Apache 孵化專案
• 豐富的視覺化元件及簡單直覺數據探索的操作
• 有彈性的authentication/authorization, 可與LDAP做整合
• 與Druid有深度的整合, 利用Druid超快的slice & dice的能力來提
升用者體驗

資料儲存/查詢層 - Druid
• Druid是一個用於大數據實時查詢和分析的高容錯、高性
能開源分佈式系統
• 為分析而設計—Druid是為OLAP工作流的探索性分析而構建，
它支持各種過濾、聚合和查詢
• 快速的交互式查詢—Druid的低延遲數據攝取架構允許事件在它
們創建後毫秒內可被查詢到
• 高可用性—Druid的數據在系統更新時依然可用，規模的擴大和
縮小都不會造成數據丟失
• 可擴展—Druid已實現每天能夠處理數十億事件和TB級數據

Fast Response Time
• Critical for interactive user experience
• Avg query times ~500ms
• 90% percentile under 1 sec
• 99% percentile under 10 sec
• Handle 1000’s of concurrent queries

Arbitrary slicing n’ dicing
• 支持任意過濾、分割和匯總數據的能力

Scalability
• Ability to handle
• petabytes of data
• billions of events/day
• Largest druid cluster
• 50 Trillion+ events
• 50PB+ of raw data
• Over 500TB of compressed query-able data
• Ingestion Rate over 500,000 events/sec
• 10-100K events/sec/core

資料串流處理層- Flink
• Apache Flink是分佈式高性能、高可用與提供準確
的數據流計算的開源流處理框架
• High Performance & Low Latency
• Support for Event Time and Out-of-Order Events
• Exactly-once Semantics for Stateful Computations
• Highly flexible Streaming Windows
• Continuous Streaming Model with Backpressure
• Fault-tolerance via Lightweight Distributed Snapshots

串流處理 – Event Enrichment
Event
Event after Enrichment
PART_A 4 2017-09-30 09:31:54SO#04
CATEGORY 4 2017-09-30 09:31:54SO#04 Customer USA PART_A TYPE..

Local Cache
(LRU)
Local Cache
(LRU)
Remote Cache
Or
Key/Value Store
參考: https://martin.kleppmann.com/2015/04/23/bottled-water-real-time-postgresql-kafka.html
M
M
M
快取裡頭的資料
會被更新的資料
所取代
M
檢查內部的LRU快取是否存在
相同的鍵值, 如果”是”則以最
新的資料取代
M
要被lookup的主檔資料是怎麼被即時處理與保存的?

Local Cache
(LRU)
Local Cache
(LRU)
Remote Cache
Or
Key/Value Store
參考: https://martin.kleppmann.com/2015/04/23/bottled-water-real-time-postgresql-kafka.html
E
E
E
1. 檢查local cache是否存在所需要的鍵值, 如果
有就取出使用
2. 如果沒有, 則從remote cache取回所需要的參
考資料, 並將資料保留在local cache
3. Local cache的資料如果超過設定的size, LRU的
機制會移除最不常使用的資料
M
M
E
E
E
Event是怎麼被豐富化?

總結 - 功能性需求
• Many users
• High concurrency
• Low latency , sub-second response
• Large volume of history & current data

總結 - 非功能性需求
可維護性
Maintainability
可移植性
Portability
可靠性
Reliability
可擴展性
Scalability
靈活性
Flexibility
審查能力
Auditability
相關說明文件
Documentation
性能
Performance
安全
Security
可用性
Usability
Source: https://www.outsystems.com/blog/2013/03/the-truth-about-non-functional-requirements-nfrs.html

總結 - 功能性+非功能需求
Source: https://www.outsystems.com/blog/2013/03/the-truth-about-non-functional-requirements-nfrs.html
要快
要便宜
要好
白雪公主的故事裡..如果少了這些小矮
人, 那會精彩嗎?

Source: https://devilsadvocatepaper.com/2016/04/06/the-end/

Realtime analytics with Flink and Druid

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Realtime analytics with Flink and Druid

Similar to Realtime analytics with Flink and Druid (20)

More from Erhwen Kuo

More from Erhwen Kuo (20)

Realtime analytics with Flink and Druid