10. ETL pipeline
Dedicated ETL tools (e.g. SSIS)
Defined schema
Queries
Results
Relational
LOB
Applications
Traditional business analytics process
1. Start with end-user requirements to identify desired reports
and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema
(‘schema-on-write’)
5. Create reports. Analyze data
All data not immediately required is discarded or archived
15. 巨量資料儲存 機器學習跟分析
行動
People
Automated
Systems
Apps
Web
Mobile
Bots
智慧服務
儀錶板 & 資料視覺化
Cortana
Bot
Framework
Cognitive
Services
Power BI
資訊管理
Event Hubs
Data Catalog
Data Factory
HDInsight
(Hadoop and
Spark)
Stream
Analytics
智慧服務
Data Lake
Analytics
Machine
Learning
SQL Data
Warehouse
Data Lake
Store
資料
來源
應用
程式
感知器
與裝置
資料
IoT Hub
20. 常透過Hadoop 處理的資料型態
1. 情緒分析(Sentiment)
Understand how your customers feel about your brand
2. Clickstream
Capture and analyze website visitors’ data trails and optimize your website
3. 感應器(Sensor)/機器
Discover patterns in data streaming automatically from remote sensors and machines
4. 地理資訊
Analyze location-based data to manage operations where they occur
5. 伺服器 Logs
Research logs to diagnose process failures and prevent security breaches
6. 非結構化資料 (txt, video, pictures, etc..)
Understand patterns in files across millions of web pages, emails, and documents
21. Azure HDInsight 簡介
Hadoop Meets the Cloud由微軟所管理的Hadoop服務
使用100% 開源的Apache Hadoop
相容.Net 與 Java 工具
可自動升級 Hadoop 版本
數分鐘內可以設定完成並執行, 無須採購硬體
執行於 Windows 或 Linux
啟用與設定服務, 使用, 取消服務 – 可以保留資料
微軟提供技術支援
26. Data Node Data Node Data Node Data Node
Task Tracker Task Tracker Task Tracker Task Tracker
Name Node
Job Tracker
HMaster
Coordination
Region Server Region Server Region Server Region Server
34. 其他Hadoop 元件與工具
Ambari: Cluster provisioning, management, and monitoring.
Avro (Microsoft .NET Library for Avro): Data serialization for
the Microsoft .NET environment
MapReduce and YARN: Distributed processing and resource
management
Oozie: Workflow management
Phoenix: Relational database layer over HBase
Pig: Simpler scripting for MapReduce transformations
Sqoop: Data import and export
Tez: Allows data-intensive processes to run efficiently at
scale
ZooKeeper: Coordination of processes in distributed systems
42. be removed January 1, 2017
https://portal.azure.com
https://azure.microsoft.com/en-
us/documentation/templates/?term=hdinsight
叢集佈署
43. First Cloud Hadoop solution to onboard LLAP (Long Lived and Process) from the Stinger.Next initiatives, which
promises sub-second querying on big data, which is 25x faster than existing Hive.
44. Apache Spark – An Unified Framework
An unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Spark SQL
Interactive
Queries
Spark
Streaming
Stream processing
Spark MLlib
Machine
Learning
GraphX
Graph
Computation
Yarn Mesos
Standalone
Scheduler
45. Fast, expressive cluster computing system compatible with Apache
Hadoop
• Works with any Hadoop-supported storage system (HDFS, S3, Avro, …)
Improves efficiency through:
• In-memory computing primitives
• General computation graphs
Improves usability through:
• Rich APIs in Java, Scala, Python
• Interactive shell
Spark was initially started by Matei Zaharia at UC Berkeley AMPLab in
2009, was open sourced in 2010 and donated to Apache in 2013
Up to 100× faster
Often 2-10× less code
What is Spark?
54. Developing Spark apps with notebooks
Jupyter and Zeppelin are two notebooks that work
with Apache Spark
55. Jupyter
Language agnostic
Supports a rich Read-Evaluate-Print-Loop (REPL) protocol Includes:
Jupyter interactive web-based notebook
Jupyter Qt console
Jupyter Terminal console
Notebook viewer (nbviewer)
full list here
Supported languages (kernels)
56. Zeppelin architecture
Browser client
Zeppelin server
Class loader Class loader
Interpreter group Interpreter group
Interpreter Dep Spark Spark SQL
HTTP Rest Websocket
…
Spark
…
Maven
Apache Spark is supported in
Zeppelin with the Spark interpreter
group, which consists of four
interpreters.
Name Class Description
%spark SparkInterpreter Creates SparkContext and provides
scala environment
%pyspark PySparkInterpreter Provides python environment
%sql SparkSQLInterprete
r
Provides SQL environment
%dep DepInterpreter Dependency loader
57. Spark SQL overview
You run interactive
Spark SQL statements
using notebooks.
Run Spark SQL
statements using
notebooks
HDInsight uses Azure
Blob storage account
for storing data.
Create an Azure
storage account
HDInsight makes
Apache Spark available
as a service in cloud.
HDInsight makes
Apache Spark
available as a
service in cloud.
65. 巨量資料儲存 機器學習跟分析
行動
People
Automated
Systems
Apps
Web
Mobile
Bots
智慧服務
儀錶板 & 資料視覺化
Cortana
Bot
Framework
Cognitive
Services
Power BI
資訊管理
Event Hubs
Data Catalog
Data Factory
HDInsight
(Hadoop and
Spark)
Stream
Analytics
智慧服務
Data Lake
Analytics
Machine
Learning
SQL Data
Warehouse
Data Lake
Store
資料
來源
應用
程式
感知器
與裝置
資料
IoT Hub
66. What investment is your company making in big data?
大數據處理技術對許多組織仍是挑戰
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
Fully deployed Have a pilot
in place
Currently
investigating
Interested, but haven’t
investigated yet
Have investigated and
decided not to pursue
Not being
considered
5%
11%
29%
41%
5%
9%
Interest in big data 70%
Invested in big data 16%
91% Hadoop usage concerns
71% Hadoop/BI tool inexperience
69. Azure SQL資料倉儲服務架構
Control
Node
Compute
Node
Compute
Node
Compute
Node
Compute
Node
SQL
DB
SQL
DB
SQL
DB
SQL
DB
Blob storage [WASB(S)]
Compute
Scale compute up or down
when required
(SLA <= 60 seconds).
Pause, Resume, Stop, Start.
Storage
AddLoad data to WASB(S)
without incurring compute
costs
Massively Parallel
Processing (MPP) Engine
Azure Infrastructure and
Storage
100 DWU < > 2000 DWU
儲存與運算分開, 提供彈性的服務架
構與計費方式
(儲存與運算資源分別計價)
Application or
User connection
HDInsight
Data Loading
(SSIS, REST, OLE, ADO, ODBC,
WebHDFS, AZCopy, PS) DMS
DMS DMS DMS DMS
DMS (Data
Movement Service)
在所有的資料庫節
點上運行
70. Azure SQL資料倉儲服務 – 控制節點( Node )
Control
Node
SQL
DB
Compute
Node
Compute
Node
Compute
Node
Compute
Node
SQL
DB
SQL
DB
SQL
DB
SQL
DB
Blob storage [WASB(S)]
Massively Parallel
Processing (MPP) Engine
HDInsight
Control
Node
SQL DB
• Endpoint for connections
• Regular SQL endpoint (TCP 1433)
• Persists no user data (metadata
only)
• Coordinates compute activity
using MPP
71. Azure SQL資料倉儲服務 – 運算節點( Node )
Control
Node
SQL
DB
Compute
Node
Compute
Node
Compute
Node
Compute
Node
SQL
DB
SQL
DB
SQL
DB
SQL
DB
Blob storage [WASB(S)]
Massively Parallel
Processing (MPP) Engine
HDInsight
Compute
Node(s)
Azure SQL Database
SQL DB
An increase of DWU will
increase the number of
compute nodes
72. Azure SQL資料倉儲服務 – Blob 儲存體
Control
Node
SQL
DB
Compute
Node
Compute
Node
Compute
Node
Compute
Node
SQL
DB
SQL
DB
SQL
DB
SQL
DB
Blob storage [WASB(S)]
Massively Parallel
Processing (MPP) Engine
HDInsight
• RA-GRS storage
• +PB’s of storage
• Ingest data without
incurring compute costs
74. CREATE TABLE [build].[FactOnlineSales]
(
[OnlineSalesKey] int NOT NULL
, [DateKey] datetime NOT NULL
, [StoreKey] int NOT NULL
, [ProductKey] int NOT NULL
, [PromotionKey] int NOT NULL
, [CurrencyKey] int NOT NULL
, [CustomerKey] int NOT NULL
, [SalesOrderNumber] nvarchar(20) NOT NULL
, [SalesOrderLineNumber] int NULL
, [SalesQuantity] int NOT NULL
, [SalesAmount] money NOT NULL
)
WITH
( CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = ROUND_ROBIN
)
;
CREATE TABLE [build].[FactOnlineSales]
(
[OnlineSalesKey] int NOT NULL
, [DateKey] datetime NOT NULL
, [StoreKey] int NOT NULL
, [ProductKey] int NOT NULL
, [PromotionKey] int NOT NULL
, [CurrencyKey] int NOT NULL
, [CustomerKey] int NOT NULL
, [SalesOrderNumber] nvarchar(20) NOT NULL
, [SalesOrderLineNumber] int NULL
, [SalesQuantity] int NOT NULL
, [SalesAmount] money NOT NULL
)
WITH
( CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = HASH([ProductKey])
)
;
77. 透過Polybase查詢非結構化資料
T-SQL query
SQL Server Hadoop
計程車交易:
************************
**********************
*********************
**********************
***********************
$658.39
Jim Gray
姓名
11/13/58
生日
WA
縣市
Ann Smith 04/29/76 ME
89. App Service
Intelligent App
Hadoop
Azure Machine
Learning
Power BI
Azure SQL
Database
SQL
Azure SQL Data
Warehouse
End-to-end platform built for the cloud
Power of integration
94. 具高度延展性, 分散式, 支援平行處理的雲端檔案系統
支援多種的資料分析框架
什麼是 Azure Data Lake Store?
LOB Applications
SocialDevices
Clickstream
Sensors
Video
Web
Relational
HDInsight
ADL Analytics
Machine Learning
Spark
R
98
ADL Store
95. ADL Store 無限規模架構
ADL Store 中的檔案被切片分散到不同blocks中
Blocks 被分散到後端儲存系統中的不同的data
nodes
在有足夠的data nodes狀況下, 任何大小的檔案
可以被儲存˙
Azure 雲端上的後端儲存系統概念上可以有無
限的資源
每個檔案的Metadata也被同樣的系統儲存
99
Azure Data Lake Store file
…Block 1 Block 2 Block 2
後端儲存系統
Data node Data node Data node Data node Data nodeData node
Block Block Block Block Block Block
96. ADL Store 提供大量的傳輸量
透過平行讀取ADL Store提供大量的傳輸量
每個讀取動作都在data notes 上藉由平行讀取
同時進行
Read operation
100
Azure Data Lake Store file
…Block 1 Block 2 Block 2
後端儲存系統
Data node Data node Data node Data node Data nodeData node
Block Block Block Block Block Block
98. ADL Store 是 HDFS-相容檔案系統
透過 WebHDFS 端點 Azure Data Lake Store 是一個 Hadoop相容檔案系統, 可以無縫的
整合 Azure HDInsight
Map reduce
HBase
transactions
Any HDFS applicationHive query
Azure HDInsight
Hadoop WebHDFS client
Hadoop WebHDFS client
WebHDFS
endpoint
WebHDFS
REST API
WebHDFS
REST API
102
ADL Store file ADL Store file ADL Store file ADL Store fileADL Store file
Azure Data Lake Store
99. ADL Store: 高可用性及可靠度
• 每個區域(region) Azure 將資料物件存放3份分
別在不同的失敗(fault) 及升級(upgrade) 領域
(domains)
• 所有操作動作都複製到另外兩份, 並確保複製
完成後才 commit.
• 可以從任何一個資料副本進行讀取
Data is never lost or unavailable
even under failures
Replica 1
Replica 2 Replica 3
Fault/upgrade
domains
Write Commit
100. ADL Store: Ingress
Data can be ingested into Azure Data Lake Store from a variety of sources
Server logs
Azure Event Hub
Apache
Flume
Azure Storage Blobs
Custom programs
.NET SDK
JavaScript CLI
Azure Portal
Azure PowerShell
Azure Data Factory
Apache Sqoop
Azure SQL DB
Azure SQL DW
Azure tables
Table Storage
On-premises databases
SQL
104
ADL Store
ADLS Built-in
copy service
101. ADL Store: Egress
Data can be exported from Azure Data Lake Store into numerous targets/sinks
Azure SQL DB
SQL
Azure SQL DW
Azure Tables
Table Storage
On-premises databases
Azure Data Factory
Apache Sqoop
Azure Storage Blobs
Custom programs
.NET SDK
JavaScript CLI
Azure Portal
Azure PowerShell
105
Built-in
ADLS copy service
ADL Store
102. Data Lake Store: 技術規格
安全性 資料存取需要支援授權管理
原始格式 能儲存原始資料格式以追蹤資料血統及出處
低延遲 能支援高頻率的資料操作.
能支援多種分析框架—Batch, Real-time, Streaming, ML etc.
沒有單一框架可以支援所有資料內容跟分析方式.
多種分析框架
資料細節 可記載資料的詳細內容.
吞吐量 能承受像Hadoop and Spark這樣平行處理架構的資料存取需求
可靠度 高可用度及可靠度.
延展性 可容納快速增長的資料
多種資料來源 可從多種資料來源輸入資料.
108. ADLA直接在資料來源做查詢
• 無須移動資料, 直接將查詢任務派送到資料來源
執行
• 避免查詢前必須將儲存在不同地方的大量資料透
過網路搬移
• 提供單一資料檢視方式, 無論資料實際儲存在何
處
• 減少資料多個副本的資料擴散(Data proliferation )
問題
• 所有資料都可用單一查詢語法
• 各個資料來源可以維持原本各自的管理機制
• 將SQL查詢表示式直接在遠端SQL 資料來源執行
• Filters
• Joins
U-SQL Query Query
Azure
Storage Blobs
Azure SQL
in VMs
Azure
SQL DB
Azure Data
Lake Analytics
Azure
SQL Data Warehouse
Azure
Data Lake Storage
109. Work across all cloud data
Azure Data Lake
Analytics
Azure SQL DW Azure SQL DB
Azure
Storage Blobs
Azure
Data Lake Store
SQL Server in an
Azure VM
110. U-SQL 語法
SQL 陳述式(Declarative) 查詢
• 使用 SQL語法 : SELECT FROM WHERE with GROUP
BY/aggregation, joins, SQL analytics functions
• 容易做最佳化調校
可處理結構性及非結構性資料
• Schema 在讀檔時決定
• 支援關聯式 metadata 物件 (e.g. database, table)
高度擴充性
• 基於C# 型別系統(Type system )
• C# 表述語言(Expressionlanguage)
• 使用者自訂義 functions(U-SQL and C#)
• 使用者自訂義 aggregators(C#)
• 使用者自訂義 operators (UDO) (C#)
提供容易擴充的平行化處理及Scale-out架構
• EXTRACTOR, OUTPUTTER, PROCESSOR,REDUCER,
COMBINER, APPLIER
將查詢送到不同資料來源執行
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int
, order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount float
FROM "/input/orders.txt"
USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string
FROM "/input/customers.txt"
USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder
, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt
, AGG<MyAgg.MySum>(c.amount) AS totalamount
FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid
WHERE c.city.StartsWith("New")
&& MyNamespace.MyFunction(o.odate) > 10
GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"
USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;
123. Azure SQL DW HDInsight Hive HDInsight Spark Azure Data Lake SQL Server (IaaS)
Volume Petabytes Petabytes Petabytes Petabytes Terabytes
Security Encryption, TD,
Audit
ADLS / Apache
Ranger
ADLS AAD Security
Groups (data)
Encryption, TD
Audit
Languages T-SQL HiveQL SparkSQL, HiveQL,
Scala, Java,
Python, R
U-SQL T-SQL
Extensibility No Yes, .NET/SerDe Yes, Packages Yes, .NET Yes, .NET CLR
External File
Types
ORC, TXT,
Parquet, RCFile
ORC, CSV, Parquet
+ others
Parquet, JSON,
Hive + others
Many ORC, TXT, Parquet,
RCFile
Admin Low-Medium Medium-High Medium-High Low High
Cost Model DWU Nodes & VM Nodes & VM Units/Jobs VM
Schema
Definition
Schema on
Write / Polybase
Schema on Read Schema on Read Schema on Read Schema on Write /
Polybase
124. The “Clusters” Big Data Approach
Hardware
Purchase
Maintaining
Hardware
Cluster Time
Nodes
Time
Wasted compute time vs. Productive
compute time
125. The “Clusterless” Big Data Approach
Intelligently managing the
cluster lifetime and scale
Wasted compute time vs. Productive
compute time
Wasted compute time vs. Productive
compute time with clusters
Wasted compute time vs. Productive
compute time with Azure Data Lake
Analytics
A clusterless approach
doesn’t have unused
compute time
126. Enabling Further Cost Optimizations
Productive compute time with Azure Data Lake
Analytics
Productive compute time vs Optimized compute time with
Azure Data Lake Analytics
130. Analytics APIs
Ready to consume APIs for
Vision, Speech, Language,
Knowledge
R-based analytics
Enterprise grade, write
once deploy anywhere
Cloud analytics
Easy drag/drop UX with
single click
operationalization
Azure Machine LearningMicrosoft R Cognitive Services
Solutions
Big Data Platform
Run large massively
parallel compute
and data jobs
HDInsight/Spark
Citizen Data Scientist
Advanced Data
Scientist Developer
Data Engineer
/Data Scientist
Preconfigured
Solutions/Apps/Soluti
on Templates
BDM/TDM
Finished Apps & Solutions
Ready to consume Apps and
solutions for solving specific
business scenarios
131. MapReduce &
Tez
U-SQL
Data Lake Store
WebHDFS
YARN
Spark
Batch
Interactive
Streaming
ML
Batch
Interactive
Streaming
ML
FEDERATION to enable very large
(100K+) YARN clusters, Cross-DC,
BCDR
REEF – “libc for BigData”
AMEOBA – work preserving pre-
emption
RAYON – Capacity Reservation
MERCURY & YAQ – Optimistic
allocation + YARN conservatism to
improve performance
OAuth Support
Microsoft works with the Open Source community
133. Big Data Pipeline and Data Flow in Azure
HDInsight
(Hadoop and
Spark)
Stream Analytics
Data Lake
Analytics
Machine
Learning
134. 透過Polybase查詢非結構化資料
T-SQL query
SQL Server Hadoop
計程車交易:
************************
**********************
*********************
**********************
***********************
$658.39
Jim Gray
姓名
11/13/58
生日
WA
縣市
Ann Smith 04/29/76 ME
135. App Service
Intelligent App
Hadoop
Azure Machine
Learning
Power BI
Azure SQL
Database
SQL
Azure SQL Data
Warehouse
End-to-end platform built for the cloud
Power of integration
136. ON PREMISES CLOUD
Massive
Archive
On Prem HDFS
Active
Incoming Data
“Landing
Zone”
Data Lake
Store
Move to
cloud via
AzCopy
Data Lake
Store
Data Lake
Analytics
Azure DW
CONSUMPTION
Machine Learning at scale.
Customer Segmentation &
Fraud Detection)
Web Portals
Mobile
Apps
Power BI
Experimentation at scale.
Drive changes based on
customer behavior
Real World Scenario with Azure Data Lake
Jupyter
Data Science
Notebooks
159. Cortana Analytic Suite (分析套件包)
將資料透過先進資料分析轉換成智慧型決策與行動
決策與行動
People
Automated
Systems
Apps
Web
Mobile
Bots
智慧服務
儀錶板 & 資料視覺化
Cortana
Bot
Framework
Cognitive
Services
Power BI
資訊管理
Event Hubs
Data Catalog
Data Factory
機器學習跟分析
HDInsight
(Hadoop and
Spark)
Stream
Analytics
智慧分析
Data Lake
Analytics
Machine
Learning
巨量資料儲存
SQL Data
Warehouse
Data Lake
Store
Data
Sources
Apps
Sensors
and
devices
資料產生
IoT Hub
DocumetDB