5. 取り込み
Modern Data Lifecycle
処理 保存 利用
Event Hubs
IoT Hubs
Service Bus
Kafka
HDInsight
ADLA
Storm
Spark
Stream Analytics
ADLS
Azure Storage
Azure SQL DB
Azure SQL DW
ADLS
Azure DW
Azure SQL DB
Hbase
Cassandra
Azure Storage
Power BI
キュレーション
Azure Data Factory Azure ML
10. Azure Data Lake service
無限にデータをストア・管理
Row Data を保存
高スループット、低いレイテンシの分析ジョ
ブ
セキュリティ、アクセスコントロール Azure Data Lake store
HDInsight & Azure Data Lake Analytics
12. ADL Analytics Account
Links to ADL Stores
ADL Store Account
(the default one)
Job Queue
キーの設置:
- Max Concurrent Jobs
- Max ADLUs per Job
- Max Queue Length
Links to Azure Blob Stores
U-SQL Catalog
Metadata
U-SQL Catalog Data
ADLAU = Azure Data Lake Analytics
Unit
13.
14. ON PREMISES CLOUD
Massive Archive
On Prem HDFS
インポート
Active Incoming
Data
継続更新
“Landing Zone”
Data Lake Store
AzCopy
でコピー
Data Lake
Store
Data Lake
Analytics
永続ストア箇所への移動
と
ジョブ実行により作成さ
れた
構造化データセットの保
存
DW (many
instances)
構造化データの作成。
CONSUMPTION
Machine
Learning
機械学習の実行、検証
Web Portals
Mobile Apps
Power BI
実験・検証
A/B テストや
顧客行動の変化の追跡
Jupyter
Data Science
Notebooks
40. Map reduce
Hbase
トランザクショ
ン
HDFS アプリケーションHive クエリ
Azure HDInsight
Hadoop WebHDFS クライアント
Hadoop WebHDFS クライアント
WebHDFS
エンドポイント
WebHDFS
REST API
WebHDFS
REST API
ADL Store file ADL Store file ADL Store file ADL Store fileADL Store file
Azure Data Lake Store
41. Local
ADL Store
Azure Portal
Azure PowerShell
Azure CLI
Data Lake Tools for Visual Studio
Azure Data Factory
AdlCopy ツール
Azure Stream Analytics
Azure HDInsight Storm
Azure Data Factory
Apache Sqoop
Apache DistCp
Azure Data Factory
AdlCopy ツール
43. Azure Data Lake Store file
…Block 1 Block 2 Block 2
Backend Storage
Data node Data node Data node Data node Data nodeData node
Block Block Block Block Block Block
44. Azure Data Lake Store Azure BLOB ストレージ
目的
ビッグ データ分析ワークロードに
最適化されたストレージ
さまざまなストレージ シナリオに対応する
汎用オブジェクト ストア
ユース ケース
バッチ、対話型、ストリーミング分析、
および機械学習データ
(ログ ファイル、IoT データ、
クリック ストリーム、大規模なデータセット等)
あらゆる種類のテキスト データまたは
バイナリ データ
(アプリケーション バックエンド、
バックアップ データ、ストリーミング用
メディア ストレージ、汎用データなど)
Structure 階層型ファイル システム フラットな名前空間を使用するオブジェクト ストア
サーバー側 API WebHDFS 互換の REST API Azure BLOB ストレージ REST API
データ操作 - 承認
POSIX アクセス制御リスト (ACL)。
Azure Active Directory ID に基づく ACL :
ファイルおよびフォルダー レベルで設定可能。
アカウントレベルの承認:アカウント アクセス キー
アカウント、コンテナー、BLOB の承認:Shared
Access Signature キー
データ操作 - 認証 Azure Active Directory ID
共有シークレット (アカウント アクセス キーと
Shared Access Signature キー)
サイズ制限
アカウント サイズ、ファイル サイズ、
ファイル数に制限なし
あり
(https://docs.microsoft.com/ja-jp/azure/azure-
subscription-service-limits#storage-limits)
49. スケールに制限なし
U-SQL, SQLのメリットにC#のパワーを加えた新
しい言語
Data Lake Store に最適化
Azure データサービスへの FEDERATED QUERY
企業利用のためのセキュリティ、
アクセス制御、暗号化など
ジョブ単位での課金とスケール設定
Azure Data Lake
Analytics
全てのどんなサイズのデータ
でも処理できる
Apache YARNベースの
分析サービス
50. HDInsight
Java, Eclipse, Hive, etc.
フルマネージド の
Hadoop クラスタ
Data Lake Analytics
C#, SQL & PowerShell
フルマネージド の
分散管理処理クラスタ
DryAd ベース
57. Management
Data Lake Analytics アカウント
Jobs
U-SQL job
Catalog
カタログ(メタデータ)
Management
Data Lake Store アカウント
File System
Upload, download, list, delete, rename,
append
(WebHDFS)
Analytics Store
Azure Active Directory
65. 多くの SQL & .NET DEVELOPERS
宣言型言語の SQL と
逐次実行型である C# のパワーを融合
構造化、一部構造化、非構造化データの融合
全てのデータに分散クエリの実施
U-SQL
Big Data のための新しい言語
66. #azurejp
@rows =
EXTRACT
name string,
id int
FROM “/data.csv”
USING Extractors.Csv( );
OUTPUT @rows
TO “/output.csv”
USING Outputters.Csv();
Rowsets
EXTRACT for files
OUTPUT
Schema
Types
Inputs & Outputs
Keywords are UPPERCASE
67. #azurejp
REFERENCE ASSEMBLY WebLogExtASM;
@rs =
EXTRACT
UserID string,
Start DateTime,
End DateTime,
Region string,
SitesVisited string,
PagesVisited string
FROM "swebhdfs://Logs/WebLogRecords.csv"
USING WebLogExtractor ();
@result = SELECT UserID,
(End.Subtract(Start)).TotalSeconds AS Duration
FROM @rs ORDER BY Duration DESC FETCH 10;
OUTPUT @result TO "swebhdfs://Logs/Results/top10.txt"
USING Outputter.Tsv();
• 型定義は C# の型定義と同じ
• データをファイルから抽出・読
み込み
するときに、スキーマを定義
Data Lake Store のファイ
ル独自形式を解析するカスタム
関数
C# の関数
行セット:
(中間テーブ
ルの概念に近
い)
TSV形式で読み取る関数
68. DECLARE @endDate DateTime = DateTime.Now;
DECLARE @startDate DateTime = @endDate.AddDays(-7);
@orders =
EXTRACT
OrderId int,
Customer string,
Date DateTime,
Amount float
FROM "/input/orders.txt"
USING Extractors.Tsv();
@orders = SELECT * FROM @orders
WHERE Date >= startDate AND Date <= endDate;
@orders = SELECT * FROM @orders
WHERE Customer.Contains(“Contoso”);
OUTPUT @orders
TO "/output/output.txt"
USING Outputters.Tsv();
U-SQL
Basics
(1) DECLARE C# の式で変数宣言
(2) EXTRACT ファイル読み込み時に
スキーマを決定し、結果を RowSet
に
(3) RowSet 式を使ってデータを再定
義
(4) OUTPUT データをファイルへ出
力
1
2
3
4
69. CREATE ASSEMBLY OrdersDB.SampleDotNetCode
FROM @"/Assemblies/Helpers.dll";
REFERENCE ASSEMBLY OrdersDB.Helpers;
@rows =
SELECT
OrdersDB.Helpers.Normalize(Customer) AS Customer,
Amount AS Amount
FROM @orders;
@rows =
PROCESS @rows
PRODUCE OrderId string, FraudDetectionScore double
USING new OrdersDB.Detection.FraudAnalyzer();
OUTPUT @rows
TO "/output/output.dat"
USING OrdersDB.CustomOutputter();
U-SQL
.NET Code 利用
(1) CREATE ASSEMBLY アセンブ
リーをU-SQL Catalog へ登録
(2) REFERENCE ASSEMBLY アセン
ブリーへの参照宣言
(3) U-SQL 式の中で、C# メソッドの
呼び出し
(4) PROCESS User Defined Operator
を使って、行ごとの処理を実行
(5) OUTPUT 独自のデータ形式で出
力
1
2
3
4
5
70.
71. #azurejp
// WASB (requires setting up a WASB DataSource
in ADLS)
@rows =
EXTRACT name string, id int
FROM “wasb://…/data..csv”
USING Extractors.Csv( );
// ADLS (absolute path)
@rows =
EXTRACT name string, id int
FROM “adl://…/data..csv”
USING Extractors.Csv( );
// ADLS (relative to default ADLS for an ADLA
account)
@rows =
EXTRACT name string, id int
FROM “/…/data..csv”
USING Extractors.Csv( );
Default Extractors
Extractors.Csv( )
Extractors.Tsv( )
73. #azurejp
@rows =
EXTRACT name string, id int
FROM “adl://…/data..csv”
USING Extractors.Csv();
OUTPUT @rows
TO “/data.tsv”
USING Outputters.Csv();
Default Outputters
Outputters.Csv( )
Outputters.Tsv( )
74. #azurejp
@rows =
EXTRACT
Name string,
Id int,
FROM “/file.tsv”
USING Extractors.Tsv(skipFirstNRows:1);
OUTPUT @data
TO "/output/docsamples/output_header.csv"
USING Outputters.Csv(outputHeader:true);
スキップする行数を指
定
92. #azurejp
@output =
SELECT
Region,
COUNT() AS NumSessions,
SUM(Duration) AS TotalDuration,
AVG(Duration) AS AvgDwellTtime,
MAX(Duration) AS MaxDuration,
MIN(Duration) AS MinDuration
FROM @searchlog
GROUP BY Region;
93. #azurejp
// NO GROUP BY
@output =
SELECT
SUM(Duration) AS TotalDuration
FROM @searchlog;
// WITH GROUP BY
@output =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM searchlog
GROUP BY Region;
94. #azurejp
// find all the Regions where the total dwell time is > 200
@output =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM @searchlog
GROUP BY Region
HAVING TotalDuration > 200;
95. #azurejp
// Option 1
@output =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM @searchlog
GROUP BY Region;
@output2 =
SELECT *
FROM @output
WHERE TotalDuration > 200;
// Option 2
@output =
SELECT
Region,
SUM(Duration) AS TotalDuration
FROM @searchlog
GROUP BY Region
HAVING SUM(Duration) > 200;
96.
97. #azurejp
// List the sessions in increasing order of Duration
@output =
SELECT *
FROM @searchlog
ORDER BY Duration ASC
FETCH FIRST 3 ROWS;
// This does not work (ORDER BY requires FETCH)
@output =
SELECT *
FROM @searchlog
ORDER BY Duration ASC;
100. #azurejp
LEFT OUTER JOIN
LEFT INNER JOIN
RIGHT INNER JOIN
RIGHT OUTER JOIN
FULL OUTER JOIN
CROSS JOIN
LEFT SEMI JOIN
RIGHT SEMI JOIN
EXCEPT ALL
EXCEPT DISTINCT
INTERSECT ALL
INTERSECT DISTINCT
UNION ALL
UNION DISTINCT
107. #azurejp
UserId Region Duration
A$A892 en-us 10500
HG54#A en-us 22270
YSD78@ en-us 38790
JADI899 en-gb 18780
YCPB(%U en-gb 17000
BHPY687 en-gb 16700
BGFSWQ en-bs 57750
BSD805 en-fr 15675
BSDYTH7 en-fr 10250
UserId Region Rank
YSD78@ en-us 1
HG54#A en-us 2
JADI899 en-gb 1
YCPB(%U en-gb 2
BGFSWQ en-bs 1
BSD805 en-fr 1
BSDYTH7 en-fr 2
@irs @result
@result =
SELECT UserId, Region,
ROW_NUMBER()
OVER(PARTITION BY Vertical
ORDER BY Duration) AS Rank
FROM @irs
GROUP BY Region
HAVING RowNumber <= 2;
各リージョンで最も滞在時間の長いユーザー2人を見つける
108.
109. #azurejp
@a = SELECT Region, Urls FROM @searchlog;
@b = SELECT
Region,
SqlArray.Create(Urls.Split(';')) AS UrlTokens
FROM @a;
@c = SELECT
Region,
Token AS Url
FROM @b
CROSS APPLY EXPLODE (UrlTokens) AS r(Token);
@a
@b
@c
CROSS APPLY EXPLODE
ARRAY TYPE
110. #azurejp
@d = SELECT Region,
ARRAY_AGG<string>(Url).ToArray() AS UrlArray
FROM @c
GROUP BY Region;
@e = SELECT Region,
string.Join(";", UrlArray) AS Urls
FROM @c;
@c
@e
@d
121. #azurejp
CREATE FUNCTION MyDB.dbo.
RETURNS @rows TABLE
(
Name string,
Id int
)
AS
BEGIN
@rows =
EXTRACT
Name string,
Id int,
FROM “/file.tsv”
USING Extractors.Tsv();
RETURN;
END;
結果を返す RowSet
Schema
定義された RowSet
A Single concept that replaces Scope
views & functions
-Discoverable
-Schematized
133. # If you have the username & password as strings
$username = "username"
$passwd = ConvertTo-SecureString "password" -AsPlainText -Force
$creds = New-Object System.Management.Automation.PSCredential($username, $passwd)
# Prompt user for credentials
$creds = Get-Credential
OR
143. Input Data
(K, A, B, C, D)
REDUCE ON K
Partition
K0
Partition
K1
Partition
K2
REDUCER
Python/R
REDUCER
Python/R
REDUCER
Python/R
Output for
K0
Output for
K0
Output for
K0
Extensions の追加
REFERENCE ASSEMBLY [ExtPython]
REFERENCE ASSEMBLY [ExtR]
特別な Reducers によって Python or R code
を分散実行
• Extension.Python.Reducer
• Extension.R.Reducer
Standard DataFrame を Reducerの入出力とし
て使える
NOTE: Reducer は、Aggregate を含んでいな
い
144. REFERENCE ASSEMBLY [ExtPython];
DECLARE @myScript = @"
def get_mentions(tweet):
return ';'.join( ( w[1:] for w in tweet.split() if w[0]=='@' ) )
def usqlml_main(df):
del df['time']
del df['author']
df['mentions'] = df.tweet.apply(get_mentions)
del df['tweet']
return df
";
@t =
SELECT * FROM
(VALUES
("D1","T1","A1","@foo Hello World @bar"),
("D2","T2","A2","@baz Hello World @beer")
) AS
D( date, time, author, tweet );
@m =
REDUCE @t ON date
PRODUCE date string, mentions string
USING new Extension.Python.Reducer(pyScript:@myScript);
Python Extensions
U-SQLを並列分散処理に使用する
Python code を多くのノード上で実
行
NumPy、Pandasのような、Python
の標準ライブラリが利用できる
146. REFERENCE ASSEMBLY ImageCommon;
REFERENCE ASSEMBLY FaceSdk;
REFERENCE ASSEMBLY ImageEmotion;
REFERENCE ASSEMBLY ImageTagging;
REFERENCE ASSEMBLY ImageOcr;
@imgs =
EXTRACT FileName string, ImgData byte[]
FROM @"/images/{FileName:*}.jpg"
USING new Cognition.Vision.ImageExtractor();
// Extract the number of objects on each image and tag them
@objects =
PROCESS @imgs
PRODUCE FileName,
NumObjects int,
Tags string
READONLY FileName
USING new Cognition.Vision.ImageTagger();
OUTPUT @objects
TO "/objects.tsv"
USING Outputters.Tsv();
Imaging
147. REFERENCE ASSEMBLY [TextCommon];
REFERENCE ASSEMBLY [TextSentiment];
REFERENCE ASSEMBLY [TextKeyPhrase];
@WarAndPeace =
EXTRACT No int,
Year string,
Book string, Chapter string,
Text string
FROM @"/usqlext/samples/cognition/war_and_peace.csv"
USING Extractors.Csv();
@sentiment =
PROCESS @WarAndPeace
PRODUCE No,
Year,
Book, Chapter,
Text,
Sentiment string,
Conf double
USING new Cognition.Text.SentimentAnalyzer(true);
OUTPUT @sentinment
TO "/sentiment.tsv"
USING Outputters.Tsv();
Text Analysis
148. • オブジェクト認識 (タグ)
• 顔認識、感情認識
• JOIN処理 – 幸せな人は誰なのか?
REFERENCE ASSEMBLY ImageCommon;
REFERENCE ASSEMBLY FaceSdk;
REFERENCE ASSEMBLY ImageEmotion;
REFERENCE ASSEMBLY ImageTagging;
@objects =
PROCESS MegaFaceView
PRODUCE FileName, NumObjects int, Tags string
READONLY FileName
USING new Cognition.Vision.ImageTagger();
@tags =
SELECT FileName, T.Tag
FROM @objects
CROSS APPLY
EXPLODE(SqlArray.Create(Tags.Split(';')))
AS T(Tag)
WHERE T.Tag.ToString().Contains("dog") OR
T.Tag.ToString().Contains("cat");
@emotion_raw =
PROCESS MegaFaceView
PRODUCE FileName string, NumFaces int, Emotion string
READONLY FileName
USING new Cognition.Vision.EmotionAnalyzer();
@emotion =
SELECT FileName, T.Emotion
FROM @emotion_raw
CROSS APPLY
EXPLODE(SqlArray.Create(Emotion.Split(';')))
AS T(Emotion);
@correlation =
SELECT T.FileName, Emotion, Tag
FROM @emotion AS E
INNER JOIN
@tags AS T
ON E.FileName == T.FileName;
Images
Objects Emotions
filter
join
aggregate
170. Job Front End
Job Scheduler Compiler Service
Job Queue
Job Manager
U-SQL Catalog
YARN
Job 投入
Job 実行
U-SQL Runtime vertex 実行
171. U-SQL C# user code
C++ system code
Algebra
other files
(system files, deployed resources)
managed dll
Unmanaged dll
Input
script
Compilation output (in job folder)
Files
Meta
Data
Service
Deployed to vertices
Compiler & Optimizer
197. USER CODE (99.9%)
• There’s something about what the
developer did that is causing the
problem
• The issue may be related to the …
• Input data
• Existence
• Size
• Format
• Key Distribution (“Data Skew”)
• C# code
• C# expressions in a U-SQL Script
• C# UDFs
• Code in Libraries the script uses
210. Case #1 Local Vertex Debug works
GREAT and SEAMLESSLY
Case #2 and Case #3 – Use the same
techniques in these cases as for
debugging a C# program
We are going to make Case #2 easier
in the future.
211. You can Launch the Exception Settings window via
• Keyboard Shortcut: Ctrl + Alt + E
• Debug > Windows > Exception Settings
• Search for “Exception Settings” in the VS Search
box. Select the item under “Menus”
212. If failed to read an input.
• Do the inputs exist?
• Do you have permissions to read the
inputs?
• Verify that the you can access the inputs
from the portal.
Did the U-SQL code ever work?
If YES:
• Provide a previous job link.
• Examine the job diff
If possible and depending on the issue,
construct the smallest possible repro
(script and dataset) that lets others
examine the problem.
Did a Vertex Fail because of User Code?
Debug the vertex locally.
213. GO THROUGH THE CHECKLIST FIRST
Provide the ..
• Job Link (job url)
• Job Link to a previous run where the script
worked
• ADLA Account Name
• Azure Region of the ADL Account
• Top Level Error + inner errors
• Error Codes
• Error text
IF POSSIBLE
Depending on the issue, construct the
smallest possible repro (script and
dataset) that lets others examine the
problem.
IF WILLING
• Attach the job graph as an image
• Attach the script.
214.
215.
216.
217.
218. Allocating 10 AUs
for a 10 minute job.
Cost
= 10 min * 10 AUs
= 100 AU minutes
Time
Blue line: Allocated
219. Consider using fewer AUs
You are paying for the area under the
blue line
You are only using the area under the
red line
Time
Blue line: Allocated
Red line: Running
Over Allocated != Bad
220. Time
Consider using more Aus
“Under allocated” jobs are Efficient!
Blue line: Allocated
Red line: Running
Under Allocated != Bad
221. Allocation Efficiency =
(Compute hours) / (AUs * duration)
Example Job
= 192.3 hours / (200 AUs * 1.1 hours)
= 192.3 hours / 220 AUHours
= 0.874
= 87.4% Efficiency very very good!
Efficient != What you always want
222. /catalog/…/tables/Guid(T)/
Guid(T.p1).ss Guid(T.p2).ss Guid(T.p3).ss
TABLE T (
…
INDEX i
CLUSTERED BY (id)
DISTRIBUTED BY HASH (key) INTO 4
PARTITIONED BY (date)
)
LOGICAL
PHYSICAL
@date1 @date2 @date3
ID1
H1
ID1
H1
ID1
H1
ID2 ID2
ID3ID3 ID4
H2
ID4
H2
ID5
ID5 ID6 H3
ID6
H2ID6
H3
ID7
H4
ID7
ID8 ID7
ID9 ID8
Clustering ->Data Locality
Partition -> Data lifecycle
management. Partition elimination
Distribution -> Data Locality +
distribution elimination
224. WH
WH
WH
Partition2
CNN
CNN
CNN
Partition3
FB
FB
FB
Partition1
U-SQL Table partitioned on Domain
Rows with the same keys are on the
same partition.
Read
Write
Filter
PARTITION 1 PARTITION 2 PARTITION 3
Thanks to “Partition
Elimination” and the U-SQL
Table, the job only reads from
the extent that is known to
have the relevant key
@output = SELECT *
FROM @clickstream
WHERE domain == “CNN”;
225. @output =
SELECT domain, SUM(clicks) AS total
FROM @clickstream
GROUP BY domain;
Read Read
Partition Partition
Full Agg
Write
Full Agg
Write
Full Agg
Write
Read
Partition
Partial Agg Partial Agg Partial Agg
PARTITION 1 PARTITION 2 PARTITION 3
Read Read
Full Agg Full Agg
Write Write
Read
Full Agg
Write
PARTITION 1 PARTITION 2 PARTITION 3
228. FB
WH
CNN
Extent 2
FB
WH
CNN
Extent 3
FB
WH
CNN
Extent 1
File:
Keys (Domain) are scattered across
the extents
WH
WH
WH
Extent 2
CNN
CNN
CNN
Extent 3
FB
FB
FB
Extent 1
U-SQL Table partitioned on Domain
The keys are now physically in the
same place on disk (extent)
229. CREATE TABLE SampleDBTutorials.dbo.ClickData
(
SessionId int,
Domain string,
Clinks int,
INDEX idx1 //Name of index
CLUSTERED (Domain)
PARTITIONED BY HASH (Domain)
);
INSERT INTO SampleDBTutorials.dbo.ClickData
SELECT *
FROM @clickdata;
230. // Using a File
@ClickData =
SELECT
Session int,
Domain string,
Clicks int
FROM “/clickdata.tsv”
USING Extractors.Tsv();
@rows = SELECT *
FROM @ClickData
WHERE Domain == “cnn.com”;
OUTPUT @rows
TO “/output.tsv”
USING Outputters.tsv();
// Using a U-SQL Table partitioned by Domain
@ClickData =
SELECT *
FROM MyDB.dbo.ClickData;
@rows = SELECT *
FROM @ClickData
WHERE Domain == “cnn.com”;
OUTPUT @rows
TO “/output.tsv”
USING Outputters.tsv();
231. Read Read
Write Write Write
Read
Filter Filter Filter
CNN,
FB,
WH
EXTENT 1 EXTENT 2 EXTENT 3
CNN,
FB,
WH
CNN,
FB,
WH
Because “CNN” could be anywhere,
all extents must be read.
Read
Write
Filter
FB
PARTITION 1 PARTITION 2 PARTITION 3
WH CNN
Thanks to “Partition Elimination” and
the U-SQL Table, the job only reads
from the extent that is known to
have the relevant key
File U-SQL Table Partitioned by Domain
233. Read Read
Partition Partition
Full Agg
Write
Full Agg
Write
Full Agg
Write
Read
Partition
Partial Agg Partial Agg Partial Agg
CNN,
FB,
WH
EXTENT 1 EXTENT 2 EXTENT 3
CNN,
FB,
WH
CNN,
FB,
WH
U-SQL Table Partitioned by Domain
Read Read
Full Agg Full Agg
Write Write
Read
Full Agg
Write
FB
PARTITION 1
WH
PARTITION 2
CNN
PARTITION 3
Expensive!
234. /catalog/…/tables/Guid(T)/
Guid(T.p1).ss Guid(T.p2).ss Guid(T.p3).ss
TABLE T (
…
INDEX i
CLUSTERED (id)
PARTITIONED BY (date)
DISTRIBUTED BY HASH (key) INTO 4
)
LOGICAL
PHYSICAL
@date1 @date2 @date3
ID1
H1
ID1
H1
ID1
H1
ID2 ID2
ID3ID3 ID4
H2
ID4
H2
ID5
ID5 ID6 H3
ID6
H2ID6
H3
ID7
H4
ID7
ID8 ID7
ID9 ID8
235. <><><><>
<><><><>
<><><><>
<><><><>
<><><><>
<><><><>
Extent 1
Region = “en-us”
<><><><>
<><><><>
<><><><>
<><><><>
<><><><>
<><><><>
Extent 2
Region = “en-gb”
<><><><>
<><><><>
<><><><>
<><><><>
<><><><>
<><><><>
Extent 3
Region = “en-fr”CREATE TABLE
LogRecordsTable
(UserId int,
Start DateTime,
Region string,
INDEX idx CLUSTERED
(Region ASC)
PARTITIONED BY HASH
(Region));
インサート時に、
“Region” カラムに基
づき、3つの範囲に
渡って
ハッシュ分散される
INSERT INTO LogRecordsTable
SELECT UserId, Start, End, Region FROM @rs
パーティションが
分かれている
@rs = SELECT * FROM LogRecordsTable
WHERE Region == “en-gb”
1
2
3
236. Full agg
Region ごとにクラスタ化されたテーブル
Read Read Read Read
Full agg Full agg Partial agg Partial agg
Extent 1 Extent 2 Extent 3 Extent 4
Sort Sort
Top 100 Top 100 Sort
Top 100
Top 100
Read Read Read Read
非構造化データ
Partial agg Partial agg Partial agg Partial agg
Full agg Full agg Full agg
Sort Sort Sort
Top 100 Top 100 Top 100
Extent 1 Extent 2 Extent 3 Extent 4
Partition Partition Partition Partition
@rs1 =
SELECT Region,
COUNT() AS Total
FROM @rs
GROUP BY Region;
@rs2 =
SELECT TOP 100
Region, Total
FROM @rs1
ORDER BY Total;
高コストな処理
253. // THIS CODE IS SCOPE -> NEEDS TO BE UPDATED TO U-SQL
public class SampleNonRecursiveReducer: Reducer
{
public override bool IsRecursive { get { return true; } }
public override Schema Produces(string[] columns, string[] args, Schema input)
{
…
}
public override IEnumerable<Row> Reduce(RowSet input, Row output, string[] args)
{
…
}
}
#ENDCS
254. Distributed data reduction
SELECT Gender, SUM(…) AS Result
FROM HugeInput
GROUP BY Gender;
This job will succeeded even if the input is huge.
• But not all Reducers Can be recursive
− Median
− Percentile