Tuning and Optimizing
U-SQL queries
for maximum performance
Michael Rys, Principal Program Manager, Microsoft
Session Objectives And Takeaways
Session Objective(s):
• Understand the U-SQL Query execution
• Be able to understand and improve U-SQL job performance/cost
• Be able to understand and improve the U-SQL query plan
• Know how to write more efficient U-SQL scripts
Key Takeaways:
• U-SQL is designed for scale-out
• U-SQL provides scalable execution of user code
• U-SQL has a tool set that can help you analyze and improve your scalability, cost and performance
• Job Execution Experience and Investigations
Query Execution
Stage Graph
Dryad crash course
Job Metrics
Resource Planning
• Job Performance Analysis
Analyze the critical path
Heat Map
Critical Path
Data Skew
• Tuning / Optimizations
Cost Optimizations
Data Partitioning
Partition Elimination
Predicate Pushing
Column Pruning
Some Data Hints
UDOs can be evil
INSERT optimizations
U-SQL Query Execution and Performance Tuning
Job Scheduler
& Queue
Vertex Execution
Overall U-SQL Batch Job Execution Lifetime
Data Lake
Compiler Optimizer
running in
Vertex Scheduling
On containers
Job Manager
Service &
USQL Catalog
Programming Style
Automatic "in-lining" of U-SQL expressions
– whole script leads to a single execution
Execution plan that is optimized out-of-the-
box and w/o user intervention.
Per job and user driven level of
Detail visibility into execution steps, for
Heatmap like functionality to identify
performance bottlenecks.
U-SQL Compilation Process
Other files
(system files, deployed resources)
managed dll
Unmanaged dll
Compilation output (in job folder)
Compiler &
U-SQL Metadata
Deployed to
Analyzing a job
1000 (ADLAUs)
Work composed of
12K Vertices
1 ADLAU currently maps to a VM with 2 cores and 6 GB of memory
U-SQL Query Execution
Physical plans vs. Dryad stage graph…
Stage Details
252 Pieces of work
AVG Vertex
execution time
4.3 Billion rows
Data Read &
Super Vertex = Stage
U-SQL Query Execution
Redefinition of big-data…
U-SQL Query Execution
Redefinition of big-data…
U-SQL Performance Analysis
Analyze the critical path, heat maps, playback, and runtime metrics on every vertex…
Tuning for
Cost Efficiency
Dips down to 1 active vertex at
these times
Smallest estimated time when
given 2425 ADLAUs
1410 seconds
= 23.5 minutes
Model with 100 ADLAUs
8709 seconds
= 145.5 minutes
Data Storage
• Files
• Tables
• Unstructured Data in files
• Files are split into 250MB extents
• 4 extents per vertex -> 1GB per vertex
• Different file content formats:
• Splittable formats are parallelizable:
• row-oriented (CSV etc)
• Where data does not span extents
• Non-splittable formats cannot be parallelized:
• Have to be processed in single vertex extractor with
• Use File Sets to provide semantic partition pruning
• Tables
• Clustered Index (row-oriented) storage
• Vertical and horizontal partitioning
• Statistics for the optimizer (CREATE STATISTICS)
• Native scalar value serialization
unstructured data
// Unstructured Files (24 hours daily log impressions)
@Impressions =
EXTRACT ClientId int, Market string, OS string, ...
FROM @"wasb://ads@wcentralus/2015/10/30/{*}.nif"
FROM @"wasb://ads@wcentralus/2015/10/30/{Market}_{*}.nif"
// …
// Filter to by Market
@US =
SELECT * FROM @Impressions
WHERE Market == "en"
U-SQL Optimizations
Partition Elimination – Unstructured Files
Partition Elimination
• Even with unstructured files!
• Leverage Virtual Columns (Named)
• Avoid unnamed {*}
• WHERE predicates on named virtual columns
• That binds the PE range during compilation time
• Named virtual columns without predicate = warning
• Design directories/files with PE in mind
• Design for elimination early in the tree, not in the leaves
Extracts all files in the folder
Post filter = pay I/O cost to drop most data
PE pushes this predicate to the EXTRACT
EXTRACT now only reads “en” files!
How many clicks per domain?
@rows = SELECT
SUM(Clicks) AS TotalClicks
FROM @ClickData
GROUP BY Domain;
Read Read
Partition Partition
Full Agg
Full Agg
Full Agg
Partial Agg Partial Agg Partial Agg
U-SQL Table Distributed by Domain
Read Read
Full Agg Full Agg
Write Write
Full Agg
Scaling out with
Table Partitioning and Distribution
• Fine grained (horizontal) partitioning/distribution
• Distributes within a partition (together with clustering) to keep same
data values close
• Choose for:
• Join alignment, partition size, filter selectivity, partition elimination
• Coarse grained (vertical) partitioning
• Based on Partition keys
• Partition is addressable in language
• Query predicates will allow partition pruning
• Choose for data life cycle management, partition elimination
When to use?
HASH(keys) Automatic Hash for fast item lookup
DIRECT HASH(id) Exact control of hash bucket value
RANGE(keys) Keeps ranges together
ROUND ROBIN To get equal distribution (if others give skew)
and Clusters
T ( id …
, C …
, date DateTime, …
HASH(id) INTO 4)
PARTITION (@date1) PARTITION (@date2) PARTITION (@date3)
Guid(T.p1).ss Guid(T.p2).ss Guid(T.p3).ss
Benefits of
Index in
• Design for most frequent/costly queries
• Manage data skew in distribution bucket
• Provide locality of same data values
• Provide seeks and range scans for query predicates (index
Clustered index in tables is mandatory, chose according to
desired benefits
Pro Tip:
Distribution keys should be prefix of Clustered Index keys:
Especially for RANGE distribution
Optimizer will make use of global ordering then:
If you make the RANGE distribution key a prefix of the index key, U-
SQL will repartition on demand to align any UNIONALLed or JOINed
tables or partitions!
Split points of table distribution partitions are choosen independently,
so any partitioned table can do UNION ALL in this manner if the data
is to be processed subsequently on the distribution key.
Benefits of
Distribution in
• Design for most frequent/costly queries
• Manage data skew in partition/table
• Manage parallelism in querying (by number of
• Manage minimizing data movement in joins
• Provide distribution seeks and range scans for query
predicates (distribution bucket elimination)
Distribution in tables is mandatory, chose according to
desired benefits
Benefits of
• Partitions are addressable
• Enables finer-grained data lifecycle management at
partition level
• Manage parallelism in querying by number of partitions
• Query predicates provide partition elimination
• Predicate has to be constant-foldable
Use partitioned tables for
• Managing large amounts of incrementally growing
structured data
• Queries with strong locality predicates
• point in time, for specific market etc
• Managing windows of data
• provide data for last x months for processing
Use partitioned tables
for querying parts of
large amounts of
incrementally growing
structured data
Get partition
optimizations with the
right query predicates
Creating partition table
CREATE TABLE PartTable(id int, event_date DateTime, lat float, long float
, INDEX idx CLUSTERED (vehicle_id ASC)
PARTITIONED BY(event_date) DISTRIBUTED BY HASH (vehicle_id) INTO 4);
Creating partitions
DECLARE @pdate1 DateTime = new DateTime(2014, 9, 14,
DECLARE @pdate2 DateTime = new DateTime(2014, 9, 15,
ALTER TABLE vehiclesP ADD PARTITION (@pdate1), PARTITION (@pdate2);
Loading data into partitions dynamically
DECLARE @date1 DateTime = DateTime.Parse("2014-09-14");
DECLARE @date2 DateTime = DateTime.Parse("2014-09-16");
SELECT vehicle_id, event_date, lat, long FROM @data
WHERE event_date >= @date1 AND event_date <= @date2;
• Filters and inserts clean data only, ignore “dirty” data
Loading data into partitions statically
ALTER TABLE vehiclesP ADD PARTITION (@pdate1), PARTITION (@baddate);
SELECT vehicle_id, lat, long FROM @data
WHERE event_date >= @date1 AND event_date <= @date2;
@Impressions =
searchDM.SML.PageView(@start, @end) AS PageView
// Q1(A,B)
@Sessions =
SUM(PageClicks) AS Clicks
Query, ClientId
// Q2(B)
@Display =
SELECT * FROM @Sessions
INNER JOIN @Campaigns
ON @Sessions.Query == @Campaigns.Query
U-SQL Optimizations
Distributions – Minimize (re)partitions
Input must be distributed on:
Input must be distributed on:
(Query) or (ClientId) or (Query, ClientId)
Optimizer wants to distribute only once
But Query could be skewed
Data Distribution
• Re-Distributing is very expensive
• Many U-SQL operators can handle multiple distribution choices
• Optimizer bases decision upon estimations
Wrong statistics may result in worse query performance
// Unstructured (24 hours daily log impressions)
@Huge = EXTRACT ClientId int, ...
// Small subset (ie: ForgetMe opt out)
@Small = SELECT * FROM @Huge
WHERE Bing.ForgetMe(x,y,z)
// Result (not enough info to determine simple Broadcast
@Remove = SELECT * FROM Bing.Sessions
INNER JOIN @Small ON Sessions.Client ==
U-SQL Optimizations
Distribution - Cardinality
Broadcast JOIN right?
Broadcast is now a candidate.
Wrong statistics may result in worse query performance
Optimizer has no stats this is small...
What is Data Skew?
• Some data points are
much more common
than others
• data may be
distributed such that
all rows that match a
certain key go to a
single vertex
• imbalanced
execution, vertex
time out.
Population by State
Low Distinctiveness Keys
• Keys with small
selectivity can lead
to large vertices even
without skew
@rows =
SELECT Gender, AGG<MyAgg>(…) AS Result
FROM @HugeInput
GROUP BY Gender;
Gender==Male Gender==Female
Vertex 0 Vertex 1
Why is this a problem?
Vertexes have a 5 hour runtime limit!
Your UDO or join may excessively allocate memory.
• Your memory usage may not be obvious due to garbage collection
Addressing Data Skew/Low distinctiveness
• Improve data partition sizes:
• Find more fine grained keys, eg, states and congressional districts or ZIP codes
• If no fine grained keys can be found or are too fine-grained: use ROUND ROBIN
• Write queries that can handle data skew:
• Use filters that prune skew out early
• Use Data Hints to identify skew and “low distinctness” in keys:
• SKEWFACTOR(columns) = x
provides hint that given columns have a skew factor x between
0 (no skew) and 1 (very heavy skew))
• DISTINCTVALUE(columns) = n
let’s you specify how many distinct values the given columns have (n>1)
• Implement aggregation/reducer recursively if possible
Non-Recursive vs Recursive SUM
1 2 3 4 5 6 7 8 36
1 2 3 4 5 6 7 8
6 15 15
U-SQL Partitioning during Processing
Data Skew
U-SQL Partitioning
Data Skew – Recursive Reducer
// Metrics per domain
@Metric =
REDUCE @Impressions ON UrlDomain
USING new Bing.TopNReducer(count:10)
// …
Inherent Data Skew
[SqlUserDefinedReducer(IsRecursive = true)]
public class TopNReducer : IReducer
public override IEnumerable<IRow>
Reduce(IRowset input, IUpdatableRow output)
// Compute TOP(N) per group
// …
• Allow multi-stage aggregation trees
• Requires same schema (input => output)
• Requires associativity:
• R(x, y) = R( R(x), R(y) )
• Default = non-recursive
• User code has to honor recursive semantics
brought to a single vertex
// Bing impressions
@Impressions = SELECT * FROM
searchDM.SML.PageView(@start, @end) AS PageView
// Compute sessions
@Sessions =
REDUCE @Impressions ON Client, Market
USING new Bing.SessionReducer(range : 30)
// Users metrics
@Metrics =
SELECT * FROM @Sessions
Market == "en-us"
// …
U-SQL Optimizations
Predicate pushing – UDO pass-through columns
Show me U-
// Bing impressions
@Impressions = SELECT * FROM
searchDM.SML.PageView(@start, @end) AS PageView
// Compute page views
@Impressions =
PROCESS @Impressions
PRODUCE Client, Market, Header string
USING new Bing.HtmlProcessor()
@Sessions =
REDUCE @Impressions ON Client, Market
USING new Bing.SessionReducer(range : 30)
// Users metrics
@Metrics =
SELECT * FROM @Sessions
Market == "en-us"
U-SQL Optimizations
Predicate pushing – UDO row level processors
public abstract class IProcessor : IUserDefinedOperator
/// <summary/>
public abstract IRow Process(IRow input, IUpdatableRow output);
public abstract class IReducer : IUserDefinedOperator
/// <summary/>
public abstract IEnumerable<IRow> Reduce(IRowset input, IUpdatableRow output);
// Bing impressions
@Impressions = SELECT Client, Market, Html FROM
searchDM.SML.PageView(@start, @end) AS PageView
// Compute page views
@Impressions =
PROCESS @Impressions
PRODUCE Client, Market, Header string
USING new Bing.HtmlProcessor()
// Users metrics
@Metrics =
SELECT * FROM @Sessions
Market == "en-us"
&& Header.Contains("")
AND Header.Contains("")
U-SQL Optimizations
Predicate pushing – relational vs. C# semantics
// Bing impressions
@Impressions = SELECT * FROM
searchDM.SML.PageView(@start, @end) AS PageView
// Compute page views
@Impressions =
PROCESS @Impressions
REQUIRED ClientId, HtmlContent(Header, Footer)
USING new Bing.HtmlProcessor()
// Users metrics
@Metrics =
SELECT ClientId, Market, Header FROM @Sessions
Market == "en-us"
U-SQL Optimizations
Column Pruning and dependencies
Column Pruning
• Minimize I/O (data shuffling)
• Minimize CPU (complex processing, html)
• Requires dependency knowledge:
• R(D*) = Input ( Output )
• Default no pruning
• User code has to honor reduced columns
A B C D E F G J KH I … M … 1000
UDO Tips
• Tips when Using UDOs:
• READONLY clause to allow pushing predicates through UDOs
• REQUIRED clause to allow column pruning through UDOs
• PRESORT on REDUCE if you need global order
• Hint Cardinality if it does choose the wrong plan
• Warnings and better alternatives:
• Use SELECT with UDFs instead of PROCESS
• Use User-defined Aggregators instead of REDUCE
• Learn to use Windowing Functions (OVER expression)
• Good use-cases for PROCESS/REDUCE/COMBINE:
• The logic needs to dynamically access the input and/or output
E.g., create a JSON doc for the data in the row where the columns
are not known apriori.
• Your UDF based solution creates too much memory pressure and
you can write your code more memory efficient in a UDO
• You need an ordered Aggregator or produce more than 1 row
per group
INSERT Multiple INSERTs into same table
• Generates separate file per insert in physical
• Can lead to performance degradation
• Recommendations:
• Try to avoid small inserts
• Rebuild table after frequent insertions with:
Future Items
GA and beyond • Tooling
• Resource planning based on $-cost
• Storage support
• Storage compression (available since this week!)
• Columnar Storage/Index
• Secondary Index
Blogs and community page:
• (U-SQL Github)
Documentation and articles:
ADL forums and feedback
Slide decks
Thank You
Learn more from
Michael Rys or follow @MikeDoesBigData

Tuning and Optimizing U-SQL Queries (SQLPASS 2016)

  • 1. Tuning and Optimizing U-SQL queries for maximum performance Michael Rys, Principal Program Manager, Microsoft @MikeDoesBigData, AD-315-MAD-400-M
  • 3. Session Objectives And Takeaways Session Objective(s): • Understand the U-SQL Query execution • Be able to understand and improve U-SQL job performance/cost • Be able to understand and improve the U-SQL query plan • Know how to write more efficient U-SQL scripts Key Takeaways: • U-SQL is designed for scale-out • U-SQL provides scalable execution of user code • U-SQL has a tool set that can help you analyze and improve your scalability, cost and performance
  • 4. Agenda • Job Execution Experience and Investigations Query Execution Stage Graph Dryad crash course Job Metrics Resource Planning • Job Performance Analysis Analyze the critical path Heat Map Critical Path Data Skew • Tuning / Optimizations Cost Optimizations Data Partitioning Partition Elimination Predicate Pushing Column Pruning Some Data Hints UDOs can be evil INSERT optimizations U-SQL Query Execution and Performance Tuning
  • 5.
  • 6. Job Scheduler & Queue Front-EndService 6 Vertex Execution Consume Overall U-SQL Batch Job Execution Lifetime Local Storage Data Lake Store Author Plan Compiler Optimizer Vertexes running in YARN Containers U-SQL Runtime Optimized Plan Vertex Scheduling On containers Job Manager USQL Compiler Service & USQL Catalog
  • 7. Expression-flow Programming Style Automatic "in-lining" of U-SQL expressions – whole script leads to a single execution model. Execution plan that is optimized out-of-the- box and w/o user intervention. Per job and user driven level of parallelization. Detail visibility into execution steps, for debugging. Heatmap like functionality to identify performance bottlenecks.
  • 8. U-SQL Compilation Process C# C++ Algebra Other files (system files, deployed resources) managed dll Unmanaged dll Compilation output (in job folder) Compiler & Optimizer U-SQL Metadata Service Deployed to Vertices
  • 9.
  • 11. Parallelism 1000 (ADLAUs) Work composed of 12K Vertices 1 ADLAU currently maps to a VM with 2 cores and 6 GB of memory
  • 12. U-SQL Query Execution Physical plans vs. Dryad stage graph…
  • 13. Stage Details 252 Pieces of work AVG Vertex execution time 4.3 Billion rows Data Read & Written Super Vertex = Stage
  • 16. 18 U-SQL Performance Analysis Analyze the critical path, heat maps, playback, and runtime metrics on every vertex…
  • 17.
  • 19. Dips down to 1 active vertex at these times
  • 20. Smallest estimated time when given 2425 ADLAUs 1410 seconds = 23.5 minutes
  • 21. Model with 100 ADLAUs 8709 seconds = 145.5 minutes
  • 22.
  • 23. Data Storage • Files • Tables • Unstructured Data in files • Files are split into 250MB extents • 4 extents per vertex -> 1GB per vertex • Different file content formats: • Splittable formats are parallelizable: • row-oriented (CSV etc) • Where data does not span extents • Non-splittable formats cannot be parallelized: • XML/JSON • Have to be processed in single vertex extractor with atomicFileProcessing=true. • Use File Sets to provide semantic partition pruning • Tables • Clustered Index (row-oriented) storage • Vertical and horizontal partitioning • Statistics for the optimizer (CREATE STATISTICS) • Native scalar value serialization
  • 25. // Unstructured Files (24 hours daily log impressions) @Impressions = EXTRACT ClientId int, Market string, OS string, ... FROM @"wasb://ads@wcentralus/2015/10/30/{*}.nif" FROM @"wasb://ads@wcentralus/2015/10/30/{Market}_{*}.nif" ; // … // Filter to by Market @US = SELECT * FROM @Impressions WHERE Market == "en" ; U-SQL Optimizations Partition Elimination – Unstructured Files Partition Elimination • Even with unstructured files! • Leverage Virtual Columns (Named) • Avoid unnamed {*} • WHERE predicates on named virtual columns • That binds the PE range during compilation time • Named virtual columns without predicate = warning • Design directories/files with PE in mind • Design for elimination early in the tree, not in the leaves Extracts all files in the folder Post filter = pay I/O cost to drop most data PE pushes this predicate to the EXTRACT EXTRACT now only reads “en” files! en_10.0.nif en_8.1.nif de_10.0.nif jp_7.0.nif de_8.1.nif ../2015/10/30/ …
  • 26.
  • 27. How many clicks per domain? @rows = SELECT Domain, SUM(Clicks) AS TotalClicks FROM @ClickData GROUP BY Domain;
  • 28. File Read Read Partition Partition Full Agg Write Full Agg Write Full Agg Write Read Partition Partial Agg Partial Agg Partial Agg CNN, FB, WH EXTENT 1 EXTENT 2 EXTENT 3 CNN, FB, WH CNN, FB, WH U-SQL Table Distributed by Domain Read Read Full Agg Full Agg Write Write Read Full Agg Write FB EXTENT 1 WH EXTENT 2 CNN EXTENT 3 Expensive!
  • 30. Data Partitioning Tables Table Partitioning and Distribution • Fine grained (horizontal) partitioning/distribution • Distributes within a partition (together with clustering) to keep same data values close • Choose for: • Join alignment, partition size, filter selectivity, partition elimination • Coarse grained (vertical) partitioning • Based on Partition keys • Partition is addressable in language • Query predicates will allow partition pruning • Choose for data life cycle management, partition elimination Distribution Scheme When to use? HASH(keys) Automatic Hash for fast item lookup DIRECT HASH(id) Exact control of hash bucket value RANGE(keys) Keeps ranges together ROUND ROBIN To get equal distribution (if others give skew)
  • 31. Partitions, Distributions and Clusters TABLE T ( id … , C … , date DateTime, … , INDEX i CLUSTERED (id, C) PARTITIONED BY (date) DISTRIBUTED BY HASH(id) INTO 4) PARTITION (@date1) PARTITION (@date2) PARTITION (@date3) HASH DISTRIBUTION 1 HASH DISTRIBUTION 2 HASH DISTRIBUTION 3 HASH DISTRIBUTION 1 HASH DISTRIBUTION 1 HASH DISTRIBUTION 2 HASH DISTRIBUTION 3 HASH DISTRIBUTION 4 HASH DISTRIBUTION 3 C1 C2 C3 C1 C2 C4 C5 C4 C6 C6 C7 C8 C7 C5 C6 C9 C10 C1 C3 /catalog/…/tables/Guid(T)/ Guid(T.p1).ss Guid(T.p2).ss Guid(T.p3).ss LOGICAL PHYSICAL
  • 32. Benefits of Clustered Index in Distribution Benefits • Design for most frequent/costly queries • Manage data skew in distribution bucket • Provide locality of same data values • Provide seeks and range scans for query predicates (index lookup) Clustered index in tables is mandatory, chose according to desired benefits Pro Tip: Distribution keys should be prefix of Clustered Index keys: Especially for RANGE distribution Optimizer will make use of global ordering then: If you make the RANGE distribution key a prefix of the index key, U- SQL will repartition on demand to align any UNIONALLed or JOINed tables or partitions! Split points of table distribution partitions are choosen independently, so any partitioned table can do UNION ALL in this manner if the data is to be processed subsequently on the distribution key.
  • 33. Benefits of Distribution in Tables Benefits • Design for most frequent/costly queries • Manage data skew in partition/table • Manage parallelism in querying (by number of distributions) • Manage minimizing data movement in joins • Provide distribution seeks and range scans for query predicates (distribution bucket elimination) Distribution in tables is mandatory, chose according to desired benefits
  • 34. Benefits of Partitioned Tables Benefits • Partitions are addressable • Enables finer-grained data lifecycle management at partition level • Manage parallelism in querying by number of partitions • Query predicates provide partition elimination • Predicate has to be constant-foldable Use partitioned tables for • Managing large amounts of incrementally growing structured data • Queries with strong locality predicates • point in time, for specific market etc • Managing windows of data • provide data for last x months for processing
  • 35. Partitioned tables Use partitioned tables for querying parts of large amounts of incrementally growing structured data Get partition elimination optimizations with the right query predicates Creating partition table CREATE TABLE PartTable(id int, event_date DateTime, lat float, long float , INDEX idx CLUSTERED (vehicle_id ASC) PARTITIONED BY(event_date) DISTRIBUTED BY HASH (vehicle_id) INTO 4); Creating partitions DECLARE @pdate1 DateTime = new DateTime(2014, 9, 14, 00,00,00,00,DateTimeKind.Utc); DECLARE @pdate2 DateTime = new DateTime(2014, 9, 15, 00,00,00,00,DateTimeKind.Utc); ALTER TABLE vehiclesP ADD PARTITION (@pdate1), PARTITION (@pdate2); Loading data into partitions dynamically DECLARE @date1 DateTime = DateTime.Parse("2014-09-14"); DECLARE @date2 DateTime = DateTime.Parse("2014-09-16"); INSERT INTO vehiclesP ON INTEGRITY VIOLATION IGNORE SELECT vehicle_id, event_date, lat, long FROM @data WHERE event_date >= @date1 AND event_date <= @date2; • Filters and inserts clean data only, ignore “dirty” data Loading data into partitions statically ALTER TABLE vehiclesP ADD PARTITION (@pdate1), PARTITION (@baddate); INSERT INTO vehiclesP ON INTEGRITY VIOLATION MOVE TO @baddate SELECT vehicle_id, lat, long FROM @data WHERE event_date >= @date1 AND event_date <= @date2;
  • 36. @Impressions = SELECT * FROM searchDM.SML.PageView(@start, @end) AS PageView OPTION(SKEWFACTOR(Query)=0.5) ; // Q1(A,B) @Sessions = SELECT ClientId, Query, SUM(PageClicks) AS Clicks FROM @Impressions GROUP BY Query, ClientId ; // Q2(B) @Display = SELECT * FROM @Sessions INNER JOIN @Campaigns ON @Sessions.Query == @Campaigns.Query ; U-SQL Optimizations Distributions – Minimize (re)partitions Input must be distributed on: (Query) Input must be distributed on: (Query) or (ClientId) or (Query, ClientId) Optimizer wants to distribute only once But Query could be skewed Data Distribution • Re-Distributing is very expensive • Many U-SQL operators can handle multiple distribution choices • Optimizer bases decision upon estimations Wrong statistics may result in worse query performance
  • 37. // Unstructured (24 hours daily log impressions) @Huge = EXTRACT ClientId int, ... FROM @"wasb://ads@wcentralus/2015/10/30/{*}.nif" ; // Small subset (ie: ForgetMe opt out) @Small = SELECT * FROM @Huge WHERE Bing.ForgetMe(x,y,z) OPTION(ROWCOUNT=500) ; // Result (not enough info to determine simple Broadcast join) @Remove = SELECT * FROM Bing.Sessions INNER JOIN @Small ON Sessions.Client == @Small.Client ; U-SQL Optimizations Distribution - Cardinality Broadcast JOIN right? Broadcast is now a candidate. Wrong statistics may result in worse query performance => CREATE STATISTICS Optimizer has no stats this is small...
  • 38.
  • 39. What is Data Skew? • Some data points are much more common than others • data may be distributed such that all rows that match a certain key go to a single vertex • imbalanced execution, vertex time out. 0 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 30,000,000 35,000,000 40,000,000 California Florida Ohio NorthCarolina Washington Indiana Maryland Colorado Louisiana Oklahoma Mississippi Utah Nebraska Hawaii RhodeIsland SouthDakota DistrictofColumbia Population by State
  • 40. Low Distinctiveness Keys • Keys with small selectivity can lead to large vertices even without skew @rows = SELECT Gender, AGG<MyAgg>(…) AS Result FROM @HugeInput GROUP BY Gender; Gender==Male Gender==Female @HugeInput Vertex 0 Vertex 1
  • 41. Why is this a problem? Vertexes have a 5 hour runtime limit! Your UDO or join may excessively allocate memory. • Your memory usage may not be obvious due to garbage collection
  • 42. Addressing Data Skew/Low distinctiveness • Improve data partition sizes: • Find more fine grained keys, eg, states and congressional districts or ZIP codes • If no fine grained keys can be found or are too fine-grained: use ROUND ROBIN distribution • Write queries that can handle data skew: • Use filters that prune skew out early • Use Data Hints to identify skew and “low distinctness” in keys: • SKEWFACTOR(columns) = x provides hint that given columns have a skew factor x between 0 (no skew) and 1 (very heavy skew)) • DISTINCTVALUE(columns) = n let’s you specify how many distinct values the given columns have (n>1) • Implement aggregation/reducer recursively if possible
  • 43. Non-Recursive vs Recursive SUM 1 2 3 4 5 6 7 8 36 1 2 3 4 5 6 7 8 6 15 15 36
  • 44. U-SQL Partitioning during Processing Data Skew
  • 45. U-SQL Partitioning Data Skew – Recursive Reducer // Metrics per domain @Metric = REDUCE @Impressions ON UrlDomain USING new Bing.TopNReducer(count:10) ; // … Inherent Data Skew [SqlUserDefinedReducer(IsRecursive = true)] public class TopNReducer : IReducer { public override IEnumerable<IRow> Reduce(IRowset input, IUpdatableRow output) { // Compute TOP(N) per group // … } } Recursive • Allow multi-stage aggregation trees • Requires same schema (input => output) • Requires associativity: • R(x, y) = R( R(x), R(y) ) • Default = non-recursive • User code has to honor recursive semantics brought to a single vertex
  • 46.
  • 47. // Bing impressions @Impressions = SELECT * FROM searchDM.SML.PageView(@start, @end) AS PageView ; // Compute sessions @Sessions = REDUCE @Impressions ON Client, Market READONLY Market USING new Bing.SessionReducer(range : 30) ; // Users metrics @Metrics = SELECT * FROM @Sessions WHERE Market == "en-us" ; // … Microsoft Confidential U-SQL Optimizations Predicate pushing – UDO pass-through columns
  • 48. Show me U- SQL UDOs! 16/06/27/how-do-i-combine-overlapping-ranges- using-u-sql-introducing-u-sql-reducer-udos/
  • 49. // Bing impressions @Impressions = SELECT * FROM searchDM.SML.PageView(@start, @end) AS PageView ; // Compute page views @Impressions = PROCESS @Impressions READONLY Market PRODUCE Client, Market, Header string USING new Bing.HtmlProcessor() ; @Sessions = REDUCE @Impressions ON Client, Market READONLY Market USING new Bing.SessionReducer(range : 30) ; // Users metrics @Metrics = SELECT * FROM @Sessions WHERE Market == "en-us" ; Microsoft Confidential U-SQL Optimizations Predicate pushing – UDO row level processors public abstract class IProcessor : IUserDefinedOperator { /// <summary/> public abstract IRow Process(IRow input, IUpdatableRow output); } public abstract class IReducer : IUserDefinedOperator { /// <summary/> public abstract IEnumerable<IRow> Reduce(IRowset input, IUpdatableRow output); }
  • 50. // Bing impressions @Impressions = SELECT Client, Market, Html FROM searchDM.SML.PageView(@start, @end) AS PageView ; // Compute page views @Impressions = PROCESS @Impressions PRODUCE Client, Market, Header string USING new Bing.HtmlProcessor() ; // Users metrics @Metrics = SELECT * FROM @Sessions WHERE Market == "en-us" && Header.Contains("") AND Header.Contains("") ; U-SQL Optimizations Predicate pushing – relational vs. C# semantics
  • 51. // Bing impressions @Impressions = SELECT * FROM searchDM.SML.PageView(@start, @end) AS PageView ; // Compute page views @Impressions = PROCESS @Impressions PRODUCE * REQUIRED ClientId, HtmlContent(Header, Footer) USING new Bing.HtmlProcessor() ; // Users metrics @Metrics = SELECT ClientId, Market, Header FROM @Sessions WHERE Market == "en-us" ; U-SQL Optimizations Column Pruning and dependencies C H M C H M C H M Column Pruning • Minimize I/O (data shuffling) • Minimize CPU (complex processing, html) • Requires dependency knowledge: • R(D*) = Input ( Output ) • Default no pruning • User code has to honor reduced columns A B C D E F G J KH I … M … 1000
  • 52. UDO Tips and Warnings • Tips when Using UDOs: • READONLY clause to allow pushing predicates through UDOs • REQUIRED clause to allow column pruning through UDOs • PRESORT on REDUCE if you need global order • Hint Cardinality if it does choose the wrong plan • Warnings and better alternatives: • Use SELECT with UDFs instead of PROCESS • Use User-defined Aggregators instead of REDUCE • Learn to use Windowing Functions (OVER expression) • Good use-cases for PROCESS/REDUCE/COMBINE: • The logic needs to dynamically access the input and/or output schema. E.g., create a JSON doc for the data in the row where the columns are not known apriori. • Your UDF based solution creates too much memory pressure and you can write your code more memory efficient in a UDO • You need an ordered Aggregator or produce more than 1 row per group
  • 53.
  • 54. INSERT Multiple INSERTs into same table • Generates separate file per insert in physical storage: • Can lead to performance degradation • Recommendations: • Try to avoid small inserts • Rebuild table after frequent insertions with: ALTER TABLE T REBUILD;
  • 55. Future Items GA and beyond • Tooling • Resource planning based on $-cost • Storage support • Storage compression (available since this week!) • Columnar Storage/Index • Secondary Index
  • 56. Additional Resources Blogs and community page: • (U-SQL Github) • • • Documentation and articles: • • analytics/ • ADL forums and feedback • • US/home?forum=AzureDataLake • Slide decks •
  • 58. Session Evaluations ways to access Go to Download the GuideBook App and search: PASS Summit 2016 Follow the QR code link displayed on session signage throughout the conference venue and in the program guide Submit by 5pm Friday November 6th to WIN prizes Your feedback is important and valuable. 3
  • 59. Thank You Learn more from Michael Rys or follow @MikeDoesBigData

