The document discusses best practices and performance tuning for U-SQL in Azure Data Lake. It provides an overview of U-SQL query execution, including the job scheduler, query compilation process, and vertex execution model. The document also covers techniques for analyzing and optimizing U-SQL job performance, including analyzing the critical path, using heat maps, optimizing AU usage, addressing data skew, and query tuning techniques like data loading tips, partitioning, predicate pushing and column pruning.
1. Best Practices and Performance
Tuning of U-SQL in Azure Data
Lake
Michael Rys
Principal Program Manager, Microsoft
@MikeDoesBigData, usql@microsoft.com
2. Session Objectives And Takeaways
• Session Objective(s):
• Understand some common best practices to make your U-SQL scale
• Understand the U-SQL Query execution
• Be able to understand and improve U-SQL job performance/cost
• Be able to understand and improve the U-SQL query plan
• Know how to write more efficient U-SQL scripts
• Key Takeaways:
• U-SQL is designed for scale-out
• U-SQL provides scalable execution of user code
• U-SQL has a tool set that can help you analyze and improve your scalability, cost and performance
3. Agenda
• Job Execution Experience and Investigations
Query Execution
Stage Graph
Dryad crash course
Job Metrics
Resource Planning
Job Parallelism and AUs
• Job Performance Analysis
Analyze the critical path
Heat Map
Critical Path
Data Skew
• Cost Optimizations
AU usage
AU modeller
• Tuning / Optimizations
Data Loading Tips
Data Partitioning (Files and Tables)
INSERT optimizations Partition Elimination
Predicate Pushing
Column Pruning
Addressing Data Skew
Some Data Hints
UDOs can be evil
5. Expression-flow Programming Style
• Automatic "in-lining" of U-SQL expressions – whole
script leads to a single execution model.
• Execution plan that is optimized out-of-the-box and w/o
user intervention.
• Per job and user driven level of parallelization.
• Detail visibility into execution steps, for debugging.
• Heatmap like functionality to identify performance
bottlenecks.
6. Job Scheduler
& Queue
Front-EndService
Vertex Execution
Consume
Local
Storage
Data Lake
Store
Author
Plan
Compiler Optimizer
Vertexes
running in
YARN
Containers
U-SQL
Runtime
Optimized
Plan
Vertex Scheduling
On containers
Job Manager
USQL
Catalog
Overall U-SQL Batch Job Execution Lifetime
Stage Codegen
(C++/C#
Compilation)
7. Finalization Phase Execution Phase
Queueing
Preparation Phase
Job Scheduler
& Queue
Front-EndService
Vertex Execution
Consume
Local
Storage
Data Lake
Store
Author
Plan
Compiler Optimizer
Vertexes
running in
YARN
Containers
U-SQL
Runtime
Optimized
Plan
Vertex Scheduling
On containers
Job Manager
USQL
Catalog
Overall U-SQL Batch Job Execution Lifetime
Vertex
Codegen
(C++/C#
Compilation)
8. U-SQL Preparation into Vertex (.NET example)
C#
C++
Algebra
Additional non-dll files &
Deployed resources
managed dll
native dll
Compilation output (in job folder)
Compilation and Optimization
U-SQL
Metadata
Service
Deployed to
Vertices
REFERENCE ASSEMBLY
ADLS
(or WASB)
DEPLOY RESOURCE
System files
(built-in Runtimes, Core DLLs, OS)
13. Stage Details
6307 Pieces of work: Same
code applied to different
data partitions
AVG Vertex execution
time
4.4 Billion rows
Data Read &
Written
Super Vertex = Stage
16. AU allocation vs Job Parallelism/Scale Out
• Job’s inherent scale out is determined by
• Size of input data plus data partitioning
• Operations
• Cross join vs parallelizable equi-join
• Order By: Global order vs local order
• REDUCE ALL vs REDUCE ON part_key
• Hints:
• Data size hints
• Partition plan hint (works only if the plan space contains an option for that plan)
• AU allocation does not specify Job Scale Out but specifies upper limit
• Increasing the AU does not make your job necessarily faster
• If you over-specify AU allocation, you will pay for overallocation!!!
24. Data Storage
• Files
• Tables
• Unstructured Data in files
• Files are split into 250MB extents
• 4 extents per vertex -> 1GB per vertex
• Different file content formats:
• Splittable formats are parallelizable:
• row-oriented (CSV etc)
• Where data does not span extents
• Non-splittable formats cannot be parallelized:
• XML/JSON
• Have to be processed in single vertex extractor with
atomicFileProcessing=true.
• Use File Sets to provide semantic partition pruning
• Tables
• Clustered Index (row-oriented) storage
• Vertical and horizontal partitioning
• Statistics for the optimizer (CREATE STATISTICS)
• Native scalar value serialization
26. Real-Time Data Ingestion
Promises of the Data Lake:
1. Store everything and anything in its original form, figure it out later
2. BYO compute (Hadoop, ADLA, other) to a shared collection of data
3. Massive scale, no limits, never run out of storage
No 1 is where we see most customers get into trouble!
27. Real-Time Data Ingestion
Many customers take the “original form” a little too literally
“The original form of my data is the event… I need to keep it”
But
• Discrete database transactions have been occurring forever, one row at a time
• We store them in singular (table) structures for efficiency
• We can still find a discrete record in its absolute original form
• We should apply similar optimizations to Data Lakes!!!
29. Real-Time Data Ingestion
Moral of the story…
• Small files are suboptimal in every scenario
• Consider alternatives to concatenate into larger files:
• Offline outside of Azure
• Event Hubs Capture
• Stream Analytics
• … or ADLA fast file sets to compact most recent deltas
30. // Unstructured Files (24 hours daily log impressions)
@Impressions =
EXTRACT ClientId int, Market string, OS string, ...
FROM @"wasb://ads@wcentralus/2015/10/30/{*}.nif"
FROM @"wasb://ads@wcentralus/2015/10/30/{Market}_{*}.nif"
;
// …
// Filter to by Market
@US =
SELECT * FROM @Impressions
WHERE Market == "en"
;
Partition Elimination
• Even with unstructured files!
• Leverage Virtual Columns (Named)
• Avoid unnamed {*}
• WHERE predicates on named virtual columns
• That binds the PE range during compilation time
• Named virtual columns without predicate = warning
• Design directories/files with PE in mind
• Design for elimination early in the tree, not in the leaves
Extracts all files in the folder
Post filter = pay I/O cost to drop most data
PE pushes this predicate to the EXTRACT
EXTRACT now only reads “en” files!
en_10.0.nif
en_8.1.nif
de_10.0.nif
jp_7.0.nif
de_8.1.nif
../2015/10/30/
…
U-SQL Optimizations
Partition Elimination – Unstructured Files
31. FileSet optimizations
• Use
SET @@FeaturePreviews="InputFileGrouping:on";
to group small files into single vertex (instead of one file per vertex)
• Also use to speed up preparation time and produce faster code:
SET @@FeaturePreviews =
"FileSetV2Dot5:on,AsyncCompilerStoreAccess:on";
(will be default with next refresh and not needed afterwards)
32. Cost of small files with custom extractor
• Custom code deployment during vertex creation dominates vertex
execution time over actual work!
Vertex creation time:
Deploy Vertex content
Actual vertex execution
work!
37. Table Partitioning
and Distribution Fine grained (horizontal) partitioning/distribution
• Distributes within a partition (together with clustering) to
keep same data values close
• Choose for:
• Join alignment, partition size, filter selectivity, partition elimination
Coarse grained (vertical) partitioning
• Based on Partition keys
• Partition is addressable in language
• Query predicates will allow partition pruning
• Choose for data life cycle management, partition elimination
Distribution Scheme When to use?
HASH(keys) Automatic Hash for fast item lookup
DIRECT HASH(id) Exact control of hash bucket value
RANGE(keys) Keeps ranges together
ROUND ROBIN To get equal distribution (if others give skew)
38. Partitions, Distributions and Clusters
TABLE
T ( id …
, C …
, date DateTime, …
, INDEX i
CLUSTERED (id, C)
PARTITIONED BY (date)
DISTRIBUTED BY
HASH(id) INTO 4
)
PARTITION (@date1) PARTITION (@date2) PARTITION (@date3)
HASH DISTRIBUTION 1
HASH DISTRIBUTION 2
HASH DISTRIBUTION 3
HASH DISTRIBUTION 1
HASH DISTRIBUTION 1
HASH DISTRIBUTION 2
HASH DISTRIBUTION 3
HASH DISTRIBUTION 4 HASH DISTRIBUTION 3
id1C1
id5C2
id9C3
id1C1
id1C2
id2C4
id6C5
id2C4
id3C6
id7C6
id7C7
id11C8
id7C7
id2C5
id3C6
id4C9
id4C10
id5C1
id9C3
/catalog/…/tables/Guid(T)/
Guid(T.p1).ss Guid(T.p2).ss Guid(T.p3).ss
LOGICAL
PHYSICAL
39. Benefits of
Tables
Benefits of Table clustering and distribution
• Faster lookup of data provided by distribution and clustering when right
distribution/cluster is chosen
• Data distribution provides better localized scale out
• Used for filters, joins and grouping
Benefits of Table partitioning
• Provides data life cycle management (“expire” old partitions):
Partition on date/time dimension
• Partial re-computation of data at partition level
• Query predicates can provide partition elimination
Do not use when…
• No filters, joins and grouping
• No reuse of the data for future queries
If in doubt: use sampling (e.g., SAMPLE ANY(x)) and test.
40. Benefits of
Distribution in
Tables
Benefits
• Design for most frequent/costly queries
• Manage data skew in partition/table
• Manage parallelism in querying (by number of
distributions)
• Manage minimizing data movement in joins
• Provide distribution seeks and range scans for query
predicates (distribution bucket elimination)
Distribution in tables is mandatory, chose according to
desired benefits
41. Benefits of
Clustered Index
in Distribution
Benefits
• Design for most frequent/costly queries
• Manage data skew in distribution bucket
• Provide locality of same data values
• Provide seeks and range scans for query predicates (index
lookup)
Clustered index in tables is mandatory, chose according to
desired benefits
Pro Tip:
Distribution keys should be prefix of Clustered Index keys:
Especially for RANGE distribution
Optimizer will make use of global ordering then:
If you make the RANGE distribution key a prefix of the index key, U-SQL will
repartition on demand to align any UNIONALLed or JOINed tables or
partitions!
Split points of table distribution partitions are chosen independently, so any
partitioned table can do UNION ALL in this manner if the data is to be
processed subsequently on the distribution key.
42. Benefits of
Partitioned
Tables
Benefits
• Partitions are addressable
• Enables finer-grained data lifecycle management at
partition level
• Manage parallelism in querying by number of partitions
• Query predicates provide partition elimination
• Predicate has to be constant-foldable
Use partitioned tables for
• Managing large amounts of incrementally growing
structured data
• Queries with strong locality predicates
• point in time, for specific market etc
• Managing windows of data
• provide data for last x months for processing
43. Partitioned tables
Use partitioned tables for
querying parts of large
amounts of incrementally
growing structured data
Get partition elimination
optimizations with the right
query predicates
Creating partition table
CREATE TABLE PartTable(id int, event_date DateTime, lat float, long float
, INDEX idx CLUSTERED (vehicle_id ASC)
PARTITIONED BY(event_date) DISTRIBUTED BY HASH (vehicle_id) INTO 4);
Creating partitions
DECLARE @pdate1 DateTime = new DateTime(2014, 9, 14,
00,00,00,00,DateTimeKind.Utc);
DECLARE @pdate2 DateTime = new DateTime(2014, 9, 15,
00,00,00,00,DateTimeKind.Utc);
ALTER TABLE vehiclesP ADD PARTITION (@pdate1), PARTITION (@pdate2);
Loading data into partitions dynamically
DECLARE @date1 DateTime = DateTime.Parse("2014-09-14");
DECLARE @date2 DateTime = DateTime.Parse("2014-09-16");
INSERT INTO vehiclesP ON INTEGRITY VIOLATION IGNORE
SELECT vehicle_id, event_date, lat, long FROM @data
WHERE event_date >= @date1 AND event_date <= @date2;
Filters and inserts clean data only, ignore “dirty” data
Loading data into partitions statically
ALTER TABLE vehiclesP ADD PARTITION (@pdate1), PARTITION (@baddate);
INSERT INTO vehiclesP ON INTEGRITY VIOLATION MOVE TO @baddate
SELECT vehicle_id, lat, long FROM @data
WHERE event_date >= @date1 AND event_date <= @date2;
Filters and inserts clean data only, put “dirty” data into special partition
44. Avoid Table
Fragmentation
and
Over-
partitioning!
What is Table Fragmentation
• ADLS is an append-only store!
• Every INSERT statement is creating a new file (INSERT fragment)
Why is it bad?
• Every INSERT fragment contains data in its own distribution buckets,
thus query processing loses ability to get “localized” fast access
• Query generation has to read from many files now -> slow preparation
phase that may time out.
• Reading from too many files is disallowed:
Current LIMIT: 3000 table partitions and INSERT fragments per job!
What if I have to add data incrementally?
• Batch inserts into table
• Use ALTER TABLE REBUILD/ALTER TABLE REBUILD PARTITION regularly to reduce
fragmentation and keep performance.
45. @Impressions =
SELECT * FROM
searchDM.SML.PageView(@start, @end) AS PageView
OPTION(SKEWFACTOR(Query)=0.5)
;
// Q1(A,B)
@Sessions =
SELECT
ClientId,
Query,
SUM(PageClicks) AS Clicks
FROM
@Impressions
GROUP BY
Query, ClientId
;
// Q2(B)
@Display =
SELECT * FROM @Sessions
INNER JOIN @Campaigns
ON @Sessions.Query == @Campaigns.Query
; Input must be distributed on: (Query)
Input must be distributed on:
(Query) or (ClientId) or (Query, ClientId)
Optimizer wants to distribute only once
But Query could be skewed
Data Distribution
• Re-Distributing is very expensive
• Many U-SQL operators can handle multiple distribution choices
• Optimizer bases decision upon estimations
Wrong statistics may result in worse query performance
U-SQL Optimizations
Distributions – Minimize (re)partitions
46. // Unstructured (24 hours daily log impressions)
@Huge = EXTRACT ClientId int, ...
FROM
@"wasb://ads@wcentralus/2015/10/30/{*}.nif"
;
// Small subset (ie: ForgetMe opt out)
@Small = SELECT * FROM @Huge
WHERE Bing.ForgetMe(x,y,z)
OPTION(ROWCOUNT=500)
;
// Result (not enough info to determine simple Broadcast
join)
@Remove = SELECT * FROM Bing.Sessions
INNER JOIN @Small ON Sessions.Client ==
@Small.Client
;
Broadcast JOIN right?
Broadcast is now a candidate.
Wrong statistics may result in worse query performance
=> CREATE STATISTICS
Optimizer has no stats this is small...
U-SQL Optimizations
Distribution - Cardinality
47. Write the Right Query!
• Do I really need a CROSS JOIN?
Example: Get a daily “expansion” of my fact validity interval
Approach 1 (standard text-book solution):
Join fact table with daily date dimension table using daily dimension between
begin and end of fact
=> Does not scale: CROSS JOIN on single node
Approach 2:
Cross apply daily expansion expression to each fact row
=> Scales very well since only depends on a row
49. What is Data Skew?
• Some data points are
much more common
than others
• data may be
distributed such that
all rows that match a
certain key go to a
single vertex
• imbalanced execution,
vertex time out.
0
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
30,000,000
35,000,000
40,000,000
Population by State
50. Low Distinctiveness Keys
• Keys with small
selectivity can lead to
large vertices even
without skew
@rows =
SELECT Gender, AGG<MyAgg>(…) AS Result
FROM @HugeInput
GROUP BY Gender;
Gender==Male Gender==Female
@HugeInput
Vertex 0 Vertex 1
51. Why is this a problem?
• Vertexes have a 5 hour runtime limit!
• Your UDO or join may excessively allocate memory.
• Your memory usage may not be obvious due to garbage collection
52. Addressing Data Skew/Low distinctiveness
• Improve data partition sizes:
• Find more fine grained keys, eg, states and congressional districts or ZIP codes
• If no fine grained keys can be found or are too fine-grained: use ROUND ROBIN
distribution
• Write queries that can handle data skew:
• Use filters that prune skew out early
• Use Data Hints to identify skew and “low distinctness” in keys:
• SKEWFACTOR(columns) = x
provides hint that given columns have a skew factor x between
0 (no skew) and 1 (very heavy skew))
• DISTINCTVALUE(columns) = n
let’s you specify how many distinct values the given columns have (n>1)
• Implement aggregation/reducer recursively if possible
55. // Metrics per domain
@Metric =
REDUCE @Impressions ON UrlDomain
USING new Bing.TopNReducer(count:10)
;
// …
Inherent Data Skew
[SqlUserDefinedReducer(IsRecursive = true)]
public class TopNReducer : IReducer
{
public override IEnumerable<IRow>
Reduce(IRowset input, IUpdatableRow output)
{
// Compute TOP(N) per group
// …
}
}
Recursive
• Allow multi-stage aggregation trees
• Requires same schema (input => output)
• Requires associativity:
• R(x, y) = R( R(x), R(y) )
• Default = non-recursive
• User code has to honor recursive semantics
www.bing.com
brought to a single vertex
U-SQL Partitioning
Data Skew – Recursive Reducer
57. // Bing impressions
@Impressions = SELECT * FROM
searchDM.SML.PageView(@start, @end) AS PageView
;
// Compute sessions
@Sessions =
REDUCE @Impressions ON Client, Market
READONLY Market
USING new Bing.SessionReducer(range : 30)
;
// Users metrics
@Metrics =
SELECT * FROM @Sessions
WHERE
Market == "en-us"
;
// …
Microsoft Confidential
U-SQL Optimizations
Predicate pushing – UDO pass-through columns
58. // Bing impressions
@Impressions = SELECT * FROM
searchDM.SML.PageView(@start, @end) AS PageView
;
// Compute page views
@Impressions =
PROCESS @Impressions
READONLY Market
PRODUCE Client, Market, Header string
USING new Bing.HtmlProcessor()
;
@Sessions =
REDUCE @Impressions ON Client, Market
READONLY Market
USING new Bing.SessionReducer(range : 30)
;
// Users metrics
@Metrics =
SELECT * FROM @Sessions
WHERE
Market == "en-us"
;
Microsoft Confidential
U-SQL Optimizations
Predicate pushing – UDO row level processors
public abstract class IProcessor : IUserDefinedOperator
{
/// <summary/>
public abstract IRow Process(IRow input, IUpdatableRow output);
}
public abstract class IReducer : IUserDefinedOperator
{
/// <summary/>
public abstract IEnumerable<IRow> Reduce(IRowset input, IUpdatableRow output);
}
59. // Bing impressions
@Impressions = SELECT Client, Market, Html FROM
searchDM.SML.PageView(@start, @end) AS PageView
;
// Compute page views
@Impressions =
PROCESS @Impressions
PRODUCE Client, Market, Header string
USING new Bing.HtmlProcessor()
;
// Users metrics
@Metrics =
SELECT * FROM @Sessions
WHERE
Market == "en-us"
&& Header.Contains("microsoft.com")
AND Header.Contains("microsoft.com")
;
U-SQL Optimizations
Predicate pushing – relational vs. C# semantics
60. // Bing impressions
@Impressions = SELECT * FROM
searchDM.SML.PageView(@start, @end) AS PageView
;
// Compute page views
@Impressions =
PROCESS @Impressions
PRODUCE *
REQUIRED ClientId, HtmlContent(Header, Footer)
USING new Bing.HtmlProcessor()
;
// Users metrics
@Metrics =
SELECT ClientId, Market, Header FROM @Sessions
WHERE
Market == "en-us"
;
U-SQL Optimizations
Column Pruning and dependencies
C H M
C H M
C H M
Column Pruning
• Minimize I/O (data shuffling)
• Minimize CPU (complex processing, html)
• Requires dependency knowledge:
• R(D*) = Input ( Output )
• Default no pruning
• User code has to honor reduced columns
A B C D E F G J KH I … M … 1000
61. UDO Tips and Warnings
• Tips when Using UDOs:
• READONLY clause to allow pushing predicates through UDOs
• REQUIRED clause to allow column pruning through UDOs
• PRESORT on REDUCE if you need global order
• Hint Cardinality if it does choose the wrong plan
• Warnings and better alternatives:
• Use SELECT with UDFs instead of PROCESS
• Use User-defined Aggregators instead of REDUCE
• Learn to use Windowing Functions (OVER expression)
• Good use-cases for PROCESS/REDUCE/COMBINE:
• The logic needs to dynamically access the input and/or output schema.
E.g., create a JSON doc for the data in the row where the columns are not known apriori.
• Your UDF based solution creates too much memory pressure and you can write your code more memory efficient in a UDO
• You need an ordered Aggregator or produce more than 1 row per group
62. Resources • Blogs and community page:
• http://usql.io (U-SQL Github)
• http://blogs.msdn.microsoft.com/azuredatalake/
• http://blogs.msdn.microsoft.com/mrys/
• https://channel9.msdn.com/Search?term=U-SQL#ch9Search
• Documentation, presentations and articles:
• http://aka.ms/usql_reference
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-programmability-guide
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/
• https://msdn.microsoft.com/en-us/magazine/mt614251
• https://msdn.microsoft.com/magazine/mt790200
• http://www.slideshare.net/MichaelRys
• Getting Started with R in U-SQL
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-python-extensions
• ADL forums and feedback
• https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake
• http://stackoverflow.com/questions/tagged/u-sql
• http://aka.ms/adlfeedback
Continue your education at
Microsoft Virtual Academy
online.
This slide is required. Do NOT delete. This should be the first slide after your Title Slide. This is an important year and we need to arm our attendees with the information they can use to Grow Share! Please ensure that your objectives are SMART (defined below) and that they will enable them to go in and win against the competition to grow share. If you have questions, please contact your Track PM for guidance. We have also posted guidance on writing good objectives, out on the Speaker Portal (https://www.mytechready.com).
This slide should introduce the session by identifying how this information helps the attendee, partners and customers be more successful. Why is this content important?
This slide should call out what’s important about the session (sort of the why should we care, why is this important and how will it help our customers/partners be successful) as well as the key takeaways/objectives associated with the session. Call out what attendees will be able to execute on using the information gained in this session. What will they be able to walk away from this session and execute on with their customers.
Good Objectives should be SMART (specific, measurable, achievable, realistic, time-bound). Focus on the key takeaways and why this information is important to the attendee, our partners and our customers.
Each session has objectives defined and published on www.mytechready.com, please work with your Track PM to call these out here in the slide deck.
If you have questions, please contact your Track PM. See slide 5 in this template for a complete list of Tracks and TPMs.