SlideShare a Scribd company logo
1 of 63
Best Practices and Performance
Tuning of U-SQL in Azure Data
Lake
Michael Rys
Principal Program Manager, Microsoft
@MikeDoesBigData, usql@microsoft.com
Session Objectives And Takeaways
• Session Objective(s):
• Understand some common best practices to make your U-SQL scale
• Understand the U-SQL Query execution
• Be able to understand and improve U-SQL job performance/cost
• Be able to understand and improve the U-SQL query plan
• Know how to write more efficient U-SQL scripts
• Key Takeaways:
• U-SQL is designed for scale-out
• U-SQL provides scalable execution of user code
• U-SQL has a tool set that can help you analyze and improve your scalability, cost and performance
Agenda
• Job Execution Experience and Investigations
Query Execution
Stage Graph
Dryad crash course
Job Metrics
Resource Planning
Job Parallelism and AUs
• Job Performance Analysis
Analyze the critical path
Heat Map
Critical Path
Data Skew
• Cost Optimizations
AU usage
AU modeller
• Tuning / Optimizations
Data Loading Tips
Data Partitioning (Files and Tables)
INSERT optimizations Partition Elimination
Predicate Pushing
Column Pruning
Addressing Data Skew
Some Data Hints
UDOs can be evil
U-SQL Query Execution
Expression-flow Programming Style
• Automatic "in-lining" of U-SQL expressions – whole
script leads to a single execution model.
• Execution plan that is optimized out-of-the-box and w/o
user intervention.
• Per job and user driven level of parallelization.
• Detail visibility into execution steps, for debugging.
• Heatmap like functionality to identify performance
bottlenecks.
Job Scheduler
& Queue
Front-EndService
Vertex Execution
Consume
Local
Storage
Data Lake
Store
Author
Plan
Compiler Optimizer
Vertexes
running in
YARN
Containers
U-SQL
Runtime
Optimized
Plan
Vertex Scheduling
On containers
Job Manager
USQL
Catalog
Overall U-SQL Batch Job Execution Lifetime
Stage Codegen
(C++/C#
Compilation)
Finalization Phase Execution Phase
Queueing
Preparation Phase
Job Scheduler
& Queue
Front-EndService
Vertex Execution
Consume
Local
Storage
Data Lake
Store
Author
Plan
Compiler Optimizer
Vertexes
running in
YARN
Containers
U-SQL
Runtime
Optimized
Plan
Vertex Scheduling
On containers
Job Manager
USQL
Catalog
Overall U-SQL Batch Job Execution Lifetime
Vertex
Codegen
(C++/C#
Compilation)
U-SQL Preparation into Vertex (.NET example)
C#
C++
Algebra
Additional non-dll files &
Deployed resources
managed dll
native dll
Compilation output (in job folder)
Compilation and Optimization
U-SQL
Metadata
Service
Deployed to
Vertices
REFERENCE ASSEMBLY
ADLS
(or WASB)
DEPLOY RESOURCE
System files
(built-in Runtimes, Core DLLs, OS)
Job Scheduling and Execution
Demo: Analyzing a job
Maximal degree of Parallelism
1000 (ADL AUs)
Work composed of
12K Vertices
1 ADL AU currently maps to a VM with 2 cores and 6 GB of memory
U-SQL Query Execution
Physical plans vs. Dryad stage graph…
Stage Details
6307 Pieces of work: Same
code applied to different
data partitions
AVG Vertex execution
time
4.4 Billion rows
Data Read &
Written
Super Vertex = Stage
U-SQL Query Execution
Redefinition of big-data…
16
U-SQL Query Execution
Redefinition of big-data: massive data is so 2000, now it’s ginormous data 
17
AU allocation vs Job Parallelism/Scale Out
• Job’s inherent scale out is determined by
• Size of input data plus data partitioning
• Operations
• Cross join vs parallelizable equi-join
• Order By: Global order vs local order
• REDUCE ALL vs REDUCE ON part_key
• Hints:
• Data size hints
• Partition plan hint (works only if the plan space contains an option for that plan)
• AU allocation does not specify Job Scale Out but specifies upper limit
• Increasing the AU does not make your job necessarily faster
• If you over-specify AU allocation, you will pay for overallocation!!!
U-SQL Performance Analysis
Analyze the critical path, heat maps, playback, and runtime metrics on every vertex…
19
Tuning U-SQL Jobs
Demo: Tuning for Cost Efficiency
Dips down to 1 active vertex at
these times
Smallest estimated time when
given 2425 ADLAUs
1410 seconds
= 23.5 minutes
Model with 100 ADLAUs
8709 seconds
= 145.5 minutes
Tuning for Performance –
Data Partitioning during loading,
storage and processing
Data Storage
• Files
• Tables
• Unstructured Data in files
• Files are split into 250MB extents
• 4 extents per vertex -> 1GB per vertex
• Different file content formats:
• Splittable formats are parallelizable:
• row-oriented (CSV etc)
• Where data does not span extents
• Non-splittable formats cannot be parallelized:
• XML/JSON
• Have to be processed in single vertex extractor with
atomicFileProcessing=true.
• Use File Sets to provide semantic partition pruning
• Tables
• Clustered Index (row-oriented) storage
• Vertical and horizontal partitioning
• Statistics for the optimizer (CREATE STATISTICS)
• Native scalar value serialization
Loading and
Querying unstructured data
Real-Time Data Ingestion
Promises of the Data Lake:
1. Store everything and anything in its original form, figure it out later
2. BYO compute (Hadoop, ADLA, other) to a shared collection of data
3. Massive scale, no limits, never run out of storage
No 1 is where we see most customers get into trouble!
Real-Time Data Ingestion
Many customers take the “original form” a little too literally 
“The original form of my data is the event… I need to keep it”
But
• Discrete database transactions have been occurring forever, one row at a time
• We store them in singular (table) structures for efficiency
• We can still find a discrete record in its absolute original form
• We should apply similar optimizations to Data Lakes!!!
Demo: Querying small files vs large files
Real-Time Data Ingestion
Moral of the story…
• Small files are suboptimal in every scenario
• Consider alternatives to concatenate into larger files:
• Offline outside of Azure
• Event Hubs Capture
• Stream Analytics
• … or ADLA fast file sets to compact most recent deltas
// Unstructured Files (24 hours daily log impressions)
@Impressions =
EXTRACT ClientId int, Market string, OS string, ...
FROM @"wasb://ads@wcentralus/2015/10/30/{*}.nif"
FROM @"wasb://ads@wcentralus/2015/10/30/{Market}_{*}.nif"
;
// …
// Filter to by Market
@US =
SELECT * FROM @Impressions
WHERE Market == "en"
;
Partition Elimination
• Even with unstructured files!
• Leverage Virtual Columns (Named)
• Avoid unnamed {*}
• WHERE predicates on named virtual columns
• That binds the PE range during compilation time
• Named virtual columns without predicate = warning
• Design directories/files with PE in mind
• Design for elimination early in the tree, not in the leaves
Extracts all files in the folder
Post filter = pay I/O cost to drop most data
PE pushes this predicate to the EXTRACT
EXTRACT now only reads “en” files!
en_10.0.nif
en_8.1.nif
de_10.0.nif
jp_7.0.nif
de_8.1.nif
../2015/10/30/
…
U-SQL Optimizations
Partition Elimination – Unstructured Files
FileSet optimizations
• Use
SET @@FeaturePreviews="InputFileGrouping:on";
to group small files into single vertex (instead of one file per vertex)
• Also use to speed up preparation time and produce faster code:
SET @@FeaturePreviews =
"FileSetV2Dot5:on,AsyncCompilerStoreAccess:on";
(will be default with next refresh and not needed afterwards)
Cost of small files with custom extractor
• Custom code deployment during vertex creation dominates vertex
execution time over actual work!
Vertex creation time:
Deploy Vertex content
Actual vertex execution
work!
Tuning for Performance
Structured Data
How many clicks per domain?
@rows = SELECT
Domain,
SUM(Clicks) AS TotalClicks
FROM @ClickData
GROUP BY Domain;
Read Read
Partition Partition
Full Agg
Write
Full Agg
Write
Full Agg
Write
Read
Partition
Partial Agg Partial Agg Partial Agg
CNN,
FB,
WH
EXTENT 1 EXTENT 2 EXTENT 3
CNN,
FB,
WH
CNN,
FB,
WH
U-SQL Table
Distributed by Domain
Read Read
Full Agg Full Agg
Write Write
Read
Full Agg
Write
FB
EXTENT 1
WH
EXTENT 2
CNN
EXTENT 3
Expensive!
Files
Demo: Scaling out with Distributions
Table Partitioning
and Distribution Fine grained (horizontal) partitioning/distribution
• Distributes within a partition (together with clustering) to
keep same data values close
• Choose for:
• Join alignment, partition size, filter selectivity, partition elimination
Coarse grained (vertical) partitioning
• Based on Partition keys
• Partition is addressable in language
• Query predicates will allow partition pruning
• Choose for data life cycle management, partition elimination
Distribution Scheme When to use?
HASH(keys) Automatic Hash for fast item lookup
DIRECT HASH(id) Exact control of hash bucket value
RANGE(keys) Keeps ranges together
ROUND ROBIN To get equal distribution (if others give skew)
Partitions, Distributions and Clusters
TABLE
T ( id …
, C …
, date DateTime, …
, INDEX i
CLUSTERED (id, C)
PARTITIONED BY (date)
DISTRIBUTED BY
HASH(id) INTO 4
)
PARTITION (@date1) PARTITION (@date2) PARTITION (@date3)
HASH DISTRIBUTION 1
HASH DISTRIBUTION 2
HASH DISTRIBUTION 3
HASH DISTRIBUTION 1
HASH DISTRIBUTION 1
HASH DISTRIBUTION 2
HASH DISTRIBUTION 3
HASH DISTRIBUTION 4 HASH DISTRIBUTION 3
id1C1
id5C2
id9C3
id1C1
id1C2
id2C4
id6C5
id2C4
id3C6
id7C6
id7C7
id11C8
id7C7
id2C5
id3C6
id4C9
id4C10
id5C1
id9C3
/catalog/…/tables/Guid(T)/
Guid(T.p1).ss Guid(T.p2).ss Guid(T.p3).ss
LOGICAL
PHYSICAL
Benefits of
Tables
Benefits of Table clustering and distribution
• Faster lookup of data provided by distribution and clustering when right
distribution/cluster is chosen
• Data distribution provides better localized scale out
• Used for filters, joins and grouping
Benefits of Table partitioning
• Provides data life cycle management (“expire” old partitions):
Partition on date/time dimension
• Partial re-computation of data at partition level
• Query predicates can provide partition elimination
Do not use when…
• No filters, joins and grouping
• No reuse of the data for future queries
If in doubt: use sampling (e.g., SAMPLE ANY(x)) and test.
Benefits of
Distribution in
Tables
Benefits
• Design for most frequent/costly queries
• Manage data skew in partition/table
• Manage parallelism in querying (by number of
distributions)
• Manage minimizing data movement in joins
• Provide distribution seeks and range scans for query
predicates (distribution bucket elimination)
Distribution in tables is mandatory, chose according to
desired benefits
Benefits of
Clustered Index
in Distribution
Benefits
• Design for most frequent/costly queries
• Manage data skew in distribution bucket
• Provide locality of same data values
• Provide seeks and range scans for query predicates (index
lookup)
Clustered index in tables is mandatory, chose according to
desired benefits
Pro Tip:
Distribution keys should be prefix of Clustered Index keys:
Especially for RANGE distribution
Optimizer will make use of global ordering then:
If you make the RANGE distribution key a prefix of the index key, U-SQL will
repartition on demand to align any UNIONALLed or JOINed tables or
partitions!
Split points of table distribution partitions are chosen independently, so any
partitioned table can do UNION ALL in this manner if the data is to be
processed subsequently on the distribution key.
Benefits of
Partitioned
Tables
Benefits
• Partitions are addressable
• Enables finer-grained data lifecycle management at
partition level
• Manage parallelism in querying by number of partitions
• Query predicates provide partition elimination
• Predicate has to be constant-foldable
Use partitioned tables for
• Managing large amounts of incrementally growing
structured data
• Queries with strong locality predicates
• point in time, for specific market etc
• Managing windows of data
• provide data for last x months for processing
Partitioned tables
Use partitioned tables for
querying parts of large
amounts of incrementally
growing structured data
Get partition elimination
optimizations with the right
query predicates
Creating partition table
CREATE TABLE PartTable(id int, event_date DateTime, lat float, long float
, INDEX idx CLUSTERED (vehicle_id ASC)
PARTITIONED BY(event_date) DISTRIBUTED BY HASH (vehicle_id) INTO 4);
Creating partitions
DECLARE @pdate1 DateTime = new DateTime(2014, 9, 14,
00,00,00,00,DateTimeKind.Utc);
DECLARE @pdate2 DateTime = new DateTime(2014, 9, 15,
00,00,00,00,DateTimeKind.Utc);
ALTER TABLE vehiclesP ADD PARTITION (@pdate1), PARTITION (@pdate2);
Loading data into partitions dynamically
DECLARE @date1 DateTime = DateTime.Parse("2014-09-14");
DECLARE @date2 DateTime = DateTime.Parse("2014-09-16");
INSERT INTO vehiclesP ON INTEGRITY VIOLATION IGNORE
SELECT vehicle_id, event_date, lat, long FROM @data
WHERE event_date >= @date1 AND event_date <= @date2;
Filters and inserts clean data only, ignore “dirty” data
Loading data into partitions statically
ALTER TABLE vehiclesP ADD PARTITION (@pdate1), PARTITION (@baddate);
INSERT INTO vehiclesP ON INTEGRITY VIOLATION MOVE TO @baddate
SELECT vehicle_id, lat, long FROM @data
WHERE event_date >= @date1 AND event_date <= @date2;
Filters and inserts clean data only, put “dirty” data into special partition
Avoid Table
Fragmentation
and
Over-
partitioning!
What is Table Fragmentation
• ADLS is an append-only store!
• Every INSERT statement is creating a new file (INSERT fragment)
Why is it bad?
• Every INSERT fragment contains data in its own distribution buckets,
thus query processing loses ability to get “localized” fast access
• Query generation has to read from many files now -> slow preparation
phase that may time out.
• Reading from too many files is disallowed:
Current LIMIT: 3000 table partitions and INSERT fragments per job!
What if I have to add data incrementally?
• Batch inserts into table
• Use ALTER TABLE REBUILD/ALTER TABLE REBUILD PARTITION regularly to reduce
fragmentation and keep performance.
@Impressions =
SELECT * FROM
searchDM.SML.PageView(@start, @end) AS PageView
OPTION(SKEWFACTOR(Query)=0.5)
;
// Q1(A,B)
@Sessions =
SELECT
ClientId,
Query,
SUM(PageClicks) AS Clicks
FROM
@Impressions
GROUP BY
Query, ClientId
;
// Q2(B)
@Display =
SELECT * FROM @Sessions
INNER JOIN @Campaigns
ON @Sessions.Query == @Campaigns.Query
; Input must be distributed on: (Query)
Input must be distributed on:
(Query) or (ClientId) or (Query, ClientId)
Optimizer wants to distribute only once
But Query could be skewed
Data Distribution
• Re-Distributing is very expensive
• Many U-SQL operators can handle multiple distribution choices
• Optimizer bases decision upon estimations
Wrong statistics may result in worse query performance
U-SQL Optimizations
Distributions – Minimize (re)partitions
// Unstructured (24 hours daily log impressions)
@Huge = EXTRACT ClientId int, ...
FROM
@"wasb://ads@wcentralus/2015/10/30/{*}.nif"
;
// Small subset (ie: ForgetMe opt out)
@Small = SELECT * FROM @Huge
WHERE Bing.ForgetMe(x,y,z)
OPTION(ROWCOUNT=500)
;
// Result (not enough info to determine simple Broadcast
join)
@Remove = SELECT * FROM Bing.Sessions
INNER JOIN @Small ON Sessions.Client ==
@Small.Client
;
Broadcast JOIN right?
Broadcast is now a candidate.
Wrong statistics may result in worse query performance
=> CREATE STATISTICS
Optimizer has no stats this is small...
U-SQL Optimizations
Distribution - Cardinality
Write the Right Query!
• Do I really need a CROSS JOIN?
Example: Get a daily “expansion” of my fact validity interval
Approach 1 (standard text-book solution):
Join fact table with daily date dimension table using daily dimension between
begin and end of fact
=> Does not scale: CROSS JOIN on single node
Approach 2:
Cross apply daily expansion expression to each fact row
=> Scales very well since only depends on a row
Tuning for Performance –
Handling Data Skew
What is Data Skew?
• Some data points are
much more common
than others
• data may be
distributed such that
all rows that match a
certain key go to a
single vertex
• imbalanced execution,
vertex time out.
0
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
30,000,000
35,000,000
40,000,000
Population by State
Low Distinctiveness Keys
• Keys with small
selectivity can lead to
large vertices even
without skew
@rows =
SELECT Gender, AGG<MyAgg>(…) AS Result
FROM @HugeInput
GROUP BY Gender;
Gender==Male Gender==Female
@HugeInput
Vertex 0 Vertex 1
Why is this a problem?
• Vertexes have a 5 hour runtime limit!
• Your UDO or join may excessively allocate memory.
• Your memory usage may not be obvious due to garbage collection
Addressing Data Skew/Low distinctiveness
• Improve data partition sizes:
• Find more fine grained keys, eg, states and congressional districts or ZIP codes
• If no fine grained keys can be found or are too fine-grained: use ROUND ROBIN
distribution
• Write queries that can handle data skew:
• Use filters that prune skew out early
• Use Data Hints to identify skew and “low distinctness” in keys:
• SKEWFACTOR(columns) = x
provides hint that given columns have a skew factor x between
0 (no skew) and 1 (very heavy skew))
• DISTINCTVALUE(columns) = n
let’s you specify how many distinct values the given columns have (n>1)
• Implement aggregation/reducer recursively if possible
Non-Recursive vs Recursive SUM
1 2 3 4 5 6 7 8 36
1 2 3 4 5 6 7 8
6 15 15
36
U-SQL Partitioning during Processing
Data Skew
// Metrics per domain
@Metric =
REDUCE @Impressions ON UrlDomain
USING new Bing.TopNReducer(count:10)
;
// …
Inherent Data Skew
[SqlUserDefinedReducer(IsRecursive = true)]
public class TopNReducer : IReducer
{
public override IEnumerable<IRow>
Reduce(IRowset input, IUpdatableRow output)
{
// Compute TOP(N) per group
// …
}
}
Recursive
• Allow multi-stage aggregation trees
• Requires same schema (input => output)
• Requires associativity:
• R(x, y) = R( R(x), R(y) )
• Default = non-recursive
• User code has to honor recursive semantics
www.bing.com
brought to a single vertex
U-SQL Partitioning
Data Skew – Recursive Reducer
Tuning for Performance – User Defined Operators
// Bing impressions
@Impressions = SELECT * FROM
searchDM.SML.PageView(@start, @end) AS PageView
;
// Compute sessions
@Sessions =
REDUCE @Impressions ON Client, Market
READONLY Market
USING new Bing.SessionReducer(range : 30)
;
// Users metrics
@Metrics =
SELECT * FROM @Sessions
WHERE
Market == "en-us"
;
// …
Microsoft Confidential
U-SQL Optimizations
Predicate pushing – UDO pass-through columns
// Bing impressions
@Impressions = SELECT * FROM
searchDM.SML.PageView(@start, @end) AS PageView
;
// Compute page views
@Impressions =
PROCESS @Impressions
READONLY Market
PRODUCE Client, Market, Header string
USING new Bing.HtmlProcessor()
;
@Sessions =
REDUCE @Impressions ON Client, Market
READONLY Market
USING new Bing.SessionReducer(range : 30)
;
// Users metrics
@Metrics =
SELECT * FROM @Sessions
WHERE
Market == "en-us"
;
Microsoft Confidential
U-SQL Optimizations
Predicate pushing – UDO row level processors
public abstract class IProcessor : IUserDefinedOperator
{
/// <summary/>
public abstract IRow Process(IRow input, IUpdatableRow output);
}
public abstract class IReducer : IUserDefinedOperator
{
/// <summary/>
public abstract IEnumerable<IRow> Reduce(IRowset input, IUpdatableRow output);
}
// Bing impressions
@Impressions = SELECT Client, Market, Html FROM
searchDM.SML.PageView(@start, @end) AS PageView
;
// Compute page views
@Impressions =
PROCESS @Impressions
PRODUCE Client, Market, Header string
USING new Bing.HtmlProcessor()
;
// Users metrics
@Metrics =
SELECT * FROM @Sessions
WHERE
Market == "en-us"
&& Header.Contains("microsoft.com")
AND Header.Contains("microsoft.com")
;
U-SQL Optimizations
Predicate pushing – relational vs. C# semantics
// Bing impressions
@Impressions = SELECT * FROM
searchDM.SML.PageView(@start, @end) AS PageView
;
// Compute page views
@Impressions =
PROCESS @Impressions
PRODUCE *
REQUIRED ClientId, HtmlContent(Header, Footer)
USING new Bing.HtmlProcessor()
;
// Users metrics
@Metrics =
SELECT ClientId, Market, Header FROM @Sessions
WHERE
Market == "en-us"
;
U-SQL Optimizations
Column Pruning and dependencies
C H M
C H M
C H M
Column Pruning
• Minimize I/O (data shuffling)
• Minimize CPU (complex processing, html)
• Requires dependency knowledge:
• R(D*) = Input ( Output )
• Default no pruning
• User code has to honor reduced columns
A B C D E F G J KH I … M … 1000
UDO Tips and Warnings
• Tips when Using UDOs:
• READONLY clause to allow pushing predicates through UDOs
• REQUIRED clause to allow column pruning through UDOs
• PRESORT on REDUCE if you need global order
• Hint Cardinality if it does choose the wrong plan
• Warnings and better alternatives:
• Use SELECT with UDFs instead of PROCESS
• Use User-defined Aggregators instead of REDUCE
• Learn to use Windowing Functions (OVER expression)
• Good use-cases for PROCESS/REDUCE/COMBINE:
• The logic needs to dynamically access the input and/or output schema.
E.g., create a JSON doc for the data in the row where the columns are not known apriori.
• Your UDF based solution creates too much memory pressure and you can write your code more memory efficient in a UDO
• You need an ordered Aggregator or produce more than 1 row per group
Resources • Blogs and community page:
• http://usql.io (U-SQL Github)
• http://blogs.msdn.microsoft.com/azuredatalake/
• http://blogs.msdn.microsoft.com/mrys/
• https://channel9.msdn.com/Search?term=U-SQL#ch9Search
• Documentation, presentations and articles:
• http://aka.ms/usql_reference
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-programmability-guide
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/
• https://msdn.microsoft.com/en-us/magazine/mt614251
• https://msdn.microsoft.com/magazine/mt790200
• http://www.slideshare.net/MichaelRys
• Getting Started with R in U-SQL
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-python-extensions
• ADL forums and feedback
• https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake
• http://stackoverflow.com/questions/tagged/u-sql
• http://aka.ms/adlfeedback
Continue your education at
Microsoft Virtual Academy
online.
Thank you very much for
your attention!

More Related Content

What's hot

GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahAMD Developer Central
 
Battlelog - Building scalable web sites with tight game integration
Battlelog - Building scalable web sites with tight game integrationBattlelog - Building scalable web sites with tight game integration
Battlelog - Building scalable web sites with tight game integrationElectronic Arts / DICE
 
Default GitLab CI Pipeline - Auto DevOps
Default GitLab CI Pipeline - Auto DevOpsDefault GitLab CI Pipeline - Auto DevOps
Default GitLab CI Pipeline - Auto DevOpsRajith Bhanuka Mahanama
 
Pythonistaで音ゲーを作る
Pythonistaで音ゲーを作るPythonistaで音ゲーを作る
Pythonistaで音ゲーを作るmonochrojazz
 
Unity3D 5 의 ugui #1
Unity3D 5 의 ugui #1Unity3D 5 의 ugui #1
Unity3D 5 의 ugui #1Hong-Gi Joe
 
Efficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® GraphicsEfficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® GraphicsGael Hofemeier
 
Debugging Microservices - QCON 2017
Debugging Microservices - QCON 2017Debugging Microservices - QCON 2017
Debugging Microservices - QCON 2017Idit Levine
 
Introducing GitLab (June 2018)
Introducing GitLab (June 2018)Introducing GitLab (June 2018)
Introducing GitLab (June 2018)Noa Harel
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-renderingmistercteam
 
Khronos Munich 2018 - Halcyon and Vulkan
Khronos Munich 2018 - Halcyon and VulkanKhronos Munich 2018 - Halcyon and Vulkan
Khronos Munich 2018 - Halcyon and VulkanElectronic Arts / DICE
 
OIT to Volumetric Shadow Mapping, 101 Uses for Raster-Ordered Views using Dir...
OIT to Volumetric Shadow Mapping, 101 Uses for Raster-Ordered Views using Dir...OIT to Volumetric Shadow Mapping, 101 Uses for Raster-Ordered Views using Dir...
OIT to Volumetric Shadow Mapping, 101 Uses for Raster-Ordered Views using Dir...Gael Hofemeier
 
Taking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next GenerationTaking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next GenerationGuerrilla
 

What's hot (15)

GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
 
Battlelog - Building scalable web sites with tight game integration
Battlelog - Building scalable web sites with tight game integrationBattlelog - Building scalable web sites with tight game integration
Battlelog - Building scalable web sites with tight game integration
 
Git undo
Git undoGit undo
Git undo
 
Default GitLab CI Pipeline - Auto DevOps
Default GitLab CI Pipeline - Auto DevOpsDefault GitLab CI Pipeline - Auto DevOps
Default GitLab CI Pipeline - Auto DevOps
 
GIT INTRODUCTION
GIT INTRODUCTIONGIT INTRODUCTION
GIT INTRODUCTION
 
Pythonistaで音ゲーを作る
Pythonistaで音ゲーを作るPythonistaで音ゲーを作る
Pythonistaで音ゲーを作る
 
Unity3D 5 의 ugui #1
Unity3D 5 의 ugui #1Unity3D 5 의 ugui #1
Unity3D 5 의 ugui #1
 
Efficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® GraphicsEfficient Rendering with DirectX* 12 on Intel® Graphics
Efficient Rendering with DirectX* 12 on Intel® Graphics
 
Debugging Microservices - QCON 2017
Debugging Microservices - QCON 2017Debugging Microservices - QCON 2017
Debugging Microservices - QCON 2017
 
Introducing GitLab (June 2018)
Introducing GitLab (June 2018)Introducing GitLab (June 2018)
Introducing GitLab (June 2018)
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-rendering
 
Khronos Munich 2018 - Halcyon and Vulkan
Khronos Munich 2018 - Halcyon and VulkanKhronos Munich 2018 - Halcyon and Vulkan
Khronos Munich 2018 - Halcyon and Vulkan
 
OpenVINO introduction
OpenVINO introductionOpenVINO introduction
OpenVINO introduction
 
OIT to Volumetric Shadow Mapping, 101 Uses for Raster-Ordered Views using Dir...
OIT to Volumetric Shadow Mapping, 101 Uses for Raster-Ordered Views using Dir...OIT to Volumetric Shadow Mapping, 101 Uses for Raster-Ordered Views using Dir...
OIT to Volumetric Shadow Mapping, 101 Uses for Raster-Ordered Views using Dir...
 
Taking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next GenerationTaking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next Generation
 

Similar to Best Practices for Tuning U-SQL Performance

Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Michael Rys
 
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)Michael Rys
 
Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Michael Rys
 
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQLTaming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQLMichael Rys
 
U-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for DevelopersU-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for DevelopersMichael Rys
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineChester Chen
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Data Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsData Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsIke Ellis
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsMicrosoft Tech Community
 
Implementing the Databese Server session 02
Implementing the Databese Server session 02Implementing the Databese Server session 02
Implementing the Databese Server session 02Guillermo Julca
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analyticsIke Ellis
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptxIke Ellis
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveIlyas F ☁☁☁
 
AWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesAWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesDoiT International
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)James Serra
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Michael Rys
 

Similar to Best Practices for Tuning U-SQL Performance (20)

Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
 
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
 
Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)
 
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQLTaming the Data Science Monster with A New ‘Sword’ – U-SQL
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
 
U-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for DevelopersU-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for Developers
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Data Modeling on Azure for Analytics
Data Modeling on Azure for AnalyticsData Modeling on Azure for Analytics
Data Modeling on Azure for Analytics
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
 
SQLServer Database Structures
SQLServer Database Structures SQLServer Database Structures
SQLServer Database Structures
 
Implementing the Databese Server session 02
Implementing the Databese Server session 02Implementing the Databese Server session 02
Implementing the Databese Server session 02
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
Data modeling trends for analytics
Data modeling trends for analyticsData modeling trends for analytics
Data modeling trends for analytics
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptx
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep Dive
 
AWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL QueriesAWS Athena vs. Google BigQuery for interactive SQL Queries
AWS Athena vs. Google BigQuery for interactive SQL Queries
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
 

More from Michael Rys

Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Michael Rys
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Michael Rys
 
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Michael Rys
 
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Big Data Processing with Spark and .NET - Microsoft Ignite 2019Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Big Data Processing with Spark and .NET - Microsoft Ignite 2019Michael Rys
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Michael Rys
 
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...Michael Rys
 
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...Michael Rys
 
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...Michael Rys
 
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)Michael Rys
 
Killer Scenarios with Data Lake in Azure with U-SQL
Killer Scenarios with Data Lake in Azure with U-SQLKiller Scenarios with Data Lake in Azure with U-SQL
Killer Scenarios with Data Lake in Azure with U-SQLMichael Rys
 
ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)Michael Rys
 
U-SQL Learning Resources (SQLBits 2016)
U-SQL Learning Resources (SQLBits 2016)U-SQL Learning Resources (SQLBits 2016)
U-SQL Learning Resources (SQLBits 2016)Michael Rys
 
U-SQL Federated Distributed Queries (SQLBits 2016)
U-SQL Federated Distributed Queries (SQLBits 2016)U-SQL Federated Distributed Queries (SQLBits 2016)
U-SQL Federated Distributed Queries (SQLBits 2016)Michael Rys
 
U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)Michael Rys
 
U-SQL Query Execution and Performance Basics (SQLBits 2016)
U-SQL Query Execution and Performance Basics (SQLBits 2016)U-SQL Query Execution and Performance Basics (SQLBits 2016)
U-SQL Query Execution and Performance Basics (SQLBits 2016)Michael Rys
 
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)Michael Rys
 
U-SQL Does SQL (SQLBits 2016)
U-SQL Does SQL (SQLBits 2016)U-SQL Does SQL (SQLBits 2016)
U-SQL Does SQL (SQLBits 2016)Michael Rys
 

More from Michael Rys (20)

Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
 
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...
 
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Big Data Processing with Spark and .NET - Microsoft Ignite 2019Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
 
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
 
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
 
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
 
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
 
Killer Scenarios with Data Lake in Azure with U-SQL
Killer Scenarios with Data Lake in Azure with U-SQLKiller Scenarios with Data Lake in Azure with U-SQL
Killer Scenarios with Data Lake in Azure with U-SQL
 
ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)
 
U-SQL Learning Resources (SQLBits 2016)
U-SQL Learning Resources (SQLBits 2016)U-SQL Learning Resources (SQLBits 2016)
U-SQL Learning Resources (SQLBits 2016)
 
U-SQL Federated Distributed Queries (SQLBits 2016)
U-SQL Federated Distributed Queries (SQLBits 2016)U-SQL Federated Distributed Queries (SQLBits 2016)
U-SQL Federated Distributed Queries (SQLBits 2016)
 
U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)
 
U-SQL Query Execution and Performance Basics (SQLBits 2016)
U-SQL Query Execution and Performance Basics (SQLBits 2016)U-SQL Query Execution and Performance Basics (SQLBits 2016)
U-SQL Query Execution and Performance Basics (SQLBits 2016)
 
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
 
U-SQL Does SQL (SQLBits 2016)
U-SQL Does SQL (SQLBits 2016)U-SQL Does SQL (SQLBits 2016)
U-SQL Does SQL (SQLBits 2016)
 

Recently uploaded

CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 

Recently uploaded (20)

CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 

Best Practices for Tuning U-SQL Performance

  • 1. Best Practices and Performance Tuning of U-SQL in Azure Data Lake Michael Rys Principal Program Manager, Microsoft @MikeDoesBigData, usql@microsoft.com
  • 2. Session Objectives And Takeaways • Session Objective(s): • Understand some common best practices to make your U-SQL scale • Understand the U-SQL Query execution • Be able to understand and improve U-SQL job performance/cost • Be able to understand and improve the U-SQL query plan • Know how to write more efficient U-SQL scripts • Key Takeaways: • U-SQL is designed for scale-out • U-SQL provides scalable execution of user code • U-SQL has a tool set that can help you analyze and improve your scalability, cost and performance
  • 3. Agenda • Job Execution Experience and Investigations Query Execution Stage Graph Dryad crash course Job Metrics Resource Planning Job Parallelism and AUs • Job Performance Analysis Analyze the critical path Heat Map Critical Path Data Skew • Cost Optimizations AU usage AU modeller • Tuning / Optimizations Data Loading Tips Data Partitioning (Files and Tables) INSERT optimizations Partition Elimination Predicate Pushing Column Pruning Addressing Data Skew Some Data Hints UDOs can be evil
  • 5. Expression-flow Programming Style • Automatic "in-lining" of U-SQL expressions – whole script leads to a single execution model. • Execution plan that is optimized out-of-the-box and w/o user intervention. • Per job and user driven level of parallelization. • Detail visibility into execution steps, for debugging. • Heatmap like functionality to identify performance bottlenecks.
  • 6. Job Scheduler & Queue Front-EndService Vertex Execution Consume Local Storage Data Lake Store Author Plan Compiler Optimizer Vertexes running in YARN Containers U-SQL Runtime Optimized Plan Vertex Scheduling On containers Job Manager USQL Catalog Overall U-SQL Batch Job Execution Lifetime Stage Codegen (C++/C# Compilation)
  • 7. Finalization Phase Execution Phase Queueing Preparation Phase Job Scheduler & Queue Front-EndService Vertex Execution Consume Local Storage Data Lake Store Author Plan Compiler Optimizer Vertexes running in YARN Containers U-SQL Runtime Optimized Plan Vertex Scheduling On containers Job Manager USQL Catalog Overall U-SQL Batch Job Execution Lifetime Vertex Codegen (C++/C# Compilation)
  • 8. U-SQL Preparation into Vertex (.NET example) C# C++ Algebra Additional non-dll files & Deployed resources managed dll native dll Compilation output (in job folder) Compilation and Optimization U-SQL Metadata Service Deployed to Vertices REFERENCE ASSEMBLY ADLS (or WASB) DEPLOY RESOURCE System files (built-in Runtimes, Core DLLs, OS)
  • 9. Job Scheduling and Execution
  • 11. Maximal degree of Parallelism 1000 (ADL AUs) Work composed of 12K Vertices 1 ADL AU currently maps to a VM with 2 cores and 6 GB of memory
  • 12. U-SQL Query Execution Physical plans vs. Dryad stage graph…
  • 13. Stage Details 6307 Pieces of work: Same code applied to different data partitions AVG Vertex execution time 4.4 Billion rows Data Read & Written Super Vertex = Stage
  • 15. U-SQL Query Execution Redefinition of big-data: massive data is so 2000, now it’s ginormous data  17
  • 16. AU allocation vs Job Parallelism/Scale Out • Job’s inherent scale out is determined by • Size of input data plus data partitioning • Operations • Cross join vs parallelizable equi-join • Order By: Global order vs local order • REDUCE ALL vs REDUCE ON part_key • Hints: • Data size hints • Partition plan hint (works only if the plan space contains an option for that plan) • AU allocation does not specify Job Scale Out but specifies upper limit • Increasing the AU does not make your job necessarily faster • If you over-specify AU allocation, you will pay for overallocation!!!
  • 17. U-SQL Performance Analysis Analyze the critical path, heat maps, playback, and runtime metrics on every vertex… 19
  • 19. Demo: Tuning for Cost Efficiency
  • 20. Dips down to 1 active vertex at these times
  • 21. Smallest estimated time when given 2425 ADLAUs 1410 seconds = 23.5 minutes
  • 22. Model with 100 ADLAUs 8709 seconds = 145.5 minutes
  • 23. Tuning for Performance – Data Partitioning during loading, storage and processing
  • 24. Data Storage • Files • Tables • Unstructured Data in files • Files are split into 250MB extents • 4 extents per vertex -> 1GB per vertex • Different file content formats: • Splittable formats are parallelizable: • row-oriented (CSV etc) • Where data does not span extents • Non-splittable formats cannot be parallelized: • XML/JSON • Have to be processed in single vertex extractor with atomicFileProcessing=true. • Use File Sets to provide semantic partition pruning • Tables • Clustered Index (row-oriented) storage • Vertical and horizontal partitioning • Statistics for the optimizer (CREATE STATISTICS) • Native scalar value serialization
  • 26. Real-Time Data Ingestion Promises of the Data Lake: 1. Store everything and anything in its original form, figure it out later 2. BYO compute (Hadoop, ADLA, other) to a shared collection of data 3. Massive scale, no limits, never run out of storage No 1 is where we see most customers get into trouble!
  • 27. Real-Time Data Ingestion Many customers take the “original form” a little too literally  “The original form of my data is the event… I need to keep it” But • Discrete database transactions have been occurring forever, one row at a time • We store them in singular (table) structures for efficiency • We can still find a discrete record in its absolute original form • We should apply similar optimizations to Data Lakes!!!
  • 28. Demo: Querying small files vs large files
  • 29. Real-Time Data Ingestion Moral of the story… • Small files are suboptimal in every scenario • Consider alternatives to concatenate into larger files: • Offline outside of Azure • Event Hubs Capture • Stream Analytics • … or ADLA fast file sets to compact most recent deltas
  • 30. // Unstructured Files (24 hours daily log impressions) @Impressions = EXTRACT ClientId int, Market string, OS string, ... FROM @"wasb://ads@wcentralus/2015/10/30/{*}.nif" FROM @"wasb://ads@wcentralus/2015/10/30/{Market}_{*}.nif" ; // … // Filter to by Market @US = SELECT * FROM @Impressions WHERE Market == "en" ; Partition Elimination • Even with unstructured files! • Leverage Virtual Columns (Named) • Avoid unnamed {*} • WHERE predicates on named virtual columns • That binds the PE range during compilation time • Named virtual columns without predicate = warning • Design directories/files with PE in mind • Design for elimination early in the tree, not in the leaves Extracts all files in the folder Post filter = pay I/O cost to drop most data PE pushes this predicate to the EXTRACT EXTRACT now only reads “en” files! en_10.0.nif en_8.1.nif de_10.0.nif jp_7.0.nif de_8.1.nif ../2015/10/30/ … U-SQL Optimizations Partition Elimination – Unstructured Files
  • 31. FileSet optimizations • Use SET @@FeaturePreviews="InputFileGrouping:on"; to group small files into single vertex (instead of one file per vertex) • Also use to speed up preparation time and produce faster code: SET @@FeaturePreviews = "FileSetV2Dot5:on,AsyncCompilerStoreAccess:on"; (will be default with next refresh and not needed afterwards)
  • 32. Cost of small files with custom extractor • Custom code deployment during vertex creation dominates vertex execution time over actual work! Vertex creation time: Deploy Vertex content Actual vertex execution work!
  • 34. How many clicks per domain? @rows = SELECT Domain, SUM(Clicks) AS TotalClicks FROM @ClickData GROUP BY Domain;
  • 35. Read Read Partition Partition Full Agg Write Full Agg Write Full Agg Write Read Partition Partial Agg Partial Agg Partial Agg CNN, FB, WH EXTENT 1 EXTENT 2 EXTENT 3 CNN, FB, WH CNN, FB, WH U-SQL Table Distributed by Domain Read Read Full Agg Full Agg Write Write Read Full Agg Write FB EXTENT 1 WH EXTENT 2 CNN EXTENT 3 Expensive! Files
  • 36. Demo: Scaling out with Distributions
  • 37. Table Partitioning and Distribution Fine grained (horizontal) partitioning/distribution • Distributes within a partition (together with clustering) to keep same data values close • Choose for: • Join alignment, partition size, filter selectivity, partition elimination Coarse grained (vertical) partitioning • Based on Partition keys • Partition is addressable in language • Query predicates will allow partition pruning • Choose for data life cycle management, partition elimination Distribution Scheme When to use? HASH(keys) Automatic Hash for fast item lookup DIRECT HASH(id) Exact control of hash bucket value RANGE(keys) Keeps ranges together ROUND ROBIN To get equal distribution (if others give skew)
  • 38. Partitions, Distributions and Clusters TABLE T ( id … , C … , date DateTime, … , INDEX i CLUSTERED (id, C) PARTITIONED BY (date) DISTRIBUTED BY HASH(id) INTO 4 ) PARTITION (@date1) PARTITION (@date2) PARTITION (@date3) HASH DISTRIBUTION 1 HASH DISTRIBUTION 2 HASH DISTRIBUTION 3 HASH DISTRIBUTION 1 HASH DISTRIBUTION 1 HASH DISTRIBUTION 2 HASH DISTRIBUTION 3 HASH DISTRIBUTION 4 HASH DISTRIBUTION 3 id1C1 id5C2 id9C3 id1C1 id1C2 id2C4 id6C5 id2C4 id3C6 id7C6 id7C7 id11C8 id7C7 id2C5 id3C6 id4C9 id4C10 id5C1 id9C3 /catalog/…/tables/Guid(T)/ Guid(T.p1).ss Guid(T.p2).ss Guid(T.p3).ss LOGICAL PHYSICAL
  • 39. Benefits of Tables Benefits of Table clustering and distribution • Faster lookup of data provided by distribution and clustering when right distribution/cluster is chosen • Data distribution provides better localized scale out • Used for filters, joins and grouping Benefits of Table partitioning • Provides data life cycle management (“expire” old partitions): Partition on date/time dimension • Partial re-computation of data at partition level • Query predicates can provide partition elimination Do not use when… • No filters, joins and grouping • No reuse of the data for future queries If in doubt: use sampling (e.g., SAMPLE ANY(x)) and test.
  • 40. Benefits of Distribution in Tables Benefits • Design for most frequent/costly queries • Manage data skew in partition/table • Manage parallelism in querying (by number of distributions) • Manage minimizing data movement in joins • Provide distribution seeks and range scans for query predicates (distribution bucket elimination) Distribution in tables is mandatory, chose according to desired benefits
  • 41. Benefits of Clustered Index in Distribution Benefits • Design for most frequent/costly queries • Manage data skew in distribution bucket • Provide locality of same data values • Provide seeks and range scans for query predicates (index lookup) Clustered index in tables is mandatory, chose according to desired benefits Pro Tip: Distribution keys should be prefix of Clustered Index keys: Especially for RANGE distribution Optimizer will make use of global ordering then: If you make the RANGE distribution key a prefix of the index key, U-SQL will repartition on demand to align any UNIONALLed or JOINed tables or partitions! Split points of table distribution partitions are chosen independently, so any partitioned table can do UNION ALL in this manner if the data is to be processed subsequently on the distribution key.
  • 42. Benefits of Partitioned Tables Benefits • Partitions are addressable • Enables finer-grained data lifecycle management at partition level • Manage parallelism in querying by number of partitions • Query predicates provide partition elimination • Predicate has to be constant-foldable Use partitioned tables for • Managing large amounts of incrementally growing structured data • Queries with strong locality predicates • point in time, for specific market etc • Managing windows of data • provide data for last x months for processing
  • 43. Partitioned tables Use partitioned tables for querying parts of large amounts of incrementally growing structured data Get partition elimination optimizations with the right query predicates Creating partition table CREATE TABLE PartTable(id int, event_date DateTime, lat float, long float , INDEX idx CLUSTERED (vehicle_id ASC) PARTITIONED BY(event_date) DISTRIBUTED BY HASH (vehicle_id) INTO 4); Creating partitions DECLARE @pdate1 DateTime = new DateTime(2014, 9, 14, 00,00,00,00,DateTimeKind.Utc); DECLARE @pdate2 DateTime = new DateTime(2014, 9, 15, 00,00,00,00,DateTimeKind.Utc); ALTER TABLE vehiclesP ADD PARTITION (@pdate1), PARTITION (@pdate2); Loading data into partitions dynamically DECLARE @date1 DateTime = DateTime.Parse("2014-09-14"); DECLARE @date2 DateTime = DateTime.Parse("2014-09-16"); INSERT INTO vehiclesP ON INTEGRITY VIOLATION IGNORE SELECT vehicle_id, event_date, lat, long FROM @data WHERE event_date >= @date1 AND event_date <= @date2; Filters and inserts clean data only, ignore “dirty” data Loading data into partitions statically ALTER TABLE vehiclesP ADD PARTITION (@pdate1), PARTITION (@baddate); INSERT INTO vehiclesP ON INTEGRITY VIOLATION MOVE TO @baddate SELECT vehicle_id, lat, long FROM @data WHERE event_date >= @date1 AND event_date <= @date2; Filters and inserts clean data only, put “dirty” data into special partition
  • 44. Avoid Table Fragmentation and Over- partitioning! What is Table Fragmentation • ADLS is an append-only store! • Every INSERT statement is creating a new file (INSERT fragment) Why is it bad? • Every INSERT fragment contains data in its own distribution buckets, thus query processing loses ability to get “localized” fast access • Query generation has to read from many files now -> slow preparation phase that may time out. • Reading from too many files is disallowed: Current LIMIT: 3000 table partitions and INSERT fragments per job! What if I have to add data incrementally? • Batch inserts into table • Use ALTER TABLE REBUILD/ALTER TABLE REBUILD PARTITION regularly to reduce fragmentation and keep performance.
  • 45. @Impressions = SELECT * FROM searchDM.SML.PageView(@start, @end) AS PageView OPTION(SKEWFACTOR(Query)=0.5) ; // Q1(A,B) @Sessions = SELECT ClientId, Query, SUM(PageClicks) AS Clicks FROM @Impressions GROUP BY Query, ClientId ; // Q2(B) @Display = SELECT * FROM @Sessions INNER JOIN @Campaigns ON @Sessions.Query == @Campaigns.Query ; Input must be distributed on: (Query) Input must be distributed on: (Query) or (ClientId) or (Query, ClientId) Optimizer wants to distribute only once But Query could be skewed Data Distribution • Re-Distributing is very expensive • Many U-SQL operators can handle multiple distribution choices • Optimizer bases decision upon estimations Wrong statistics may result in worse query performance U-SQL Optimizations Distributions – Minimize (re)partitions
  • 46. // Unstructured (24 hours daily log impressions) @Huge = EXTRACT ClientId int, ... FROM @"wasb://ads@wcentralus/2015/10/30/{*}.nif" ; // Small subset (ie: ForgetMe opt out) @Small = SELECT * FROM @Huge WHERE Bing.ForgetMe(x,y,z) OPTION(ROWCOUNT=500) ; // Result (not enough info to determine simple Broadcast join) @Remove = SELECT * FROM Bing.Sessions INNER JOIN @Small ON Sessions.Client == @Small.Client ; Broadcast JOIN right? Broadcast is now a candidate. Wrong statistics may result in worse query performance => CREATE STATISTICS Optimizer has no stats this is small... U-SQL Optimizations Distribution - Cardinality
  • 47. Write the Right Query! • Do I really need a CROSS JOIN? Example: Get a daily “expansion” of my fact validity interval Approach 1 (standard text-book solution): Join fact table with daily date dimension table using daily dimension between begin and end of fact => Does not scale: CROSS JOIN on single node Approach 2: Cross apply daily expansion expression to each fact row => Scales very well since only depends on a row
  • 48. Tuning for Performance – Handling Data Skew
  • 49. What is Data Skew? • Some data points are much more common than others • data may be distributed such that all rows that match a certain key go to a single vertex • imbalanced execution, vertex time out. 0 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 30,000,000 35,000,000 40,000,000 Population by State
  • 50. Low Distinctiveness Keys • Keys with small selectivity can lead to large vertices even without skew @rows = SELECT Gender, AGG<MyAgg>(…) AS Result FROM @HugeInput GROUP BY Gender; Gender==Male Gender==Female @HugeInput Vertex 0 Vertex 1
  • 51. Why is this a problem? • Vertexes have a 5 hour runtime limit! • Your UDO or join may excessively allocate memory. • Your memory usage may not be obvious due to garbage collection
  • 52. Addressing Data Skew/Low distinctiveness • Improve data partition sizes: • Find more fine grained keys, eg, states and congressional districts or ZIP codes • If no fine grained keys can be found or are too fine-grained: use ROUND ROBIN distribution • Write queries that can handle data skew: • Use filters that prune skew out early • Use Data Hints to identify skew and “low distinctness” in keys: • SKEWFACTOR(columns) = x provides hint that given columns have a skew factor x between 0 (no skew) and 1 (very heavy skew)) • DISTINCTVALUE(columns) = n let’s you specify how many distinct values the given columns have (n>1) • Implement aggregation/reducer recursively if possible
  • 53. Non-Recursive vs Recursive SUM 1 2 3 4 5 6 7 8 36 1 2 3 4 5 6 7 8 6 15 15 36
  • 54. U-SQL Partitioning during Processing Data Skew
  • 55. // Metrics per domain @Metric = REDUCE @Impressions ON UrlDomain USING new Bing.TopNReducer(count:10) ; // … Inherent Data Skew [SqlUserDefinedReducer(IsRecursive = true)] public class TopNReducer : IReducer { public override IEnumerable<IRow> Reduce(IRowset input, IUpdatableRow output) { // Compute TOP(N) per group // … } } Recursive • Allow multi-stage aggregation trees • Requires same schema (input => output) • Requires associativity: • R(x, y) = R( R(x), R(y) ) • Default = non-recursive • User code has to honor recursive semantics www.bing.com brought to a single vertex U-SQL Partitioning Data Skew – Recursive Reducer
  • 56. Tuning for Performance – User Defined Operators
  • 57. // Bing impressions @Impressions = SELECT * FROM searchDM.SML.PageView(@start, @end) AS PageView ; // Compute sessions @Sessions = REDUCE @Impressions ON Client, Market READONLY Market USING new Bing.SessionReducer(range : 30) ; // Users metrics @Metrics = SELECT * FROM @Sessions WHERE Market == "en-us" ; // … Microsoft Confidential U-SQL Optimizations Predicate pushing – UDO pass-through columns
  • 58. // Bing impressions @Impressions = SELECT * FROM searchDM.SML.PageView(@start, @end) AS PageView ; // Compute page views @Impressions = PROCESS @Impressions READONLY Market PRODUCE Client, Market, Header string USING new Bing.HtmlProcessor() ; @Sessions = REDUCE @Impressions ON Client, Market READONLY Market USING new Bing.SessionReducer(range : 30) ; // Users metrics @Metrics = SELECT * FROM @Sessions WHERE Market == "en-us" ; Microsoft Confidential U-SQL Optimizations Predicate pushing – UDO row level processors public abstract class IProcessor : IUserDefinedOperator { /// <summary/> public abstract IRow Process(IRow input, IUpdatableRow output); } public abstract class IReducer : IUserDefinedOperator { /// <summary/> public abstract IEnumerable<IRow> Reduce(IRowset input, IUpdatableRow output); }
  • 59. // Bing impressions @Impressions = SELECT Client, Market, Html FROM searchDM.SML.PageView(@start, @end) AS PageView ; // Compute page views @Impressions = PROCESS @Impressions PRODUCE Client, Market, Header string USING new Bing.HtmlProcessor() ; // Users metrics @Metrics = SELECT * FROM @Sessions WHERE Market == "en-us" && Header.Contains("microsoft.com") AND Header.Contains("microsoft.com") ; U-SQL Optimizations Predicate pushing – relational vs. C# semantics
  • 60. // Bing impressions @Impressions = SELECT * FROM searchDM.SML.PageView(@start, @end) AS PageView ; // Compute page views @Impressions = PROCESS @Impressions PRODUCE * REQUIRED ClientId, HtmlContent(Header, Footer) USING new Bing.HtmlProcessor() ; // Users metrics @Metrics = SELECT ClientId, Market, Header FROM @Sessions WHERE Market == "en-us" ; U-SQL Optimizations Column Pruning and dependencies C H M C H M C H M Column Pruning • Minimize I/O (data shuffling) • Minimize CPU (complex processing, html) • Requires dependency knowledge: • R(D*) = Input ( Output ) • Default no pruning • User code has to honor reduced columns A B C D E F G J KH I … M … 1000
  • 61. UDO Tips and Warnings • Tips when Using UDOs: • READONLY clause to allow pushing predicates through UDOs • REQUIRED clause to allow column pruning through UDOs • PRESORT on REDUCE if you need global order • Hint Cardinality if it does choose the wrong plan • Warnings and better alternatives: • Use SELECT with UDFs instead of PROCESS • Use User-defined Aggregators instead of REDUCE • Learn to use Windowing Functions (OVER expression) • Good use-cases for PROCESS/REDUCE/COMBINE: • The logic needs to dynamically access the input and/or output schema. E.g., create a JSON doc for the data in the row where the columns are not known apriori. • Your UDF based solution creates too much memory pressure and you can write your code more memory efficient in a UDO • You need an ordered Aggregator or produce more than 1 row per group
  • 62. Resources • Blogs and community page: • http://usql.io (U-SQL Github) • http://blogs.msdn.microsoft.com/azuredatalake/ • http://blogs.msdn.microsoft.com/mrys/ • https://channel9.msdn.com/Search?term=U-SQL#ch9Search • Documentation, presentations and articles: • http://aka.ms/usql_reference • https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-programmability-guide • https://docs.microsoft.com/en-us/azure/data-lake-analytics/ • https://msdn.microsoft.com/en-us/magazine/mt614251 • https://msdn.microsoft.com/magazine/mt790200 • http://www.slideshare.net/MichaelRys • Getting Started with R in U-SQL • https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-python-extensions • ADL forums and feedback • https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake • http://stackoverflow.com/questions/tagged/u-sql • http://aka.ms/adlfeedback Continue your education at Microsoft Virtual Academy online.
  • 63. Thank you very much for your attention!

Editor's Notes

  1. This slide is required. Do NOT delete. This should be the first slide after your Title Slide. This is an important year and we need to arm our attendees with the information they can use to Grow Share! Please ensure that your objectives are SMART (defined below) and that they will enable them to go in and win against the competition to grow share. If you have questions, please contact your Track PM for guidance. We have also posted guidance on writing good objectives, out on the Speaker Portal (https://www.mytechready.com).   This slide should introduce the session by identifying how this information helps the attendee, partners and customers be more successful. Why is this content important? This slide should call out what’s important about the session (sort of the why should we care, why is this important and how will it help our customers/partners be successful) as well as the key takeaways/objectives associated with the session. Call out what attendees will be able to execute on using the information gained in this session. What will they be able to walk away from this session and execute on with their customers. Good Objectives should be SMART (specific, measurable, achievable, realistic, time-bound). Focus on the key takeaways and why this information is important to the attendee, our partners and our customers. Each session has objectives defined and published on www.mytechready.com, please work with your Track PM to call these out here in the slide deck. If you have questions, please contact your Track PM. See slide 5 in this template for a complete list of Tracks and TPMs.
  2. https://github.com/Azure/usql/tree/master/Examples/AmbulanceDemos/AmbulanceDemos/5-Ambulance-StreamSets-PartitionedTables
  3. https://github.com/Azure/usql/tree/master/Examples/AmbulanceDemos/AmbulanceDemos/5-Ambulance-StreamSets-PartitionedTables
  4. https://github.com/Azure/usql/tree/master/Examples/AmbulanceDemos/AmbulanceDemos/5-Ambulance-StreamSets-PartitionedTables
  5. https://github.com/Azure/usql/tree/master/Examples/AmbulanceDemos/AmbulanceDemos/5-Ambulance-StreamSets-PartitionedTables