U-SQL - Azure Data Lake Analytics for Developers

U-SQL - Azure Data Lake Analytics for Developers
Michael Rys, Microsoft
@MikeDoesBigData, {mrys, usql}@microsoft.com

Characteristics
of Big Data
analytics
•Sample Use Cases
• Digital Crime Forensics – Analyze complex
attack patterns to understand BotNets and to
predict and mitigate future attacks, by
analyzing log records with complex custom
algorithms
• Image Processing – Large-scale image
feature extraction and classification using
custom code
• Shopping Recommendations – Complex
pattern analysis and prediction over shopping
records using proprietary algorithms
Requires processing
of any type of data
Allows use of custom algorithms
Scales efficiently to any size

Status quo:
SQL for Big Data
 Declarativity does scaling
and parallelization for you
 Extensibility is bolted on
and
not “native”
 Difficult to work with anything
other than structured data
 Difficult to extend with custom
code

Status quo:
Programming
languages
for Big Data
 Extensibility through custom
code is “native”
 Declarativity is bolted on and
not “native”
 User often has to care about
scale and performance
 SQL is second class within string
 Often no code reuse/sharing
across queries

Why U-SQL?
Get benefits of both!
• Makes it easy for you by unifying:
• Unstructured and structured data processing
• Declarative SQL and custom imperative code (C#)
• Local and remote queries
• Increase productivity and agility
from Day 1 and at Day 100 for
YOU!
 Declarativity and
extensibility are equally native
to the language

The origins of
U-SQL
SCOPE – Microsoft’s internal
Big Data language
• SQL and C# integration model
• Optimization and scaling model
• Runs 100,000s of jobs daily
Hive
• Complex data types (Maps, Arrays)
• Data format alignment for text files
T-SQL/ANSI SQL
• Many of the SQL capabilities (windowing functions, meta
data model etc.)

U-SQL language philosophy
Declarative query and transformation language
• Uses SQL’s SELECT FROM WHERE with GROUP
BY/aggregation, joins, SQL analytics functions
• Optimizable, scalable
Expression-flow programming style
• Easy-to-use functional lambda composition
• Composable, globally optimizable
Operates on unstructured and structured data
• Schema on read over files
• Relational metadata objects (e.g. database, table)
Extensible from ground up
• Type system is based on C#
• Expression language IS C#
• User-defined functions (U-SQL and C#)
• User-defined aggregators (C#)
• User-defined operators (UDO) (C#)
U-SQL provides the Parallelization and Scale-out
Framework for Usercode
• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER,
COMBINER, APPLIER
Federated query across distributed data sources
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int
, order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount float
FROM "/input/orders.txt"
USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string
FROM "/input/customers.txt"
@j = SELECT c.cid, MIN(o.odate) AS firstorder
, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt
, AGG<MyAgg.MySum>(c.amount) AS totalamount
FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid
WHERE c.city.StartsWith("New")
&& MyNamespace.MyFunction(o.odate) > 10
GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"
USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;

Expression-flow
programming style
• Automatic "in-lining" of U-SQL
expressions – whole script leads
to a single execution model.
• Execution plan that is optimized
out-of-the-box and without user
intervention.
• Per-job and user-driven level of
parallelization.
• Detailed visibility into execution
steps, for debugging.
• Heatmap-like functionality to
identify performance
bottlenecks.

Query data where it lives
• Easily query data in multiple
Azure data stores without
moving it to a single store
• Avoid moving large amounts of data across the
network between stores
• Single view of data irrespective of physical
location
• Minimize data proliferation issues caused by
maintaining multiple copies
• Single query language for all data
• Each data store maintains its own sovereignty
• Design choices based on the need
• Push SQL expressions to remote SQL sources
• Filters
• Joins
U-SQL Query Query
Azure
Storage Blobs
Azure SQL
in VMs
Azure
SQL DB
Azure Data
Lake Analytics
Azure
SQL Data Warehouse
Azure
Data Lake Storage

Unstructured
files
• Schema on read
• Write to file
• Built-in and custom extractors and
outputters
• ADL Storage and Azure Blob
Storage
EXTRACT Expression
@s = EXTRACT a string, b int
FROM "filepath/file.csv"
USING Extractors.Csv(encoding: Encoding.Unicode);
• Built-in extractors: Csv, Tsv, Text with lots of options
• Custom extractors: e.g., JSON, XML, and so on
OUTPUT Expression
OUTPUT @s
TO "filepath/file.csv"
USING Outputters.Csv();
• Built-in outputters: Csv, Tsv, Text
• Custom outputters: JSON, XML, and so on
Filepath URIs
• Relative URI to default ADL Storage account: "/filepath/file.csv"
• Absolute URIs:
• ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv"
• WASB: "wasb://container@account/filepath/file.csv"

U-SQL extensibility
Extend U-SQL with C#/.NET
Built-in operators,
function, aggregates
C# expressions (in SELECT expressions)
User-defined aggregates (UDAGGs)
User-defined functions (UDFs)
User-defined operators (UDOs)

Managing
Assemblies
Create assemblies
Reference assemblies
Enumerate assemblies
Drop assemblies
CREATE ASSEMBLY db.assembly FROM @path;
CREATE ASSEMBLY db.assembly FROM byte[];
• Can also include additional resource files
REFERENCE ASSEMBLY db.assembly;
• Referencing .Net Framework Assemblies
• Always accessible system namespaces:
• U-SQL specific (e.g., for SQL.MAP)
• All provided by system.dll system.core.dll system.data.dll,
System.Runtime.Serialization.dll, mscorelib.dll (e.g., System.Text,
System.Text.RegularExpressions, System.Linq)
• Add all other .Net Framework Assemblies with:
REFERENCE SYSTEM ASSEMBLY [System.XML];
• Enumerating Assemblies
• Powershell command
• U-SQL Studio Server Explorer
DROP ASSEMBLY db.assembly;

Assembly
Dependencies
• Assembly must be registered to be
referenced
• All Assemblies needed for compilation
must be referenced in script
• All Assemblies needed at runtime either
• Need to be referenced in script, or
• Need to be registered with the assembly
as additional files
• Metadata Service does NOT enforce
dependencies
• Visual Studio Extension provides support
for dependency management

File sets
• Simple patterns
• Virtual columns
• Only on EXTRACT for now
Simple pattern language on filename and path
@pattern string =
"/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}";
• Binds two columns date and suffix
• Wildcards the filename
• Limits on number of files (between 800 and 3000)
Virtual columns
EXTRACT
name string
, suffix string // virtual column
, date DateTime // virtual column
FROM @pattern
• Refer to virtual columns in query to get partition elimination
• Virtual columns need to be referenced for DateTime columns
and if no wildcard has been given

Let’s do some SQL with U-SQL

@m CROSS APPLY EXPLODE(refs) AS Refs(r);
@m(refs)
@me, @you
@him, @her
Refs(r)
@me
@you
@him
@her
@me, @you
@me
@you

U-SQL Joins
Join operators
• INNER JOIN
• LEFT or RIGHT or FULL OUTER JOIN
• CROSS JOIN
• SEMIJOIN
• equivalent to IN subquery
• ANTISEMIJOIN
• Equivalent to NOT IN subquery
Notes
• ON clause comparisons need to be of the simple form:
rowset.column == rowset.column
or AND conjunctions of the simple equality comparison
• If a comparand is not a column, wrap it into a column in a previous
SELECT
• If the comparison operation is not ==, put it into the WHERE clause
• turn the join into a CROSS JOIN if no equality comparison
Reason: Syntax calls out which joins are efficient

U-SQL Analytics
Windowing Expression
Window_Function_Call 'OVER' '('
[ Over_Partition_By_Clause ]
[ Order_By_Clause ]
[ Row _Clause ]
')'.
Window_Function_Call :=
Aggregate_Function_Call
| Analytic_Function_Call
| Ranking_Function_Call.
Windowing Aggregate Functions
ANY_VALUE, AVG, COUNT, MAX, MIN, SUM, STDEV, STDEVP, VAR, VARP
Analytics Functions
CUME_DIST, FIRST_VALUE, LAST_VALUE, PERCENTILE_CONT, PERCENTILE_DISC,
PERCENT_RANK, LEAD, LAG
Ranking Functions
DENSE_RANK, NTILE, RANK, ROW_NUMBER

“Top 5”s
Surprises for
SQL Users
AS is not as
• C# keywords and SQL keywords overlap
• Costly to make case-insensitive -> Better build
capabilities than tinker with syntax
= != ==
• Remember: C# expression language
null IS NOT NULL
• C# nulls are two-valued
PROCEDURES but no WHILE
No UPDATE nor MERGE
• Transform/Recook instead

Meta Data Object Model
ADLA Catalog
Database
Schema
[1,n]
[1,n]
[0,n]
tables views TVFs
C# Fns C# UDAgg
Clustered
Index
partitions
C#
Assemblies
C# Extractors
Data
Source
C# Reducers
C# Processors
C# Combiners
C# Outputters
Ext. tables Procedures
Creden-
tials
C# Applier
Table Types
Statistics
C# UDTs
Abstract
objects
User
objects
Refers toContains Implemented
and named by
MD
Name
C# Name
Legend

U-SQL Catalog • Naming
• Default database and schema context: master.dbo
• Quote identifiers with []: [my table]
• Stores data in ADL Storage /catalog folder
• Discovery
• Visual Studio Server Explorer
• Azure Data Lake Analytics Portal
• SDKs and Azure PowerShell commands
• Sharing
• Within an Azure Data Lake Analytics account
• Securing
• Secured with AAD principals at catalog level (inherited from
ADL Storage)
• Naming
• Discovery
• Sharing
• Securing

Create shareable data and
code

Views and TVFs
• Views for simple cases
• TVFs for parameterization and most
cases
Views
CREATE VIEW V AS EXTRACT…
CREATE VIEW V AS SELECT …
• Cannot contain user-defined objects (such as UDFs or
UDOs)
• Will be inlined
Table-Valued Functions (TVFs)
CREATE FUNCTION F (@arg string = "default")
RETURNS @res [TABLE ( … )]
AS BEGIN … @res = … END;
• Provides parameterization
• One or more results
• Can contain multiple statements
• Can contain user-code (needs assembly reference)
• Will always be inlined
• Infers schema or checks against specified return schema

Procedures
Allows encapsulation of non-DDL scripts
CREATE PROCEDURE P (@arg string = "default“)
AS
BEGIN
…;
OUTPUT @res TO …;
INSERT INTO T …;
END;
• Provides parameterization
• No result but writes into file or table
• Can contain multiple statements
• Can contain user code (needs assembly
reference)
• Will always be inlined
• Cannot contain DDL (no CREATE, DROP)

Tables
• CREATE TABLE
• CREATE TABLE AS SELECT
CREATE TABLE T (col1 int
, col2 string
, col3 SQL.MAP<string,string>
, INDEX idx CLUSTERED (col1 ASC)
PARTITIONED BY HASH (driver_id)
);
• Structured Data
• Built-in Data types only (no UDTs)
• Clustered index (must be specified): row-oriented
• Fine-grained partitioning (must be specified):
• HASH, DIRECT HASH, RANGE, ROUND ROBIN
CREATE TABLE T (INDEX idx CLUSTERED …) AS SELECT …;
CREATE TABLE T (INDEX idx CLUSTERED …) AS EXTRACT…;
CREATE TABLE T (INDEX idx CLUSTERED …) AS
myTVF(DEFAULT);
• Infer the schema from the query
• Still requires index and partitioning

INSERT
• INSERT constant values
• INSERT from queries
• Multiple INSERTs
INSERT constant values
INSERT INTO T VALUES (1, "text",
new SQL.MAP<string,string>("key","value"));
INSERT from queries
INSERT INTO T SELECT col1, col2, col3 FROM @rowset;
Multiple INSERTs into same table
• Is supported
• Generates separate file per insert in physical storage:
• Can lead to performance degradation
• Recommendations:
• Try to avoid small inserts
• Rebuild table after frequent insertions with:
ALTER TABLE T REBUILD;

Additional
capabilities and
resources
• Tools: http://aka.ms/adltoolsVS
• Blogs and community page:
• http://usql.io
• http://blogs.msdn.com/b/visualstudio/
• http://azure.microsoft.com/en-us/blog/topics/big-data/
• https://channel9.msdn.com/Search?term=U-SQL#ch9Search
• Documentation and articles and slides:
• http://aka.ms/usql_reference
• https://azure.microsoft.com/en-us/documentation/services/data-lake-analytics/
• https://msdn.microsoft.com/en-us/magazine/mt614251
• http://www.slideshare.net/MichaelRys
• ADL forums and feedback
• http://aka.ms/adlfeedback
• https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake
• http://stackoverflow.com/questions/tagged/u-sql

Unifies SQL declarativity and C# extensibility
Unifies querying structured and unstructured data
Unifies local and remote queries
Increase productivity and agility from Day 1 forward for
YOU!
Sign up for an Azure Data Lake account and join the Public Preview
http://www.azure.com/datalake, download the VS tools, and give us
feedback via http://aka.ms/adlfeedback or at http://aka.ms/u-sql-survey!
This is why U-SQL!

Friendly Competition
Win ADL and U-SQL SWAG
1. Contribute a cool U-SQL project/sample to the Azure/usql
Github repo (via http://usql.io) by Apr 30, 2016
2. Tweet your submission to @MikeDoesBigData with
#USQLComp
3. We will review the submissions and send some cool swag
(U-SQL T-Shirts, ADL Poloshirts etc) to the top 5
submissions

U-SQL - Azure Data Lake Analytics for Developers

U-SQL - Azure Data Lake Analytics for Developers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to U-SQL - Azure Data Lake Analytics for Developers

Similar to U-SQL - Azure Data Lake Analytics for Developers (20)

Recently uploaded

Recently uploaded (20)

U-SQL - Azure Data Lake Analytics for Developers

Editor's Notes