3. Characteristics
of Big Data
analytics
•Sample Use Cases
• Digital Crime Forensics – Analyze complex
attack patterns to understand BotNets and to
predict and mitigate future attacks, by
analyzing log records with complex custom
algorithms
• Image Processing – Large-scale image
feature extraction and classification using
custom code
• Shopping Recommendations – Complex
pattern analysis and prediction over shopping
records using proprietary algorithms
Requires processing
of any type of data
Allows use of custom algorithms
Scales efficiently to any size
4. Status quo:
SQL for Big Data
Declarativity does scaling
and parallelization for you
Extensibility is bolted on
and
not “native”
Difficult to work with anything
other than structured data
Difficult to extend with custom
code
5. Status quo:
Programming
languages
for Big Data
Extensibility through custom
code is “native”
Declarativity is bolted on and
not “native”
User often has to care about
scale and performance
SQL is second class within string
Often no code reuse/sharing
across queries
6. Why U-SQL?
Get benefits of both!
• Makes it easy for you by unifying:
• Unstructured and structured data processing
• Declarative SQL and custom imperative code (C#)
• Local and remote queries
• Increase productivity and agility
from Day 1 and at Day 100 for
YOU!
Declarativity and
extensibility are equally native
to the language
7. The origins of
U-SQL
SCOPE – Microsoft’s internal
Big Data language
• SQL and C# integration model
• Optimization and scaling model
• Runs 100,000s of jobs daily
Hive
• Complex data types (Maps, Arrays)
• Data format alignment for text files
T-SQL/ANSI SQL
• Many of the SQL capabilities (windowing functions, meta
data model etc.)
9. U-SQL language philosophy
Declarative query and transformation language
• Uses SQL’s SELECT FROM WHERE with GROUP
BY/aggregation, joins, SQL analytics functions
• Optimizable, scalable
Expression-flow programming style
• Easy-to-use functional lambda composition
• Composable, globally optimizable
Operates on unstructured and structured data
• Schema on read over files
• Relational metadata objects (e.g. database, table)
Extensible from ground up
• Type system is based on C#
• Expression language IS C#
• User-defined functions (U-SQL and C#)
• User-defined aggregators (C#)
• User-defined operators (UDO) (C#)
U-SQL provides the Parallelization and Scale-out
Framework for Usercode
• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER,
COMBINER, APPLIER
Federated query across distributed data sources
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int
, order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount float
FROM "/input/orders.txt"
USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string
FROM "/input/customers.txt"
USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder
, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt
, AGG<MyAgg.MySum>(c.amount) AS totalamount
FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid
WHERE c.city.StartsWith("New")
&& MyNamespace.MyFunction(o.odate) > 10
GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"
USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;
10. Expression-flow
programming style
• Automatic "in-lining" of U-SQL
expressions – whole script leads
to a single execution model.
• Execution plan that is optimized
out-of-the-box and without user
intervention.
• Per-job and user-driven level of
parallelization.
• Detailed visibility into execution
steps, for debugging.
• Heatmap-like functionality to
identify performance
bottlenecks.
11. Query data where it lives
• Easily query data in multiple
Azure data stores without
moving it to a single store
• Avoid moving large amounts of data across the
network between stores
• Single view of data irrespective of physical
location
• Minimize data proliferation issues caused by
maintaining multiple copies
• Single query language for all data
• Each data store maintains its own sovereignty
• Design choices based on the need
• Push SQL expressions to remote SQL sources
• Filters
• Joins
U-SQL Query Query
Azure
Storage Blobs
Azure SQL
in VMs
Azure
SQL DB
Azure Data
Lake Analytics
Azure
SQL Data Warehouse
Azure
Data Lake Storage
12. Unstructured
files
• Schema on read
• Write to file
• Built-in and custom extractors and
outputters
• ADL Storage and Azure Blob
Storage
EXTRACT Expression
@s = EXTRACT a string, b int
FROM "filepath/file.csv"
USING Extractors.Csv(encoding: Encoding.Unicode);
• Built-in extractors: Csv, Tsv, Text with lots of options
• Custom extractors: e.g., JSON, XML, and so on
OUTPUT Expression
OUTPUT @s
TO "filepath/file.csv"
USING Outputters.Csv();
• Built-in outputters: Csv, Tsv, Text
• Custom outputters: JSON, XML, and so on
Filepath URIs
• Relative URI to default ADL Storage account: "/filepath/file.csv"
• Absolute URIs:
• ADLS: "adl://account.azuredatalakestore.net/filepath/file.csv"
• WASB: "wasb://container@account/filepath/file.csv"
15. Managing
Assemblies
Create assemblies
Reference assemblies
Enumerate assemblies
Drop assemblies
CREATE ASSEMBLY db.assembly FROM @path;
CREATE ASSEMBLY db.assembly FROM byte[];
• Can also include additional resource files
REFERENCE ASSEMBLY db.assembly;
• Referencing .Net Framework Assemblies
• Always accessible system namespaces:
• U-SQL specific (e.g., for SQL.MAP)
• All provided by system.dll system.core.dll system.data.dll,
System.Runtime.Serialization.dll, mscorelib.dll (e.g., System.Text,
System.Text.RegularExpressions, System.Linq)
• Add all other .Net Framework Assemblies with:
REFERENCE SYSTEM ASSEMBLY [System.XML];
• Enumerating Assemblies
• Powershell command
• U-SQL Studio Server Explorer
DROP ASSEMBLY db.assembly;
16. Assembly
Dependencies
• Assembly must be registered to be
referenced
• All Assemblies needed for compilation
must be referenced in script
• All Assemblies needed at runtime either
• Need to be referenced in script, or
• Need to be registered with the assembly
as additional files
• Metadata Service does NOT enforce
dependencies
• Visual Studio Extension provides support
for dependency management
18. File sets
• Simple patterns
• Virtual columns
• Only on EXTRACT for now
Simple pattern language on filename and path
@pattern string =
"/input/{date:yyyy}/{date:MM}/{date:dd}/{*}.{suffix}";
• Binds two columns date and suffix
• Wildcards the filename
• Limits on number of files (between 800 and 3000)
Virtual columns
EXTRACT
name string
, suffix string // virtual column
, date DateTime // virtual column
FROM @pattern
USING Extractors.Csv();
• Refer to virtual columns in query to get partition elimination
• Virtual columns need to be referenced for DateTime columns
and if no wildcard has been given
21. U-SQL Joins
Join operators
• INNER JOIN
• LEFT or RIGHT or FULL OUTER JOIN
• CROSS JOIN
• SEMIJOIN
• equivalent to IN subquery
• ANTISEMIJOIN
• Equivalent to NOT IN subquery
Notes
• ON clause comparisons need to be of the simple form:
rowset.column == rowset.column
or AND conjunctions of the simple equality comparison
• If a comparand is not a column, wrap it into a column in a previous
SELECT
• If the comparison operation is not ==, put it into the WHERE clause
• turn the join into a CROSS JOIN if no equality comparison
Reason: Syntax calls out which joins are efficient
23. “Top 5”s
Surprises for
SQL Users
AS is not as
• C# keywords and SQL keywords overlap
• Costly to make case-insensitive -> Better build
capabilities than tinker with syntax
= != ==
• Remember: C# expression language
null IS NOT NULL
• C# nulls are two-valued
PROCEDURES but no WHILE
No UPDATE nor MERGE
• Transform/Recook instead
24. Meta Data Object Model
ADLA Catalog
Database
Schema
[1,n]
[1,n]
[0,n]
tables views TVFs
C# Fns C# UDAgg
Clustered
Index
partitions
C#
Assemblies
C# Extractors
Data
Source
C# Reducers
C# Processors
C# Combiners
C# Outputters
Ext. tables Procedures
Creden-
tials
C# Applier
Table Types
Statistics
C# UDTs
Abstract
objects
User
objects
Refers toContains Implemented
and named by
MD
Name
C# Name
Legend
25. U-SQL Catalog • Naming
• Default database and schema context: master.dbo
• Quote identifiers with []: [my table]
• Stores data in ADL Storage /catalog folder
• Discovery
• Visual Studio Server Explorer
• Azure Data Lake Analytics Portal
• SDKs and Azure PowerShell commands
• Sharing
• Within an Azure Data Lake Analytics account
• Securing
• Secured with AAD principals at catalog level (inherited from
ADL Storage)
• Naming
• Discovery
• Sharing
• Securing
27. Views and TVFs
• Views for simple cases
• TVFs for parameterization and most
cases
Views
CREATE VIEW V AS EXTRACT…
CREATE VIEW V AS SELECT …
• Cannot contain user-defined objects (such as UDFs or
UDOs)
• Will be inlined
Table-Valued Functions (TVFs)
CREATE FUNCTION F (@arg string = "default")
RETURNS @res [TABLE ( … )]
AS BEGIN … @res = … END;
• Provides parameterization
• One or more results
• Can contain multiple statements
• Can contain user-code (needs assembly reference)
• Will always be inlined
• Infers schema or checks against specified return schema
28. Procedures
Allows encapsulation of non-DDL scripts
CREATE PROCEDURE P (@arg string = "default“)
AS
BEGIN
…;
OUTPUT @res TO …;
INSERT INTO T …;
END;
• Provides parameterization
• No result but writes into file or table
• Can contain multiple statements
• Can contain user code (needs assembly
reference)
• Will always be inlined
• Cannot contain DDL (no CREATE, DROP)
29. Tables
• CREATE TABLE
• CREATE TABLE AS SELECT
CREATE TABLE T (col1 int
, col2 string
, col3 SQL.MAP<string,string>
, INDEX idx CLUSTERED (col1 ASC)
PARTITIONED BY HASH (driver_id)
);
• Structured Data
• Built-in Data types only (no UDTs)
• Clustered index (must be specified): row-oriented
• Fine-grained partitioning (must be specified):
• HASH, DIRECT HASH, RANGE, ROUND ROBIN
CREATE TABLE T (INDEX idx CLUSTERED …) AS SELECT …;
CREATE TABLE T (INDEX idx CLUSTERED …) AS EXTRACT…;
CREATE TABLE T (INDEX idx CLUSTERED …) AS
myTVF(DEFAULT);
• Infer the schema from the query
• Still requires index and partitioning
31. INSERT
• INSERT constant values
• INSERT from queries
• Multiple INSERTs
INSERT constant values
INSERT INTO T VALUES (1, "text",
new SQL.MAP<string,string>("key","value"));
INSERT from queries
INSERT INTO T SELECT col1, col2, col3 FROM @rowset;
Multiple INSERTs into same table
• Is supported
• Generates separate file per insert in physical storage:
• Can lead to performance degradation
• Recommendations:
• Try to avoid small inserts
• Rebuild table after frequent insertions with:
ALTER TABLE T REBUILD;
32. Additional
capabilities and
resources
• Tools: http://aka.ms/adltoolsVS
• Blogs and community page:
• http://usql.io
• http://blogs.msdn.com/b/visualstudio/
• http://azure.microsoft.com/en-us/blog/topics/big-data/
• https://channel9.msdn.com/Search?term=U-SQL#ch9Search
• Documentation and articles and slides:
• http://aka.ms/usql_reference
• https://azure.microsoft.com/en-us/documentation/services/data-lake-analytics/
• https://msdn.microsoft.com/en-us/magazine/mt614251
• http://www.slideshare.net/MichaelRys
• ADL forums and feedback
• http://aka.ms/adlfeedback
• https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake
• http://stackoverflow.com/questions/tagged/u-sql
33. Unifies SQL declarativity and C# extensibility
Unifies querying structured and unstructured data
Unifies local and remote queries
Increase productivity and agility from Day 1 forward for
YOU!
Sign up for an Azure Data Lake account and join the Public Preview
http://www.azure.com/datalake, download the VS tools, and give us
feedback via http://aka.ms/adlfeedback or at http://aka.ms/u-sql-survey!
This is why U-SQL!
34. Friendly Competition
Win ADL and U-SQL SWAG
1. Contribute a cool U-SQL project/sample to the Azure/usql
Github repo (via http://usql.io) by Apr 30, 2016
2. Tweet your submission to @MikeDoesBigData with
#USQLComp
3. We will review the submissions and send some cool swag
(U-SQL T-Shirts, ADL Poloshirts etc) to the top 5
submissions
Editor's Notes
Offers auto-scaling and performance
Operates on unstructured data without requiring tables
Easy to extend declaratively with custom code: consistent model for UDO, UDF and UDAgg.
Easy to query remote sources even without external tables
U-SQL UDAgg
Code and compile .cs file:
Implement IAggregate’s three methods :Init(), Accumulate(), Terminate()
C# takes case of type checking, generics etc.
Deploy:
Tooling: one click registration in user db of assembly
By Hand:
Copy file to ADL
CREATE ASSEMBLY to register assembly
Use via AGG<MyNamespace.MyAggregate<T>>(a)
U-SQL UDF
Code in C#, register assembly once, call by C# name.
U-SQL is the next generation large scale data processing language that combines
The benefits of the declarative, optimizable and parallelizable SQL language with
The extensibility, expressiveness and familiarity of the programmer’s favorite programming and expression language
to analyze large and complex amounts of data while being
Easy to program
Highly scalable and performing
Affordable
Secure
User focus on the WHAT and not the HOW