ADL/U-SQL Introduction (SQLBits 2016)

SQLBits 2016
(http://www.slideshare.net/MichaelRys)
Azure Data Lake &
U-SQL
Michael Rys, @MikeDoesBigData
http://www.azure.com/datalake
{mrys, usql}@microsoft.com

Implement Data Warehouse
Reporting &
Analytics
Development
Reporting &
Analytics Design
Physical DesignDimension Modelling
ETL
Development
ETL Design
Install and TuneSetup Infrastructure
Traditional data warehousing approach
Data sources
ETL
BI and analytics
Data warehouse
Understand
Corporate
Strategy
Gather
Requirements
Business
Requirements
Technical
Requirements

The Data Lake approach
Ingest all data
regardless of
requirements
Store all data
in native format
without schema
definition
Do analysis
Using analytic
engines like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices

Source: ComScore 2009-2015 Search Report US
9%
11%
15%
16%
18%
19%
20%
0%
5%
10%
15%
20%
25%
2009 2010 2011 2012 2013 2014 2015
MICROSOFT DOUBLES SEARCH SHARE
How Microsoft has used
Big Data
We needed to better leverage data and
analytics to win in search
We changed our approach
• More experiments by more people!
So we…
Built an Exabyte-scale data lake for everyone
to put their data.
Built tools approachable by any developer.
Built machine learning tools for collaborating
across large experiment models.

Introducing Azure Data Lake
Big Data Made Easy

Analytics
Storage
HDInsight
(“managed clusters”)
Azure Data Lake Analytics
Azure Data Lake Storage
Azure Data Lake

Azure Data Lake
Storage Service

No limits to SCALE
Store ANY DATA in its native format
HADOOP FILE SYSTEM (HDFS) for the cloud
ENTERPRISE GRADE access control, encryption
at rest
Optimized for analytic workload
PERFORMANCE
Azure Data Lake
Store
A hyper scale repository for big
data analytics workloads
IN PREVIEW

Azure Data Lake
Analytics Service

WebHDFS
YARN
U-SQL
ADL Analytics ADL HDInsight
Store
HiveAnalytics
Storage
Azure Data Lake (Store, HDInsight, Analytics)

ADLA complements HDInsight
Target the same scenarios, tools, and customers
HDInsight
For developers familiar with the
Open Source: Java, Eclipse, Hive, etc.
Clusters offer customization, control,
and flexibility in a managed Hadoop
cluster
ADLA
Enables customers to leverage
existing experience with C#, SQL &
PowerShell
Offers convenience, efficiency,
automatic scale, and management in
a “job service” form factor

No limits to SCALE
Includes U-SQL, a language that unifies the
benefits of SQL with the expressive power of C#
Optimized to work with ADL STORE
FEDERATED QUERY across Azure data sources
ENTERPRISE GRADE role-based access control
and auditing
Pay PER QUERY and scale PER QUERY
Azure Data Lake
Analytics
A distributed analytics service
built on Apache YARN that
dynamically scales to your
needs
IN PREVIEW

Query data where it lives
Easily query data in multiple Azure data stores without moving it to a single store
Benefits
• Avoid moving large amounts of data across the
network between stores
• Single view of data irrespective of physical location
• Minimize data proliferation issues caused by
maintaining multiple copies
• Single query language for all data
• Each data store maintains its own sovereignty
• Design choices based on the need
• Push SQL expressions to remote SQL sources
• Filters
• Joins
U-SQL
Query
Query
Azure
Storage Blobs
Azure SQL
in VMs
Azure
SQL DB
Azure Data
Lake Analytics
Azure
SQL Data Warehouse
Azure
Data Lake Storage

Some sample use cases
Digital Crime Unit – Analyze complex attack patterns
to understand BotNets and to predict and mitigate
future attacks by analyzing log records with
complex custom algorithms
Image Processing – Large-scale image feature
extraction and classification using custom code
Shopping Recommendation – Complex pattern
analysis and prediction over shopping records
using proprietary algorithms
Characteristics
of Big Data
Analytics
•Requires processing
of any type of data
•Allow use of custom
algorithms
•Scale to any size and
be efficient

Status Quo:
SQL for
Big Data
 Declarativity does scaling and
parallelization for you
 Extensibility is bolted on and
not “native”
 hard to work with anything other than
structured data
 difficult to extend with custom code

Status Quo:
Programming
Languages for
Big Data
 Extensibility through custom code
is “native”
 Declarativity is bolted on and
not “native”
 User often has to
care about scale and performance
 SQL is 2nd class within string
 Often no code reuse/
sharing across queries

Why U-SQL?  Declarativity and Extensibility are
equally native to the language!
Get benefits of both!
Makes it easy for you by unifying:
• Unstructured and structured data processing
• Declarative SQL and custom imperative Code
• Local and remote Queries
• Increase productivity and agility from Day 1 and
at Day 100 for YOU!

The origins
of U-SQL
SCOPE – Microsoft’s internal
Big Data language
• SQL and C# integration model
• Optimization and Scaling model
• Runs 100’000s of jobs daily
Hive
• Complex data types (Maps, Arrays)
• Data format alignment for text files
T-SQL/ANSI SQL
• Many of the SQL capabilities (windowing functions, meta
data model etc.)

Built-in operators,
function, aggregates
C# expressions (in SELECT expressions)
User-defined aggregates (UDAGGs)
User-defined functions (UDFs)
User-defined operators (UDOs)

U-SQL Language Philosophy
Declarative Query and Transformation Language:
• Uses SQL’s SELECT FROM WHERE with GROUP
BY/Aggregation, Joins, SQL Analytics functions
• Optimizable, Scalable
Expression-flow programming style:
• Easy to use functional lambda composition
• Composable, globally optimizable
Operates on Unstructured & Structured Data
• Schema on read over files
• Relational metadata objects (e.g. database, table)
Extensible from ground up:
• Type system is based on C#
• Expression language IS C#
• User-defined functions (U-SQL and C#)
• User-defined Aggregators (C#)
• User-defined Operators (UDO) (C#)
U-SQL provides the Parallelization and Scale-out
Framework for Usercode
• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER,
COMBINER, APPLIER
Federated query across distributed data sources
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int
, order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount float
FROM "/input/orders.txt"
USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string
FROM "/input/customers.txt"
USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder
, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt
, AGG<MyAgg.MySum>(c.amount) AS totalamount
FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid
WHERE c.city.StartsWith("New")
&& MyNamespace.MyFunction(o.odate) > 10
GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"
USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;

Expression-flow
Programming Style
Automatic "in-lining" of U-SQL
expressions – whole script leads to a
single execution model.
Execution plan that is optimized out-
of-the-box and w/o user
intervention.
Per job and user driven level of
parallelization.
Detail visibility into execution steps,
for debugging.
Heatmap like functionality to identify
performance bottlenecks.

Unifies natively SQL’s declarativity and C#’s extensibility
Unifies querying structured and unstructured
Unifies local and remote queries
Increase productivity and agility from Day 1 forward for
YOU!
Sign up for an Azure Data Lake account and join the Public Preview
http://www.azure.com/datalake and give us your feedback via
http://aka.ms/adlfeedback or at http://aka.ms/u-sql-survey!

Additional
resources
• Tools:
• http://aka.ms/adltoolsVS
• Blogs, videos and community page:
• http://usql.io (Link to Github with code samples)
• http://blogs.msdn.com/b/visualstudio/
• http://azure.microsoft.com/en-us/blog/topics/big-data/
• https://channel9.msdn.com/Search?term=U-SQL#ch9Search
• Documentation and articles and slides:
• http://aka.ms/usql_reference
• https://azure.microsoft.com/en-us/documentation/services/data-lake-analytics/
• https://msdn.microsoft.com/en-us/magazine/mt614251
• http://www.slideshare.net/MichaelRys
• ADL forums and feedback
• http://aka.ms/adlfeedback
• https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake
• http://stackoverflow.com/questions/tagged/u-sql

ADL/U-SQL Introduction (SQLBits 2016)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to ADL/U-SQL Introduction (SQLBits 2016)

Similar to ADL/U-SQL Introduction (SQLBits 2016) (20)

More from Michael Rys

More from Michael Rys (6)

Recently uploaded

Recently uploaded (20)

ADL/U-SQL Introduction (SQLBits 2016)

Editor's Notes