U-SQL is the query language for big data analytics on the Azure Data Lake platform. This session will explore the unification of SQL and C# in this new query language, examples of combining data from external sources such as Azure SQL Database and Blob storage with Azure Data Lake store, creating and referencing assemblies, job submission and tools. The ADL platform will also be compared and contrasted to the HDInsight/Hadoop platform.
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA): A first look at U-SQL on Azure Data Lake Public Preview
1. Hands-On with U-SQL and
Azure Data Lake Analytics
(ADLA)
A first look at U-SQL on Azure Data Lake Public Preview
Jason Brugger (@JasonLBrugger)
MCSE: Data Platform, MCSE: Business
Intelligence
July 16, 2016
2. This presentation has been
modified from its original format.
Animations have been removed
and it has been reformatted for
publication on Slideshare.net.
3. Assumptions
• You are familiar with the differences between a traditional
RDBMS and a Big Data solution.
• You are familiar with both T-SQL and C#.
4. What is a data lake? What are Azure Data
Lake Store and Azure Data Lake Analytics?
• A data lake is a storage repository that holds a vast amount of raw
data in its native format until it is needed. – Margaret Rouse (on
AWS)
• Pentaho CTO James Dixon has generally been credited with coining
the term “data lake”. He describes a data mart (a subset of a data
warehouse) as akin to a bottle of water…” cleansed, packaged and
structured for easy consumption” while a data lake is more like a
body of water in its natural state. – Chris Campbell, Blue Granite
• Data Lake Analytics is an Azure Big Data computation service that
lets you use data to drive your business using the insights gained
from your data in the cloud, regardless of where it is and regardless
of its size. – Ed Macauley, Microsoft
Data Lake
Store
Data Lake
Analytics
5. ADLA vs. HDInsight (e.g. Hadoop)
• HDInsight (Cluster as a
Service)
• Provision cluster of n nodes
• Run your queries
• Delete cluster
• (Repeat)
• ADLA (Query as a service)
• Don’t provision anything
• Specify node count (parallelism)
at job submission time
• Pay per query
6. Getting Started – What’s Needed?
• Azure subscription
• Sign-up for ADL preview
• Visual Studio 2015 + Azure
Data Lake Tools for Visual
Studio
• Microsoft Azure PowerShell
(1.0+ via WPI)
• Not to be confused with the
version of PowerShell, e.g.
5.0.
• Microsoft Azure SDK for .NET
(Optional)
11. Getting data into ADL
• Portal
• PowerShell
• Login-AzureRmAccount
• Import-AzureRmDataLakeStoreItem
• Connecting to External Data
(Demo #2)
• SSIS
• ADF
12. The Data (NOAA Weather observations)
Station Datekey Element Value Mflag Qflag Sflag TimeKey
US1FLSL0019 20150101 PRCP 173 N
US1TXTV0133 20150101 PRCP 119 N
USC00178998 20150101 TMAX -33 700
USC00178998 20150101 TMIN -167 700
USC00178998 20150101 TOBS -67 700
USC00178998 20150101 PRCP 0 700
USC00178998 20150101 SNOW 0
USC00178998 20150101 SNWD 0
USR0000CSNR 20150101 TMAX 194 H D U
Notice sparse
data with many
null values
13. The Data (NOAA Weather observations)
Station Datekey Element Value Mflag Qflag Sflag TimeKey
US1FLSL0019 20150101 PRCP 173 N
US1TXTV0133 20150101 PRCP 119 N
USC00178998 20150101 TMAX -33 700
USC00178998 20150101 TMIN -167 700
USC00178998 20150101 TOBS -67 700
USC00178998 20150101 PRCP 0 700
USC00178998 20150101 SNOW 0
USC00178998 20150101 SNWD 0
USR0000CSNR 20150101 TMAX 194 H D U
Multiple
observation types,
per site, per day
14. The Data (NOAA Weather observations)
Station Datekey Element Value Mflag Qflag Sflag TimeKey
US1FLSL0019 20150101 PRCP 173 N
US1TXTV0133 20150101 PRCP 119 N
USC00178998 20150101 TMAX -33 700
USC00178998 20150101 TMIN -167 700
USC00178998 20150101 TOBS -67 700
USC00178998 20150101 PRCP 0 700
USC00178998 20150101 SNOW 0
USC00178998 20150101 SNWD 0
USR0000CSNR 20150101 TMAX 194 H D U
Tenths of degree C
15. The Data (NOAA Weather observations)
Station Datekey Element Value Mflag Qflag Sflag TimeKey
US1FLSL0019 20150101 PRCP 173 N
US1TXTV0133 20150101 PRCP 119 N
USC00178998 20150101 TMAX -33 700
USC00178998 20150101 TMIN -167 700
USC00178998 20150101 TOBS -67 700
USC00178998 20150101 PRCP 0 700
USC00178998 20150101 SNOW 0
USC00178998 20150101 SNWD 0
USR0000CSNR 20150101 TMAX 194 H D U
Correlates to external data uploaded to Azure SQL Database
16. Basic U-SQL query
Load .csv file from
Data Lake Store
using built-in
Extractor
Schematize using
C# data types, note
nullability
Output
schematized rows
to a table variable
SELECT using
familiar SQL-like
queryOutput query result
to Data Lake Store
using built-in
Outputter
17. Key Takeaways & ‘gotchas’
• SQL statements MUST be uppercase
• Header rows not currently supported by Extractor
• e.g skipFirstNRows:1 not currently supported
• Be mindful of nullability in C# types
• Built-in operators include support for .Csv(), .Tsv(), & .Text()
• Various options such as delimiter
• Build custom extractors by inheriting IExtractor
18. DEMO #1
• Demo local execution
• Simple aggregation of 10,000 rows down to 43, by element
type
19. Persisted schema with meta data object
model
Familiar CREATE
DATABASE
statement
Familiar
CREATE VIEW
statement;
View maintains
extractor and
schema
definitions so
from now on,
we can just
select from the
view.
Note data kept in its
native compressed
(.gz) format.
Extractor handles
decompression in
this case Wildcard {*} yields file-
set of all matching
files
20. Combining with external data
• Create catalog secret using
PowerShell (specifies remote Host &
credentials & ADLA catalog)
• New-AzureRmDataLakeAnalyticsCatalogSecret
• Create credential (in turn, references
catalog secret)
• Create data source (in turn, references
credential & specifies data source
type (e.g. Azure SQL Db) & specifies
remote catalog)
CREATE EXTERNAL
TABLE denotes
underlying table
resides remotely
Schema
using C#
types
Remote table name
21. External data with federated query
SELECT FROM
EXTERNAL data
source EXECUTE
Embedded query executes
remotely at data source. This is
T-SQL, not U-SQL
22. External data with federated query
Embedded query executes
remotely at data source. This is
T-SQL, not U-SQL
Table variable
contains only rows
returned
id name
US1FLHB0090 TAMPA 10.2 NNW
US1FLHB0048 GREATER NORTHDALE 0.4 ENE
USW00012810 MACDILL AFB
USC00088890 TEMPLE TERRACES
US1FLHB0007 TAMPA 8.4 NW
US1FLHB0025 CARROLLWOOD 1.7 SE
US1FLHB0040 UNIVERSITY WEST 2.0 WNW
USC00088786 TAMPA
US1FLHB0028 WEST PARK 0.4 S
US1FLHB0055 TAMPA 5.0 NNE
US1FLHB0012 CARROLLWOOD 0.5 WNW
US1FLHB0096 TAMPA 5.4 SSW
US1FLHB0005 CITRUS PARK 1.3 ENE
USC00080520 BAY LAKE
US1FLHB0093 TEMPLE TERRACE 1.5 SE
US1FLHB0039 CARROLLWOOD 2.0 SSE
US1FLHB0087 TAMPA 7.9 N
US1FLHB0071 TAMPA 6.1 N
USW00012842 TAMPA INTL AP
US1FLHB0010 TAMPA 5.1 S
US1FLHB0036 TAMPA 4.4 SSW
US1FLHB0029 TAMPA 6.5 NNE
US1FLHB0064 TAMPA 4.7 NW
US1FLHB0051 LUTZ 2.2 SSE
23. Data relationship exhibit
Azure SQL
Db
Federate
d Query
dbo.station
dbo.calend
ar
Azure Data Lake Analytics
@station_t
pa
dbo.calend
ar
dbo.observ
-ation
Result
Azure Data Lake
Store
24. Complex types in U-SQL & EXPLODE
• SQL.ARRAY
• Like a List or Array in C#
• Can be used in conjunction with
String.Split()
• SQL.MAP
• Key-Value pairs
• Like a Dictionary (Hash table) in
C#
• EXPLODE
• Expands to rows
ID (int) Data (SQL.MAP)
1 ((“A”,25),(“B”,35),(“C”,45))
2 ((“A”,27),(“B”,38),(“C”,42))
ID
1 A 25
1 B 35
1 C 45
2 A 27
2 B 38
2 C 42
EXPLODE
25. Tying it all together with U-SQL
Familiar
JOIN syntax;
Note double
equals “==“,
the only
supported
JOIN
operatorFamiliar
WHERE
and
GROUP BY
syntax
26. Tying it all together with U-SQL
CROSS APPLY
the value
(recall this was
in tenths of
degrees C)
27. Tying it all together with U-SQL
CROSS APPLY
the value
(recall this was
in tenths of
degrees C)
Declare a
new
SQL.MAP
with
conversion
factors by
C, F, and K
28. Tying it all together with U-SQL
CROSS APPLY
the value
(recall this was
in tenths of
degrees C)
Declare a
new
SQL.MAP
with
conversion
factors by
C, F, and KEXPLODE the
SQL.MAP into
rows and new
columns
scale, temp
29. Tying it all together with U-SQL
Familiar
aggregation
AVG using
exploded
column temp
30. Tying it all together with U-SQL
Using
String.Concat .NET
method to build
description of
derived column
e.g. “AVG_TMAX_F”
AVG_TMAX_C
AVG_TMAX_K
AVG_TMIN_F
AVG_TMIN_C
AVG_TMIN_K
31. Tying it all together with U-SQL
CREATE TABLE AS
SELECT (CTAS) –
Conceptually
similar to select
into
32. Tying it all together with U-SQL
CREATE TABLE AS
SELECT (CTAS) –
Conceptually
similar to select
into
No HEAPs in
ADLA;
Clustered
index
required
33. Tying it all together with U-SQL
CREATE TABLE AS
SELECT (CTAS) –
Conceptually
similar to select
into
No HEAPs in
ADLA;
Clustered
index
required
Partitioned by
Round Robin
distributes data
evenly.
34. Tying it all together with U-SQL
CREATE TABLE AS
SELECT (CTAS) –
Conceptually
similar to select
into
No HEAPs in
ADLA;
Clustered
index
required
Partitioned by
Round Robin
distributes data
evenly.
No update or
merge
support
35. DEMO #2
• Data Lake data set consists of daily readings from 98,035 stations over
5 years
• ~32,720,048 rows per file
• About 164M rows total
• 24 Tampa stations
• Filtering and aggregating it down to 5 years x 12 months x 2 elements x
3 temperature scales, or 360 rows
• Monitor job execution status
• Streams, Vertices, Display avg execution time (heat map), Diagnostics,
History, Script*
36. Working with Assemblies & Libraries
• Code-behind file
• Convenient, simple
• Assembly created and referenced
automatically
• No support for NuGet, but manually add
references…OR:
• Class library
• Right-click, register assembly to ADLA
• Option to automatically copy to DLS
• NuGet supported normally
37. Example: Simple Linear Regression to
predict temp
.dlls for
statistics
library copied
to Data Lake
Store
CREATE
ASSEMBLY from
file;
(Can also create
from binary)
Custom C# class
method signature
Noaa.Predict.Regress(
int, SqlMap<int,
decimal?>) : decimal?
38. Example: Simple Linear Regression to
predict temp
.dlls for
statistics
library copied
to Data Lake
Store
CREATE
ASSEMBLY from
file;
(Can also create
from binary)
Custom C# class
method signature
Noaa.Predict.Regress(
int, SqlMap<int,
decimal?>) : decimal?
MAP_AGG()
function returns
SQL.MAP – like a
reverse
EXPLODE, which
we pass as
function
parameter
41. .CS code-behind file
Return
predicte
d Temp
Year to
predict,
e.g.
2016
SqlMap
contains
series against
which
regression is
performed
Referenced
Library
namespace
42. DEMO #3
• Pivoting our existing averages on Month & aggregating Year &
Temp into Key-Value pairs (SQL.MAP) which we pass as
parameter to custom function
• Passing predictive year (2016) as a parameter
• Limit our selection to just AVG_TMAX_F
• Result adds another 12 rows of predicted temps to our existing
360 row result table
43. Additional homework subjects, not covered
• Extractor (UDO) by inheriting IExtractor
• IOutputter
• IProcessor – transform single row, read one,
output one
• IReducer – read n rows, output 1 row
• ICombiner – like a user-defined Join
• IApplier – input one row, output n rows
• User-defined Aggregators (IAggregate) – AGG
keyword
• ARRAY_AGG()
• Blob as External storage
• No Primary Keys
• No columnstore (yet)
• Table Value Functions - YES, but not with cross
apply
• No support for R, but leverage .NET libraries as
demo’d
• User-defined Types
• Partitioning by Hash, Direct Hash, Range
• http://www.slideshare.net/MichaelRys/usql-
partitioned-data-and-tables-sqlbits-2016
45. Attribution
The Accord.NET Framework
Copyright (c) 2009-2014, César Roberto de Souza
<cesarsouza@gmail.com>
This library is free software; you can redistribute it and/or modify it under
the terms of the GNU Lesser General Public License as published by the
Free Software Foundation; either version 2.1 of the License, or (at your
option) any later version.
The copyright holders provide no reassurances that the source code
provided does not infringe any patent, copyright, or any other intellectual
property rights of third parties. The copyright holders disclaim any liability
to any recipient for claims brought against recipient by any third party
for infringement of that parties intellectual property rights.
This library is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
License for more details.
National Oceanic and Atmospheric Administration (NOAA)
README FILE FOR DAILY GLOBAL HISTORICAL CLIMATOLOGY NETWORK
(GHCN-DAILY) Version 3.22
How to cite:
Note that the GHCN-Daily dataset itself now has a DOI (Digital Object
Identifier)so it may be relevant to cite both the methods/overview journal
article as well as the specific version of the dataset used.
The journal article describing GHCN-Daily is:Menne, M.J., I. Durre, R.S.
Vose, B.E. Gleason, and T.G. Houston, 2012: An overview of the Global
Historical Climatology Network-Daily Database. Journal of Atmospheric
and Oceanic Technology, 29, 897-910, doi:10.1175/JTECH-D-11-
00103.1.
To acknowledge the specific version of the dataset used, please cite:Menne,
M.J., I. Durre, B. Korzeniewski, S. McNeal, K. Thomas, X. Yin, S. Anthony, R.
Ray, R.S. Vose, B.E.Gleason, and T.G. Houston, 2012: Global Historical
Climatology Network - Daily (GHCN-Daily), Version 3. [indicate subset used
following decimal, e.g. Version 3.12].
NOAA National Climatic Data Center. http://doi.org/10.7289/V5D21VHZ
[access date].
46. Bibliography
• Campbell, C. “Top Five Differences between Data Lakes and Data Warehouses.” Business Insights. Blue Granite, 26 Jan 2015. Web. https://www.blue-
granite.com/blog/bid/402596/Top-Five-Differences-between-Data-Lakes-and-Data-Warehouses
• Gopalan, R. (21 Jun 2016). U-SQL Part 4: Use custom code to extend U-SQL [Webinar]. PASS Big Data Virtual Chapter.
• Macauley, E. “Overview of Microsoft Azure Data Lake Analytics.” Microsoft Azure. Microsoft, 16 May 2016. Web. https://azure.microsoft.com/en-
us/documentation/articles/data-lake-analytics-overview/
• Reddy, S. (31 May 2016). Introduction to Azure Data Lake [Webinar]. PASS Big Data Virtual Chapter.
• Rossello, Justin. “Querying Azure SQL Database from an Azure Data Lake Analytics U-SQL Script.” eat{Code}live. 21 Nov 2015. Web.
http://eatcodelive.com/2015/11/21/querying-azure-sql-database-from-an-azure-data-lake-analytics-u-sql-script/
• Rouse, M. “Definition Data Lake.” SearchAws. TechTarget, May 2015. Web. http://searchaws.techtarget.com/definition/data-lake
• Rys, M. (8 Mar 2016). Introducing U-SQL; Part 2 of 2: Scaling U-SQL and doing SQL in U-SQL [Webinar]. PASS Big Data Virtual Chapter. Retrieved from
http://www.youtube.com/channel/UCkOKmMW_LEsACOqE8C1RWdw
• Rys, M. (16 Feb 2016). Introducing U-SQL; Part 1 of 2: Introduction and C# extensibility [Webinar]. PASS Big Data Virtual Chapter. Retrieved from
http://www.youtube.com/channel/UCkOKmMW_LEsACOqE8C1RWdw
• Rys, M., et. al. Azure/usql, (2016), GitHub repository, https://github.com/Azure/usql
• “U-SQL Language Reference.” Microsoft Azure. Microsoft, 28 Oct 2015. Web. https://msdn.microsoft.com/en-
US/library/azure/mt591959(Azure.100).aspx
Editor's Notes
Azure Portal – 1) Browse to create new resource, 2) Point-out default Sample data, 3) Browse my own sample.csv, 4) Point-out Upload option
Visual Studio – 1) Demo Server Explorer login & browse objects, 2) Point out Upload option
Azure Portal – 1) Browse to create new resource, 2) Point-out default Sample data, 3) Browse my own sample.csv, 4) Point-out Upload option
Visual Studio – 1) Demo Server Explorer login & browse objects, 2) Point out Upload option
Azure Portal – 1) Browse to create new resource, 2) Point-out default Sample data, 3) Browse my own sample.csv, 4) Point-out Upload option
Visual Studio – 1) Demo Server Explorer login & browse objects, 2) Point out Upload option