Killer Scenarios with Data Lake in Azure with U-SQL

U-SQL extensibility
Extend U-SQL with C#/.NET
Built-in operators,
function, aggregates
C# expressions (in SELECT expressions)
User-defined aggregates (UDAGGs)
User-defined functions (UDFs)
User-defined operators (UDOs)

 User-Defined Extractors
 User-Defined Outputters
 User-Defined Processors
 Take one row and produce one row
 Pass-through versus transforming
 User-Defined Appliers
 Take one row and produce 0 to n rows
 Used with OUTER/CROSS APPLY
 User-Defined Combiners
 Combines rowsets (like a user-defined join)
 User-Defined Reducers
 Take n rows and produce m rows (normally m<n)
 Scaled out with explicit U-SQL Syntax that takes a UDO
instance (created as part of the execution):
 EXTRACT
 OUTPUT
 PROCESS
 COMBINE
 REDUCE
What are
UDOs?
Custom Operator Extensions
Scaled out by U-SQL

UDO model
• Marking UDOs
• Parameterizing UDOs
• UDO signature
• UDO-specific processing
pattern
• Rowsets and their
schemas in UDOs
• Setting results
 By position
 By name
[SqlUserDefinedExtractor]
public class DriverExtractor : IExtractor
{
private byte[] _row_delim;
private string _col_delim;
private Encoding _encoding;
// Define a non-default constructor since I want to pass in my own parameters
public DriverExtractor( string row_delim = "rn", string col_delim = ",“
, Encoding encoding = null )
{
_encoding = encoding == null ? Encoding.UTF8 : encoding;
_row_delim = _encoding.GetBytes(row_delim);
_col_delim = col_delim;
} // DriverExtractor
// Converting text to target schema
private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow)
{
var schema = outputrow.Schema;
if (schema[i].Type == typeof(int))
{
var tmp = Convert.ToInt32(c);
outputrow.Set(i, tmp);
}
...
} //SerializeCol
public override IEnumerable<IRow> Extract( IUnstructuredReader input
, IUpdatableRow outputrow)
{
foreach (var row in input.Split(_row_delim))
{
using(var s = new StreamReader(row, _encoding))
{
int i = 0;
foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None))
{
OutputValueAtCol_I(c, i++, outputrow);
} // foreach
} // using
yield return outputrow.AsReadOnly();
} // foreach
} // Extract
} // class DriverExtractor

 Code behind
How to specify
UDOs?

 C# Class Project for U-SQL
How to specify
UDOs?

 Any .Net language usable
 however not first-class in tooling
 Use U-SQL specific .Net DLLs
 Compile DLL, upload to ADLS, register with script
How to specify
UDOs?

Managing
Assemblies
• CREATE ASSEMBLY db.assembly FROM @path;
• CREATE ASSEMBLY db.assembly FROM byte[];
• Can also include additional resource files
• REFERENCE ASSEMBLY db.assembly;
• Referencing .Net Framework Assemblies
• Always accessible system namespaces:
• U-SQL specific (e.g., for SQL.MAP)
• All provided by system.dll system.core.dll
system.data.dll, System.Runtime.Serialization.dll,
mscorelib.dll (e.g., System.Text,
System.Text.RegularExpressions, System.Linq)
• Add all other .Net Framework Assemblies with:
REFERENCE SYSTEM ASSEMBLY [System.XML];
• Enumerating Assemblies
• Powershell command
• U-SQL Studio Server Explorer
• DROP ASSEMBLY db.assembly;
 Create assemblies
 Reference assemblies
 Enumerate assemblies
 Drop assemblies
 VisualStudio makes registration easy!

'USING' csharp_namespace
| Alias '=' csharp_namespace_or_class.
Examples:
DECLARE @ input string = "somejsonfile.json";
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
@data0 =
EXTRACT IPAddresses string
FROM @input
USING new JsonExtractor("Devices[*]");
USING json =
[Microsoft.Analytics.Samples.Formats.Json.JsonExtractor];
@data1 =
EXTRACT IPAddresses string
FROM @input
USING new json("Devices[*]");

Overlapping Range
Aggregation
Start Time - End Time - User Name
5:00 AM - 6:00 AM - ABC
5:00 AM - 6:00 AM - XYZ
8:00 AM - 9:00 AM - ABC
8:00 AM - 10:00 AM - ABC
10:00 AM - 2:00 PM - ABC
7:00 AM - 11:00 AM - ABC
9:00 AM - 11:00 AM - ABC
11:00 AM - 11:30 AM - ABC
11:40 PM - 11:59 PM - FOO
11:50 PM - 0:40 AM - FOO
https://blogs.msdn.microsoft.com/azuredatalake/2016/06/27/how-do-i-combine-
overlapping-ranges-using-u-sql-introducing-u-sql-reducer-udos
Start Time - End Time - User Name
5:00 AM - 6:00 AM - ABC
5:00 AM - 6:00 AM - XYZ
7:00 AM - 2:00 PM - ABC
11:40 PM - 0:40 AM - FOO

U-SQL:
@r = REDUCE @in
PRESORT begin
ON user
PRODUCE begin DateTime
, end DateTime
, user string
READONLY user
USING new ReduceSample.RangeReducer();
Overlapping
Range
Aggregation

 Code Behind:
namespace ReduceSample
{
[SqlUserDefinedReducer(IsRecursive = true)]
public class RangeReducer : IReducer
{
public override IEnumerable<IRow>
Reduce(IRowset input, IUpdatableRow output)
{
// Init aggregation values
int i = 0;
var begin = DateTime.MaxValue;
var end = DateTime.MinValue;
foreach (var row in input.Rows)
{
...
begin = row.Get<DateTime>("begin");
end = row.Get<DateTime>("end");
...
output.Set<DateTime>("begin", begin);
output.Set<DateTime>("end", end);
yield return output.AsReadOnly();
...
} // foreach
} // Reduce
Overlapping
Range
Aggregation

JSON Processing
How do I extract data from JSON documents?
https://github.com/Azure/usql/tree/master/Examples/DataFormats

 Architecture of Sample Format Assembly
 Single JSON document per file: Use JsonExtractor
 Multiple JSON documents per file:
 Do not allow CR/LF (row delimiter) in JSON
 Use built-in Text Extractor to extract
 Use JsonTuple to schematize (with CROSS APPLY)
 Currently loads full JSON document into memory
 better to use JSONReader Processing if docs are large
JSON
Processing
Microsoft.Analytics.Samples.Formats
NewtonSoft.Json System.Xml

JSON
Processing
@json =
EXTRACT personid int,
name string,
addresses string
FROM @input
USING new Json.JsonExtractor(“[*].person");
@person =
SELECT personid,
name,
Json.JsonFunctions.JsonTuple(addresses)["address"] AS address_array
FROM @json;
@addresses = SELECT personid, name, Json.JsonFunctions.JsonTuple(address) AS address
FROM @person
CROSS APPLY
EXPLODE (Json.JsonFunctions.JsonTuple(address_array).Values) AS A(address);
@result =
SELECT personid,
name,
address["addressid"]AS addressid,
address["street"]AS street,
address["postcode"]AS postcode,
address["city"]AS city
FROM @addresses;

Image Processing
Copyright Camera
Make
Camera
Model
Thumbnail
Michael Canon 70D
Michael Samsung S7
https://github.com/Azure/usql/tree/master/Examples/ImageApp

 Image processing assembly
 Uses System.Drawing
 Exposes
 Extractors
 Outputter
 Processor
 User-defined Functions
 Trade-offs
 Column memory limits:
Image Extractor vs Feature Extractor
 Main memory pressures in vertex:
UDFs vs Processor vs Extractor
Image
Processing

Architecture
U-SQL
Processing
with R
KMeansRReducer
R to .Net interop (RDotNet.dll &
RDotNet.NativeLib.dll)
R Runtime (R-bin.zip)
R Engine Manager Utility (RUtilities.dll)
Similar Approaches can be done for deploying other
runtimes: Python, JavaScript, JVM
No external access from UDOs
Future work:
 More generic samples
 More automatic experiences (no user wrappers/deploys)

What are UDOs?
Custom Operator Extensions written in .Net (C#)
Scaled out by U-SQL

UDO Tips and
Warnings
• Tips when Using UDOs:
 READONLY clause to allow pushing predicates through UDOs
 REQUIRED clause to allow column pruning through UDOs
 PRESORT on REDUCE if you need global order
 Hint Cardinality if it does choose the wrong plan
• Warnings and better alternatives:
 Use SELECT with UDFs instead of PROCESS
 Use User-defined Aggregators instead of REDUCE
 Learn to use Windowing Functions (OVER expression)
• Good use-cases for PROCESS/REDUCE/COMBINE:
 The logic needs to dynamically access the input and/or output
schema.
E.g., create a JSON doc for the data in the row where the
columns are not known apriori.
 Your UDF based solution creates too much memory pressure and
you can write your code more memory efficient in a UDO
 You need an ordered Aggregator or produce more than 1 row
per group

http://usql.io
http://blogs.msdn.microsoft.com/azuredatalake/
http://blogs.msdn.microsoft.com/mrys/
https://channel9.msdn.com/Search?term=U-SQL#ch9Search
http://aka.ms/usql_reference
https://azure.microsoft.com/en-
us/documentation/services/data-lake-analytics/
https://msdn.microsoft.com/en-us/magazine/mt614251
http://aka.ms/adlfeedback
https://social.msdn.microsoft.com/Forums/azure/en-
US/home?forum=AzureDataLake
http://stackoverflow.com/questions/tagged/u-sql

Killer Scenarios with Data Lake in Azure with U-SQL

Killer Scenarios with Data Lake in Azure with U-SQL

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Killer Scenarios with Data Lake in Azure with U-SQL

Similar to Killer Scenarios with Data Lake in Azure with U-SQL (20)

More from Michael Rys

More from Michael Rys (8)

Recently uploaded

Recently uploaded (20)

Killer Scenarios with Data Lake in Azure with U-SQL

Editor's Notes