Presentation from Microsoft Data Science Summit 2016
Presents 4 examples of custom U-SQL data processing: Overlapping Range Aggregation, JSON Processing, Image Processing and R with U-SQL
5. User-Defined Extractors
User-Defined Outputters
User-Defined Processors
Take one row and produce one row
Pass-through versus transforming
User-Defined Appliers
Take one row and produce 0 to n rows
Used with OUTER/CROSS APPLY
User-Defined Combiners
Combines rowsets (like a user-defined join)
User-Defined Reducers
Take n rows and produce m rows (normally m<n)
Scaled out with explicit U-SQL Syntax that takes a UDO
instance (created as part of the execution):
EXTRACT
OUTPUT
PROCESS
COMBINE
REDUCE
What are
UDOs?
Custom Operator Extensions
Scaled out by U-SQL
6. UDO model
• Marking UDOs
• Parameterizing UDOs
• UDO signature
• UDO-specific processing
pattern
• Rowsets and their
schemas in UDOs
• Setting results
By position
By name
[SqlUserDefinedExtractor]
public class DriverExtractor : IExtractor
{
private byte[] _row_delim;
private string _col_delim;
private Encoding _encoding;
// Define a non-default constructor since I want to pass in my own parameters
public DriverExtractor( string row_delim = "rn", string col_delim = ",“
, Encoding encoding = null )
{
_encoding = encoding == null ? Encoding.UTF8 : encoding;
_row_delim = _encoding.GetBytes(row_delim);
_col_delim = col_delim;
} // DriverExtractor
// Converting text to target schema
private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow)
{
var schema = outputrow.Schema;
if (schema[i].Type == typeof(int))
{
var tmp = Convert.ToInt32(c);
outputrow.Set(i, tmp);
}
...
} //SerializeCol
public override IEnumerable<IRow> Extract( IUnstructuredReader input
, IUpdatableRow outputrow)
{
foreach (var row in input.Split(_row_delim))
{
using(var s = new StreamReader(row, _encoding))
{
int i = 0;
foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None))
{
OutputValueAtCol_I(c, i++, outputrow);
} // foreach
} // using
yield return outputrow.AsReadOnly();
} // foreach
} // Extract
} // class DriverExtractor
8. C# Class Project for U-SQL
How to specify
UDOs?
9. Any .Net language usable
however not first-class in tooling
Use U-SQL specific .Net DLLs
Compile DLL, upload to ADLS, register with script
How to specify
UDOs?
10. Managing
Assemblies
• CREATE ASSEMBLY db.assembly FROM @path;
• CREATE ASSEMBLY db.assembly FROM byte[];
• Can also include additional resource files
• REFERENCE ASSEMBLY db.assembly;
• Referencing .Net Framework Assemblies
• Always accessible system namespaces:
• U-SQL specific (e.g., for SQL.MAP)
• All provided by system.dll system.core.dll
system.data.dll, System.Runtime.Serialization.dll,
mscorelib.dll (e.g., System.Text,
System.Text.RegularExpressions, System.Linq)
• Add all other .Net Framework Assemblies with:
REFERENCE SYSTEM ASSEMBLY [System.XML];
• Enumerating Assemblies
• Powershell command
• U-SQL Studio Server Explorer
• DROP ASSEMBLY db.assembly;
Create assemblies
Reference assemblies
Enumerate assemblies
Drop assemblies
VisualStudio makes registration easy!
11. 'USING' csharp_namespace
| Alias '=' csharp_namespace_or_class.
Examples:
DECLARE @ input string = "somejsonfile.json";
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
@data0 =
EXTRACT IPAddresses string
FROM @input
USING new JsonExtractor("Devices[*]");
USING json =
[Microsoft.Analytics.Samples.Formats.Json.JsonExtractor];
@data1 =
EXTRACT IPAddresses string
FROM @input
USING new json("Devices[*]");
12. Overlapping Range
Aggregation
Start Time - End Time - User Name
5:00 AM - 6:00 AM - ABC
5:00 AM - 6:00 AM - XYZ
8:00 AM - 9:00 AM - ABC
8:00 AM - 10:00 AM - ABC
10:00 AM - 2:00 PM - ABC
7:00 AM - 11:00 AM - ABC
9:00 AM - 11:00 AM - ABC
11:00 AM - 11:30 AM - ABC
11:40 PM - 11:59 PM - FOO
11:50 PM - 0:40 AM - FOO
https://blogs.msdn.microsoft.com/azuredatalake/2016/06/27/how-do-i-combine-
overlapping-ranges-using-u-sql-introducing-u-sql-reducer-udos
Start Time - End Time - User Name
5:00 AM - 6:00 AM - ABC
5:00 AM - 6:00 AM - XYZ
7:00 AM - 2:00 PM - ABC
11:40 PM - 0:40 AM - FOO
13. U-SQL:
@r = REDUCE @in
PRESORT begin
ON user
PRODUCE begin DateTime
, end DateTime
, user string
READONLY user
USING new ReduceSample.RangeReducer();
Overlapping
Range
Aggregation
14. Code Behind:
namespace ReduceSample
{
[SqlUserDefinedReducer(IsRecursive = true)]
public class RangeReducer : IReducer
{
public override IEnumerable<IRow>
Reduce(IRowset input, IUpdatableRow output)
{
// Init aggregation values
int i = 0;
var begin = DateTime.MaxValue;
var end = DateTime.MinValue;
foreach (var row in input.Rows)
{
...
begin = row.Get<DateTime>("begin");
end = row.Get<DateTime>("end");
...
output.Set<DateTime>("begin", begin);
output.Set<DateTime>("end", end);
yield return output.AsReadOnly();
...
} // foreach
} // Reduce
Overlapping
Range
Aggregation
15. JSON Processing
How do I extract data from JSON documents?
https://github.com/Azure/usql/tree/master/Examples/DataFormats
16. Architecture of Sample Format Assembly
Single JSON document per file: Use JsonExtractor
Multiple JSON documents per file:
Do not allow CR/LF (row delimiter) in JSON
Use built-in Text Extractor to extract
Use JsonTuple to schematize (with CROSS APPLY)
Currently loads full JSON document into memory
better to use JSONReader Processing if docs are large
JSON
Processing
Microsoft.Analytics.Samples.Formats
NewtonSoft.Json System.Xml
17. JSON
Processing
@json =
EXTRACT personid int,
name string,
addresses string
FROM @input
USING new Json.JsonExtractor(“[*].person");
@person =
SELECT personid,
name,
Json.JsonFunctions.JsonTuple(addresses)["address"] AS address_array
FROM @json;
@addresses = SELECT personid, name, Json.JsonFunctions.JsonTuple(address) AS address
FROM @person
CROSS APPLY
EXPLODE (Json.JsonFunctions.JsonTuple(address_array).Values) AS A(address);
@result =
SELECT personid,
name,
address["addressid"]AS addressid,
address["street"]AS street,
address["postcode"]AS postcode,
address["city"]AS city
FROM @addresses;
21. Architecture
U-SQL
Processing
with R
KMeansRReducer
R to .Net interop (RDotNet.dll &
RDotNet.NativeLib.dll)
R Runtime (R-bin.zip)
R Engine Manager Utility (RUtilities.dll)
Similar Approaches can be done for deploying other
runtimes: Python, JavaScript, JVM
No external access from UDOs
Future work:
More generic samples
More automatic experiences (no user wrappers/deploys)
24. UDO Tips and
Warnings
• Tips when Using UDOs:
READONLY clause to allow pushing predicates through UDOs
REQUIRED clause to allow column pruning through UDOs
PRESORT on REDUCE if you need global order
Hint Cardinality if it does choose the wrong plan
• Warnings and better alternatives:
Use SELECT with UDFs instead of PROCESS
Use User-defined Aggregators instead of REDUCE
Learn to use Windowing Functions (OVER expression)
• Good use-cases for PROCESS/REDUCE/COMBINE:
The logic needs to dynamically access the input and/or output
schema.
E.g., create a JSON doc for the data in the row where the
columns are not known apriori.
Your UDF based solution creates too much memory pressure and
you can write your code more memory efficient in a UDO
You need an ordered Aggregator or produce more than 1 row
per group