3. What are
UDOs?
User-Defined Extractors
User-Defined Outputters
User-Defined Processors
• Take one row and produce one row
• Pass-through versus transforming
User-Defined Appliers
• Take one row and produce 0 to n rows
• Used with OUTER/CROSS APPLY
User-Defined Combiners
• Combines rowsets (like a user-defined join)
User-Defined Reducers
• Take n rows and produce 1 row
Called with explicit U-SQL Syntax that takes a UDO
instance (created as part of the execution):
• EXTRACT
• OUTPUT
• PROCESS
• COMBINE
• REDUCE
4. UDO/UDT Tips
and Warnings
• Use:
• READONLY clause to allow pushing predicates through UDOs
• REQUIRED clause to allow column pruning through UDOs
• PRESORT (coming)
• Use SELECT with UDFs instead of PROCESS
• Use User-defined Aggregators instead of REDUCE
• Hint Cardinality if you use CROSS APPLY and it
does chose the wrong plan
• Learn to use Windowing Functions (OVER
expression)
• Use SQL.MAP and SQL.ARRAY instead of C#
Dictionary and array
• Some use-cases for PROCESS/REDUCE/COMBINE:
• The logic needs to dynamically access the input and/or output
schema. E.g., create a JSON doc for the data in the row where the
columns are not known apriori.
• Your UDF based solution creates too much memory pressure and
you can write your code more memory efficient in a UDO
5. What are UDFs
and UDAGGs?
• UDFs are user-defined C# scalar
functions that can be called like any
scalar C# function
• UDAGGs are user-defined aggregators
• Called by special syntax AGG<…>
• Enables templatized user-defined aggregators
• UDFs, UDAGGs and UDOs must be
provided by a referenced assembly
6.
7. UDO model
• Marking UDOs
• Parameterizing UDOs
• UDO signature
• UDO-specific processing
pattern
• Rowsets and their schemas
in UDOs
• Setting results
• By position
• By name
[SqlUserDefinedExtractor]
public class DriverExtractor : IExtractor
{
private byte[] _row_delim;
private string _col_delim;
private Encoding _encoding;
// Define a non-default constructor since I want to pass in my own parameters
public DriverExtractor( string row_delim = "rn", string col_delim = ",“
, Encoding encoding = null )
{
_encoding = encoding == null ? Encoding.UTF8 : encoding;
_row_delim = _encoding.GetBytes(row_delim);
_col_delim = col_delim;
} // DriverExtractor
// Converting text to target schema
private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow)
{
var schema = outputrow.Schema;
if (schema[i].Type == typeof(int))
{
var tmp = Convert.ToInt32(c);
outputrow.Set(i, tmp);
}
...
} //SerializeCol
public override IEnumerable<IRow> Extract( IUnstructuredReader input
, IUpdatableRow outputrow)
{
foreach (var row in input.Split(_row_delim))
{
using(var s = new StreamReader(row, _encoding))
{
int i = 0;
foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None))
{
OutputValueAtCol_I(c, i++, outputrow);
} // foreach
} // using
yield return outputrow.AsReadOnly();
} // foreach
} // Extract
} // class DriverExtractor
8. UDAGG model
• UDAGG extends
IAggregate interface
• Requires implementation
of Init(), Accumulate(),
and Terminate() methods
• Can have multiple
arguments
• Can be generic
• Called with special syntax
to provide support for
generic UDAGGs
public class MyCountAggregate : IAggregate<int, long>
{
private int count;
public override void Init() { count = 0; }
public override void Accumulate(int i) { count += i; }
public override long Terminate(){ return count; }
}
public class MyTwoArgAggregate : IAggregate<string, long, int>
{
public override void Init() {…}
public override void Accumulate(string s, long l) {…}
public override int Terminate() {…}
}
public class GenericListAggregate<T1, TResult> : IAggregate<T1, TResult>
where TResult : IList<T1>, new()
{
private TResult result;
public override void Init() { this.result = new TResult(); }
public override void Accumulate(T1 t1) { this.result.Add(t1);}
public override TResult Terminate() { return this.result;}
}
SELECT AGG<MyNamespace.MyCountAggregate>(a) AS ms FROM @X;
C# is the extension story for U-SQL
Expressions in SELECT statement
User-defined operators (UDOs)
User-defined functions (UDFs)
User-defined aggregates (UDAGGs)
User-defined types (UDTs)
UDOs are central to U-SQL user experience
UDFs, UDAGGs, UDOs and UDTs require assemblies to be registered (one-time cost, fixed assembly version)
UDFs UDAGGs, UDOs and UDTs will automatically be available after referencing assembly in script
One version of assembly per database
Assembly with same short name is not allowed
Tooling provides code-behind and aut-odeploy experience