Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

U-SQL User-Defined Operators (UDOs) (SQLBits 2016)


Published on

U-SQL User-Defined Operators (UDOs) (SQLBits 2016 ADL/U-SQL Pre-Conference)
UDO model, User-defined Aggregators (UDAGGs)

Published in: Data & Analytics
  • Be the first to comment

U-SQL User-Defined Operators (UDOs) (SQLBits 2016)

  1. 1. Michael Rys Principal Program Manager, Big Data @ Microsoft @MikeDoesBigData, {mrys, usql} U-SQL User-Defined Operators (UDOs)
  2. 2. Extend U-SQL with C#/.NET Built-in operators, function, aggregates C# expressions (in SELECT expressions) User-defined aggregates (UDAGGs) User-defined functions (UDFs) User-defined operators (UDOs)
  3. 3. What are UDOs? User-Defined Extractors User-Defined Outputters User-Defined Processors • Take one row and produce one row • Pass-through versus transforming User-Defined Appliers • Take one row and produce 0 to n rows • Used with OUTER/CROSS APPLY User-Defined Combiners • Combines rowsets (like a user-defined join) User-Defined Reducers • Take n rows and produce 1 row Called with explicit U-SQL Syntax that takes a UDO instance (created as part of the execution): • EXTRACT • OUTPUT • PROCESS • COMBINE • REDUCE
  4. 4. UDO/UDT Tips and Warnings • Use: • READONLY clause to allow pushing predicates through UDOs • REQUIRED clause to allow column pruning through UDOs • PRESORT (coming) • Use SELECT with UDFs instead of PROCESS • Use User-defined Aggregators instead of REDUCE • Hint Cardinality if you use CROSS APPLY and it does chose the wrong plan • Learn to use Windowing Functions (OVER expression) • Use SQL.MAP and SQL.ARRAY instead of C# Dictionary and array • Some use-cases for PROCESS/REDUCE/COMBINE: • The logic needs to dynamically access the input and/or output schema. E.g., create a JSON doc for the data in the row where the columns are not known apriori. • Your UDF based solution creates too much memory pressure and you can write your code more memory efficient in a UDO
  5. 5. What are UDFs and UDAGGs? • UDFs are user-defined C# scalar functions that can be called like any scalar C# function • UDAGGs are user-defined aggregators • Called by special syntax AGG<…> • Enables templatized user-defined aggregators • UDFs, UDAGGs and UDOs must be provided by a referenced assembly
  6. 6. UDO model • Marking UDOs • Parameterizing UDOs • UDO signature • UDO-specific processing pattern • Rowsets and their schemas in UDOs • Setting results • By position • By name [SqlUserDefinedExtractor] public class DriverExtractor : IExtractor { private byte[] _row_delim; private string _col_delim; private Encoding _encoding; // Define a non-default constructor since I want to pass in my own parameters public DriverExtractor( string row_delim = "rn", string col_delim = ",“ , Encoding encoding = null ) { _encoding = encoding == null ? Encoding.UTF8 : encoding; _row_delim = _encoding.GetBytes(row_delim); _col_delim = col_delim; } // DriverExtractor // Converting text to target schema private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow) { var schema = outputrow.Schema; if (schema[i].Type == typeof(int)) { var tmp = Convert.ToInt32(c); outputrow.Set(i, tmp); } ... } //SerializeCol public override IEnumerable<IRow> Extract( IUnstructuredReader input , IUpdatableRow outputrow) { foreach (var row in input.Split(_row_delim)) { using(var s = new StreamReader(row, _encoding)) { int i = 0; foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None)) { OutputValueAtCol_I(c, i++, outputrow); } // foreach } // using yield return outputrow.AsReadOnly(); } // foreach } // Extract } // class DriverExtractor
  7. 7. UDAGG model • UDAGG extends IAggregate interface • Requires implementation of Init(), Accumulate(), and Terminate() methods • Can have multiple arguments • Can be generic • Called with special syntax to provide support for generic UDAGGs public class MyCountAggregate : IAggregate<int, long> { private int count; public override void Init() { count = 0; } public override void Accumulate(int i) { count += i; } public override long Terminate(){ return count; } } public class MyTwoArgAggregate : IAggregate<string, long, int> { public override void Init() {…} public override void Accumulate(string s, long l) {…} public override int Terminate() {…} } public class GenericListAggregate<T1, TResult> : IAggregate<T1, TResult> where TResult : IList<T1>, new() { private TResult result; public override void Init() { this.result = new TResult(); } public override void Accumulate(T1 t1) { this.result.Add(t1);} public override TResult Terminate() { return this.result;} } SELECT AGG<MyNamespace.MyCountAggregate>(a) AS ms FROM @X;
  8. 8. Additional Resources Documentation U-SQL UDO Expressions: us/library/azure/mt621319.aspx U-SQL OUTPUT Statement: us/library/azure/mt621334.aspx U-SQL UDO Programmer’s Guide: Under development U-SQL Performance Presentation: and-performance-tuning Sample Projects ceDemos/AmbulanceDemos/2-Ambulance-Structured%20Data alysis
  9. 9.