Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Killer Scenarios with Data Lake in Azure with U-SQL

1,697 views

Published on

Presentation from Microsoft Data Science Summit 2016
Presents 4 examples of custom U-SQL data processing: Overlapping Range Aggregation, JSON Processing, Image Processing and R with U-SQL

Published in: Data & Analytics
  • Be the first to comment

Killer Scenarios with Data Lake in Azure with U-SQL

  1. 1. U-SQL extensibility Extend U-SQL with C#/.NET Built-in operators, function, aggregates C# expressions (in SELECT expressions) User-defined aggregates (UDAGGs) User-defined functions (UDFs) User-defined operators (UDOs)
  2. 2.  User-Defined Extractors  User-Defined Outputters  User-Defined Processors  Take one row and produce one row  Pass-through versus transforming  User-Defined Appliers  Take one row and produce 0 to n rows  Used with OUTER/CROSS APPLY  User-Defined Combiners  Combines rowsets (like a user-defined join)  User-Defined Reducers  Take n rows and produce m rows (normally m<n)  Scaled out with explicit U-SQL Syntax that takes a UDO instance (created as part of the execution):  EXTRACT  OUTPUT  PROCESS  COMBINE  REDUCE What are UDOs? Custom Operator Extensions Scaled out by U-SQL
  3. 3. UDO model • Marking UDOs • Parameterizing UDOs • UDO signature • UDO-specific processing pattern • Rowsets and their schemas in UDOs • Setting results  By position  By name [SqlUserDefinedExtractor] public class DriverExtractor : IExtractor { private byte[] _row_delim; private string _col_delim; private Encoding _encoding; // Define a non-default constructor since I want to pass in my own parameters public DriverExtractor( string row_delim = "rn", string col_delim = ",“ , Encoding encoding = null ) { _encoding = encoding == null ? Encoding.UTF8 : encoding; _row_delim = _encoding.GetBytes(row_delim); _col_delim = col_delim; } // DriverExtractor // Converting text to target schema private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow) { var schema = outputrow.Schema; if (schema[i].Type == typeof(int)) { var tmp = Convert.ToInt32(c); outputrow.Set(i, tmp); } ... } //SerializeCol public override IEnumerable<IRow> Extract( IUnstructuredReader input , IUpdatableRow outputrow) { foreach (var row in input.Split(_row_delim)) { using(var s = new StreamReader(row, _encoding)) { int i = 0; foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None)) { OutputValueAtCol_I(c, i++, outputrow); } // foreach } // using yield return outputrow.AsReadOnly(); } // foreach } // Extract } // class DriverExtractor
  4. 4.  Code behind How to specify UDOs?
  5. 5.  C# Class Project for U-SQL How to specify UDOs?
  6. 6.  Any .Net language usable  however not first-class in tooling  Use U-SQL specific .Net DLLs  Compile DLL, upload to ADLS, register with script How to specify UDOs?
  7. 7. Managing Assemblies • CREATE ASSEMBLY db.assembly FROM @path; • CREATE ASSEMBLY db.assembly FROM byte[]; • Can also include additional resource files • REFERENCE ASSEMBLY db.assembly; • Referencing .Net Framework Assemblies • Always accessible system namespaces: • U-SQL specific (e.g., for SQL.MAP) • All provided by system.dll system.core.dll system.data.dll, System.Runtime.Serialization.dll, mscorelib.dll (e.g., System.Text, System.Text.RegularExpressions, System.Linq) • Add all other .Net Framework Assemblies with: REFERENCE SYSTEM ASSEMBLY [System.XML]; • Enumerating Assemblies • Powershell command • U-SQL Studio Server Explorer • DROP ASSEMBLY db.assembly;  Create assemblies  Reference assemblies  Enumerate assemblies  Drop assemblies  VisualStudio makes registration easy!
  8. 8. 'USING' csharp_namespace | Alias '=' csharp_namespace_or_class. Examples: DECLARE @ input string = "somejsonfile.json"; REFERENCE ASSEMBLY [Newtonsoft.Json]; REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats]; USING Microsoft.Analytics.Samples.Formats.Json; @data0 = EXTRACT IPAddresses string FROM @input USING new JsonExtractor("Devices[*]"); USING json = [Microsoft.Analytics.Samples.Formats.Json.JsonExtractor]; @data1 = EXTRACT IPAddresses string FROM @input USING new json("Devices[*]");
  9. 9. Overlapping Range Aggregation Start Time - End Time - User Name 5:00 AM - 6:00 AM - ABC 5:00 AM - 6:00 AM - XYZ 8:00 AM - 9:00 AM - ABC 8:00 AM - 10:00 AM - ABC 10:00 AM - 2:00 PM - ABC 7:00 AM - 11:00 AM - ABC 9:00 AM - 11:00 AM - ABC 11:00 AM - 11:30 AM - ABC 11:40 PM - 11:59 PM - FOO 11:50 PM - 0:40 AM - FOO https://blogs.msdn.microsoft.com/azuredatalake/2016/06/27/how-do-i-combine- overlapping-ranges-using-u-sql-introducing-u-sql-reducer-udos Start Time - End Time - User Name 5:00 AM - 6:00 AM - ABC 5:00 AM - 6:00 AM - XYZ 7:00 AM - 2:00 PM - ABC 11:40 PM - 0:40 AM - FOO
  10. 10. U-SQL: @r = REDUCE @in PRESORT begin ON user PRODUCE begin DateTime , end DateTime , user string READONLY user USING new ReduceSample.RangeReducer(); Overlapping Range Aggregation
  11. 11.  Code Behind: namespace ReduceSample { [SqlUserDefinedReducer(IsRecursive = true)] public class RangeReducer : IReducer { public override IEnumerable<IRow> Reduce(IRowset input, IUpdatableRow output) { // Init aggregation values int i = 0; var begin = DateTime.MaxValue; var end = DateTime.MinValue; foreach (var row in input.Rows) { ... begin = row.Get<DateTime>("begin"); end = row.Get<DateTime>("end"); ... output.Set<DateTime>("begin", begin); output.Set<DateTime>("end", end); yield return output.AsReadOnly(); ... } // foreach } // Reduce Overlapping Range Aggregation
  12. 12. JSON Processing How do I extract data from JSON documents? https://github.com/Azure/usql/tree/master/Examples/DataFormats
  13. 13.  Architecture of Sample Format Assembly  Single JSON document per file: Use JsonExtractor  Multiple JSON documents per file:  Do not allow CR/LF (row delimiter) in JSON  Use built-in Text Extractor to extract  Use JsonTuple to schematize (with CROSS APPLY)  Currently loads full JSON document into memory  better to use JSONReader Processing if docs are large JSON Processing Microsoft.Analytics.Samples.Formats NewtonSoft.Json System.Xml
  14. 14. JSON Processing @json = EXTRACT personid int, name string, addresses string FROM @input USING new Json.JsonExtractor(“[*].person"); @person = SELECT personid, name, Json.JsonFunctions.JsonTuple(addresses)["address"] AS address_array FROM @json; @addresses = SELECT personid, name, Json.JsonFunctions.JsonTuple(address) AS address FROM @person CROSS APPLY EXPLODE (Json.JsonFunctions.JsonTuple(address_array).Values) AS A(address); @result = SELECT personid, name, address["addressid"]AS addressid, address["street"]AS street, address["postcode"]AS postcode, address["city"]AS city FROM @addresses;
  15. 15. Image Processing Copyright Camera Make Camera Model Thumbnail Michael Canon 70D Michael Samsung S7 https://github.com/Azure/usql/tree/master/Examples/ImageApp
  16. 16.  Image processing assembly  Uses System.Drawing  Exposes  Extractors  Outputter  Processor  User-defined Functions  Trade-offs  Column memory limits: Image Extractor vs Feature Extractor  Main memory pressures in vertex: UDFs vs Processor vs Extractor Image Processing
  17. 17. R Processing KMeans Centroids
  18. 18. Architecture U-SQL Processing with R KMeansRReducer R to .Net interop (RDotNet.dll & RDotNet.NativeLib.dll) R Runtime (R-bin.zip) R Engine Manager Utility (RUtilities.dll) Similar Approaches can be done for deploying other runtimes: Python, JavaScript, JVM No external access from UDOs Future work:  More generic samples  More automatic experiences (no user wrappers/deploys)
  19. 19. What are UDOs? Custom Operator Extensions written in .Net (C#) Scaled out by U-SQL
  20. 20. UDO Tips and Warnings • Tips when Using UDOs:  READONLY clause to allow pushing predicates through UDOs  REQUIRED clause to allow column pruning through UDOs  PRESORT on REDUCE if you need global order  Hint Cardinality if it does choose the wrong plan • Warnings and better alternatives:  Use SELECT with UDFs instead of PROCESS  Use User-defined Aggregators instead of REDUCE  Learn to use Windowing Functions (OVER expression) • Good use-cases for PROCESS/REDUCE/COMBINE:  The logic needs to dynamically access the input and/or output schema. E.g., create a JSON doc for the data in the row where the columns are not known apriori.  Your UDF based solution creates too much memory pressure and you can write your code more memory efficient in a UDO  You need an ordered Aggregator or produce more than 1 row per group
  21. 21. http://usql.io http://blogs.msdn.microsoft.com/azuredatalake/ http://blogs.msdn.microsoft.com/mrys/ https://channel9.msdn.com/Search?term=U-SQL#ch9Search http://aka.ms/usql_reference https://azure.microsoft.com/en- us/documentation/services/data-lake-analytics/ https://msdn.microsoft.com/en-us/magazine/mt614251 http://aka.ms/adlfeedback https://social.msdn.microsoft.com/Forums/azure/en- US/home?forum=AzureDataLake http://stackoverflow.com/questions/tagged/u-sql

×