Upload
michael-rys
View
471
Download
0
Embed Size (px)
Citation preview
Microsoft Data Science SummitSept 26 – 27 | Atlanta, GA
Killer Scenarios with Data Lake in Azure with U-SQLMichael RysPrincipal Program Manager Big Data@[email protected]://aka.ms/azuredatalake
Agenda Today (BR013): Killer extensibility in Azure Data Lake with U-SQL Custom rowset aggregation How to do JSON processing Image processing How to call R from U-SQL
Yesterday (BR014): Introduction to Azure Data Lake and U-SQL What is Azure Data Lake? Why U-SQL? Core concepts
Schema on read on file and file sets C# extensibility SQL with U-SQL Script level execution and optimization
Tool usage
U-SQL extensibilityExtend U-SQL with C#/.NET
Built-in operators, function, aggregates
C# expressions (in SELECT expressions)
User-defined aggregates (UDAGGs)
User-defined functions (UDFs)
User-defined operators (UDOs)
User-Defined Extractors User-Defined Outputters User-Defined Processors
Take one row and produce one row Pass-through versus transforming
User-Defined Appliers Take one row and produce 0 to n rows Used with OUTER/CROSS APPLY
User-Defined Combiners Combines rowsets (like a user-defined join)
User-Defined Reducers Take n rows and produce m rows (normally m<n)
Scaled out with explicit U-SQL Syntax that takes a UDO instance (created as part of the execution): EXTRACT OUTPUT PROCESS COMBINE REDUCE
What are UDOs?Custom Operator ExtensionsScaled out by U-SQL
UDO model• Marking UDOs• Parameterizing UDOs• UDO signature• UDO-specific
processing pattern• Rowsets and their
schemas in UDOs• Setting results
By position By name
[SqlUserDefinedExtractor] public class DriverExtractor : IExtractor { private byte[] _row_delim; private string _col_delim; private Encoding _encoding; // Define a non-default constructor since I want to pass in my own parameters public DriverExtractor( string row_delim = "\r\n", string col_delim = ",“ , Encoding encoding = null ) { _encoding = encoding == null ? Encoding.UTF8 : encoding; _row_delim = _encoding.GetBytes(row_delim); _col_delim = col_delim; } // DriverExtractor
// Converting text to target schema private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow) { var schema = outputrow.Schema;
if (schema[i].Type == typeof(int)) { var tmp = Convert.ToInt32(c); outputrow.Set(i, tmp); } ... } //SerializeCol
public override IEnumerable<IRow> Extract( IUnstructuredReader input , IUpdatableRow outputrow) { foreach (var row in input.Split(_row_delim)) { using(var s = new StreamReader(row, _encoding)) { int i = 0; foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None)) { OutputValueAtCol_I(c, i++, outputrow); } // foreach } // using yield return outputrow.AsReadOnly(); } // foreach } // Extract } // class DriverExtractor
Code behindHow to specify UDOs?
C# Class Project for U-SQLHow to specify UDOs?
Any .Net language usable however not first-class in tooling Use U-SQL specific .Net DLLs Compile DLL, upload to ADLS, register
with script
How to specify UDOs?
Managing Assemblies
• CREATE ASSEMBLY db.assembly FROM @path;• CREATE ASSEMBLY db.assembly FROM byte[];
• Can also include additional resource files
• REFERENCE ASSEMBLY db.assembly;
• Referencing .Net Framework Assemblies• Always accessible system namespaces:
• U-SQL specific (e.g., for SQL.MAP)• All provided by system.dll system.core.dll
system.data.dll, System.Runtime.Serialization.dll, mscorelib.dll (e.g., System.Text, System.Text.RegularExpressions, System.Linq)
• Add all other .Net Framework Assemblies with:REFERENCE SYSTEM ASSEMBLY [System.XML];
• Enumerating Assemblies• Powershell command• U-SQL Studio Server Explorer
• DROP ASSEMBLY db.assembly;
Create assemblies Reference assemblies Enumerate assemblies Drop assemblies
VisualStudio makes registration easy!
USING clause 'USING' csharp_namespace | Alias '=' csharp_namespace_or_class.
Examples: DECLARE @ input string = "somejsonfile.json";
REFERENCE ASSEMBLY [Newtonsoft.Json];REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
@data0 = EXTRACT IPAddresses string FROM @input USING new JsonExtractor("Devices[*]");
USING json = [Microsoft.Analytics.Samples.Formats.Json.JsonExtractor];
@data1 = EXTRACT IPAddresses string FROM @input USING new json("Devices[*]");
Allows shortening and disambiguating C# namespace and class names
Overlapping Range AggregationStart Time - End Time - User Name5:00 AM - 6:00 AM - ABC5:00 AM - 6:00 AM - XYZ8:00 AM - 9:00 AM - ABC8:00 AM - 10:00 AM - ABC10:00 AM - 2:00 PM - ABC7:00 AM - 11:00 AM - ABC9:00 AM - 11:00 AM - ABC11:00 AM - 11:30 AM - ABC11:40 PM - 11:59 PM - FOO11:50 PM - 0:40 AM - FOO
https://blogs.msdn.microsoft.com/azuredatalake/2016/06/27/how-do-i-combine-overlapping-ranges-using-u-sql-introducing-u-sql-reducer-udos
Start Time - End Time - User Name5:00 AM - 6:00 AM - ABC5:00 AM - 6:00 AM - XYZ7:00 AM - 2:00 PM - ABC11:40 PM - 0:40 AM - FOO
U-SQL:
@r = REDUCE @in PRESORT begin ON user PRODUCE begin DateTime , end DateTime , user string READONLY user USING new ReduceSample.RangeReducer();
Overlapping Range Aggregation
Presort input rowset
Partition and scale out
Declare passthrough
User-defined Reducer
Code Behind:namespace ReduceSample{ [SqlUserDefinedReducer(IsRecursive = true)] public class RangeReducer : IReducer { public override IEnumerable<IRow> Reduce(IRowset input, IUpdatableRow output) { // Init aggregation values int i = 0; var begin = DateTime.MaxValue; var end = DateTime.MinValue;
foreach (var row in input.Rows) { ... begin = row.Get<DateTime>("begin"); end = row.Get<DateTime>("end"); ... output.Set<DateTime>("begin", begin); output.Set<DateTime>("end", end); yield return output.AsReadOnly(); ... } // foreach } // Reduce
Overlapping Range Aggregation
• Provides better scale
• Requires associative operation
• Implement IReducer• Implement IReducer
• Get input column
• Input Rowset Partition
• Set output column
• Accumulate rows
JSON Processing
How do I extract data from JSON documents?
https://github.com/Azure/usql/tree/master/Examples/DataFormats
Architecture of Sample Format Assembly
Single JSON document per file: Use JsonExtractor
Multiple JSON documents per file: Do not allow CR/LF (row delimiter) in JSON Use built-in Text Extractor to extract Use JsonTuple to schematize (with CROSS
APPLY) Currently loads full JSON document into
memory better to use JSONReader Processing if docs
are large
JSON Processing Microsoft.Analytics.Samples.Formats
NewtonSoft.Json System.Xml
JSON Processing
@json = EXTRACT personid int, name string, addresses string FROM @input USING new Json.JsonExtractor(“[*].person");
@person = SELECT personid, name, Json.JsonFunctions.JsonTuple(addresses)["address"] AS address_array FROM @json;
@addresses = SELECT personid, name, Json.JsonFunctions.JsonTuple(address) AS address FROM @person CROSS APPLY EXPLODE (Json.JsonFunctions.JsonTuple(address_array).Values) AS A(address);
@result = SELECT personid, name, address["addressid"]AS addressid, address["street"]AS street, address["postcode"]AS postcode, address["city"]AS city FROM @addresses;
Key to field relative to objects in JsonExtractor
JPath Expression mapping objects to Row
Generates 1-level key value-pairs as SqlMap
Gets value from map as string
Convert string array into Map and pivot all Values into rows
Get object map for array item
Get desired keys from object map
Image ProcessingCopyright
Camera Make
Camera Model
Thumbnail
Michael Canon 70D
Michael Samsung S7
https://github.com/Azure/usql/tree/master/Examples/ImageApp
Image processing assembly Uses System.Drawing Exposes
Extractors Outputter Processor User-defined Functions
Trade-offs Column memory limits:
Image Extractor vs Feature Extractor
Main memory pressures in vertex:
UDFs vs Processor vs Extractor
Image Processing
R Processing
KMeans Centroids
ArchitectureU-SQL Processing with R R Programmer Assembly
KMeansRReducer
R Engine (Runtime)
R to .Net interop (RDotNet.dll & RDotNet.NativeLib.dll)
R Runtime (R-bin.zip)
R Engine Manager Utility (RUtilities.dll)
Similar Approaches can be done for deploying other runtimes: Python, JavaScript, JVM No external access from UDOsFuture work: More generic samples More automatic experiences (no user
wrappers/deploys)
Summary of U-SQL UDOs
What are UDOs?
Custom Operator Extensions written in .Net (C#)Scaled out by U-SQL
UDO Tips and Warnings
• Tips when Using UDOs: READONLY clause to allow pushing predicates through
UDOs REQUIRED clause to allow column pruning through UDOs PRESORT on REDUCE if you need global order Hint Cardinality if it does choose the wrong plan
• Warnings and better alternatives: Use SELECT with UDFs instead of PROCESS Use User-defined Aggregators instead of REDUCE Learn to use Windowing Functions (OVER expression)
• Good use-cases for PROCESS/REDUCE/COMBINE: The logic needs to dynamically access the input and/or
output schema. E.g., create a JSON doc for the data in the row where the columns are not known apriori.
Your UDF based solution creates too much memory pressure and you can write your code more memory efficient in a UDO
You need an ordered Aggregator or produce more than 1 row per group
Additional Resources Blogs and community page:
http://usql.io (U-SQL Github) http://blogs.msdn.microsoft.com/azuredatalake/ http://blogs.msdn.microsoft.com/mrys/ https://channel9.msdn.com/Search?term=U-SQL#ch9Se
arch
Documentation and articles: http://aka.ms/usql_reference https://azure.microsoft.com/en-us/documentation/servic
es/data-lake-analytics/ https://msdn.microsoft.com/en-us/magazine/mt614251
ADL forums and feedback http://aka.ms/adlfeedback https://social.msdn.microsoft.com/Forums/azure/en-US/h
ome?forum=AzureDataLake
http://stackoverflow.com/questions/tagged/u-sql
© 2016 Microsoft Corporation. All rights reserved.