29
SQLBits 2016 (http://www.slideshare.net/M ichaelRys) Azure Data Lake & U-SQL Michael Rys, @MikeDoesBigData http://www.azure.com/datalake {mrys, usql}@microsoft.com

ADL/U-SQL Introduction (SQLBits 2016)

Embed Size (px)

Citation preview

Page 1: ADL/U-SQL Introduction (SQLBits 2016)

SQLBits 2016(http://www.slideshare.net/MichaelRys)

Azure Data Lake &U-SQLMichael Rys, @MikeDoesBigData

http://www.azure.com/datalake{mrys, usql}@microsoft.com

Page 2: ADL/U-SQL Introduction (SQLBits 2016)

The Data Lake Approach

Page 3: ADL/U-SQL Introduction (SQLBits 2016)

Implement Data WarehouseReporting & Analytics Development

Reporting & Analytics Design

Physical Design

Dimension Modelling

ETL DevelopmentETL Design

Install and TuneSetup Infrastructure

Traditional data warehousing approach

Data sources

ETL

BI and analytics

Dashboards

Reporting

Data warehouse

Understand Corporate Strategy

Gather Requirements

Business Requirement

s

Technical Requirements

Page 4: ADL/U-SQL Introduction (SQLBits 2016)

The Data Lake approach

Ingest all data regardless of requirements

Store all data in native format without schema definition

Do analysisUsing analytic engines like Hadoop

Interactive queriesBatch queries

Machine LearningData warehouse

Real-time analytics

Devices

Page 5: ADL/U-SQL Introduction (SQLBits 2016)

Source: ComScore 2009-2015 Search Report US

2009 2010 2011 2012 2013 2014 20150%

5%

10%

15%

20%

25%

9%11%

15%16%

18%19% 20%

MICROSOFT DOUBLES SEARCH SHARE

How Microsoft has used Big DataWe needed to better leverage data and analytics to win in searchWe changed our approach• More experiments by more people!

So we…Built an Exabyte-scale data lake for everyone to put their data.Built tools approachable by any developer.Built machine learning tools for collaborating across large experiment models.

Page 6: ADL/U-SQL Introduction (SQLBits 2016)

Introducing Azure Data LakeBig Data Made Easy

Page 7: ADL/U-SQL Introduction (SQLBits 2016)

Analytics

Storage

HDInsight(“managed clusters”)

Azure Data Lake Analytics

Azure Data Lake Storage

Azure Data Lake

Page 8: ADL/U-SQL Introduction (SQLBits 2016)

Azure Data Lake Storage Service

Page 9: ADL/U-SQL Introduction (SQLBits 2016)

No limits to SCALE

Store ANY DATA in its native format

HADOOP FILE SYSTEM (HDFS) for the cloud

ENTERPRISE GRADE access control, encryption at rest

Optimized for analytic workload PERFORMANCE

Azure Data Lake StoreA hyper scale repository for big data analytics workloads

IN PREVIEW

Page 10: ADL/U-SQL Introduction (SQLBits 2016)

Azure Data Lake Analytics Service

Page 11: ADL/U-SQL Introduction (SQLBits 2016)

WebHDFS

YARN

U-SQL

ADL Analytics

ADL HDInsight

1

1

1

1

1

1 1

1

1

1

1

1

Store

HiveAnalytics

Storage

Azure Data Lake (Store, HDInsight, Analytics)

Page 12: ADL/U-SQL Introduction (SQLBits 2016)

ADLA complements HDInsightTarget the same scenarios, tools, and customers

HDInsightFor developers familiar with the Open Source: Java, Eclipse, Hive, etc.

Clusters offer customization, control, and flexibility in a managed Hadoop cluster

ADLAEnables customers to leverage existing experience with C#, SQL & PowerShell

Offers convenience, efficiency, automatic scale, and management in a “job service” form factor

Page 13: ADL/U-SQL Introduction (SQLBits 2016)

No limits to SCALE

Includes U-SQL, a language that unifies the benefits of SQL with the expressive power of C#

Optimized to work with ADL STORE

FEDERATED QUERY across Azure data sources

ENTERPRISE GRADE role-based access control and auditing

Pay PER QUERY and scale PER QUERY

Azure Data Lake AnalyticsA distributed analytics servicebuilt on Apache YARN that dynamically scales to your needs

IN PREVIEW

Page 14: ADL/U-SQL Introduction (SQLBits 2016)

ADL and SQLDW

XML

JSON

Preparation• Pre-process • Transpose• Re-format

TEXT Model• Load• Transform• Aggregate• Consume

High Value Data

Unknown Value

Data

BatchAd-hocBatch

Page 15: ADL/U-SQL Introduction (SQLBits 2016)

Query data where it livesEasily query data in multiple Azure data stores without moving it to a single store

Benefits• Avoid moving large amounts of data across the

network between stores• Single view of data irrespective of physical

location • Minimize data proliferation issues caused by

maintaining multiple copies• Single query language for all data• Each data store maintains its own sovereignty• Design choices based on the need• Push SQL expressions to remote SQL sources

• Filters• Joins

U-SQL Query

Query

Query

Query

WriteAzure Storage Blobs

Azure SQL in VMs

Azure SQL DB

Azure Data Lake Analytics

Query

Azure SQL Data Warehouse

Quer

y

Writ

e

Azure Data Lake Storage

Page 16: ADL/U-SQL Introduction (SQLBits 2016)

Azure Data Lake U-SQL

Page 17: ADL/U-SQL Introduction (SQLBits 2016)

Why U-SQL?

Page 18: ADL/U-SQL Introduction (SQLBits 2016)

Some sample use casesDigital Crime Unit – Analyze complex attack patterns to understand BotNets and to predict and mitigate future attacks by analyzing log records with complex custom algorithmsImage Processing – Large-scale image feature extraction and classification using custom codeShopping Recommendation – Complex pattern analysis and prediction over shopping records using proprietary algorithms

Characteristics of Big Data Analytics

•Requires processing of any type of data •Allow use of custom algorithms•Scale to any size and be efficient

Page 19: ADL/U-SQL Introduction (SQLBits 2016)

Status Quo:SQL for Big Data

Declarativity does scaling and parallelization for you

Extensibility is bolted on and not “native” hard to work with anything other

than structured data difficult to extend with custom

code

Page 20: ADL/U-SQL Introduction (SQLBits 2016)

Status Quo:Programming Languages for Big Data

Extensibility through custom code is “native”

Declarativity is bolted on and not “native” User often has to

care about scale and performance

SQL is 2nd class within string Often no code reuse/

sharing across queries

Page 21: ADL/U-SQL Introduction (SQLBits 2016)

Why U-SQL? Declarativity and Extensibility are equally native to the language!

Get benefits of both!Makes it easy for you by unifying:• Unstructured and structured data

processing• Declarative SQL and custom imperative

Code• Local and remote Queries• Increase productivity and agility from Day

1 and at Day 100 for YOU!

Page 22: ADL/U-SQL Introduction (SQLBits 2016)

The origins of U-SQL

U-SQLSCOPE

Hive

T-SQL/ ANSI SQL

SCOPE – Microsoft’s internal Big Data language• SQL and C# integration model• Optimization and Scaling model• Runs 100’000s of jobs daily Hive• Complex data types (Maps, Arrays)• Data format alignment for text filesT-SQL/ANSI SQL• Many of the SQL capabilities (windowing functions,

meta data model etc.)

Page 23: ADL/U-SQL Introduction (SQLBits 2016)

Extend U-SQL with C#/.NET

Built-in operators, function, aggregates

C# expressions (in SELECT expressions)

User-defined aggregates (UDAGGs)

User-defined functions (UDFs)

User-defined operators (UDOs)

Page 24: ADL/U-SQL Introduction (SQLBits 2016)

DemoShow me U-SQL!

Page 25: ADL/U-SQL Introduction (SQLBits 2016)

U-SQL Language Philosophy

Declarative Query and Transformation Language:• Uses SQL’s SELECT FROM WHERE with GROUP

BY/Aggregation, Joins, SQL Analytics functions• Optimizable, ScalableExpression-flow programming style:• Easy to use functional lambda composition • Composable, globally optimizableOperates on Unstructured & Structured Data• Schema on read over files• Relational metadata objects (e.g. database, table)Extensible from ground up:• Type system is based on C#• Expression language IS C#• User-defined functions (U-SQL and C#)• User-defined Aggregators (C#)• User-defined Operators (UDO) (C#)

U-SQL provides the Parallelization and Scale-out Framework for Usercode• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER,

COMBINER, APPLIERFederated query across distributed data sources

REFERENCE MyDB.MyAssembly;

CREATE TABLE T( cid int, first_order DateTime , last_order DateTime, order_count int , order_amount float );

@o = EXTRACT oid int, cid int, odate DateTime, amount float FROM "/input/orders.txt" USING Extractors.Csv();

@c = EXTRACT cid int, name string, city string FROM "/input/customers.txt" USING Extractors.Csv();

@j = SELECT c.cid, MIN(o.odate) AS firstorder , MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt , AGG<MyAgg.MySum>(c.amount) AS totalamount FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid WHERE c.city.StartsWith("New") && MyNamespace.MyFunction(o.odate) > 10 GROUP BY c.cid;

OUTPUT @j TO "/output/result.txt"USING new MyData.Write();

INSERT INTO T SELECT * FROM @j;

Page 26: ADL/U-SQL Introduction (SQLBits 2016)

Expression-flow Programming Style

Automatic "in-lining" of U-SQL expressions – whole script leads to a single execution model.Execution plan that is optimized out-of-the-box and w/o user intervention.Per job and user driven level of parallelization.Detail visibility into execution steps, for debugging.Heatmap like functionality to identify performance bottlenecks.

Page 27: ADL/U-SQL Introduction (SQLBits 2016)

Unifies natively SQL’s declarativity and C#’s extensibilityUnifies querying structured and unstructuredUnifies local and remote queriesIncrease productivity and agility from Day 1 forward for YOU!Sign up for an Azure Data Lake account and join the Public Preview http://www.azure.com/datalake and give us your feedback via http://aka.ms/adlfeedback or at http://aka.ms/u-sql-survey!

This is why U-SQL!

Page 28: ADL/U-SQL Introduction (SQLBits 2016)

Additional resources

• Tools:• http://aka.ms/adltoolsVS

• Blogs, videos and community page:• http://usql.io (Link to Github with code samples)• http://blogs.msdn.com/b/visualstudio/• http://azure.microsoft.com/en-us/blog/topics/big-data/• https://channel9.msdn.com/Search?term=U-SQL#ch9Search

• Documentation and articles and slides:• http://aka.ms/usql_reference• https://azure.microsoft.com/en-us/documentation/services/data-lake-analyt

ics/• https://msdn.microsoft.com/en-us/magazine/mt614251 • http://www.slideshare.net/MichaelRys

• ADL forums and feedback• http://aka.ms/adlfeedback• https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=Azure

DataLake

• http://stackoverflow.com/questions/tagged/u-sql

Page 29: ADL/U-SQL Introduction (SQLBits 2016)

http://aka.ms/AzureDataLake