Upload
michael-rys
View
571
Download
4
Embed Size (px)
Citation preview
SQLBits 2016(http://www.slideshare.net/MichaelRys)
Azure Data Lake &U-SQLMichael Rys, @MikeDoesBigData
http://www.azure.com/datalake{mrys, usql}@microsoft.com
The Data Lake Approach
Implement Data WarehouseReporting & Analytics Development
Reporting & Analytics Design
Physical Design
Dimension Modelling
ETL DevelopmentETL Design
Install and TuneSetup Infrastructure
Traditional data warehousing approach
Data sources
ETL
BI and analytics
Dashboards
Reporting
Data warehouse
Understand Corporate Strategy
Gather Requirements
Business Requirement
s
Technical Requirements
The Data Lake approach
Ingest all data regardless of requirements
Store all data in native format without schema definition
Do analysisUsing analytic engines like Hadoop
Interactive queriesBatch queries
Machine LearningData warehouse
Real-time analytics
Devices
Source: ComScore 2009-2015 Search Report US
2009 2010 2011 2012 2013 2014 20150%
5%
10%
15%
20%
25%
9%11%
15%16%
18%19% 20%
MICROSOFT DOUBLES SEARCH SHARE
How Microsoft has used Big DataWe needed to better leverage data and analytics to win in searchWe changed our approach• More experiments by more people!
So we…Built an Exabyte-scale data lake for everyone to put their data.Built tools approachable by any developer.Built machine learning tools for collaborating across large experiment models.
Introducing Azure Data LakeBig Data Made Easy
Analytics
Storage
HDInsight(“managed clusters”)
Azure Data Lake Analytics
Azure Data Lake Storage
Azure Data Lake
Azure Data Lake Storage Service
No limits to SCALE
Store ANY DATA in its native format
HADOOP FILE SYSTEM (HDFS) for the cloud
ENTERPRISE GRADE access control, encryption at rest
Optimized for analytic workload PERFORMANCE
Azure Data Lake StoreA hyper scale repository for big data analytics workloads
IN PREVIEW
Azure Data Lake Analytics Service
WebHDFS
YARN
U-SQL
ADL Analytics
ADL HDInsight
1
1
1
1
1
1 1
1
1
1
1
1
Store
HiveAnalytics
Storage
Azure Data Lake (Store, HDInsight, Analytics)
ADLA complements HDInsightTarget the same scenarios, tools, and customers
HDInsightFor developers familiar with the Open Source: Java, Eclipse, Hive, etc.
Clusters offer customization, control, and flexibility in a managed Hadoop cluster
ADLAEnables customers to leverage existing experience with C#, SQL & PowerShell
Offers convenience, efficiency, automatic scale, and management in a “job service” form factor
No limits to SCALE
Includes U-SQL, a language that unifies the benefits of SQL with the expressive power of C#
Optimized to work with ADL STORE
FEDERATED QUERY across Azure data sources
ENTERPRISE GRADE role-based access control and auditing
Pay PER QUERY and scale PER QUERY
Azure Data Lake AnalyticsA distributed analytics servicebuilt on Apache YARN that dynamically scales to your needs
IN PREVIEW
ADL and SQLDW
XML
JSON
Preparation• Pre-process • Transpose• Re-format
TEXT Model• Load• Transform• Aggregate• Consume
High Value Data
Unknown Value
Data
BatchAd-hocBatch
Query data where it livesEasily query data in multiple Azure data stores without moving it to a single store
Benefits• Avoid moving large amounts of data across the
network between stores• Single view of data irrespective of physical
location • Minimize data proliferation issues caused by
maintaining multiple copies• Single query language for all data• Each data store maintains its own sovereignty• Design choices based on the need• Push SQL expressions to remote SQL sources
• Filters• Joins
U-SQL Query
Query
Query
Query
WriteAzure Storage Blobs
Azure SQL in VMs
Azure SQL DB
Azure Data Lake Analytics
Query
Azure SQL Data Warehouse
Quer
y
Writ
e
Azure Data Lake Storage
Azure Data Lake U-SQL
Why U-SQL?
Some sample use casesDigital Crime Unit – Analyze complex attack patterns to understand BotNets and to predict and mitigate future attacks by analyzing log records with complex custom algorithmsImage Processing – Large-scale image feature extraction and classification using custom codeShopping Recommendation – Complex pattern analysis and prediction over shopping records using proprietary algorithms
Characteristics of Big Data Analytics
•Requires processing of any type of data •Allow use of custom algorithms•Scale to any size and be efficient
Status Quo:SQL for Big Data
Declarativity does scaling and parallelization for you
Extensibility is bolted on and not “native” hard to work with anything other
than structured data difficult to extend with custom
code
Status Quo:Programming Languages for Big Data
Extensibility through custom code is “native”
Declarativity is bolted on and not “native” User often has to
care about scale and performance
SQL is 2nd class within string Often no code reuse/
sharing across queries
Why U-SQL? Declarativity and Extensibility are equally native to the language!
Get benefits of both!Makes it easy for you by unifying:• Unstructured and structured data
processing• Declarative SQL and custom imperative
Code• Local and remote Queries• Increase productivity and agility from Day
1 and at Day 100 for YOU!
The origins of U-SQL
U-SQLSCOPE
Hive
T-SQL/ ANSI SQL
SCOPE – Microsoft’s internal Big Data language• SQL and C# integration model• Optimization and Scaling model• Runs 100’000s of jobs daily Hive• Complex data types (Maps, Arrays)• Data format alignment for text filesT-SQL/ANSI SQL• Many of the SQL capabilities (windowing functions,
meta data model etc.)
Extend U-SQL with C#/.NET
Built-in operators, function, aggregates
C# expressions (in SELECT expressions)
User-defined aggregates (UDAGGs)
User-defined functions (UDFs)
User-defined operators (UDOs)
DemoShow me U-SQL!
U-SQL Language Philosophy
Declarative Query and Transformation Language:• Uses SQL’s SELECT FROM WHERE with GROUP
BY/Aggregation, Joins, SQL Analytics functions• Optimizable, ScalableExpression-flow programming style:• Easy to use functional lambda composition • Composable, globally optimizableOperates on Unstructured & Structured Data• Schema on read over files• Relational metadata objects (e.g. database, table)Extensible from ground up:• Type system is based on C#• Expression language IS C#• User-defined functions (U-SQL and C#)• User-defined Aggregators (C#)• User-defined Operators (UDO) (C#)
U-SQL provides the Parallelization and Scale-out Framework for Usercode• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER,
COMBINER, APPLIERFederated query across distributed data sources
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime , last_order DateTime, order_count int , order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount float FROM "/input/orders.txt" USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string FROM "/input/customers.txt" USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder , MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt , AGG<MyAgg.MySum>(c.amount) AS totalamount FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid WHERE c.city.StartsWith("New") && MyNamespace.MyFunction(o.odate) > 10 GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;
Expression-flow Programming Style
Automatic "in-lining" of U-SQL expressions – whole script leads to a single execution model.Execution plan that is optimized out-of-the-box and w/o user intervention.Per job and user driven level of parallelization.Detail visibility into execution steps, for debugging.Heatmap like functionality to identify performance bottlenecks.
Unifies natively SQL’s declarativity and C#’s extensibilityUnifies querying structured and unstructuredUnifies local and remote queriesIncrease productivity and agility from Day 1 forward for YOU!Sign up for an Azure Data Lake account and join the Public Preview http://www.azure.com/datalake and give us your feedback via http://aka.ms/adlfeedback or at http://aka.ms/u-sql-survey!
This is why U-SQL!
Additional resources
• Tools:• http://aka.ms/adltoolsVS
• Blogs, videos and community page:• http://usql.io (Link to Github with code samples)• http://blogs.msdn.com/b/visualstudio/• http://azure.microsoft.com/en-us/blog/topics/big-data/• https://channel9.msdn.com/Search?term=U-SQL#ch9Search
• Documentation and articles and slides:• http://aka.ms/usql_reference• https://azure.microsoft.com/en-us/documentation/services/data-lake-analyt
ics/• https://msdn.microsoft.com/en-us/magazine/mt614251 • http://www.slideshare.net/MichaelRys
• ADL forums and feedback• http://aka.ms/adlfeedback• https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=Azure
DataLake
• http://stackoverflow.com/questions/tagged/u-sql
http://aka.ms/AzureDataLake