59
Ilyas F Azure Solution Architect @ 8KMiles Twitter: @ilyas_tweets Linkedin: https://in.linkedin.com/in/ilyasf Azure Data Lake Analytics Deep Dive 2016/05/17

Azure Data Lake Analytics Deep Dive

Embed Size (px)

Citation preview

Page 1: Azure Data Lake Analytics Deep Dive

Ilyas FAzure Solution Architect @ 8KMilesTwitter: @ilyas_tweetsLinkedin: https://in.linkedin.com/in/ilyasf

Azure Data Lake Analytics Deep Dive

2016/05/17

Page 2: Azure Data Lake Analytics Deep Dive

AgendaOrigins • Cosmos• Futures

Layers & Components• Storage• Parallelization• Job Scheduling• Query Execution• Performance• Demo

Page 3: Azure Data Lake Analytics Deep Dive

Quick Recap

Page 4: Azure Data Lake Analytics Deep Dive

The 3 Azure Data Lake Services

HDInsight

Analytics Store

Clusters as a service

Big data queries as a service

Hyper-scale Storage optimized

for analytics

Currently in PREVIEW. General Availability later in 2016.

Page 5: Azure Data Lake Analytics Deep Dive

Familiar syntax to millions of SQL & .NET developers

Unifies declarative nature of SQL with the imperative power of C#

Unifies structured, semi-structured and unstructured data

Distributed query support over all data

U-SQLA new language for Big Data

Page 6: Azure Data Lake Analytics Deep Dive

History

Page 7: Azure Data Lake Analytics Deep Dive

HistoryBing needed to…• Understand user behavior

And do it…• At massive scale• With agility and speed• At low cost

So they built …• Cosmos

Cosmos• Batch Jobs• Interactive• Machine Learning• Streaming

Thousands of Developers

Page 8: Azure Data Lake Analytics Deep Dive

Pricing

Page 9: Azure Data Lake Analytics Deep Dive

Key ADL Analytics Components

Page 10: Azure Data Lake Analytics Deep Dive

ADL Account Configuration

ADL Analytics Account

Links to ADL Stores

ADL Store Account

(the default one)

Job Queue

Key Settings:- Max Concurrent Jobs- Max ADLAUs per Job- Max Queue Length

An ADL Store IS REQUIRED for ADL Analytics to function.

Key Settings:• Max Concurrent Jobs = 3• Max ADLAUs per job = 20• Max Queue Length = 200

If you want to change the defaults, open a Support ticket

Links to Azure Blob Stores

U-SQL Catalog

Metadata

U-SQL Catalog

Data

Page 11: Azure Data Lake Analytics Deep Dive

Simplified Workflow

Job Front End

Job Scheduler Compiler ServiceJob Queue

Job Manager

U-SQL Catalog

YARN

Job submission

Job execution

U-SQL Runtime Vertex execution

Page 12: Azure Data Lake Analytics Deep Dive

Goal: Understanding a U-SQL (Batch) Job

Page 13: Azure Data Lake Analytics Deep Dive

Azure Data Lake Analytics (ADLA) Demo

Page 14: Azure Data Lake Analytics Deep Dive
Page 15: Azure Data Lake Analytics Deep Dive

Job Properties

Job Graph

Page 16: Azure Data Lake Analytics Deep Dive

Job SchedulingStates, Queue, Priority

Page 17: Azure Data Lake Analytics Deep Dive

Job Status in Visual Studio

Page 18: Azure Data Lake Analytics Deep Dive

Preparing

Queued

Running

Finalizing

Ended(Succeeded, Failed, Cancelled)

NewCompiling

QueuedScheduling

Starting

Running

Ended

UX Job State

The script is being compiled by the Compiler Service

All jobs enter the queue.

Are there enough ADLAUs to start the job?

If yes, then allocate those ADLAUs for the job

The U-SQL runtime is now executing the code on 1 or more ADLAUs or finalizing the outputs

The job has concluded.

Page 19: Azure Data Lake Analytics Deep Dive

Why does a Job get Queued?Local Cause

Conditions:• Queue already at

Max Concurrency

Global Cause (very rare)

Conditions:• System-wide shortage of

ADLAUs• System-wide shortage of

Bandwidth

* If these conditions are met, a job will be queued even if the queue is not at its Max Concurrency

Page 20: Azure Data Lake Analytics Deep Dive

State History

Page 21: Azure Data Lake Analytics Deep Dive

The Job Queue

The queue is ordered by job priority.

Lower numbers -> higher priority.

1 = highest.

Running jobs

When a job is at the top of the queue, it will

start running.

Defaults: Max Running Jobs = 3Max Tokens per job = 20Max Queue Size = 200

Page 22: Azure Data Lake Analytics Deep Dive

Priority Doesn’t Preempt Running Jobs

X has Pri=1.

XA

B

C

X will NOT preempt running jobs. X will have to wait.

These are all running and have very low priority

(pri=1000)

Page 23: Azure Data Lake Analytics Deep Dive

U-SQL Job Compilation

Page 24: Azure Data Lake Analytics Deep Dive

U-SQL Compilation Process

C#

C++

Algebra

Other files(system files, deployed resources)

managed dll

Unmanaged dll

Compilation output (in job folder)

Compiler & Optimizer

U-SQL Metadata Service

Deployed to Vertices

Page 25: Azure Data Lake Analytics Deep Dive

The Job FolderInside the Default ADL Store:

/system/jobservice/jobs/Usql/YYYY/MM/DD/hh/mm/JOBID

/system/jobservice/jobs/Usql/2016/01/20/00/00/17972fc2-4737-48f7-81fb-49af9a784f64

Page 26: Azure Data Lake Analytics Deep Dive

C# code generated by the U-SQL Compiler

C++ code generated by the U-SQL Compiler

Cluster Plan a.ka. “Job Graph” generated by U-SQL Compiler

User-provided .NET Assemblies

User-provided USQL script

Job Folder Contents

Page 27: Azure Data Lake Analytics Deep Dive

Resources

Page 28: Azure Data Lake Analytics Deep Dive

Blue items: the output of the compiler

Grey items: U-SQL runtime bits

Download all the resources

Download a specific resource

Page 29: Azure Data Lake Analytics Deep Dive

Query ExecutionPlans, Vertices, Stages, Parallelism,

ADLAUs

Page 30: Azure Data Lake Analytics Deep Dive

Job Schedule

r & Queue

Fron

t-End

Ser

vice

30

Optimizer

Vertex Scheduling

Compiler

Runtime

Visual Studio

Portal / API

Query Life

Page 31: Azure Data Lake Analytics Deep Dive

How does the Parallelism number relate to Vertices

What does Vertices mean?

What is this?

Page 32: Azure Data Lake Analytics Deep Dive

Logical -> Physical Plan

Each square = “a vertex” represents a fraction of the total

Vertexes in each SuperVertex (aka “Stage) are doing the same operation on different parts of the same data.

Vertexes in a later stages may depend on a vertex in an earlier stage

Visualized like this

Page 33: Azure Data Lake Analytics Deep Dive

Stage Details252 Pieces of work

AVG Vertex execution time

4.3 Billion rows

Data Read & Written

Page 34: Azure Data Lake Analytics Deep Dive

Automatic Vertex retryA vertex failed … but was

retried automatically

Overall Stage Completed Successfully

A vertex might fail because:- Router congested- Hardware failure (ex: hard drive

failed)- VM had to be rebooted

U-SQL job will automatically schedule a vertex on another VM.

Page 35: Azure Data Lake Analytics Deep Dive

ADLAUs AzureData LakeAnalyticsUnit

Parallelism N = N ADLAUs

1 ADLAU ~= A VM with 2 cores and 6 GB of memory

Page 36: Azure Data Lake Analytics Deep Dive

EfficiencyCost vs Latency

Page 37: Azure Data Lake Analytics Deep Dive

Profile isn’t loaded

Page 38: Azure Data Lake Analytics Deep Dive

Profile is loaded now

Click Resource usage

Page 39: Azure Data Lake Analytics Deep Dive

Blue: Allocation

Red: Actual running

Page 40: Azure Data Lake Analytics Deep Dive

Smallest estimated time when given 2425 ADLAUs

1410 seconds= 23.5 minutes

Page 41: Azure Data Lake Analytics Deep Dive

Model with 100 ADLAUs

8709 seconds= 145.5 minutes

Page 42: Azure Data Lake Analytics Deep Dive

𝐽𝑜𝑏𝐶𝑜𝑠𝑡=5𝑐+ (𝑚𝑖𝑛𝑢𝑡𝑒𝑠× 𝐴𝐷𝐿𝑈𝐴𝑠×𝐴𝐷𝐿𝐴𝑈𝑐𝑜𝑠𝑡𝑝𝑒𝑟𝑚𝑖𝑛 )

Page 43: Azure Data Lake Analytics Deep Dive

Allocation

Allocating 10 ADLAUsfor a 10 minute job.

Cost = 10 min * 10 ADLAUs = 100 ADLAU minutes

Time

Blue line: Allocated

Page 44: Azure Data Lake Analytics Deep Dive

Over Allocation Consider using fewer ADLAUs

You are paying for the area under the blue line

You are only using the area under the red line

Time

Page 45: Azure Data Lake Analytics Deep Dive

Vertex Execution

Page 46: Azure Data Lake Analytics Deep Dive

Store Basics

A VERY BIG FILE

1 2 3 4 5

1 2 3 4 5

1 2 3 4 5

1 2 3 4 5

Files are split apart into Extents.

Extents can be up to 250MB in size.

For availability and reliability, extents are replicated (3 copies).

Enables parallelized read

Page 47: Azure Data Lake Analytics Deep Dive

Parallel writingFront-end machines for a web serviceLog files

Simultaneousuploads

Azure Data lake

Page 48: Azure Data Lake Analytics Deep Dive

Extent

As file size increases, more opportunities for parallelism

Vertex

Extent Vertex

Extent Vertex

Extent Vertex

Page 49: Azure Data Lake Analytics Deep Dive

The importance of partitioning input data

Page 50: Azure Data Lake Analytics Deep Dive

Search engine clicks data setA log of how many clicks a certain domain got within a

sessionSessionID Domain Clicks3 cnn.com 91 whitehouse.gov 142 facebook.com 83 reddit.com 782 microsoft.com 11 facebook.com 53 microsoft.com 11

Page 51: Azure Data Lake Analytics Deep Dive

Data Partitioning Compared

FBWH

CNN

Extent 2

FB

WHCNN

Extent 3

FB

WHCNN

Extent 1

File: Keys (Domain) are scattered across the extents

WHWHWH

Extent 2

CNN

CNN

CNN

Extent 3

FB

FB

FB

Extent 1

U-SQL Table partitioned on DomainThe keys are now “close together” also the index tells U-SQL exactly which extents contain the key

Page 52: Azure Data Lake Analytics Deep Dive

CREATE TABLE SampleDBTutorials.dbo.ClickData(        SessionId        int,         Domain          string,         Clinks           int,     INDEX idx1 //Name of index    CLUSTERED (Domain ASC) //Column to cluster by    // PARTITIONED BY HASH (Region) //Column to partition by);

INSERT INTO SampleDBTutorials.dbo.ClickDataSELECT *FROM @clickdata;

How did we create and fill that table?

Page 53: Azure Data Lake Analytics Deep Dive

Find all the rows for cnn.com// Using a File

@ClickData = SELECT

Session int, Domain string,Clicks int

FROM “/clickdata.tsv”USING Extractors.Tsv();

@rows = SELECT * FROM @ClickData WHERE Domain == “cnn.com”;

OUTPUT @rows TO “/output.tsv” USING Outputters.tsv();

// Using a U-SQL Table partitioned by Domain

@ClickData = SELECT * FROM MyDB.dbo.ClickData;

@rows = SELECT * FROM @ClickData WHERE Domain == “cnn.com”;

OUTPUT @rows TO “/output.tsv” USING Outputters.tsv();

Page 54: Azure Data Lake Analytics Deep Dive

Read Read

Write Write Write

Read

Filter Filter Filter

CNN,FB,WH

EXTENT 1 EXTENT 2 EXTENT 3

CNN,FB,WH

CNN,FB,WH

Because “CNN” could be anywhere, all extents must be read.

Read

Write

Filter

FBEXTENT 1 EXTENT 2 EXTENT 3

WH CNN

Thanks to “Partition Elimination” and the U-SQL Table, the job only reads from the extent that is known to have the relevant key

File U-SQL Table Partitioned by Domain

Page 55: Azure Data Lake Analytics Deep Dive

How many clicks per domain?

@rows = SELECT Domain, SUM(Clicks) AS TotalClicks FROM @ClickData GROUP BY Domain;

Page 56: Azure Data Lake Analytics Deep Dive

File

Read Read

Partition Partition

Full Agg

Write

Full Agg

Write

Full Agg

Write

Read

Partition

Partial Agg Partial Agg Partial Agg

CNN,FB,WH

EXTENT 1 EXTENT 2 EXTENT 3

CNN,FB,WH

CNN,FB,WH

U-SQL Table Partitioned by Domain

Read Read

Full Agg Full Agg

Write Write

Read

Full Agg

Write

FBEXTENT 1

WHEXTENT 2

CNNEXTENT 3

Expensive!

Page 57: Azure Data Lake Analytics Deep Dive

High-Level Performance Advice

Page 58: Azure Data Lake Analytics Deep Dive

Learn U-SQLLeverage Native U-SQL

Constructs first

UDOs are Evil Can’t optimize UDOs like pure

U-SQL code.

Understand your DataVolume, Distribution, Partitioning,

Growth

Page 59: Azure Data Lake Analytics Deep Dive

Questions?