A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore

DATA

A Microsoft IT pros guide

to the HDInsight and the

swarm of open source

solutions it comes with

Michael Jonsson

Cloud Solutions Architect – Avanade Denmark

P-CSA – Microsoft (Finalist in the Microsoft Partner of the Year awards 2018 in the category “Partner Seller Excellence in Technology, Sales and/or Licensing Award”)

Twitter: Michael_Jonsson

Blog: Azurefabric.com

DATA

What to expect from this session

• War stories implementing HDInsight

• Suttle complaints on not ready solutions

• Fast pace and bad jokes

• Guidance to a partial datadriven org

• Some Demos, but maybe just screen shots

• Pitfalls info.... (and bad powerpointing)

DATA

Feedback please ☺

https://feedback.expertslive.nl

DATA

The IT state is evolving

DATA

Customer project

Goals

• Information management

• Advanced analytics

• Future platform (data driven, self service)

• Machine Learning (AI)

DATA

If the Infrastructure is already in Place!!

DATA

The second 80/20 dilemma

DATA

Wrong in the current OLD methods

1: Cleaning to relational data = deleting some

2: ETL with defined schemas = hard to change

3: No DevOps (everything as code) fit the method

DATA The Cloud era!

The modern way!

• For modern big data analytics -Consider all data as valuable and therefore keep it unstructured (schemaless), stored in native format in low cost location

• Ingest, analyze and then structure. (schema on read or ELT)

• Schema free data enables the possibility to add additional and new data to the mix

• Infra as Code = No infra core knowledge needed for the datascientist (its in the code!)

• PaaS services

• Infra as Code

• CI/CD

• DevOps

DATA

Customer project

Goals

• Information management

• Advanced analytics

• Future platform (data driven, self service)

• Machine Learning (AI)

DATA WHY?

Data Driven business! (Goal)• Descisions based on Data (facts)! – Netflix

• Democratize data – Available for all –

(Microsoft services goal)

• Enable AI to get insights not natural to

humas – Analysis, analytics etc

DATA

DATA The Cloud era!

The modern way!

• For modern big data analytics -Consider all data as valuable and therefore keep it unstructured (schemaless), stored in native format in low cost location



• Infra as Code = No infra core knowledge needed for the datascientist (its in the code!)

• PaaS services

• Infra as Code

• CI/CD

• DevOps

AZURE ML ML SERVER AZURE

DATABRICKS

AZURE

STREAM ANALYTICS

AZURE

HDINSIGHT

AZURE

DATABRICKS

AZURE DATA

LAKE

ANALYTICS

AZURE

HDINSIGHTAZURE

DATABRICKS

The Azure BIG Data Landscape

AZURE

SDK

AZURE

DATA

FACTORY

AZURE IMPORT

EXPORT

SERVICE

AZURE

CLI

AZURE IOT HUB AZURE EVENT HUBS

AZURE EXPRESSROUTE AZURE NETWORK

SECURITY GROUPS

AZURE FUNCTIONSVISUAL STUDIOOPERATIONS

MANAGEMENT SUITE

AZURE SEARCH

AZURE

ACTIVE DIRECTORY

COGNITIVE SERVICESBOT SERVICEAZURE

DATA CATALOG

AZURE KEY

MANAGEMENT SERVICE

AZURE STORAGE

BLOBS

AZURE DATA LAKE

STORE

KAFKA ON

AZURE HDINSIGHT

AZURE SQL DATA

WAREHOUSEAZURE SQL DB AZURE COSMOS DB

AZURE

ANALYSIS SERVICES POWER BI

DATA

DATA

Demo: HDInsight (to keep it

interesting)

DATA

Pointers on HDInsight• When integrating HDInsight with Data Lake through an AAD service principal, how do you control

access to data on a per-user level from the cluster? (this is where the Apache Ranger discussion comes in)

• Domain-joined premium HDInsight clusters are difficult (at least in preview when documentation was severely lacking)

• Limitations in the process for domain joining HDInsight nodes may be unacceptable for large enterprises (e.g., must put all nodes in the same OU and cannot control their hostnames)

• Parameters in ARM templates are poorly documented and change frequently (don’t assume you can reuse stuff from GitHub)

• Forced tunneling is not supported according to the HDInsight documentation, so how do you get it to work with e.g. ExpressRoute via a customer firewall?

• If multiple HDInsight clusters are deployed in the same VNet, the first 6 characters of the cluster names must be different (this may be a problem depending on the corporate naming convention)

DATA

Start with the Data• For modern big data analytics -Consider all data as

valuable and therefore keep it unstructured (schemaless), stored in native format in low cost location



On-demand analytics job service powering intelligent action across a hyper-scale repository for Big Data analytics workloads

Azure Data Lake

• Start in seconds,

• Develop • Leverage open source

• Debug & optimize

• Petabyte size files

• Virtualize your analytics

• Provide I/O capacity

• Always encrypted,

DATA

Datalake

DATA

Use the built in capabilities• Azure Data Lake Analytics: U-SQL

• Ingest, analyze and then structure.

(schema on read or ELT)

• Schema free data enables the possibility to

add additional and new data to the mix

U-SQL

Declarative

+

Imperative

Structured

+

Semi-structured

+

Unstructured

Batch

+

Interactive

+

Streaming

+

Machine Learning

Programming models Data Workloads

a language that unifies

Develop massively parallel programs with simplicity

A simple U-SQL script can scale from Gigabytes to Petabytes without learning complex big data programming techniques.

U-SQL automatically generates a scaled out and optimized execution plan to handle any amount of data.

Execution nodes immediately rapidly allocated to run the program.

Error handling, network issues, and runtime optimization are handled automatically.

@searchlog = EXTRACT UserId int,

Start DateTime, Region string, Query string, Duration int, Urls string, ClickedUrls string

FROM @"/Samples/Data/SearchLog.tsv"USING Extractors.Tsv();

OUTPUT @searchlogTO @"/Samples/Output/SearchLog_output.tsv"USING Outputters.Tsv();

DATA

DEMO

Azure Data Factory ->

Azure Data Lake ->

Azure Data Lake Analytics: U-SQL ->

Azure Data Lake ->

PowerBI

DATA

What to use when?

DATA

DATA

AZURE ML ML SERVER AZURE

DATABRICKS

AZURE

STREAM ANALYTICS

AZURE

HDINSIGHT

AZURE

DATABRICKS

AZURE DATA

LAKE

ANALYTICS

AZURE

HDINSIGHTAZURE

DATABRICKS

The Azure BIG Data Landscape

AZURE

SDK

AZURE

DATA

FACTORY

AZURE IMPORT

EXPORT

SERVICE

AZURE

CLI

AZURE IOT HUB AZURE EVENT HUBS

AZURE EXPRESSROUTE AZURE NETWORK

SECURITY GROUPS

AZURE FUNCTIONSVISUAL STUDIOOPERATIONS

MANAGEMENT SUITE

AZURE SEARCH

AZURE

ACTIVE DIRECTORY

COGNITIVE SERVICESBOT SERVICEAZURE

DATA CATALOG

AZURE KEY

MANAGEMENT SERVICE

AZURE STORAGE

BLOBS

AZURE DATA LAKE

STORE

KAFKA ON

AZURE HDINSIGHT

AZURE SQL DATA

WAREHOUSEAZURE SQL DB AZURE COSMOS DB

AZURE

ANALYSIS SERVICES POWER BI

CONTROL EASE OF USE

Azure Data Lake Store

Azure Storage

Any Hadoop technology,

any distribution

Workload optimized,

managed clusters

Data Engineering in a

Job-as-a-service model

Azure MarketplaceHDP | CDH | MapR

Azure Data Lake

Analytics

IaaS Clusters Managed Clusters Big Data as-a-service

Azure HDInsight

Frictionless & Optimized

Spark clusters

Azure Databricks

BIG

DA

TA

S

TO

RA

GE

BIG

DA

TA

A

NA

LY

TIC

S

Red

uced

Ad

min

istr

ati

on

DATA

Demo: Data Bricks (to keep it

interesting)

DATA

What to use when?

• Less is more

• Start with Data store (DataLake) and

see how far you get with the native

capabilities

• U-SQL, Data Factory

• Use VSTS (DevOps, early)

DATA

THANK YOU!

https://feedback.expertslive.nl

Twitter: Michael_Jonsson

Blog: Azurefabric.com

DATADo you want to gain more

knowledge about Microsoft

technology?

The Future Ready Skills program

offers online courseware, online

labs, live Q&A’s and expert

sessions, so you can acquire

your official Microsoft Certificate

in the most efficient way.

For more information:

aka.ms/frsblog

FUTURE READY

SKILLS

Documents

A Microsoft IT pros guide to the HDInsight and the swarm ... · DATA The Cloud era! The modern way! • For modern big data analytics -Consider all data as valuable and therefore