Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
DATA
A Microsoft IT pros guide
to the HDInsight and the
swarm of open source
solutions it comes with
Michael Jonsson
Cloud Solutions Architect – Avanade Denmark
P-CSA – Microsoft (Finalist in the Microsoft Partner of the Year awards 2018 in the category “Partner Seller Excellence in Technology, Sales and/or Licensing Award”)
Twitter: Michael_Jonsson
Blog: Azurefabric.com
DATA
What to expect from this session
• War stories implementing HDInsight
• Suttle complaints on not ready solutions
• Fast pace and bad jokes
• Guidance to a partial datadriven org
• Some Demos, but maybe just screen shots
• Pitfalls info.... (and bad powerpointing)
DATA
Feedback please ☺
https://feedback.expertslive.nl
DATA
The IT state is evolving
DATA
Customer project
Goals
• Information management
• Advanced analytics
• Future platform (data driven, self service)
• Machine Learning (AI)
DATA
If the Infrastructure is already in Place!!
DATA
The second 80/20 dilemma
DATA
Wrong in the current OLD methods
1: Cleaning to relational data = deleting some
2: ETL with defined schemas = hard to change
3: No DevOps (everything as code) fit the method
DATA The Cloud era!
The modern way!
• For modern big data analytics -Consider all data as valuable and therefore keep it unstructured (schemaless), stored in native format in low cost location
• Ingest, analyze and then structure. (schema on read or ELT)
• Schema free data enables the possibility to add additional and new data to the mix
• Infra as Code = No infra core knowledge needed for the datascientist (its in the code!)
• PaaS services
• Infra as Code
• CI/CD
• DevOps
DATA
Customer project
Goals
• Information management
• Advanced analytics
• Future platform (data driven, self service)
• Machine Learning (AI)
DATA WHY?
Data Driven business! (Goal)• Descisions based on Data (facts)! – Netflix
• Democratize data – Available for all –
(Microsoft services goal)
• Enable AI to get insights not natural to
humas – Analysis, analytics etc
DATA
DATA The Cloud era!
The modern way!
• For modern big data analytics -Consider all data as valuable and therefore keep it unstructured (schemaless), stored in native format in low cost location
• Ingest, analyze and then structure. (schema on read or ELT)
• Schema free data enables the possibility to add additional and new data to the mix
• Infra as Code = No infra core knowledge needed for the datascientist (its in the code!)
• PaaS services
• Infra as Code
• CI/CD
• DevOps
AZURE ML ML SERVER AZURE
DATABRICKS
AZURE
STREAM ANALYTICS
AZURE
HDINSIGHT
AZURE
DATABRICKS
AZURE DATA
LAKE
ANALYTICS
AZURE
HDINSIGHTAZURE
DATABRICKS
The Azure BIG Data Landscape
AZURE
SDK
AZURE
DATA
FACTORY
AZURE IMPORT
EXPORT
SERVICE
AZURE
CLI
AZURE IOT HUB AZURE EVENT HUBS
AZURE EXPRESSROUTE AZURE NETWORK
SECURITY GROUPS
AZURE FUNCTIONSVISUAL STUDIOOPERATIONS
MANAGEMENT SUITE
AZURE SEARCH
AZURE
ACTIVE DIRECTORY
COGNITIVE SERVICESBOT SERVICEAZURE
DATA CATALOG
AZURE KEY
MANAGEMENT SERVICE
AZURE STORAGE
BLOBS
AZURE DATA LAKE
STORE
KAFKA ON
AZURE HDINSIGHT
AZURE SQL DATA
WAREHOUSEAZURE SQL DB AZURE COSMOS DB
AZURE
ANALYSIS SERVICES POWER BI
DATA
DATA
Demo: HDInsight (to keep it
interesting)
DATA
Pointers on HDInsight• When integrating HDInsight with Data Lake through an AAD service principal, how do you control
access to data on a per-user level from the cluster? (this is where the Apache Ranger discussion comes in)
• Domain-joined premium HDInsight clusters are difficult (at least in preview when documentation was severely lacking)
• Limitations in the process for domain joining HDInsight nodes may be unacceptable for large enterprises (e.g., must put all nodes in the same OU and cannot control their hostnames)
• Parameters in ARM templates are poorly documented and change frequently (don’t assume you can reuse stuff from GitHub)
• Forced tunneling is not supported according to the HDInsight documentation, so how do you get it to work with e.g. ExpressRoute via a customer firewall?
• If multiple HDInsight clusters are deployed in the same VNet, the first 6 characters of the cluster names must be different (this may be a problem depending on the corporate naming convention)
DATA
Start with the Data• For modern big data analytics -Consider all data as
valuable and therefore keep it unstructured (schemaless), stored in native format in low cost location
• Ingest, analyze and then structure. (schema on read or ELT)
• Schema free data enables the possibility to add additional and new data to the mix
On-demand analytics job service powering intelligent action across a hyper-scale repository for Big Data analytics workloads
Azure Data Lake
• Start in seconds,
• Develop • Leverage open source
• Debug & optimize
• Petabyte size files
• Virtualize your analytics
• Provide I/O capacity
• Always encrypted,
DATA
Datalake
DATA
Use the built in capabilities• Azure Data Lake Analytics: U-SQL
• Ingest, analyze and then structure.
(schema on read or ELT)
• Schema free data enables the possibility to
add additional and new data to the mix
U-SQL
Declarative
+
Imperative
Structured
+
Semi-structured
+
Unstructured
Batch
+
Interactive
+
Streaming
+
Machine Learning
Programming models Data Workloads
a language that unifies
Develop massively parallel programs with simplicity
A simple U-SQL script can scale from Gigabytes to Petabytes without learning complex big data programming techniques.
U-SQL automatically generates a scaled out and optimized execution plan to handle any amount of data.
Execution nodes immediately rapidly allocated to run the program.
Error handling, network issues, and runtime optimization are handled automatically.
@searchlog = EXTRACT UserId int,
Start DateTime, Region string, Query string, Duration int, Urls string, ClickedUrls string
FROM @"/Samples/Data/SearchLog.tsv"USING Extractors.Tsv();
OUTPUT @searchlogTO @"/Samples/Output/SearchLog_output.tsv"USING Outputters.Tsv();
DATA
DEMO
Azure Data Factory ->
Azure Data Lake ->
Azure Data Lake Analytics: U-SQL ->
Azure Data Lake ->
PowerBI
DATA
What to use when?
DATA
DATA
AZURE ML ML SERVER AZURE
DATABRICKS
AZURE
STREAM ANALYTICS
AZURE
HDINSIGHT
AZURE
DATABRICKS
AZURE DATA
LAKE
ANALYTICS
AZURE
HDINSIGHTAZURE
DATABRICKS
The Azure BIG Data Landscape
AZURE
SDK
AZURE
DATA
FACTORY
AZURE IMPORT
EXPORT
SERVICE
AZURE
CLI
AZURE IOT HUB AZURE EVENT HUBS
AZURE EXPRESSROUTE AZURE NETWORK
SECURITY GROUPS
AZURE FUNCTIONSVISUAL STUDIOOPERATIONS
MANAGEMENT SUITE
AZURE SEARCH
AZURE
ACTIVE DIRECTORY
COGNITIVE SERVICESBOT SERVICEAZURE
DATA CATALOG
AZURE KEY
MANAGEMENT SERVICE
AZURE STORAGE
BLOBS
AZURE DATA LAKE
STORE
KAFKA ON
AZURE HDINSIGHT
AZURE SQL DATA
WAREHOUSEAZURE SQL DB AZURE COSMOS DB
AZURE
ANALYSIS SERVICES POWER BI
CONTROL EASE OF USE
Azure Data Lake Store
Azure Storage
Any Hadoop technology,
any distribution
Workload optimized,
managed clusters
Data Engineering in a
Job-as-a-service model
Azure MarketplaceHDP | CDH | MapR
Azure Data Lake
Analytics
IaaS Clusters Managed Clusters Big Data as-a-service
Azure HDInsight
Frictionless & Optimized
Spark clusters
Azure Databricks
BIG
DA
TA
S
TO
RA
GE
BIG
DA
TA
A
NA
LY
TIC
S
Red
uced
Ad
min
istr
ati
on
DATA
Demo: Data Bricks (to keep it
interesting)
DATA
What to use when?
• Less is more
• Start with Data store (DataLake) and
see how far you get with the native
capabilities
• U-SQL, Data Factory
• Use VSTS (DevOps, early)
DATA
THANK YOU!
https://feedback.expertslive.nl
Twitter: Michael_Jonsson
Blog: Azurefabric.com
DATADo you want to gain more
knowledge about Microsoft
technology?
The Future Ready Skills program
offers online courseware, online
labs, live Q&A’s and expert
sessions, so you can acquire
your official Microsoft Certificate
in the most efficient way.
For more information:
aka.ms/frsblog
FUTURE READY
SKILLS