View
201
Download
4
Category
Preview:
Citation preview
A Big Data Solution using Azure Data Platform
Azure Batch, Azure Data Lake, Azure HDInsight, Power BI
Toronto Azure Group Meeting
January 11, 2017
Roy Kim@roykimtoronto
Agenda
Overview of Big Data + Azure + Data Insights
Walkthrough of Job Postings demo solution architecture & implementation
Q&A
Author: Roy KimBy: Roy Kim
Data to Insight
Author: Roy Kim
Big DataData Platform Technologies
Solution Data Insights
By: Roy Kim
Job Postings Demo Solution
Author: Roy Kim References: https://softwarestrategiesblog.com/2015/09/05/10-ways-big-data-is-revolutionizing-supply-chain-management
Job Postings Azure Data PlatformData Lake, HDInsight,
SQL, Power BIJob trends, analysis
By: Roy Kim
Big Data Spectrum
References: https://softwarestrategiesblog.com/2015/09/05/10-ways-big-data-is-revolutionizing-supply-chain-management
Veracity
By: Roy Kim
Azure Cloud Platform
References: https://blogs.technet.microsoft.com/cansql/2015/06/03/microsoft-data-platform-overview/
By: Roy Kim
Many services growing and maturing
Azure Data Platform
References: https://blogs.technet.microsoft.com/cansql/2015/06/03/microsoft-data-platform-overview/
Two Illustrations:
By: Roy Kim
Data Analytics
By: Roy Kim
Job Postings Data Set
Volume
• Many national job sites
• New job postings daily
• Metadata and full text.
Velocity
• New job postings created every minute
Variety
• Semi-structured
• Job Title
• Location
• Company
• Unstructured
• Job Description
Veracity
• Incomplete/Imprecise
• Salary, Per hour
• FT, PT, Temp, Contract, Seasonal
• Main profession
By: Roy Kim
Power BI – Job Postings Demo Reports
By: Roy Kim
Power BI – Job Postings Demo Reports
By: Roy Kim
Job Postings Big Data Solution Architecture
Azure Data Lake Analytics
Internet Data Sets
USQL
Storage Account
Blob Store
Azure Batch
.NET Console App
Azure Data Lake Store
Blob Store(HDFS)
Azure HDInsight
Hive
Storage Account
Blob Store(HDFS)
Azure
Active Directory
HDInsightAzure SQL
database
SQL Data
Warehouse
Storage blob
Storage (Azure)
Visual Studio
Online
Data LakeAzure SQL / Data
Warehouse
SQL DB
Analysis Services(preview)
Tabular
Sto
rage
Tie
rSe
rvic
es T
ier
REST/HTML/..
Vis
ual
izat
ion
/
Rep
ort
ing
Too
ls
Pre
sen
tati
on
Tier Mobile
Pig
Scoop
By: Roy Kim
Desktop
Batch
BusinessUsers
ReportBuilders
Azure Data Factory
Pipeline
Data Factory
Browser
Service
Application Insights
Microsoft Azure
Data
Job Postings from Internet Job Boards
Web sites that offer APIs such as indeed.com, dice.com, etc. Use any server-side programming language to retrieve data such as NET, Java, Node.js, etc. If no APIs, consider HTML web page scraping
REST API
http end points typically return JSON or XML data formats
Html Web Page Scraping
HTML Agility Pack to assist in parsing the Document Object Model for data points. https://www.nuget.org/packages/HtmlAgilityPack HTML parsing supporting XPath to traverse the Document Object Model (DOM) E.g. doc.DocumentElement.SelectSingleNode(“//div[@id=‘Total Sales’]”)
By: Roy Kim
Job Postings Data Collector .NET Console Application
.NET console application to read data from the internet and store into Azure Storage accounts
Concurrent requests to job postings public API and HTML pages Multi-threaded to increase speed and throughput
Parse HTML pages and JSON Store JSON files directly into Azure Data Lake Store with ADLS .NET SDK Leverages Azure Application Insights for logging trace and exception error messages. To store files into Azure Data Lake Store, the .NET application needs to access with an Azure AD service
principal with the appropriate access control.
By: Roy Kim
Job Postings Data Collector App Architecture
By: Roy Kim
Azure Application Insights
By: Roy Kim
Application Insights Core API. This package provides core functionality for transmission of all Application Insights Telemetry Types and is a dependent package for all other Application Insights packages.
Azure Batch
A managed Azure service executing command line applications. For batch processing or batch computing--running a large volume of similar tasks to get some desired
result. Commonly used by organizations that regularly process, transform, and analyze large volumes of data. Simply, a set of Azure Virtual Machines running a console application to process data that can be on a
recurring schedule and in parallel
References: https://github.com/Microsoft/azure-docs/blob/master/articles/batch/batch-technical-overview.md
Author: Roy Kim
Azure Batch – Demo Implementation
Azure Batch runs the Console Application on a daily schedule against one node (Virtual Machine - 2 cores)
To run console application in parallel through compute nodes, used the sample Parallel Tasks .NET solution which uses the Azure Batch Client SDK.https://github.com/Azure/azure-batch-samples/tree/master/CSharp/ArticleProjects/ParallelTasks
Azure batch is an architecture option to support data collection in terms of velocity and volume.
By: Roy Kim
Azure Data Lake
Intended for data storage in its raw format for future analysis, processing or data modelling. For developers, data scientists, and analysts to store data of any size, shape, and speed. To do all types of processing and analytics across different platforms and languages. Extract and load, minimal transformations To manage data in characteristic of variety, velocity and volume
Two Components Azure Data Lake Store Azure Data Lake Analytics
References: https://azure.microsoft.com/en-us/solutions/data-lake/
By: Roy Kim
Azure Data Lake Store
Azure Data Lake Store is a hyper-scale repository for big data analytic workloads. Azure Data Lake enables you to capture data of any size, type, and ingestion speed in one single place for operational and exploratory analytics.
The Azure Data Lake store is an Apache Hadoop file system compatible with Hadoop Distributed File System (HDFS)
Can be accessed from Hadoop (available with HDInsight cluster) using the WebHDFS-compatible REST APIs
References: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-overview
By: Roy Kim
Azure Data Lake Store
Use Cases Store social media posts, log files,
sensor data Store corporate data such as
relational databases (as flat files)
References: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-overview
By: Roy Kim
Azure Data Lake Analytics
Azure Data Lake Analytics is built to make big data analytics easy. Focus on writing, running, and managing jobs, rather than operating distributed infrastructure. Instead of
deploying, configuring, and tuning hardware. Write queries to transform your data and extract valuable insights. The analytics service can handle jobs of
any scale instantly by setting the dial for how much power you need. U-SQL – a Big Data query language. Likeness of SQL + C# ”schema on reads”
Pay for your job when it is running; making it cost-effective.
Data Collector app stores .json files in respective folders USQL scripts logic:
reads 1000s of JSON files in a given folder Outputs to one TSV (tab delimited) file Create a Tables to schematize the TSV files Query against tables to analyze or transform to a new output file.
References: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-overview
By: Roy Kim
Azure Data Lake Analytics – Demo Implementation
By: Roy Kim
USQL script: process json files into a tab delimited file
Azure Data Lake Analytics – Demo Implementation
By: Roy Kim
Roy Kim
# of JSON Files
Single output file
50 compute nodes
3.4 mins duration
Azure HDInsight
Hadoop refers to an ecosystem of open-source software that is a framework for distributed processing, storing, and analysis of big data sets on clusters of commodity computer hardware.
Azure HDInsight makes the Hadoop components from the Hortonworks Data Platform (HDP) distribution available in Azure, deploys managed clusters with high reliability and availability, and provides enterprise-grade security and governance with Active Directory.
HDInsight offers the cluster types - Hadoop, HBase, Spark, Kafka, Interactive Hive, Storm, customized, etc.
Supports integration with BI tools such as Power BI, Excel, SQL Server Analysis Services, and SQL Server Reporting Services.
By: Roy Kim
Azure HDInsight – Demo Implementation
Hadoop Cluster Type Data Source
Windows Azure Storage Account Data Lake Store access
Data Lake Store Account
Hive Tables JobPostings (internal) Table
Schema definition Data loaded from Azure Data Lake Store .TSV file into Hadoop cluster’s WASB
JobPostings External Table Data referenced in Azure Data Lake Store .TSV file. This is external to HDInsight storage
account.
By: Roy Kim
Azure HD Insight – Demo Implementation
Considerations
To manage the compute costs, script the provisioning and de-provisioning of the cluster. While a cluster is running, execute scripts and query the data into self service BI tools and into
other data warehouses. In comparison to Azure Data Lake, ADL Analytics may be more cost effective since it is pay per
use at a more granular level - # of nodes and execution time. E.g. Running against 100 nodes may cost a few dollars per minute in ADL Analytics; whereas, in HDInsight, 13 nodes for small VM size may cost a few dollars an hour.
By: Roy Kim
Azure SQL Database
A relational database-as-a-service in the cloud built on the Microsoft SQL Server engine No need to manage the infrastructure. Scale up or down based on Database Transaction Units (DTUs). 1TB storage maximum Can be used as a simpler data warehouse.
By: Roy Kim
Azure SQL Database – Demo Implementation
Developed a simple data warehouse modelling Job Postings data loaded from ADLS Star schema Added a date dimension table Table of # of jobs for each province by a date
hierarchy
By: Roy Kim
Azure Data Factory
Cloud-based data integration service that orchestrates and automates the movement and transformation of data.
Create data pipelines that move and transform data, and then run the pipelines on a specified schedule (hourly, daily, weekly, etc.)
Azure Data Factory
Category Data store Supported as a source Supported as a sink
Azure Azure Blob storageAzure Data Lake StoreAzure SQL DatabaseAzure SQL Data WarehouseAzure Table storageAzure DocumentDBAzure Search Index
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
Databases SQL Server*Oracle*MySQL*DB2*Teradata*PostgreSQL*Sybase*Cassandra*MongoDB*Amazon Redshift
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
✓
File File System*HDFS*Amazon S3FTP
✓
✓
✓
✓
✓
Others SalesforceGeneric ODBC*Generic ODataWeb Table (table from HTML)GE Historian*
✓
✓
✓
✓
✓
By: Roy Kim
Azure Data Factory – Demo Implementation
Moved JobPostings.tsv data from Azure Data Lake to a JobPostings database table in Azure SQL
Linked Service Source: Azure Data Lake Store Target: SQL Database
Pipeline Straight mapping from flat file columns to SQL table columns Convert string text to respective varchar, datetime, long data types.
By: Roy Kim
Architecture
Governance
Security
Capacity
Business Continuity
Application Lifecycle
ManagementPerformance
Solution Design
Operations
By: Roy Kim
Security Architecture
Azure Data Lake Analytics
Internet Data Sources
USQL
Storage Account
Blob Store
Azure Batch
.NET Console App
Azure Data Lake Store
Blob Store(HDFS)
Azure HDInsight
Hive
Storage Account
Blob Store(HDFS)
Azure
Active Directory
HDInsightAzure SQL
database
SQL Data
Warehouse
Storage blob
Storage (Azure)
Visual Studio
Online
Data LakeAzure SQL / Data
Warehouse
SQL DB
Analysis Services(preview)
Tabular
Sto
rage
Tie
rSe
rvic
es T
ier
REST/HTML/..
Vis
ual
izat
ion
/
Rep
ort
ing
Too
ls
Pig
Scoop
Roy Kim
Desktop
Batch
BIDevelopers
IT Ops
Azure Data Factory
Pipeline
Data Factory
Microsoft AzureAPI
Key
API Key
AAD AppService Principal
AAD AppService Principal
AAD UserSQL account
AAD UserSQL account
Account Key
AAD User
End Users
By: Roy Kim
Application Insights
Data Processing & Formats
Azure Data Lake Analytics
Internet Data Sources
USQL
Azure Batch
.NET Console App
Azure HDInsight
Hive
HDInsightAzure SQL
database
SQL Data
Warehouse
Data LakeAzure SQL
SQL DB
Analysis Services(preview)
Tabular
Dat
a Fo
rmat
Serv
ices
Tie
r
REST/HTML/.. Pig
Scoop
Roy Kim
BatchAzure Data
Factory
Pipeline
Data Factory
JSON
HTML
JSON
TSV Hive Ext. Table
Relational DB
Hive Int. Table
Table
By: Roy Kim
Power BI - Demo
Connect to Azure Data Lake Store flat files
But no connection available to Azure Data Lake Analytics (yet) Azure HDInsight Hadoop cluster – Hive tables Azure SQL Database
Transform and Model the data from the above data sources Visualize and Dashboard the modelled data. Publish and share reports View reports through native mobile app.
By: Roy Kim
Appendix - Azure Data Lake Analytics
By: Roy Kim
Appendix - Azure Data Lake Analytics
By: Roy Kim
Recommended