What’s new in the Azure Data Platformdownload.microsoft.com/download/6/5/0/65023338-AE17...AZURE...

Preview:

Citation preview

Data Platform Airlift21 de Outubro \\ Microsoft Lisbon Experience

What’s new in the Azure Data PlatformRicardo Peres

Luis Calado

Azure DocumentDB

Azure Search

Azure Machine Learning Marketplace

Azure SQL Database

Azure Data Lake

Azure Data Factory

Agenda

Headline

Core Concepts

Resources

Indexes

Querying

Paging

Updating

Transactions

Partition Resolvers

User Defined Functions

Stored Procedures

Triggers

Security

Limits

Search

Best Practices

Headlines

NoSQL database as a service for JSON documents

Schemaless

RESTful

Part of Azure – only available online

Highly scalable

Several bindings (.NET, JavaScript, Python, ...)

Core Concepts

Resources (1 of 3)

Documents that live in DocumentDB

All have a unique addressable URL (_rid or id):https://{account}.documents.azure.com/dbs/{_rid-db}/colls/{_rid-col}/docs/{_rid-doc}

All live inside a collection

A collection lives inside a database

A database belongs to an account

A collection can take different kinds of documents

Resources (2 of 3)

Either POCOs or inherit from Resource

Some built-in properties:

If an id property is not specified, one will be provided (Guid)

Case matters!

Property User Settable Purpose

_rid No System generated, unique and hierarchical

identifier

_etag No HTTP etag required for optimistic concurrency

control

_ts No Last updated timestamp

_self No Unique addressable URL

id Yes User defined unique name

Resources (3 of 3)

Can have attachments:https://{account}.documents.azure.com/dbs/{_rid-db}/colls/{_rid-col}/docs/{_rid-doc}/attachments/{_id-attch}

Additional properties:

Property User Settable Purpose

contentType Yes The content type of the attachment

media Yes The URL link or file path where the

attachment resides

Indexes (1 of 2)

Consistency can be configured per collectionConsistent: indexes are updated synchronously

Lazy: indexes are updated asynchronously

None

Indexes (2 of 2)

By default, all paths are indexed, can be overriden

Three kinds of property indexes:Hashed: for exact matchesRange: for range comparisons, orderingSpatial: for geospatial queries

Three kinds of property value indexes (from JSON):String (precision: 1-100 or -1)Number (precision: 1-8 or -1)Point

A collection can have several indexes at once

If a collection does not have an index, it cannot be queried except by id or self link!

Querying – SQL (1 of 3)

Returns JSON

Joins only inside document (collections)

No comparison of different data types (undefined)

Math: +, -, *, /, %

Bitwise: |, &, ^, <,>>, >>>

Logical: AND, OR, NOT

Comparison: =, !=, <, >, <=, >=, <>

String: ||

Ternary and coalesce: ?, ??

IN, BETWEEN, ORDER BY

Parameterized – no SQL injection

Querying – SQL (2 of 3)

SQL functions:Math: ABS, CEILING, EXP, FLOOR, LOG, LOG10, POWER, ROUND,

SIGN, SQRT, SQUARE, TRUNC, ACOS, ASIN, ATAN, ATN2, COS, COT, DEGREES, PI, RADIANS, SIN, TAN

Type checking: IS_ARRAY, IS_BOOL, IS_NULL, IS_NUMBER, IS_OBJECT, IS_STRING, IS_DEFINED, IS_PRIMITIVE

String: CONCAT, CONTAINS, ENDSWITH, INDEX_OF, LEFT, LENGTH, LOWER, LTRIM, REPLACE, REPLICATE, REVERSE, RIGHT, RTRIM, STARTSWITH, SUBSTRING, UPPER

Array: ARRAY_CONCAT, ARRAY_CONTAINS, ARRAY_LENGTH, ARRAY_SLICE

Spatial: ST_DISTANCE, ST_WITHIN, ST_ISVALID, ST_ISVALIDDETAILED

Querying – SQL (3 of 3)

SQL Ternary and coalesce: ?, ??

SELECT (c.grade < 5)? "elementary": "other" AS gradeLevel

FROM Families.children[0] c

SELECT f.lastName ?? f.surname AS familyName

FROM Families f

Projecting into new JSON objects:SELECT { "state": f.address.state, "city": f.address.city, "name": f.id }

FROM Families f

WHERE f.id = "AndersenFamily“

Creating arrays:SELECT [f.address.city, f.address.state] AS CityState

FROM Families f

Returning single values:SELECT VALUE “Hello World”

[{ "$1": { "state": "WA", "city": "seattle" }, "$2": { "name": "AndersenFamily" } }]

[ { "CityState": [ "seattle", "WA" ] }, { "CityState": [ "NY", "NY" ] } ]

[ "Hello World" ]

Querying - LINQ

LINQ functions:Math: Abs, Acos, Asin, Atan, Ceiling, Cos, Exp, Floor, Log, Log10,

Pow, Round, Sign, Sin, Sqrt, Tan, Truncate

String: Concat, Contains, EndsWith, IndexOf, Count, ToLower, TrimStart, Replace, Reverse, TrimEnd, StartsWith, SubString, ToUpper

Array: Concat, Contains, and Count

Spatial: Distance, Within, IsValid, and IsValidDetailed

Paging

Can specify maximum number of items to retrieve

Has more results / get next results

Ordering

Updating

InsertsFrom POCOFrom StreamBatching:

Document ExplorerData Migration ToolStored Procedures

ReplacesConcurrency control from Etags

DeletesBy self link or id

Transactions

No explicit transactions

Implicit inside triggers and stored procedures – only at collection level

Partition Resolvers

Specified per database

Possibly several

Can decide on which collection a document is to be saved or retrieved from

Included:HashPartitionResolver: distribute data evenly accross collections

RangePartitionResolver<T>: when there is a “natural” ordering, such as with date and time

User Defined Functions

JavaScript-based

Exist in collections

No side effects

var regexMatchUdf = new UserDefinedFunction {

Id = "REGEX_MATCH",

Body = "function (input, pattern) {

return input.match(pattern) !== null;

};",

};

SELECT udf.REGEX_MATCH("ardo", s.Id) FROM Session s

Stored Procedures

JavaScript-based

Exist in collections

Can do batching

Implicit transactions

function (gender) {

var response = getContext().getResponse();

var collection = getContext().getCollection();

var query = 'SELECT * FROM c WHERE c.Gender= "' + gender + '"';

collection.queryDocuments(collection

.getSelfLink(), query, {},

function(err, documents, options) {

response.setBody(response.getBody() + JSON.stringify(documents));

}

);

}

Triggers

JavaScript-based

Exist in collections

Two types:Pre trigger

Post trigger

function updateTrigger() {

var request = getContext()

.getRequest();

var doc = request.getBody();

doc[‘message’] = ‘Added by trigger’;

request.setBody(doc);

}

Security

Access keys:

Master (single)

Read only (multiple)

Database users – specify use at DocumentClient level

Permissions for users over resources (resource tokens: default expiration is 1h, up to 5h):

All

Read

Resources:

Collections

Documents

Attachments

Stored procedures

Triggers

User defined functions

LimitsFeature Limit

Maximum Request Units / second / collection 2500

Maximum execution time for stored procedure

and trigger

5 s

Provisioned document storage / collection 50 GB

Maximum collections per database account* 100

Maximum document storage per database

(100 collections)*

1 TB

Maximum Length of the Id property 255 chars

Maximum request size of document and

attachment

512 KB

Maximum number of JOINs per query* 5

Number of stored procedures, triggers and

UDFs per collection*

25

Number of users per database account 500.000

Search

Based on Elasticsearch and Lucene

.NET + REST APIs

Can retrieve data from DocumentDB

Best Practices

Cache the DocumentClient instance

Choose right collection index update policy

Index only properties that will be searchable and with appropriate values – watch out for ranges

Store small documents

Measure and tune request costs

Retrieve only what you need – paging, projections

Cache self links – they never change

Use partition resolvers for distributing burden

Beware throttling!

Meet the Competition

MongoDBOpen source + support model

No joins

Aggregations

Time to live

Offline deployment

Replication

Eventual consistency

ACID transactions

Map/Reduce

Several programming languages supported

RavenDBOpen source + support model

Joins across documents

Aggregations

Expiry

Offline deployment

Replication

Eventual consistency

ACID transactions

Map/Reduce

.NET, REST

References

Query Playground: https://www.documentdb.com/sql/demo

.NET Azure DocumentDB Samples: https://github.com/Azure/azure-documentdb-net

DocumentDB Studio: https://studiodocumentdb.codeplex.com/

Azure DocumentDB Data Migration Tool: http://www.microsoft.com/en-us/download/details.aspx?id=46436

Pricing: https://azure.microsoft.com/en-us/pricing/details/documentdb/

Connecting DocumentDB with Azure Search using indexers: https://azure.microsoft.com/en-us/documentation/articles/documentdb-search-indexer/

A search-as-a-service solution allowing developers to incorporate great search experiences into applicationswithout managing infrastructure or needing to become search experts.

Type Ahead

FacetsFacets

Hit Highlighting

Spelling Mistakes

Geo-Spatial Search

Paging

Sorting & Scoring

New indexers (SQL Database and DocumentDB)

New language support (35 languages including pt-PT)

Index creation in the new Management Portal

New Regions

New APIs for index creation

• Distance

• Intersection

Full Text Search

Secure data with authentication, authorization and encryption

Extended Events

Azure Portal

Azure Ops Team

ML Studio

Data Scientist

HDInsight

Azure Storage

Training Set

from on-prem

Azure Portal &

ML API service

Azure Ops Team

PowerBI/DashboardsMobile AppsWeb Apps

ML API service Developer

ML Studio and the Data Scientist

• Access and prepare data

• Create, test and train models

• Collaborate

• One click to stage for

production via the API service

Azure Portal & ML API serviceand the Azure Ops Team

• Create ML Studio workspace

• Assign storage account(s)

• Monitor ML consumption

• See alerts when model is ready

• Deploy models to web service

ML API service and the Developer

• Tested models available as an url that can be called from any end point

Business users easily access results:

from anywhere, on any device

Cloud

Event Hubs

ML Studio ML API Service

Microsoft

Azure Portal

Blob Storage

ML Apps

Marketplace

ML Operationalization

ML Studio

ML Algorithms

Observation

Pattern

Theory

Hypothesis

What will happen?

How can we make it happen?

Predictive

Analytics

Prescriptive

Analytics

What happened?

Why did it happen?

Descriptive

Analytics

Diagnostic

Analytics

Top-Down

Confirmation

Theory

Hypothesis

Observation

Implement Data Warehouse

Physical Design

ETL

Development

Reporting &

Analytics

Development

Install and Tune

Reporting & Analytics Design

Dimension Modelling

ETL Design

Setup Infrastructure

Understand Corporate Strategy

Data sources

ETL

BI and analytic

Data warehouse

Gather Requirements

Business Requirements

Technical Requirements

Ingestregardless of requirements

Storein native format without

schema definition

AnalyzeUsing analytic engines

like Hadoop

Interactive queries

Batch queries

Machine Learning

Data warehouse

Real-time analytics

Devices

Store and analyse data of any kind and size

Develop faster, debug and optimise smarter

Interactively explore patterns in your data

No learning curve—use U-SQL, Spark, Hive, HBase and Storm

Managed and supported with an enterprise-grade SLA

Dynamically scales to match your business priorities

Enterprise-grade security with Azure Active Directory

Built on YARN, designed for the cloud

`

AZURE DATA LAKE

DEV

TOOLSVisual

Studio

PowerShell

MS

Azure Data Factory

Azure Stream

Analytics*

MS

HDInsight

Kona

Azure SQL

DW*

AzureML*

3rd Party

Informatica*

3rd Party

Cloudera*

Hortonworks*

MapR*

Open Source

Sqoop

Flume

MS

RevolutionR*

PowerBI*

3rd Party

TBA

PLATFORMS

APPLICATIONS

DATA INTEGRATION TOOLS

Last Name First Name Country Age …

Flasko Mike Canada 32

Anand Subbaraj USA 30

Gaurav Malhotra USA 72

… …. …. ….

Last Name First Name At risk of

churning

….

Flasko Mike Yes

Anand Subbaraj No

Gaurav Malhotra Yes

… ….

Call Log Files

Customer Table

Call Log Files

Customer Table

Customer

Churn Table

Data Sources Ingest Transform & Analyze Publish

Customer

Call Details

Customers

Likely to

Churn

Recommended