38

Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Embed Size (px)

Citation preview

Page 1: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017
Page 2: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Specialist Solutions Architect, Data and Analytics, EMEA

November 17th, 2017

Full Stack Analytics on AWSIan Robinson

Page 3: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Forces and Trends Prompting the Move to Cloud

Cost OptimizationLicensesHardwareData center and operations

Dark DataPrematurely discarding data

AgilityExperimentation (data & tools)Democratised Access to DataTime-to-first-results Terminate failed experiments early

From BI to Data ScienceIn-house data scienceFrom back office to product

Page 4: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Storage is the Gravity for Cloud Applications

Store all your data, for ever, at every stage of its lifecycleApply it using the right tool for the job

Page 5: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Storage is Job #1

Page 6: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Object Storage is Foundational

Page 7: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Standard

Active data Archive dataInfrequently accessed data

Standard - Infrequent Access

Amazon Glacier

Create

Delete

Events and Lifecycle Management

Page 8: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

S3 as the Data Lake Fabric

• Unlimited number of objects and volume

• 99.99% availability• 99.999999999% durability• Versioning• Tiered storage via lifecycle

policies• SSL, client/server-side

encryption at rest• Low cost (just over

$2700/month for 100TB)

• Natively supported by big data frameworks (Spark, Hive, Presto, etc)

• Decouples storage and compute• Run transient compute

clusters (with Amazon EC2 Spot Instances)

• Multiple, heterogeneous clusters can use same data

Page 9: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

DatabaseMigrationService

Automated Data Ingestion

Page 10: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Stream Events to S3 Using Kinesis Firehose

Page 11: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Write Database Changes to S3 with DMS

<schema_name>/<table_name>/LOAD001.csv <schema_name>/<table_name>/LOAD002.csv <schema_name>/<table_name>/<time-stamp>.csv

Full Load

Change Data Capture

Page 12: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Scalable (secure, versioned, durable) storage +Immutable data at every stage of its lifecycle +

Versioned schema and metadata=

Data discovery, lineage

Storage + Catalog

Page 13: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

AWS Glue

• Data Catalog Discover and store metadata

• Job Authoring Auto-generated ETL code

• Job Execution Serverless scheduling and execution

Page 14: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Hive metastore-compatible, highly-available metadata repository:• Classification for identifying and

parsing files• Versioning of table metadata as

schemas evolve• Table definitions – usable by

Redshift, Athena, Glue, EMR

Populate using Hive DDL, bulk import, or automatically through crawlers.

Glue Data Catalog

Page 15: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

semi-structuredper-file schema

semi-structured unified schema

identify file type and parse files

enumerateS3 objects

file 1

file 2

file N

…int

array

intchar

struct

char int

array

struct

char

bool int

int

arrayint

char

char intcustom classifiers

app log parsermetrics parser

system classifiersJSON parser

CSV parserApache log parser

bool

Crawlers: Automatic Schema Inference

Page 16: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

AWS Lambda

AWS Lambda

Metadata Index(Amazon DynamoDB)

Search Index(Amazon Elasticsearch)

ObjectCreatedObjectDeleted PutItem

Update Stream

Update Index

Extract Search Fields

Indexing and Searching Using Metadata

Amazon S3

Page 17: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Security is Job #0

Page 18: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Data Access & AuthorisationGive your users easy and secure access

Storage & CatalogSecure, cost-effective storage in Amazon S3.

Robust metadata in AWS Catalog

Protect and SecureUse entitlements to ensure data is secure and users’ identities are verified

Page 19: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Identity and Access Management

• Manage users,groups,androles• Identityfederation withOpenID• TemporarycredentialswithAmazonSecurityToken

Service(AmazonSTS)• Storedpolicytemplates• Powerfulpolicylanguage• AmazonS3bucketpolicies

Page 20: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

IAM

AmazonS3

Amazon ElastiCache

AmazonDynamoDB

Amazon EMR

Amazon Kinesis

AmazonAthena

Service API Access

Security at the Data Level

Page 21: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Third Party Ecosystem Security Tools

AmazonS3

AWSCloudTrail

http://amzn.to/2tSimHjAmazonAthena

Access Logging

API Logging

Access Log

Analytics

IAM

Amazon EMR

http://amzn.to/2si6RqS

Storage Level Support for Access Logging and Audit

Page 22: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Encryption Options

AWSServer-Sideencryption• AWSmanagedkeyinfrastructure

AWSKeyManagementService• Automatedkeyrotation&auditing• IntegrationwithotherAWSservices

AWSCloudHSM• DedicatedTenancySafeNet LunaSAHSMDevice• CommonCriteriaEAL4+,NISTFIPS140-2

Page 23: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Serverless Processing and Analytics

Page 24: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

• Python code generatedby AWS Glue

• Connect a notebook or IDE to AWS Glue

• Existing code brought into AWS Glue

Managed ETL with AWS GLue

Page 25: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

• Schedule-based• Event-based• On demand

Job Execution with AWS Glue

Page 26: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Amazon Kinesis Analytics

• Interact with streaming data in real time using SQL• Build fully managed and elastic stream processing

applications that process data for real-time visualizations and alarms

Page 27: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

SELECT STREAM author, count(author) OVER ONE_MINUTE

FROM Tweets WINDOW ONE_MINUTE AS(PARTITION BY author RANGE INTERVAL '1' MINUTE PRECEDING)WHERE text LIKE ‘%#BigDataSpain%';

Amazon Kinesis Analytics – Simple SQL Interface

Page 28: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Amazon Athena – Analyze Data in S3

• Interactive queries• ANSI SQL• No infrastructure or administration

• Zero spin up time• Query data in its raw format

• AVRO, Text, CSV, JSON, weblogs, AWS service logs• Convert to an optimized form like ORC or Parquet for the

best performance and lowest cost• No loading of data, no ETL required

• Stream data from directly from Amazon S3, take advantage of Amazon S3 durability and availability

Page 29: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Simple query editor with syntax highlighting

and autocomplete

Data Catalog

Query History, Saved Queries, and Catalog Management

Page 30: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

QuickSight allows you to connect to data from a wide variety of AWS, third-party, and on-premises sources including Amazon Athena

Amazon RDS

Amazon S3

Amazon Redshift

Amazon Athena

Using Amazon Athena with Amazon QuickSight

Page 31: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017
Page 32: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Building Smarter Applications

Page 33: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Add Machine Learning CapabilitiesAmazon Machine Learning ServiceBatch and online predictionsTrain using data in S3, RDS and Redshift

Amazon EMRComprehensive machine learning libraries (eg Spark MLlib, Anaconda)Provision analytics clusters in minutes, autoscale with data volume or query demand

Page 34: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Amazon AI Services

Amazon Polly – Lifelike Text-to-Speech47 voices, 24 languagesLow-latency, real time

Amazon Rekognition – Image AnalysisObject and scene detectionFacial analysis

Amazon Lex – Conversational EngineSpeech and text recognitionEnterprise connectors

Page 35: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Demographic Data

Facial Landmarks

Sentiment Expressed

Image Quality

Facial Analysis with Rekognition

Brightness: 25.84Sharpness: 160

General Attributes

Page 36: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Up to ~40k CUDA coresPre-configured CUDA driversJupyter notebook with Python2, Python3, Anaconda

CloudFormation TemplateAWS Marketplace – one-click deploy

AWS Deep Learning AMI

Page 37: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Kinesis Firehose

AthenaQuery Service Glue

Machine LearningPredictive analytics

Data Access & AuthorisationGive your users easy and secure access

Data IngestionGet your data into S3 quickly and securely

Processing & AnalyticsUse of predictive and prescriptive

analytics to gain better understanding

Protect and SecureUse entitlements to ensure data is secure and users’ identities are verified

Amazon AIStorage & Catalog

Secure, cost-effective storage in Amazon S3. Robust metadata in AWS Catalog

Page 38: Full Stack Analytics on Amazon Web Services by Ian Robinson at Big Data Spain 2017

Thank YouFull Stack Analytics on AWS