Upload
big-data-spain
View
459
Download
0
Embed Size (px)
Citation preview
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Specialist Solutions Architect, Data and Analytics, EMEA
November 17th, 2017
Full Stack Analytics on AWSIan Robinson
Forces and Trends Prompting the Move to Cloud
Cost OptimizationLicensesHardwareData center and operations
Dark DataPrematurely discarding data
AgilityExperimentation (data & tools)Democratised Access to DataTime-to-first-results Terminate failed experiments early
From BI to Data ScienceIn-house data scienceFrom back office to product
Storage is the Gravity for Cloud Applications
Store all your data, for ever, at every stage of its lifecycleApply it using the right tool for the job
Storage is Job #1
Object Storage is Foundational
Standard
Active data Archive dataInfrequently accessed data
Standard - Infrequent Access
Amazon Glacier
Create
Delete
Events and Lifecycle Management
S3 as the Data Lake Fabric
• Unlimited number of objects and volume
• 99.99% availability• 99.999999999% durability• Versioning• Tiered storage via lifecycle
policies• SSL, client/server-side
encryption at rest• Low cost (just over
$2700/month for 100TB)
• Natively supported by big data frameworks (Spark, Hive, Presto, etc)
• Decouples storage and compute• Run transient compute
clusters (with Amazon EC2 Spot Instances)
• Multiple, heterogeneous clusters can use same data
DatabaseMigrationService
Automated Data Ingestion
Stream Events to S3 Using Kinesis Firehose
Write Database Changes to S3 with DMS
<schema_name>/<table_name>/LOAD001.csv <schema_name>/<table_name>/LOAD002.csv <schema_name>/<table_name>/<time-stamp>.csv
Full Load
Change Data Capture
Scalable (secure, versioned, durable) storage +Immutable data at every stage of its lifecycle +
Versioned schema and metadata=
Data discovery, lineage
Storage + Catalog
AWS Glue
• Data Catalog Discover and store metadata
• Job Authoring Auto-generated ETL code
• Job Execution Serverless scheduling and execution
Hive metastore-compatible, highly-available metadata repository:• Classification for identifying and
parsing files• Versioning of table metadata as
schemas evolve• Table definitions – usable by
Redshift, Athena, Glue, EMR
Populate using Hive DDL, bulk import, or automatically through crawlers.
Glue Data Catalog
semi-structuredper-file schema
semi-structured unified schema
identify file type and parse files
enumerateS3 objects
file 1
file 2
file N
…int
array
intchar
struct
char int
array
struct
char
bool int
int
arrayint
char
char intcustom classifiers
app log parsermetrics parser
…
system classifiersJSON parser
CSV parserApache log parser
…
bool
Crawlers: Automatic Schema Inference
AWS Lambda
AWS Lambda
Metadata Index(Amazon DynamoDB)
Search Index(Amazon Elasticsearch)
ObjectCreatedObjectDeleted PutItem
Update Stream
Update Index
Extract Search Fields
Indexing and Searching Using Metadata
Amazon S3
Security is Job #0
Data Access & AuthorisationGive your users easy and secure access
Storage & CatalogSecure, cost-effective storage in Amazon S3.
Robust metadata in AWS Catalog
Protect and SecureUse entitlements to ensure data is secure and users’ identities are verified
Identity and Access Management
• Manage users,groups,androles• Identityfederation withOpenID• TemporarycredentialswithAmazonSecurityToken
Service(AmazonSTS)• Storedpolicytemplates• Powerfulpolicylanguage• AmazonS3bucketpolicies
IAM
AmazonS3
Amazon ElastiCache
AmazonDynamoDB
Amazon EMR
Amazon Kinesis
AmazonAthena
Service API Access
Security at the Data Level
Third Party Ecosystem Security Tools
AmazonS3
AWSCloudTrail
http://amzn.to/2tSimHjAmazonAthena
Access Logging
API Logging
Access Log
Analytics
IAM
Amazon EMR
http://amzn.to/2si6RqS
Storage Level Support for Access Logging and Audit
Encryption Options
AWSServer-Sideencryption• AWSmanagedkeyinfrastructure
AWSKeyManagementService• Automatedkeyrotation&auditing• IntegrationwithotherAWSservices
AWSCloudHSM• DedicatedTenancySafeNet LunaSAHSMDevice• CommonCriteriaEAL4+,NISTFIPS140-2
Serverless Processing and Analytics
• Python code generatedby AWS Glue
• Connect a notebook or IDE to AWS Glue
• Existing code brought into AWS Glue
Managed ETL with AWS GLue
• Schedule-based• Event-based• On demand
Job Execution with AWS Glue
Amazon Kinesis Analytics
• Interact with streaming data in real time using SQL• Build fully managed and elastic stream processing
applications that process data for real-time visualizations and alarms
SELECT STREAM author, count(author) OVER ONE_MINUTE
FROM Tweets WINDOW ONE_MINUTE AS(PARTITION BY author RANGE INTERVAL '1' MINUTE PRECEDING)WHERE text LIKE ‘%#BigDataSpain%';
Amazon Kinesis Analytics – Simple SQL Interface
Amazon Athena – Analyze Data in S3
• Interactive queries• ANSI SQL• No infrastructure or administration
• Zero spin up time• Query data in its raw format
• AVRO, Text, CSV, JSON, weblogs, AWS service logs• Convert to an optimized form like ORC or Parquet for the
best performance and lowest cost• No loading of data, no ETL required
• Stream data from directly from Amazon S3, take advantage of Amazon S3 durability and availability
Simple query editor with syntax highlighting
and autocomplete
Data Catalog
Query History, Saved Queries, and Catalog Management
QuickSight allows you to connect to data from a wide variety of AWS, third-party, and on-premises sources including Amazon Athena
Amazon RDS
Amazon S3
Amazon Redshift
Amazon Athena
Using Amazon Athena with Amazon QuickSight
Building Smarter Applications
Add Machine Learning CapabilitiesAmazon Machine Learning ServiceBatch and online predictionsTrain using data in S3, RDS and Redshift
Amazon EMRComprehensive machine learning libraries (eg Spark MLlib, Anaconda)Provision analytics clusters in minutes, autoscale with data volume or query demand
Amazon AI Services
Amazon Polly – Lifelike Text-to-Speech47 voices, 24 languagesLow-latency, real time
Amazon Rekognition – Image AnalysisObject and scene detectionFacial analysis
Amazon Lex – Conversational EngineSpeech and text recognitionEnterprise connectors
Demographic Data
Facial Landmarks
Sentiment Expressed
Image Quality
Facial Analysis with Rekognition
Brightness: 25.84Sharpness: 160
General Attributes
Up to ~40k CUDA coresPre-configured CUDA driversJupyter notebook with Python2, Python3, Anaconda
CloudFormation TemplateAWS Marketplace – one-click deploy
AWS Deep Learning AMI
Kinesis Firehose
AthenaQuery Service Glue
Machine LearningPredictive analytics
Data Access & AuthorisationGive your users easy and secure access
Data IngestionGet your data into S3 quickly and securely
Processing & AnalyticsUse of predictive and prescriptive
analytics to gain better understanding
Protect and SecureUse entitlements to ensure data is secure and users’ identities are verified
Amazon AIStorage & Catalog
Secure, cost-effective storage in Amazon S3. Robust metadata in AWS Catalog
Thank YouFull Stack Analytics on AWS