Upload
amazon-web-services
View
927
Download
0
Embed Size (px)
Citation preview
“Teaching Old Data New Tricks™”
Brian Barker • CEO • NorthBay Solutions
John Puopolo • SVP • Engineering • Eliza Corporation
Ali Khan • Director, Business Intelligence and Analytics • Scholastic
Sai Reddy Thangirala • Solutions Architect • Amazon Web Services
Agenda• Big Data on AWS• NorthBay• Eliza Corporation Case Study
• Challenges Eliza Faced• Strategic Goals• Why a Data Lake Approach was Chosen• Outcomes & Benefits Eliza Achieved
• Scholastic Case Study• Challenges• Goals• The AWS/NorthBay Decision• How the Initiative Unfolded• Key Learnings
Data is Growing
of new data will be created every second for every human being on the planet by 2020
http://www.whizpr.be/upload/medialab/21/company/Media_Presentation_2012_DigiUniverseFINAL1.pdf
1.7MB
compound annual growth rate of 58% surpassing $1 billion by 2020 forecasted for the Hadoop market
http://www.ap-institute.com/big-data-articles/big-data-what-is-hadoop-%E2%80%93-an-explanation-for-absolutely-anyone.aspx
http://www.marketanalysis.com/?p=279
58%of all data is ever analyzed and used at the moment
http://www.technologyreview.com/news/514346/the-data-made-me-do-it/
0.5%<
Big Data Is for Everyone
The market for Big Data technologies is growing more than six times faster than the information technology market as a whole….
…and those companies who use their data well win.
Why AWS for Big Data?
Immediately Available
Broad and Deep Capabilities
Trusted and Secure
Scalable
AWS Provides the Most Complete Platform for Big DataIt’s easy to get data to AWS, store it securely, and analyze it with the engine of your choice, without any long-term commitment or vendor lock-in
CollectImport/ExportSnowballDirect ConnectVM Import/Export
StoreAmazon S3EMRAmazon GlacierAmazon RedshiftDynamoDBAurora
AnalyzeAmazon KinesisLambdaEMREC2
What Can You Do With Big Data on AWS?
Big Data Repositories Clickstream Analysis ETL Offload
Machine Learning Online Ad Serving BI Applications
“Teaching Old Data New Tricks™” with NorthBay
“Teaching Old Data New Tricks™”
Untapped wealth - Companies gain tremendous leverage when “Teaching Old Data New Tricks™”
• So what does that mean?• You’ll hear 2 exciting Customer
Examples/Use Cases presented today
Building a HIPAA compliant Data Lake
Re-tooling old on premise technology on the fly
Customer Examples/Use Cases
Scholastic Preview of Coming Attractions• How did an old school $1.5B 100-year-old company re-invent its
old school IBM and Microsoft based big data system & analytics system on the fly?
• What was their starting point?• What factors did they consider when making their decision?• What did they decide on for technology and partners and why?• How did they implement?• What were the results?• Lessons learned?
AWS & NorthBay Background
Global Provider of Big Data Solutions
250+ Full-time professionals
145+ Clients
200+ Solutions launched
Conceptual Data Lake Architecture
Eliza Preview of Coming Attractions
• How does a high flying Healthcare services company re-platform its Enterprise Data Platform while processing millions of 'interactions' every day.
• Why the need to change?• What strategic goals had to be achieved?• What is so tough about "named value pairs" • Why a Data Lake and why NorthBay?• Which AWS services were chosen to leverage?• What did they decide on for technology and partners and why?• How did it turn out?• What did they learn?
Eliza CorporationJohn Puopolo, SVP, Engineering, Eliza Corporation
About Eliza Corporation
• Founded 2000• Leader in Health Engagement Management
(HEM) outreach services• Hundreds of millions of outreaches for
intensive operation and analytics processing• High-volume semi-structured data, complex
business flow of data• Variety of analytics/consumption needs
ranging from portal for customers to ML workloads
Challenges Eliza Faced
Eliza Corporation analyzes more than 300 million interactions per year
Outreach questions and responses form a decision tree, and each question and response are captured as a pair, E.G.: <question, response> = <“Did you visit your physician in
the last 30 days?”, “Yes”>
Diverse downstream consumption requirements
Challenging to process and analyze data
Strategic Goals
Create next generation data architecture
Decouple Storage and Compute
Ability to process old & new data streams
Achieve HIPAA compliance
Ingest & store original datasets
Allow both real-time & batch processing
Enable access through entitlements and governance
Increase self-service for end-users
Conceptual Data Lake ArchitectureMonitoring, auditing, management, and alerting
Data System Analytics (Lineage, Profiling)
EDWETL
Data Lake Storage
Data Lake Archive
Catalog & Search
& Data Discovery
API & UI
Entitlements & Authorizations
Data Quality & Governance
Streaming Data Sources
Batch Data Sources
Data Sources & Ingestion Processing & Storage Consumption & Analytics
Real Time Analytics
BI tools
Hadoop (Shared
services)
Business Units
BI UI
Hadoop, SAS
(Business Unit
Dedicated)
D
D
D
D
D
Benefits of the New Enterprise Data Platform Architecture
• Hub & spoke model for one original copy of all enterprise analytics data
• Quality layer for consistent transformations and cleansing of data• Governance layer for entitlements and security management • Enable multiple consumption patterns called projections• A purpose-designed schema for an Enterprise Data Warehouse
(Redshift) for efficient reporting of known queries • Streamline and automated ingestion of source batch and streaming
data reducing human/manual touch points
Technical Architecture
Major AWS Services Used
Aurora
Kinesis + Kinesis Streams
Amazon Redshift Dynamo DB
Hive, Presto, Spark on EMR
CloudSearch, EC2
Benefits of a New Enterprise Data Platform
• Streamlined data load process by enabling schema on read
• Improved business agility by allowing schema on read
• Improved ability to manage costs by allowing separation of costs
• Provided ability to enable resources to scale on-demand
• Reduced end-to-end client analytics time
Key Learnings
• The nature of our data is name-value. We were doing too many transformations due to our original storage formats.
• Using mini-PoCs to form hypotheses and prove/disprove them led to an emergent architecture, which pointed us towards a data lake
• A data lake architecture fits our core business and growth plans extremely well
ScholasticAli Khan, Director, Business Intelligence and Analytics, Scholastic
About Scholastic
in annual revenue. The worlds largest publisher and
distributor of children’s books
website for U.S. elementary school teachers
employees globally
1.6B #1 8,400+
countries languages
165+ 45+ A leader in comprehensive
educational solutions
Existing Platform & Challenges• We taught old data new tricks
• IBM AS/400 was primary data warehouse platform, supplemented by Microsoft SQL Server to enable business intelligence
• 5,500+ AS/400 workloads, 350+ SQL Server workloads
• Inflexible architecture – slow time to market
• Unable to meet internal SLAs due to performance of daily ETL processes
• Scalability limitations with SQL Server Analysis Services (SSAS) for dashboards/reports
• Limited ability to perform self-service business intelligence
28
Project Goals
Improve performance, scalability, availability, logging, security
Enable self-service business intelligence
Integrate with existing technology stack
Align with the tech strategy (DevOps model, Cloud First)
Leverage the skill set of current team (SQL/relational)
Team up with an experienced partner
• AWS was chosen because of agility, scalability, elasticity, security and alignment with corporate strategy
• Redshift was chosen to replace AS400 and SQL Server for its relational-style high performance data store
• NorthBay was chosen for their expertise in Big Data and Amazon Redshift migrations
The Decision
30
Pilot Plans
Migrate function area in key business unit during a 3-month pilot
Demonstrate immediate business value
Stand up the AWS environment to allow IT to gain competence with AWS
Pilot Outcomes
Create core framework for migration
Implement ELT architecture and perform
validation
Establish visualization/self-service
capability through Tableau
Technical Architecture
AS400 / DB2(Source DB)
EMR Cluster running Sqoop Script
Output Bucket
EC2 Instance running Copy Command
Redshift (Staging)
Tableau(Reporting Tool)
Data Pipeline
SNS Topic
(Pipeline Status) (Pipeline Failure)
SNS Email Notification
Lambda (Save Pipeline Stats)
RDS MySQL Instance
(Save Pipeline Stats)
(Pipeline Configurations)
DynamoDB
DynamoDB Redshift (Data Warehouse)
RDS MySQL Instance
Core Framework• Jobs and job groups are defined as metadata in DynamoDB• Control-M Scheduler, Custom Application and Data Pipeline for
Orchestration• ELT Process with EMR/Sqoop for Extraction, Redshift Load and Transform
the data through SQL scripts• Core Framework allows for
• Restart capability from point of failure
• Capturing of operational statistics ( # of rows updated)
• Audit capability (which feed caused the fact to change)
34
Data Visualization Through Tableau
• Business users have access to facts/dimensions for standard reports through Tableau
• Power users have access to Staging tables for Ad-Hoc queries through Tableau
• Data Scientists have access to Files in S3 (from all extracts serving as Data Archive) using Hive and/or Presto
35
Accelerating the Program Timeline
36
• CTO moved budget forward to:
• Reduce project timeline by 50%
• Eliminate overhead of 2 platforms
• Parallel work streams (swim lanes) utilized the same core framework for migrating data for other business units
• NorthBay partners with each of those work streams to accelerate migration
• Users wanted to be on the new platform sooner
Lessons Learned - Technology
Isolate core framework with project specific code repositories
Make appropriate schema changes when migrating to new platform
Customize Framework for gathering operational stats (eg: # of rows loaded etc.)
Start with test automation tools and Acceptance Test Driven Development (ATDD) earlier in the project
Lessons Learned – Program Execution
Creating new data platforms and migrating data into them is easy, especially with AWS. Decommission of existing data platforms is hard!
“Data Champion” / “Data Guide” partnership absolutely critical for successful adoption of new platforms and working models
Importance of strong Agile coaches while scaling out Agile teams
Questions & AnswersBrian Barker • CEO • NorthBay Solutions [email protected] John Puopolo • SVP • Engineering • Eliza CorporationAli Khan • Director, Business Intelligence and Analytics • ScholasticSai Reddy Thangirala • Solutions Architect • Amazon Web Services
www.northbaysolutions.com [email protected]