48
Optimize Your Reporting In Less Than 10 Minutes David Nhim, News Distribution Network, Inc. June 24 th , 2015

Optimize Your Reporting In Less Than 10 Minutes

Embed Size (px)

Citation preview

Page 1: Optimize Your Reporting In Less Than 10 Minutes

Optimize Your Reporting In Less Than 10 Minutes

David Nhim, News Distribution Network, Inc.

June 24th, 2015

Page 2: Optimize Your Reporting In Less Than 10 Minutes

Housekeeping

• The recording will be sent to all webinar participants after the event.• Questions? Type them in the chat box and we will answer. • Posting to social? Use #AWSandChartio

Page 3: Optimize Your Reporting In Less Than 10 Minutes

Today’s Speakers

Matt Train

@Chartio

David Nhim

@Newsinc

Brandon Chavis

@AWScloud

Page 4: Optimize Your Reporting In Less Than 10 Minutes

Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year

Amazon Redshift

Page 5: Optimize Your Reporting In Less Than 10 Minutes

Common Customer Use Cases

• Reduce costs by extending DW rather than adding HW

• Migrate completely from existing DW systems

• Respond faster to business

• Improve performance by an order of magnitude

• Make more data available for analysis

• Access business data via standard reporting tools

• Add analytic functionality to applications

• Scale DW capacity as demand grows

• Reduce HW & SW costs by an order of magnitude

Traditional Enterprise DW Companies with Big Data SaaS Companies

Page 6: Optimize Your Reporting In Less Than 10 Minutes

Amazon Redshift is easy to use

• Provision in minutes

• Monitor query performance

• Point and click resize

• Built in security

• Automatic backups

Page 7: Optimize Your Reporting In Less Than 10 Minutes

Amazon Redshift is priced to let you analyze all your data

Price is nodes times hourly cost

No charge for leader node

3x data compression on avg

Price includes 3 copies of data

DS2 (HDD)Price Per Hour for

DW1.XL Single NodeEffective Annual

Price per TB compressed

On-Demand $ 0.850 $ 3,725

1 Year Reservation $ 0.500 $ 2,190

3 Year Reservation $ 0.228 $ 999

DC1 (SSD)Price Per Hour for DW2.L Single Node

Effective Annual Price per TB compressed

On-Demand $ 0.250 $ 13,690

1 Year Reservation $ 0.161 $ 8,795

3 Year Reservation $ 0.100 $ 5,500

Page 8: Optimize Your Reporting In Less Than 10 Minutes

Amazon Redshift Node Types

• Optimized for I/O intensive workloads

• High disk density

• On demand at $0.85/hour

• As low as $1,000/TB/Year

• Scale from 2TB to 2PB

DS2.XL: 31 GB RAM, 2 Cores 2 TB compressed storage, 0.5 GB/sec scan

DS2.8XL: 244 GB RAM, 16 Cores16 TB compressed, 4 GB/sec scan

• High performance at smaller storage size

• High compute and memory density

• On demand at $0.25/hour

• As low as $5,500/TB/Year

• Scale from 160GB to 326TB

DC1.L: 16 GB RAM, 2 Cores 160 GB compressed SSD storage

DC1.8XL: 256 GB RAM, 32 Cores 2.56 TB of compressed SSD storage

Page 9: Optimize Your Reporting In Less Than 10 Minutes

Amazon Redshift Architecture• Leader Node

– SQL endpoint

– Stores metadata

– Coordinates query execution

• Compute Nodes– Local, columnar storage

– Execute queries in parallel

– Load, backup, restore via Amazon S3; load from Amazon DynamoDB or SSH

• Two hardware platforms– Optimized for data processing

– DW1: HDD; scale from 2TB to 2PB

– DW2: SSD; scale from 160GB to 330TB

10 GigE(HPC)

IngestionBackupRestore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

Amazon S3 / DynamoDB / SSH

JDBC/ODBC

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

LeaderNode

Page 10: Optimize Your Reporting In Less Than 10 Minutes

Amazon Redshift enables end-to-end security

• SSL to secure data in transit; load encrypted from Amazon S3; ECDHE perfect forward security

• Encryption to secure data at rest– AES-256; hardware accelerated

– All blocks on disks & in Amazon S3 encrypted

– On-premises HSM & AWS CloudHSM support

• UNLOAD to Amazon S3 supports SSE and client-side encryption

• Audit logging & AWS CloudTrail integration

• Amazon VPC and IAM support

• SOC 1/2/3, PCI-DSS Level 1, FedRAMP, HIPAA

10 GigE(HPC)

IngestionBackupRestore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

S3 / EMR / DynamoDB / SSH

Customer VPC

InternalVPC

JDBC/ODBC

LeaderNode

Compute Node

Compute Node

Compute Node

Page 11: Optimize Your Reporting In Less Than 10 Minutes

Amazon Redshift integrates with multiple data sources

Amazon S3

Amazon EMR

Amazon Redshift

DynamoDB

Amazon RDS

Corporate Datacenter

Page 12: Optimize Your Reporting In Less Than 10 Minutes

NDN Introduction2015

Page 13: Optimize Your Reporting In Less Than 10 Minutes

• Transition Items & Interim Plan• Marketing Approach & Priorities• Brand Development Process• Resourcing• Next Steps

The Broadest Offering ofVideo Available

Anywhere400+ Premium Sources4,000 New Videos Daily

Page 14: Optimize Your Reporting In Less Than 10 Minutes

The Digital Media Exchange

400 Premium Content Providers

4,000 High-Traffic Publishers

Page 15: Optimize Your Reporting In Less Than 10 Minutes

The Web’s Best Publishers Lead with Video from NDN

Page 16: Optimize Your Reporting In Less Than 10 Minutes

Competitive Insight

NDN is a leader in the News/Information category, ranked #2 behind Huffington Post Media Group.

Page 17: Optimize Your Reporting In Less Than 10 Minutes

NDN Powers the Full Video Experience for Publishers

Page 18: Optimize Your Reporting In Less Than 10 Minutes

NDN Single Video Player & Fixed Placement

Page 19: Optimize Your Reporting In Less Than 10 Minutes

Perfect Pixel has Redefined the Video Workflow

Page 20: Optimize Your Reporting In Less Than 10 Minutes

NDN Wire Match

NDN Wire Match: automates placement of AP video recommended by AP editors

Page 21: Optimize Your Reporting In Less Than 10 Minutes

Powering Video On 44 of the Top 50 Newspaper Sites

Top

U.S

. N

ew

spa

pe

rs O

nlin

e

Page 22: Optimize Your Reporting In Less Than 10 Minutes

NDN is the Leader in Local News

• Breaking News Video Available from over 250 Stations in 155 US News Markets

• Coverage for 90% of the US Audience

Page 23: Optimize Your Reporting In Less Than 10 Minutes

The Largest Consortium of Digital Local News Video Ever Created

Participating broadcasters:

257 Stations in 155 Markets

Page 24: Optimize Your Reporting In Less Than 10 Minutes

BI Initiative

• Needed self-service BI • Must be user-friendly• Easy to Manage• Reviewed over a dozen BI vendors

– Build or Buy– Self Hosted vs Cloud– Training/Support– POC process

Page 25: Optimize Your Reporting In Less Than 10 Minutes

Tech @ NDN

• Tools– Kinesis for Real-Time Data Collection– Python / EMR / Pentaho for ETL– Redshift for Data Warehousing– Chartio for Visualization

Page 26: Optimize Your Reporting In Less Than 10 Minutes

Data Warehouse

Architecture

RDBMS

Logs

ETL

DIMENSIONS

Page 27: Optimize Your Reporting In Less Than 10 Minutes

Architecture

• Real-time data collector encodes messages in protocol buffers and sends payload to kinesis

• Micro-batching – ETL process continuously reads from kinesis, batches the data, and

loads into Redshift– ~15 minutes behind real-time

Page 28: Optimize Your Reporting In Less Than 10 Minutes

Redshift Basics

• Redshift is a distributed column store– don’t treat it like a traditional row store– Don’t do “SELECT * FROM” queries

• No Referential Integrity – primary / foreign keys ignored except for query planning– Enforce uniqueness via ETL

• No UDFs or Stored Procedures– Must rely on built in functions– Do as much pre-processing outside of cluster

Page 29: Optimize Your Reporting In Less Than 10 Minutes

Redshift

• Use COPY command to bulk load data– Raw inserts are slow – “Insert Into Table … Values …”

• Deep copies to rebuild tables rather than do a full vacuum.– Create table then Insert Into “Select * from”– Vacuum took as long as three days for some tables

Page 30: Optimize Your Reporting In Less Than 10 Minutes

Distribution

• Distribution Styles– Use “All” distribution for dimension tables– Use “Even” distribution for summary tables– Use “Key” distribution for fact tables

Select most often joined column as dist key.

Strive for join data locality

Page 31: Optimize Your Reporting In Less Than 10 Minutes

Sort Keys

• Select a timestamp based column with the lowest grain that makes sense (minute truncated timestamp)

• Insert Data in Sort key order to minimize the need for vacuum

Page 32: Optimize Your Reporting In Less Than 10 Minutes

Compression Encoding

• Use compression to reduce I/O– Use ANALYZE COMPRESSION to get recommended encodings for

your table or use COPY bulk loading tool do it for you– Use Run Length Encoding on rollup columns like hour, day, month, year,

booleans (assuming a timestamp for your sortkey)

Page 33: Optimize Your Reporting In Less Than 10 Minutes

Summary Tables

• Aggregate Tables / Materialized Views– Pre-build your summaries and complex queries– Your biggest boost in query performance will come from using summary

tables– Adds ETL complexity, but reduces reporting complexity – Chartio’s Data Store is also an option if your data set is < 1 M rows

Page 34: Optimize Your Reporting In Less Than 10 Minutes

Avoid Updates on fact tables

• Avoid doing Updates on your fact tables– Updates are equivalent to delete then insert and will ruin your sort order– Vacuum will be required after large updates

• Deletes remain in your table– Marked and hidden, but don’t disappear until a vacuum delete or full

vacuum is performed

Page 35: Optimize Your Reporting In Less Than 10 Minutes

Caching

• Configure Chartio with the appropriate cache timeout values – 15 min, 1 hour, 8 hours

• Use Chartio’s data store feature– Ideal for storing complex query results or aggregates

Page 36: Optimize Your Reporting In Less Than 10 Minutes

Views

• Use views instead of tables– Easier to update Chartio schemas if using a view– Can add mandatory filters– Can change view w/o affecting Chartio

Page 37: Optimize Your Reporting In Less Than 10 Minutes

Chartio Filters and Drilldowns

• Encourage use of dashboard filters and variables– Allows for dynamic filtering and focused reporting

• Configure drilldowns on dashboards– Makes exploration more natural

Page 38: Optimize Your Reporting In Less Than 10 Minutes

Redshift Workload Manager

• Use the Workload Manager (WLM)– Prevent long queries from blocking other users– Create multiple query queues for ETL, BI, Machine Learning, etc– Set separate memory settings and query timeout values for each queue

Page 39: Optimize Your Reporting In Less Than 10 Minutes

Quick Stats

• 14 event types• 300 M ~ 1 B events / day• ½ Terabyte uncompressed data / day • 30 – 50 data points per event type• 50+ users (about half the company)• 80+ dashboards, majority user generated• Reportable dimensions include:

– Partners, Geo-location, Device, EventType, Playlists, Widgets, Date/Time …

Page 40: Optimize Your Reporting In Less Than 10 Minutes

Data At A Glance

Page 41: Optimize Your Reporting In Less Than 10 Minutes

Data At A Glance

Page 42: Optimize Your Reporting In Less Than 10 Minutes

Chartio Summary

• Easy to deploy• Easy to manage• Dead simple to use• Great performance• Responsive support• Continually improving and adding new features

Page 43: Optimize Your Reporting In Less Than 10 Minutes

Redshift Summary

• Easy to Deploy• Easy to Resize• Automated backups• Familiar postgres-like interface• High performance• Can use OLAP/Relational tools

Page 44: Optimize Your Reporting In Less Than 10 Minutes

Data Sources

Schema/Business Rules

Interactive Mode

SQL ModeData Stores

TV Screens

Scheduled Emails

Data Exploration

Dashboards

Embedded

Data Pipeline/Data Blending

Data Caching

Security

Page 45: Optimize Your Reporting In Less Than 10 Minutes
Page 46: Optimize Your Reporting In Less Than 10 Minutes
Page 47: Optimize Your Reporting In Less Than 10 Minutes

Next stepsDownload Chartio Guide: Optimizing Amazon Redshift Query Performance

https://chartio.com/redshift

Page 48: Optimize Your Reporting In Less Than 10 Minutes

Questions?

ChartioMatt Train

[email protected]

chartio.com

News Distribution Network, Inc.

David Nhim

[email protected] newsinc.com

AWSBrandon Chavis

[email protected] aws.amazon.com