Building your data warehouse with Redshift

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Amazon Redshift Deep Dive

Ran Tessler, AWS Solutions Architect

Amazon Redshift Architecture

• Leader Node

– SQL endpoint

– Stores metadata

– Coordinates query execution

• Compute Nodes

– Local, columnar storage

– Execute queries in parallel

– Load, backup, restore via

Amazon S3; load from

Amazon DynamoDB or SSH

• Two hardware platforms

– Optimized for data processing

– Dense Storage: HDD; scale from 2TB to 2PB

– Dense Compute: SSD; scale from 160GB to 326TB

10 GigE

(HPC)

Ingestion

Backup

Restore

JDBC/ODBC

• Massive Parallel Processing (MPP)

– Nodes are split into independent slices

– Each slice has a single virtual core, dedicated RAM

and storage

Amazon Redshift Architecture

10 GigE

(HPC)

Ingestion

Backup

Restore

JDBC/ODBC

Compute Node

Slice 1 Slice 2

Virtual Core

7.5 GiB RAM

Local Disk

Virtual Core

7.5 GiB RAM

Local Disk

Amazon Redshift dramatically reduces I/O

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375



• Zone maps



ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375


• Column storage


• Zone maps



analyze compression listing;

Table | Column | Encoding

---------+----------------+----------

listing | listid | delta

listing | sellerid | delta32k

listing | eventid | delta32k

listing | dateid | bytedict

listing | numtickets | bytedict

listing | priceperticket | delta32k

listing | totalprice | mostly32

listing | listtime | raw


• Column storage




• Track of the minimum and

maximum value for each block

• Skip over blocks that don’t

contain the data needed for a

given query

• Minimize unnecessary I/O


• Column storage


• Zone maps



• Use direct-attached storage

to maximize throughput

• Hardware optimized for high

performance data

processing

• Large block sizes to make the

most of each read

• Amazon Redshift manages

durability for you

Data Modeling

Data Distribution

• Data is allocated to slices based on

distribution style

– DISTSTYLE EVEN – Round Robin

– DISTSTYLE KEY – based on the distribution key

hash value

– DISTSTYLE ALL - Replicated to slice 0 on all

nodes

• Query performance considerations

– Uneven distribution harms query

– Data redistribution is expensive

Compute Node 1

Slice 1 Slice 2

Compute Node 2

Slice 3 Slice 4

5M

records

2M

records 1M

records

4M

records

Compute Node 1

Slice 1 Slice 2

Compute Node 2

Slice 3 Slice 4

Compute Node 3

Slice 5 Slice 6

Suboptimal Distribution

ORDERS ITEMS

Default (No Distribution Key, Round Robin Order)

Order 1 Order 2 Order 3Item 2.1 Item 1.1 Item 1.2

Item 2.2Item 3.1

Order 1: Dave Smith, Total $195

Item 1.1: Order 1, Kindle Fire HD 7”, $159

Item 1.2: Order 1, Kindle Fire Case, $36

Compute Node 1

Slice 1 Slice 2

Compute Node 2

Slice 3 Slice 4

Compute Node 3

Slice 5 Slice 6

Optimal Distribution

ORDERS ITEMSOrder 1: Dave Smith, Total $195

Item 1.1: Order 1, Kindle Fire HD 7”, $159

Item 1.2: Order 1, Kindle Fire Case, $36

Order 1 Order 2 Order 3

Item 2.1Item 1.1

Item 1.2 Item 2.2

Item 3.1

Customised (ORDERS.ORDER_ID DISTKEY, ITEMS.ORDER_ID DISTKEY)

Sorting Table Data

• Sort Keys ≠ Index

– Data is initially written by INSERT/COPY order

– VACUUM sorts the rows and reclaims stale

storage

Compound Sort Keys Illustrated

Records in Redshift are stored in blocks.

For this illustration, let’s assume that four records fill a block

Records with a given cust_id are all in one block

However, records with a given prod_id are spread across four blocks

1

1

1

1

2

3

4

1

4

4

4

2

3

4

4

1

3

3

3

2

3

4

3

1

2

2

2

2

3

4

2

1

1 [1,1] [1,2] [1,3] [1,4]

2 [2,1] [2,2] [2,3] [2,4]

3 [3,1] [3,2] [3,3] [3,4]

4 [4,1] [4,2] [4,3] [4,4]

1 2 3 4

prod_id

cust_id

cust_id prod_id other columns blocks

1 [1,1] [1,2] [1,3] [1,4]

2 [2,1] [2,2] [2,3] [2,4]

3 [3,1] [3,2] [3,3] [3,4]

4 [4,1] [4,2] [4,3] [4,4]

1 2 3 4

prod_id

cust_id

Interleaved Sort Keys Illustrated

Records with a given

cust_id are spread across

two blocks

Records with a given

prod_id are also spread

across two blocks

Data is sorted in equal

measures for both keys

1

1

2

2

2

1

2

3

3

4

4

4

3

4

3

1

3

4

4

2

1

2

3

3

1

2

2

4

3

4

1

1

cust_id prod_id other columns blocks

Query Optimization

Query Performance

• Good choice of distribution and sort keys speed query

performance more than any other factor

• Redshift Uses a Cost Based Query Optimizer

– Good statistics are VITAL to ensure good performance

– Table constraints, while not enforced, are used to optimize queries

• Run ANALYZE command to update statistics:ANALYZE lineitem;

Query Analysis

• EXPLAIN command followed by the query:EXPLAIN select avg(datediff(day, listtime, saletime)) as avgwait from sales,

listing where sales.listid = listing.listid;

QUERY PLAN

XN Aggregate (cost=6350.30..6350.31 rows=1 width=16)

-> XN Hash Join DS_DIST_NONE (cost=47.08..6340.89 rows=3766 width=16)

Hash Cond: ("outer".listid = "inner".listid)

-> XN Seq Scan on listing (cost=0.00..1924.97 rows=192497 width=12)

-> XN Hash (cost=37.66..37.66 rows=3766 width=12)

-> XN Seq Scan on sales (cost=0.00..37.66 rows=3766 width=12)

• From the EXPLAIN plan you can tell:

– Query execution steps

– Which operation to be performed in each step

– Which table to be used in each step

– How much data needs to be processed in each step

Query Analysis

• Access the STL_EXPLAIN table for executed queries: select query,nodeid,parentid,substring(plannode from 1 for 30),

substring(info from 1 for 20) from stl_explain

where query=10 order by 1,2;

query | nodeid | parentid | substring | substring

------+--------+----------+---------------------+---------------------

10 | 1 | 0 | XN Aggregate (cost=6350.30... |

10 | 2 | 1 | -> XN Merge Join DS_DIST_NO | Merge Cond: ("outer"

10 | 3 | 2 | -> XN Seq Scan on lis |

10 | 4 | 2 | -> XN Seq Scan on sal |

• SVL_QUERY_SUMMARY and SVL_QUERY_REPORT for finer details

Query Analysis

• Explain plans and performance metrics are also available via

the console:

Query Analysis

• Explain Plan Visualization is now also available

Amazon Redshift

Spend time with your data, not your database….

Thank You

Technology

Building your data warehouse with Redshift