Amazon Redshift Database Developer Guide

Amazon RedshiftDatabase Developer Guide

API Version 2012-12-01

Amazon Redshift: Database Developer GuideCopyright 2015 Amazon Web Services, Inc. and/or its affiliates. All rights reserved.The following are trademarks of Amazon Web Services, Inc.: Amazon, Amazon Web Services Design, AWS, Amazon CloudFront,AWS CloudTrail, AWS CodeDeploy, Amazon Cognito, Amazon DevPay, DynamoDB, ElastiCache, Amazon EC2, Amazon ElasticCompute Cloud, Amazon Glacier, Amazon Kinesis, Kindle, Kindle Fire, AWS Marketplace Design, Mechanical Turk, Amazon Redshift,Amazon Route 53, Amazon S3, Amazon VPC, and Amazon WorkDocs. In addition, Amazon.com graphics, logos, page headers,button icons, scripts, and service names are trademarks, or trade dress of Amazon in the U.S. and/or other countries. Amazon'strademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner that islikely to cause confusion among customers, or in any manner that disparages or discredits Amazon.

All other trademarks not owned by Amazon are the property of their respective owners, who may or may not be affiliated with, connectedto, or sponsored by Amazon.

Amazon Redshift Database Developer Guide

Table of ContentsWelcome ..................................................................................................................................... 1

Are You a First-Time Amazon Redshift User? ............................................................................. 1Are You a Database Developer? ............................................................................................... 2Prerequisites ........................................................................................................................ 3

Amazon Redshift System Overview .................................................................................................. 4Data Warehouse System Architecture ....................................................................................... 4Performance ......................................................................................................................... 6Columnar Storage ................................................................................................................. 8Internal Architecture and System Operation ............................................................................... 9Workload Management ......................................................................................................... 10Using Amazon Redshift with Other Services ............................................................................. 11

Moving Data Between Amazon Redshift and Amazon S3 .................................................... 11Using Amazon Redshift with Amazon DynamoDB ............................................................. 11Importing Data from Remote Hosts over SSH ................................................................... 11Automating Data Loads Using AWS Data Pipeline ............................................................. 11

Getting Started Using Databases ................................................................................................... 12Step 1: Create a Database .................................................................................................... 12Step 2: Create a Database User ............................................................................................. 13

Delete a Database User ................................................................................................ 13Step 3: Create a Database Table ............................................................................................ 13

Insert Data Rows into a Table ........................................................................................ 14Select Data from a Table ............................................................................................... 14

Step 4: Load Sample Data ..................................................................................................... 15Step 5: Query the System Tables ............................................................................................ 18

View a List of Table Names ............................................................................................ 19View Database Users ................................................................................................... 19View Recent Queries .................................................................................................... 20Determine the Process ID of a Running Query .................................................................. 20

Step 6: Cancel a Query ......................................................................................................... 21Cancel a Query from Another Session ............................................................................. 22Cancel a Query Using the Superuser Queue .................................................................... 22

Step 7: Clean Up Your Resources ........................................................................................... 22Amazon Redshift Best Practices ..................................................................................................... 24

Best Practices for Designing Tables ......................................................................................... 24Take the Tuning Table Design Tutorial .............................................................................. 25Choose the Best Sort Key ............................................................................................. 25Choose the Best Distribution Style .................................................................................. 25Use Automatic Compression .......................................................................................... 26Define Constraints ....................................................................................................... 26Use the Smallest Possible Column Size ........................................................................... 26Using Date/Time Data Types for Date Columns ................................................................. 26

Best Practices for Loading Data ............................................................................................. 27Take the Loading Data Tutorial ....................................................................................... 27Take the Tuning Table Design Tutorial .............................................................................. 27Use a COPY Command to Load Data ............................................................................. 27Use a Single COPY Command ...................................................................................... 28Split Your Load Data into Multiple Files ............................................................................ 28Compress Your Data Files with gzip or lzop ...................................................................... 28Use a Manifest File ...................................................................................................... 28Verify Data Files Before and After a Load ......................................................................... 28Use a Multi-Row Insert ................................................................................................. 29Use a Bulk Insert ......................................................................................................... 29Load Data in Sort Key Order .......................................................................................... 29Load Data in Sequential Blocks ...................................................................................... 29Use Time-Series Tables ................................................................................................ 30

API Version 2012-12-01iii


Use a Staging Table to Perform a Merge .......................................................................... 30Schedule Around Maintenance Windows ......................................................................... 30

Best Practices for Designing Queries ...................................................................................... 30Tutorial: Tuning Table Design .......................................................................................................... 33

Prerequisites ...................................................................................................................... 33Steps ................................................................................................................................. 33Step 1: Create a Test Data Set ............................................................................................... 34

To Create a Test Data Set ............................................................................................. 34Next Step ................................................................................................................... 38

Step 2: Establish a Baseline .................................................................................................. 38To Test System Performance to Establish a Baseline .......................................................... 39Next Step ................................................................................................................... 41

Step 3: Select Sort Keys ....................................................................................................... 42To Select Sort Keys ...................................................................................................... 42Next Step ................................................................................................................... 42

Step 4: Select Distribution Styles ............................................................................................ 43Distribution Styles ....................................................................................................... 43To Select Distribution Styles .......................................................................................... 44Next Step ................................................................................................................... 46

Step 5: Review Compression Encodings .................................................................................. 46To Review Compression Encodings ................................................................................ 46Next Step ................................................................................................................... 49

Step 6: Recreate the Test Data Set ......................................................................................... 49To Recreate the Test Data Set ........................................................................................ 49Next Step ................................................................................................................... 52

Step 7: Retest System Performance After Tuning ....................................................................... 52To Retest System Performance After Tuning ..................................................................... 52Next Step ................................................................................................................... 56

Step 8: Evaluate the Results .................................................................................................. 56Next Step ................................................................................................................... 58

Step 9: Clean Up Your Resources ........................................................................................... 58Next Step ................................................................................................................... 58

Summary ........................................................................................................................... 58Next Step ................................................................................................................... 59

Tutorial: Loading Data from Amazon S3 ........................................................................................... 60Prerequisites ...................................................................................................................... 60Overview ............................................................................................................................ 61Steps ................................................................................................................................. 61Step 1: Launch a Cluster ....................................................................................................... 61

Next Step ................................................................................................................... 62Step 2: Download the Data Files ............................................................................................. 62

Next Step ................................................................................................................... 63Step 3: Upload the Files to an Amazon S3 Bucket ..................................................................... 63

................................................................................................................................. 63Next Step ................................................................................................................... 64

Step 4: Create the Sample Tables ........................................................................................... 64Next Step ................................................................................................................... 67

Step 5: Run the COPY Commands ......................................................................................... 67COPY Command Syntax .............................................................................................. 67Loading the SSB Tables ................................................................................................ 68

Step 6: Vacuum and Analyze the Database .............................................................................. 79Next Step ................................................................................................................... 80

Step 7: Clean Up Your Resources ........................................................................................... 80Next .......................................................................................................................... 80

Summary ........................................................................................................................... 80Next Step ................................................................................................................... 81

Tutorial: Configuring WLM Queues to Improve Query Processing ......................................................... 82Overview ............................................................................................................................ 82

API Version 2012-12-01iv


Prerequisites .............................................................................................................. 82Sections ..................................................................................................................... 82

Section 1: Understanding the Default Queue Processing Behavior ................................................ 83Step 1: Create the WLM_QUEUE_STATE_VW View .......................................................... 83Step 2: Create the WLM_QUERY_STATE_VW View ........................................................... 84Step 3: Run Test Queries .............................................................................................. 85

Section 2: Modifying the WLM Query Queue Configuration .......................................................... 86Step 1: Create a Parameter Group .................................................................................. 87Step 2: Configure WLM ................................................................................................. 87Step 3: Associate the Parameter Group with Your Cluster .................................................... 89

Section 3: Routing Queries to Queues Based on User Groups and Query Groups ........................... 91Step 1: View Query Queue Configuration in the Database ................................................... 91Step 2: Run a Query Using the Query Group Queue .......................................................... 91Step 3: Create a Database User and Group ...................................................................... 92Step 4: Run a Query Using the User Group Queue ............................................................ 93

Section 4: Using wlm_query_slot_count to Temporarily Override Concurrency Level in a Queue ........ 94Step 1: Override the Concurrency Level Using wlm_query_slot_count ................................... 94Step 2: Run Queries from Different Sessions .................................................................... 95

Section 5: Cleaning Up Your Resources ................................................................................... 96Managing Database Security ......................................................................................................... 97

Amazon Redshift Security Overview ........................................................................................ 97Default Database User Privileges ........................................................................................... 98Superusers ......................................................................................................................... 98Users ................................................................................................................................ 99

Creating, Altering, and Deleting Users ............................................................................. 99Groups ............................................................................................................................. 100

Creating, Altering, and Deleting Groups ......................................................................... 100Schemas .......................................................................................................................... 100

Creating, Altering, and Deleting Schemas ...................................................................... 101Search Path .............................................................................................................. 101Schema-Based Privileges ............................................................................................ 101

Example for Controlling User and Group Access ..................................................................... 102Designing Tables ........................................................................................................................ 104

Choosing a Column Compression Type .................................................................................. 104Compression Encodings ............................................................................................. 105Testing Compression Encodings ................................................................................... 111Example: Choosing Compression Encodings for the CUSTOMER Table .............................. 113

Choosing a Data Distribution Style ........................................................................................ 115Data Distribution Concepts .......................................................................................... 115Distribution Styles ...................................................................................................... 116Viewing Distribution Styles ........................................................................................... 116Evaluating Query Patterns ........................................................................................... 117Designating Distribution Styles ..................................................................................... 117Evaluating the Query Plan ........................................................................................... 118Query Plan Example .................................................................................................. 120Distribution Examples ................................................................................................. 125

Choosing Sort Keys ............................................................................................................ 127Defining Constraints ........................................................................................................... 127Analyzing Table Design ....................................................................................................... 127

Loading Data ............................................................................................................................ 131Using COPY to Load Data ................................................................................................... 131

Preparing Your Input Data ............................................................................................ 132Loading Data from Amazon S3 ..................................................................................... 133Loading Data from Amazon EMR .................................................................................. 142Loading Data from Remote Hosts ................................................................................. 149Loading from Amazon DynamoDB ................................................................................ 155Verifying That the Data Was Loaded Correctly ................................................................. 157Validating Input Data .................................................................................................. 157

API Version 2012-12-01v


Automatic Compression .............................................................................................. 158Optimizing for Narrow Tables ........................................................................................ 160Default Values ........................................................................................................... 160Troubleshooting ......................................................................................................... 161

Updating with DML ............................................................................................................. 165Updating and Inserting ........................................................................................................ 165

Merge Method 1: Replacing Existing Rows ..................................................................... 165Merge Method 2: Specifying a Column List ..................................................................... 166Creating a Temporary Staging Table .............................................................................. 166Performing a Merge Operation by Replacing Existing Rows ............................................... 166Performing a Merge Operation by Specifying a Column List ............................................... 167Merge Examples ........................................................................................................ 169

Performing a Deep Copy ..................................................................................................... 171Analyzing Tables ................................................................................................................ 173

ANALYZE Command History ........................................................................................ 174Automatic Analysis ..................................................................................................... 175

Vacuuming Tables .............................................................................................................. 175VACUUM Frequency ................................................................................................... 176Sort Stage and Merge Stage ........................................................................................ 176Vacuum Types ........................................................................................................... 177Managing Vacuum Times ............................................................................................ 177Vacuum Column Limit Exceeded Error ........................................................................... 182

Managing Concurrent Write Operations ................................................................................. 183Serializable Isolation ................................................................................................... 183Write and Read-Write Operations ................................................................................. 184Concurrent Write Examples ......................................................................................... 185

Unloading Data .......................................................................................................................... 187Unloading Data to Amazon S3 ............................................................................................. 187Unloading Encrypted Data Files ........................................................................................... 190Unloading Data in Delimited or Fixed-Width Format ................................................................. 191Reloading Unloaded Data .................................................................................................... 192

Tuning Query Performance .......................................................................................................... 194Query Processing .............................................................................................................. 194

Query Planning And Execution Workflow ........................................................................ 195Reviewing Query Plan Steps ........................................................................................ 196Query Plan ............................................................................................................... 197Factors Affecting Query Performance ............................................................................ 203

Analyzing and Improving Queries .......................................................................................... 204Query Analysis Workflow ............................................................................................. 204Reviewing Query Alerts ............................................................................................... 205Analyzing the Query Plan ............................................................................................ 206Analyzing the Query Summary ..................................................................................... 207Improving Query Performance ...................................................................................... 212Diagnostic Queries for Query Tuning ............................................................................. 214

Implementing Workload Management .................................................................................... 218Defining Query Queues ............................................................................................... 218WLM Queue Assignment Rules .................................................................................... 221Modifying the WLM Configuration ................................................................................. 223Assigning Queries to Queues ....................................................................................... 224Monitoring Workload Management ................................................................................ 225

Troubleshooting Queries ..................................................................................................... 226Connection Fails ........................................................................................................ 227Query Hangs ............................................................................................................. 227Query Takes Too Long ................................................................................................ 228Load Fails ................................................................................................................. 228Load Takes Too Long .................................................................................................. 229Load Data Is Incorrect ................................................................................................. 229Setting the JDBC Fetch Size Parameter ......................................................................... 229

API Version 2012-12-01vi


SQL Reference .......................................................................................................................... 231Amazon Redshift SQL ........................................................................................................ 231

SQL Functions Supported on the Leader Node ............................................................... 231Amazon Redshift and PostgreSQL ................................................................................ 232

Using SQL ........................................................................................................................ 238SQL Reference Conventions ........................................................................................ 238Basic Elements .......................................................................................................... 238Expressions .............................................................................................................. 261Conditions ................................................................................................................ 265

SQL Commands ................................................................................................................ 282ABORT .................................................................................................................... 283ALTER DATABASE ..................................................................................................... 285ALTER GROUP ......................................................................................................... 286ALTER SCHEMA ....................................................................................................... 286ALTER TABLE ........................................................................................................... 287ALTER USER ............................................................................................................ 292ANALYZE ................................................................................................................. 294ANALYZE COMPRESSION ......................................................................................... 295BEGIN ..................................................................................................................... 296CANCEL .................................................................................................................. 298CLOSE .................................................................................................................... 299COMMENT ............................................................................................................... 300COMMIT .................................................................................................................. 301COPY ...................................................................................................................... 302CREATE DATABASE .................................................................................................. 336CREATE GROUP ....................................................................................................... 337CREATE SCHEMA ..................................................................................................... 338CREATE TABLE ......................................................................................................... 339CREATE TABLE AS .................................................................................................... 349CREATE USER ......................................................................................................... 354CREATE VIEW .......................................................................................................... 355DEALLOCATE ........................................................................................................... 356DECLARE ................................................................................................................ 357DELETE ................................................................................................................... 359DROP DATABASE ...................................................................................................... 361DROP GROUP .......................................................................................................... 362DROP SCHEMA ........................................................................................................ 362DROP TABLE ............................................................................................................ 363DROP USER ............................................................................................................. 366DROP VIEW ............................................................................................................. 367END ........................................................................................................................ 368EXECUTE ................................................................................................................ 369EXPLAIN .................................................................................................................. 370FETCH .................................................................................................................... 374GRANT .................................................................................................................... 375INSERT .................................................................................................................... 379LOCK ...................................................................................................................... 383PREPARE ................................................................................................................ 384RESET .................................................................................................................... 385REVOKE .................................................................................................................. 386ROLLBACK .............................................................................................................. 389SELECT ................................................................................................................... 390SELECT INTO ........................................................................................................... 419SET ......................................................................................................................... 420SET SESSION AUTHORIZATION ................................................................................. 423SET SESSION CHARACTERISTICS ............................................................................. 424SHOW ..................................................................................................................... 424START TRANSACTION ............................................................................................... 425

API Version 2012-12-01vii


TRUNCATE .............................................................................................................. 425UNLOAD .................................................................................................................. 426UPDATE ................................................................................................................... 438VACUUM .................................................................................................................. 442

SQL Functions Reference ................................................................................................... 444Leader NodeOnly Functions ....................................................................................... 445Aggregate Functions .................................................................................................. 446Bit-Wise Aggregate Functions ...................................................................................... 454Window Functions ...................................................................................................... 459Conditional Expressions .............................................................................................. 501Date Functions .......................................................................................................... 509Math Functions .......................................................................................................... 534String Functions ........................................................................................................ 558JSON Functions ........................................................................................................ 595Data Type Formatting Functions ................................................................................... 598System Administration Functions .................................................................................. 607System Information Functions ...................................................................................... 611

Reserved Words ................................................................................................................ 622System Tables Reference ............................................................................................................ 626

System Tables and Views .................................................................................................... 626Types of System Tables and Views ........................................................................................ 627Visibility of Data in System Tables and Views .......................................................................... 627

Filtering System-Generated Queries .............................................................................. 627STL Tables for Logging ....................................................................................................... 628

STL_AGGR .............................................................................................................. 629STL_ALERT_EVENT_LOG ......................................................................................... 631STL_BCAST ............................................................................................................. 632STL_COMMIT_STATS ................................................................................................ 634STL_CONNECTION_LOG ........................................................................................... 635STL_DDLTEXT .......................................................................................................... 636STL_DIST ................................................................................................................ 637STL_DELETE ........................................................................................................... 639STL_ERROR ............................................................................................................ 641STL_EXPLAIN .......................................................................................................... 642STL_FILE_SCAN ....................................................................................................... 644STL_HASH ............................................................................................................... 645STL_HASHJOIN ........................................................................................................ 646STL_INSERT ............................................................................................................ 648STL_LIMIT ................................................................................................................ 649STL_LOAD_COMMITS ............................................................................................... 651STL_LOAD_ERRORS ................................................................................................ 653STL_LOADERROR_DETAIL ........................................................................................ 655STL_MERGE ............................................................................................................ 656STL_MERGEJOIN ..................................................................................................... 657STL_NESTLOOP ....................................................................................................... 658STL_PARSE ............................................................................................................. 660STL_PLAN_INFO ...................................................................................................... 661STL_PROJECT ......................................................................................................... 663STL_QUERY ............................................................................................................. 664STL_QUERYTEXT ..................................................................................................... 666STL_REPLACEMENTS .............................................................................................. 667STL_RETURN ........................................................................................................... 668STL_SAVE ............................................................................................................... 670STL_S3CLIENT ......................................................................................................... 671STL_S3CLIENT_ERROR ............................................................................................ 672STL_SCAN ............................................................................................................... 673STL_SESSIONS ........................................................................................................ 676STL_SORT ............................................................................................................... 677

API Version 2012-12-01viii


STL_SSHCLIENT_ERROR ......................................................................................... 678STL_STREAM_SEGS ................................................................................................ 679STL_TR_CONFLICT .................................................................................................. 679STL_UNDONE .......................................................................................................... 680STL_UNIQUE ........................................................................................................... 681STL_UNLOAD_LOG .................................................................................................. 683STL_USERLOG ........................................................................................................ 684STL_UTILITYTEXT .................................................................................................... 685STL_VACUUM .......................................................................................................... 686STL_WARNING ......................................................................................................... 688STL_WINDOW .......................................................................................................... 689STL_WLM_ERROR ................................................................................................... 690STL_WLM_QUERY .................................................................................................... 690

STV Tables for Snapshot Data .............................................................................................. 692STV_ACTIVE_CURSORS ........................................................................................... 693STV_BLOCKLIST ...................................................................................................... 693STV_CURSOR_CONFIGURATION ............................................................................... 696STV_EXEC_STATE .................................................................................................... 697STV_INFLIGHT ......................................................................................................... 698STV_LOAD_STATE .................................................................................................... 699STV_LOCKS ............................................................................................................. 700STV_PARTITIONS ..................................................................................................... 701STV_RECENTS ........................................................................................................ 703STV_SLICES ............................................................................................................ 704STV_SESSIONS ....................................................................................................... 705STV_TBL_PERM ....................................................................................................... 706STV_TBL_TRANS ..................................................................................................... 708STV_WLM_CLASSIFICATION_CONFIG ........................................................................ 709STV_WLM_QUERY_QUEUE_STATE ............................................................................ 710STV_WLM_QUERY_STATE ......................................................................................... 711STV_WLM_QUERY_TASK_STATE ............................................................................... 712STV_WLM_SERVICE_CLASS_CONFIG ....................................................................... 713STV_WLM_SERVICE_CLASS_STATE .......................................................................... 714

System Views .................................................................................................................... 715SVL_COMPILE ......................................................................................................... 716SVV_DISKUSAGE ..................................................................................................... 717SVL_QERROR .......................................................................................................... 719SVL_QLOG .............................................................................................................. 719SVV_QUERY_INFLIGHT ............................................................................................ 720SVL_QUERY_QUEUE_INFO ....................................................................................... 721SVL_QUERY_REPORT .............................................................................................. 722SVV_QUERY_STATE ................................................................................................. 724SVL_QUERY_SUMMARY ........................................................................................... 726SVL_STATEMENTTEXT .............................................................................................. 728SVV_TABLE_INFO .................................................................................................... 730SVV_VACUUM_PROGRESS ....................................................................................... 731SVV_VACUUM_SUMMARY ......................................................................................... 732SVL_VACUUM_PERCENTAGE .................................................................................... 734

System Catalog Tables ........................................................................................................ 734PG_TABLE_DEF ....................................................................................................... 734Querying the Catalog Tables ........................................................................................ 736

Configuration Reference .............................................................................................................. 741Modifying the Server Configuration ....................................................................................... 741datestyle .......................................................................................................................... 742

Values (Default in Bold) ............................................................................................... 742Description ............................................................................................................... 742Example ................................................................................................................... 742

extra_float_digits ................................................................................................................ 742

API Version 2012-12-01ix


Values (Default in Bold) ............................................................................................... 742Description ............................................................................................................... 742

max_cursor_result_set_size ................................................................................................. 743Values (Default in Bold) ............................................................................................... 743Description ............................................................................................................... 743

query_group ..................................................................................................................... 743Values (Default in Bold) ............................................................................................... 743Description ............................................................................................................... 743

search_path ...................................................................................................................... 744Values (Default in Bold) ............................................................................................... 744Description ............................................................................................................... 744Example ................................................................................................................... 744

statement_timeout ............................................................................................................. 745Values (Default in Bold) ............................................................................................... 745Description ............................................................................................................... 745Example ................................................................................................................... 745

wlm_query_slot_count ........................................................................................................ 745Values (Default in Bold) ............................................................................................... 745Description ............................................................................................................... 746Examples ................................................................................................................. 746

Sample Database ...................................................................................................................... 747CATEGORY Table .............................................................................................................. 748DATE Table ....................................................................................................................... 749EVENT Table ..................................................................................................................... 749VENUE Table .................................................................................................................... 749USERS Table .................................................................................................................... 750LISTING Table ................................................................................................................... 750SALES Table ..................................................................................................................... 751

Time Zone Names and Abbreviations ............................................................................................ 752Time Zone Names ............................................................................................................. 752Time Zone Abbreviations ..................................................................................................... 762

Document History ...................................................................................................................... 766

API Version 2012-12-01x


Welcome

Topics Are You a First-Time Amazon Redshift User? (p. 1) Are You a Database Developer? (p. 2) Prerequisites (p. 3)

This is the Amazon Redshift Database Developer Guide.

Amazon Redshift is an enterprise-level, petabyte scale, fully managed data warehousing service.

This guide focuses on using Amazon Redshift to create and manage a data warehouse. If you work withdatabases as a designer, software developer, or administrator, it gives you the information you need todesign, build, query, and maintain your data warehouse.

Are You a First-Time Amazon Redshift User?If you are a first-time user of Amazon Redshift, we recommend that you begin by reading the followingsections.

Service Highlights and Pricing The product detail page provides the Amazon Redshift value proposition,service highlights, and pricing.

Getting Started The Getting Started Guide includes an example that walks you through the processof creating an Amazon Redshift data warehouse cluster, creating database tables, uploading data, andtesting queries.

After you complete the Getting Started guide, we recommend that you explore one of the following guides:

Amazon Redshift Cluster Management Guide The Cluster Management guide shows you how tocreate and manage Amazon Redshift clusters.

If you are an application developer, you can use the Amazon Redshift Query API to manage clustersprogrammatically. Additionally, the AWS SDK libraries that wrap the underlying Amazon Redshift APIcan help simplify your programming tasks. If you prefer a more interactive way of managing clusters,you can use the Amazon Redshift console and the AWS command line interface (AWS CLI). Forinformation about the API and CLI, go to the following manuals: API Reference


Amazon Redshift Database Developer GuideAre You a First-Time Amazon Redshift User?

CLI Reference Amazon Redshift Database Developer Guide (this document) If you are a database developer, the

Database Developer Guide explains how to design, build, query, and maintain the databases that makeup your data warehouse.

If you are transitioning to Amazon Redshift from another relational database system or data warehouseapplication, you should be aware of important differences in how Amazon Redshift is implemented. Fora summary of the most important considerations for designing tables and loading data, see Best Practicesfor Designing Tables (p. 24) and Best Practices for Loading Data (p. 27). Amazon Redshift is based onPostgreSQL 8.0.2. For a detailed list of the differences between Amazon Redshift and PostgreSQL, seeAmazon Redshift and PostgreSQL (p. 232).

Are You a Database Developer?If you are a database user, database designer, database developer, or database administrator, thefollowing table will help you find what youre looking for.

We recommendIf you want to ...

Begin by following the steps in the Getting Started guide to quickly deploy acluster, connect to a database, and try out some queries.

When you are ready to build your database, load data into tables, and writequeries to manipulate data in the data warehouse, return here to the DatabaseDeveloper Guide.

Quickly start usingAmazon Redshift

The Amazon Redshift System Overview (p. 4) gives a high-level overviewof Amazon Redshift's internal architecture.

If you want a broader overview of the Amazon Redshift web service, go tothe Amazon Redshift product detail page.

Learn about the internalarchitecture of theAmazon Redshift datawarehouse.

Getting Started Using Databases (p. 12) is a quick introduction to the basicsof SQL development.

The Amazon Redshift SQL (p. 231) has the syntax and examples for AmazonRedshift SQL commands and functions and other SQL elements.

Best Practices for Designing Tables (p. 24) provides a summary of our recom-mendations for choosing sort keys, distribution keys, and compression encod-ings.

Create databases,tables, users, and otherdatabase objects.

Designing Tables (p. 104) details considerations for applying compression tothe data in table columns and choosing distribution and sort keys.

Learn how to designtables for optimum per-formance.

Loading Data (p. 131) explains the procedures for loading large datasets fromAmazon DynamoDB tables or from flat files stored in Amazon S3 buckets.

Best Practices for Loading Data (p. 27) provides for tips for loading your dataquickly and effectively.

Load data.

Managing Database Security (p. 97) covers database security topics.Manage users, groups,and database security.


Amazon Redshift Database Developer GuideAre You a Database Developer?

We recommendIf you want to ...

The System Tables Reference (p. 626) details system tables and views thatyou can query for the status of the database and monitor queries and pro-cesses.

You should also consult the Amazon Redshift Management Guide to learnhow to use the AWS Management Console to check the system health,monitor metrics, and back up and restore clusters.

Monitor and optimizesystem performance.

Many popular software vendors are certifying Amazon Redshift with their of-ferings to enable you to continue to use the tools you use today. For moreinformation, see the Amazon Redshift partner page.

The SQL Reference (p. 231) has all the details for the SQL expressions,commands, and functions Amazon Redshift supports.

Analyze and report in-formation from verylarge datasets.

PrerequisitesBefore you use this guide, you should complete these tasks.

Install a SQL client. Launch an Amazon Redshift cluster. Connect your SQL client to the cluster master database.

For step-by-step instructions, see the Amazon Redshift Getting Started Guide.

You should also know how to use your SQL client and should have a fundamental understanding of theSQL language.


Amazon Redshift Database Developer GuidePrerequisites

Amazon Redshift System Overview

Topics Data Warehouse System Architecture (p. 4) Performance (p. 6) Columnar Storage (p. 8) Internal Architecture and System Operation (p. 9) Workload Management (p. 10) Using Amazon Redshift with Other Services (p. 11)

An Amazon Redshift data warehouse is an enterprise-class relational database query and managementsystem.

Amazon Redshift supports client connections with many types of applications, including businessintelligence (BI), reporting, data, and analytics tools.When you execute analytic queries, you are retrieving, comparing, and evaluating large amounts of datain multiple-stage operations to produce a final result.

Amazon Redshift achieves efficient storage and optimum query performance through a combination ofmassively parallel processing, columnar data storage, and very efficient, targeted data compressionencoding schemes. This section presents an introduction to the Amazon Redshift system architecture.

Data Warehouse System ArchitectureThis section introduces the elements of the Amazon Redshift data warehouse architecture as shown inthe following figure.


Amazon Redshift Database Developer GuideData Warehouse System Architecture

Client applications

Amazon Redshift integrates with various data loading and ETL (extract, transform, and load) tools andbusiness intelligence (BI) reporting, data mining, and analytics tools. Amazon Redshift is based onindustry-standard PostgreSQL, so most existing SQL client applications will work with only minimalchanges. For information about important differences between Amazon Redshift SQL and PostgreSQL,see Amazon Redshift and PostgreSQL (p. 232).Connections

Amazon Redshift communicates with client applications by using industry-standard PostgreSQL JDBCand ODBC drivers. For more information, see Amazon Redshift and PostgreSQL JDBC and ODBC (p. 233).Clusters

The core infrastructure component of an Amazon Redshift data warehouse is a cluster.

A cluster is composed of one or more compute nodes. If a cluster is provisioned with two or more computenodes, an additional leader node coordinates the compute nodes and handles external communication.Your client application interacts directly only with the leader node. The compute nodes are transparentto external applications.

Leader node

The leader node manages communications with client programs and all communication with computenodes. It parses and develops execution plans to carry out database operations, in particular, the seriesof steps necessary to obtain results for complex queries. Based on the execution plan, the leader nodecompiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data toeach compute node.

The leader node distributes SQL statements to the compute nodes only when a query references tablesthat are stored on the compute nodes. All other queries run exclusively on the leader node. AmazonRedshift is designed to implement certain SQL functions only on the leader node. A query that uses anyof these functions will return an error if it references tables that reside on the compute nodes. For moreinformation, see SQL Functions Supported on the Leader Node (p. 231).


Amazon Redshift Database Developer GuideData Warehouse System Architecture

Compute nodes

The leader node compiles code for individual elements of the execution plan and assigns the code toindividual compute nodes.The compute nodes execute the compiled code send intermediate results backto the leader node for final aggregation.

Each compute node has its own dedicated CPU, memory, and attached disk storage, which are determinedby the node type. As your workload grows, you can increase the compute capacity and storage capacityof a cluster by increasing the number of nodes, upgrading the node type, or both.

Amazon Redshift provides two node types; dense storage nodes and dense compute nodes. Each nodeprovides two storage choices.You can start with a single 160 GB node and scale up to multiple 16 TBnodes to support a petabyte of data or more.

For a more detailed explanation of data warehouse clusters and nodes, see Internal Architecture andSystem Operation (p. 9).Node slices

A compute node is partitioned into slices; one slice for each core of the node's multi-core processor. Eachslice is allocated a portion of the node's memory and disk space, where it processes a portion of theworkload assigned to the node. The leader node manages distributing data to the slices and apportionsthe workload for any queries or other database operations to the slices. The slices then work in parallelto complete the operation.

When you create a table, you can optionally specify one column as the distribution key. When the tableis loaded with data, the rows are distributed to the node slices according to the distribution key that isdefined for a table. Choosing a good distribution key enables Amazon Redshift to use parallel processingto load data and execute queries efficiently. For information about choosing a distribution key, see Choosethe Best Distribution Style (p. 25).Internal network

Amazon Redshift takes advantage of high-bandwidth connections, close proximity, and customcommunication protocols to provide private, very high-speed network communication between the leadernode and compute nodes.The compute nodes run on a separate, isolated network that client applicationsnever access directly.

Databases

A cluster contains one or more databases. User data is stored on the compute nodes.Your SQL clientcommunicates with the leader node, which in turn coordinates query execution with the compute nodes.

Amazon Redshift is a relational database management system (RDBMS), so it is compatible with otherRDBMS applications. Although it provides the same functionality as a typical RDBMS, including onlinetransaction processing (OLTP) functions such as inserting and deleting data, Amazon Redshift is optimizedfor high-performance analysis and reporting of very large datasets.

Amazon Redshift is based on PostgreSQL 8.0.2. Amazon Redshift and PostgreSQL have a number ofvery important differences that you need to take into account as you design and develop your datawarehouse applications. For information about how Amazon Redshift SQL differs from PostgreSQL, seeAmazon Redshift and PostgreSQL (p. 232).

PerformanceAmazon Redshift achieves extremely fast query execution by employing these performance features:

Massively parallel processing


Amazon Redshift Database Developer GuidePerformance

Columnar data storage Data compression Query optimization Compiled code

Massively parallel processing

Massively parallel processing (MPP) enables fast execution of the most complex queries operating onlarge amounts of data. Multiple compute nodes handle all query processing leading up to final resultaggregation, with each core of each node executing the same compiled query segments on portions ofthe entire data.

Amazon Redshift distributes the rows of a table to the compute nodes so that the data can be processedin parallel. By selecting an appropriate distribution key for each table, you can optimize the distributionof data to balance the workload and minimize movement of data from node to node. For more information,see Choose the Best Distribution Style (p. 25).Loading data from flat files takes advantage of parallel processing by spreading the workload acrossmultiple nodes while simultaneously reading from multiple files. For more information about how to loaddata into tables, see Best Practices for Loading Data (p. 27).Columnar data storage

Columnar storage for database tables drastically reduces the overall disk I/O requirements and is animportant factor in optimizing analytic query performance. Storing database table information in a columnarfashion reduces the number of disk I/O requests and reduces the amount of data you need to load fromdisk. Loading less data into memory enables Amazon Redshift to perform more in-memory processingwhen executing queries. See Columnar Storage (p. 8) for a more detailed explanation.When columns are sorted appropriately, the query processor is able to rapidly filter out a large subset ofdata blocks. For more information, see Choose the Best Sort Key (p. 25).Data compression

Data compression reduces storage requirements, thereby reducing disk I/O, which improves queryperformance. When you execute a query, the compressed data is read into memory, then uncompressedduring query execution. Loading less data into memory enables Amazon Redshift to allocate more memoryto analyzing the data. Because columnar storage stores similar data sequentially, Amazon Redshift isable to apply adaptive compression encodings specifically tied to columnar data types. The best way toenable data compression on table columns is by allowing Amazon Redshift to apply optimal compressionencodings when you load the table with data. To learn more about using automatic data compression,see Loading Tables with Automatic Compression (p. 158).Query optimizer

The Amazon Redshift query execution engine incorporates a query optimizer that is MPP-aware and alsotakes advantage of the columnar-oriented data storage.The Amazon Redshift query optimizer implementssignificant enhancements and extensions for processing complex analytic queries that often includemulti-table joins, subqueries, and aggregation.To learn more about optimizing queries, see Tuning QueryPerformance (p. 194).Compiled code

The leader node distributes fully optimized compiled code across all of the nodes of a cluster. Compilingthe query eliminates the overhead associated with an interpreter and therefore increases the executionspeed, especially for complex queries. The compiled code is cached and shared across sessions on thesame cluster, so subsequent executions of the same query will be faster, often even with differentparameters.


Amazon Redshift Database Developer GuidePerformance

The execution engine compiles different code for the JDBC connection protocol and for ODBC and psql(libq) connection protocols, so two clients using different protocols will each incur the first-time cost ofcompiling the code. Other clients that use the same protocol, however, will benefit from sharing the cachedcode.

Columnar StorageColumnar storage for database tables is an important factor in optimizing analytic query performancebecause it drastically reduces the overall disk I/O requirements and reduces the amount of data you needto load from disk.

The following series of illustrations describe how columnar data storage implements efficiencies and howthat translates into efficiencies when retrieving data into memory.

This first illustration shows how records from database tables are typically stored into disk blocks by row.

In a typical relational database table, each row contains field values for a single record. In row-wisedatabase storage, data blocks store values sequentially for each consecutive column making up the entirerow. If block size is smaller than the size of a record, storage for an entire record may take more thanone block. If block size is larger than the size of a record, storage for an entire record may take less thanone block, resulting in an inefficient use of disk space. In online transaction processing (OLTP) applications,most transactions involve frequently reading and writing all of the values for entire records, typically onerecord or a small number of records at a time. As a result, row-wise storage is optimal for OLTP databases.

The next illustration shows how with columnar storage, the values for each column are stored sequentiallyinto disk blocks.

Using columnar storage, each data block stores values of a single column for multiple rows. As recordsenter the system, Amazon Redshift transparently converts the data to columnar storage for each of thecolumns.

In this simplified example, using columnar storage, each data block holds column field values for as manyas three times as many records as row-based storage. This means that reading the same number ofcolumn field values for the same number of records requires a third of the I/O operations compared to


Amazon Redshift Database Developer GuideColumnar Storage

row-wise storage. In practice, using tables with very large numbers of columns and very large row counts,storage efficiency is even greater.

An added advantage is that, since each block holds the same type of data, block data can use acompression scheme selected specifically for the column data type, further reducing disk space and I/O.For more information about compression encodings based on data types, see CompressionEncodings (p. 105).The savings in space for storing data on disk also carries over to retrieving and then storing that data inmemory. Since many database operations only need to access or operate on one or a small number ofcolumns at a time, you can save memory space by only retrieving blocks for columns you actually needfor a query. Where OLTP transactions typically involve most or all of the columns in a row for a smallnumber of records, data warehouse queries commonly read only a few columns for a very large numberof rows. This means that reading the same number of column field values for the same number of rowsrequires a fraction of the I/O operations and uses a fraction of the memory that would be required forprocessing row-wise blocks. In practice, using tables with very large numbers of columns and very largerow counts, the efficiency gains are proportionally greater. For example, suppose a table contains 100columns. A query that uses five columns will only need to read about five percent of the data containedin the table. This savings is repeated for possibly billions or even trillions of records for large databases.In contrast, a row-wise database would read the blocks that contain the 95 unneeded columns as well.

Typical database block sizes range from 2 KB to 32 KB. Amazon Redshift uses a block size of 1 MB,which is more efficient and further reduces the number of I/O requests needed to perform any databaseloading or other operations that are part of query execution.

Internal Architecture and System OperationThe following diagram shows a high level view of internal components and functionality of the AmazonRedshift data warehouse.


Amazon Redshift Database Developer GuideInternal Architecture and System Operation

Workload ManagementAmazon Redshift workload management (WLM) enables users to flexibly manage priorities within workloadsso that short, fast-running queries won't get stuck in queues behind long-running queries.

Amazon Redshift WLM creates query queues at runtime according to service classes, which define theconfiguration parameters for various types of queues, including internal system queues and user-accessiblequeues. From a user perspective, a user-accessible service class and a queue are functionally equivalent.For consistency, this documentation uses the term queue to mean a user-accessible service class aswell as a runtime queue.

When you run a query, WLM assigns the query to a queue according to the user's user group or bymatching a query group that is listed in the queue configuration with a query group label that the usersets at runtime.


Amazon Redshift Database Developer GuideWorkload Management

By default, Amazon Redshift configures one queue with a concurrency level of five, which enables up tofive queries to run concurrently, plus one predefined Superuser queue, with a concurrency level of one.You can define up to eight queues. Each queue can be configured with a maximum concurrency level of50.The maximum total concurrency level for all user-defined queues (not including the Superuser queue)is 50.

The easiest way to modify the WLM configuration is by using the Amazon Redshift Management Console.You can also use the Amazon Redshift command line interface (CLI) or the Amazon Redshift API.For more information about implementing and using workload management, see Implementing WorkloadManagement (p. 218).

Using Amazon Redshift with Other ServicesAmazon Redshift integrates with other AWS services to enable you to move, transform, and load yourdata quickly and reliably, using data security features.

Moving Data Between Amazon Redshift andAmazon S3Amazon Simple Storage Service (Amazon S3) is a web service that stores data in the cloud. AmazonRedshift leverages parallel processing to read and load data from multiple data files stored in AmazonS3 buckets. For more information, see Loading Data from Amazon S3 (p. 133).You can also use parallel processing to export data from your Amazon Redshift data warehouse to multipledata files on Amazon S3. For more information, see Unloading Data (p. 187).

Using Amazon Redshift with Amazon DynamoDBAmazon DynamoDB is a fully managed NoSQL database service.You can use the COPY command toload an Amazon Redshift table with data from a single Amazon DynamoDB table. For more information,see Loading Data from an Amazon DynamoDB Table (p. 155).

Importing Data from Remote Hosts over SSHYou can use the COPY command in Amazon Redshift to load data from one or more remote hosts, suchas Amazon EMR clusters, Amazon EC2 instances, or other computers. COPY connects to the remotehosts using SSH and executes commands on the remote hosts to generate data. Amazon Redshiftsupports multiple simultaneous connections. The COPY command reads and loads the output frommultiple host sources in parallel. For more information, see Loading Data from Remote Hosts (p. 149).

Automating Data Loads Using AWS Data PipelineYou can use AWS Data Pipeline to automate data movement and transformation into and out of AmazonRedshift. By using the built-in scheduling capabilities of AWS Data Pipeline, you can schedule and executerecurring jobs without having to write your own complex data transfer or transformation logic. For example,you can set up a recurring job to automatically copy data from Amazon DynamoDB into Amazon Redshift.For a tutorial that walks you through the process of creating a pipeline that periodically moves data fromAmazon S3 to Amazon Redshift, see Copy Data to Amazon Redshift Using AWS Data Pipeline in theAWS Data Pipeline Developer Guide.


Amazon Redshift Database Developer GuideUsing Amazon Redshift with Other Services

Getting Started Using Databases

Topics Step 1: Create a Database (p. 12) Step 2: Create a Database User (p. 13) Step 3: Create a Database Table (p. 13) Step 4: Load Sample Data (p. 15) Step 5: Query the System Tables (p. 18) Step 6: Cancel a Query (p. 21) Step 7: Clean Up Your Resources (p. 22)

This section describes the basic steps to begin using the Amazon Redshift database.

The examples in this section assume you have signed up for the Amazon Redshift data warehouseservice, created a cluster, and established a connection to the cluster from your SQL query tool. Forinformation about these tasks, see the Amazon Redshift Getting Started Guide.

ImportantThe cluster that you deployed for this exercise will be running in a live environment. As long asit is running, it will accrue charges to your AWS account. For more pricing information, go to theAmazon Redshift pricing page.To avoid unnecessary charges, you should delete your cluster when you are done with it. Thefinal step of the exercise explains how to do so.

Step 1: Create a DatabaseAfter you have verified that your cluster is up and running, you can create your first database. Thisdatabase is where you will actually create tables, load data, and run queries. A single cluster can hostmultiple databases. For example, you can have a TICKIT database and an ORDERS database on thesame cluster.

After you connect to the initial cluster database, the database you created when you launched the cluster,you use the initial database as the base for creating a new database.

For example, to create a database named tickit, issue the following command:


Amazon Redshift Database Developer GuideStep 1: Create a Database

create database tickit;

For this exercise, we'll accept the defaults. For information about more command options, see CREATEDATABASE (p. 336) in the SQL Command Reference.After you have created the TICKIT database, you can connect to the new database from your SQL client.Use the same connection parameters as you used for your current connection, but change the databasename to tickit.

You do not need to change the database to complete the remainder of this tutorial. If you prefer not toconnect to the TICKIT database, you can try the rest of the examples in this section using the defaultdatabase.

Step 2: Create a Database UserBy default, only the master user that you created when you launched the cluster has access to the initialdatabase in the cluster.To grant other users access, you must create one or more user accounts. Databaseuser accounts are global across all the databases in a cluster; they do not belong to individual databases.

Use the CREATE USER command to create a new database user. When you create a new user, youspecify the name of the new user and a password. A password is required. It must have between 8 and64 characters, and it must include at least one uppercase letter, one lowercase letter, and one numeral.

For example, to create a user named GUEST with password ABCd4321, issue the following command:

create user guest password 'ABCd4321';

For information about other command options, see CREATE USER (p. 354) in the SQL CommandReference.

Delete a Database UserYou won't need the GUEST user account for this tutorial, so you can delete it. If you delete a databaseuser account, the user will no longer be able to access any of the cluster databases.

Issue the following command to drop the GUEST user:

drop user guest;

The master user you created when you launched your cluster continues to have access to the database.ImportantAmazon Redshift strongly recommends that you do not delete the master user.

For information about command options, see DROP USER (p. 366) in the SQL Reference.

Step 3: Create a Database TableAfter you create your new database, you create tables to hold your database data.You specify any columninformation for the table when you create the table.


Amazon Redshift Database Developer GuideStep 2: Create a Database User

For example, to create a table named testtable with a single column named testcol for an integerdata type, issue the following command:

create table testtable (testcol int);

The PG_TABLE_DEF system table contains information about all the tables in the cluster. To verify theresult, issue the following SELECT command to query the PG_TABLE_DEF system table.

select * from pg_table_def where tablename = 'testtable';

The query result should look something like this:

schemaname|tablename|column | type |encoding|distkey|sortkey | notnull----------+---------+-------+-------+--------+-------+--------+---------

public |testtable|testcol|integer|none |f | 0 | f(1 row)

By default, new database objects, such as tables, are created in a schema named "public". For moreinformation about schemas, see Schemas (p. 100) in the Managing Database Security section.The encoding, distkey, and sortkey columns are used by Amazon Redshift for parallel processing.For more information about designing tables that incorporate these elements, see Best Practices forDesigning Tables (p. 24).

Insert Data Rows into a TableAfter you create a table, you can insert rows of data into that table.

NoteThe INSERT (p. 379) command inserts individual rows into a database table. For standard bulkloads, use the COPY (p. 302) command. For more information, see Use a COPY Command toLoad Data (p. 27).

For example, to insert a value of 100 into the testtable table (which contains a single column), issuethe following command:

insert into testtable values (100);

Select Data from a TableAfter you create a table and populate it with data, use a SELECT statement to display the data containedin the table. The SELECT * statement returns all the column names and row values for all of the data ina table and is a good way to verify that recently added data was correctly inserted into the table.

To view the data that you entered in the testtable table, issue the following command:

select * from testtable;

The result will look like this:

testcol---------


Amazon Redshift Database Developer GuideInsert Data Rows into a Table

100(1 row)

For more information about using the SELECT statement to query tables, see SELECT (p. 390) in theSQL Command Reference.

Step 4: Load Sample DataMost of the examples in this guide use the TICKIT sample database. If you want to follow the examplesusing your SQL query tool, you will need to load the sample data for the TICKIT database.

The sample data for this tutorial is provided in Amazon S3 buckets that give read access to all authenticatedAWS users, so any valid AWS credentials that permit access to Amazon S3 will work.

NoteIf you followed the steps to load data in the Amazon Redshift Getting Started Guide, these tablesalready exist.

To load the sample data for the TICKIT database, you will first create the tables, then use the COPYcommand to load the tables with sample data that is stored in an Amazon S3 bucket. For more information,see Loading Data from Amazon S3 (p. 133).You create tables using the CREATE TABLE command with a list of columns paired with datatypes. Manyof the create table statements in this example specify options for the column in addition to the data type,such as not null, distkey, and sortkey. These are column attributes related to optimizing yourtables for query performance.You can visit Designing Tables (p. 104) to learn about how to choose theseoptions when you design your table structures.

1. Create the tables for the database.

The following SQL creates these tables: USERS, VENUE, CATEGORY, DATE, EVENT, LISTING,and SALES.

create table users( userid integer not null distkey sortkey, username char(8), firstname varchar(30), lastname varchar(30), city varchar(30), state char(2), email varchar(100), phone char(14), likesports boolean, liketheatre boolean, likeconcerts boolean, likejazz boolean, likeclassical boolean, likeopera boolean, likerock boolean, likevegas boolean, likebroadway boolean, likemusicals boolean);

create table venue( venueid smallint not null distkey sortkey,


Amazon Redshift Database Developer GuideStep 4: Load Sample Data

venuename varchar(100), venuecity varchar(30), venuestate char(2), venueseats integer);

create table category( catid smallint not null distkey sortkey, catgroup varchar(10), catname varchar(10), catdesc varchar(50));

create table date( dateid smal

Documents

Amazon Redshift Database Developer Guide