41
Design for Redshift with ER/Studio Data Architect Ron Huizenga Senior Product Manager Enterprise Architecture & Modeling April 9, 2018 Technology Note

Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

  • Upload
    others

  • View
    22

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

Design for Redshift with ER/Studio Data Architect Ron Huizenga Senior Product Manager Enterprise Architecture & Modeling April 9, 2018

Technology Note

Page 2: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

1

© 2018 IDERA, Inc. All rights reserved.

Design for Redshift with ER/Studio Data Architect

Table of Contents Overview ....................................................................................................................................................................... 2

General Modeling Considerations ............................................................................................................................... 3

ER/Studio General Platform Support and Customization ...........................................................................................4

Redshift Limitations ..................................................................................................................................................... 7

DDL Constructs ......................................................................................................................................................... 7

CREATE TABLE ...................................................................................................................................................... 7

ALTER TABLE ........................................................................................................................................................ 7

SQL Statements ........................................................................................................................................................ 7

Unsupported PostgreSQL features ............................................................................................................................. 7

Unsupported PostgreSQL data types ...................................................................................................................... 7

Redshift Specific Constructs that are not in PostgreSQL .......................................................................................... 8

Distribution Style and Distribution Keys ................................................................................................................. 8

Sort Keys .................................................................................................................................................................. 8

DDL Considerations ................................................................................................................................................. 8

Physical Model ............................................................................................................................................................. 9

Simple Method: PostSQL clause ............................................................................................................................. 9

DDL Script Generation ............................................................................................................................................ 10

Advanced Method: Attachments and Macros ...................................................................................................... 13

Defining the Attachments .................................................................................................................................. 13

Starting with an Existing Implementation ................................................................................................................ 19

Reverse Engineer Redshift as Generic DBMS using ODBC ....................................................................................... 20

Metawizard Import from Redshift into PostgresSQL 8.0 Model ............................................................................. 27

Final Steps ................................................................................................................................................................... 31

Conclusion ................................................................................................................................................................... 35

Appendix A: DDL Generated from My TICKIT Model ................................................................................................36

Appendix B: TICKIT Database DDL in Standard Redshift Tutorial ............................................................................39

Page 3: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

2

© 2018 IDERA, Inc. All rights reserved.

Overview I have been data modeling for over 35 years. As such, I have seen many different data platforms come and go. I

initially created data models manually, and subsequently benefitted from a wide variety of modeling tools as

they emerged throughout my career. With the high level of change, there has always been a constant: the

ability to take advantage of my modeling tool of choice, regardless of which tool it was at the time, to deliver a

high quality design and database implementation.

Even if a modeling tool did not specifically list support for a new DBMS platform by name, there was generally

a way to make the modeling tool support most of the requirements. In the early days, this was relatively

straight forward, since the basis for most DBMS platforms was the ANSI SQL standard. However, as time has

progressed, there has been a proliferation of many data platforms with different features and widely varied

rates of adoption in the marketplace. There is also a constant stream of new versions of each of those

platforms, with new capabilities. Many modeling tools simply do not have the ability to keep up. However,

ER/Studio is the most capable enterprise data-modeling suite in the market with capabilities that allow it to

excel in conquering those challenges.

Maintaining pace with the frequency of change in data platforms is an ongoing challenge. For a modeling

vendor such as ourselves, prioritization is key, since it is virtually impossible to incorporate specific support for

every single feature for every single platform. There are platforms that we choose not to support as distinct

named platforms, simply due to low overall market adoption. With others, there can be a delay to ensure that

the platform (or feature) is viable in the marketplace. However, that does not preclude the use of ER/Studio

Data Architect in those situations. In fact, ER/Studio can usually be adapted easily to work with new platforms

including design, forward engineering (DDL generation) and reverse engineering functionality.

The high adaptability of ER/Studio Data Architect is due to the flexible and extensible architecture designed

into the product from the ground up. This allows modelers to define and create additional metadata for any

model construct, as well as the ability to extend generated DDL with additional syntax (pre and post SQL). In

terms of data platform connectivity, we have native connectors for many platforms, metadata import bridges,

and generic capabilities including ODBC connectivity. Capabilities can be extended further with custom

datatypes and datatype mappings for the platforms you wish to work with. Full macro programming capability

using Winwrap basic, a language which many users are already familiar with, combined with an extensive

automation engine allows users to customize the capabilities, limited only by their imagination.

To illustrate these points, I will now discuss how to apply the capabilities of ER/Studio to the design and

implementation of a data warehouse deployed to Amazon Redshift. Redshift is gaining popularity in the

marketplace. As such, we will be implementing Redshift specific platform support as part of our product

roadmap. However, you can take advantage of the platform today by utilizing ER/Studio. I will also highlight

some of my thought process as a modeler, when working with a new platform.

I will not discuss how to set up an AWS cluster, Redshift, or the necessary security policies that are needed to

connect to Redshift from your computer. This document assumes that the necessary Redshift and AWS

configurations are already in place. It also assumes that the necessary ODBC and JDBC drivers have been

downloaded from the Amazon Redshift website and installed.

Page 4: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

3

© 2018 IDERA, Inc. All rights reserved.

General Modeling Considerations I would be remiss by not pointing out that far too many teams rush toward a particular physical deployment as

a development activity without adequately analyzing and modeling the solution. Those teams incorrectly

assume that modeling, and data modeling in particular, are an unnecessary documentation step that simply

slows them down. Nothing could be further from the truth and those that bypass these necessary activities are

doing a disservice to themselves and their organizations: It is paramount that we first understand the data and

rules from a business standpoint, which we accomplish through logical modeling. Logical data modeling is an

important analysis step and intentionally technology agnostic in order to understand and define the data

elements and their relationships to one another, from a business perspective. We then derive the physical

models from the logical models, adding platform specific constructs. For larger solutions, we often iterate back

and forth between these activities, as different subject areas in the models evolve at a differing pace.

With the above in mind, we will begin with an example for a small data warehouse intended for analytics of

ticket sales. This particular example is based on the Amazon Redshift tutorial called “TICKIT” so that readers

who may already be familiar with it can focus purely on the modeling aspects in this discussion.

The TICKIT database is a tool to analyze sales activity for the fictional TICKIT web site, where users buy and sell

tickets online for different types of events. In particular, analysts can identify ticket movement over time,

success rates for sellers, and the best-selling events, venues, and seasons. It is small, comprised of 2 facts and 5

dimensions.

Page 5: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

4

© 2018 IDERA, Inc. All rights reserved.

The relationships (connectors) are extremely important, since they depict the business relationships between

the different concepts.

ER/Studio General Platform Support and Customization As mentioned previously, ER/Studio has many different customization capabilities. In particular, I make copies

of the shipped macros and datatype mappings in my own work folders. That allows me to modify them at low

risk, since I can easily overlay them with original copies if the result is not what I expect. Once I have created

the work folders, I can set the paths to use in the Data Architect application by selecting the menu and options

for Tools-> Options-> Directories

Page 6: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

5

© 2018 IDERA, Inc. All rights reserved.

Here are my shipped defaults:

And the modified paths:

I am ready to commence with the physical model, in which I will incorporate platform specific design

constructs. For this example, I wish to design and deploy for Redshift. At this time, Redshift is not one of the

specifically named physical platforms in ER/Studio Data Architect, so I choose one that I feel has the highest

affinity with Redshift. Generally, to arrive at this decision, I consider similarities in the physical implementation

such as DDL syntax, data types, and other characteristics that I consider important.

Page 7: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

6

© 2018 IDERA, Inc. All rights reserved.

After some consideration, I decide that the most compatible platform is PostgresSQL, since Redshift is a

derivative of PostgresSQL. However, both have evolved independently with differences on both sides, so I

choose PostgresSQL 8.0 as my starting point. There are some differences in datatypes, as well as constructs in

Redshift that do not apply to PostgresSQL (covered in next section). My second choice would likely be Generic

DBMS, which I will discuss in the reverse engineering section.

When I generate my physical model from the logical model, I would like the datatype mapping to proceed as

smoothly as possible, so I can modify the existing datatype mappings if I wish. All of the logical to physical

mappings from logical top physical models are file driven for all the supported platforms in ER/Studio Data

Architect. This allows me to add new datatypes, alter existing mapping templates, or make a user defined copy

of a datatype mapping, which is applied to specific models. An example datatype mapping is below:

Page 8: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

7

© 2018 IDERA, Inc. All rights reserved.

Redshift Limitations Those looking at Redshift for the first time will find that it behaves quite differently that the databases they

have used in the past. The following is a subset of the capabilities in Postgres (and several other platforms)

that are unsupported in Amazon Redshift:

DDL Constructs

CREATE TABLE

Amazon Redshift does not support tablespaces, table partitioning, inheritance, and certain constraints. The

Amazon Redshift implementation of CREATE TABLE enables you to define the sort and distribution algorithms

for tables to optimize parallel processing.

ALTER TABLE

ALTER COLUMN actions are not supported.

ADD COLUMN supports adding only one column in each ALTER TABLE statement.

COPY (the Amazon Redshift COPY command is highly specialized to enable the loading of data from Amazon

S3 buckets and Amazon DynamoDB tables).

SQL Statements

INSERT, UPDATE, and DELETE: WITH clause is not supported.

Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces

Constraints (Unique, Foreign key, Primary key, Check constraints, Exclusion constraints)

NOTE: Unique, primary key, and foreign key constraints are permitted, but are informational only. They are not

enforced by the system, but there is still value in defining them since they are used by the query planner.

Inheritance

Indexes

Collations

Stored procedures

Triggers

Sequences

Unsupported PostgreSQL data types

Arrays, BIT, BIT VARYING, BYTEA, Composite Types, Date/Time Types, INTERVAL, TIME, Enumerated Types,

Geometric Types, JSON, SERIAL, BIGSERIAL, SMALLSERIAL, MONEY, Object Identifier Types

Page 9: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

8

© 2018 IDERA, Inc. All rights reserved.

Redshift Specific Constructs that are not in PostgreSQL Redshift is a data warehouse platform, optimized for very fast execution of complex analytic queries against

very large data sets. Due to the massive amount of data in a data warehouse, specific design parameters

facilitate performance optimization.

Distribution Style and Distribution Keys

Distribution Styles (DISTSTYLE) defines the data distribution style for the entire table. Redshift distributes the

rows of a table to the compute nodes that make up the cluster. If the data is heavily skewed, meaning a large

amount is placed on a single node, query performance suffers. Even distribution prevents these bottlenecks by

ensuring that nodes equally share the processing load according the distribution style specified for the table as

follows:

KEY - Distribution Keys (DISTKEY) means the data is distributed by the values in the DISTKEY column(s). When

join columns of joining tables are set as distribution keys, the joining rows from both tables are collocated on

the compute nodes. This allows the optimizer to perform joins more efficiently. When DISTSTYLE of KEY is

specified, one or more DISTKEY columns must be specified for the table.

EVEN – means the data in the table spreads evenly across the nodes in a cluster in a round-robin distribution,

determined by Row ID’s. The result is distribution of approximately the same number of rows to each node.

EVEN is the default distribution style and assumed unless a different DISTSTYLE is specified.

ALL – means that a copy of the entire table is distributed to every node. This distribution style ensures that all

the rows required for any join to this table are available on every node. The downside is that it multiplies

storage requirements, increases load time and increases maintenance times for the table. ALL distribution can

improve execution time when used with certain dimension tables where KEY distribution is not appropriate. It

is generally suited to small tables used frequently in joins.

Sort Keys

Sort Keys (SORTKEY) determine the order in which rows in a table are stored. If properly applied, sort Keys

allow the bypass of large chunks of data during query processing. Reduced data scanning improves query

performance significantly.

DDL Considerations

Distribution keys and sort keys can be specified as keywords for a specific column, when only one column is

used for the respective key, or as the last clause in the create table DDL. Specifying as the last clause in the

CREATE TABLE statement is the most flexible, since it supports DISTKEY and SORTKEY that have single or

multiple columns. One of the critical disciplines in modeling is consistency, so I choose to define them as the

last portion of the CREATE TABLE statement. This is also consistent with DISTSTYLE, defined only at the table

level. NOTE: If specifying a DISTKEY, DISTSTYLE is not required. The value of KEY will be assumed by default.

Changing Distribution style, distribution keys and sort keys MUST be specified as part of a CREATE TABLE

statement. Redshift does not allow them to be changed using an ALTER TABLE statement. If there is a need to

change them a new table will need to be created with the correct parameters and data will need to be copied

from the old table to the new table. The old table will then need to be deleted. Once that is completed, the old

name is then be applied to rename the new table. Depending on the amount of data involved, combined with

the change in distribution style, this can be time consuming.

Page 10: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

9

© 2018 IDERA, Inc. All rights reserved.

Physical Model Once I have generated the physical model from the logical model as PostgresSQL, as well as using my

customized datatype mapping template, the next step is to update the physical model with specifications for

Distribution Style and Sort keys, since these are critical to Redshift performance. There are 2 ways to do this.

The first is very straight forward. The second is more advanced, but provides additional model documentation

and communication benefits.

Simple Method: PostSQL clause

As stated, this is very straight forward. The table editor in the physical model has tabs to specify PreSQL &

PostSQL to be used in DDL generation for the table. The following screen shows how I have specified it for the

sales (fact) table:

NOTE: The DDL may look incorrect if you click on the full DDL preview tab. Note that there is a semicolon after

the initial CREATE TABLE statement, immediately before the PostSQL. We will eliminate this when we generate

the DDL script from the physical model.

Page 11: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

10

© 2018 IDERA, Inc. All rights reserved.

DDL Script Generation

I will now review the DDL generation options that I use for Redshift. The DDL generation preferences can be

saved to a file, so that they can be defined once and re-used as required.

Page 12: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

11

© 2018 IDERA, Inc. All rights reserved.

On the second screen, I have generation of constraint names turned off. I am also generating primary keys,

since they can be used by the query optimizer.

In the general options, I am generating foreign key constraints. Just like primary keys, they are informational

only, but can be utilized by the query optimizer. I have also cleared the SQL statement delimiter from the field

at the bottom of the screen. That will eliminate the separator between our CREATE TABLE statement and the

PostSQL clause, combining them into a consolidated CREATE TABLE statement.

Page 13: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

12

© 2018 IDERA, Inc. All rights reserved.

The last screen allows me to double check my specified options and save the DDL preferences to a file. I can

the preview the script. Clicking the Finish button will allow me to save the DDL script. Lastly, I will connect to

Redshift with my editor of choice and execute the script. The full generated script is shown in Appendix A at

the end of the document.

Execution of scripts is generally the most common use case, as opposed to directly connecting to a database

and generating (or updating it). In most initiatives, database creation and updates must coincide with other

development deliverables. In a data warehouse environment, this will typically include staging area updates,

including extract, transform & load (ETL) from source systems to the staging area, as well as from the staging

area to the data warehouse itself. All source data stores, staging area and data warehouse can (and should) be

modelled using ER/Studio Data Architect. Visual Data Lineage modeling in ER/Studio Data Architect can model

ETL, including data transformations. When using ER/Studio Enterprise Team Edition, visual data lineage bridges

can reverse engineer from many popular ETL tools and platforms.

Page 14: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

13

© 2018 IDERA, Inc. All rights reserved.

Advanced Method: Attachments and Macros

The advanced method builds upon the simple approach just discussed. In

fact, it uses the PostSQL in exactly the same way to generate the full DDL

script. The difference is that the advanced approach provides the ability

to specify the constructs separately by using attachments in ER/Studio. I

then use a macro to assemble the attachments and create the PostSQL

clause.

Defining the Attachments

First, I create attachment types and attachments in the data dictionary.

Attachment types are represented by folders, with specific attachments

belonging to that type in the folder. When specifying the attachment

type, I also indicate the kinds of model objects that the attachment type

applies to. The example I’m discussing today involves only tables, but we

may have other attachments that apply to more than one type of object.

This is very powerful, since I only need to set up a specific attachment

once, and it can be bound As you can see in this example I have created a

type for Redshift Physical Properties, with specific attachments for

DISTKEY, DISTSTYLE and SORTKEY.

I will now quickly show the setup of each:

DISTSTYLE has been set up with a description to describe how it is used. The specified content is a list of

values, with choices of EVEN or ALL. I have set EVEN as the default value, since it is also the default behaviour

in Redshift. I purposely excluded a value for KEY, since it is assumed if we provide a DISTKEY. It also minimizes

the amount of information I need to provide for each table.

Page 15: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

14

© 2018 IDERA, Inc. All rights reserved.

DISTKEY has been set up with a description to describe how it is used. The specified content is text, to enable

entry of the DDL clause containing 1 or more columns that are part of the DISTKEY.

Page 16: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

15

© 2018 IDERA, Inc. All rights reserved.

SORTKEY is similar to DISTKEY, as it has a detailed description and the value itself is text, so the full SORTKEY

SQL clause can be specified.

Page 17: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

16

© 2018 IDERA, Inc. All rights reserved.

To specify the information for a specific table, I simply click on the Attachment Bindings tab within the table

editor. I have shown the specified information for the sales (fact) table below:

Page 18: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

17

© 2018 IDERA, Inc. All rights reserved.

Once I have defined the attachment values for all my tables, I can execute a macro that I created, which will

read the bound attachments for each table and create the post SQL clause. I have chosen to build the macro so

that it uses the tables I have selected on the screen. That allows me the flexibility to update individual tables,

groups of tables, or all of them very easily. I can execute the macro from the macro tab, or even add it to the

macro shortcuts menu of ER/Studio Data Architect.

The macro editing language in ER/Studio is WinWrap basic, which is very flexible and powerful. A portion of the

code is shown below:

Because I have used attachments, I now have the ability to show the important DISTKEY, DISTSTYLE and

SORTKEY information directly on the model diagram itself. The model also classifies the tables as facts,

dimensions and subcategories within each. Other types can also be specified, such as snowflake, bridge,

hierarchy navigation, and undefined. ER/Studio can interrogate the model and automatically identify the table

types as well, if desired.

Page 19: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

18

© 2018 IDERA, Inc. All rights reserved.

This enables improved communication and understanding, as well as providing high quality documentation.

The major benefit is that the specifications are contained in the model AND generated from the model. Model

driven design is extremely powerful, productive. It yields high quality, consistent results.

Page 20: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

19

© 2018 IDERA, Inc. All rights reserved.

Starting with an Existing Implementation We don’t always have the luxury of starting with a clean-sheet design. As a consultant, many of my

engagements first required assessment of an organization’s data environment. To do so, I relied heavily on

ER/Studio’s reverse engineering capabilities, allowing me to construct a blueprint of the data landscape as a

basis for analysis, enhancements, and redesign. The exercise of doing so typically uncovers many existing

deficiencies and inconsistencies that require remediation, especially if the databases were implemented

directly, without data modeling. In my experience, those implemented quickly as part of a development project

often have data structures skewed toward the easiest programming solution, rather than reflecting the

business needs and rules of the organization. There may also be missing foreign keys and inconsistent use of

datatypes. Therefore, reverse engineering and analysis presents an opportunity to introduce significant quality

and performance improvements.

For a platform that does not have specific named platform support in ER/Studio, we may still have multiple

approaches for reverse engineering:

1) Use reverse engineering specified for an earlier version of the same platform. This usually works fine,

unless the DBMS vendor has made significant connectivity changes, or dropped significant features in

the later release. Specific enhancements aligning with new platform features might not be in Data

Architect yet, but they can usually be overcome with approaches I outlined earlier in this document.

2) Generic DBMS Support using ODBC. Generally, when an ODBC driver is available for the platform, we

can usually connect for reverse engineering purposes. Again, we can augment using approaches

already described.

3) MetaWizard Import Bridges. ER/Studio Enterprise Team Edition includes a wide variety of metadata

bridges that can import metadata into ER/Studio models. For other editions, MetaWizard bridges can

be purchased as add-on licenses.

For this example using Redshift, I can use option 2 or 3 above. I usually try both, then proceed with the

approach that yields the results that I feel are most practical. The choice can vary based on the quality of the

implemented database that I am reverse engineering.

In this instance, using ODBC to reverse engineer to a generic DBMS could offer some distinct advantages. If the

implemented database has specified primary keys (even though they are for documentation), the ODBC driver

will recognize them and create the model accordingly. It will also recognize foreign key constraints if specified

(again, only for documentation). Even if they were not, ER/Studio has the capability to infer foreign key

relationships through column name matching.

In this instance, the MetaWizard for Redshift imports the metadata, creating a PostgreSQL 8.0 physical model

and a logical model. That corresponds to the platform choice I made earlier, when designing from scratch.

However, the MetaWizard only pulls the physical specifications, so we will not get primary keys or foreign key

relationships. Thus, I will need to expend a bit of effort to analyze and update the model accordingly.

I will now show each approach. The database that I reverse engineer was created from the DDL in Appendix B,

which is the same DDL used by the Amazon Redshift TICKIT tutorial.

Page 21: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

20

© 2018 IDERA, Inc. All rights reserved.

Reverse Engineer Redshift as Generic DBMS using ODBC In ER/Studio Data Architect, I begin by creating a new model, selecting the option to reverse engineer from an

existing database.

This will step me through the reverse engineer wizard. On the first screen, I specify ODBC and will select (or

create) my data source.

Page 22: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

21

© 2018 IDERA, Inc. All rights reserved.

Clicking the ODBC setup button will show the data sources:

The configuration of the data source used in this example is as follows:

Page 23: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

22

© 2018 IDERA, Inc. All rights reserved.

When setting up an ODBC driver for Redshift, it is important to review and alter the default data type

configuration, which is:

I have obtained the best results by un-checking the data type options. Bypassing this step will result in

incorrect data type mapping from Redshift. In particular, string lengths are impacted. Ensure that the Unicode

option is off (unchecked) or string data types will come into the model with declared lengths that are twice as

long as they are in the database itself.

After selecting or specifying the data source, proceeding will result in the following pop-up message:

This is normal behaviour, since Redshift 8.x is not a specific named DBMS platform in ER/Studio. Clicking on the

“Yes” option will proceed normally.

Page 24: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

23

© 2018 IDERA, Inc. All rights reserved.

On screen 2 of the wizard, I include user tables only.

On screen 3 of the wizard, I select the tables from my TICKIT example. I have excluded the auto health check

table that Redshift creates as part of the implementation.

Page 25: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

24

© 2018 IDERA, Inc. All rights reserved.

I am hopeful that the database was created with primary keys defined. Therefore, I select the option to infer

foreign keys from names, which will match column names across tables to infer relationships.

I am also able to save the reverse engineering parameters into a quick launch file.

Page 26: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

25

© 2018 IDERA, Inc. All rights reserved.

Clicking the “Finish” button will execute the process, providing a progress log.

Page 27: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

26

© 2018 IDERA, Inc. All rights reserved.

Pro tip: for fastest results, choose the circular layout option on screen 4 of the wizard when working with very

large databases. The following model wing shows my reverse engineered Redshift database, after applying

some very basic layout changes.

I would now proceed with my analysis and modeling, adding the constructs such as DISTTYPE, DISTKEY and

SORTKEY as discussed previously.

Page 28: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

27

© 2018 IDERA, Inc. All rights reserved.

MetaWizard Import from Redshift into PostgreSQL 8.0 Model I can import from Redshift using the MetaWizard by selecting:

File -> Import File -> From External Metadata

NOTE: To use the MetaWizard for Redshift, the Redshift 4.1 JDBC driver is required. The current MetaWizard

version is not compatible with the Redshift 4.2.1 JDBC driver, which was released by Amazon subsequent to the

current MetaWizard build.

Please see the help text from the MetaWizard Driver stated below, stating that the 4.1 driver is required

Clicking the dropdown field on the first screen allows me to select Amazon Redshift from an extensive list of

platforms and other data sources.

Page 29: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

28

© 2018 IDERA, Inc. All rights reserved.

For my TICKIT example, I specified the parameters as follows:

On the second screen, there are additional parameters to specify, including the name of the model file to

create. I have the option to reverse engineer to a relational or dimensional model, just as I did using ODBC.

Page 30: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

29

© 2018 IDERA, Inc. All rights reserved.

A detailed import progress screen is then presented.

Page 31: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

30

© 2018 IDERA, Inc. All rights reserved.

Unlike the Generic DBMS using ODBC, primary keys, foreign keys and relationships are not created by the

Metadata import, even if they exist in the database as documentation.

Therefore, additional effort is required to specify the keys and relationships, as well as DISTSTYLE, DISTKEY,

SORTKEY and other parameters. If the Redshift Database does not contain primary and foreign keys, I would

personally use the MetaWizard, since it creates a Postgres 8.0 physical model, which I would prefer to use

going forward.

NOTE: ER/Studio is very flexible. I can change a Generic DBMS physical model to another physical platform

(including PostgreSQL), or vice versa. It will convert the datatypes to those supported by the target platform. I

can also use the logical model to generate multiple physical models for different platforms. This is a benefit of

the advanced ER/Studio architecture, which supports loosely coupled models.

Page 32: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

31

© 2018 IDERA, Inc. All rights reserved.

Final Steps Whenever I reverse engineer a data model from a database, I always perform additional validation checks to

ensure that I have specified parameters and options correctly. In cases like this example, where I know certain

constructs are not included in reverse engineering, I will analyze the internal database catalog tables building

some queries to extract additional metadata. I can then use that metadata as a guide to make additional model

changes manually.

Here are a couple of helpful Redshift catalog queries:

To extract DISTSYLE specifications:

select relname, reldiststyle from pg_class where relnamespace = 2200;

relname reldiststyle

category_pkey 0

date_pkey 0

venue_pkey 0

event_pkey 0

category 8

venue 8

users_pkey 0

listing_pkey 0

sales_pkey 0

date 1

users 0

event 1

sales 1

listing 1

The values above can be decoded as follows:

RELDISTSTYLE Distribution style

0 EVEN

1 KEY

8 ALL

Page 33: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

32

© 2018 IDERA, Inc. All rights reserved.

To extract existing DISTKEY and SORTKEY specifications

select * from pg_table_def where schemaname = 'public';

schemaname tablename column type encoding distkey sortkey notnull

public category catid smallint none FALSE 1 TRUE

public category catgroup character varying(10) lzo FALSE 0 FALSE

public category catname character varying(10) lzo FALSE 0 FALSE

public category catdesc character varying(50) lzo FALSE 0 FALSE

public category_pkey catid smallint none FALSE 1 FALSE

public date dateid smallint none TRUE 1 TRUE

public date caldate date lzo FALSE 0 TRUE

public date day character(3) lzo FALSE 0 TRUE

public date week smallint lzo FALSE 0 TRUE

public date month character(5) lzo FALSE 0 TRUE

public date qtr character(5) lzo FALSE 0 TRUE

public date year smallint lzo FALSE 0 TRUE

public date holiday boolean none FALSE 0 FALSE

public date_pkey dateid smallint none TRUE 1 FALSE

public event eventid integer none TRUE 1 TRUE

public event venueid smallint lzo FALSE 0 FALSE

public event catid smallint lzo FALSE 0 FALSE

public event dateid smallint lzo FALSE 0 FALSE

public event eventname character varying(200) lzo FALSE 0 FALSE

public event starttime timestamp without time zone lzo FALSE 0 FALSE

public event_pkey eventid integer none TRUE 1 FALSE

public listing listid integer none TRUE 1 TRUE

public listing sellerid integer lzo FALSE 0 FALSE

public listing eventid integer lzo FALSE 0 FALSE

public listing dateid smallint lzo FALSE 0 FALSE

public listing numtickets smallint lzo FALSE 0 TRUE

public listing priceperticket numeric(8,2) lzo FALSE 0 FALSE

public listing totalprice numeric(8,2) lzo FALSE 0 FALSE

public listing listtime timestamp without time zone lzo FALSE 0 FALSE

public listing_pkey listid integer none TRUE 1 FALSE

public sales salesid integer lzo TRUE 0 TRUE

public sales listid integer none FALSE 1 FALSE

public sales sellerid integer none FALSE 2 FALSE

public sales buyerid integer lzo FALSE 0 FALSE

public sales eventid integer lzo FALSE 0 FALSE

public sales dateid smallint lzo FALSE 0 FALSE

public sales qtysold smallint lzo FALSE 0 TRUE

public sales pricepaid numeric(8,2) lzo FALSE 0 FALSE

public sales commission numeric(8,2) lzo FALSE 0 FALSE

public sales saletime timestamp without time zone lzo FALSE 0 FALSE

public sales_pkey salesid integer lzo TRUE 0 FALSE

public users userid integer none FALSE 1 TRUE

public users username character(8) lzo FALSE 0 FALSE

public users firstname character varying(30) lzo FALSE 0 FALSE

public users lastname character varying(30) lzo FALSE 0 FALSE

public users city character varying(30) lzo FALSE 0 FALSE

public users state character(2) lzo FALSE 0 FALSE

public users email character varying(100) lzo FALSE 0 FALSE

Page 34: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

33

© 2018 IDERA, Inc. All rights reserved.

schemaname tablename column type encoding distkey sortkey notnull

public users phone character(14) lzo FALSE 0 FALSE

public users likesports boolean none FALSE 0 FALSE

public users liketheatre boolean none FALSE 0 FALSE

public users likeconcerts boolean none FALSE 0 FALSE

public users likejazz boolean none FALSE 0 FALSE

public users likeclassical boolean none FALSE 0 FALSE

public users likeopera boolean none FALSE 0 FALSE

public users likerock boolean none FALSE 0 FALSE

public users likevegas boolean none FALSE 0 FALSE

public users likebroadway boolean none FALSE 0 FALSE

public users likemusicals boolean none FALSE 0 FALSE

public users_pkey userid integer none FALSE 1 FALSE

public venue venueid smallint none FALSE 1 TRUE

public venue venuename character varying(100) lzo FALSE 0 FALSE

public venue venuecity character varying(30) lzo FALSE 0 FALSE

public venue venuestate character(2) lzo FALSE 0 FALSE

public venue venueseats integer lzo FALSE 0 FALSE

public venue_pkey venueid smallint none FALSE 1 FALSE

Page 35: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

34

© 2018 IDERA, Inc. All rights reserved.

To validate column datatypes and sequencing in tables it can also be helpful to run the following query (partial

results shown)

select * from svv_columns where table_schema = 'public' order by table_name, ordinal_position

table_catalog table_schema table_name column_name

ordinal_

position

column_

default

is_nulla

ble data_type

character_

maximum

_length

numeric_

precision

numeric_

precision_

radix

numeric

_scale

dev publ ic category catid 1 NO smal l int 16 2 0

dev publ ic category catgroup 2 YES character varying 10

dev publ ic category catname 3 YES character varying 10

dev publ ic category catdesc 4 YES character varying 50

dev publ ic date dateid 1 NO smal l int 16 2 0

dev publ ic date caldate 2 NO date

dev publ ic date day 3 NO character 3

dev publ ic date week 4 NO smal l int 16 2 0

dev publ ic date month 5 NO character 5

dev publ ic date qtr 6 NO character 5

dev publ ic date year 7 NO smal l int 16 2 0

dev publ ic date hol iday 8 FALSE YES boolean

dev publ ic event eventid 1 NO integer 32 2 0

dev publ ic event venueid 2 YES smal l int 16 2 0

dev publ ic event catid 3 YES smal l int 16 2 0

dev publ ic event dateid 4 YES smal l int 16 2 0

dev publ ic event eventname 5 YES character varying 200

dev publ ic event starttime 6 YES timestamp without time zone

dev publ ic l i s ting l i s tid 1 NO integer 32 2 0

dev publ ic l i s ting sel lerid 2 YES integer 32 2 0

dev publ ic l i s ting eventid 3 YES integer 32 2 0

dev publ ic l i s ting dateid 4 YES smal l int 16 2 0

dev publ ic l i s ting numtickets 5 NO smal l int 16 2 0

dev publ ic l i s ting priceperticket 6 YES numeric 8 10 2

dev publ ic l i s ting tota lprice 7 YES numeric 8 10 2

dev publ ic l i s ting l i s ttime 8 YES timestamp without time zone

dev publ ic sa les sa les id 1 NO integer 32 2 0

dev publ ic sa les l i s tid 2 YES integer 32 2 0

dev publ ic sa les sel lerid 3 YES integer 32 2 0

dev publ ic sa les buyerid 4 YES integer 32 2 0

dev publ ic sa les eventid 5 YES integer 32 2 0

dev publ ic sa les dateid 6 YES smal l int 16 2 0

dev publ ic sa les qtysold 7 NO smal l int 16 2 0

dev publ ic sa les pricepaid 8 YES numeric 8 10 2

dev publ ic sa les commiss ion 9 YES numeric 8 10 2

dev publ ic sa les sa letime 10 YES timestamp without time zone

dev publ ic users userid 1 NO integer 32 2 0

dev publ ic users username 2 YES character 8

dev publ ic users fi rs tname 3 YES character varying 30

dev publ ic users lastname 4 YES character varying 30

dev publ ic users ci ty 5 YES character varying 30

dev publ ic users state 6 YES character 2

dev publ ic users emai l 7 YES character varying 100

dev publ ic users phone 8 YES character 14

dev publ ic users l ikesports 9 YES boolean

dev publ ic users l iketheatre 10 YES boolean

dev publ ic users l ikeconcerts 11 YES boolean

dev publ ic users l ikejazz 12 YES boolean

dev publ ic users l ikeclass ica l 13 YES boolean

dev publ ic users l ikeopera 14 YES boolean

dev publ ic users l ikerock 15 YES boolean

dev publ ic users l ikevegas 16 YES boolean

dev publ ic users l ikebroadway 17 YES boolean

dev publ ic users l ikemus ica ls 18 YES boolean

dev publ ic venue venueid 1 NO smal l int 16 2 0

dev publ ic venue venuename 2 YES character varying 100

dev publ ic venue venuecity 3 YES character varying 30

dev publ ic venue venuestate 4 YES character 2

dev publ ic venue venueseats 5 YES integer 32 2 0

Page 36: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

35

© 2018 IDERA, Inc. All rights reserved.

Conclusion Throughout the document I have highlighted a portion of the ER/Studio advanced architecture and modeling

capabilities. While this is the end of this particular topic, it marks the beginning of many capabilities that you

can discover and apply in your own environment.

For example, rather than manually comparing the results of the Redshift Catalog queries to my model, I could

use an even more advanced approach by first downloading the query results into a file, such as an excel

workbook. Then, I could build reusable macros to read from the file and create the additional metadata in the

model for me. The approach is very similar to how I built the macro to populate the postsql clause from a

table’s bound attachments.

I hope you have found this document to be a helpful guide as you embark upon modeling and generating

Redshift data warehouses with ER/Studio. These principles apply to other platforms as well. ER/Studio’s

advanced architecture will allow you to define and create additional metadata for any model construct, as well

as the ability to extend generated DDL with additional syntax (pre and post SQL. Capabilities are extended with

custom datatypes and datatype mappings for the platforms you wish to work with. Full macro programming

capability using Winwrap basic, combined with an extensive automation engine allows you to customize the

capabilities, limited only by your imagination.

Having these new capabilities will provide you with huge productivity benefits. Have fun impressing your boss

with your new modeling superpowers!

Page 37: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

36

© 2018 IDERA, Inc. All rights reserved.

Appendix A: DDL Generated from My TICKIT Model CREATE TABLE category( Catid int2 NOT NULL, Catgroup varchar(10), Catname varchar(10), Catdesc varchar(50), PRIMARY KEY (catid) ) DISTSTYLE ALL SORTKEY(catid) ; -- -- TABLE: date -- CREATE TABLE date( Dateid int2 NOT NULL, Caldate date NOT NULL, Day char(3) NOT NULL, Week int2 NOT NULL, Month char(5) NOT NULL, Qtr char(5) NOT NULL, Year int2 NOT NULL, Holiday boolean DEFAULT false, PRIMARY KEY (dateid) ) DISTKEY(dateid) SORTKEY(dateid) ; -- -- TABLE: venue -- CREATE TABLE venue( Venueid int2 NOT NULL, Venuename varchar(100), Venuecity varchar(30), Venuestate char(2), Venueseats int4, PRIMARY KEY (venueid) ) DISTSTYLE ALL SORTKEY(venueid) ; -- -- TABLE: event -- CREATE TABLE event( Eventid int4 NOT NULL, Venueid int2, Catid int2, Dateid int2, Eventname varchar(200), Starttime timestamp, PRIMARY KEY (eventid),

Page 38: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

37

© 2018 IDERA, Inc. All rights reserved.

FOREIGN KEY (catid) REFERENCES category(catid), FOREIGN KEY (dateid) REFERENCES date(dateid), FOREIGN KEY (venueid) REFERENCES venue(venueid) ) DISTKEY(eventid) SORTKEY(eventid) ; -- -- TABLE: users -- CREATE TABLE users( Userid int4 NOT NULL, Username char(8), Firstname varchar(30), Lastname varchar(30), City varchar(30), State char(2), Email varchar(100), Phone char(14), Likesports boolean, Liketheatre boolean, Likeconcerts boolean, Likejazz boolean, Likeclassical boolean, Likeopera boolean, Likerock boolean, Likevegas boolean, Likebroadway boolean, Likemusicals boolean, PRIMARY KEY (userid) ) DISTSTYLE EVEN SORTKEY(userid) ; -- -- TABLE: listing -- CREATE TABLE listing( Listid int4 NOT NULL, Sellerid int4, Eventid int4, Dateid int2, Numtickets int2 NOT NULL, Priceperticket numeric(8, 2), Totalprice numeric(8, 2), Listtime timestamp, PRIMARY KEY (listid), FOREIGN KEY (eventid) REFERENCES event(eventid), FOREIGN KEY (sellerid) REFERENCES users(userid), FOREIGN KEY (dateid)

Page 39: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

38

© 2018 IDERA, Inc. All rights reserved.

REFERENCES date(dateid) ) DISTKEY(listid) SORTKEY(listid) ; -- -- TABLE: sales -- CREATE TABLE sales( Salesid int4 NOT NULL, Listid int4, Sellerid int4, Buyerid int4, Eventid int4, Dateid int2, Qtysold int2 NOT NULL, Pricepaid numeric(8, 2), Commission numeric(8, 2), Saletime timestamp, PRIMARY KEY (salesid), FOREIGN KEY (dateid) REFERENCES date(dateid), FOREIGN KEY (sellerid) REFERENCES users(userid), FOREIGN KEY (buyerid) REFERENCES users(userid), FOREIGN KEY (eventid) REFERENCES event(eventid), FOREIGN KEY (listid) REFERENCES listing(listid) ) DISTKEY(salesid) SORTKEY(listid, sellerid) ;

Page 40: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

39

© 2018 IDERA, Inc. All rights reserved.

Appendix B: TICKIT Database DDL in Standard Redshift Tutorial create table users( userid integer not null distkey sortkey, username char(8), firstname varchar(30), lastname varchar(30), city varchar(30), state char(2), email varchar(100), phone char(14), likesports boolean, liketheatre boolean, likeconcerts boolean, likejazz boolean, likeclassical boolean, likeopera boolean, likerock boolean, likevegas boolean, likebroadway boolean, likemusicals boolean); create table venue( venueid smallint not null distkey sortkey, venuename varchar(100), venuecity varchar(30), venuestate char(2), venueseats integer); create table category( catid smallint not null distkey sortkey, catgroup varchar(10), catname varchar(10), catdesc varchar(50)); create table date( dateid smallint not null distkey sortkey, caldate date not null, day character(3) not null, week smallint not null, month character(5) not null, qtr character(5) not null, year smallint not null, holiday boolean default('N')); create table event( eventid integer not null distkey, venueid smallint not null, catid smallint not null, dateid smallint not null sortkey, eventname varchar(200), starttime timestamp); create table listing( listid integer not null distkey, sellerid integer not null, eventid integer not null, dateid smallint not null sortkey, numtickets smallint not null, priceperticket decimal(8,2),

Page 41: Design for Redshift with ER/Studio Data Architect · Unsupported PostgreSQL features Table partitioning (range and list partitioning), Tablespaces Constraints (Unique, Foreign key,

40

© 2018 IDERA, Inc. All rights reserved.

totalprice decimal(8,2), listtime timestamp); create table sales( salesid integer not null, listid integer not null distkey, sellerid integer not null, buyerid integer not null, eventid integer not null, dateid smallint not null sortkey, qtysold smallint not null, pricepaid decimal(8,2), commission decimal(8,2), saletime timestamp);