T D SQL Guide

Teradata SQLUnleash the Power

Michael J. Larkinsand

Thomas L. Coffing, Jr.Third Edition 2003(Includes V2R5 functionality)

Written by Michael J. Larkins and Thomas L. CoffingWeb Page: www.CoffingDW.com

E-Mail addresses:Mike: [email protected] Tom: [email protected]

Teradata, NCR, and BYNET are registered trademarks of NCR Corporation, Dayton, Ohio, U.S.A., IBM and DB2 are registered trademarks of IBM Corporation, ANSI is a registered trademark of the American National Standards Institute. The Jeopardy game is a registered trademark of Parker Brothers and Merv Griffin. In addition to these products names, all brands and product names in this document are registered names or trademarks of their respective holders.

Coffing Data Warehousing shall have neither liability nor responsibility to any person or entity with respect to any loss or damages arising from the information contained in this book or from the use of programs or program segments that are included. The manual is not a publication of NCR Corporation, nor was it produced in conjunction with NCR Corporation.

mailto:[email protected]

mailto:[email protected]

http://www.coffingdw.com/

Copyright 2001 by Coffing Publishing

All rights reserved. No part of this book shall be reproduced, stored in a retrieval system, or transmitted by any means, electronic, mechanical, photocopying, recording, or otherwise, without written permission from the publisher. No patent liability is assumed with respect to the use of information contained herein. Although every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, neither is any liability assumed for damages resulting from the use of information contained herein. For information, address:

Coffing Publishing7810 Kiester Rd.Middletown, OH 45042

International Standard Book Number: ISBN 0-9704980-3-9

Printed in the United States of America

All terms mentioned in this book that are known to be trademarks or service have been stated. Coffing Publishing cannot attest to the accuracy of this information. Use of a term in this book should not be regarded as affecting the validity of any trademark or service mark.

Acknowledgements and Special ThanksTodd Walter, NCR, for providing access to his people regarding V2R4 system information.

Paul Sinclair, NCR, for providing V2R4 information on Stored Procedures and reviewing the stored procedures chapter, information on the new V2R4 OLAP functionality, and for information regarding the new UPSERT command.

Fred Pluebell, JCPenney Corp., for providing V2R4 system availability while we were teaching in Dallas.

A special thanks to the staff at Nationwide Insurance for letting us teach an early V2R4 update class and helping finalize some additional syntax when creating stored procedures.

Larry Carter and Paul DeRouin, NCR, for information on changes to triggers in V2R4.

Bill Putnam for assistance in obtaining V2R4 information.

Chris Coffing, Coffing Data Warehousing, for dedication in getting our system up on V2R4 so that we didn’t have to “borrow” so much system time.

We have a very special thank you for Loraine Larkins. She is Mike’s Mom and an excellent proof-reader and barometer for the ease of understanding the material. This is especially true for someone who was not SQL literate when this whole thing started.

Last, but far from least, we want to thank God for providing us with the inspiration, dedication and fortitude to finish this book.

http://www.coffingdw.com/sql/tdsqlutp/title_page.htm

http://www.coffingdw.com/sql/tdsqlutp/teradata_introduction.htm

Teradata IntroductionThe world’s largest data warehouses commonly use the superior technology of NCR’s Teradata relational database management system (RDBMS). A data warehouse is normally loaded directly from operational data. The majority, if not all of this data will be collected on-line as a result of normal business operations. The data warehouse therefore acts as a central repository of the data that reflects the effectiveness of the methodologies used in running a business.

As a result, the data loaded into the warehouse is mostly historic in nature. To get a true representation of the business, normally this data is not changed once it is loaded. Instead, it is interrogated repeatedly to transform data into useful information, to discover trends and the effectiveness of operational procedures. This interrogation is based on business rules to determine such aspects as profitability, return on investment and evaluation of risk.

For example, an airline might load all of its maintenance activity on every aircraft into the database. Subsequent investigation of the data could indicate the frequency at which certain parts tend to fail. Further analysis might show that the parts are failing more often on certain models of aircraft. The first benefit of the new found knowledge regards the ability to plan for the next failure and maybe even the type of airplane on which the part will fail. Therefore, the part can be on hand when and maybe where it is needed, or the part might be proactively changed prior to its failure.

If the information reveals that the part is failing more frequently on a particular model of aircraft, this could be an indication that the aircraft manufacturer has a problem with the design or production of that aircraft. Another possible cause is that the maintenance crew is doing something incorrectly and contributing to the situation. Either way, you cannot fix a problem if you do not know that a problem exists. There is incredible power and savings in this type of knowledge.

Another business area where the Teradata database excels is in retail. It provides an environment that can store billions of sales. This is a critical capability when you are recording and analyzing the sales of every item in every store around the world. Whether it is used for inventory control, marketing research or credit analysis, the data provides an insight into the business. This type of knowledge is not easily attainable without detailed data that records every aspect of the business. Tracking inventory turns, stock replenishment, or predicting

http://www.coffingdw.com/sql/tdsqlutp/acknowledgements_and_special_thanks.htm

http://www.coffingdw.com/sql/tdsqlutp/teradata_architecture.htm

the number of goods needed in a particular store yields a priceless perspective into the operation of a retail outlet. This information is what enables one retailer to thrive while others go out of business.

Teradata is flourishing with the realization that detail data is critical to the survival of a business in a competitive, lower margin environment. Continually, businesses are forced to do more with less. Therefore, it is vital to maximize the efforts that work well to improve profit and minimize or correct those that do not work.

One computer vendor used these same techniques to determine that it cost more to sell into the desktop environment than was realized in profit. Prior to this realization, the sales effort had attempted to make up the loss by selling more computers. Unfortunately, increased sales meant increased losses. Today, that company is doing much better and has made a huge step into profitability by discontinuing the small computer line.

Teradata ArchitectureThe Teradata database currently runs normally on NCR Corporation’s WorldMark Systems in the UNIX MP-RAS environment. Some of these systems consist of a single processing node (computer) while others are several hundred nodes working together in a single system. The NCR nodes are based entirely on industry standard CPU processor chips, standard internal and external bus architectures like PCI and SCSI, and standard memory modules with 4-way interleaving for speed.

At the same time, Teradata can run on any hardware server in the single node environment when the system runs Microsoft NT and Windows 2000. This single node may be any computer from a large server to a laptop.

Whether the system consists of a single node or is a massively parallel system with hundreds of nodes, the Teradata RDBMS uses the exact same components executing on all the nodes in parallel. The only difference between small and large systems is the number of processing components.

When these components exist on different nodes, it is essential that the components communicate with each other at high speed. To facilitate the communications, the multi-node systems use the BYNET interconnect. It is a high speed, multi-path, dual redundant communications channel. Another amazing capability of the BYNET is that the bandwidth increases with each consecutive node added into the system. There is more detail on the BYNET later in this chapter.

Teradata Components

As previously mentioned, Teradata is the superior product today because of its parallel operations based on its architectural design. It is the parallel processing by the major components that provide the power to move mountains of data. Teradata works more like the early Egyptians who built the pyramids without heavy equipment using parallel, coordinated human efforts. It uses smaller nodes running several processing components all working together on the same user request. Therefore, a monumental task is completed in record time.

Teradata operates with three major components to achieve the parallel operations. These components are called: Parsing Engine Processors, Access Module Processors and the Message Passing Layer. The role of

http://www.coffingdw.com/sql/tdsqlutp/teradata_introduction.htm

http://www.coffingdw.com/sql/tdsqlutp/a_teradata_database.htm

each component is discussed in the next sections to provide a better understanding of Teradata. Once we understand how Teradata works, we will pursue the SQL that allows storage and access of the data.

Parsing Engine Processor (PEP or PE)

The Parsing Engine Processor (PEP) or Parsing Engine (PE), for short, is one of the two primary types of processing tasks used by Teradata. It provides the entry point into the database for users on mainframe and networked computer systems. It is the primary director task within Teradata.

As users “logon” to the database they establish a Teradata session. Each PE can manage 120 concurrent user sessions. Within each of these sessions users submit SQL as a request for the database server to take an action on their behalf. The PE will then parse the SQL statement to establish which database objects are involved. For now, let’s assume that the database object is a table. A table is a two-dimensional array that consists of rows and columns. A row represents an entity stored in a table and it is defined using columns. An example of a row might be the sale of an item and its columns include the UPC, a description and the quantity sold.

Any action a user requests must also go through a security check to validate their privileges as defined by the database administrator. Once their authorization at the object level is verified, the PE will verify that the columns requested actually exist within the objects referenced.

Next, the PE optimizes the SQL to create an execution plan that is as efficient as possible based on the amount of data in each table, the indices defined, the type of indices, the selectivity level of the indices, and the number of processing steps needed to retrieve the data. The PE is responsible for passing the optimized execution plan to other components as the best way to gather the data.

An execution plan might use the primary index column assigned to the table, a secondary index or a full table scan. The use of an index is preferable and will be discussed later in this chapter. For now, it is sufficient to say that a full table scan means that all rows in the table must be read and compared to locate the requested data.

Although a full table scan sounds really bad, within the architecture of Teradata, it is not necessarily a bad thing because the data is divided up and distributed to multiple, parallel components throughout the database. We will look next at the AMPs that perform the parallel disk

access using their file system logic. The AMPs manage all data storage on disks. The PE has no disks.

Activities of a PE: Convert incoming requests from EBCDIC to ASCII (if from an

IBM mainframe) Parse the SQL to determine type and validity Validate user privileges Optimize the access path(s) to retrieve the rows Build an execution plan with necessary steps for row access Send the plan steps to Access Module Processors (AMP)

involved

Access Module Processor (AMP)

The next major component of Teradata’s parallel architecture is called an Access Module Processor (AMP). It stores and retrieves the distributed data in parallel. Ideally, the data rows of each table are distributed evenly across all the AMPs. The AMPs read and write data and are the workhorses of the database. Their job is to receive the optimized plan steps, built by the PE after it completes the optimization, and execute them. The AMPs are designed to work in parallel to complete the request in the shortest possible time.

Optimally, every AMP should contain a subset of all the rows loaded into every table. By dividing up the data, it automatically divides up the work of retrieving the data. Remember, all work comes as a result of a users’ SQL request. If the SQL asks for a specific row, that row exists in its entirety (all columns) on a single AMP and other rows exist on the other AMPs.

If the user request asks for all of the rows in a table, every AMP should participate along with all the other AMPs to complete the retrieval of all rows. This type of processing is called an all AMP operation and an all rows scan. However, each AMP is only responsible for its rows, not the rows that belong to a different AMP. As far as the AMPs are concerned, it owns all of the rows. Within Teradata, the AMP environment is a “shared nothing” configuration. The AMPs cannot access each other’s data rows, and there is no need for them to do so.

Once the rows have been selected, the last step is to return them to the client program that initiated the SQL request. Since the rows are scattered across multiple AMPs, they must be consolidated before reaching the client. This consolidation process is accomplished as a

part of the transmission to the client so that a final comprehensive sort of all the rows is never performed. Instead, all AMPs sort only their rows (at the same time – in parallel) and the Message Passing Layer is used to merge the rows as they are transmitted from all the AMPs.

Therefore, when a client wishes to sequence the rows of an answer set, this technique causes the sort of all the rows to be done in parallel. Each AMP sorts only its subset of the rows at the same time all the other AMPs sort their rows. Once all of the individual sorts are complete, the BYNET merges the sorted rows. Pretty brilliant!

Activities of the AMP: Store and retrieve data rows using the file system Aggregate data Join processing between multiple tables Convert ASCII returned data to EBCDIC (IBM mainframes only) Sort and format output data

Message Passing Layer (BYNET)

The Message Passing Layer varies depending on the specific hardware on which the Teradata database is executing. In the latter part of the 20th century, most Teradata database systems executed under the UNIX operating system. However, in 1998, Teradata was released on Microsoft’s NT operating system. Today it also executes under Windows 2000. The initial release of Teradata, on the Microsoft systems, is for a single node.

When using the UNIX operating system, Teradata supports up to 512 nodes. This massively parallel system establishes the basis for storing and retrieving data from the largest commercial databases in the world, Teradata. Today, the largest system in the world consists of 176 nodes. There is much room for growth as the databases begin to exceed 40 or 50 terabytes.

For the NCR UNIX systems, the Message Passing Layer is called the BYNET. The amazing thing about the BYNET is its capacity. Instead of a fixed bandwidth that is shared among multiple nodes, the bandwidth of the BYNET increases as the number of nodes increase. This feat is accomplished as a result of using virtual circuits instead of using a single fixed cable or a twisted pair configuration.

To understand the workings of the BYNET, think of a telephone switch used by local and long distance carriers. As more and more people place phone calls, no one needs to speak slower. As one switch

becomes saturated, another switch is automatically used. When your phone call is routed through a different switch, you do not need to speak slower. If a natural or other type of disaster occurs and a switch is destroyed, all subsequent calls are routed through other switches. The BYNET is designed to work like a telephone switching network.

An additional aspect of the BYNET is that it is really two connection paths, like having two phone lines for a business. The redundancy allows for two different aspects of its performance. The first aspect is speed. Each path of the BYNET provides bandwidth of 10 Megabytes (MB) per second with Version 1 and 60 MB per second with Version 2. Therefore the aggregate speed of the two connections is 20MB/second or 120MB/second. However, as mentioned earlier, the bandwidth grows linearly as more nodes are added. Using Version 1 any two nodes communicate at 40MB/second (10MB/second * 2 BYNETs * 2 nodes). Therefore, 10 nodes can utilize 200MB/second and 100 nodes have 2000MB/second available between them. When using the version 2 BYNET, the same 100 nodes communicate at 12,000MB/second (60MB/second * 2 BYNETs * 100 nodes).

The second and equally important aspect of the BYNET uses the two connections for availability. Regardless of the speed associated with each BYNET connection, if one of the connections should fail, the second is completely independent and can continue to function at its individual speed without the other connection. Therefore, communications continue to pass between all nodes.

Although the BYNET is performing at half the capacity during an outage, it is still operational and SQL is able to complete without failing. In reality, when the BYNET is performing at only 10MB/second per node, it is still a lot faster than many normal networks that typically transfer messages at 10MB per second.

All messages going across the BYNET offer guaranteed delivery. So, any messages not successfully delivered because of a failure on one connection automatically route across the other connection. Since half of the BYNET is not working, the bandwidth reduces by half. However, when the failed connection is returned to service, its topology is automatically configured back into service and it begins transferring messages along with the other connection. Once this occurs, the capacity returns to normal.

A Teradata DatabaseWithin Teradata, a database is a storage location for database objects (tables, views, macros, and triggers). An administrator can use Data Definition Language (DDL) to establish a database by using a CREATE DATABASE command.

A database may have PERMANENT (PERM) space allocated to it. This PERM space establishes the maximum amount of disk space for storing user data rows in any table located in the database. However, if no tables are stored within a database, it is not required to have PERM space. Although a database without PERM space cannot store tables, it can store views and macros because they are physically stored in the Data Dictionary (DD) PERM space and require no user storage space. The DD is in a “database” called DBC.

Teradata allocates PERM space to tables, up to the maximum, as rows are inserted. The space is not pre-allocated. Instead, it is allocated, as rows are stored in blocks on disk. The maximum block size is defined either at a system level in the DBS Control Record, at the database level or individually for each table. Like PERM, the block size is a maximum size. Yet, it is only a maximum for blocks that contain multiple rows. By nature, the blocks are variable in length. So, disk space is not pre-allocated; instead, it is allocated on an as needed basis, one sector (512 bytes) at a time. Therefore, the largest possible wasted disk space in a block is 511 bytes.

A database can also have SPOOL space associated with it. All users who run queries need workspace at some point in time. This SPOOL space is workspace used for the temporary storage of rows during the execution of user SQL statements. Like PERM space, SPOOL is defined as a maximum amount that can be used within a database or by a user. Since PERM is not pre-allocated, unused PERM space is automatically available for use as SPOOL. This maximizes the disk space throughout the system.

It is a common practice in Teradata to have some databases with PERM space that contain only tables. Then, other databases contain only views. These view databases require no PERM space and are the only databases that users have privileges to access. The views in these databases control all access to the real tables in other databases. They insulate the actual tables from user access. There will be more on views later in this book.

http://www.coffingdw.com/sql/tdsqlutp/teradata_architecture.htm

http://www.coffingdw.com/sql/tdsqlutp/teradata_users.htm

The newest type of space allocation within Teradata is TEMPORARY (TEMP) space. A database may or may not have TEMP space, however, it is required if Global Temporary Tables are used. The use of temporary tables is also covered in more detail later in the SQL portion of this book.

A database is defined using a series of parameter values at creation time. The majority of the parameters can easily be changed after a database has been created using the MODIFY DATABASE command. However, when attempting to increase PERM or TEMP space maximums, there must be sufficient disk space available even though it is not immediately allocated. There may not be more PERM space defined that actual disk on the system.

A number of additional database parameters are listed below along with the user parameters in the next section. These parameters are tools for the database administrator and other experienced users when establishing databases for tables and views.

CREATE / MODIFY DATABASE Parameters PERMANENT TEMPORARY SPOOL ACCOUNT FALLBACK JOURNAL DEFAULT JOURNAL

Teradata UsersIn Teradata, a user is the same as a database with one exception. A user is able to logon to the system and a database cannot. Therefore, to authenticate the user, a password must be established. The password is normally established at the same time that the CREATE USER statement is executed. The password can also be changed using a MODIFY USER command.

Like a database, a user area can contain database objects (tables, views, macros and triggers). A user can have PERM and TEMP space and can also have spool space. On the other hand, a user might not have any of these types of space, exactly the same as a database.

The biggest difference between a database and a user is that a user must have a password. This similarity between the two makes administering the system easier and allows for default values that all databases and users can inherit.

The next two lists regard the creation and modification of databases and users.

{ CREATE | MODIFY } DATABASE or USER (in common) PERMANENT TEMPORARY SPOOL ACCOUNT FALLBACK JOURNAL DEFAULT JOURNAL

{ CREATE | MODIFY } USER (only) PASSWORD STARTUP DEFAULT DATABASE

By no means are these all of the parameters. It is not the intent of this chapter, nor the intent of this book to teach database administration. There are reference manuals and courses available to use. Teradata administration warrants a book by itself.

http://www.coffingdw.com/sql/tdsqlutp/a_teradata_database.htm

http://www.coffingdw.com/sql/tdsqlutp/symbols_used_in_this_book.htm

Symbols Used in this BookSince there are no standard symbols for teaching SQL, it is necessary to understand some of the symbols used in our syntax diagrams throughout this book.

This chart should be used as a reference for SQL syntax used in the book:

<database-name> Substitute an actual database name in this location

<table-name> Substitute an actual table name in this location

<comparison> Substitute a comparison in this location, i.e. a=1

<column-name> Substitute an actual column name in this location

<data-value> Substitute a literal data value in this location

[ optional entry ] Everything between the [ ] is optional, not required to be valid syntax , use when needed

{ use this | or this } Use one of the keywords or symbols on either side of the “ | “, but not both. I.e. { LEFT | RIGHT } use either “LEFT” or “RIGHT” but not both

Figure 1-1

http://www.coffingdw.com/sql/tdsqlutp/teradata_users.htm

http://www.coffingdw.com/sql/tdsqlutp/database_command.htm

DATABASE CommandWhen users negotiate a successful logon to Teradata, they are automatically positioned in a default database as defined by the database administrator. When an SQL request is executed, by default, it looks in the current database for all referenced objects.

There may be times when the object is not in the current database. When this happens, the user has one of two choices to resolve this situation. One solution is to qualify the name of the object along with the name of the database in which it resides. To do this, the user simply associates the database name to the object name by connecting them with a period (.) or dot as shown below:

<database-name>.<table-name>

The second solution is to use the database command. It repositions the user to the specified database. After the database command is executed, there is no longer a need to qualify the objects in that database. Of course, if the SQL statement references additional objects in another database, they will have to be qualified in order for the system to locate them. Normally, you will DATABASE to the database that contains most of the objects that you need. Therefore it reduces the number of object names requiring qualification.

The following is the syntax for the DATABASE command.

DATABASE <database-name>;

If you are not sure what database you are in, either the HELP SESSION or SELECT DATABASE command may be used to make that determination. These commands and other HELP functions are covered in the SQL portion of this book.

http://www.coffingdw.com/sql/tdsqlutp/symbols_used_in_this_book.htm

http://www.coffingdw.com/sql/tdsqlutp/use_of_an_index.htm

Use of an IndexAlthough a relational data model uses Primary Keys and Foreign Keysto establish the relationships between tables, that design is a Logical Model. Each vendor uses specialized techniques to implement a Physical Model. Teradata does not use keys in its physical model. Instead, Teradata is implemented using indices, both primary and secondary.

The Primary Index (PI) is the most important index in all of Teradata. The performance of Teradata can be linked directly to the selection of this index. The data value in the PI column(s) is submitted to the hashing function. The resulting row hash value is used to map the row to a specific AMP for data distribution and storage.

To illustrate this concept, I have on several occasions used two decks of cards. Imagine if you will, fourteen people in a room. To the largest, most powerful looking man in the room, you give one of the decks of cards. His large hands allow him to hold all fifty-two cards at one time, with some degree of success. The cards are arranged with the ace of spades continuing through the king of spades in ascending order. After the spades, are the hearts, then the clubs and last, the diamonds. Each suit is arranged starting with the ace and ascending up to the king. The cards are partitioned by suit.

The other deck of cards is divided among the other thirteen people. Using this procedure, all cards with the same value (i.e. aces) all go to the same person. Likewise, all the deuces, treys and subsequent cards each go to one of the thirteen people. Each of the four cards will be in the same order as the suits contained in the single deck that went to the lone man: spades, hearts, clubs and diamonds. Once all the cards have been distributed, each of the thirteen people will be holding four cards of the same value (4*13=52). Now, the game can begin.

The requests in this game come in the form of “give-me,” one or more cards.

To make it easy for the lone player, we first request: give-me the ace of spades. The person with four aces finds their ace, as does the lone player with all 52 cards, both on the top other their cards. That was easy!

As the difficulty of the give-me requests increase, the level of difficulty dramatically increases for the lone man. For instance, when the give-

http://www.coffingdw.com/sql/tdsqlutp/database_command.htm

http://www.coffingdw.com/sql/tdsqlutp/determining_the_release_of_your_teradata_system_.htm

me request is for all of the twos, one of the thirteen people holds up all four of their cards and they are done. The lone man must locate the 2 of spades between the ace and trey. Then, go and locate the 2 of hearts, thirteen cards later between the ace and trey. Then, find the 2 of clubs, thirteen cards after that, as well as the 2 of diamonds, thirteen cards after that to finally complete the request.

Another request might be give-me all of the diamonds. For the thirteen people, each person locates and holds up one card of their cards and the request is finished. For the lone person with the single deck, the request means finding and holding up the last thirteen cards in their deck of fifty-two. In each of these give-me requests, the lone man had to negotiate all fifty two cards while the thirteen other people only needed to determine which of the four cards applied to the request, if any. This is the same procedure used by Teradata. It divides up the data like we divided up the cards.

As illustrated, the thirteen people are faster than the lone man. However, the game is not limited to thirteen players. If there were 26 people who wished to play on the same team, the cards simply need to be divided or distributed differently.

When using the value (ace through king) there are only 13 unique values. In order for 26 people to play, we need a way to come up with 26 unique values for 26 people. To make the cards more unique, we might combine the value of the card (i.e. ace) with the color. Therefore, we have two red aces and two black aces as well as two sets for every other card. Now when we distribute the cards, each of the twenty-six people receives only two cards instead of the original four. The distribution is still based on fifty-two cards (2 times 26).

At the same time, the optimum number of people for the game is not 26. Based on what has been discussed so far, what is the optimum number of people?

If your answer is 52, then you are absolutely correct.

With this many people, each person has one and only one card. Any time a give-me is requested of the participants, their one card either qualifies or it does not. It doesn’t get any simpler or faster than this situation.

As easy as this sounds, to accomplish this distribution the value of the card alone is not sufficient to manifest 52 unique values. Neither is using the value and the color. That combination only gives us a distribution of 26 unique values when 52 unique values are desired.

To achieve this distribution we need to establish still more uniqueness. Fortunately, we can use the suit along with the value. Therefore, the ace of spades is different than the ace of hearts, which is different from the ace of clubs and the ace of diamonds. In other words, there are now 52 unique identities to use for distribution.

To relate this distribution to Teradata, one or more columns of a table are chosen to be the Primary Index.

Primary Index

The Primary Index can consist of up to sixteen different columns. These columns, when considered together, provide a comprehensive technique to derive a Unique Primary Index (UPI, pronounced as “you-pea”) value as we discussed previously regarding the card analogy. That is the good news.

To store the data, the value(s) in the PI are hashed via a calculation to determine which AMP will own the data. The same data values always hash the same row hash and therefore are always associated with the same AMP.

The advantage to using up to sixteen columns is that row distribution is very smooth or evenly based on unique values. This simply means that each AMP contains the same number of rows. At the same time, there is a downside to using several columns for a PI. The PE needs every data value for each column as input to the hashing calculation to directly access a particular row. If a single column value is missing, a full table scan will result because the row hash cannot be recreated. Any row retrieval using the PI column(s) is always an efficient, one AMP operation.

Although uniqueness is good in most cases, Teradata does not require that a UPI be used. It also allows for a Non-Unique Primary Index(NUPI, pronounced as new-pea). The potential downside of a NUPI is that if several duplicate values (NUPI dups) are stored, they all go to the same AMP. This can cause an uneven distribution that places more rows on some of the AMPs than on others. This means that any time an AMP with a larger number of rows is involved, it has to work harder than the other AMPs. The other AMPs will finish before the slower AMP. The time to process a single user request is always based on the slowest AMP. Therefore, serious consideration should be used when making the decision to use a NUPI.

Every table must have a PI and it is established when the table is created. If the CREATE TABLE statement contains: UNIQUE PRIMARY INDEX( <column-list> ), the value in the column(s) will be distributed

to an AMP as a UPI. However, if the statement reads: PRIMARY INDEX ( <column-list> ), the value in the column(s) will be distributed as a NUPI and allow duplicate values. Again, all the same values will go to the same AMP.

If the DDL statement does not specify a PI, but it specifies a PRIMARY KEY (PK), the named column(s) are used as the UPI. Although Teradata does not use primary keys, the DDL may be ported from another vendor's database system.

A UPI is used because a primary key must be unique and cannot be null. By default, both UPIs and NUPIs allow a null value to be stored unless the column definition indicates that null values are not allowed using a NOT NULL constraint.

Now, with that being said, when considering JOIN accesses on the tables, sometimes it is advantageous to use a NUPI. This is because the rows being joined between tables must be on the same AMP. If they are not on the same AMP, one of the rows must be moved to the same AMP as the matching row. Teradata will use one of two different strategies to temporarily move rows. It can copy all needed rows to all AMPs or it can redistribute them using the hashing mechanism on the column defined as the join domain that is a PI. However, if neither join column is a PI, it might be necessary to redistribute all participating rows from both tables by hash code to get them together on a single AMP.

Planning data distribution, using access characteristics, can reduce the amount of data movement and therefore improve join performance. This works fine as long as there is a consistent number of duplicate values or only a small number of duplicate values. The logical data model needs to be extended with usage information in order to know the best way to distribute the data rows. This is done during the physical implementation phase before creating tables.

Secondary Index

A Secondary Index (SI) is used in Teradata as a way to directly access rows in the data, sometimes called the base table, without requiring the use of PI values. Unlike the PI, an SI does not effect the distribution of the data rows. Instead, it is an alternate read path and allows for a method to locate the PI value using the SI. Once the PI is obtained, the row can be directly accessed using the PI. Like the PI, an SI can consist of up to 16 columns.

In order for an SI to retrieve the data row by way of the PI, it must store and retrieve an index row. To accomplish this Teradata creates,

maintains and uses a subtable. The PI of the subtable is the value in the column(s) that are defined as the SI. The “data” stored in the subtable row is the previously hashed value of the real PI for the data row or rows in the base table. The SI is a pointer to the real data row desired by the request. An SI can also be unique (USI, pronounced as you-sea) or non-unique (NUSI, pronounced as new-sea).

The rows of the subtable contain the row hashed value of the SI, the actual data value(s) of the SI, and the row hashed value of the PI as the row ID. Once the row ID of the PI is obtained from the subtable row, using the hashed value of the SI, the last step is to get the actual data row from the AMP where it is stored. The action and hashing for an SI is exactly the same as when starting with a PI. When using a USI, the access of the subtable is a one AMP operation and then accessing the data row from the base table is another one AMP operation. Therefore, USI accesses are always a two AMP operation based on two separate row hash operations.

When using a NUSI, the subtable access is always an all AMPoperation. Since the data is distributed by the PI, NUSI duplicate values may exist and probably do exist on multiple AMPs. So, the best plan is to go to all AMPs and check for the requested NUSI value. To make this more efficient, each AMP scans its subtable. These subtable rows contain the row hash of the NUSI, the value of the data that created the NUSI and one or more row IDs for all the PI rows on that AMP. This is still a fast operation because these rows are quite small and several are stored in a single block. If the AMP determines that it contains no rows for the value of the NUSI requested, it is finished with its portion of the request. However, if an AMP has one or more rows with the NUSI value requested, it then goes and retrieves the data rows into spool space using the index.

With this said, the SQL optimizer may decide that there are too many base table data rows to make index access efficient. When this happens, the AMPs will do a full base table scan to locate the data rows and ignore the NUSI. This situation is called a weakly selective NUSI. Even using old-fashioned indexed sequential files, it has always been more efficient to read the entire file and not use an index if more than 15% of the records were needed. This is compounded with Teradata because the “file” is read in parallel instead of all data from a single file. So, the efficiency percentage is probably closer to being less than 3% of all the rows in order to use the NUSI.

If the SQL does not use a NUSI, you should consider dropping it, due to the fact that the subtable takes up PERM space with no benefit to the users. The Teradata EXPLAIN is covered in this book and it is the

easiest way to determine if your SQL is using a NUSI. Furthermore, the optimizer will never use a NUSI without STATISTICS.

There has been another evolution in the use of NUSI processing. It is called NUSI Bitmapping. This means that if a table has two different NUSI indices and individually they are weakly selective, but together they can be bitmapped together to eliminate most of the non-conforming rows; it will use the two different NUSI columns together because they become highly selective. Therefore, many times, it is better to use smaller individual NUSI indices instead of a large composite (more than one column) NUSI.

There is another feature related to NUSI processing that can improve access time when a value range comparison is requested. When using hash values, it is impossible to determine any value within the range. This is because large data values can generate small hash values and small data values can produce large hash values. So, to overcome the issue associated with a hashed value, there is a range feature called Value Ordered NUSIs. At this time, it may only be used with a four byte or smaller numeric data column. Based on its functionality, a Value Ordered NUSI is perfect for date processing. See the DDLchapter in this book for more details on USI and NUSI usage.

Determining the Release of Your Teradata System:SELECT * FROM DBC.DBCINFO;

InfoKey InfoData ???

RELEASE V2R.04.00.02.26VERSION 04.00.02.27

http://www.coffingdw.com/sql/tdsqlutp/use_of_an_index.htm

http://www.coffingdw.com/sql/tdsqlutp/fundamental_structured_query_language_sql_.htm

Fundamental Structured Query Language (SQL)The access language for all modern relational database systems (RDBMS) is Structured Query Language (SQL). It has evolved over time to be the standard. The ANSI SQL group defines which commands and functionality all vendors should provide within their RDBMS.

There are three levels of compliance within the standard: Entry, Intermediate and Full. The three level definitions are based on specific commands, data types and functionalities. So, it is not that a vendor has incorporated some percentage of the commands; it is more that each command is categorized as belonging to one of the three levels. For instance, most data types are Entry level compliant. Yet, there are some that fall into the Intermediate and Full definitions.

Since the standard continues to grow with more options being added, it is difficult to stay fully ANSI compliant. Additionally, all RDBMSvendors provide extra functionality and options that are not part of the standard. These extra functions are called extensions because they extend or offer a benefit beyond those in the standard definition.

At the writing of this book, Teradata was fully ANSI Entry level compliant based on the 1992 Standards document. NCR also provides much of the Intermediate and some of the Full capabilities. This book indicates feature by feature which SQL capabilities are ANSI and which are Teradata specific, or extensions. It is to NCR’s benefit to be as compliant as possible in order to make it easier for customers of other RDBMS vendors to port their data warehouse to Teradata.

As indicated earlier, SQL is used to access, store, remove and modify data stored within a relational database, like Teradata. The SQL is actually comprised of three types of statements. They are: Data Definition Language (DDL), Data Control Language (DCL) and Data Manipulation Language (DML). The primary focus of this book is on DML and DDL. Both DDL and DCL are, for the most part, used for administering an RDBMS. Since the SELECT statement is used the vast majority of the time, we are concentrating on its functionality, variations and capabilities.

Everything in the first part of this chapter describes ANSI standardcapabilities of the SELECT command. As the statements become more involved, each capability will be designated as either ANSI or a Teradata Extension.

http://www.coffingdw.com/sql/tdsqlutp/determining_the_release_of_your_teradata_system_.htm

http://www.coffingdw.com/sql/tdsqlutp/basic_select_command.htm

Basic SELECT CommandUsing the SELECT has been described like playing the game, Jeopardy. The answer is there; all you have to do is come up with the correct question.

The basic structure of the SELECT statement indicates which column values are desired and the tables that contain them. To aid in the learning of SQL, this book will capitalize the SQL keywords. However, when SQL is written for Teradata, the case of the statement is not important. The SQL statements can be written using all uppercase, lowercase or a combination; it does not matter to the Teradata PE.

The SELECT is used to return the data value(s) stored in the columns named within the SELECT command. The requested columns must be valid names defined in the table(s) listed in the FROM portion of the SELECT.

The following shows the format of a basic SELECT statement. In this book, the syntax uses expressions like: <column-name> (see Figure 1-1) to represent the location of one or more names required to construct a valid SQL statement:

The structure of the above command places all keywords on the left in uppercase and the variable information such as column and table names to the right. Like using capital letters, this positioning is to aid in learning SQL. Lastly, although the use of SEL is acceptable in Teradata, with [ECT] in square brackets being optional, it is not ANSI standard.

Lastly, when multiple column names are requested in the SELECT, a comma must separate them. Without the separator, the optimizer cannot determine where one ends and the next begins.

The following syntax format is also acceptable:SEL[ECT] <column-name> FROM <table-name> ;

http://www.coffingdw.com/sql/tdsqlutp/fundamental_structured_query_language_sql_.htm

http://www.coffingdw.com/sql/tdsqlutp/where_clause.htm

Both of these SELECT statements produce the output report, but the above style is easier to read and debug for complex queries. The output display might appear as:

3 Rows Returned

<column-name>

aaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbcccccccccccccccccc

In the output, the column name becomes the default heading for the report. Then, the data contained in the selected column is displayed once for each row returned.

The next variation of the SELECT statement returns all of the columns defined in the table indicated in the FROM portion of the SELECT.

The output of the above request uses each column name as the heading and the columns are displayed in the same sequence as they are defined in the table. Depending on the tool used to submit the request, care should be taken, because if the returned display is wider than the media (i.e. terminal=80 and paper=133); it may be truncated.

At times, it is desirable to select the same column twice. This is permitted and to accomplish it, the column name is simply listed in the SELECT column list more than once. This technique might often be used when doing aggregations or calculating a value, both are covered in later chapters.

The table below is used to demonstrate the results of various requests. It is a small table with a total of ten rows for easy comparison.

Student Table - contains 10 students

Student_ID Last_Name First_name Class_code Grade_Pt

PK FK

UPI NUSI NUSI 123250125634

PhillipsHanson

MartinHenry

SRFR

3.002.88

234121231222260000280023322133324652333450423400

ThomasWilsonJohnsonMcRobertsBondDelaneySmithLarkins

WendySusieStanleyRichardJimmyDannyAndyMichael

FRSOJRJRSRSOFR

4.003.801.903.953.352.000.00

Figure 2-1

For Example: the next SELECT might be used with Figure 2-1, to display the student number, the last name, first name, the class code and grade point for all of the students in the Student table:

SELECT *FROM Student_Table ;

10 Rows returned

Student_ID Last_Name First_Name Class_Code Grade_Pt

423400 Larkins Michael FR 0.00125634 Hanson Henry FR 2.88280023 McRoberts Richard JR 1.90260000 Johnson Stanley ? ?231222 Wilson Susie SO 3.80234121 Thomas Wendy FR 4.00324652 Delaney Danny SR 3.35123250 Phillips Martin SR 3.00322133 Bond Jimmy JR 3.95333450 Smith Andy SO 2.00

Notice that Johnson has question marks in the grade point and class code columns. Most client software uses the question mark to represent missing data or an unknown value (NULL). More discussion on this condition will appear throughout this book. The other thing to note is that character data is aligned from left to right, the same as we read it and numeric is from right to left, from the decimal.

This SELECT returns all of the columns except the Student ID from the Student table:

10 Rows returned

First_Name Last_Name Class_Code Grade_Pt

Michael Larkins FR 0.00Henry Hanson FR 2.88Richard McRoberts JR 1.90Stanley Johnson ? ?Susie Wilson SO 3.80Wendy Thomas FR 4.00Danny Delaney SR 3.35Martin Phillips SR 3.00Jimmy Bond JR 3.95Andy Smith SO 2.00

There is no short cut for selecting all columns except one or two. Also, notice that the columns are displayed in the output in the same sequence they are requested in the SELECT statement.

WHERE ClauseThe previous “unconstrained” SELECT statement returned every row from the table. Since the Teradata database is most often used as a data warehouse, a table might contain millions of rows. So, it is wise to request only certain types of rows for return.

By adding a WHERE clause to the SELECT, a constraint is established to potentially limit which rows are returned based on a TRUE comparison to specific criteria or set of conditions.

The conditional check in the WHERE can use the ANSI comparison operators (symbols are ANSI / alphabetic is Teradata Extension):

Equal Not Equal Less Than Greater Than Less Than or Equal Greater Than or Equal

= <> < > <= >=EQ NE LT GT LE GE

Figure 2-2

The following SELECT can be used to return the students with a B (3.0) average or better from the Student table:

5 Rows returned

Student_ID Last_Name Grade_Pt

231222 Wilson 3.80234121 Thomas 4.00324652 Delaney 3.35123250 Phillips 3.00

http://www.coffingdw.com/sql/tdsqlutp/basic_select_command.htm

http://www.coffingdw.com/sql/tdsqlutp/compound_comparisons_and_or_.htm

322133 Bond 3.95

Without the WHERE clause, the AMPs return all of the rows in the table to the user. More and more Teradata user systems are getting to the point where they are storing billions of rows in a single table. There must be a very good reason for needing to see all of them. More simply put, you will always use a WHERE clause whenever you want to see only a portion of the rows in a table.

Compound Comparisons ( AND / OR )Many times a single comparison is not sufficient to specify the desired rows. To add more functionality to the WHERE it is common to use more than one comparison. The multiple condition checks and column names are not separated by a comma, like column names. Instead, they must be connected using a logical operator.

The following is the syntax for using the AND logical operator:

Notice that the column name is listed for each comparison separated by a logical operator; this will be true even when it is the same column being compared twice. The AND signifies that each individual comparison on both sides of the AND must be true. The final result of the comparison must be TRUE for a row to be returned.

This Truth Table illustrates this point using AND.

First Test Result AND Second Test Result Final Result

True True TrueTrue False FalseFalse True FalseFalse False False

Figure 2-3

When using AND, different columns must be used because a single column can never contain more than a single data value.

Therefore, it does not make good sense to issue the next SELECT using an AND on the same column because no rows will ever be returned.

http://www.coffingdw.com/sql/tdsqlutp/where_clause.htm

http://www.coffingdw.com/sql/tdsqlutp/impact_of_null_on_compound_comparisons.htm

No rows found

The above SELECT will never return any rows. It is impossible for a column to contain more than one value. No student has a 3.0 grade average AND a 4.0 average. They might have one or the other, but not both. It might contain one or the other, but never

both at the same time. The AND operator indicates both must be TRUE and should never be used between two comparisons on the same column.

By substituting an OR logical operator for the previous AND, rows will now be returned.

The following is the syntax for using OR:

2 Rows returned

Student_ID Last_Name First_Name Grade_Pt

234121 Thomas Wendy 4.00123250 Phillips Martin 3.00

The OR signifies that only one of the comparisons on each side of the OR needs to be true for the entire test to result in a true and the row to be selected.

This Truth Table illustrates the results for the OR:

First Test Result OR Second Test Result Final Result

True True TrueTrue False True

False True TrueFalse False False

Figure 2-4

When using the OR, the same column or different column names may be used. In this case, it makes sense to use the same column because a row is returned when a column contains either of the specified values as opposed to both values as seen with AND.

It is perfectly legal and common practice to combine the AND with the OR in a single SELECT statement.

The next SELECT contains both an AND as well as an OR:

2 Rows returned


234121 Thomas Wendy FR 4.00123250 Phillips Martin SR 3.00

At first glance, it appears that the comparison worked correctly. However, upon closer evaluation it is incorrect because Phillips is a senior and not a freshman.

When mixing AND with OR in the same WHERE clause, it is important to know that the AND is evaluated first. The previous SELECT actually returns all rows with a grade point of 3.0. Hence, Phillips was returned. The second comparison returned Thomas with a grade point of 4.0 and a class code of ‘FR’.

When it is necessary for the OR to be evaluated before the AND the use of parentheses changes the priority of evaluation. A different result is seen when doing the OR first. Here is how the statement should be written:

1 Row returned

Last_Name Class_Code Grade_Pt

Thomas FR 4.00

Now, only Thomas is returned and the output is correct.

Impact of NULL on Compound ComparisonsNULL is an SQL reserved word. It represents missing or unknown data in a column. Since NULL is an unknown value, a normal comparison cannot be used to determine whether it is true or false. All comparisons of any value to a NULL result in an unknown; it is neither true nor false. The only valid test for a null uses the keyword NULL without the normal comparison symbols and is explained in this chapter.

When a table is created in Teradata, the default for a column is for it to allow a NULL value to be stored. So, unless the default is over-ridden and NULL values are not allowed, it is a good idea to understand how they work.

A SHOW TABLE command (chapter 3) can be used to determine whether a NULL is allowed. If the column contains a NOT NULL constraint, you need not be concerned about the presence of a NULL because it is disallowed.

This AND Truth Table must now be used for compound tests when NULL values are allowed:

First Test Result AND Second Test Result Final Result

True Unknown UnknownUnknown True Unknown

False Unknown FalseUnknown False FalseUnknown Unknown Unknown

Figure 2-5

This OR Truth Table must now be used for compound tests when NULLvalues are allowed:

First Test Result OR Second Test Result Final ResultTrue Unknown True

Unknown True TrueFalse Unknown Unknown

Unknown False UnknownUnknown Unknown Unknown

http://www.coffingdw.com/sql/tdsqlutp/compound_comparisons_and_or_.htm

http://www.coffingdw.com/sql/tdsqlutp/using_not_in_sql_comparisons.htm

Figure 2-6

For most comparisons, an unknown (null) is functionally equivalent to a false because it is not a true. Therefore, when using any comparison symbol a row is not returned when it contains a NULL.

At the same time, the next SELECT does not return Johnson because all comparisons against a NULL are unknown:

No rows foundV2R5: *** Failure 3731 The user must use IS NULL or IS NOT NULL to test for NULL values.

As seen in the above Truth tables, a comparison test cannot be used to find a NULL.

To find a NULL, it becomes necessary to make a slight change in the syntax of the conditional comparison. The coding necessary to find a NULL is seen in the next section.

Using NOT in SQL ComparisonsIt can be fairly straightforward to request exactly which rows are needed. However, sometimes rows are needed that contain any value other than a specific value. When this is the case, it might be easier to write the SELECT to find what is not needed instead of what is needed. Then convert it to return everything else. This might be the situation when there are 100 potential values stored in the database table and 99 of them are needed. So, it is easier to eliminate the one value than it is to specifically list the desired 99 different values individually.

Either of the next two SELECT formats can be used to accomplish the elimination of the one value:

This second version of the SELECT is normally used when compound conditions are required. This is because it is usually easier to code the SELECT to get what is not wanted and then to enclose the entire set of comparisons in parentheses and put one NOT in front of it. Otherwise, with a single comparison, it is easier to put NOT in front of the comparison operator without requiring the use of parentheses.

The next SELECT uses the NOT with an AND comparison to display seniors and lower classmen with grade points less than 3.0:

http://www.coffingdw.com/sql/tdsqlutp/impact_of_null_on_compound_comparisons.htm

http://www.coffingdw.com/sql/tdsqlutp/multiple_value_search_in_.htm

6 Rows returned

Last_Name First_Name Class_Code Grade_Pt

McRoberts Richard JR 1.90Hanson Henry FR 2.88Delaney Danny SR 3.35Larkins Michael FR 0.00Phillips Martin SR 3.00Smith Andy SO 2.00

Without using the above technique of a single NOT, it is necessary to change every individual comparison. The following SELECT shows this approach, notice the other change necessary below, NOT AND is an OR:

Since you cannot have conditions like: NOT >= and NOT <>, they must be converted to < (not < and not =) and = (not, not =). It returns the same 5 rows, but also notice that the AND is now an OR:

6 Rows returned


McRoberts Richard JR 1.90Hanson Henry FR 2.88Delaney Danny SR 3.35Phillips Martin SR 3.00Larkins Michael FR 0.00Smith Andy SO 2.00

Chart of individual conditions and NOT:

Condition Opposite condition NOT condition

<= < NOT >=

<> = NOT <>AND OR OROR AND AND

Figure 2-7

To maintain the integrity of the statement, all portions of the WHERE must be changed, including AND, as well as OR. The following two SELECT statements illustrate the same concept when using an OR:

1 Row returned

Last_Name

Hanson

In the earlier Truth table, the NULL value returned an unknown when checked with a comparison operator. When looking for specific conditions, an unknown was functionally equivalent to a false, but really it is an unknown.

These two Truth tables can be used together as a tool when mixing AND and OR together in the WHERE clause along with NOT.

This Truth Table helps to gauge returned rows when using NOT with AND:

First Test Result AND Second Test Result ResultNOT(True) = False NOT(Unknown) = Unknown FalseNOT(Unknown) = Unknown NOT(True) = False FalseNOT(False) = True NOT(Unknown) = Unknown UnknownNOT(Unknown) = Unknown NOT(False) = True UnknownNOT(Unknown) = Unknown NOT(Unknown) = Unknown Unknown

Figure 2-8

This Truth Table can be used to gauge returned rows when using NOT with OR:

First Test Result OR Second Test Result ResultNOT(True) = False NOT(Unknown) = Unknown UnknownNOT(Unknown) = Unknown NOT(True) = False UnknownNOT(False) = True NOT(Unknown) = Unknown TrueNOT(Unknown) = Unknown NOT(False) = True TrueNOT(Unknown) = Unknown NOT(Unknown) = Unknown Unknown

Figure 2-9

There is an issue associated with using NOT. When a NOT is done on a true condition, the result is a false. Likewise, the NOT of a false is a true. However, when a NOT is done with an unknown, the result is still an unknown. Whenever a NULL appears in the data for any of the columns being compared, the row will never be returned and the answer set will not be what is expected.

Another area where care must be taken is when allowing NULL values to be stored in one or both of the columns. As mentioned earlier, previous versions of Teradata had no concept of “unknown” and if a compare didn’t result in a true, it was false. With the emphasis on ANSI compatibility the unknown was introduced.

If NULL values are allowed and there is potential for the NULL to impact the final outcome of compound tests, additional tests are required to eliminate them. One way to eliminate this concern is to never allow a NULL value in any columns. However, this may not be appropriate and it will require more storage space because a NULL can be compressed. Therefore, when a NULL is allowed, the SQL needs to simply check for a NULL.

Therefore, using the expression IS NOT NULL is a good technique when NULL is allowed in a column and the NOT is used with a single or a compound comparison. This does require another comparison and could be written as:

7 Rows returned


Larkins Michael FR 0.00Hanson Henry FR 2.88McRoberts Richard R 1.90Johnson Stanley ? ?Delaney Danny SR 3.35Phillips Martin SR 3.00Smith Andy SO 2.00

Notice that Johnson came back this time and did not appear previously because of the NULL values.

Later in this book, the COALESCE will be explored as another way to eliminate NULL values directly in the SQL instead of in the database.

Multiple Value Search (IN)Previously, it was shown that adding a WHERE clause to the SELECT limited the returned rows to those that meet the criteria. The IN comparison is an alternative to using one or more OR comparisons on the same column in the WHERE clause of a SELECT statement and the IN comparison also makes it a bit easier to code:

The value list normally consists of multiple values separated by commas. When the value in the column being compared matches one of the values in the list, the row is returned.

The following is an example for the alternative method when any one of the conditions is enough to satisfy the request using IN:

3 Row returned


Phillips SR 3.00Thomas FR 4.00Smith SO 2.00

The use of multiple conditional checks as well as the IN can be used in the same SELECT request. Considerations include the use of AND for declaring that multiple conditions must all be true. Earlier, we saw the solution using a compound OR.

http://www.coffingdw.com/sql/tdsqlutp/using_not_in_sql_comparisons.htm

http://www.coffingdw.com/sql/tdsqlutp/using_quantifiers_versus_in.htm

Using NOT IN

As seen earlier, sometimes the unwanted values are not known or it is easier to eliminate a few values than to specify all the values needed. When this is the case, it is a common practice to use the NOT IN as coded below.

The next statement eliminates the rows that match and return those that do not match:

6 Rows returned

Last_Name Grade_Pt

McRoberts 1.90Hanson 2.88Wilson 3.80Delaney 3.35Larkins 0.00Bond 3.95

The following SELECT is a better way to make sure that all rows are returned when using a NOT IN:

7 Rows returnedLast_Name Class_Code Grade_Pt

Larkins FR 0.00Hanson FR 2.88McRoberts JR 1.90Johnson ? ?

Wilson SO 3.80Delaney SR 3.35Bond JR 3.95

Notice that Johnson came back in this list and not the previous request using the NOT IN.You may be thinking that if the NULLreserved word is used within the IN list it will cover the situation. Unfortunately, you are forgetting that this comparison always returns an unknown. Therefore, the next request will NEVER return any rows:

No Rows found

Making this mistake will cause no rows to ever be returned. This is because every time the column is compared against the value list the NULL is an unknown and the Truth table shows that the NOT of an unknown is always an unknown for all rows.

If you are not sure about this, do an EXPLAIN (chapter 3) of the NOT IN and a subquery to see that the AMP step will actually be skipped when a NULL exists in the list. There are also extra AMP steps to compensate for this condition. It makes the SQL VERY inefficient.

Using Quantifiers Versus INThere is another alternative to using the IN. Quantifiers can be used to allow for normal comparison operators without requiring compound conditional checks.

The following is equivalent to an IN:

This next request uses ANY instead of IN:

3 Row returned


Phillips SR 3.00Thomas FR 4.00Smith SO 2.00

Using a qualifier, the equivalent to a NOT IN is:

Notice that like adding a NOT to the compound condition, all elements need to be changed here as well. To reverse the = ANY, it becomes NOT = ALL. This is important, because the NOT = ANY selects all the rows except those containing a NULL. The reason is that as soon as a value is not equal to any one of the values in the list, it is returned.

http://www.coffingdw.com/sql/tdsqlutp/multiple_value_search_in_.htm

http://www.coffingdw.com/sql/tdsqlutp/multiple_value_range_search_between_.htm

The following SELECT is converted from an earlier NOT IN:

6 Rows returned

Last_Name Grade_Pt

McRoberts 1.90Larkins 0.00Hanson 2.88Wilson 3.80Delaney 3.35Bond 3.95

Multiple Value Range Search (BETWEEN)The BETWEEN comparison can be used as another technique to request multiple values for a column that are all in a specific range. It is easier than writing a compound OR comparison or a long value list of sequential numbers when using the IN.

This is a good time to point out that this chapter is incrementally adding new ways to compare for values within a WHERE clause. However, all of these techniques can be used together in a single WHERE clause. One method does not eliminate the ability to use one or more of the others using logical operators between each comparison.

The next SELECT shows the syntax format for using the BETWEEN:

The first and second values specified are inclusive for the purposes of the search. In other words, when these values are found in the data, the rows are included in the output.

As an example, the following code returns all students whose grade points of 2.0, 4.0 and all values between them:

7 Rows returned

Grade_Pt

3.002.884.003.803.953.35

http://www.coffingdw.com/sql/tdsqlutp/using_quantifiers_versus_in.htm

http://www.coffingdw.com/sql/tdsqlutp/character_string_search_like_.htm

2.00

Notice that due to the inclusive nature of the BETWEEN, both 2.0 and 4.0 were included in the answer set. The first value of the BETWEEN must be the lower value, otherwise, no rows will be returned. This is because it looks for all values that are greater or equal to the first value and less than or equal to the second value.

A BETWEEN can also be used to search for character values. When doing this, care must be taken to insure that rows are received with the values that are needed. The system can only compare character values that are the same length. So, if one column or value is shorter than the other, the shortest will automatically be padded with spaces out to the same length as the longer value.

Comparing ‘CA’ and ‘CALIFORNIA’ never constitutes a match. In reality, the database is comparing ‘CA ’ with ‘CALIFORNIA ‘ and they are not equal. Sometimes, it is easier to use the LIKE comparison operator which will be covered in the next section. Although, easier to code, it does not always mean faster to execute. There is always a trade-off to consider.

The next SELECT finds all of the students whose last name starts with an L:

1 Row returned

Last_Name

Larkins

In reality, the WHERE could have used BETWEEN ‘L’ and ‘M’ as long as no student’s last name was ‘M’. The data needs to be understood when using BETWEEN for character comparisons.

Character String Search (LIKE)The LIKE is used exclusively to search for character data strings. The major difference between the LIKE and the BETWEEN is that the BETWEEN looks for specific values within a range. The LIKE is normally used when looking for a string of characters within a column. Also, the LIKE has the capability to use “wildcard” characters.

The wildcard characters are:

Wildcard symbol What it does

_ (underscore) matches any single character, but a character must be present

% (percent sign) matches any single character, a series of characters or the absence of characters

Figure 2-10

The next SELECT finds all rows that have a character string that begins with ‘Sm’:

1 Row returned


333450 Smith Andy SO 2.00

The fact that the ‘s’ is in the first position dictates its location in the data. Therefore, the ‘m’ must be in the second position. Then, the ‘%’ indicates that any number of characters (including none) may be in the third and subsequent positions. So, if the WHERE clause contained: LIKE ‘%sm’, it only looks for strings that end in “SM.” On the other hand, if it were written as: LIKE ‘%sm%’, then all character strings containing “sm” anywhere are returned. Also, remember that in Teradata mode, the database is not case sensitive. However, in ANSI mode, the case of the letters must match exactly and the previous request must be written as ‘Sm%’ to obtain the same result. Care should be taken regarding case when working in ANSI mode. Otherwise, case does not matter.

http://www.coffingdw.com/sql/tdsqlutp/multiple_value_range_search_between_.htm

http://www.coffingdw.com/sql/tdsqlutp/derived_columns.htm

The ‘_’ wildcard can be used to force a search to a specific location in the character string. Anything in that position is considered a match. However, a character must be in that position.

The following SELECT uses a LIKEto find all last names with an “A” in the second position of the last name:

2 Rows returned


423400 Larkins Michael FR 0.00

125634 Hanson Henry FR 2.88

In the above example, the “_” allows any character in the first position, but requires a character to be there.

The keywords ALL, ANY, or SOME can be used to further define the values being searched. They are the same quantifiers used with the IN. Here, the quantifiers are used to extend the flexibility of the LIKEclause.

Normally, the LIKE will look for a single set of characters within the data. Sometimes, that is not sufficient for the task at hand. There will be times when the characters to search are not consecutive, nor are they in the same sequence.

The next SELECT returns rows with both an ‘s’ and an ‘m’ because of the ALL.

3 Rows returned


280023 McRoberts Richard JR 1.90

234121 Thomas Wendy FR 4.00


It does not matter if the ‘s’ appears first or the ‘m’ appears first, as long as both are contained in the string.

Below, ANSI is case sensitive and only 1 row returns due to the fact that the ‘S’ is uppercase, so Thomas and McRoberts are not returned:

1 Rows returned



If, in the above statement, the ALL quantifier is changed to ANY (ANSI standard) or SOME (Teradata extension), then a character string containing either of the characters, ‘s’ or ‘m’, in either order is returned. It uses the OR comparison.

This next SELECT returns any row where the last name contains either an ‘s’ or an ‘m’:

8 Rows returned


423400 Larkins Michael FR 0.00

125634 Hanson Henry FR 2.88

280023 McRoberts Richard JR 1.90

260000 Johnson Stanley ? ?

231222 Wilson Susie SO 3.80

234121 Thomas Wendy FR 4.00


123250 Phillips Martin SR 3.00

Always be aware of the issue regarding case sensitivity when using ANSI Mode. It will normally affect the number of rows returned and usually reduces the number of rows.

There is a specialty operation that can be performed in conjunction with the LIKE. Since the search uses the “_” and the “%” as wildcard characters, how can you search for actual data that contains a “_” or “%” in the data?

Now that we know how to use the wildcard characters, there is a way to take away the special meaning and literally make the wildcard characters an ‘_’ and a ‘%’. That is the

purpose of ESCAPE. It tells the PEto not match anything, but instead, match the actual character of ‘_’ or ‘%’.

The next SELECT uses the ESCAPE to find all table names that have a “_” in the 8th position of the name from the Data Dictionary.

2 Rows returned

Tablename

Student_Table

Student_Course_Table

In the above output, the only thing that matters is the ‘_’ in position eight because of the first seven ‘_’ characters are still wildcards.

Derived ColumnsThe majority of the time, columns in the SELECT statement exist within a database table. However, sometimes it is more advantageous to calculate a value than to store it.

An example might be the salary. In the employee table, we store the annual salary. However, a request comes in asking to display the monthly salary. Does the table need to be changed to create a column for storing the monthly salary? Must we go through and update all of the rows (one per employee) and store the monthly salary into the new column just so we can select it for display?

The answer is no, we do not need to do any of this. Instead of storing the monthly salary, we can calculate it from the annual salary using division. If the annual salary is divided by 12 (months per year), we “derive” the monthly salary using mathematics.

Chart of ANSI operands for math operations:

Operator Operation performed

( ) parentheses, (all math operations in parentheses done first)** exponentiation, (10**12 derives 1,000,000,000,000 or 1 trillion)* multiplication, (10*12 derives 120)/ division, (10/12 derives 0, both are integers and truncation of decimal

occurs )+ addition, (10+12 derives 22)- subtraction, (10-12 derives -2, since 12 is greater than 10 and negative

values are allowed)Figure 2-11

These math functions have a priority associated with their order of execution when mixed in the same formula. The sequence is basically the same as their order in the chart. All exponentiation is performed first. Then, all multiplication and division is performed and lastly, all addition and subtraction is done. Whenever two different operands are at the same priority, like addition and subtraction, they are performed based on their appearance in the equation from left to right.

Although the above is the default priority, it can be over-ridden within the SQL. Normally an equation like 2+4*5 yields 22 as the answer. This is because the 4*5 = 20 is done first and then the 2 is added to it.

http://www.coffingdw.com/sql/tdsqlutp/character_string_search_like_.htm

http://www.coffingdw.com/sql/tdsqlutp/creating_a_column_alias_name.htm

However, if it is written as (2+4)*5, now the answer becomes 30 (2+4=6*5=30).

The following SELECT shows these and the results of an assortment of mathematics:

1 Row Returned

2+4*5 (2+4)*5 2+4/5 (2+4)/5 2+4.0/5 (2+4.0)/510**930 2 1 2.8 1.2 1000000000

Note: starting with integer values, as in the above, the answer is an integer. If decimals are used, the result is a decimal answer. Otherwise, a conversion can be used to change the characteristics of the data before being used in any calculation. Adding the decimal makes a difference in the precision of the final answer. So, if the SQL is not providing the answer expected from the data, convert the data first (CAST function later in this book).

The next SELECT shows how the SQL can be written to implement the earlier example with annual and monthly salaries:

2 Rows returned

salary salary/12

48,024.00 4,002.0010,800.00 900.00

Since the column name is the default column heading, the derived column is called salary/12, which is not probably what we wish to see there. The next section covers the usage of an alias to temporarily change the name of a column during the life of the SQL.

Derived data can be used in the WHERE clause as well as the SELECT. The following SQL will only return the columns when the monthly salary is greater than $1,000.00:

1 Row returned

salary salary/12

48,024.00 4,002.00

Teradata contains several functions that allow a user to derive data for business and engineering. This is a chart of those Teradata arithmetic, trigonometric and hyperbolic math functions:


MOD x Modulo returns the remainder from a division (1 mod 2 derives 1, as the remainder of division, 2 goes into 1, 0 times with a remainder of 1. Then, 2 mod 10 derives 2, 10 goes into 2, 0 times with a remainder of 2). MOD always returns 0 thru x-1. As such, MOD 2 returns 0 for even numbers and 1 for odd; MOD 7 can be used to determine the day of the week; and MOD 10, MOD 100, MOD 1000, etc can be used to shift the decimal of any number to the left by the number of zeroes in the MOD operator.

ABS(x) Absolute value, the absolute value of a negative number is the same number as a positive x. (ABS(10-12) = 2)

EXP(x) Exponentiation, e raised to a power, ( EXP(10) derives 2.20264657948067E004 )

LOG(x) Logarithm calculus function, ( LOG(10) derives the value 1.0000000000000E000 )

LN(x) Natural logarithm, ( LN(10) derives the value 2.30258509299405E000 )

SQRT(x) Square root, ( SQRT(10) derives the value 3.16227766016838E000)

COS(x) Takes an angle in radians (x) and returns the ratio of two sides of a right triangle. The ratio is the length of the side adjacent to the angle divided by the length of the hypotenuse. The result lies in the range -1 to 1, inclusive where x is any valid number expression that expresses an angle in radians.

SIN(x) Takes an angle in radians (x) and returns the ratio of two

sides of a right triangle. The ratio is the length of the side opposite to the angle divided by the length of the hypotenuse. The result lies in the range -1 to 1, inclusive where x is any valid number expression that expresses an angle in radians.

TAN(x) Takes an angle in radians (x) and returns the ratio of two sides of a right triangle. The ratio is the length of the side opposite to the angle divided by the length of the side adjacent to the angle where x is any valid number expression that expresses an angle in radians.

Chart of Teradata arithmetic, trigonometric and hyperbolic math functions (continued)


ACOS(x) Returns the arccosine of x. The arccosine is the angle whose cosine is xwhere x is the cosine of the returned angle. The values of x must be between -1 and 1, inclusive. The returned angle is in the range 0 to ?radians, inclusive.

ASIN(x) Returns the arcsine of (x). The arcsine is the angle whose sine is x where x is the sine of the returned angle. The values of x must be between -1 and 1, inclusive. The returned angle is in the range ?/2 to ?/2 radians, inclusive.

ATAN(x) Returns the arctangent of (x). The arctangent is the angle whose tangent is arg. The returned angle is in the range ?/2 to ?/2 radians, inclusive.

ATAN2 (x,y) Returns the arctangent of the specified (x,y) coordinates. The arctangent is the angle from the x-axis to a line contained the origin(0,0) and a point with coordinates (x,y).

The returned angle is between ??and ?radians, excluding ?. A positive result represents a counterclockwise angle from the x-axis where a negative result represents a clockwise angle. The ATAN2(x,y) equals ATAN(y/x), except that x can be 0 in ATAN2(x,y) and x cannot be 0 in ATAN(y/x) since this will result in a divide by zero error. If both x and y are 0, an error is returned.

COSH(x) Returns the hyperbolic cosine of (x) where x is any real number.

SINH(x) Returns the hyperbolic sine of (x) where x is any real number.

TANH(x) Returns the hyperbolic tangent of (x) where arg is any real number.

ACOSH(x) Returns the inverse hyperbolic cosine of (x). The inverse hyperbolic cosine is the value whose hyperbolic cosine is a number so that x is any real number equal to, or greater than, 1.

ASINH(x) Returns the inverse hyperbolic sine of (x). The inverse hyperbolic sine is the value whose hyperbolic sine is a number so that x is any real number.

ATANH(x) Returns the inverse hyperbolic tangent of (x). The inverse hyperbolic tangent is the value whose hyperbolic tangent is a number so that x is any real number between 1 and -1, excluding 1 and -1).

Figure 2-12

Some of these functions are demonstrated below and throughout this book. Here they are also using alias names for the columns. Their application will be specific to the type of application being written. It is not the intent of this book to teach the meaning and use in engineering and trigonometry, but more to educate regarding their existence.

Creating a Column Alias NameSince the name of the selected column or derived data formula appears as the heading for the column, it makes for strange looking results. To make the output look better, it is a good idea to use an alias to dress up the heading name used in the output. Besides making the output look better, an alias also makes the SQL easier to write because the new column name can be used anywhere in the SQL statement.

AS

Compliance: ANSI

The previous SELECT used salary/12, which is probably not what we wish to see in the heading. Therefore, it is preferable to alias the column within the execution of the SQL. This means that a temporary name is assigned to the selected column for use only in this statement.

To alias a column, use an AS and any legal Teradata name after the real column name requested or math formula using the following technique:

2 Rows returned

Annual_salary Monthly_salary

48024.00 4002.00

10800.00 900.00

Once the alias name has been assigned, it is literally the name of the column for the life of the SQL statement.

The next request is a valid example of using of the alias in the WHERE clause:

http://www.coffingdw.com/sql/tdsqlutp/derived_columns.htm

http://www.coffingdw.com/sql/tdsqlutp/order_by.htm

1 Row returned

annual_salary monthly_salary

$48,024.00 $4,002.00

The math functions are very helpful for calculating and evaluating characteristics of the data. The following examples incorporate most of the functions to demonstrate their operational functionality.

The next SELECT uses literals and aliases to show the data being input and results for each of the most common business applicable operations:

1 Row returned

Div200 Last2 Even OddWas

PositivePositiveNow SqRoot

2 4 0 1 1 1 2.00

The output of the SELECT shows some interesting results. The division is easy; we learned that in elementary school. The first MOD 100 results in 4, because the result of the division is 2, but the remainder is 4 (204 – 200 = 4). A MOD 100 can result in any value between 0 and 99. In reality, the MOD 100 moves the decimal point two positions to the left. On the other hand, the MOD 2 will always be 0 for even numbers and 1 for odd numbers. The ABS always returns the positive value of any number and lastly, 2 is the square root of 4.

Many of these will be incorporated into SQL throughout this book to demonstrate additional business applications.

NAMED

Compliance: Teradata Extension

Prior to the AS becoming the ANSI standard, Teradata used NAMED as the keyword to establish an alias. Although both currently work, it is

strongly suggested that an AS be used for compatibility. Also, as hard as it is to believe, I have heard that NAMED may not work in future releases.

The following is the same SELECT as seen earlier, but here it uses the NAMED instead of the AS:

2 Rows returned

Annual_salary Monthly_salary

48024.00 4002.0010800.00 900.00

Naming conventions

When creating an alias only valid Teradata naming characters are allowed. The alias becomes the name of the column for the life of the SQL statement. The only difference is that it is not stored in the Data Dictionary.

The charts below list the valid characters to use and then the rules (on the left) to follow when ANSI compliance is desired. Also listed are the more flexible Teradata (on the right) allowable characters and extended character sets with its rules.

Chart of Valid Characters for ANSI and Teradata:

ANSI Characters Allowed(up to 18 in a single name)

Teradata Characters Allowed(up to 30 in a single name)

A through Z A through Z and a through z

0 through 9 0 through 9

_ (underscore / underline) _ (underscore / underline)# (octathrope / pound sign / number sign )$ (dollar sign / currency sign)

Figure 2-13

Chart of ANSI and Teradata Naming Conventions

ANSI Rules for column names Teradata Rules for column names

Must be entirely in upper case Can be all upper, all lower or a mixture of case

using any of these charactersMust start with A through Z Can start with any valid characterMust end with underscore _ Can end with any valid characterFigure 2-14

Teradata uses all of the ANSI characters as well as the additional ones listed in the above charts.

Breaking Conventions

It is not recommended to break these conventions. However, sometimes it is necessary or desirable to use non-standard characters in a name. Also, sometimes words have been used as table or column names and then in a later release, the name becomes a reserved word. There needs to be a technique to assist you when either of these requirements becomes necessary.

The technique uses double quotes (“) around the name. This technique tells the PE that the word is not a reserved word and makes it a valid name. This is the only place that Teradata uses a double quote instead of a single quote (‘).

As an example, the previous SELECT has been modified to use double quotes (“) instead of NAMED:

2 Rows returned

Annualsalary Monthlysalary

10800.00 900.0048024.00 4002.00

Although it is not obvious due to the underlining, the column heading for the first column is Annual Salary, including the space. A space is not a valid naming character, but this is the column name and it is valid because of the double quotes. This can be seen in the ORDER BY where it uses the column name. The next section provides more details on the use of ORDER BY.

ORDER BYThe Teradata AMPs generally bring data back randomly unless the user specifies a sort. The addition of the ORDER BY requests a sort operation to be performed. The sort arranges the rows returned in ascending sequence unless you specifically request descending. One or more columns may be used for the sort operation. The first column listed is the major sort sequence. Any subsequent columns specified are minor sort values in the order of their appearance in the list.The syntax for using an ORDER BY:

In Teradata, if the sequence of the rows being displayed is important, then an ORDER BY should be used in the SELECT. Many other databases store their data sequentially by the value of the primary key. As a result, the data will appear in sequence when it is returned. To be faster, Teradata stores it differently.

Teradata organizes data rows in ascending sequence on disk based on a row ID value, not the data value. This is the same value that is calculated to determine which AMP should be responsible for storing and retrieving each data row.

When the ORDER BY is not used, the data will appear vaguely in row hash sequence and is not predictable. Therefore, it is recommended to use the ORDER BY in a SELECT or the data will come back randomly. Remember, everything in Teradata is done in parallel, this includes the sorting process.

The next SELECT retrieves all columns and sorts by the Grade point average:

http://www.coffingdw.com/sql/tdsqlutp/creating_a_column_alias_name.htm

http://www.coffingdw.com/sql/tdsqlutp/distinct_function.htm

4 Rows returned


324652 Delaney Danny SR 3.35231222 Wilson Susie SO 3.80322133 Bond Jimmy JR 3.95234121 Thomas Wendy FR 4.00

Notice that the default sequence for the ORDER BY is ascending (ASC), lowest value to highest. This can be over-ridden using DESCto indicate a descending sequence as shown using the following SELECT:

4 Rows returned


234121 Thomas Wendy FR 4.00322133 Bond Jimmy JR 3.95231222 Wilson Susie SO 3.80324652 Delaney Danny SR 3.35

As an alternative to using the column name in an ORDER BY, a number can be used. The number reflects the column’s position in the SELECT list. The above SELECT could also be written this way to obtain the same result:

In this case, the grade point column is the fifth column in the table definition because of its location in the table and the SELECT uses * for all columns. This adds flexibility to the writing of the SELECT. However, always watch out for the ability words, like flexibility because it adds another ability word: responsibility. When using the column number, if the column that is used for the sort is moved to another location in the select list, a different column is now used for the sort. Therefore, it is

important to be responsible to change the list and the number in the ORDER BY.

Many times it is necessary that the value in one column needs to be sorted within the sequence of a second column. This technique is said to have a major sort column or key and one or more minor sort keys.

The first column listed in the ORDER BY is the major sort key. Likewise, the last column listed is the most minor sort key within the sequence. The minor keys are referred to as being sorted within the major sort key. Additionally, some columns can ascend while others descend.

This SELECT sorts two different columns: the last name (minor sort) ascending (ASC), within the class code (major sort) descending (DESC):

10 Rows returned


Delaney SR 3.35Phillips SR 3.00Smith SO 2.00Wilson SO 3.80Bond JR 3.95McRoberts JR 1.90Hanson FR 2.88Larkins FR 0.00Thomas FR 4.00Johnson ? ?

Notice, in the above statement, the use of relative column numbers instead of column names in the ORDER BY for the sort. The numbers 2 and 1 were used instead of Class_Code and Last_Name. When you select columns and then use numbers in the sort, the numbers relate to the order of the columns after the keyword SELECT. When you SELECT * (all columns in the table) then the sort numbers reflect the order of columns within the table.

An additional capability of Teradata is that a column can be used in the ORDER BY that is not selected. This is possible because the database uses a tag sort for speed and flexibility. In other words, it builds a tag area that consists of all the columns specified in the ORDER BY as well as the columns that are being selected.

This diagram shows the layout of a row in SPOOL used with an ORDER BY:

Tagcolumn1 TagcolumnN AMP# Selectcolumn1 Selectcolumn2 ...SelectcolumnN

Figure 2-15

Although it can sort on a column that is not selected, the sequence of the output may appear to be completely random. This is because the sorted value is not seen in the display.

Additionally, within a Teradata session the user can request a Collation Sequence and a Code Setfor the system to use. By requesting a Collation Sequence of EBCDIC, the sort puts the data into the proper sequence for the IBM mainframe system. Therefore, is the automatic default code set when connecting from the mainframe.

Likewise, if a user were extracting to a UNIX computer, the normal code set is ACSII. However, if the file is transferred from UNIX to a mainframe and converted there, it is in the wrong sequence. When it is known ahead of time that the file will be used on a mainframe but extracted to a different computer, the Collation Sequencecan be set to EBCDIC. Therefore, when the file code set is converted, the file is in the correct sequence for the mainframe without doing another sort.

Like the Collation Sequence, the Code Set can also be set. So, a file can be in EBCDIC sequence and the data in ASCII or sorted in ASCII sequence with the data in EBCDIC. The final use of the file needs to be considered when making this choice.

DISTINCT FunctionAll of the previous operations of the SELECT returned a row from a table based on its existence in a table. As a result, if multiple rows contain the same value, they all are displayed.

Sometimes it is only necessary to see one of the values, not all. Instead of contemplating a WHERE clause to accomplish this task, the DISTINCT can be added in the SELECT to return unique values by eliminating duplicate values.

The syntax for using DISTINCT:

The next SELECT uses DISTINCT to return only one row for display when a value exists:

5 Rows Returned

Class_code

?FRJR

SOSR

There are a couple noteworthy situations in the above output. First, although there are three freshman, two sophomores, two juniors, two seniors and one row without a class code, only one output row is returned for each of these values. Lastly, the NULL is considered a unique value whether there is one row or multiple rows containing it. So, it is displayed one time.

The main considerations for using DISTINCT, it must:

http://www.coffingdw.com/sql/tdsqlutp/order_by.htm

http://www.coffingdw.com/sql/tdsqlutp/help_commands.htm

1. Appear only once2. Apply to all columns listed in the SELECT to determine uniqueness3. Appear before the first column name

The following SELECT uses more than one column with a DISTINCT:

10 Rows Returned

class_code grade_pt

? ?FR 0.00FR 2.88FR 4.00JR 1.90JR 3.95SO 2.00SO 3.80SR 3.00SR 3.35

The DISTINCT in this SELECT returned all ten rows of the table. This is due to the fact that when the class code and the grade point are combined for comparison, they are all unique. The only potential for a duplicate exists when two students in the same class have the same grade point average. Therefore, as more and more columns are listed in a SELECT with a DISTINCT, there is a greater opportunity for more rows to be returned due to a higher likelihood for unique values.

If, when using DISTINCT, spool space is exceeded, see chapter 5 and the use of the GROUP BYversus DISTINCT for eliminating duplicate rows. It may solve the problem and that chapter tells the reason for it.

HELP commandsThe Teradata Database offers several types of help using an interactive client. For convenience, this reduces or eliminates the need to look information up in a hardcopy manual or on a CD-ROM. Therefore, using the help and show operations in this chapter can save you a large amount of time and make you more productive. Since Teradata allows you to organize database objects into a variety of locations, sometimes you need to determine where certain objects are stored and other detail information about them.

This chart is a list of available HELP commands on Objects:

HELP DATABASE <database-name> ; Displays the names of all the tables (T), views (V), macros (M), and triggers (G) stored in a database and user written table comments

HELP USER <user-name> ; Displays the names of all the tables (T), views (V), macros (M), and triggers (G) stored in a user area and user written table comments

HELP TABLE <table-name> ; Displays the column names, type identifier, and any user written comments on the columns within a table.

HELP VOLATILE TABLE ; Displays the names of all Volatile temporary tables active for the user session.

HELP VIEW <view-name> ; Displays the column names, type identifier, and any user written comments on the columns within a view.

HELP MACRO <macro-name> ; Displays the characteristics of parameters passed to it at execution time.

HELP PROCEDURE <procedure-name> ; Displays the characteristics of parameters passed to it at execution time.

HELP TRIGGER <trigger-name> ; Displays details created for a trigger, like action time and sequence.

HELP COLUMN <table-name>.* ;HELP COLUMN <view-name>.* ;HELP COLUMN

<table-name>.<column-name>, …. ;

Displays detail data describing the column level characteristics.

Figure 3-1

To see the database objects stored in a Database or User area, either of the following HELP commandsmay be used:HELP DATABASE My_DB ;

http://www.coffingdw.com/sql/tdsqlutp/distinct_function.htm

http://www.coffingdw.com/sql/tdsqlutp/show_commands.htm

Or

HELP USER My_User ;

4 Rows Returned

Table/View/Macroname Kind Comment

employee T T=Tablewith1rowperemployeeemployee_v V V=ViewforaccessingEmployeeTableEmployee_m1 M M=MacrotoreportonEmployeeTableEmployee_Trig G G=TriggertoupdateEmployeeTable

Since Teradata considers a database and a user to be equivalent, both can store the same types of objects and therefore, the two commands produce similar output.

Now that you have seen the names of the objects in a database or user area, further investigation displays the names and the types of columns contained within the object. For tables and views, use the following commands:HELP TABLE My_Table ;

7 Rows Returned

ColumnName Type Comment Nullable Format Title

Column1 I Thiscolumnisaninteger Y -(10)9 ?Column2 I2 Thiscolumnisasmallint Y -(5)9 ?Column3 I1 Thiscolumnisabyteint Y -(3)9 ?Column4 CF Thiscolumnisafixedlength Y X(20) ?Column5 CV Thiscolumnisavariablelength Y X(20) ?Column6 DA Thiscolumnisadate Y YYYY-MM-DD ?Column7 D Thiscolumnisadecimal Y --------.99 ?

MaxLength DecimalTotalDigits Decimal FractionalDigits RangeLow RangeHigh

4 ? ? ? ? N

2 ? ? ? ? N1 ? ? ? ? N20 ? ? ? ? N20 ? ? ? ? N

4 ? ? ? ? N4 9 2 ? ? N

UpperCase Table/View? Defaultvalue CharType IdColType

N T ? ? ?N T ? ? ?N T ? ? ?N T ? 1 ?N T ? 1 ?N T ? ? ?N T ? ? ?

The above output has been wrapped to multiple lines to show all the detail information available on the columns of a table.HELP VIEW My_View ;

(notice that the vast majority of the column data is not available for a view, it comes from the table, not the SELECT that creates a view)

7 Rows Returned

ColumnName Type Comment Nullable Format Title

Column1 ? Thiscolumnisaninteger ? ? ?Column2 ? Thiscolumnisasmallint ? ? ?Column3 ? Thiscolumnisabyteint ? ? ?Column4 ? Thiscolumnisafixedlength? ? ? Column5 ? Thiscolumnisavariablelength ? ? ?Column6 ? Thiscolumnisadate ? ? ?Column7 ? Thiscolumnisadecimal ? ? ? Max Length

Decimal Total Digits

Decimal Fractional Digits

Range Low

Range High

? ? ? ? ? ?? ? ? ? ? ?? ? ? ? ? ?? ? ? ? ? ?? ? ? ? ? ?

? ? ? ? ? ?? ? ? ? ? ? UpperCase Table/View? Defaultvalue CharType IdColType

? ? ? ? ?? ? ? ? ?? ? ? ? ?? ? ? 1 ?? ? ? 1 ?? ? ? ? ?? ? ? ? ?

The above output is wrapped to multiple lines and display the column name and the kind, which equates to the data type and any comment added to a column. Notice that a view does not know the data type of the columns from a real table. Teradata provides a COMMENT command to add these comments on tables and columns.

The following COMMENT commands add a comment to a table and a view:

This COMMENT command adds a comment to a column:

The above column information is helpful for most of the column types, such as INTEGER (I), SMALLINT (I2) and DATE (DA) because the size and the value range is a constant. However, the lengths of the DECIMAL (D) and the character columns (CF, CV) are not shown here. These are the most common of the data types. See chapter 18 (DDL) for more details on data types.

The next HELP COLUMN command provides more details for all of the columns:

The output is not shown again, since it is exactly the same as the newer version of the HELP TABLEcommand.

The next chart shows HELP commands for information on database tables and sessions, as well as SQL and SPL commands:

Help Commands:

HELP INDEX <table-name> ; Displays the indexes and their characteristics like unique or non-unique and the column or columns involved in the index. This data is used by the Optimizer to create a plan for SQL.

HELP STATISTICS <table-name> ; Displays values associated with the data demographics collected on the table. This data is used by the Optimizer to create a plan for SQL.

HELP CONSTRAINT<table-name>.<constraint-name> ;

Displays the checks to be made on the data when it is inserted or updated and the columns are involved.

HELP SESSION; Displays the user name, account name, logon date and time, current database name, collation code set and character set being used, transaction semantics, time zone and character set data.

HELP ‘SQL’; Displays a list of available SQL commands and functions.

HELP ‘SQL <command>’; Displays the basic syntax and options for the actual SQL command inserted in place of the <command> .

HELP ‘SPL’; Displays a list of available SPL commands.HELP ‘SPL <command>’; Displays the basic syntax and options for

the actual SPL command inserted in place of the <command> .

Figure 3-2

The above chart does a pretty good job of explaining the HELP functions. These functions only provide additional information if the table object has one of these characteristics defined on it. The INDEX, STATISTICS and CONSTRAINT functions will be further discussed in the Data Definition Language Chapter (DDL) because of their relationship to the objects.

At this point in learning SQL, and in the interest of getting to other SQL functions, one of the most useful of these HELP functions is the HELP SESSION.

The following HELP returns index information on the department_table:HELP INDEX Department_table ;

2 rows returned

Unique? Primary orSecondary?

ColumnNames

IndexId

ApproximateCount

IndexName

Ordered orPartitioned?

Y P Dept_No 1 8.00 ? HN S Department_name 4 8.00 ? HN S Mgr_No 8 6.00 ? H

The following HELP returns information on the session from the PE:HELP SESSION ;

1 Row Returned (columns wrapped for viewing)

User Name

Account Name

Logon Date Logon Time

Current Database Collation

Character Set

DBC DBC 99/12/12 11:45:13 Personnel ASCII ASCII

Transaction Semantics

Current DateForm Time Zone

Default Character Type Export Latin

Teradata Integerdate 00:00 LATIN 1

Export UnicodeExport Unicode Adjust Export KanjiSJIS Export Graphic

1 0 1 0 Default Date Format Radix Separator Group Separator Grouping Rule

YY/MM/DD . , 3 Currency Radix Separator

Currency Group Separator

Currency Grouping Rule Currency Name

. , 3 US Dollars

Currency ISOCurrencyDual Currency Name

Dual Currency Dual ISOCurrency

$ USD US Dollars $ USD

The above output has been wrapped for easier viewing. Normally, all headings and values are on a single line.

The current date form, time zone and everything that follows them in the output are new with the V2R3 release of Teradata. These columns have been added to make their reference here, easier than digging through the Data Dictionary using SQL.

When using a tool like BTEQ, the line is truncated. So, for easier viewing, the .SIDETITLES and .FOLDLINE commands show the output in a vertical display.

The next sequence of commands can be used within BTEQ:.sidetitles on.foldline on

HELP SESSION;

1 Row Returned

User Name MIKELAccount Name DBCLogon Date 00/06/25Logon Time 01:02:52Current DataBase MIKELCollation ASCIICharacter Set ASCII

Transaction Semantics TeradataCurrent DateForm IntegerDateSession Time Zone 00:00Default Character Type LATINExport Latin 1Export Unicode 1Export Unicode Adjust 0Export KanjiSJIS 1Export Graphic 0

To reset the display to the normal line, use either of the following commands:.DEFAULTSor.SIDETITLES OFF.FOLDLINES OFF

In BTEQ, any command that starts with a dot (.) does not have to end with a semi-colon (;).

The next HELP command returns a list of the available SQL commands and functions:HELP ‘SQL’;

41 Rows Returned

On-Line Help

DBS SQL COMMANDS:

ABORT ALTER TABLE BEGIN LOGGINGBEGIN TRANSACTION CHECKPOINT COLLECT STATISTICSCOMMIT COMMENT CREATE DATABASECREATE INDEX CREATE MACRO CREATE TABLECREATE USER CREATE VIEW DATABASEDELETE DELETE DATABASE DELETE USERDROP DATABASE DROP INDEX DROP MACRODROP TABLE DROP VIEW DROP STATISTICSECHO END LOGGING END TRANSACTION.

DBS SQL FUNCTIONS:

ABS ADD_MONTHS AVERAGECHARACTERS CAST CHAR2HEXINTCOUNT CORR COVAR_POPCSUM EXP EXTRACTFORMAT INDEX HASHAMPHASHBKAMP HASHBUCKET HASHROWKURTOSIS LN LOGMAVG MAXIMUM MCHARACTERSMDIFF MINDEX MINIMUMMLINREG MSUBSTR MSUMNAMED NULLIFZERO OCTET_LENGTHQUANTILE REGR_INTERCEPT REGR_SLOPERANDOM RANK SKEWSQRT STDDEV_POP STDDEV_SAMPSUBSTR SUM TITLETRIM TYPE UPPERVARGRAPHIC VAR_POP VAR_SAMPZEROIFNULL

The above output is not a complete list of the commands. The three dots in the center represent the location where commands were omitted so it fit onto a single page. All commands are seen when performed on a terminal.

Once this output has been used to find the command, than the following HELP command provides additional information on it:HELP ‘SQL END TRANSACTION’ ;

5 Rows Returned

Since the terminal is used most of the time to access the database, take advantage of it and use the terminal for your HELP commands.

Tools like Queryman also have a variety of HELP commands and individual menus. Always look for ways to make the task easier.

SET SESSION command

The Teradata Database provides user access only by allocating a session with a Parsing Engine. The Parsing engine will use default attributes based on the user and host computer from which the user is connecting. When a different session option is needed,the SET SESSION command is needed. It over-rides the default for this session only. The next time the user logs into Teradata, the original default will be used again.

Syntax for SET SESSION:

The SET SESSION can be abbreviated as: SS.

Collation sequence: ASCII, EBCDIC, MULTINATIONAL (European (diacritical) character or Kanji character), CHARSET_COLL (binary ordering based on the current client character set), JIS_COLL (logical ordering of characters based on the Japanese Industrial Standards collation), HOST (EBCDIC for IBM channel-attached clients and ASCII for all other clients - default collation).

Account-id: allows for the temporary changing of accounting data for charge back and priority. The account-id specified must be a valid one assigned to the user and the priority can only be down graded.

INTEGERDATE: uses the YY/MM/DD format and ANSIDATEuses the YYYY-MM-DD format for a date.

Database-name: becomes the database to use as the current database for SQL operations during this session.

SHOW commandsThere are times when you need to recreate a table, view, or macro that you already have, or you need to create another object of the same type that is either identical or very similar to an object that is already created. When this is the case, the SHOW command is a way to accomplish what you need.

We will be discussing all of these object types and their associated Data Definition Language (DDL) commands later in this course.

The intent of the SHOW command is to output the CREATE statement that could be used to recreate the object of the type specified.

This chart shows the commands and their formats:

SHOW TABLE <table-name> ; Displays the CREATE TABLE statement needed to create this table.

SHOW VIEW <view-name> ; Displays the CREATE VIEW statement needed to create this view.

SHOW MACRO <macro-name> ; Displays the CREATE MACRO statement needed to create this macro.

SHOW TRIGGER <trigger-name> ; Displays the CREATE TRIGGERstatement needed to create this trigger.

SHOW PROCEDURE <procedure-name> ; Displays the CREATE PROCEDUREstatement needed to create this stored procedure.

SHOW <SQL-statement> ; Displays the CREATE TABLE statements for all tables/views referenced by the SQL statement .

Figure 3-3

To see the CREATE TABLEcommand for the Employee table, we use the command:SHOW TABLE Employee ;

13 Rows Returned

http://www.coffingdw.com/sql/tdsqlutp/help_commands.htm

http://www.coffingdw.com/sql/tdsqlutp/explain.htm

To see the CREATE VIEWcommand, we use a command like:

SHOW VIEW TODAY ;

3 Rows Returned

To see the CREATE MACROcommand for the macro called MYREPORT, we use a command like:SHOW MACRO MYREPORT ;

9 Rows Returned

To see the CREATE TRIGGERcommand for AVG_SAL_T, we use:

SHOW TRIGGER AVG_SAL_T ;

20 Rows Returned

Since the SHOW command returns the DDL, it can be a real time saver. It is a very helpful tool when a database object needs to be recreated, a copy of an existing object is needed, or another object is needed that has similar characteristics to an existing object. Plus, what a great way to get a reminder on the syntax needed for creating a table, view, macro, or trigger.

It is a good idea to save the output of the SHOW command in case it is needed at a later date. However, if the object’s structure changes, the SHOW command should be re-executed and the new output saved. It returns the DDLthat can be used to create a new table exactly the same as the current table. Normally, at a minimum, the table name is changed before executing the command.

EXPLAINThe EXPLAIN command is a powerful tool provided with the Teradata database. It is designed to provide an English explanation of what steps the AMP must complete to satisfy the SQL request. The EXPLAIN is based on the PE’s execution plan.

The Parsing Engine (PE) does the optimization of the submitted SQL, the creation of the AMP steps and the dispatch to any AMP involved in accessing the data. The EXPLAIN is an SQL modifier; it modifies the way the SQL operates.

When an SQL statement is submitted using the EXPLAIN, the PE still does the same optimization step as normal. However, instead of building the AMP steps, it builds the English explanation and sends it back to the client software, not to the AMP. This gives users the ability to see resource utilization, use of indices, and row and time estimates.

Therefore, it can predict a Cartesian product join in seconds, instead of hours later when the user gets suspicious that the request should have been finished. The EXPLAIN should be run every time changes to an object’s structure occur, when a request is first put into production and other key times during the life of an application. Some companies require that the EXPLAIN always be run before execution of any new queries.

The syntax for using the EXPLAIN is simple: just type the EXPLAIN keyword preceding your valid SQL statement. For example:

The EXPLAIN can be used to translate the actions for all valid SQL. It cannot provide a translation when syntax errors are present. The SQL must be able to execute in order to be explained.

Chart for some of the keywords that may be seen in the output of an EXPLAIN:

Locking Psuedo Table Serial lock on a symbolic table. Every table has one. Used to prevent deadlocks situations between users.

Locking table for Indicates that an ACCESS, READ, WRITE, or

http://www.coffingdw.com/sql/tdsqlutp/show_commands.htm

http://www.coffingdw.com/sql/tdsqlutp/adding_comments.htm

EXCLUSIVE lock has been placed on the tableLocking rows for <type> Indicates that an ACCESS, READ, or WRITE, lock

is placed on rows as they are read or writtenDo an ABORT test Guarantees a transaction is not in progress for

this userAll AMPs retrieve All AMPs are receiving the AMP steps and are

involved in providing the answer setBy way of an all rows scan Rows are read sequentially on all AMPsBy way of primary index Rows are read using the Primary index

column(s)By way of index number Rows are read using the Secondary index –

number from HELP INDEX

Chart of EXPLAIN keywords (continued)

BMSMS BitMap Set Manipulation Step, alternative direct access technique when multiple NUSIcolumns are referenced in the WHERE clause

Residual conditions WHERE clause conditions, other than those of a join

Eliminating duplicate rows Providing unique values, normally result of DISTINCT, GROUP BY or subquery

Where unknown comparison will be ignored

Indicates that NULL values will not compare to a TRUE or FALSE. Might be seen in a subquery using NOT IN or NOT = ALL because no rows will be returned if comparison is ignored.

Merge join Rows of one table are matched to the other table on common domain columns after being sorted into the same sequence, normally Row Hash

Product join Rows of one table are matched to all the rows of the other table without concern for a domain match

Duplicated on all AMPs Participating rows for the table (normally smaller table) of a join are duplicated on all AMPS

Hash redistributed on all AMPs Participating rows of a join are hashed on the join column and sent to the same AMPthat stores the matching row of the table to join

SMS Set Manipulation Step, result of an INTERSECT, UNION, EXCEPT or

MINUSoperationLast use SPOOL file is no longer needed after the

step and space is releasedBuilt locally on the AMPs As rows are read, they are put into

SPOOLon the same AMPAggregate Intermediate Results are computed locally

The aggregation values are all on the same AMP and therefore no need to redistribute them to work with rows on other AMPs

Aggregate Intermediate Results are computed globally

The aggregation values are not all on the same AMP and must be redistributed on one AMP, to accompany the same value with from the other AMPs

Figure 3-4

Once you attain more experience with Teradata and SQL, these terms lead you to a more detailed understanding of the work involved in any SQL request.

The first is the estimated number of rows that will be returned. This number is an educated guess that the PE has made based on information available at the time of the EXPLAIN. This number may or may not be accurate. If there are current STATISTICS on the table, the numbers are more accurate. Otherwise, the PE calculates a guess by asking a random AMP for the number of rows it contains. Then, it multiples the answer by the number of AMPs to guess a “total row count.” At the same time, it lets you know how accurate the number provided might be using the terms in the next chart.

This chart is for phrases that accompany the estimated number of rows:

No confidence The PE has no degree of certainty with the values used. This is normally a result of not collecting STATISTICS and working with multiple steps in SPOOL

Low confidence

The PE is not sure of the values being used. This is normally a result of processing involving several steps in SPOOL instead of the actual rows in a table

High confidence

Normally indicates that STATISTICS have been collected on the columns or indices of a table. Allows the optimizer to be more aggressive in the access plan.

Index Joinconfidence

Indicates that a join is being done there uses a join condition via a unique index.

Figure 3-5

The second area to check in the output of the EXPLAIN is the estimated cost, expressed in time, to complete the SQL request. Although it is expressed in time, do not confuse it with either wall-clock or CPU time. It is strictly a cost factor calculated by the optimizer for comparison purposes only. It does not take the number of users, the current workload or other system related factors into account. After looking at the potential execution plans, the plan with the lowest cost value is selected for execution. Once these two values are checked, the question that should be asked is: Are these values reasonable?

For instance, if the table contains one million rows and the estimate is one million rows in 45 seconds, that is probably reasonable if there is not a WHERE clause. However, if the table contains a million rows and is being joined to a table with two thousand rows and the estimate is that two hundred trillion rows will be returned and it will take fifty days, this is not reasonable.

The following EXPLAIN is for a full table scan of the Student Table:

12 Rows Returned

The EXPLAIN estimates, 8 rows and .15 seconds. Since there are 10 rows in the table, the EXPLAIN is slightly off in its estimate. However, this is reasonable based on the contents of the table and the SELECT statement submitted.

The next EXPLAIN is for a join that has an error in it, can you find it?:

The EXPLAIN estimates nearly 512 rows will be returned and it will take .39 seconds. Although the time estimate sounds acceptable, this is a very small table. Looking at the number of rows returned as 512 with only 14 rows in the largest of these tables. This is not reasonable based on the contents of the tables.

Upon further examination, the product join in step 6 is using (1=1) as the join condition where it should be a merge join. Therefore, this is a Cartesian product join. A careful analysis of the SELECT shows a single join condition in the WHERE clause. However, this is a three-table join and should have two join conditions. The WHERE clause needs to be fixed and by using the EXPLAIN we have saved valuable time.

If you can get to the point of using the EXPLAIN in this manner, you are way ahead of the game. No one will ever have to slap your hand for writing SQL that runs for days, uses up large amounts of system resources and accomplishes absolutely nothing. You say, “Doctor, it hurts when I do this.” The Doctor says, “Don’t do that.” We are saying, “Don’t put extensive SELECT requests into production without doing an EXPLAIN on it.

Remember, always examine the EXPLAIN for reasonable results. Then, save the EXPLAIN output as a benchmark against any future EXPLAIN output. Then, if the SQL starts executing slower or using more resources, you have a basis for comparison. You might also use the benchmark if you decide to add a secondary index. This prototyping allows you to see exactly what your SQL is doing.

Some users have quit using the EXPLAIN because they have gotten inaccurate results. From our experience, when the numbers are consistently different than the actual rows being returned and the cost estimate is completely wrong, it is normally an indicator that STATISTICS should be collected or updated on the involved tables.

Adding CommentsSometimes it is necessary or desirable to document the logic used in an SQL statement within the query. A comment is not executed and is ignored by the PEat syntax checking and resolution time.

ANSI Comment

To comment a line using the ANSI standard form of a comment:

-- the double dash at the start of a single line denotes a comment is on that line

Each line that is a comment must be started with the same two dashes for each comment line. This is the only technique available for commenting using ANSI compliancy.

At the writing of this book, Queryman sometimes gets confused and regards all lines after the -- as part of the comment. So, be careful regarding various client tools.

-- This is an ANSI form of comment that consists of a single line of user explanation or

-- add notes to an SQL command. This is a second line and needs additional dashes

Teradata Comment

To comment a line using the Teradata form of a comment:

/* the slash asterisk at the start of a line denotes the beginning of a comment

*/ the asterisk slash (reversed from the start of a comment) is used to end a comment.

Both the start and the end of a comment can be a single line or multiple lines. This is the most common form of comment seen in Teradata SQL, primarily since it was the original technique available.

/* This is the Teradata form of comment that consists of a single line of user explanation or add notes to an SQL command. Several lines of comment can be added within a single notation. This is the end of the comment. */

http://www.coffingdw.com/sql/tdsqlutp/explain.htm

http://www.coffingdw.com/sql/tdsqlutp/user_information_functions.htm

User Information FunctionsThe Teradata RDBMS (Relational DataBase Management System) has incorporated into it functions that provide data regarding a user who has performed a logon connection to the system. The following functions make that data available to a user for display or storage.

ACCOUNT Function

Compatibility: Teradata Extension

A user within the Teradata database has an account number. This number is used to identify the user, provide a basis for charge back, if desired and establish a basic priority.

Previously, this number was used exclusively by the database administrator to control and monitor access to the system. Now, it is available for viewing by the user via SQL.

Syntax for using the ACCOUNT function:

As an example, the following returns the account information for my user:SELECT ACCOUNT;

1 Row returned

ACCOUNT

$M13678

If your account starts with a $M, you are running at a medium priority. Where $L is low and $H is high. At the same time, the account does not have to begin with one of these and can be any site specific value.

DATABASE Function


Chapter 1 of this book discussed the concept of a database and user area within the Teradata RDBMS. Knowing the current database within

http://www.coffingdw.com/sql/tdsqlutp/adding_comments.htm

http://www.coffingdw.com/sql/tdsqlutp/data_conversions.htm

Teradata is sometimes an important piece of information needed by a user. As mentioned above, the HELP SESSION is one way to determine it. However, a lot of other information is also presented. Sometimes it is advantageous to have only that single tidbit of data not only to see but also for storage. When this is the case, the DATABASE function is available.

Syntax for using the DATABASE function:

As an example, the following returns the account information for my user:SELECT DATABASE;

1 Row returned

DATABASE

Mikel

Session Function


Chapter 1 of this book discussed the PEP and the concept of a session and its role involving the user’s SQL requests. The HELP SESSION provides a wealth of information regarding the individual session established for a user. One of those pieces of data is the session number. It uniquely identifies every user session in existence at any point in time. Teradata now makes the session number available using SQL.

Syntax for using the SESSION function:

As an example, the following returns the account information for my user:SELECT SESSION;

1 Row returned

SESSION

1059

Data ConversionsIn order for data to be managed and used, it must have characteristics associated with it. These characteristics are called attributes that include a data type and a length. The values that a column can store are directly related to these two attributes.

There are times when the data type or length defined is not convenient for the use or output display needed. For instance, when character data is too long for display, an option might be to reduce its length. At other times, the defined numeric data type is not sufficient to store the result of a mathematical operation. Therefore, conversion to a larger numeric type may be the only way to successfully complete the request.

When one of these situations interrupt the execution of the SQL, it is necessary to use one or more of the conversion techniques. They are covered here in detail to enhance the understanding and the use of these capabilities.

In normal practices, there should be little need to convert from a number to a character on a regular basis. This requirement is one indicator that the table or column design is questionable. However, if a conversion must be performed, it is much safer to use the ANSI Standard CAST (Convert And Store) function when going from numeric to character instead of the older Teradata implied conversion. Both of these techniques are discussed here.

Conversions should be used only when absolutely necessary because they are intensive on system resources. As an example, I saw an SQL statement that converted four columns six different times. There were around a million rows in the table. The SQL did a lot of processing and it took about one hour to run. By eliminating these 6 million conversions, the SQL ran in under five minutes. Conversions can have an impact, but sometimes you need them. Use them only when absolutely necessary!

http://www.coffingdw.com/sql/tdsqlutp/user_information_functions.htm

http://www.coffingdw.com/sql/tdsqlutp/data_types.htm

Data TypesTeradata supports many formats for storing data on disk and most of the data types conform to the ANSI standard. At the same time, there are data types specific to Teradata. Most of these unique data types are provided to save storage space on disk or support an international code set.

Since Teradata was originally designed to store terabytes worth of data in millions or billions of rows, saving a single byte one million times becomes a space savings of nearly a megabyte. The savings increases dynamically as more rows are added and more bytes per row are saved. This space savings can be very significant.

Likewise, the speed advantage associated with smaller rows cannot be ignored. Since data is read from a disk in a block, smaller rows mean that more rows are stored in a single block. Therefore, fewer blocks need to be read and it is faster.

The following charts indicate the data types currently supported by Teradata. The first chart shows the ANSI standard types and the second is for the additional data types that are extensions to the standard.

This chart indicates which data types that Teradata currently supports as ANSI Standards:

Data Type Description Data Value Range

INTEGER Signed whole number -2,147,483,648 to 2,147,483,647

SMALLINT Signed smaller whole number

-32,768 to 32,767

DECIMAL(X,Y)Where: X=1 thru 18, total number of digits in the numberAnd Y=0 thru 18 digits to the right of the decimal

Signed decimal number18 digits on either side of the decimal point

Largest value DEC(18,0)Smallest value DEC(18,18)

NUMERIC(X,Y)Same as DECIMAL

Synonym for DECIMAL

Same as DECIMAL

FLOAT Floating Point Format <value>x10307 to

http://www.coffingdw.com/sql/tdsqlutp/data_conversions.htm

http://www.coffingdw.com/sql/tdsqlutp/cast.htm

(IEEE) <value>x10-308

REAL Stored internally as FLOAT

PRECISION Stored internally as FLOAT

DOUBLE PRECISION Stored internally as FLOAT

CHARACTER(X)CHAR(X)Where: X=1 thru 64000

Fixed length character string, 1 byte of storage per character,

1 to 64,000 characters long, pads to length with space

VARCHAR(X) CHARACTER VARYING(X)CHAR VARYING(X)Where: X=1 thru 64000

Variable length character string, 1 byte of storage per character, plus 2 bytes to record length of actual data

1 to 64,000 characters as a maximum. The system only stores the characters presented to it.

DATE Signed internal representation of YYYMMDD (YYY represents the number of years from 1900, i.e. 100 for Year 2000)

Currently to the year 3500 as a positive number and back into AD years as a negative number.

TIME Identifies a field as a TIME value with Hour, Minutes and Seconds

TIMESTAMP Identifies a field as a TIMESTAMP value with Year, Month, Day, Hour, Minute, and Seconds

Figure 4-1

This chart indicates which data types that Teradata currently supports as extensions:

Data Type Description Data Value Range

BYTEINT Signed whole number -128 to 127BYTE (X)Where: X=1 thru

Binary 1 to 64,000 bytes

64000VARBYTE (X)Where: X=1 thru 64000

Variable length binary 1 to 64,000 bytes

LONG VARCHAR Variable length string 64,000 characters (maximum data length) The system only stores the characters provided, not trailing spaces.)

GRAPHIC (X)Where: X=1 thru 32000

Fixed length string of 16-bit bytes (2 bytes per character)

1 to 32,000 KANJI characters

VARGRAPHIC (X)Where: X=1 thru 32000

Variable length string of 16-bit bytes

1 to 32,000 characters as a maximum. The system only stores characters provided.

Figure 4-2

These data types are all available for use within Teradata. Notice that there are fixed and variable length data formats. The fixed data types always require the entire defined length on disk for the column. The variable types can be used to maximize data storage within a block by storing only the data provided within a row by the client software.

You should use the appropriate type for the specific data. It is a good idea to use a VAR data type when most of the data is less than the maximum size. This is due to the addition of an extra 2-byte length indicator that is stored along with the actual data.

CASTCompatibility: ANSI

Under most conditions, the data types defined and stored in a table should be appropriate. However, sometimes it is neither convenient nor desirable to use the defined type. Data can be converted from one type to another by using the CAST function. As long as the data involved does not break any data rules (i.e. placing alphabetic or special characters into a numeric data type) the conversion works. The name of the CAST function comes from the Convert And STore operation that it performs.

Care must also be taken when converting data to manage any potential length issues. In Teradata mode, truncation occurs if a length is requested that is shorter than the original data. However, in ANSI mode, an SQL error is the result because ANSI says, “Thou shall not truncate data.”

The basic syntax of the CASTstatement follows:

Examples using CAST:

These are only some of the potential conversions and are primarily here for illustration of how to code a CAST. The CAST could also be used within the WHERE clause to control the length characteristics or the type of the data to compare.

Again, when using the CAST in ANSI mode, any attempt to truncate data causes the SQL to fail because ANSI does not allow truncation.

The next SELECT uses literal values to show the results of conversion:

http://www.coffingdw.com/sql/tdsqlutp/data_types.htm

http://www.coffingdw.com/sql/tdsqlutp/implied_cast.htm

1 Row Returned

Trunc OK Bigger Whole Rounder

A 128 127 121 122

In the above example, the first CAST truncates the five characters (left to right) to form the single character ‘A’. In the second CAST, the integer 128 is converted to three characters and left justified in the output. The 127 was initially stored in a SMALLINT (5 digits - up to 32767) and then converted to an INTEGER. Hence, it uses 11 character positions for its display, ten numeric digits and a sign (positive assumed) and right justified as numeric.

The value of 121.53 is an interesting case for two reasons. First, it was initially stored as a DECIMAL as 5 total digits with 2 of them to the right of the decimal point. Then it is converted to a SMALLINT using CAST to remove the decimal positions. Therefore, it truncates data by stripping off the decimal portion. It does not round data using this data type. On the other hand, the CAST in the fifth column called Rounder is converted to a

DECIMAL as 3 digits with no digits (3,0) to the right of the decimal, so it will round data values instead of truncating. Since .53 is greater than .5, it is rounded up to 122.

Implied CASTCompatibility: Teradata Extension

Although the CAST function is the ANSI standard, it has not always been that way. Prior to the CAST function, Teradata had the ability to convert data from one type to another.

This conversion is requested by placing the “implied’ data type conversion in parentheses after the column name. Therefore, it becomes a part of the select list and the column request. The new data type is written as an attribute for the column name.

The following is the format for requesting a conversion:

At first glance, this appears to be the best and shortest technique for doing conversions. However, there is a hidden danger here when converting from numeric to character that is demonstrated in this SELECT that uses the same data as above to do implied CASTconversions:

1 Row Returned

Shortened OOPS1 OOPS2 Bigger _ Whole

A – 128 121

What happened in the column named OK and N_OK?

The answer to this question is: the value 128 is 1 greater than 127 and therefore too large of a value to store in a BYTEINT. So it is automatically stored as a SMALLINT (5 digits plus a sign) before the conversion. The implicit conversion changes it to a character type with the first 3 characters being returned. As a result, only the first 3 spaces are seen in the report (_ _ _ 128). Likewise, OOPS2 is stored as (_ _ -

http://www.coffingdw.com/sql/tdsqlutp/cast.htm

http://www.coffingdw.com/sql/tdsqlutp/formatted_data.htm

128) with the first three characters (2 spaces and - ) shown in the output. Always think about the impact of the sign as a valid part of the data when converting from numeric to character. As mentioned earlier, if you find that conversions of this type are regularly necessary, the table design needs to be re-examined.

As demonstrated in the above output, it is always safer to use CAST when going from character to numeric data types.

Formatted DataCompatibility: Teradata Extension

Remember that truncation works in Teradata mode, but not in ANSI mode. So, another way to make data appear to be truncated is to use the Teradata FORMAT in the SELECT list with one or more columns when using a tool like BTEQ. Since FORMAT does not truncate data, it works in ANSI mode.

The syntax for using FORMAT is:

The next SELECT demonstrates the use of FORMAT:

1 Row Returned

Shorter Fmt_121 121.53 Fmt_NumDate Fmt_Date _

ABC 00121 121.53 10/01/1999 OCT 01, 1999

There are a couple of things to notice in this output. First, it works in ANSI mode because truncation does not occur. The distinction is that all of the data from the column is in spool. It is only the output that is shortened, not truncated. The character data types use the ‘X’ for the formatting character. Second, formatting does not round a data value as with the 121.53, the display is shortened. The numeric data types use a ‘9’ as the basic formatting character. Others are shown in this chapter.

Next, DATE type data uses the ‘M’ for month, the ‘D’ for day of the month and ‘Y’ for the year portion of a valid date. Lastly, the case of the formatting characters does not matter. The formatting characters can be written in all uppercase, lowercase, or a mixture of both cases.

http://www.coffingdw.com/sql/tdsqlutp/implied_cast.htm

http://www.coffingdw.com/sql/tdsqlutp/title_attribute_for_data_columns.htm

The two following charts show the valid formatting characters for Teradata and provide an explanation of the impact each one has on the output display when using BTEQ:

Basic Numeric and Character Data Formatting Symbols

Symbol Mask character and how used

X or x Character data. Each X represents one character. Can repeat value– i.e. XXXXX or X(5).

9 Decimal digit. Holds place for numeric digit for a display 0 through 9. All leading zeroes are shown if the format mask is longer than the data value. Can repeat value– i.e. 99999 or 9(5).

V or v Implied decimal point. Aligns data on a decimal value. Primarily used on imported data without actual decimal point.

E or e Exponential. Aligns the end of the mantissa and the beginning of the exponent.

G or g Graphic data. Each G represents one logical (double byte- KANJI or Katakana) character. Can repeat value– i.e. GGGGG or G(5).

Figure 4-3

Advanced Numeric and Character Formatting Symbols

Symbol Mask character and how used

$ Fixed or floating dollar sign. Inserts a $ or leaves spaces and moves (floats) over to the first character of a currency value. With the proper keyboard, additional currency signs are available: Cent, Pound and Yen.

, Comma. Inserted where appears in format mask. Used primarily to make large numbers easier to read.

. Period. Primary use to align decimal point position. Also used for: dates and comma in some currencies.

- Dash character. Inserted where appears in format mask. Used primarily for dates and negative numeric values. Also used for: phone numbers, zip codes, and social security (USA).

/ Slash character. Inserted where appears in format mask. Used primarily for dates.

% Percent character. Inserted where appears in format mask.

Used primarily for display of percentage – i.e. 99% vs. .99Z or z Zero-suppressed decimal digit. Holds place for numeric digit

displays 1 through 9 and 0, when significant. All leading zeroes (insignificant) are shown as space since their presence does not change the value of the number being displayed.

B or b Blank data. Insert a space where appears in format mask.Figure 4-4

The next chart shows the formatting characters used in conjunction with DATE data:

Date Formatting Symbols

Symbol Mask character and how used (not case specific)

M or m Month. Allows month to be displayed any where in the date display. When ‘MM’ is specified, the numeric (01-12) value is available. When ‘MMM’ is specified, the three character (JAN-DEC) value is available.

D or d Day. Allows day to be displayed any where in the date display. When ‘DD’ is specified, the numeric (01-31) value is available. When ‘DDD’ is specified, the three-digit day of the year (001-366) value is available.

Y or y Year. Allows day to be displayed any where in the date display. The normal ‘YY’ has been used for many years for the 20th century with the 19YY assumed. However, since we have moved into the 21st century, it is recommended that the ‘YYYY’ be used.

Figure 4-5

There is additional information on date formatting in a later chapter dedicated exclusively to date processing.

The next SELECT demonstrates some of the additional formatting symbols:

1 Row Returned

Fmt_Shorter Fmt_Phone Z_Press Fmt_Julian Fmt_Pay

ABC 201-485-9999 1021.53 99274 $991,001.00

There are only two things that need to be watched when using the FORMAT function. First, the data type must match the formatting character used or a syntax error is returned. So, if the data is numeric, use a numeric formatting character and the same condition for character data. The other concern is configuring the format mask big enough for the largest data column. If the mask is too short, the SQL command executes, however, the output contains a series of ************* to indicate a format overflow, as demonstrated by the following SELECT:

1 Row Returned

Fmt_Phone

*********

All of these FORMAT requests work wonderfully if the client software is BTEQ. After all, it is a report writer and these are report writer options. The issue is that the ODBC and Queryman look at the data as data, not as a report. Since many of the formatting symbols are “characters” they cannot be numeric. Therefore, the ODBC strips off the symbols and presents the numeric data to the client software for display.

Tricking the ODBC to Allow Formatted Data

If a tool uses the ODBC, the FORMAT in the SELECT is ignored and the data comes back as data, not as a formatted field. This is especially noticeable with numeric data and dates.

To force tools like Queryman to format the data, the software must be tricked into thinking the data is character type, which it leaves alone. This can be done using the CAST function.

The next SELECT uses the CASToperation to trick the software into thinking the formatted data is character:

1 Row Returned

Fmt_CAST_Phone Fmt_CAST_Date Fmt_CAST_Pay

485-9999 1999.10.01 $991,001.00

Do not let the presence of AS in the above SELECT confuse you. The first AS, inside the parentheses, goes with the new data type for the CAST. Notice that the parentheses enclose both the data and the FORMAT so that they are treated as a single entity. The second AS is outside the parentheses and is used to name the alias.

TITLE Attribute for Data ColumnsCompatibility: Teradata Extension

As seen earlier, an alias may be used to change the column name. This can be done for ease of reference or to alter the heading for the column in the output. The TITLE is an alternative to using an alias name when a column heading needs to be changed. There is a big difference between TITLE and an alias. Although an alias does change the title on

a report, it is normally used to rename a column (throughout the SQL) as a new name. The TITLEonly changes the column heading.

The syntax for using TITLE follows:

Like FORMAT, TITLE changes the attribute of the displayed data. Therefore, it is written in parentheses also. Also like FORMAT, tools using the ODBCmay not work as well as they do in BTEQ, the report writer. This is especially true when using the // stacking symbols. In tools like Queryman, the title literally contains // and is probably not the intent. Also, if you attempt to use TITLE in Queryman and it does not work, there is a configuration option in the ODBC. When “Use Column Names” is checked, it will not use the title designation.

The following SELECT uses the TITLE to show the result:

1 Row Returned

Character Data Character Data Numeric Data

Character Data Character Data 123

Notice that the word ‘Character’ is stacked over the ‘Data’ portion of the heading for the second column using BTEQ. So, as an alternative, a

http://www.coffingdw.com/sql/tdsqlutp/formatted_data.htm

http://www.coffingdw.com/sql/tdsqlutp/transaction_modes.htm

TITLE can be used instead of an alias and allows the user to include spaces in the output title.

Another neat trick for TITLE is to use two single quotes together (TITLE ‘’). This technique creates a zero length TITLE, or no title at all, as seen in the next SELECT:

1 Row Returned

Character Data

Character Data Character Data 123

Remember, this TITLE is two separate single quotes, not a single double quote. A double quote by itself does not work because it is unbalanced without a second double quote.

Transaction ModesTransaction mode is an area where the perspective of the Teradata RDMBS and ANSI experience a departure. Teradata, by default, is completely non-case specific. ANSI requires just the opposite condition, everything is case specific and as we saw earlier, dictates that table and column names be in capital letters.

This is probably a little restrictive and I tend to agree completely with the Teradata implementation. At the same time, Teradata allows the user to work in either mode within a session when connected to the RDBMS. The choice is up to the user when BTEQ is the client interface software.

For instance, within BTEQ either of the following commands can be used before logging onto the database:.SET SESSION TRANSACTION ANSI

Or.SET SESSION TRANSACTION BTET

The BTET transaction is simply an acronym made from a consolidation of the BEGIN TRANSACTION (BT) and END TRANSACTION (ET) commands to represent Teradata mode.

The system administrator defines the system default mode for Teradata. A setting in the DBS Control record determines the default session mode. The above commands allow the default to be over-ridden for each logon session. The SET command must be executed before the logon to establish the transaction mode for the next session(s).

However, not all client software supports the ability to change modes between Teradata and ANSI. When it is desirable for functionality or processing characteristics of the other mode, other options are available and are presented below. There is more information on transactional processing later in this book.

http://www.coffingdw.com/sql/tdsqlutp/title_attribute_for_data_columns.htm

http://www.coffingdw.com/sql/tdsqlutp/case_sensitivity_of_data.htm

Case Sensitivity of DataIt has been discussed earlier that there is no need for concern regarding the use of lower or upper case characters when coding the SQL. As a matter of fact, the different case letters can be mixed in a single statement. Normally, the Teradata database does not care about the case when comparing the stored data either.

However, the ANSI mode implementation of the Teradata RDBMS is case sensitive, regarding the data. This means that it knows the difference between a lower case letter like ‘a’ and an upper case letter ‘A’. At the same time, when using Teradata mode within the Teradata database, it does not distinguish between upper and lower case letters. It is the mode of the session that dictates the case sensitivity of the data.

The SQL can always execute ANSI standard commands in Teradata mode and likewise, can always execute Teradata extensions in ANSI mode. The SQL is always the same regardless of the mode being used. The difference comes when comparing the results of the data rows being returned based on the mode.

For example, earlier in this chapter, it was stated that ANSI mode does not allow truncation. Therefore, the FORMAT could be used in either mode because it did not truncate data.

To demonstrate this issue, the following uses the different modes in BTEQ:

No Rows Returned

The above SQL execution is case specific due ANSI mode and ‘A’ is different than ‘a’. The same SQL is executed again here, however, the transaction mode for the session is set to Teradata mode (BTET) prior to the logon:

1 Row Returned

http://www.coffingdw.com/sql/tdsqlutp/transaction_modes.htm

http://www.coffingdw.com/sql/tdsqlutp/casespecific.htm

Do They Match?

They match

Now that the defaults have been demonstrated, the following functions can be used to mimic the operation of each mode while executing in the other (ANSI vs Teradata) where case sensitivity is concerned.

CASESPECIFICCompatibility: Teradata Extension

The CASESPECIFIC attribute may be used to request that Teradata compare data values with a distinction made between upper and lower case. The logic behind this designation is that even in Teradata mode, case sensitivity can be requested to make the SQL work the same as ANSI mode, which is case specific. Therefore, when CASESPECIFIC is used, it normally appears in the WHERE clause.

The syntax of the next two statements execute exactly the same:

Or, it may be abbreviated as CS:

Conversely, if ANSI is the current mode and there is a need for it to be non-case specific, the NOT can be used to adjust the default operation of the SQL within a mode.

The following SQL forces ANSI to be non-case specific:

Or, it may be abbreviated as:

http://www.coffingdw.com/sql/tdsqlutp/case_sensitivity_of_data.htm

http://www.coffingdw.com/sql/tdsqlutp/lower_function.htm

The next SELECT demonstrates the functionality of CASESPECIFIC and CS for comparing an equality condition like it executed above in ANSI mode:

No Rows Returned

No rows are returned, because ‘A’ is different than ‘a’ when case sensitivity is used. At first glance, this seems to be unnecessary since the mode can be set to use either ANSI or Teradata. However, the dot (.) commands are BTEQ commands. They do not work in Queryman. If case sensitivity is needed when using other tools, this is one of the options available to mimic ANSI comparisons while in Teradata mode.

The SQL extensions in Teradata may be used to eliminate the absolute need to log off to reset the mode and then log back onto Teradata in order to use a characteristic like case sensitivity. Instead, Teradata mode can be forced to use a case specific comparison, like ANSI mode by incorporating the CASESPECIFIC(CS) into the SQL. The case specific option is not a statement level feature; it must be specified for each column needing this type of comparison in both BTEQ and Queryman.

LOWER FunctionCompatibility: ANSI

The LOWER case function is used to convert all characters stored in a column to lower case letters for display or comparison. It is a function and therefore requires that the data be passed to it.

The syntax for using LOWER:

The following SELECT uses an upper case literal value as input and outputs the same value, but in lower case:SELECT LOWER (‘ABCDE’) AS Result ;

1 Row Returned

Result

abcde

When LOWER is used in a WHERE clause, the result is a predictable string of all lowercase characters. When compared to a lowercase value, the result is a case blind comparison. This is true regardless of how the data was originally stored.SELECT ‘They match’ (title ‘Do they match?’)

WHERE LOWER(‘aBcDe’) = ‘abcde’ ;

1 Row Returned

Do They match?

They match

http://www.coffingdw.com/sql/tdsqlutp/casespecific.htm

http://www.coffingdw.com/sql/tdsqlutp/upper_function.htm

UPPER FunctionCompatibility: ANSI

The UPPER case function is used to convert all characters stored in a column to the same characters in upper case. It is a function and therefore requires that data be passed to it.

The syntax for using UPPER:

The next example uses a literal value within UPPER to show the output all in upper case:SELECT UPPER(‘aBcDe’) AS Result ;

1 Row Returned

Result

ABCDE

It is also possible to use both the LOWER and UPPER case functions within the WHERE clause. This technique can be used to make ANSI non-case specific, like Teradata, by converting all the data to a known state, regardless of the starting case. Thus, it does not check the original data, but instead it checks the data after the conversion.

The following SELECT uses the UPPER function in the WHERE:

1 Row Returned

Do They match?

They match

When the data does not meet the requirements of the output format, it is time to convert the data. The UPPER and LOWER functions can be

http://www.coffingdw.com/sql/tdsqlutp/lower_function.htm

http://www.coffingdw.com/sql/tdsqlutp/aggregate_processing.htm

used to change the appearance or characteristics of the data to a known state.

When case sensitivity is needed, ANSI is one way to accomplish it. If that is not an option, the CASESPECIFIC function can be incorporated into the SQL.

Aggregate ProcessingThe aggregate functions are used to summarize column data values stored in rows. Aggregates eliminate the detail information from the rows and only return the answer. Therefore, the result is one or more aggregated values as a single line or one line per unique value, as a group. The other characteristic of these functions is that they all ignore null values stored in column data passed to them.

Math Aggregates

The math aggregates are the original functions used to provide simple types of arithmetic operations for the data values. Their names are descriptive of the operation performed. The functions are listed below with examples following their descriptions. The newer, V2R4 statistical aggregates are covered later in this chapter.

The SUM Function

Accumulates the values for the named column and prints one total from the addition.

The AVG Function

Accumulates the values for the named column and counts the number of values added for the final division to obtain the average.

The MIN Function

Compares all the values in the named column and returns the smallest value.

The MAX Function

Compares all the values in the named column and returns the largest value.

The COUNT Function

Adds one to the counter each time a value other than null is encountered.

http://www.coffingdw.com/sql/tdsqlutp/upper_function.htm

http://www.coffingdw.com/sql/tdsqlutp/group_by.htm

The aggregates can all be used together in a single request on the same column, or individually on different columns, depending on your needs.

The following syntax shows all six aggregate functions in a single SELECT to produce a single line answer set:

The following table is used to demonstrate the aggregate functions:



PK FK UPI NUSI NUSI

123250125634234121231222260000280023322133324652333450423400

PhillipsHansonThomasWilsonJohnsonMcRobertsBondDelaneySmithLarkins

MartinHenryWendySusieStanleyRichardJimmyDannyAndyMichael

SRFRFRSOJRJRSRSOFR

3.002.884.003.801.903.953.352.000.00

Figure 5-1

The next SELECT uses the Student table, to show all aggregates in one statement working on the same column:

1 Row Returned

SUM(Grade_pt) AVG(Grade_pt) MIN(Grade_pt) MAX(Grade_pt) COUNT(Grade_pt)

24.88 2.76 0.00 4.00 9

Notice that Stanley’s row is not included in the functions due to the null in his grade point average. Also notice that no individual grade point data is displayed because the aggregates eliminate this level of column and row detail and only returns the summarized result for all included rows. The way to eliminate rows from being included in the aggregation is through the use of a WHERE clause. Since the name of the selected column appears as the heading for the column, aggregate names make for funny looking headings. To make the output look better, it is a good idea to use an alias to dress up the name used in the output. Additionally, the alias can be used elsewhere in the SQL as the column name.

The next SELECT demonstrates the use of alias names for the aggregates:

1 Row Returned

Total Average Smallest Largest Count

24.88 2.76 0.00 4.00 9

Notice that when using aliases in the above SELECT they appear as the heading for each column. Also the words Total, Average and Count are in double quotes. As mentioned earlier in this book, the double quoting technique is used to tell the PE that this is a column name, opposed to being the reserved word. Whereas, the single quotes are used to identify a literal data value.

Aggregates and Derived Data

The various aggregates can work on any column. However, most of the aggregates only work with numeric data. The COUNT function might be the primary one used on either character or numeric data. The aggregates can also be used with derived data.

The following table is used to demonstrate derived data and aggregation:

Employee Table - contains 9 students

Employee_No Last_Name First_name Salary Dept_No

PK FKUPI NUSI NUSI

123257812563492341218231222520000001000234112133413246571333454

ChambersHarrisonReillyLarkinsJonesSmytheStricklingCoffingSmith

MandeeHerbertWilliamLoraineSquiggyRichardCletusBillyJohn

$48,850.00$54,500.00$36,000.00$40,200.00$32,800.50$64,300.00$54,500.00$41,888.88$48,000.00

10040040030010400200200

Figure 5-2

This SELECT totals the salaries for all employees and show what the total salaries will be if everyone is given a 5% or a 10% raise:

1 Row Returned

Salary Total +5% Raise +10% Raise_ Average SalaryComputed Average Salary

$421,039.38 $442,091.35 $463,143.32 $46,782.15 $46,782.15

Notice that since both TITLE and FORMAT require parentheses, they can share the same set. Also, the AVG function and dividing the SUM by the COUNT provide the same answer.

GROUP BYIt has been shown that aggregates produce one row of output with one value per aggregate. However, the above SELECT is inconvenient if individual aggregates are needed based on different values in another column, like the class code. For example, you might want to see each aggregate for freshman, sophomores, juniors, and seniors.

The following SQL might be run once for each unique value specified in the WHERE clause for class code, here the aggregates only work on the senior class (‘SR’):

1 Row Returned

Total Average Smallest Largest Count

6.35 3.175 3.00 3.35 2

Although this technique works for finding each class, it is not very convenient. The first issue is that each unique class value needs to be known ahead of time for each execution. Second, each WHERE clause must be manually modified for the different values needed. Lastly, each time the SELECT is executed, it produces a separate output. In reality, it might be better to have all the results in a single report format.

Since the results of aggregates are incorporated into a single output line, it is necessary to create a way to provide one line returned per a unique data value. To provide a unique value, it is necessary to select a column with a value that groups various rows together. This column is simply selected and not used in an aggregate. Therefore, it is a not an aggregated column.

However, when aggregates and “non-aggregates” (normal columns) are selected at the same time, a 3504 error message is returned to indicate the mixture and that the non-aggregate is not part of an

http://www.coffingdw.com/sql/tdsqlutp/aggregate_processing.htm

http://www.coffingdw.com/sql/tdsqlutp/limiting_output_values_using_having.htm

associated group. Therefore, the GROUP BY is required in the SQL statement to identify every column selected that is not an aggregate.

The resulting output consists of one line for all aggregate values for each unique data value stored in the column(s) named in the GROUP BY. For example, if the department number is used from the Employee table, the output consists of one line per department with at least one employee working in it.

The next SELECT uses the GROUP BY to create one line of output per unique value in the class code column:

5 Rows Returned

Class_code Total Average Smallest Largest Count

FR 6.88 2.29 0.00 4.00 2? ? ? ? ? 0JR 5.85 2.925 1.90 3.95 2SR 6.35 3.175 3.00 3.35 2SO 5.80 2.9 2.00 3.80 2

Notice that the null value in the class code column is returned. At first, this may seem contrary to the aggregates ignoring nulls. However, class code is not being aggregated and is selected as a “unique value.” All the aggregate values on the grade point for this row are null, except for COUNT. Although, the COUNT is zero and this does indicate that the null value is ignored. The COUNT value initially starts at zero, so: 0 + 0 = 0. The GROUP BY is only required when a non-aggregate column is selected along with one or more aggregates. Without both a non-aggregate and a GROUP BY clause, the aggregates return only one row. Whereas, with a non-aggregate and a GROUP BY clause designating the column(s), the aggregates return one row per unique value in the column, as seen above.

Additionally, more than one non-aggregate column can be specified in the SELECT and in the GROUP BY clause. The normal result of this is that more rows are returned. This is because one row appears whenever any single column value changes, the combination of each column constitutes a new value. Remember, all non-aggregates selected with an aggregate must be included in the GROUP BY, or a 3504 error is returned.

As an example, the last name might be added as a second non-aggregate. Then, each combination of last name and class code are compared to other students in the same class. This combination creates more lines of output. As a result, each aggregate value is primarily the aggregation of a single row. The only time multiple rows are processed together is when multiple students have the same last name and are in the same class. Then they group together based on the values in both columns being equal.

This SELECT demonstrates the correct syntax when using multiple non-aggregates with aggregates and the output is one line of output for each student:

10 Rows Returned

Last_name Class_code Total Average Smallest Largest Count

Johnson ? ? ? ? ? 0Thomas FR 4.00 4.00 4.00 4.00 1Smith SO 2.00 2.00 2.00 2.00 1McRoberts JR 1.90 1.90 1.90 1.90 1Larkins FR 0.00 0.00 0.00 0.00 1Phillips SR 3.00 3.00 3.00 3.00 1Delaney SR 3.35 3.35 3.35 3.35 1Wilson SO 3.80 3.80 3.80 3.80 1Bond JR 3.95 3.95 3.95 3.95 1

Hanson FR 2.88 2.88 2.88 2.88 1

Beyond showing the correct syntax for multiple non-aggregates, the above output reveals that it is possible to request too many non-aggregates. As seen above, every output line is a single row. Therefore, every aggregated value consists of a single row. Therefore, the aggregate is meaningless because it is the same as the original data value. Also notice that without an ORDER BY, the GROUP BY does not sort the output rows.

Like the ORDER BY, the number associated with the column’s relative position within the SELECT can also be used in the GROUP BY. In the above example, the two columns are the first ones in the SELECT and therefore, it is written using the shorter format: GROUP BY 1,2.

Caution: Using the shorter technique can cause problems if the location of a non-aggregate is changed in the SELECT list and the GROUP BY is not changed. The most common problem is a 3504 error message indicating that a non-aggregate is not included in the GROUP BY, so the SELECT does not execute.

As previously shown, the default for a column heading is the column name. It is not very pretty to see the name of the aggregate and column used as a heading. Therefore, an alias is suggested in all tools or optionally, a TITLE in BTEQ to define a heading.

Also seen earlier, a COUNT on the grade point for department null is zero. Actually, this is misleading in that 1 row contains a null not zero rows. But, because of the null value, the row is not counted. A better technique might be the use of COUNT(*), for a row count. Although this implies counting all columns, in reality it counts the row. The objective of this request is to find any column that contains a non-null data value.

Another method to provide the same result is to count any column that is defined as NOT NULL. However, since it takes time to determine such a column and its name is longer than typing an asterisk (*), it is easier to use the COUNT(*).

Again, the GROUP BY clause creates one line of output per unique value, but does not perform a sort. It only creates the distinct grouping for all of the columns specified. Therefore, it is suggested that you always include an ORDER BY to sort the output.

The following might be a better way to code the previous request, using the COUNT(*) and an ORDER BY:

5 Rows Returned

Class_code Total Average Smallest Largest Count

? ? ? ? ? 1FR 6.88 2.29 0.00 4.00 3JR 5.85 2.925 1.90 3.95 2SO 5.80 2.9 2.00 3.80 2SR 6.35 3.175 3.00 3.35 2

Now the output is sorted by the class code with the null appearing first, as the lowest “value.” Also notice the count is one for the row containing mostly NULL data. The COUNT(*) counts the row.

Limiting Output Values Using HAVINGAs in any SELECT statement, a WHERE clause can always be used to limit the number or types of rows used in the aggregate processing. Therefore, something besides a WHERE is needed to evaluate aggregate values because the aggregate is not finished until all eligible rows have been read. Again, a WHERE clause eliminates rows during the process of reading the base table rows. To allow for the elimination of specific aggregate results, the HAVING clause is used to make the final comparison before the aggregate results are returned.

The previous SELECT is modified below to compare the aggregates and only return the students from spool with a grade point average of B (3.0) or better:

1 Rows Returned

Class_code Total Average Count

SR 6.35 3.18 2

Notice that all of the previously seen output with an average value less than 3.00 has been eliminated as a result of using the HAVING clause. The WHERE clause eliminates rows; the HAVING provides the last comparison after the calculation of the aggregate and before results are returned to the user client.

http://www.coffingdw.com/sql/tdsqlutp/group_by.htm

http://www.coffingdw.com/sql/tdsqlutp/statistical_aggregates.htm

Statistical AggregatesIn Teradata Release 4 (V2R4) there are several new aggregates that perform statistical operations. Many of them are used in other internal functions and now they are available for use within SQL.

Not only are these statistical functions the newest, but there are two types of statistical functions. They are unary (single input value) functions, and binary (dual input value) functions.

The unary functions look at individual column values for each row included and compare all of the values for trends, similarities and groupings. All the original aggregate functions are unary in that they accept a single value to perform their processing.

The statistical unary functions are: Kurtosis Skew Standard Deviation of a sample Standard Deviation of a population Variance of a sample Variance of a population

The binary functions examine the relationship between the two different values. Normally these two values represent two separate points on an X axis and Y-axis.

The binary functions are: Correlation Covariance Regression Line Intercept Regression Line Slope

The results from the statistical functions are not as obvious to demonstrate and figure out as the original functions, like SUM or AVG. The Stats table in Figure 5-3 is used to demonstrate the statistical functions. Its column values have certain patterns in them. For instance COL1 increases sequentially from 1 to 30 while COL4 decreases sequentially from 30 to 1. The remaining columns tend to have the same value repeated and some values repeat more than

http://www.coffingdw.com/sql/tdsqlutp/limiting_output_values_using_having.htm

http://www.coffingdw.com/sql/tdsqlutp/using_the_distinct_function_with_aggregates.htm

others. These values are used in both the unary and binary functions to illustrate the types of answers generated using these statistical functions.

The following table demonstrates the operation and output from the new statistical aggregate functions in V2R4.

Stats Table - contains 30 rows

Col1 Col2 Col3 Col4 Col5 Col6

PK 12345678910111213141516171819202122232425262728

1133345555779999101010101010131313141515

111010101010101020202020202020202020202020203030404050

3029282726252423222120191817161514131211109876543

12345678910221213141514131211987654321

051015203030303535404045455055556060656565707080859090

2930

1616

5060

21

11

95100

Figure 5-3

The KURTOSIS Function

The KURTOSIS function is used to return a number that represents the sharpness of a peak on a plotted curve of a probability function for a distribution compared with the normal distribution.

A high value result is referred to as leptokurtic. While a medium result is referred to as mesokurtic and a low result is referred to as platykurtic.

A positive value indicates a sharp or peaked distribution and a negative number represents a flat distribution. A peaked distribution means that one value exists more often than the other values. A flat distribution means there is the same quantity values exist for each number.

If you compare this to the row distribution associated within Teradata, most of the time a flat distribution is best, with the same number of rows stored on each AMP. Having skewed data represents more of a lumpy distribution.

Syntax for using KURTOSIS:KURTOSIS(<column-name>)

The next SELECT uses KURTOSIS to compare the distribution of the Stats table:

1 Row Returned

KofCol1 KofCol2 KofCol3 KofCol4 KofCol5 KofCol6

-1 -1 1 -1 -1 -1

The SKEW Function

The Skew indicates that a distribution does not have equal probabilities above and below the mean (average). In a skew distribution, the median and the mean are not coincident, or equal.

Where: a median value < mean value = a positive skew a median value > mean value = a negative skew a median value = mean value = no skew

Syntax for using SKEW:SKEW(<column-name>)

The following SELECT uses SKEW to compare the distribution of the Stats table:

1 Row Returned

SKofCol1 SKofCol2 SKofCol3 SKofCol4 SKofCol5 SKofCol6

0 -0 1 0 0 -0

The STDDEV_POP Function

The standard deviation function is a statistical measure of spread or dispersion of values. It is the root’s square of the difference of the mean (average). This measure is to compare the amount by which a set of values differs from the arithmetical mean.

The STDDEV_POP function is one of two that calculates the standard deviation. The population is of all the rows included based on the comparison in the WHERE clause.

Syntax for using STDDEV_POP:STDDEV_POP(<column-name>)

The next SELECT uses STDDEV_POP to determine the standard deviation on all columns of all rows within the Stats table:

1 Row Returned

SDPofCol1 SDPofCol2 SDPofCol3 SDPofCol4 SDPofCol5 SDPofCol6

9 4 14 9 4 27

The STDDEV_SAMP Function

The standard deviation function is a statistical measure of spread or dispersion of values. It is the root’s square of the difference of the mean (average). This measure is to compare the amount by which a set of values differs from the arithmetical mean.

The STDDEV_SAMP function is one of two that calculates the standard deviation. The sample is a random selection of all rows returned based on the comparisons in the WHERE clause. The population is for all of the rows based on the WHERE clause.

Syntax for using STDDEV_SAMP:STDDEV_SAMP(<column-name>)

The following SELECT uses STDDEV_SAMP to determine the standard deviation on all columns of a sample of the rows within the Stats table:

1 Row Returned

SDSofCol1 SDSofCol2 SDSofCol3 SDSofCol4 SDSofCol5 SDSofCol6

9 4 14 9 5 27

The VAR_POP Function

The Variance function is a measure of dispersion (spread of the distribution) as the square of the standard deviation. There are two forms of Variance in Teradata, VAR_POP is for the entire population of data rows allowed by the WHERE clause.

Although standard deviation and variance are regularly used in statistical calculations, the meaning of variance is not easy to elaborate. Most often variance is used in theoretical work where a variance of the sample is needed.

There are two methods for using variance. These are the Kruskal-Wallis one-way Analysis of Variance and Friedman two-way Analysis of Variance by rank.

Syntax for using VAR_POP:VAR_POP(<column-name>)

The following SELECT uses VAR_POP to compare the variance of the distribution on all rows from the Stats table:

1 Row Returned

VPofCol1 VPofCol2 VPofCol3 VPofCol4 VPofCol5 VPofCol6

75 19 191 75 20 723

The VAR_SAMP Function

The Variance function is a measure of dispersion (spread of the distribution) as the square of the standard deviation. There are two forms of Variance in Teradata, VAR_SAMP is used for a random sampling of the data rows allowed through by the WHERE clause.

Although standard deviation and variance are regularly used in statistical calculations, the meaning of variance is not easy to elaborate. Most often variance is used in theoretical work where a variance of the sample is needed to look for consistency.

There are two methods for using variance. These are the Kruskal-Wallis one-way Analysis of Variance and Friedman two-way Analysis of Variance by rank.

Syntax for using VAR_SAMP:VAR_SAMP(<column-name>)

The next SELECT uses VAR_SAMP to compare the variance of the distribution on a row sample from the Stats table:

1 Row Returned

VSofCol1 VSofCol2 VSofCol3 VSofCol4 VSofCol5 VSofCol6

78 20 198 78 20 748

The CORR Function

The CORR function is a binary function, meaning that two variables are used as input to it. It measures the association between 2 random variables. If the variables are such that when one changes the other does so in a related manner, they are correlated. Independent variables are not correlated because the change in one does not necessarily cause the other to change.

The correlation coefficient is a number between -1 and 1. It is calculated from a number of pairs of observations or linear points (X,Y).

Where: 1 = perfect positive correlation 0 = no correlation -1 = perfect negative correlation

Syntax for using CORR:CORR(<column-name>, <column-name>)

The following SELECT uses CORR to compare the association of values stored in various columns from the Stats table:

1 Row Returned

CofCol1#2 CofCol1#3 CofCol1#4 CofCol1#5 CofCol1#6

0.986480 0.885155 -1.000000 -0.151877 0.991612

Since there are two column values passed to this function and the first example has data values that sequentially ascend, the next example uses col4 as the first value because it sequentially descends. It demonstrates the impact of this sequence change on the result:

1 Row Returned

CofCol4#2 CofCol4#3 CofCol4#1 CofCol4#5 CofCol4#6

-0.986480 -0.885155 -1.000000 0.151877 -0.991612

The COVAR Function

The covariance is a statistical measure of the tendency of two variables to change in conjunction with each other. It is equal to the product of their standard deviations and correlation coefficients.

The covariance is a statistic used for bivariate samples or bivariate distribution. It is used for working out the equations for regression lines and the product-moment correlation coefficient.

Syntax:COVAR(<column-name>, <column-name>)

The next SELECT uses COVAR to compare the covariance association of values stored in various columns from the Stats table:

1 Row Returned

CVofCol1#2 CVofCol1#3 CVofCol1#4 CVofCol1#5 CVofCol1#6

38 106 -75 -6 231


1 Row Returned

CvofCol4#2 CvofCol4#3 CvofCol4#1 CvofCol4#5 CvofCol4#6

-37 -106 -75 6 -231

The REGR_INTERCEPT Function

A regression line is a line of best fit, drawn through a set of points on a graph for X and Y coordinates. It uses the Y coordinate as the Dependent Variable and the X value as the Independent Variable.

Two regression lines always meet or intercept at the mean of the data points(x,y), where x=AVG(x) and y=AVG(y) and is not usually one of the original data points.

Syntax for using REGR_INTERCEPT:REGR_INTERCEPT(dependent-expression, independent-expression)

The following SELECT uses REGR_INTERCEPT to find the intercept point between the values stored in various columns from the Stats table:

1 Row Returned

RIofCol1#2 RIofCol1#3 RIofCol1#4 RIofCol1#5 RIofCol1#6

-1 3 31 18 -1


1 Row Returned

RIofCol4#2 RIofCol4#3 RIofCol4#1 RIofCol4#5 RIofCol4#6

32 28 0 13 32

The REGR_SLOPE Function

A regression line is a line of best fit, drawn through a set of points on a graph of X and Y coordinates. It uses the Y coordinate as the Dependent Variable and the X value as the Independent Variable.

The slope of the line is the angle at which it moves on the X and Y coordinates. The vertical slope is Y on X and the horizontal slope is X on Y.

Syntax for using REGR_SLOPE:REGR_SLOPE(dependent-expression, independent-expression)

The next SELECT uses REGR_SLOPE to find the slope for the values stored in various columns from the Stats table:

1 Row Returned

RSofCol1#2 RSofCol1#3 RSofCol1#4 RSofCol1#5 RSofCol1#6

2 1 -1 -0 0


1 Row Returned

RSofCol4#2 RSofCol4#3 RSofCol4#1 RSofCol4#5 RSofCol4#6

-2 -1 1 0 -0

Using GROUP BY

Like the original aggregates, the new statistical aggregates may also take advantage of using non-aggregates with the aggregates. The GROUP BY is used to identify and form groups for each unique value in the selected non-aggregate column.

Likewise, the new statistical aggregates are compatible with the original aggregates as seen in the following SELECT:

7 Rows Returned

col3 Cnt Avg1 SD1 VP1 Avg4 SD4 VP4 Avg6 SD6 VP6

1 2 2 0 0 30 0 0 2 2 610 7 6 2 4 25 2 4 24 9 7420 14 16 4 16 14 4 16 54 11 11630 2 24 0 0 6 0 0 75 5 2540 2 26 0 0 4 0 0 88 2 650 2 28 0 0 2 0 0 92 2 660 1 30 0 0 1 0 0 100 0 0

Use of HAVING

Also like the original aggregates, the HAVING may be used to eliminate specific output lines based on one or more of the final aggregate values.

The next SELECT uses the HAVING to perform a compound comparison on both the count and the covariance:

2 Rows Returned

col3 Cnt Avg1 SD1 VP1

10 7 6 2 420 14 16 4 16

Using the DISTINCT Function with AggregatesAt times throughout this book, examples are shown using a function within a function and the power it provides. The COUNT aggregate provides another opportunity to demonstrate a capability that might prove itself useful. It combines the DISTINCT and aggregate functions.

The following may be used to determine how many courses are being taken instead of the total number of students (10) with a valid class code:

1 Row Returned

Unique_Courses Unique_GPA

4 9

Note: Prior to V2R5, you can only use a single column for all DISTINCT operations inside of aggregates.

Versus using all of the values:

1 Row Returned

Courses GPAs

9 9

It is allowable to use the DISTINCT in multiple aggregates within a SELECT. However, prior to V2R5 there was a restriction that only allowed the aggregates to use the same column for each DISTINCT function. Now, it can use different columns names.

http://www.coffingdw.com/sql/tdsqlutp/statistical_aggregates.htm

http://www.coffingdw.com/sql/tdsqlutp/aggregates_and_very_large_data_bases_vldb_.htm

Aggregates and Very Large Data Bases (VLDB)As great as huge databases might be, there are considerations to take into account when processing large numbers of rows. This section enumerates a few of the situations that might be encountered. Read them and think about the requirement or benefit of incorporating them into your SQL.

Potential of Execution Error

Aggregates use the data type of the column they are aggregating. On most databases, this works fine. However, when working on a VLDB, this may cause the SELECT to fail on a numeric overflow condition. An overflow occurs when the value being calculated exceeds the maximum size or value for the data type being used.

For example, one billion (1,000,000,000) is a valid value for an integer column because it is less than 2,147,483,647. However, if three rows each have one billion as their value and a SUM operation is performed, it fails on the third row.

Try the following series of commands to demonstrate an overflow and its fix: Create a table called Overflow with 2 columnsCT Overflow_tbl (Ovr_byte BYTEINT, Ovr_int INT); Insert 3 rows with very large values of 1 billion where max value is

2,147,438,647INS overflow_tbl values (1, 10**9);INS overflow_tbl values (2, 10**9);INS overflow_tbl values (3, 10**9); A SUM aggregate on these values will result in 3 billionSEL SUM(ovr_int) AS sum_col FROM overflow_tbl;

***** 2616 numeric overflow

Attempting this SUM, as written, results in a 2616 numeric overflow error. That is because 3 billion is too large to be stored in the default data type of integer. This is the default because of the data type of the column being used within the aggregate. To fix it, use either of the following techniques to convert the data column to a different type before performing the aggregation.

http://www.coffingdw.com/sql/tdsqlutp/using_the_distinct_function_with_aggregates.htm

http://www.coffingdw.com/sql/tdsqlutp/performance_opportunities.htm

1 Row Returned

sum_col

3,000,000,000

Whenever you find yourself in a situation where the SQL is failing due to a numeric overflow, it is most likely due to the inherited data type of the column. When this happens, be sure to convert the type before doing the math.

GROUP BY versus DISTICT

As seen in chapter 2, DISTINCT is used to eliminate duplicate values. In this chapter, the GROUP BY is used to consolidate multiple rows with the same value into the same group. It does the consolidation by eliminating duplicates. On the surface, they provide the same functionality.

The next SELECT uses GROUP BYwithout aggregation to eliminate duplicates:

5 Rows Returned

class_code

?FRJR

SOSR

The GROUP BY without aggregation returns the same rows as the DISTINCT. So the obvious question becomes, which is more efficient?

The answer is not a simple one. Instead, something must be known about the characteristics of the data. Generally, with more duplicate data values – GROUP BYis more efficient. However, if only a few duplicates exist – DISTINCT is more efficient. To understand the reason, it is important to know how each of them eliminates the duplicate values.

Technique used to eliminate duplicates (can be seen in EXPLAIN): DISTINCT

Reads a row on each AMP Hashes the column(s) value identified in the

DISTINCT Redistributes the row value to the appropriate AMP Once all participating rows have been redistributed

Sorts the data to combine duplicates on each AMP Eliminates duplicates on each AMP

GROUP BY Reads all the participating rows Eliminates duplicates on each AMP using “buckets” Hashes the unique values on each AMP Redistributes the unique values to the appropriate AMP Once all unique values have been redistributed from

every AMP Sorts the unique values to combine duplicates on

each AMP Eliminates duplicates on each AMP

Back to the original question: which is more efficient?

Since DISTINCT redistributes the rows immediately, more data may move between the AMPs, compared to GROUP BY that only sends unique values between the AMPs. So, GROUP BY sounds more efficient. However, when you consider that if the data is nearly unique, GROUP BY spends time attempting to eliminate duplicates that do not exist. Therefore, it is wasting the time to check for duplicates the first time. Then, it must redistribute the same amount of data anyway.

Therefore, for efficiency, when there are: Many duplicates – use GROUP BY

Few to no duplicates – use DISTINCT SPOOL space is exceeded – try GROUP BY

Performance OpportunitiesThe Teradata optimizer has always had options available to it when performing SQL. It always attempts to use the most efficient path to provide the answer set. This is true for aggregation as well.

When performing aggregation, the main shortcut available might include the use of a secondary index. The index row is maintained in a subtable. This row contains the row ID (row hash + uniqueness value) and the actual data value stored in the data row. Therefore, an index row is normally much shorter than a data row. Hence, more index rows exist in an index block than in a data block.

As a result, the read of an index block makes more values available than the actual data block. Since I/O is the slowest operation on all computer systems, less I/O equates to faster processing. If the optimizer can obtain all the values it needs for processing by using the secondary index, it will. This is referred to as a “covered query.”

The creation of a secondary index is covered in this book as part of the Data Definition Language(DDL) chapter.

http://www.coffingdw.com/sql/tdsqlutp/aggregates_and_very_large_data_bases_vldb_.htm

http://www.coffingdw.com/sql/tdsqlutp/subquery.htm

SubqueryThe subquery is a commonly used technique and powerful way to select rows from one table based on values in another table. It is predicated on the use of a SELECT statement within a SELECT and takes advantage of the relationships built into a relational database. The basic concept behind a subquery is that it retrieves a list of values that are used for comparison against one or more columns in the main query. To accomplish the comparison, the subquery is written after the WHERE clause and normally as part of an IN list.

In an earlier chapter, the IN was used to build a value list for comparison against the rows of a table to determine which rows to select. The next example illustrates how this technique can be used to SELECT all the columns for rows containing any of the three different values 10, 20 and 30:

4 Rows Returned

Column1 Column2

10 A row with 10 in column130 A row with 30 in column110 A row with 10 in column120 A row with 20 in column1

As powerful as this is, what if the three values turned into a thousand values. That is too much work and too many opportunities to forget one of the values. Instead of writing the values manually, a subquery can be used to generate the values automatically.

The coding technique of a subquery replaces the values previously written in the IN list with a valid SELECT. Then the subquery SELECT dynamically generates the value list. Once the values have been retrieved, it eliminates the duplicates by automatically performing a DISTINCT.

The following is the syntax for a subquery:

http://www.coffingdw.com/sql/tdsqlutp/performance_opportunities.htm

http://www.coffingdw.com/sql/tdsqlutp/qualifying_table_names_and_creating_a_table_alias.htm

Conceptually, the subquery is processed first so that all the values are expanded into the list for comparison with the column specified in the WHERE clause. These values in the subquery SELECT can only be used for comparison against the column or columns referenced in the WHERE.

Columns inside the subquery SELECT cannot be returned to the user via the main SELECT. The only columns available to the client are those in the tables named in the main (first) FROM clause. The query in parentheses is called the subquery and it is responsible for building the IN list.

At the writing of this document, Teradata allows up to 64 tables in a single query. Therefore, if each SELECT accessed only one table, a query might contain 63 subqueries in a single statement.

The next two tables are used to demonstrate the functionality of subqueries:

Customer Table - contains 5 customers

Customer_number Customer_name Phone_number

PK UPI NUSI NUSI

1111111131313131313231345789688387323456

Billy’s Best ChoiceAcme ProductsACE ConsultingXYZ PlumbingDatabases N-U

555-1234555-1111555-1212347-8954322-1012

Figure 6-1

Order Table - contains 5 orders

Order_number Customer_number Order_date Order_total

PK FK UPI NUSI NUSI 123456123512123552123585123777

1111111111111111313231348732345657896883

980504990101991001991010990909

12347.5308005.9105111.4715231.6223454.84

Figure 6-2

The next SELECT uses a subquery to find all customers that have an order of more than $10,000.00:

3 Rows Returned

Customer_name Phone_number

Billy’s Best Choice 555-1234XYZ Plumbing 347-8954Databases N-U 322-1012

This is an appropriate place to mention that the columns being compared between the main and subqueries must be from the same domain. Otherwise, if no equal condition exists, no rows are returned. The above SELECT uses the customer number (FK) in the Order table to match the customer number (PK) in the Customer table. They are both customer numbers and therefore have the opportunity to compare equal from both tables.

The next subquery swaps the queries to find all the orders by a specific customer:

3 Rows Returned

Order_number Order_total

123456 12347.53123512 8005.91

Notice that the Customer table is used in the main query to answer a customer question and the Order table is used in the main query to answer an order question. However, they both compare on the customer number as the common domain between the two tables.

Both of the previous subqueries work fine for comparing a single column in the main table to a value list in the subquery. Thus, it is possible to answer questions like, “Which customer has placed the largest order?” However, it cannot answer this question, “What is the maximum order for each customer?”

To make Subqueries more sophisticated and powerful, they can compare more than one column at a time. The multiple columns are referenced in the WHERE clause, of the main query and also enclosed in parentheses.

The key is this: if multiple columns are named before the IN portion of the WHERE clause, the exact same number of columns must be referenced in the SELECT of the subquery to obtain all the required values for comparison.

Furthermore, the corresponding columns (outside and inside the subquery) must respectively be of the same domain. Each of the columns must be equal to a corresponding value in order for the row to be returned. It works like an AND comparison.

The following SELECT uses a subquery to match two columns with two values in the subquery to find the highest dollar orders for each customer:

4 Rows Returned

Customer Order_number Order_total

11111111 123546 12347.5357896883 123777 23454.8431323134 123552 5111.4787323456 123585 15231.62

Although this works well for MIN and MAX type of values (equalities), it does not work well for finding values greater than or less than an average. For this type of processing, a Correlated subquery is the best solution and will be demonstrated later in this chapter.

Since 64 tables can be in a single Teradata SQL statement, as mentioned previously, this means that a maximum of 63 subqueries can be written into a single statement. The following shows a 3-table access using two separate subqueries. Additional subqueries simply follow the same pattern.

From the above tables, it is also possible to find the customer who has ordered the single highest dollar amount order. To accomplish this, the Order table must be used to determine the maximum order. Then, the Order table is used again to compare the maximum with each order and finally, compared to the Customer Table to determine which customer placed the order.

The next subquery can be used to find them:

1 Row Returned


XYZ Plumbing 347-8954

It is now known that XYZ Plumbing has the highest dollar order. What is not known is the amount of the order. Since the order total is in the Order table, which is not referenced in the main query, it cannot be part of the SELECT list.

In order to see the order total, a join must be used. Joins will be covered in the next chapter.

Using NOT IN

As seen in a previous chapter, when using the IN and a value list, the NOT IN can be used to find all of the rows that do not match.

Using this technique, the subquery above can be modified to find the customers without an order. The only changes made are to 1) add the NOT before the IN and 2) eliminate the WHERE clause in the subquery:

1 Row Returned


Databases R Us 322-1012

Caution needs to be used regarding the NOT IN when there is a potential for including a NULL in the value list. Since the comparison of a NULL and any other value is unknown, and the NOT of an unknown is still an unknown no rows are returned. Therefore when there is potential for a NULL in the subquery, it is best to also code a compound comparison as seen in the following SELECT:

Using Quantifiers

In other RDBMS systems and early Teradata versions, using an equality symbol (=) in a comparison normally proved to be more efficient than using an IN list. The reason was that it allowed for indices, if they existed, to be used instead of a sequential read of all rows. Today, Teradata automatically uses indices whenever they are more efficient. So, the use of quantifiers is optional and an IN works exactly the same.

Another powerful use for quantifiers is when using inequalities. It is sometimes necessary to find all rows that are greater than or less than one or more other values.

To use quantifiers, replace the IN with an =, <, >, ANY, SOME or ALL as demonstrated in the following syntax:

Earlier in this chapter, a two level subquery was used to find the customer who spent the most money on a single order. It used an IN list to find equal values. The next SELECT uses = ANY to find the same customers:

2 Rows Returned


Billy’s Best Choice 555-1234XYZ Plumbing 347-8954

In order to accomplish this, the Order table is first used to determine the average order amount. Then, the Order table is used again to compare the average with each order and finally, compared to the Customer table to determine which of the customers qualify.

The quantifiers of SOME and ANY are interchangeable. However, the use of ANY conforms to ANSI standard and SOME is the Teradata extension. The = ANY is functionally equivalent to using an IN list.

The ALL and the = are more limited in their scope. In order for them to work, there can only be a single value from the subquery for each of

the values in the WHERE clause. However, earlier the NOT IN was explored. When using quantifiers and the NOT, consider the following:

Equivalency Chart

IN is equivalent to = ANYNOT IN is equivalent to NOT = ALLFigure 6-3

Of these, the NOT = ALL takes the most thought. It forces the system to examine every value in the list to make sure that the value being compared is checked against all the values. Otherwise, as soon as any of the values is different, the row is returned without looking at the other values (ALL).

Although the above describes the conceptual approach of a subquery, the Teradata optimizer will normally use a join to optimize and locate the rows that are needed from within the database. This usage may be seen using the EXPLAIN. Joins are discussed in the next chapter.

Qualifying Table Names and Creating a Table AliasThis section provides techniques to specifically reference table and columns throughout all databases and to temporarily rename tables with an alias name. Both of these techniques are necessary to provide specific and unique names to the optimizer at SQL execution time.

Qualifying Column Names

Since column names within a table must be unique, the system knows which data to access simply by using the column name. However, when more that one table is referenced by the FROM in a single SELECT, this may not be the case. The potential exists for columns of the same domain to have the same name in more than one table. When this happens, the system does not guess which column to reference. The SQL must explicitly declare which table to use for accessing the column.

This declaration is called qualifying the column name. If the SQL does not qualify the column name appearing in more than one table, the system displays an error message that indicates too much ambiguity exists in the query. Correlated subqueries, addressed next, and join processing, in the next chapter, both make use of more than one table at the same time. Therefore, many times it is important to make sure the system knows which table’s columns to use for all portions of the SQL statement.

To qualify a column name, the table name and column name are connected using a period or sometimes referred to as a dot (.). The dot connects the names without a space to make the two names work as a single reference name. However, if the column has different names in the multiple tables, there is no confusion within the system and therefore, no need to qualify the name.

To illustrate this concept, lets consider people instead of tables. For instance, Mike is a common name. If two Mikes are in different rooms and someone uses the name in either room, there is no confusion. However, if both Mikes are in the same room and someone uses the name, both Mikes respond and therefore confusion exists. To eliminate the conflict, the use of the first and last names makes the identification unique.

The syntax for using qualification levels follows:

http://www.coffingdw.com/sql/tdsqlutp/subquery.htm

http://www.coffingdw.com/sql/tdsqlutp/correlated_subquery_processing.htm

3-level reference: <database-name>.<table-name>.<column-name>2-level reference: <database-name>.<table-name>2-level reference: <table-name>.<column-name>

Whenever all 3 levels are used, the first name is always the database, the second is the table and the last is the column. However, when two names appear in a 2-level qualification, the location of the names within the SQL must be examined to know for sure their meaning. Since the FROM names the tables, the first name of the qualified names is a database name and the second is a table. Since columns are referenced in the SELECT list and WHERE clause, the first name is a table name and the second is an * (all columns) or a single column.

In Teradata, the following is a valid statement, including the abbreviation for SELECT and missing FROM:SEL DBC.TABLES.* ;

This technique is not ANSI standard, however, the PE has everything needed to get all columns and rows out of the TABLES table in the DBC database.

Creating an Alias for a Table

Since table names can be up to 30 characters long, to save typing when the name is used more than once, a commonly used technique is to provide a temporary name for the table within the SELECT. The new temporary name for a table is called an alias name.

Once the alias is created for the table, it is important to use the alias name throughout the request. Otherwise the system looks at the use of the full table name as another table and it causes undesirable results. To establish an alias for a table, in the FROM, simply follow the name of the table with an AS: FROM <table-name> AS <table-alias-name>.

Correlated Subquery ProcessingThe correlated subquery is a very powerful tool. It is an excellent technique to use when there is a need to determine which rows to SELECT based on one or more values from another table. This is especially true when the value for comparison is based on an aggregate. It combines subquery processing and join processing into a single request.

For example, one Teradata user has the need to bill their customers and incorporate the latest payment date. Therefore, the latest date needs to be obtained from the table. So, the payment date is found using the MAX aggregate in the subquery. However, it must be the latest payment date for that customer, which might be different for each customer. The processing involves the subquery locating the maximum date only for one customer account.

The correlated subquery is perfect for this processing. It is more efficient and faster than using a normal subquery with multiple values. One reason for its speed is that it can perform some processing steps in parallel, as seen in an EXPLAIN. The other reason is that it only finds the maximum date when a particular account is read for processing, not for all accounts like a normal subquery.

The operation for a correlated subquery differs from that of a normal subquery. Instead of comparing the selected subquery values against all the rows in the main query, the correlated subquery works backward. It first reads a row in the main query, and then goes into the subquery to find all the rows that match the specified column value. Then, it gets the next row in the main query and retrieves all the subquery rows that match the next value in this row. This processing continues until all the qualifying rows from the main SELECT are satisfied. Although this sounds terribly inefficient and is inefficient on other databases, it is extremely efficient in Teradata. This is due to the way the AMPs handle this type of request. The AMPs are smart enough to remember and share each value that is located.

Thus, when a second row comes into the comparison that contains the same value as an earlier row, there is no need to re-read the matching rows again. That operation has already been done once and the AMPs remember the answer from the first comparison.

The following is the syntax for writing a correlated subquery:

http://www.coffingdw.com/sql/tdsqlutp/qualifying_table_names_and_creating_a_table_alias.htm

http://www.coffingdw.com/sql/tdsqlutp/exists.htm

The subquery does not have a semi-colon of its own. The SELECT in the subquery is all part of the same primary query and shares the one semi-colon.

The aggregate value is normally obtained using MIN, MAX or AVG. Then this aggregate value is in turn used to locate the row or rows within a table that compares equals, less than or greater than this value.

This table is used to demonstrate correlated subqueries:

Employee Table - contains 9 students


PK FKUPI NUSI NUSI

123257812563492341218231222520000001000234112133413246571333454

ChambersHarrisonReillyLarkinsJonesSmytheStricklingCoffingSmith

MandeeHerbertWilliamLoraineSquiggyRichardCletusBillyJohn

$48,850.00$54,500.00$36,000.00$40,200.00$32,800.50$64,300.00$54,500.00$41,888.88$48,000.00

10040040030010400200200

Figure 6-4

Using the above table, this Correlated subquery finds the highest paid employee in each department:

6 Rows Returned

Last_name First_name Dept_no Salary

Smythe Richard 10 $64,300.00Chambers Mandee 100 $48,850.00Smith John 200 $48,000.00Larkins Loraine 300 $40,200.00Harrison Herbert 400 $54,500.00Strickling Cletus 400 $54,500.00

Notice that both of the tables have been assigned alias names (emp for the main query and emt for the correlated subquery). Because the same Employee table is used in the main query and the subquery, one of them must be assigned an alias. The aliases are used in the subquery to qualify and match the common domain values between the two tables. This coding technique “correlates” the main query table to the one in the subquery.

The following Correlated subquery uses the AVG function to find all employees who earn less than the average pay in their department:

5 Rows Returned

Last_name _ First_name Dept_no Salary _

Smythe Richard 10 $64,300.00Chambers Mandee 100 $48,850.00Coffing Billy 200 $41,888.88Larkins Loraine 300 $40,200.00Reilly William 400 $36,000.00

Earlier in this chapter, it was indicated that a column from the subquery cannot be referenced in the main query. This is still true. However, nothing is wrong with using one or more column references from the main query within the subquery to create a Correlated subquery.

EXISTSAnother powerful resource that can be used with a correlated subquery is the EXISTS. It provides a true-false test within the WHERE clause.

In the syntax that follows, it is used to test whether or not a single row is returned from the subquery SELECT:

If a row is found, the EXISTS test is true, and conversely, if a row is not found, the result is false. When a true condition is determined, the value in the SELECT is returned from the main query. When the condition is determined to be false, no rows are selected.

Since EXISTS returns one or no rows, it is a fast way to determine whether or not a condition is present within one or more database tables. The correlated subquery can also be part of a join to add another level of test. It has potential to be very sophisticated.

As an example, to find all customers that have not placed an order the NOT IN subquery can be used. Remember, when you use the NOT IN clause the NULL needs to be considered and eliminated using the IS NOT NULL check in the subquery. When using the NOT EXISTS with a correlated subquery, the same answer is obtained, it is faster than a normal subquery and there is no concern for getting a null into the subquery. These next examples show the EXISTS and the NOT EXISTS tests.

Notice that the next SELECT is the same correlated subquery as seen earlier, except here it is utilizing the subquery to find all customers with orders:

4 Rows Returned

http://www.coffingdw.com/sql/tdsqlutp/correlated_subquery_processing.htm

http://www.coffingdw.com/sql/tdsqlutp/join_processing.htm

Customer_name

Ace ConsultingDatabases R UsBilly's Best ChoiceXYZ Plumbing

By changing the EXISTS to a NOT EXISTS, the next SELECT finds all customers without orders:

1 Row Returned

Customer_name

Acme Products

Since the Customer and Order tables are used in the above Correlated subquery, the table names did not require an alias. However, it was done to shorten the names to ease the equality coding in the subquery.

An added benefit of this technique (NOT EXISTS) is that the presence of a NULL does not affect the performance. Notice that in both subqueries, the asterisk (*) is used for the columns. Since it is a true or false test, the columns are not used and it is the shortest way to code the SELECT. If the column in the subquery table is a Primary Index or a Unique Secondary Index, the correlated subquery can be very fast.

The examples in this chapter only use a single column for the correlation. However, it is common to use more than one column from the main query in the correlated subquery. Although the techniques presented in this last chapter seem relatively simple, they can be very powerful. Understanding subqueries and Correlated subqueries can help you unleash the power.

Join ProcessingA join is the combination of two or more tables in the same FROM of a single SELECT statement. When writing a join, the key is to locate a column in both tables that is from a common domain. Like the correlated subquery, joins are normally based on an equal comparison between the join columns.

An example of a common domain column might be a customer number. Whether it represents a particular customer, as the primary key, in the Customer table, or the customer that placed a specific order, as a foreign key, in the Order table, it represents the same entity in both tables. Without a common value, a match cannot be made and therefore, no rows can be selected using a join. An equality join returns matching rows.

Any answer set that a subquery can return, a join can also provide. Unlike the subquery, a join lists all of its tables in the same FROM clause of the SELECT. Therefore, columns from multiple tables are available for return to the user. The desired columns are the main factor in deciding whether to use a join or a subquery. If only the columns come from a single table are desired, a subquery or a join work fine. However, if columns from more than one table are needed, a join must be used. In Version 2 Release 3, the number of tables allowed in a single join increased from sixteen (16) to sixty-four (64) tables.

http://www.coffingdw.com/sql/tdsqlutp/exists.htm

http://www.coffingdw.com/sql/tdsqlutp/original_join_syntax.htm

Original Join SyntaxThe SQL join is a traditional and powerful tool in a relational database. The first difference between a join and a single table SELECT is that multiple tables are listed using the FROM clause. The first technique, shown below, uses a comma between the table names. This is the same technique used when listing multiple columns in the SELECT, ORDER BY or most other area that allows for the identification of more than one object.

The following is the original join syntax for a two-table join:

The following tables will be used to demonstrate the join syntax:

Customer Table - contains 5 customers

Customer_number Customer_name Phone_number

PK UPI NUSI NUSI1111111131313131313231345789688387323456

Billy’s Best ChoiceAcme ProductsACE ConsultingXYZ PlumbingDatabases N-U

555-1234555-1111555-1212347-8954322-1012

Figure 7-1



PK FK

http://www.coffingdw.com/sql/tdsqlutp/join_processing.htm

http://www.coffingdw.com/sql/tdsqlutp/product_join.htm

UPI NUSI NUSI 123456123512123552123585123777

1111111111111111313231348732345657896883

980504990101991001991010990909

12347.5308005.9105111.4705111.4723454.84

Figure 7-2

The common domain between these two tables is the customer number. It is used in the WHERE clause with the equal condition to find all the rows from both tables with matching values. Since the column has exactly the same name in both tables, it becomes mandatory to qualify this column’s name so that the PE knows which table to reference for the data. Every appearance of the customer number in the SELECT must be qualified.

The next SELECT finds all of the orders for each customer and shows the Customer’s name, Order number and Order total using a join:

5 Rows Returned

Customer_number Customer_name Order_number Order_total

31323134 ACE Consulting 123552 $5,111.4711111111 Billy’s Best Choice 123456 $12,347.5311111111 Billy’s Best Choice 123512 $8,005.9187323456 Databases N-U 123585 $15,231.6257896883 XYZ Plumbing 123777 $23,454.84

In the above output, all of the customers, except one, have a single order on file.

However, Billy’s Best Choice has placed two orders and is displayed twice, once for each order. Notice that the Customer number in the SELECT list is qualified and returned from the Customer table. Does it

matter, in this join which table is used to obtain the value for the Customer number?

Your answer should be no. This is because the value in the two tables is checked for equal in the WHERE clause of the join. Therefore, the value is the same regardless of which table is used. However, as mentioned earlier, you must use the table name to qualify any column name that exists in more than one table with the same name. Teradata will not assume which column to use.

The following shows the syntax for a three-table join:

The next three tables are used to demonstrate a three-table join:

Course Table - contains 7 courses

Course_ID Course_Name Credits Seats

PK FK UPI NUSI 100200210220300400500

Teradata ConceptsIntroduction to SQLAdvanced SQLV2R3 SQL FeaturesPhysical Database DesignDatabase AdministrationLogical Database Design

3332442

50202225201624

Figure 7-3




PhillipsHansonThomasWilsonJohnsonMcRobertsBondDelaneySmithLarkins

MartinHenryWendySusieStanleyRichardJimmyDannyAndyMichael

SRFRFRSOJRJRSRSOFR

3.002.884.003.801.903.953.352.000.00

Figure 7-4

Student_Course Table (associative table)

Student_ID Course_ID

PKNUPI NUSI123250125634125634125634234121231222231222260000280023322133322133324652333450

100100200220100210220400210220300200400

Figure 7-5

The first two tables represent the students and courses they can attend. Since a student can take more than one class, the third table Student_Course is used to associate the two main tables. It allows for

one student to take many classes and one class to be taken by many students (a many-to-many relationship).

The following SELECT joins these three tables on the common domain columns to find all courses being taken by the students:

13 Rows Returned

Last Name First Student_ID Course

McRoberts Richard 280023 Advanced SQLWilson Susie 231222 Advanced SQLJohnson Stanley 260000 Database AdministrationSmith Andy 333450 Database AdministrationDelaney Danny 324652 Introduction to SQLHanson Henry 125634 Introduction to SQLBond Jimmy 322133 Physical Database DesignHanson Henry 125634 Teradata ConceptsPhillips Martin 123250 Teradata ConceptsThomas Wendy 234121 Teradata ConceptsBond Jimmy 322133 V2R3 SQL FeaturesHanson Henry 125634 V2R3 SQL FeaturesWilson Susie 231222 V2R3 SQL Features

It is required to have one less equality test in the WHERE than the number of tables being joined. Here there are three tables and two equalities on common domain columns in the tables. If the maximum of 64 tables is used, this means that there must be 63 comparisons with 63 AND logical operators. If one comparison is forgotten, the result is not a syntax error; it is a Cartesian product join.

Many times the request adds some residual conditions to further refine the output. For instance, the need might be to see all the students that have taken the V2R3 SQL class. This is very common since most tables will have thousands or millions of rows. A way is needed to limit the rows returned. The residual conditions also appear in the WHERE clause.

In the next join, the WHERE of the previous SELECT has been modified to add an additional comparison for the course:

3 Rows Returned

Last Name FirstName Student_ID Course

Bond Jimmy 322133 V2R3 SQL FeaturesHanson Henry 125634 V2R3 SQL FeaturesWilson Susie 231222 V2R3 SQL Features

The added residual condition does not replace the join conditions. Instead it adds a third condition for the course. If one of the join conditions is omitted, the result is a Cartesian product join (explained next).

Product JoinIt is very important to use an equal condition in the WHERE clause. Otherwise you get a product join. This means that one row of a table is joined to multiple rows of another table. A mathematic product means that multiplication is used.

The next join example uses a WHERE clause, but it only limits which rows participate in the join and does not provide a join condition:

5 Rows Returned

Customer_name Order_number Order_total

Billy’s Best Choice 123456 12347.53Billy’s Best Choice 123512 8005.91Billy’s Best Choice 123552 5111.47Billy’s Best Choice 123585 5111.47Billy’s Best Choice 123777 23454.84

The above output resulted from 1 row in the customer table being joined to all the rows of the order table. The WHERE limited the customer rows that participated in the join, but did not specify an equal comparison between the join columns. As a result, it looks like Billy placed five orders, which is not correct. So, be careful when using product joins because SQL answers the question as asked, not necessarily as intended.

When all rows of one table are joined to all rows of another table, it is called a Cartesian product join or an unconstrained product join. Think about this: if one table has one million rows and the other table contains one thousand rows, the output is one trillion rows (1,000,000 rows * 1,000 rows = 1,000,000,000 rows).

As seen above, the vast majority of the time, a product join has no meaningful output and is usually a mistake. The mistake is either that

http://www.coffingdw.com/sql/tdsqlutp/original_join_syntax.htm

http://www.coffingdw.com/sql/tdsqlutp/newer_ansi_join_syntax.htm

the WHERE clause is omitted, a column comparison is omitted for one of the tables using an AND, or the table is given an alias and the alias is not used (system thought it was an additional table without a comparison).

The next SELECT is the same as the one above, except this time the entire WHERE clause has been commented out using /* and */:

Since the join condition is converted into a comment, the output from the SELECT is a Cartesian product that will return 980 rows (10*70*14=980) using these very small tables. The output is completely meaningless and implies that every student is taking every class. This output does not reflect the correct situation.

Forgetting to include the WHERE clause does not make the join syntax incorrect. Instead, it results in a Cartesian product join. Always use the EXPLAIN to verify that the result of the join is reasonable before executing the actual join. The following shows the

output from an EXPLAIN of the above classic Cartesian product join. Notice that steps 6 and 7 indicate a product join on the condition that (1=1). Since 1 is always equal to 1 every time a row is read, all rows are joined with all rows.

The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.56 seconds.

If you remember from Chapter 3, the EXPLAIN shows immediately that this situation will occur if the SELECT is executed. This is better than waiting, potentially hours, to determine that the SELECT is running too long, stealing valuable computer cycles, doing data transfer, and interfering with valid SQL from other users. Be a good corporate citizen and database user: EXPLAIN your join syntax before executing! Make sure the estimates are reasonable for the size of the database tables involved.

Newer ANSI Join SyntaxThe ANSI committee has created a new form of the join syntax. Like most ANSI compliant code, it is a bit longer to write. However, I personally believe that it is worth the time and the effort due to better functionality and safeguards. Plus, it is more difficult to get an accidental product join using this form of syntax. This chapter describes and demonstrates the use of the INNER JOIN, the OUTER JOIN, the CROSS JOIN and the Self-join.

INNER JOIN

Although the original syntax still works, there is a newer version of the join using the INNER JOINsyntax. It works exactly the same as the original join, but is written slightly different.

The following syntax is for a two-table INNER JOIN:

There are two primary differences between the new INNER JOIN and the original join syntax. The first difference is that a comma (,) no longer separates the table names. Instead of a comma, the words INNER JOIN are used. As shown in the above syntax format, the word INNER is optional. If only the JOIN appears, it defaults to an INNER JOIN.

The other difference is that the WHERE clause for the join condition is changed to an ON to declare an equal comparison on the common domain columns. If the ON is omitted, a syntax error is reported and the SELECT does not execute. So, the result is not a Cartesian product join as seen in the original syntax. Therefore, it is safer to use.

Although the INNER JOIN is a slightly longer SQL statement to code, it does have advantages. The first advantage, mentioned above, is fewer accidental Cartesian product joins because the ON is required. In the original syntax, when the WHERE is omitted the syntax is still correct. However, without a comparison, all rows of both tables are joined with each other and results in a Cartesian product.

http://www.coffingdw.com/sql/tdsqlutp/product_join.htm

http://www.coffingdw.com/sql/tdsqlutp/adding_residual_conditions_to_a_join.htm

The last and most compelling advantage of the newer syntax is that the INNER JOIN and OUTER JOIN statements can easily be combined into a single SQL statement. The OUTER JOIN syntax, explanation and significance are covered in this chapter.

The following is the same join that was performed earlier using the original join syntax. Here, it has been converted to use an INNER JOIN:

5 Rows Returned

Customer_number Customer_name Order_number Order_total

31323134 ACE Consulting 123552 $5,111.4711111111 Billy’s Best Choice 123456 $12,347.5311111111 Billy’s Best Choice 123512 $8,005.9187323456 Databases N-U 123585 $15,231.6257896883 XYZ Plumbing 123777 $23,454.84

Like the original syntax, more than two tables can be joined in a single INNER JOIN. Each consecutive table name follows an INNER JOIN and associated ON clause to tell which columns to match. Therefore, a ten-table join has nine JOIN and nine ON clauses to identify each table and the columns being compared. There is always one less JOIN / ON combination than the number of tables referenced in the FROM.

The following syntax is for an INNER JOIN with more than two tables:

The <table-nameN> reference above is intended to represent a variable number of tables. It might be a 3-table, a 10-table or up to a 64-table join. The same approach is used regardless of the number of tables being joined together in a single SELECT.

The other difference between these two join formats is that regardless of the number of tables in the original syntax, there was only a single WHERE clause. Here, each additional INNER JOIN has its own ON condition. If one ON is omitted from the INNER JOIN, an error code of 3706 will be returned. This error keeps the join from executing, unlike the original syntax, where a forgotten join condition in the WHERE is allowed, but creates an accidental Cartesian product join.

The next INNER JOIN is converted from the 3-table join seen earlier:

3 Rows Returned


Bond Jimmy 322133 V2R3 SQL FeaturesHanson Henry 125634 V2R3 SQL FeaturesWilson Susie 231222 V2R3 SQL Features

The INNER JOIN syntax can use a WHERE clause instead of a compound ON comparison. It can be used to add one or more residual conditions. A residual condition is a comparison that is in addition to the join condition. When it is used, the intent is to potentially eliminate rows from one or more of the tables.

In other words, as rows are read the WHERE clause compares each row with a condition to decide whether or not it should be included or eliminated from the join processing. The WHERE clause is applied as rows are read, before the ON clause. Eliminated rows do not participate in the join against rows from another table. For more details, read the section on WHERE clauses at the end of this chapter.

The following is the same SELECT using a WHERE to compare the Course name as a residual condition instead of a compound (AND) comparison in the ON:

As far as the INNER JOINprocessing is concerned, the PE will normally optimize both of these last two joins exactly the same. The EXPLAIN is the best way to determine how the optimizer uses specific Teradata tables in a join operation.

OUTER JOIN

As seen previously, the join processing matches rows from multiple tables on a column containing values from a common domain. Most of the time, each row in a table has a matching row in the other table. However, we do not live in a perfect world and sometimes our data is not perfect. Imperfect data is never returned when a normal join is used and the imperfection may go unnoticed.

The sole purpose of an OUTER JOIN is to find and return rows that do not match at least one row from another table. It is for “exception” reporting, but at the same time, it does the INNER JOIN processing too. Therefore, the intersecting (matching) common domain rows are returned along with all rows without a matching value from another table. This non-matching condition might be due to the existence of a NULL or invalid data value in the join column(s).

For instance, if the employee and department tables are joined using an INNER JOIN, it displays all the employees who work in a valid department. Mechanically, this means it returns all of the employee rows that contain a value in the department number column, as a foreign key, that matches a department number value in the department table, as a primary key.

What it does not display are employees without a department number (NULL) and employees with invalid department numbers (breaks referential integrity rules). These additional rows can be returned with

the intersecting rows using one of the three formats for an OUTER JOIN listed below.

The three formats of an OUTER JOIN are:Left_table LEFT OUTER JOIN Right_table – left table is outer tableLeft_table RIGHT OUTER JOIN Right_table – right table is outer tableLeft_table FULL OUTER JOIN Right_table – both are outer tables

The OUTER JOIN has an outer table. The outer table is used to direct which exception rows are output. Simply put, it is the controlling table of the OUTER JOIN. As a result of this feature, all the rows from the outer table will be returned, those containing matching domain values and those with non-matching values. The INNER JOIN has only inner tables. To code an OUTER JOIN it is wise to start with an INNER JOIN. Once the join is working, the next step is to convert the word INNER to OUTER.

The SELECT list for matching rows can display data from any of the tables in the FROM. This is because a row with a matching row exists in the tables. However, all non-matching rows with NULL or invalid data in the outer table do not have a matching row in the inner table. Therefore, the entire inner table row is missing and no column is available for the SELECT list. This is the equivalent of a NULL. Since the exception row is missing, there is no data available for display. All referenced columns from the missing inner table rows will be represented as a NULL in the display.

The basic syntax for a two-table OUTER JOIN follows:

Unlike the INNER JOIN, there is no original join syntax operation for an OUTER JOIN. The OUTER JOIN is a unique answer set. The closest functionality to an OUTER JOIN comes from the UNION set operator, which is covered later in this book. The other fantastic quality of the newer INNER and OUTER join syntax is that they both can be used in the same SELECT with three or more tables.

The next several sections explain and demonstrate all three formats of the OUTER JOIN. The primary issue when using an OUTER JOIN is that only one format can be used in a SELECT between any two tables. The FROM list determines the outer table for processing. It is important to understand the functionality in order to chose the correct outer join.

LEFT OUTER JOIN

The outer table is determined by its location in the FROM clause of the SELECT as shown here:<Outer-table> LEFT OUTER JOIN <Inner-table>

Or

<Outer-table> LEFT JOIN <Inner-table>

In this format, the Customer table is the one on the left of the word JOIN. Since this is a LEFT OUTER JOIN, the Customer is the outer table. This syntax can return all customer rows that match a valid order number (INNER JOIN) and Customers with NULL or invalid order numbers (OUTER JOIN).

The next SELECT shows customers with matching orders and those that need to be called because they have not placed an order:

6 Rows Returned


Ace Consulting 123552 $5,111.47Acme Products ? ?Billy's Best Choice 123456 $12,347.53Billy's Best Choice 123512 $8,005.91Databases N-U 123585 $15,231.62XYZ Plumbing 123777 $23,454.84

The above output consists of all the rows from the Customer table because it is the outer table and there are no residual conditions. Unlike the earlier INNER JOIN, Acme Products is now easily seen as the only customer without an order. Since Acme Products has no order at this time, the order number and the order total are both extended with the “?” to represent a NULL, or missing value from a non-matching row of the inner table. This is a very important concept.

The result of the SELECT provides the matching rows like the INNER JOIN and the non-matching rows, or exceptions that are missed by the INNER JOIN. It is possible to add the order number to an ORDER BY to put all exceptions either at the front (ASC) or back (DESC) of the output report.

When using an OUTER JOIN, the results of this join are stored in the spool area and contain all of the rows from the outer table. This includes the rows that match and all the rows that do not match from the join step. The only difference is that the non-matching rows are carrying the NULL values for all columns for missing rows from the inner table.

The concept of a LEFT OUTER JOINis pretty straight forward with two tables. However, additional thought is required when using more then two tables to preserve rows from the first outer table.

Remember that the result of the first join is saved in spool. This same spool is then used to perform all subsequent joins against any additional tables, or other spool areas. So if you join 3 tables using an outer join the first two tables are joined together with the spooled results representing the new outer table and then joined with the third table which becomes the RIGHT table.

Using the Student, Course and Student_Course tables, the following SELECT preserves the exception rows from the Student table as the outer table, throughout the entire join. Since both joins are written using the LEFT OUTER JOIN and the Student table is the table name that is the furthest to the left it remains as the outer table:

14 Rows Returned


Larkins Michael 423400 ?McRoberts Richard 280023 Advanced SQLWilson Susie 231222 Advanced SQLJohnson Stanley 260000 Database AdministrationSmith Andy 333450 Database AdministrationDelaney Danny 324652 Introduction to SQLHanson Henry 125634 Introduction to SQLBond Jimmy 322133 Physical Database DesignHanson Henry 125634 Teradata ConceptsPhillips Martin 123250 Teradata ConceptsThomas Wendy 234121 Teradata ConceptsBond Jimmy 322133 V2R3 SQL FeaturesHanson Henry 125634 V2R3 SQL FeaturesWilson Susie 231222 V2R3 SQL Features

The above output contains all the rows from the Student table as the outer table in the three-table LEFT OUTER JOIN. The OUTER JOIN returns a row for a student named Michael Larkins even though he is not taking a course. Since, his course row is missing, no course name is available for display. As a result, the output is extended with a NULL in course name, but becomes part of the answer set.

Now, it is known that a student isn’t taking a course. It might be important to know if there are any courses without students. The previous join can be converted to determine this fact by rearranging the table names in the FROM to make the Course table the outer table, or by using the RIGHT OUTER JOIN.

RIGHT OUTER JOIN

As indicated earlier, the outer table is determined by its position in the FROM clause of the SELECT. Consider the following:<Inner-table> RIGHT OUTER JOIN <Outer-table>

Or

<Inner-table> RIGHT JOIN <Outer-table>

In the next example, the Customer table is still written before the Order table. Since it is now a RIGHT OUTER JOIN and the Order table is on the right of the word JOIN, it is now the outer table. Remember, all rows can be returned from the outer table!

To include the orders without customers, the previously seen LEFT OUTER JOIN has been converted to a RIGHT OUTER JOIN. It can be used to return all of the rows in the Order table, those that match customer rows and those that do not match customers.

The following is converted to a RIGHT OUTER JOIN to find all orders:

6 Rows Returned


? 999999 $1.00-Ace Consulting 123552 $5,111.47Billy's Best Choice 123456 $12,347.53Billy's Best Choice 123512 $8,005.91Databases N-U 123585 $15,231.62XYZ Plumbing 123777 $23,454.84

The above output from the SELECT consists of all the rows from the Order table, which is the outer table. In a 2-table OUTER JOINwithout a WHERE clause, the number of rows returned is usually equal to the number of rows in the outer table. In this case, the outer table is the Order table. It contains 6 rows and all 6 rows are returned.

This join returns all orders with a valid customer ID (like the INNER JOIN) and orders with a missing or an invalid customer ID (OUTER JOIN). Either of these last two conditions constitutes a critical business problem that needs immediate attention. It is important to determine that orders were placed, but that the buyer of them is not known. Since the output was sorted by the customer name, the exception row is returned first. This technique makes the exception easy to find, especially in a large report. Not only is the customer missing for this order, it obviously has additional problems. The total is negative and the order number is all nines. We can now correct a situation we knew

nothing about or correct the procedure or policy that allowed for the error to occur.

Using the same Student and Course tables from the previous 3-table join, it can be converted from the two LEFT OUTER JOINoperations to two RIGHT OUTER JOIN operations in order to find the students taking courses and also find any courses without students enrolled:

8 Rows Returned


McRoberts Richard 280023 Advanced SQLWilson Susie 231222 Advanced SQLDelaney Danny 324652 Introduction to SQLHanson Henry 125634 Introduction to SQL? ? ? Logical Database DesignBond Jimmy 322133 V2R3 SQL FeaturesHanson Henry 125634 V2R3 SQL FeaturesWilson Susie 231222 V2R3 SQL Features

Now, using the output from the OUTER JOIN on the Course table, it is apparent that no one is enrolled in the Logical Database Design course. The enrollment needs to be increased or the room needs to be freed up for another course. Where inner joins are great at finding matches, outer joins are great at finding both matches and problems.

FULL OUTER JOIN

The last form of the OUTER JOIN is a FULL OUTER JOIN. If both Customer and Order exceptions are to be included in the output report, then the syntax should appear as:<Outer-table> FULL OUTER JOIN <Outer-table>

Or

<Outer-table> FULL JOIN <Outer-table>

A FULL OUTER JOIN uses both of the tables as outer tables. The exceptions are returned from both tables and the missing column values from either table are extended with NULL. This puts the LEFT and RIGHT OUTER JOINoutput into a single report.

To return the customers with orders, and include the orders without customers and customers without orders, the following FULL OUTER JOIN can be used:

7 Rows Returned


? 999999 $1.00-Ace Consulting 123552 $5,111.47Acme Products ? ?Billy's Best Choice 123512 $8,005.91Billy's Best Choice 123456 $12,347.53Databases N-U 123585 $15,231.62XYZ Plumbing 123777 $23,454.84

The output from the SELECT consists of all the rows from the Order and Customer tables because they are now both outer tables in a FULL OUTER JOIN.

The total number of rows returned is more difficult to predict with a FULL OUTER JOIN. The answer set contains: one row for each of the matching rows from the tables, plus one row for each of the missing rows in the left table, plus one for each of the missing rows in the right table.

Since both tables are outer tables, not as much thought is required for choosing the outer table. However, as mentioned earlier the INNER and OUTER join processing can be combined in a single SELECT. The INNER

JOIN still eliminates all non-matching rows. This is when the most consideration needs to be given to the appropriate outer tables.

Like all joins, more than two tables can be joined using a FULL OUTER JOIN, up to 64 tables. The next FULL OUTER JOIN syntax uses Student and Course tables for the outer tables through the entire join process:

15 Rows Returned


Larkins Michael 423400 ?McRoberts Richard 280023 Advanced SQLWilson Susie 231222 Advanced SQLJohnson Stanley 260000 Database AdministrationSmith Andy 333450 Database AdministrationDelaney Danny 324652 Introduction to SQLHanson Henry 125634 Introduction to SQL? ? ? Logical Database DesignBond Jimmy 322133 Physical Database DesignHanson Henry 125634 Teradata ConceptsPhillips Martin 123250 Teradata ConceptsThomas Wendy 234121 Teradata ConceptsBond Jimmy 322133 V2R3 SQL FeaturesHanson Henry 125634 V2R3 SQL FeaturesWilson Susie 231222 V2R3 SQL Features

The above SELECT uses the Student, Course and “Student Course” (associative) tables in a FULL OUTER JOIN. All three tables are outer tables. The above includes one non-matching row from the Student table with a null in the course name and one non-matching row from

the course table with nulls in all three columns from the Student table. Since the Student Course table is also an outer table, if there were any non-matching rows in it, they can also be returned containing a null in its columns. However, since it is an associative table used only for a many-to-many relationship between the Student and Course tables, missing rows in it would indicate a serious business problem.

As a reminder, the result of the first join step is stored in spool, which is temporary work space that the system uses to complete each step of the SELECT. Then, the spool area is used for each consecutive JOIN step. This continues until all of the tables have been joined together, two at a time. However, the spool areas are not held until the end of the SELECT. Instead, when the spool is no longer needed, it is released immediately. This makes more spool available for another step, or by another user. The release can be seen in the EXPLAIN output as (Last Use) for a spool area.

Also, when using Teradata, do not spend a lot of time worrying about which tables to join first. The optimizer makes this choice at execution time. The optimizer always looks for the fastest method to obtain the requested rows. It uses data distribution and index demographics to make its final decision on a methodology. So, the tables joined first in the syntax, might be the last tables joined in the execution plan.

All databases join tables two at a time, but most databases just pick which tables to join based on their position in the FROM. Sometimes when the SQL runs slow, the user just changes the order of the tables in the join. Otherwise, join schemas must be built to tell the RDBMS how to join specific tables.

Teradata is smart enough, using explicit or implicit STATISTICS, to evaluate which tables to join together first. Whenever possible, four tables might be joined at the same time, but it is still done as two, two-table joins in parallel. Joins involving millions of rows are considered difficult for most databases, but Teradata joins them with ease.

It is a good idea to use the Teradata EXPLAIN, to see what steps the optimizer plans to use to accomplish the request. Primarily in the beginning you are looking for an estimate of the number of rows that will be returned and the time cost to accomplish it. I recommend using the EXPLAIN before each join as you are learning to make sure that the result is reasonable.

If these numbers appear to be too high for the tables involved, it is probably a Cartesian product; which is not good. The EXPLAIN discovers the product join within seconds instead of hours. If it were actually running, it would be wasting resources by doing all the extra

work to accomplish nothing. Use the EXPLAIN to learn this fact the easy way and fix it.

CROSS JOIN

A CROSS JOIN is the ANSI way to write a product join. This means that it joins one or more rows participating from one table with all the participating rows from the other table. As mentioned earlier in this chapter, there is not a large application for a product join and even fewer for a Cartesian join.

Although there are not many applications for a CROSS JOIN, consider this: an airline might use one to determine the location and number of routes needed to fly from one hub to all of the other cities they serve. A potential route “joins” every city to the hub. Therefore, the result needs a product join. Probably what should still be avoided is to fly from every city to every other city (Cartesian join). A CROSS JOIN is controlled using a WHERE clause. Unlike the other join syntax, a CROSS JOIN results in a syntax error if an ON clause is used.

The following is the syntax for the CROSS JOIN:

The next SELECT performs a CROSS JOIN (product join) using the Student and Course tables:

10 Rows Returned

Last_name Course_name

Phillips Teradata ConceptsHanson Teradata ConceptsThomas Teradata ConceptsWilson Teradata ConceptsJohnson Teradata Concepts

McRoberts Teradata ConceptsBond Teradata ConceptsDelaney Teradata ConceptsSmith Teradata ConceptsLarkins Teradata Concepts

Since every student is not taking every course, this output has very little meaning from a student and course perspective. However, this same data can be valuable in determining a potential for a situation or the resources that are needed to determine maximum room capacities. For example, it helps if the Dean wants to know the maximum number of seats needed in a classroom if every student were to enroll for every SQL class. However, the rows are probably counted (COUNT(*)) and not displayed.

This SELECT uses a CROSS JOIN to populate a derived table (discussed later), which is then used to obtain the final count:

1 Row Returned

Total SQL Seats Needed

30

The previous SELECT can also be written to use the WHERE clause to the main SELECT to compare the rows of the derived table called DT instead of only building those rows. Compare the previous SELECT with the next one and determine which is more efficient.

Which do you find to be more efficient?

At first glance, it would appear that the first is more efficient because the CROSS JOIN inside the parentheses for a derived table is not a

Cartesian product. Instead, the CROSS JOIN that populates the derived table is constrained in the WHERE to only SQL courses rather than all courses. However, the PEoptimizes them the same. I told you that Teradata was smart!

Self Join

A Self Join is simply a join that uses the same table more than once in a single join operation. The first requirement for this type of join is that the table must contain two different columns of the same domain. This may involve de-normalized tables.

For instance, if the Employee table contained a column for the manager’s employee number and the manager is an employee, these two columns have the same domain. By joining on these two columns in the Employee table, the managers can be joined to the employees.

The next SELECT joins the Employee table to itself as an employee table and also as a manager table to find managers. Then, the managers are joined to the Department table to return the first ten characters of the manager’s name and their entire department name:

The self join can be the original syntax (table , table), an INNER, OUTER, or CROSS join. Another requirement is that at least one of the table references must be assigned an alias. Since the alias name becomes the table name, the table is now treated as two completely different tables.

Normally, a self join requires some degree of de-normalization to allow for two columns in the same table to be part of the same domain. Since our Employee table does not contain the manager’s employee number, the output cannot be shown. However, the concept is shown here.

Alternative JOIN / ON Coding

There is another format that may be used for coding both the INNER and OUTER JOIN processing. Previously, all of the examples and syntax

for joins of more than two tables used an ON immediately following the JOIN table list.

The following demonstrates the other coding syntax technique:

When using this technique, care should be taken to sequence the JOIN and ON portions correctly. There are two primary differences with this style compared to the early syntax. First, the JOIN statements and table names are all together. In one sense, this is more like the syntax of: tablename1, tablename2 as seen in the original join.

Second, the ON statement sequence is reversed. In the above syntax diagram, the ON reference for tablename2 and tablenameN is before the ON reference for tablename1 and tablename2. However, the JOIN for <table-name1> and <table-name2> are still before the JOIN of <table-name2> and <table-nameN>. In other words, the first ON goes with the last JOIN when they are nested using this technique.

The following three-table INNER JOIN seen earlier is converted here to use this reversed form of the ON comparisons:

Personally, we prefer the first technique in which every JOIN is followed immediately by its ON condition. Here are our reasons:

It is harder to accidentally forget to code an ON for a JOIN, they are together.

Less debugging time needed, and when it is needed, it is easier.

Because the join allows 64 tables in a single SELECT, the SQL involving several tables may be longer than a single page can display. Therefore, many of the JOIN clauses will be on a different page than its corresponding ON condition. It might require paging back and forth multiple times to locate all of the ON conditions for every JOIN clause. This involves too much effort. Using the JOIN / ON, they are physically next to each other.

Adding another table into the join requires careful thought and placement for both the JOIN and the ON. When using the JOIN / ON, they can be placed almost anywhere in the FROM clause.

Adding Residual Conditions to a JoinMost of the examples in this book have included all rows from the tables being joined. However, in the world of Teradata with millions of rows being stored in a single table, additional comparisons are probably needed to reduce the number of rows returned. There are two ways to code residual conditions. They are: the use of a compound condition using the ON, or a WHERE clause may be used in the new JOIN. These residual conditions are in addition to the join equality in the ON clause.

Consideration should be given to the type of join when including the WHERE clause. The following paragraphs discuss the operational aspects of mixing an ON with a WHERE for INNER and OUTER JOINoperations.

INNER JOIN

The WHERE clause works exactly the same when used with the INNER JOIN as it does on all other forms of the SELECT. It eliminates rows at read time based on the condition being checked and any index columns involved in the comparison.

Normally, as fewer rows are read, the faster the SQL will run. It is more efficient because fewer resources such as disk, I/O, cache space, spool space, and CPU are needed. Therefore, whenever possible, it is best to eliminate unneeded rows using a WHERE condition with an INNER JOIN. I like the use of WHERE because all residual conditions are located in one place.

The following samples are the same join that was performed earlier in this chapter. Here, one uses a WHERE clause and the other a compound comparison via the ON:

Or

http://www.coffingdw.com/sql/tdsqlutp/newer_ansi_join_syntax.htm

http://www.coffingdw.com/sql/tdsqlutp/outer_join_hints.htm

2 Rows Returned


Billy’s Best Choice 123456 $12,347.53Billy’s Best Choice 123512 $8,005.91

The output is exactly the same with both coding methods. This can be verified using the EXPLAIN. We recommend using the WHERE clause with an inner join because it consolidates all residual conditions in a single location that is easy to find when changes are needed. Although there are multiple ON comparisons, there is only one WHERE clause.

OUTER JOIN

Like the INNER JOIN, the WHERE clause can also be used with the OUTER JOIN. However, its processing is the opposite of the technique used with an INNER JOIN and other SQL constructs. If you remember, with the INNER JOIN the intent of the WHERE clause was to eliminate rows from one or all tables referenced by the SELECT.

When the WHERE clause is coded with an OUTER JOIN, it is executed last, instead of first. Remember, the OUTER JOIN returns exceptions. The exceptions must be determined using the join (matching and non-matching rows) and therefore rows cannot be eliminated at read time. Instead, they go into the join and into spool. Then, just before the rows are returned to the client, the WHERE checks to see if rows can be eliminated from the spooled join rows.

The following demonstrates the difference when using the same two techniques in the OUTER JOIN. Notice that the results are different:

7 Rows Returned


McRoberts Richard 280023 Advanced SQLWilson Susie 231222 Advanced SQLDelaney Danny 324652 Introduction to SQLHanson Henry 125634 Introduction to SQLBond Jimmy 322133 V2R3 SQL FeaturesHanson Henry 125634 V2R3 SQL FeaturesWilson Susie 231222 V2R3 SQL Features

Notice that only courses with SQL as part of the name are returned.

Whereas the next SELECT using the same condition as a compound comparison has a different result:

11 Rows Returned


McRoberts Richard 280023 Advanced SQLWilson Susie 231222 Advanced SQL? ? ? Database AdministrationDelaney Danny 324652 Introduction to SQLHanson Henry 125634 Introduction to SQL? ? ? Logical Database Design? ? ? Physical Database Design? ? ? Teradata ConceptsBond Jimmy 322133 V2R3 SQL FeaturesHanson Henry 125634 V2R3 SQL FeaturesWilson Susie 231222 V2R3 SQL Features

The reason for the difference makes sense after you think about the functionality of the OUTER JOIN. Remember that an OUTER JOIN retains all rows from the outer table, those that match and those that do not match the ON comparison. Therefore, the row shows up, but as a non-matching row instead of as a matching row.

There is one last consideration when using a WHERE clause with an OUTER JOIN. Always use columns from the outer table in the WHERE. The reason: if columns of the inner table are referenced in a WHERE, the optimizer will perform an INNER JOIN and not an OUTER JOIN, as coded. It does this since no rows will be returned except those of the inner table. Therefore, an INNER JOIN is more efficient. The phrase “merge join” can found be in the EXPLAIN output instead of “outer join” to verify this event.

The next SELECT was executed earlier as an inner join and returned 2 rows. Here it has been converted to an outer join. However, the output from the EXPLAIN shows in step 5 that an inner (merge) join will be used because customer name is a column from the inner table (Customer table):

EXPLAIN

Explanation

OUTER JOIN HintsThe easiest way to begin writing an OUTER JOIN is to:

1. Start with an INNER JOIN and convert to an OUTER JOIN.

Once the INNER JOIN is working, change the appropriate INNER descriptors to LEFT OUTER, RIGHT OUTER or FULL OUTER join based on the desire to include the exception rows. Since INNER and OUTER joins can be used together, one join at a time can be changed to validate the output. Use the join diagram below to convert the INNER JOIN to an OUTER JOIN.

2. For joins with greater than two tables, think of it as: JOIN two tables at a time.

It makes the entire process easier by concentrating on only two tables instead of all tables. The optimizer will always join two tables, whether serially or in parallel and it is smart enough to do it in the most efficient manner possible.

3. Don’t worry about which tables you join first.

The optimizer will determine which tables should be joined first for the optimal plan.

4. The WHERE clause, if used in an OUTER JOIN to eliminate rows.

A. It is applied after then join is complete, not when rows are read like the Inner Join.

B. It should reference columns from the outer table. If columns from the Inner table are referenced in a WHERE clause, the optimizer will most likely perform a merge join (INNER) for efficiency. This is actually an INNER JOINoperation and can be seen in the EXPLAIN output.

Join Diagram:

http://www.coffingdw.com/sql/tdsqlutp/adding_residual_conditions_to_a_join.htm

http://www.coffingdw.com/sql/tdsqlutp/parallel_join_processing.htm

Where:

Table I rowsA = that match Table II rows and match Table III rows (INNER join perfect data)B = that match Table II rows, but not Table III rowsC = that do not match Table II rows or Table III rowsD = that do not match Table II rows, but do match Table III rows

Table II rowsE = that do not match Table I, nor Table III rowsF = that do not match Table I, but do match Table III

Table III rowsG = that do not match Table I or Table II

Parallel Join ProcessingThere are four basic types of joins that Teradata can perform depending on the characteristics of the table definition. When the join domain is the primary index (PI) column, with a unique secondary index (USI) the join is referred to as a nested join and involves, at most, three AMPs. The second type of join is a merge join, with three different forms of a merge join, based on the request. The newest type of join in Teradata is the Row Hash join using the pre-sorted Row Hash value instead of a sorted data value match. This is beneficial since the data row is stored based on the row hash value and not the data value. The last type is the product join.

In Teradata, each AMP performs all join processing in parallel locally. This means that matching values in the join columns must be on the same AMP to be matched. When the rows are not distributed and stored on the same AMP, they must be temporarily moved to the same AMP, in spool. Remember, rows are distributed on the value in the PI column(s). If joins are performed on the PI of both tables, no row movement is necessary. This is because the rows with the same PI value are on the same AMP – easy, but not always practical. Most joins use a primary key, which might be the UPI and a foreign key, which is probably not the PI.

Regardless of the join type, in a parallel environment, the movement of at least one row is normally required. This movement puts all matching rows together on the same AMP. The movement is usually required due to the user’s choice of a PI. Remember, it is the PI data value that is used for hashing and row distribution to an AMP. Therefore, since the joined columns are mostly columns other than the PI, rows need to be redistributed to another AMP. The redistributed rows will be temporarily stored in spool spaceand used from there for the join processing.

The optimizer will attempt to determine the most efficient path for data row movement. Its choice will be based on the amount of data involved. The three join strategies available are: 1- duplicate all rows of one table onto every AMP, 2- redistribute the rows of one table by hashing the non-PI join column and sending them to the AMP containing the matching PI row, and 3- redistribute both tables by hashed join column value.

The duplication of all rows is a popular approach when the non-PI column is on a small table. Therefore, copying all rows is faster than

http://www.coffingdw.com/sql/tdsqlutp/outer_join_hints.htm

http://www.coffingdw.com/sql/tdsqlutp/join_index_processing.htm

hashing and distributing all rows. This technique is also used when doing a product join and worse, a Cartesian product join.

When both tables are large, the redistribution of the non-PI column row to the AMP with the PI column will be used to save space on each AMP. All participating rows are redistributed so that they are on the same AMP with the same data value used by the PI for the other table. The last choice is the redistribution of all participating row from both tables by hashing on the join column. This is required when the join is on a column that is not the PI in either table. Using this last type of join strategy will require the most spool space. Still, this technique allows Teradata to quickly join tables together in a parallel environment. By combining the speed of the BYNET, the experience of the PE optimizer, and the hashing capabilities of Teradata the data can be temporarily moved to meet the demands of the SQL query. Do not underestimate the importance or brilliance of this capability. As queries change and place new demands on the data, Teradata is flexible and powerful enough to move the data temporarily and quickly to the proper location.

Redistribution requires overhead processing. It has nothing to do with the join processing, but everything to do with preparing for the join. This is the primary reason that many tables will use a column that is not the primary key column as a NUPI. This way, the join columns used in the WHERE or the ON are used for distribution and the rows are stored on the same AMP. Therefore, the join is performed without need to redistribute data. However, normally some re-distribution is needed. So, make sure to COLLECT STATISTICS (see DDL chapter) on the join columns. The strategy that the optimize chooses can be seen in output from an EXPLAIN.

Join Index ProcessingSometimes, regardless of the join plan or indices defined, certain joins cannot be performed in a short enough time frame to satisfy the users. When this is the case, another alternative must be explored. Later chapters in this book discuss temporary tables and summary tables as available techniques. If none of these provide a viable solution, yet another option is needed.

The other way to improve join processing is the use of a JOIN INDEX. It is a pre-join that stores the joined rows. Then, when the join index “covers” the user’s SELECT columns, the optimizer automatically uses the stored join index rows to retrieve the pre-joined rows from multiple tables instead of doing the join again. The term used here is covers. It means that if all columns requested by the user are present in the join index it is used. If even one column is requested that is not in the join index, it cannot be used. Therefore, the actual join must be processed to get that extra column.

The speed of the join index is its main advantage. To enhance its on-going use, whenever a value in a column in a row for a table used within a join index is changed, the corresponding value in the join index row(s) is also changed. This keeps the join index consistent with the rows in the actual tables.

For more information on join index usage, see Chapter 18 in this book.

http://www.coffingdw.com/sql/tdsqlutp/parallel_join_processing.htm

http://www.coffingdw.com/sql/tdsqlutp/date_time_and_timestamp.htm

DATE, TIME, and TIMESTAMPTeradata has a date function and a time function built into the database and the ability to request this data from the system. In the early releases, DATE was a valid data type for storing the combination of year, month and day, but TIME was not. Now, TIME and TIMESTAMP are both valid data types that can be defined and stored within a table.

The Teradata RDBMS stores the date in YYYMMDD format on disk. The YYY is an offset value from the base year of 1900. The MM is the month value from 1 to 12 and the DD is the day of the month. Using this format, the database can currently work with dates beyond the year 3000. So, it appears that Teradata is Y3K compliant. Teradata always stores a date as a numeric INTEGER value.

The following calculation demonstrates how Teradata converts a date to the YYYMMDD date format, for storage of January 1, 1999:

Formula for INTEGERDATE = ((Year – 1900) * 10000) + (Month * 100) + Day

The stored data for the date January 1, 1999 is converted to:Year = (1999 – 1900) * 10000 = 0990000 (year portion)Month = 01 * 100 = +0100 (month portion)Day = 01 +01 (day portion) 0990101 stored on disk

Although years prior to 2000 look fairly “normal” with an implied year for the 20th Century, after 2000 years do not look like the normal concept of a year (100). Fortunately, Teradata automatically does all the conversion and makes it transparent to the user. The remainder of this book will provide SQL examples using both a numeric date as well as the character formats of ‘YY/MM/DD’ and ‘YYYY-MM-DD’.

The next conversion shows the data stored for January 1, 2000 (notice that YYY=100 or 100 years from 1900):Year = (2000 – 1900) * 10000 = 1000000 (year portion)Month = 01 * 100 = +0100 (month portion)Day = 01 +01 (day portion)

http://www.coffingdw.com/sql/tdsqlutp/join_index_processing.htm

http://www.coffingdw.com/sql/tdsqlutp/ansi_standard_date_reference.htm

1000101 stored on disk

Additionally, since the date is stored as an integer and an integer is a signed value, dates prior to the base year of 1900 can also be stored. The same formula applies for the date

conversion regardless of which century. However, since dates prior to 1900, like 1800 are smaller values, the result of the subtraction is a negative number.

ANSI Standard DATE ReferenceCURRENT_DATE is the ANSI Standard name for the date function. All references to the original DATEfunction continues to work and return the same date information. Furthermore, they both display the date in the same format.

http://www.coffingdw.com/sql/tdsqlutp/date_time_and_timestamp.htm

http://www.coffingdw.com/sql/tdsqlutp/integerdate.htm

INTEGERDATEINTEGERDATE is the default display format for most Teradata database client utilities. It is in the form of YY/MM/DD. It has nothing to do with the way the data is stored on disk, only the format of the output display. The current exception to this is Queryman. Since it uses the ODBC, it displays only the ANSI date, as seen below.

Later in this book, the Teradata FORMAT function is also addressed to demonstrate alternative arrangements regarding year, month and day for output presentation.

/* Display today’s date, this example assumes Oct. 1, 2001 */

Traditional Teradata ANSI

SELECT DATE: SELECT CURRENT_DATE:DATE 01/10/01

CURRENT_DATE01/10/01

Figure 8-1

To change the output default display, see the DATEFORM options in the next section of this chapter.

http://www.coffingdw.com/sql/tdsqlutp/ansi_standard_date_reference.htm

http://www.coffingdw.com/sql/tdsqlutp/ansidate.htm

ANSIDATETeradata was updated in release V2R3 to include the ANSI date display and reserved name. The ANSI format is: YYYY-MM-DD.

/* Display today’s date, this example assumes Oct. 1, 2001 */


SELECT DATE: SELECT CURRENT_DATE:DATE 2001-10-01

CURRENT_DATE2001-10-01

Figure 8-2

Since we are now beyond the year 1999, it is advisable to use this ANSI format to guarantee that everyone knows the difference between all the years of each century as: 2000, 1900 and 1800. If you regularly use tools via the ODBC, which is software for Open Data Base Connectivity, this is the default display format for the date.

http://www.coffingdw.com/sql/tdsqlutp/integerdate.htm

http://www.coffingdw.com/sql/tdsqlutp/dateform.htm

DATEFORMTeradata has traditionally been Y2K compliant. In reality, it is compliant to the years beyond 3000. However, the default display format using YY/MM/DD is not ANSI compliant.

In Teradata, release V2R3 allows a choice of whether to display the date in the original display format (YY/MM/DD) or the newer ANSI format (YYYY-MM-DD). When installed, Teradata defaults at the system level to the original format, called INTEGERDATE. However, this system default DATEFORM may be over-ridden by updating the DBS Control record.

The DATEFORM:

Controls default display of selected dates

Controls expected format for import and export of dates as character strings (‘YY/MM/DD’ or ‘YYYY-MM-DD’) in the load utilities

Can be over-ridden by USER or within a Session at any time.

System Level Definition

MODIFY GENERAL 14 = 0 /* INTEGERDATE (YY/MM/DD) */MODIFY GENERAL 14 = 1 /* ANSIDATE (YYYY-MM-DD) */

User Level Definition

CREATE USER username ……

DATEFORM = {INTEGERDATE | ANSIDATE} ;

Session Level Declaration

In addition to setting the system default in the control record, a user can request the format for their individual session. The syntax is:SET SESSION DATEFORM = {ANSIDATE | INTEGERDATE} ;

In the above settings, the “ | “ is used to represent an OR condition. The setting can be ANSIDATE or INTEGERDATE. Regardless of the

http://www.coffingdw.com/sql/tdsqlutp/ansidate.htm

http://www.coffingdw.com/sql/tdsqlutp/date_processing.htm

DATEFORM being used, ANSIDATE or INTEGERDATE, these define load and display characteristics only. Remember, the date is always stored on disk in the YYYMMDD format, but the DATEFORM allows you to select the format for display.

DATE ProcessingMuch of the time spent processing dates is dedicated to storage and reference. Yet, there are times that one date yields or derives a second date. For instance, once a bill has been sent to a customer, the expectation is that payment comes 60 days later. The challenge becomes the correct calculation of the exact due date.

Since Teradata stores the date as an INTEGER, it allows simple and complex mathematics to calculate new dates from dates. The next SELECT operation uses the Teradata date arithmetic and DATEFORM=INTEGERDATE to show the month and day of the payment due date in 60 days:

4 Rows Returned

Due Date Order_date Order_total

99/12/09 99/10/10 $15,231.6299/03/02 99/01/01 $8,005.9199/11/08 99/09/09 $23,454.8499/11/30 99/10/01 $5,111.47

Besides a due date, the SQL can also calculate a discount period date 10 days prior to the payment due date using the alias name:

4 Rows Returned

Order_date Due Date Order_total Discount Date Discounted

http://www.coffingdw.com/sql/tdsqlutp/dateform.htm

http://www.coffingdw.com/sql/tdsqlutp/ansi_time.htm

99/10/10 99/12/09 $15,231.62 99/11/29 $14,926.9999/01/01 99/03/02 $8,005.91 99/02/20 $7,845.7999/09/09 99/11/08 $23,454.84 99/10/29 $22,985.7499/10/01 99/11/30 $5,111.47 99/11/20 $5,009.24

In the above example, it was demonstrated that a DATE + or - an INTEGER results in a new date (date { + | - } integer = date). However, it probably does not make a lot of sense to multiply or divide a date by a number.

As seen earlier in this chapter, the stored format of the date is YYYMMDD. Since DD is the lowest component, the 60 being added to the order date in the above SELECT is assumed to be days. The system is smart enough to know that it is dealing with a date. Therefore, it is smart enough to know that a normal year contains 365 days.

The associative properties of math tell us that equations can be rearranged and still be valid. Therefore, a DATE – a DATE results in an INTEGER (date +|- date = integer). This INTEGER represents the number of days between the dates.

This chart summarizes the math operations on dates

Operation Result

DATE - DATE Interval (days between dates)DATE + or - integer DATEFigure 8-3

This SELECT uses this principal to display the number of days I was alive on my last birthday:

1 Row Returned

Mike’s Age in Days

17532

The above example subtracted one of my birthdays (October 1, 2000) with my actual birthday in 1952. Notice how awful an age looks in days! More importantly, notice how I slipped it into the Title the fact that you can use two single quotes to store or display a literal single quote in a character string.

As mentioned above, an age in days looks awful and that is probably why we do not use that format. I am not ready to tell someone I am just a little over 17000. Instead, we think about ages in years. To convert the days to years, again math can be used as seen in the following SELECT:

1 Row Returned

Mike's Age in Years

48

Wow! I feel so much younger now. This is where division begins to make sense, but remember, the INTEGER is not a DATE. At the same time, it assumes that all years have 365 days. It only does the math operations specified in the SQL statement.

Now, what day was he born?

The next SELECT uses the concatenation, date arithmetic and a blank TITLE to produce the desired output:

1 Row Returned

Mike was born on day

2

The above subtraction results in the number of days between the two dates. Then, the MOD 7 divides by 7 to get rid of the number of weeks and results in the remainder. A MOD 7 can only result in values 0 thru 6 (always 1 less than the MOD operator). Since January 1, 1900 ( 101(date) ) is a Monday, Mike was born on a Wednesday.

This chart can be used for the day of the week based on the above formula and 101(date)

Result Day of the Week

0 Monday1 Tuesday2 Wednesday

3 Thursday4 Friday5 Saturday6 Sunday

Figure 8-4

The following SELECT uses a year’s worth of days to derive a new date that is 365 days away:

5 Rows Returned

Order_date Year Later Date Order_total

98/05/04 99/05/04 $12,347.5399/01/01 00/01/01 $8,005.9199/09/09 00/09/08 $23,454.8499/10/01 00/09/30 $5,111.4799/10/10 00/10/09 $15,231.62

In the above, the year 1999 was not a leap year. Therefore, the value of 365 is used. Likewise, had the beginning year been 2000, then 366 needs to be used because it is a Leap Year. Remember, the system is simply doing the math that is indicated in the SQL statement. If a year were always needed, regardless of the number of days, see the ADD_MONTHSfunction.

ADD_MONTHS


The Teradata ADD_MONTHSfunction can be used to calculate a new date. This date may be in the future (addition) or in the past (subtraction). The calendar intelligence is built-in for the number of days in a month as well as leap year processing. Since the ANSI CURRENT_DATE and CURRENT_TIME are compatible with the original DATE and TIMEfunctions, the ADD_MONTHS works with them as well.

Below is the syntax for the ADD_MONTHS function:

The next SELECT uses literals instead of table rows to demonstrate the calendar logic used by the ADD_MONTHS function when beginning with the last day of a month and arriving at the last day of February:

1 Row Returned

FEB_Non_Leap Oct_P120 FEB_Leap_Yr Oct_M240 FEB_Leap_Yr2 Oct_4Yrs

2001-02-28 01/02/27 2000-02-29 00/03/04 2004-10-30 04/10/30

Notice, when using the ADD_MONTHS function, that all the output displays in ANSI date form. This is true when using BTEQ or Queryman. Conversely, the date arithmetic uses the default date format. Likewise, the second ADD_MONTHS uses –8, which equates to subtraction or going back in time versus ahead. Additionally, because months have a varying number of days, the output from math is likely to be different than the ADD_MONTHS.

The next SELECT uses the ADD_MONTHS function as an alternative to the previous SELECT operations for showing the month and day of the payment due date in 2 months:

5 Rows Returned

Due Date Order_date Order_total

1998-07-04 1998-05-04 $12,347.531999-03-01 1999-01-01 $8,005.911999-11-09 1999-09-09 $23,454.84

1999-12-01 1999-10-01 $5,111.471999-12-10 1999-10-10 $15,231.62

The ADD_MONTHS function also takes into account the last day of each month. The following goes from the last day of one month to the last day of another month:

1 Row Returned

Leap_Ahead_2yrs Leap_Back_2yrs With30_31_

2000-02-29 2000-02-29 2001-07-31

Whether going forward or backward or backward in time, a leap year is still recognized using ADD_MONTHS.

ANSI TIMETeradata has also been updated in V2R3 to include the ANSI time display, reserved name and the new TIME data type. Additionally, the clock is now intelligent and can carry seconds over into minutes.

CURRENT_TIME is the ANSI name of the time function. All current SQL references to the original Teradata TIME function continue to work.

/* Display the time, this example assumes 12:15PM */


SELECT TIME; SELECT CURRENT_TIME;TIME _ 12:15:00

CURRENT_TIME12:15:00

Figure 8-5

Although the time could be displayed prior to release V2R3, when stored, it was converted to a character column type. Now, TIMEis also a valid data type, may be defined in a table, and retains the HH:MM:SS properties.

As well as creating a TIME data type, intelligence has been added to the clock software. It can increment or decrement TIME with the result increasing to the next minute or decreasing from the previous minute based on the addition or subtraction of seconds.

When storing TIME on disk, this chart indicates the amount of storage required:

TIME(n) as: HH:MM:SS.nnnnnn n = 0-6 (maximum is 6 digits to the right of the decimal, default = 6)

HH stored as byteint (1 byte)MM stored as byteint (1 byte)SS stored as decimal(8,6) (4 bytes)Figure 8-6

TIME representation character display length:TIME (0) – 10:14:38 CHAR(8)

http://www.coffingdw.com/sql/tdsqlutp/date_processing.htm

http://www.coffingdw.com/sql/tdsqlutp/extract.htm

TIME (6) - 10:14:38.201163 CHAR(15)

EXTRACTCompatibility: ANSI

Both DATE and TIME data are special in terms of relational design. Since each is comprised of 3 parts and they are decomposable. Decomposable data is data that is not at its most granular level. For example, you may only want to see the hour.

The EXTRACT function is designed to do the decomposition on these data types. It works with both the DATE and TIME functions. This includes the original and newer ANSI expressions. The operation is to pull a specific portion of the SQL techniques.

The syntax for EXTRACT:

The next SELECT uses the EXTRACTwith date and time literals to demonstrate the coding technique and the resulting output:

1 Row Returned

Yr_Part Mth_Part Day_Part Hr_Part Min_Part Sec_Part

2000 10 01 10 1 30

The EXTRACT can be very helpful when there is a need to have a single component for controlling access to data or the presentation of data. For instance, when calculating aggregates, it might be necessary to

http://www.coffingdw.com/sql/tdsqlutp/ansi_time.htm

http://www.coffingdw.com/sql/tdsqlutp/implied_extract_of_day_month_and_year.htm

group the output on a change in the month. Since the data represents daily activity, the month portion needs to be evaluated separately.

The Order table below is used to demonstrate the EXTRACT function in a SELECT:




1111111111111111313231348732345657896883

980504990101991001991010990909

12347.5308005.9105111.4715231.6223454.84

Figure 8-7

The following SELECT uses the EXTRACT to only display the month and also to control the number of aggregates displayed in the GROUP BY:

4 Rows Returned

EXTRACT(MONTH FROM(Order_date) Nbr_of_rows Average(Order_total)

1 1 8005.915 1 12347.539 1 23454.8410 2 10171.54

The next SELECT operation uses entirely ANSI compliant code with DATEFORM=ANSIDATE to show the month and day of the payment due date in 2 months and 4 days, notice it uses double quotes to allow reserved words as alias names and ANSIDATE in the comparison and display:

4 Rows Returned

Month Day Year Order_date Order_total

Due Date: 3 6 1999 Jan 01, 1999 8005.91Due Date: 11 12 1999 Aug 09, 1999 23454.84Due Date: 12 4 1999 Oct 10, 1999 5111.47Due Date: 12 13 1999 Oct 10, 1999 15231.62

Implied Extract of Day, Month and YearCompatibility: Teradata Extension

Although the EXTRACT works great and it is ANSI compliant, it is a function. Therefore, it must be executed and the parameters passed to it to identify the desired portion as data. Then, it must pass back the answer. As a result, there is additional overhead processing required to use it.

It was mentioned earlier that Teradata stores a date as an integer and therefore allows math operations to be performed on a date.

The syntax for implied extract:

The following SELECT uses math to extract the three portions of Mike’s literal birthday:

1 Row Returned

Day_portion Month_portion Year_portion

1 10 2001

Remember that the date is stored as yyymmdd. The literal values are used here to provide a date of Oct. 1, 2001. The day portion is obtained here by making the dd portion (last 2 digits) the remainder from the MOD 100. The month portion is obtained by dividing by 100 to eliminate the dd to leave the mm (new last 2 digits) portion the remainder of the MOD 100. The year portion is the trickiest. Since it is stored as yyy (yyyy – 1900), we must add 1900 to the stored value to convert it back to the yyyy format. What do you suppose the EXTRACT function does? Same thing.

http://www.coffingdw.com/sql/tdsqlutp/extract.htm

http://www.coffingdw.com/sql/tdsqlutp/ansi_timestamp.htm

ANSI TIMESTAMPAnother new data type, added to Teradata in V2R3 to comply with the ANSI standard, is the TIMESTAMP. TIMESTAMP is now a display format, a reserved name and a new data type. It is a combination of the DATE and TIMEdata types combined together into a single column data type.

Since this is entirely new, there is no previous compatibility to contrast.

Teradata ANSI

Did not previously exist SELECT CURRENT_TIMESTAMP;

CURRENT_TIMESTAMP2000-10-01 12:15:00

Figure 8-8

Timestamp representation character display length:TIMESTAMP(0) 1998-12-07 11:37:58 CHAR(19)TIMESTAMP(6) 1998-12-07 11:37:58.213000 CHAR(26)

Notice that there is a space between the DATE and TIMEportions of a timestamp. This is a required element to delimit or separate the day from the hour.

http://www.coffingdw.com/sql/tdsqlutp/implied_extract_of_day_month_and_year.htm

http://www.coffingdw.com/sql/tdsqlutp/time_zones.htm

TIME ZONESIn V2R3, Teradata has the ability to access and store both the hours and the minutes reflecting the difference between the user’s time zone and the system time zone. From a World perspective, this difference is normally the number of hours between a specific location on Earth and the United Kingdom location that was historically called Greenwich Mean Time (GMT). Since the Greenwich observatory has been “decommissioned,” the new reference to this same time zone is called Universal Time Coordinate (UTC).

A time zone relative to London (UTC) might be:

LA Miami Frankfurt Hong Kong

+8:00 +05:00 00:00 -08:00

A time zone relative to New York (EST) might be:

LA Miami Frankfurt Hong Kong

+3:00 00:00 -05:00 -13:00

Here, the time zones used are represented from the perspective of the system at EST. In the above, it appears to be backward. This is because the time zone is set using the number of hours that the system is from the user.

To show an example of TIMEvalues, we randomly chose a time just after 10:00AM. Below, the various TIME with time zone values are designated as:

The default, for both TIME and TIMESTAMP, is to display six digits of decimal precision in the second’s portion. Time zones are set either at

http://www.coffingdw.com/sql/tdsqlutp/ansi_timestamp.htm

http://www.coffingdw.com/sql/tdsqlutp/date_and_time_intervals.htm

the system level (DBS Control), the user level (when user is created or modified), or at the session level as an override.

SETTING TIME ZONES

A Time Zone should be established for the system and every user in each different time zone.

Setting the system default time zone:

MODIFY GENERAL 16 = x /* Hours, n= -12 to 13 */MODIFY GENERAL 17 = x /* Minutes, n = -59 to 59 */

Setting a User’s time zone requires choosing either LOCAL, NULL, or a variety of explicit values:

A Teradata session can modify the time zone during normal operations without requiring a logoff and logon.

Using TIME ZONES

A user’s time zone is now part of the information maintained by Teradata. The settings can be seen in the extended information available in the HELP SESSIONrequest.

1 Row Returned

User Name MJL

Account Name MJLLogon Date 00/10/15

Logon Time 08:43:45Current DataBase AccountingCollation ASCIICharacter Set ASCIITransaction Semantics TeradataCurrent DateForm IntegerDateSession Time Zone 00:00Default Character Type LATINExport Latin 1Export Unicode 1Export Unicode Adjust 0Export KanjiSJIS 1Export Graphic 0

By creating a table and requesting the WITH TIME ZONE option for a TIME or TIMESTAMP data type, this additional offset is also stored.

The following SHOW command displays a table containing one timestamp column with TIME ZONEand one column as a timestamp column without TIME ZONE:SHOW TABLE Tstamp_test;

Text of DDL Statement Returned

As rows were inserted into the table, the time zone of the user’s session was automatically captured along with the data for TS_with_zone. Storing the time zone requires an additional 2 bytes of storage beyond the date+time requirements.

The next SELECT show the data rows currently in the table:SELECT * FROM Tstamp_test ;

4 Rows Returned

TS_zone TS_with_zone TS_without_zone

UTC 2000-10-01 08:12:00.000000+05:00 2000-10-01 08:12:00.000000EST 2000-10-01 08:12:00.000000+00:00 2000-10-01 08:12:00.000000PST 2000-10-01 08:12:00.000000-03:00 2000-10-01 08:12:00.000000HKT 2000-10-01 08:12:00.000000-11:00 2000-10-01 08:12:00.000000

Normalizing TIME ZONES

Teradata has the ability to incorporate the use of time zones into SQL for a relative view of the data based on one locality versus another.

This SELECT adjusts the data rows based on their TIME ZONE data in the table:

4 Rows Returned

TS_zone TS_with_zone T_Normal

UTC 2000-10-01 08:12:00.000000+05:00 2000-10-01 03:12:00.000000EST 2000-10-01 08:12:00.000000+00:00 2000-10-01 08:12:00.000000PST 2000-10-01 08:12:00.000000-03:00 2000-10-01 11:12:00.000000HKT 2000-10-01 08:12:00.000000-11:00 2000-10-01 19:12:00.000000

Notice that the Time Zone value was added to or subtracted from the time portion of the time stamp to adjust them to a perspective of the same time zone. As a result, at that moment, it has normalized the different Times Zones in respect to the system time.

As an illustration, when the transaction occurred at 8:12 AM locally in the PST Time Zone, it was already 11:12 AM in EST, the location of the system. The times in the columns have been normalized in respect to the time zone of the system.

DATE and TIME IntervalsTo make Teradata SQL more ANSI compliant and compatible with other RDBMS SQL, NCR has added INTERVAL processing. Intervals are used to perform DATE, TIME and TIMESTAMP arithmetic and conversion.

Although Teradata allowed arithmetic on DATE and TIME, it was not performed in accordance to ANSI standards and therefore, an extension instead of a standard. With INTERVAL being a standard instead of an extension, more SQL can be ported directly from an ANSI compliant database to Teradata without conversion.

Additionally, when a data value was used to perform date or time math, it was always “assumed” to be at the lowest level for the definition (days for DATE and seconds for TIME). Now, any portion of either can be expressed and used.

INTERVAL Chart

The simple intervals are: The more involved intervals are:

YEARMONTHDAYHOURMINUTESECOND

DAY TO HOURDAY TO MINUTEDAY TO SECONDHOUR TO MINUTEHOUR TO SECONDMINUTE TO SECOND

Figure 8-9

Using Intervals

To use the ANSI syntax for intervals, the SQL statement must be very specific as to what the data values mean and the format in which they are coded. ANSI standards tend to be lengthier to write and more restrictive as to what is and what is not allowed regarding the values and their use.

Simple INTERVAL Examples using literals:INTERVAL ‘500’ DAY(3)INTERVAL ‘3’ MONTHINTERVAL -‘28’ HOUR

http://www.coffingdw.com/sql/tdsqlutp/time_zones.htm

http://www.coffingdw.com/sql/tdsqlutp/overlaps.htm

Complex INTERVAL Examples using literals:INTERVAL ’45 18:30:10’ DAY TO SECONDINTERVAL ’12:12’ HOUR TO MINUTEINTERVAL ’12:12’ MINUTE TO SECOND

For several of the INTERVALliterals, their use seems obvious based on the literal non-numeric literals used. However, notice that the HOUR TO MINUTE and the MINUTE

TO SECOND above, are not so obvious. Therefore, the declaration of the meaning is important.

For instance, notice that they are coded as character literals. This allows for use of a slash (/), colon (: ) and space as part of the literal. Also, notice the use of a negative time frame requires a “-” sign to be outside of the quotes. The presence of the quotes also denotes that the numeric values are treated as character for conversion to a point in time.

The format of a timestamp requires the space between the day and hour when using intervals. For example, notice the blank space between the day and hour in the compound DAY TO HOUR interval. Without the space, it is an error.

INTERVAL Arithmetic with DATE and TIME

To use DATE and TIME arithmetic, it is important to keep in mind the results of various operations.

The chart below shows the Teradata implied arithmetic results.

DATE and TIME arithmetic Results prior to intervals:

DATE - DATE = Integer (days)DATE MOD DATE = Integer (day of month)DATE / 100 = Integer (year and month)DATE / 10000 = Integer (year)TIME - TIME = Integer (hours)DATE + or - Integer = DATEFigure 8-10

The chart below shows the ANSI explicit arithmetic results.

DATE and TIME arithmetic Results prior to intervals:

DATE - DATE = IntervalTIME - TIME = IntervalTIMESTAMP - TIMESTAMP = IntervalDATE + or - Interval = DATETIME + or - Interval = TIMETIMESTAMP + or - Interval = TIMESTAMPINTERVAL + or - Interval = IntervalFigure 8-11

Note: It makes little sense to add two dates together.

Traditionally, the output of the subtraction is an integer, up to 2.147 billion. However, Teradata knows that when an integer is used in a formula with a date, it must represent a number of days. The following uses the ANSI representation for a DATE:SELECT (DATE '1999-10-01' - DATE '1988-10-01’) AS Assumed_Days ;

1 Row Returned

Assumed_Days

4017

The next SELECT uses the ANSI explicit DAY interval:SELECT (DATE '1999-10-01' - DATE '1988-10-01’) DAY AS Actual_Days ;

**** Failure 7453 Internal Field Overflow

The above request fails on an overflow of the INTERVAL. Using this ANSI interval, the output of the subtraction is an interval with 4 digits. The default for all intervals is 2 digits and therefore the overflow occurs until the SELECT is modified with DAY(4), below:SELECT (DATE '1999-10-01' - DATE '1988-10-01’) DAY(4) AS Actual_Days ;

1 Row Returned

Actual_Days

4017

Normally, a date minus a date yields the number of days between them. To see months instead, the following SELECT operations use

literals to demonstrate the conversions performed on various DATE and INTERVAL data:SELECT (DATE '2000-10-01' – DATE '1999-10-01') MONTH (Title ‘Months’) ;

1 Row Returned

Months

12

The next SELECT shows INTERVALoperations used with TIME:

1 Row Returned

Actual_hours Actual_minutes Actual_seconds Actual_seconds4

2 155 9300.000000 9300.0000

Although Intervals tend to be more accurate, they are more restrictive and therefore, more care is required when coding them into the SQL constructs. However, one miscalculation, like in the overflow example, and the SQL fails. Additionally, 9999 is the largest value for any interval. Therefore, it might be required to use a combination of intervals, such as: MONTHS to DAYS in order to receive an answer without an overflow occurring.

CAST Using Intervals

Compliance: ANSI

The CAST function was seen in an earlier chapter as the ANSI method for converting data from one type to another. It can also be used to convert one INTERVAL to another INTERVAL representation. Although the CAST is normally used in the SELECT list, it works in the WHERE clause for comparison reasons.

Below is the syntax for using the CAST with a date:

The following converts an INTERVAL of 6 years and 2 months to an INTERVAL number of months:SELECT CAST( (INTERVAL '6-02' YEAR TO MONTH) AS INTERVAL MONTH );

1 Row Returned

6-02

74

Logic seems to dictate that if months can be shown, the years and months should also be available. This request attempts to convert 1300 months to show the number of years and months:

*** Failure 7453 Interval Field Overflow.

The above failed because the number of months takes more than two digits to hold a number of years greater than 99. The fix is to change the YEAR to YEAR(3) and rerun:

1 Row Returned

Years & Months

100-02

The biggest advantage in using the INTERVAL processing is that SQL written on another system is now compatible with Teradata.

At the same time, care must be taken to use a representation that is large enough to contain the answer. The default is 2 digits and anything larger, 4 digits maximum, must be literally requested. The incorrect size results in an SQL runtime error. The next section on the System Calendar demonstrates another way to convert from one interval of time to another.

OVERLAPSCompatibility: Teradata Extension

When working with dates and times, sometimes it is necessary to determine whether two different ranges have common points in time. Teradata provides a Boolean function to make this test for you. It is called OVERLAPS; it evaluates true, if multiple points are in common, otherwise it returns a false.

The syntax of the OVERLAPS is:

The following SELECT tests two literal dates and uses the OVERLAPS to determine whether or not to display the character literal:

1 Row Returned

The dates overlap

The literal is returned because both date ranges have from October 15 through November 30 in common.

The next SELECT tests two literal dates and uses the OVERLAPS to determine whether or not to display the character literal:

No Rows Found

The literal was not selected because the ranges do not overlap. So, the common single date of November 30 does not constitute an overlap. When dates are used, 2 days must be involved and when time is used, 2 seconds must be contained in both ranges.

http://www.coffingdw.com/sql/tdsqlutp/date_and_time_intervals.htm

http://www.coffingdw.com/sql/tdsqlutp/system_calendar.htm

The following SELECT tests two literal times and uses the OVERLAPS to determine whether or not to display the character literal:

The times overlap

This is a tricky example and it is shown to prove a point. At first glance, it appears as if this answer is incorrect because 02:01:00 looks like it starts 1 second after the first range ends. However, the system works on a 24-hour clock when a date and time (timestamp) is not used together. Therefore, the system considers the earlier time of 2AM time as the start and the later time of 8 AM as the end of the range. Therefore, not only do they overlap, the second range is entirely contained in the first range.

The following SELECT tests two literal dates and uses the OVERLAPS to determine whether or not to display the character literal:

No Rows Found

When using the OVERLAPSfunction, there are a couple of situations to keep in mind:

1. A single point in time, i.e. the same date, does not constitute an overlap. There must be at least one second of time in common for TIME or one day when using DATE.

2. Using a NULL as one of the parameters, the other DATE or TIME constitutes a single point in time versus a range.

System CalendarCompatibility: Teradata Extension

Also in V2R3, Teradata has a system calendar that is very helpful when date comparisons more complex than month, day and year are needed. For example, most businesses require comparisons from 1st quarter to 2nd quarter. It is best used to avoid maintaining your own calendar table or performing your own sophisticated SQL calculations to derive the needed date perspective.

Teradata’s calendar is implemented using a base date table named caldates with a single column named CDATES. The base table is never referenced. Instead, it is referenced using the view named CALENDAR. The base table contains rows with dates January 1, 1900 through December 31, 2100. The system calendar table and views are stored in the Sys_calendar database. This is a calendar from January through December and has nothing to do with fiscal calendars.

The purpose of the system calendar is to provide an easy way to compare dates. For example, comparing activities from the first quarter of this year with the same quarter of last year can be quite valuable. The System Calendarmakes these comparisons easy compared to trying to figure out the complexity of the various dates.

The next page contains a list of column names, their respective data types, and a brief explanation of the potential values calculated for each when using the CALENDAR view:

Column Name Data Type Description _calendar_date DATE Standard Teradata date Equivalency: DATEday_of_week BYTEINT 1-7, where 1 is Sunday Equivalency: (DATE - DATE) MOD 7day_of_month BYTEINT 1-31, some months have less Equivalency: DATE MOD 7day_of_year SMALLINT 1-366, Julian day of the year Equivalency: DATE MOD 100 or EXTRACT Dayday_of_calendar INTEGER Number of days since 01/01/1900 Equivalency: DATE - 101(date)

http://www.coffingdw.com/sql/tdsqlutp/overlaps.htm

http://www.coffingdw.com/sql/tdsqlutp/transforming_character_data.htm

weekday_of_month BYTEINT The sequence of a day within a month, first Sunday=1, second Sunday=2, etc

Equivalency: None knownweek_of_month BYTEINT 0-5, sequential week number within a

month, partial week starts at 0 Equivalency: None knownweek_of_year BYTEINT 0-53, sequential week number within a

year, partial week starts at 0 Equivalency: None knownweek_of_calendar INTEGER Number of weeks since 01/01/1900 Equivalency: (DATE – 101(date))/7month_of_quarter BYTEINT 1-3, each quarter has 3 months Equivalency: CASE EXTRACT Monthmonth_of_year BYTEINT 1-12, up to 12 months per year Equivalency: DATE/100 MOD 100 or EXTRACTMonthmonth_of_calendar INTEGER Number of months since 01/01/1900 Equivalency: None neededquarter_of_year BYTEINT 1-4, up to 4 quarters per year Equivalency: CASE EXTRACT Monthquarter_of_calendar INTEGER Number of quarters since 01/01/1900 Equivalency: None neededyear_of_calendar SMALLINT Starts at 1900 Equivalency: EXTRACT Year

It appears that the least useful of these columns are all the names that end with “_of_calendar.” As seen in the above descriptions, these values are all calculated starting at the calendar reference date of January 1, 1900. Unless a business transaction occurred on that date, they are meaningless.

The biggest benefit of the System Calendar is for determining the following: Day of the Week, Week of the Month, Week of the Year, Month of the Quarter and Quarter of the Year.

Most of the values are very straightforward. However, the column called Week_of_Month deserves some discussion. The description indicates that a partial week is week number 0. A partial week is any first week of a month that does not start on a Sunday. Therefore, not all months have a week 0 because some do start on Sunday.

Having these column references available, there is less need to make as many compound comparisons in SQL. For instance, to simply determine a quarter requires 3 comparisons, one for each month in that quarter. Worse yet, each quarter of the year will have 3 different months. Therefore, the SQL might require modification each time a different quarter was desired.

The next SELECT uses the System Calendar to obtain the various date related rows for October 1, 2001:

1 Row Returned

calendar_date 01/10/01

day_of_week 2day_of_month 1day_of_year 274day_of_calendar 37164weekday_of_month 1week_of_month 0week_of_year 39week_of_calendar 5309month_of_quarter 1month_of_year 10month_of_calendar 1222quarter_of_year 3quarter_of_calendar 408year_of_calendar 2001

Since the calendar is a view, it is used like any other table and columns are selected or compared from it. However, not all columns of all rows are needed for every application. Unlike a user created calendar, it will be faster. The primary reason for this is due to reduced input requirements.

Each date is only 4 bytes stored as DATE. The desired column values are materialized from the stored date. It makes sense that less IO equates to a faster response. So, 4 bytes per date are read instead of 32 or more bytes per date needed. There may be hundreds of different dates in a table with millions of rows. Therefore, utilizing the Teradata system calendar makes good sense.

Since the system calendar is a view or virtual table, its primary access is via a join to a stored date (i.e. billing or payment date). Whether the date is the current date or a stored date, it can be joined to the calendar. When a join is performed, a row is materialized in cache to represent the various aspects of that date.

The following examples demonstrate the use of the WHERE clause for these comparisons using months instead of quarters (WHERE Month_of_Year = 1 OR Month_of_Year = 2 OR Month_of_Year = 3 vs. WHERE Quarter_of_Year = 1) and the Day_of_week column instead of DATE MOD 7 to simplify coding:

1 Row Returned

Order_date Order_total Quarter_of_Year Week_of_Month

99/09/09 $23,454.84 3 1

As nice as it is to have a number that represents the day of the week, it still isn’t as clear as it might be with a little creativity.

This CREATE TABLE builds a table called Week_Days and populates it with the English name of the week days:

Once the table is available, it can be incorporated into SQL to make the output easier to read and understand, like the following:

2 Rows Returned

Order_date Order_total Day_of_Week Wkday_Day

99/09/09 $23,454.84 5 Thursday99/10/01 $5,111.47 6 Friday

As demonstrated in this chapter, there are many ways to incorporate dates and date logic into SQL. The format of the date can be adjusted using the DATEFORM. The SQL may use ANSI functions or Teradata capabilities and functions. Now you are ready to go back and forth with a date (pun intended).

Transforming Character DataMost of the time, it is acceptable to display data directly as it is stored in the database. However, there are times when it is not acceptable and the character data must be temporarily transformed. It might need shortening or something as simple as eliminating undesired spaces from a value. The tools to make these changes are discussed here.

Earlier, we saw the CAST function as a technique to convert data. It can be used to truncate data unless running in ANSI mode, which does not allow truncation. These functions provide an alternative to using CAST, because they do not truncate data. Instead, they allow a portion of the data to be returned. This is a slight distinction, but enough to allow the processing to provide some interesting capabilities.

We will examine the CHARACTERS, TRIM, SUBSTRING, SUBSTR, POSITION and INDEX functions. Alone, each function provides a capability that can be useful within SQL. However, when combined, they provide some powerful functionality.

This is an excellent time to remember one of the primary differences between ANSI mode and Teradata mode. ANSI mode is case sensitive and Teradata mode is not. Therefore, the output from most of these functions is shown here in both modes.

http://www.coffingdw.com/sql/tdsqlutp/system_calendar.htm

http://www.coffingdw.com/sql/tdsqlutp/characters_function.htm

CHARACTERS FunctionCompatibility: Teradata Extension

The CHARACTERS function is used to count the number of characters stored in a data column. It is easiest to use and the most helpful when the characters being counted are stored in a variable length as a VARCHAR column. A VARCHAR stores only the characters input and no trailing spaces after the last non-space character.

When referencing a fixed length CHAR column, the CHARACTERS function always returns a number that represents the maximum number of characters defined. This is because the database must store the data and pad to the full length using literal spaces. A space is a valid character and therefore, the CHARACTERS function counts every space.

The syntax of the CHARACTERS function:

CHARACTERS ( <column-name> )

Or

CHAR ( <column-name> )

To use the CHARACTERS (can be abbreviated as CHAR) function, simply pass it a column name. When referenced in the SELECT list, it displays the number of characters. When written into the WHERE clause, it can be used as a comparison value to decide whether or not the row should be returned.

The Employee table is used to demonstrate the functions in this chapter. The contents of this table is listed below:

Employee Table - contains 9 employees


PK FKUPI NUSI NUSI123257812563492341218

ChambersHarrisonReilly

MandeeHerbertWilliam

48,850.0054,500.0036,000.00

100400400

http://www.coffingdw.com/sql/tdsqlutp/transforming_character_data.htm

http://www.coffingdw.com/sql/tdsqlutp/character_length_function.htm

231222520000001000234112133413246571333454

LarkinsJonesSmytheStricklingCoffingSmith

LoraineSquiggyRichardCletusBillyJohn

40,200.0032,800.5064,300.0054,500.0041,888.88

48,000.00

30010400200200

Figure 9-1

The next SELECT demonstrates how to code using the CHARfunction in both the SELECT list as well as in the WHERE, plus the answer set:

4 Rows Returned

First_name C_length

Mandee 6Cletus 6Billy 5John 4

If there are leading and imbedded spaces stored within the column, the CHAR function counts them as valid or significant data characters.

The answer is exactly the same using CHAR in the SELECT list and the alias in the WHERE instead of repeating the CHAR function:

4 Rows Returned

First_name C_length

Mandee 6Cletus 6Billy 5

John 4

As mentioned earlier, the CHARfunction works best on VARCHARdata. The following demonstrates its result on CHAR data by retrieving the last name and the length of the last name where the first name contains more than 7 characters:

4 Rows Returned

Last_name C_length

Chambers 20Coffing 20Smith 20Strickling 20

Again, the space characters are present in the data and therefore counted. Hence, all the last names are 20 characters long. The comparison is on the first name but the display is based entirely on the last name.

The CHAR function is helpful for determining demographic information regarding the VARCHARdata stored within the Teradata database. However, sometimes this same information is needed on fixed length CHAR data. When this is the case, the TRIM function is helpful.

CHARACTER_LENGTH FunctionCompatibility: ANSI

The CHARACTER_LENGTH function is used to count the number of characters stored in a data column. It is the ANSI equivalent of the Teradata CHARACTERS function

available in V2R4. Like CHARACTERS, it’s easiest to use and the most helpful when the characters being counted are stored in a variable length VARCHAR column. A VARCHAR stores only the characters input and no trailing spaces.

When referencing a fixed length CHAR column, the CHARACTER_LENGTH function always returns a number that represents the maximum number of characters defined. This is because the database must store the data and pad to the full length using literal spaces. A space is a valid character and therefore, the CHARACTER_LENGTH function counts every space.

The syntax of the CHARACTER_LENGTH function:CHARACTER_LENGTH ( <column-name> )

To use the CHARACTER_LENGTH function, simply pass it a column name. When referenced in the SELECT list, it displays the number of characters. When written into the WHERE clause, it can be used as a comparison value to decide whether or not the row should be returned.

The contents of the same Employee table above is also used to demonstrate the CHARACTER_LENGTH function.

The next SELECT demonstrates how to code using the CHARACTER_LENGTH function in both the SELECT list as well as in the WHERE, plus the answer set:

4 Rows Returned

First_name C_length

http://www.coffingdw.com/sql/tdsqlutp/characters_function.htm

http://www.coffingdw.com/sql/tdsqlutp/octet_length_function.htm

Mandee 6Cletus 6Billy 5John 4

If there are leading and imbedded spaces stored within the column, the CHARACTER_LENGTH function counts them as valid or significant data characters.

As mentioned earlier, the CHARACTER_LENGTH function works best on VARCHAR data. The following demonstrates its result on CHAR data by retrieving the last name and the length of the last name where the first name contains more than 7 characters:

4 Rows Returned

Last_name C_length


Again, the space characters are present in the data and therefore counted. Hence, all the last names are 20 characters long. The comparison is on the first name but the display is based entirely on the last name.

The CHARACTER_LENGTH function is helpful for determining demographic information regarding the VARCHAR data stored within the Teradata database. However, sometimes this same information is needed on fixed length CHARdata. When this is the case, the TRIM function is helpful.

OCTET_LENGTH FunctionCompatibility: ANSI

The OCTET_LENGTH function is used to count the number of characters stored in a data column. It is another ANSI equivalent of the Teradata CHARACTERS function available in V2R4. Like CHARACTERS, it’s easiest to use and the most helpful when the characters being counted are stored in a variable length VARCHAR column. A VARCHAR stores only the characters input and no trailing spaces.

When referencing a fixed length CHAR column, the OCTET_LENGTHfunction always returns a number that represents the maximum number of characters defined. This is because the database must store the data and pad to the full length using literal spaces. A space is a valid character and therefore, the OCTET_LENGTH function counts every space.

The syntax of the OCTET_LENGTHfunction:

OCTET_LENGTH ( <column-name> )

To use the OCTET_LENGTHfunction, simply pass it a column name. When referenced in the SELECT list, it displays the number of characters. When written into the WHERE clause, it can be used as a comparison value to decide whether or not the row should be returned.

The contents of the same Employee table above is also used to demonstrate the OCTET_LENGTHfunction.

The next SELECT demonstrates how to code using the OCTET_LENGTH function in both the SELECT list as well as in the WHERE, plus the answer set:

4 Rows Returned

First_name C_length

Mandee 6

http://www.coffingdw.com/sql/tdsqlutp/character_length_function.htm

http://www.coffingdw.com/sql/tdsqlutp/trim.htm

Cletus 6Billy 5John 4

If there are leading and imbedded spaces stored within the column, the OCTET_LENGTH function counts them as valid or significant data characters.

As mentioned earlier, the OCTET_LENGTH function works best on VARCHAR data. The following demonstrates its result on CHAR data by retrieving the last name and the length of the last name where the first name contains more than 7 characters:

4 Rows Returned

Last_name C_length


Again, the space characters are present in the data and therefore counted. Hence, all the last names are 20 characters long. The comparison is on the first name but the display is based entirely on the last name. The OCTET_LENGTHfunction is helpful for determining demographic information regarding the VARCHAR data stored within the Teradata database. However, sometimes this same information is needed on fixed length CHARdata. When this is the case, the TRIM function is helpful.

CONTINUE HERE ---- http://www.coffingdw.com/sql/tdsqlutp/trim.htm

http://www.coffingdw.com/sql/tdsqlutp/trim.htm

Documents

T D SQL Guide