15
Module 14: System Restarts After completing this module, you will be able to: List three different ways to restart the Teradata database. Use the RESTART command. Describe the impact of … Disk(s) failure Disk array controller(s) failure BYNET(s) failure Node failure AWS failure VPROC failure Explain the difference between a PDE dump and a UNIX panic dump.

B414 Restart

Embed Size (px)

DESCRIPTION

B414 Restart

Citation preview

Page 1: B414 Restart

Module 14: System Restarts

After completing this module, you will be able to:

List three different ways to restart the Teradata database.

Use the RESTART command.

Describe the impact of …

– Disk(s) failure– Disk array controller(s) failure– BYNET(s) failure– Node failure– AWS failure– VPROC failure

Explain the difference between a PDE dump and a UNIX panic dump.

Page 2: B414 Restart

Types of Restarts

Scheduled Restarts

• Changing system parameters (e.g., DBS Control parameter is updated)

• Software upgrades

• Configuration changes (addition of new AMPs and/or PEs

Unscheduled Restarts

• Power failure (e.g., 8/14/2003 – the North East U.S. and parts of Canada)

• Hardware failure

• Software failure

• Accidents

Restart Processes

1. Spool cylinders are returned to free cylinder list (unused cylinder pool).

2. Before logons are enabled, uncommitted work is rolled back.1st Tables are re-locked for background recovery.

2nd Logons are enabled in cold start.

Page 3: B414 Restart

Scheduled Restarts

Restart Teradata with Use this command Options

Command-line tpareset <comment> -f, -x, -y-d, -l, -Q, -P

DB Console - Supervisor restart tpa <comment> cold, coldwait

vprocmanager restart cold, coldwait

MultiTool (Windows 2000) reset (via GUI choices) GUI menu choices

Example:

# tpareset -f Change of system parameters

To see when restarts occur and brief explanation of how/why for the last week:

LOGON tdpid/systemfe,service; EXEC ALLRESTARTS (DATE - 7,); LOGOFF;

The “tpatrace” command may also be used to see information about restarts.

# tpatrace 3 (shows last 3 restarts)

Page 4: B414 Restart

Restarting Teradata from DB Window

RESTART TPA [, NODUMP ] [, COLD] COMMENT[, DUMP = YES] [, COLDWAIT][, DUMP = NO]

Page 5: B414 Restart

Restart using the “tpareset” Command

# tpareset -f Change of DBSControl parameters

You are about to restart the database on the system 'u4455' Do you wish to continue (yes/no) [no]: yestpareset: TPA reset submitted.

# tpatrace

TPA Initialization Trace for Node 001-01

02/16/2004 08:25:33 -------------------- PDE starting02/16/2004 08:25:35.06 (346) ---- PDE starting.02/16/2004 08:25:35.07 (346) State is NOTPA/START.:02/16/2004 08:25:36.38 (346) State is NOTPA/NETREADY.:02/16/2004 08:25:47.15 (346) State is TPA/START.:02/16/2004 08:25:48.05 (346) State is TPA/VPROCS.:02/16/2004 08:25:49.57 (346) State is TPA/READY.:02/16/2004 08:25:49.65 (346) State is TPA/DONE.02/16/2004 08:25:49.66 (346) Crash ceiling/count = 3/002/16/2004 08:25:49.66 (346) PDE started in 15 seconds.

Example of using the tpareset command:

Example of using the tpatrace command:

Page 6: B414 Restart

Restart Messages and Information

Recovery status information is logged to numerous locations:

Software_Event_Log SMP Console Display /var/adm/streams (UNIX)

SMP Console output following a tpareset:

:Event number 33-10198-00 (severity 40, category 10)Force a TPA restart.:NOTICE: fsgsync.c: PDE: A primary fsg flush started.xcmn_err: Message Date 02/16 - Time 08:25(mm/dd hh:mm):Event number 34-02900-00 (severity 10, category 10)04/02/16 08:25:49 Running DBS Version: 05.01.00.00Event number 34-02900-00 (severity 10, category 10)04/02/16 08:25:49 Running PDE Version: 05.01.00.00:04/02/16 08:25:50 Initializing DBS Vprocs:04/02/16 08:25:56 Configuration is operationalEvent number 34-02900-00 (severity 10, category 10)04/02/16 08:25:56 Starting AMP partitions:04/02/16 08:25:59 Voting for transaction recoveryEvent number 34-02900-00 (severity 10, category 10)04/02/16 08:26:00 Recovery session 1 contains 43 rows on AMP 00000Event number 34-02900-00 (severity 10, category 10)04/02/16 08:26:11 Starting PE partitions:04/02/16 08:26:15 Logons are enabledFeb 16 08:26:15 Teradata DBS Gateway: [455]: error logging started

Page 7: B414 Restart

PDE States

The pdestate command can be used to check the current state of the PDE and Teradata software for a specific node.

# /usr/ntos/bin/pdestatePDE: Parallel Database Extension state is TPA.

PDE has three major operational states:

NULL, NOTPA, and TPA

– NULL/START– NULL/STOPPED – NULL/RESET – NULL

– NOTPA/START – NOTPA/NETCONFIG – NOTPA/NETREADY – NOTPA/RECONCILE– NOTPA

– TPA/START – TPA/VPROCS – TPA/READY– TPA/DONE– TPA

Page 8: B414 Restart

Unscheduled Restarts

Disk Drive Failures

Scenario 1Failure: One disk in a drive group Result: No TPA resetResolution: Replace disk – Array Controllers automatically rebuild the disk

Scenario 2Failure: Two disks in a drive group Result: – TPA reset (1-5 minutes)

– AMP taken offline and marked as Fatal– Fallback tables OK– Non-fallback tables partially available

Resolution: – Replace the two disks– Reformat LUNs or Volumes in the drive group– Perform a table rebuild– Restore non-fallback tables

Scenario 3Failure: Two disks in 2 different drive groups associated with AMPs in the same

cluster – 2 AMPs fail in a clusterResult: Machine haltsResolution: Restore User DBC and tables

Page 9: B414 Restart

Unscheduled Restarts (cont.)

BYNET Failures

Scenario 1

Failure: One BYNET failsResult: – No TPA reset

– All traffic auto-switched to remaining BYNET – Impact on system performance

Resolution: Repair BYNET

Scenario 2

Failure: Both BYNETs fail

Result: Teradata halts and is not available

Resolution: Repair BYNETs

Page 10: B414 Restart

Unscheduled Restarts (cont.)

Node Failure

Scenario

Failure: Node Fails (e.g., O.S. hangs, 2 power supplies fail, memory fails, etc.)Result: – TPA restart (1 - 5 minutes) and vprocs migrate to other

nodes in clique– Possible O.S. reboot (3 - 15 minutes)

Resolution: – Repair node and reboot operating system– Restart Teradata to allow node to rejoin Teradata configuration

Vproc Software Failure

Scenario

Failure: AMP or PE Vproc fails

Result: TPA restart (1 - 5 minutes) and vprocs may be marked offline

Resolution: If necessary, run Scandisk, Checktable, and Rebuild utilities

AWS Failure

Scenario

Failure: AWS fails

Result: No restart of Teradata; AWS is not available to monitor/manage system

Resolution: Reboot or recover AWS

Page 11: B414 Restart

TPA Reset – Crashdumps

UNIX

CollectorTask

Dump Device(/dev/pdedump)

AMP AMP AMP AMP

Crashdump Table

1 2

1. Selective memory and swapped pages are written to “pdedump” space.

2. As part of Teradata restart, a background collector task reads “pdedump” and writes dump information to a Crashdump table in Crashdumps database.

• If the Crashdumps database is out of perm space, the collector task outputs a warning message and retries every 60 minutes to create a crashdump table.

UNIX MP-RAS Commands to determine if dumps are present in “pdedump”:

# pdedumpcheck -v (lists /dev/pdedump dumps that are present)

# fdlcsp - mode clear (clears all dumps from /dev/pdedump)

Page 12: B414 Restart

Allocating Crashdumps Space

DBC

Sys_Calendar SysAdmin SystemFECrashdumps SYSDBA

Allocate approximately 150 – 200 MB of permanent space per node per crashdump.

Example: Four-node system and you want to allocate space for three Crashdumps:

((150 x 4) x 3) = 1800 MB without fallback((150 x 4) x 3) x 2 = 3600 MB with fallback

MODIFY USER Crashdumps AS PERM = 1800E6;

Example of Crashdump name: Crash_20040213_012519_02 (Date) (Time) (Segment #)

Help USER Crashdumps;

Table/View/Macro name Kind Comment

Crash_20040213_012519_02 T PDE:05.01.00.00,TDBMS:05.01.00.00,TGTW:05.01.00.00;

Page 13: B414 Restart

TPA Dump Maintenance

Is the Crashdump needed?

(Contact support center if in doubt.)

DELETE from Crashdumps

Optionally, delete from pdedump device

Options:

– Allow access to system via network

– Archive to file and ftp to support center

– Use DUL and archive to tape

No

Yes

UNIX MP-RAS Operating System Dumps

Complete dump of system memory, including:

• PDE

• Kernel

Crash utility may be used to interpret dump.

Page 14: B414 Restart

Review Questions

1. What is the operating system command to restart Teradata? __________________

2. What is the DB Window supervisor command to restart Teradata? __________________

3. Which of the following choices will cause a Teradata restart? __________________

A. AWS hard drive failure

B. Single drive failure in RAID 1 drive group

C. Two drive failures in same RAID 1 drive group

D. Single SMP power supply failure

E. SMP CPU failure

F. One of BYNETs fails

G. LAN connection to SMP is lost

Page 15: B414 Restart

Module 14: Review Question Answers

1. What is the operating system command to restart Teradata? tpareset

2. What is the DB Window supervisor command to restart Teradata? restart tpa

3. Which of the following choices will cause a Teradata restart? C, E

A. AWS hard drive failure

B. Single drive failure in RAID 1 drive group

C. Two drive failures in same RAID 1 drive group

D. Single SMP power supply failure

E. SMP CPU failure

F. One of BYNETs fails

G. LAN connection to SMP is lost