Rep Server Monitoring Best Practices

Sybase, Inc. 2009 Page 1 of 51

Best Practice Recommendation Subject: Monitoring Replication System

Author(s): Deepak Upadhyay, Sr.DBA, Sybase IT

Reviewer(s): David Burgess, Staff DBA, Sybase IT

Udaya Challapalli, Sr. DBA Manager, Sybase IT

Hema Seshadri, Sr. DBA Manager, Sybase IT

Contributor(s): David Burgess, Staff DBA, Sybase IT

Abstract: The purpose of this document is to provide best practices to monitor a typical Sybase replication environment.


Table of Contents 1 Introduction ...................................................................................................................................... 3

1.1 Definitions.............................................................................................................................. 3 1.1.1 Relational Database....................................................................................................... 3 1.1.2 Primary database........................................................................................................... 3 1.1.3 Replicate database ........................................................................................................ 3 1.1.4 Standby database.......................................................................................................... 3 1.1.5 Replication Server.......................................................................................................... 3 1.1.6 Primary replication server .............................................................................................. 3 1.1.7 Replicate replication server ........................................................................................... 3

Best Practice Solution ..................................................................................................................... 4 1.2 General Monitoring................................................................................................................ 4

1.2.1 Status monitoring........................................................................................................... 5 1.2.2 Errorlog ........................................................................................................................ 17 1.2.3 Disk space ................................................................................................................... 21 1.2.4 Replication topology..................................................................................................... 22

1.3 Performance monitoring...................................................................................................... 28 1.3.1 Latency ........................................................................................................................ 28 1.3.2 Throughput –................................................................................................................ 34 1.3.3 Statistics monitoring (i.e. Monitor counters) – ............................................................. 35

1.4 Alerting/Notification – .......................................................................................................... 43 1.4.1 RSM Event monitoring................................................................................................. 43 1.4.2 Scripts .......................................................................................................................... 45

1.5 General Troubleshooting..................................................................................................... 49 1.5.1 Skipping transaction – ................................................................................................. 49 1.5.2 Dumping stable queue –.............................................................................................. 49 1.5.3 Disabling secondary truncation point –........................................................................ 49 1.5.4 Enabling secondary truncation point –......................................................................... 50

References .................................................................................................................................... 51


Introduction The purpose of this document is to list out best practices used to monitor Sybase Replication System.

1.1 Definitions

1.1.1 Relational Database Type of database which groups the data into related tables, and the tables has two major elements (i.e. ROW & COLUMN).

1.1.2 Primary database A database where transactions are originally performed (by end-user/process) and those transactions are grouped/captured for replication server.

1.1.3 Replicate database A database which receives replicated transactions from replication server and applied those transactions to its own copy of primary database. A replicate database may or may not be the exact copy of its primary database.

1.1.4 Standby database A database which receives replicated transactions from replication server and applied those transactions to its own copy of primary database. A standby database is generally the exact copy of its primary database.

1.1.5 Replication Server Sybase Open Client/Server product which performs continuous, asynchronous transfer of transaction log from a primary to replicate database(s).

1.1.6 Primary replication server An instance of a replication server performs continuous, asynchronous transfer of transaction log from a primary to other replication server(s). It is also capable to replicate transactions to replicate databases.

1.1.7 Replicate replication server An instance of replication server generally receives transactions from primary replication server and replicates to replicate database(s).


Best Practice Solution In this section monitoring is categorized as follows • General Monitoring • Performance monitoring • Alerting/Notification • General trouble-shooting We will discuss each of the above categories in detail in the sections below.

1.2 General Monitoring Monitoring replication is very critical. Effective monitoring is the key to maintaining a replication system, since time is an important factor (i.e. time to “FIX” any issues). If connections are suspended for a long time, this can cause: • The stable device to fill • The Replication Agent to suspend at the primary site • Transaction log to FILL up at primary site • MAY CAUSE ALL TRANSACTION TO SUSPEND/ABORT (i.e. STOP ALL ACTIVITY) In this category of monitoring basically the objective is to make sure all components of replication system are up and running AND to avoid any surprise failure of the system. Generally monitoring for category is divided as follows • Status monitoring • Log (i.e. error log) monitoring • Space monitoring (i.e. disk space) • Overall topology (i.e. functioning as it is supposed to be)

o Table schema monitoring - To make sure the schema matches with replication definitions and with replicate database(s), especially after any application changes

o Marking replication - All required tables are marked for replication correctly o Data is in sync between primary and replicate sites

Following diagram lists out various components to monitor


1.2.1 Status monitoring This section details various replication systems’ components and also provides detail on why it is important to monitor them. Following major components are considered in this section • Servers • Connection/Routes • Replication queues • Replication Agents/LTMs • Replication threads/modules

1.2.1.1 Servers Generally there are two types of servers • Database servers


• Replication server

1.2.1.1.1 Database servers – In replication servers basically three types of database servers used • Primary • Replicate • RSSD Primary – The most important database of any application, but from replication system point of view this is of its critical component to monitor. Whether the server is up and running can be monitored using many different methods • Scripts • RMS/RSM event monitoring • Other third-party tools Replicate – When performed as standby (i.e. DR) this is as critical as primary database server. Shutdown of replicate site(s) have a big impact (to the primary site i.e. final application) based on the configuration of the replication system. Several key factors (below) values can define impact of down replicates sites • Time to fix and bring up replicate site(s) • Space allocated to the stable device(s) for the replication server(s) • Complexity of the replication system (i.e. how easy is to rebuild the replication

system or how easy is to remove/add the replicate site from/to replication system. Again, whether the server is up and running can be monitored using many different methods • Scripts • RMS/RSM event monitoring • Other third-party tools RSSD – Equally important as other replicate database servers, since failure of RSSD database server may contribute to failure of replication system Again, whether the server is up and running can be monitored using many different methods • Scripts • RMS/RSM event monitoring • Other third-party tools


1.2.1.1.2 Replication servers – Mostly there is only one replication server which performs all replications in the replication system. More complex environment will have multiple types of required replication servers as mentioned below. The status of the server can be monitored using scripts or RMS/RSM event monitoring. ID – Must be up and running during at-least on following conditions • Add new replication server to the replication system • Adding databases to the replication system • Adding routes to the replication system Primary – Responsible to collecting published data. Can impact the business, If down for longer duration (without being monitored!!). Intermediate - Flow of data between two replication servers. Monitoring the intermediate replication server is equally important as other replication servers. Replicate – Apply replicate data. Monitoring the intermediate replication server is equally important as other replication servers.

1.2.1.2 Connections/Routes – Critical components of the replication server, there status generally make sure that data is replication smoothly or of any issues Logical connection – In warm-standby environment, it is important to verify the current active and standby connections isql –U<user> -S<RepSrv> -P<pwd>

1> admin logical_status 2> go

Logical Conn Active Connection name

Active Conn State

Standby Connection name

Standby conn state

Controller RS

Operations in progress

State of Operations in progress

Spid

[278] DBS.DBSfast_LC

[281] PDSDBS1.DBSfast

Active/ [1526] PDSDBS5.DBSfast

Active/ [16777358] PRLDBS1A

None None

[283] DBS.DBSudef_LC

[284] PDSDBS1.DBSudef

Active/ [1528] PDSDBS5.DBSudef


None None

[375] DBS.DBSuomm_LC

[539] PDSDBS1.DBSuomm

Active/ [1529] PDSDBS5.DBSuomm


None None

[376] DBS.DBSvend_LC

[540] PDSDBS1.DBSvend

Active/ [1527] PDSDBS5.DBSvend


None None

[543] DBS.u_DBScta_LC

[544] PDSDBS1.u_DBScta

Active/ [1530] PDSDBS5.u_DBScta


None None

Physical database connection – Make sure that the database connection is up and running especially for replicate database connection.


isql –U<user> -S<RepSrv> -P<pwd>

1> admin health 2> go

Mode Quiesce Status ---- ------- ------ NORMAL FALSE HEALTHY

In most of below “admin” command’s output the column “State” is important to observe. Following table describes possible values for the column

State Description

Active Actively processing a command.

Awaiting Command The thread is waiting for a client to send a command.

Awaiting I/O The thread is waiting for an I/O operation to finish.

Awaiting Message The thread is waiting for a message from an Open Server message queue.

Awaiting Wakeup The thread has posted a sleep and is waiting to be awakened.

Connecting The thread is connecting.

Down The thread has not started or has terminated.

Getting Lock The thread is waiting on a mutual exclusion lock.

Inactive The status of an RSI User thread at the destination of a route when the source replication Server is not connected to the destination replication Server.

Initializing The thread is being initialized.

Suspended The thread has been suspended by the user.

1> admin who_is_down


2> go

Spid Name State Info DSI EXEC Suspended 414(1) PRSPS3.cons DSI Suspended 414 PRSPS3.cons

1> admin who 2> go

Spid Name State Info 58 DIST Awaiting Wakeup 376 DBS.DBSvend_LC 67 SQT Awaiting Wakeup 376:1 DIST DBS.DBSvend_LC 29 SQM Awaiting Message 376:1 DBS.DBSvend_LC 28 SQM Awaiting Message 376:0 DBS.DBSvend_LC 82 DSI EXEC Awaiting Command 540(1) PDSDBS1.DBSvend 42 DSI Awaiting Message 540 PDSDBS1.DBSvend 8427 REP AGENT Awaiting Command PDSDBS1.DBSvend 8203 DSI EXEC Awaiting Command 1527(1) PDSDBS5.DBSvend 8202 DSI Awaiting Message 1527 PDSDBS5.DBSvend 4646 DSI EXEC Awaiting Command 614(1) PRSDBS1.DBSvend 4645 DSI Awaiting Message 614 PRSDBS1.DBSvend 32 SQM Awaiting Message 614:0 PRSDBS1.DBSvend 38 RSI Awaiting Wakeup PRLRMDBS1 37 SQM Awaiting Message 16777372:0 PRLRMDBS1 86 RSI USER Awaiting Command PRLRMDBS1 54 dSUB Sleeping 15 dCM Awaiting Message 18 dAIO Awaiting Message 62 dREC Sleeping dREC 63 dSTATS Sleeping 1152 USER Active sa 14 dALARM Awaiting Wakeup

Direct/In-direct routes – Based on outbound queue size of the source replication server, it is critical to monitor the status of the route. isql –U<user> -S<RepSrv> -P<pwd>

1> admin who,rsi 2> go In addition to “State” column, other two highlighted columns should tell (i.e. if they are different value) if there is still data to be process by replication server for the RSI.


Spid State Info Packets Sent

Bytes Sent Blocking Reads

Locater Sent Locater Deleted

38 Awaiting Wakeup

PRLRMDBS1 2655009 426460847 383262 0x0000000000000000000000000000000000000000000000000000000000014c5900320002

0x0000000000000000000000000000000000000000000000000000000000014c5900320002

1> admin who_is_down 2> go

Spid Name State Info DSI EXEC Down 418(1)

DTRSPS3.cons

1.2.1.3 Queues – Inbound – During peak hours, it is important to monitor whether the data is moving thru the queue. Outbound – Except ONLY warm-standby environment, it is equally important to monitor the size and movement of the data in the queue Materialization – Only important during materialization Below replication command “admin who,sqm” shows about seventeen column output, and all column are recommended to monitor carefully. Below four columns can quickly provide brief state of the all replication queues. When connection is active but if it observed that data is not replicating, and then look if column “Duplicates” is rising, unique transaction may be incorrectly resolved as duplicates. Additionally, other two columns “First Segment.Block” and “Last Seg.Block” can quickly tell how much approx data in queue is there to be process. For example, in below the queue (378:1 DBS.DBSglep_LC) has about 46 MB (i.e. 600520-600474) of data to be processed. Notice: Segment = 1MB consists of 64 BLOCKS (i.e. BLOCK SIZE = 16K) isql –U<user> -S<RepSrv> -P<pwd>

1> admin who,sqm


2> go

Info Duplicates First Segment.Block

Last Seg.Block

16777372:0 PRLRMDBS1 167 294680.39 294680.39 615:0 PRSDBS1.DBSvmst 150 2735.24 2735.24 606:0 PRSDBS1.DBSglep 3 159995.25 159995.25 379:0 DBS.DBSvmst_LC 0 0.1 0 378:1 DBS.DBSglep_LC 12 600474.19 600520.47 378:0 DBS.DBSglep_LC 0 0.1 0 201:0 PDSREP1.PRLDBS1B_RSSD

9 8.9 8.9

201:1 PDSREP1.PRLDBS1B_RSSD

5 212582.54 212582.54

1.2.1.4 Replication Agent – Sybase Replication Agent – Internal thread in Sybase ASE which scans the Xact log of the database server, make sure the agent is collecting marked data, forwarding Xacts to the replication server and importantly moving the secondary truncation point. In order to verify the state of the replication agent, following stored procedure can be executed. isql –U<user> -S<PrimaryDBSrv> -P<pwd> -D<DBName>

1> sp_help_rep_agent <DBName>,process 2> go

Dbname Spid sleep status retry count last error DBSallc 19 end of log 0 0

The “sleep status” column (in above output!!) shows current activity of the replication agent

Status Comment

not running RepAgent is not running.

not active RepAgent is not in recovery mode.

Initial RepAgent is initializing in recovery mode.

end of log RepAgent is in recovery mode and has


reached the end of the transaction log.

Unknown none of the above.

Further, the shown “Spid” of the replication agent can be verified using stored procedure “sp_who”. isql –U<user> -S<PrimaryDBSrv> -P<pwd> -D<DBName>

1> sp_who “19” 2> go

fid spid status loginame Origname hostname blk_spid dbname cmd block_xloid 0 19 background NULL NULL 0 DBSallc REP

AGENT 0

1> select * from master..syslogshold 2> where dbid = db_id(“DBSallc”) 3> and name = “$replication_truncation_point” 4> go

dbid Reserved spid page xactid masterxactid starttime name xloid 5 0 0 21762 0x000000000000 0x000000000000 Jan 26

2009 9:00AM

$replication_truncation_point 0

LTMs – For non-Sybase primary data only, which collects the primary Xact log (i.e. delta) and transfers to replication server it is equally important as Sybase replication agent to monitor its errorlog.

1.2.1.5 Replication threads SQM – Make sure different between “First Segment” and “Last Seg” is less than equal to zero (from “admin who,sqm”). As mentioned earlier, below output shows the difference of 46 MB data in the queue. isql –U<user> -S<RepSrv> -P<pwd>

1> admin who,sqm 2> go


Last Seg.Block

16777372:0 PRLRMDBS1 167 294680.39 294680.39



Last Seg.Block

615:0 PRSDBS1.DBSvmst 150 2735.24 2735.24 606:0 PRSDBS1.DBSglep 3 159995.25 159995.25 379:0 DBS.DBSvmst_LC 0 0.1 0 378:1 DBS.DBSglep_LC 12 600474.19 600520.47 378:0 DBS.DBSglep_LC 0 0.1 0 201:0 PDSREP1.PRLDBS1B_RSSD

9 8.9 8.9

201:1 PDSREP1.PRLDBS1B_RSSD

5 212582.54 212582.54

Notice : Segment = 1MB consists of 64 BLOCKS (i.e. BLOCK SIZE = 16K) SQT – Using “admin who,sqt” look for any large Xact SQT is processing, and not affecting replication system. If column “Full” is often observed “1” then SQT cache size is not enough or small. Also column “Removed” shows number of Xact’s messages are move out from SQT cache (due to their sizes), if there are many or even single for long time, observe other columns for example “Open” or “First Tran” (with ST = O and large number of Cmds) The column “First Tran” contains information in three parts ST: Followed by O/C/R/D (Open/Closed/Read/Deleted) Cmds: Followed by number of SQL commands in first transaction <qid>: Followed by exact position of the first transaction (i.e. segment:block:row) Make sure “Cmds” are changing and <qid> is increasing. isql –U<user> -S<RepSrv> -P<pwd>

1> admin who,sqt 2> go

Info Closed

Read

Open

Trunc

Removed

Full

SQLBlocked

First Trans Parsed

SQL Reader

Change Oqids

Detect Orphans

209:1 DIST ITDSREP1.ITRLDR1_RSSD

0 0 0 0 0 0 1 0 0 0 0

323:1 DIST GIT.nisdb_LC2

0 0 0 0 0 0 1 0 0 0 0

210:1 DIST GIT.nisdb_LC

0 0 0 0 0 0 1 0 0 0 0

324 112 0 0 112 0 0 0 st:C,cmds:3,q 0 0 0 1


Info Closed

Read

Open

Trunc

Removed

Full

SQLBlocked

First Trans Parsed

SQL Reader

Change Oqids

Detect Orphans

DTSGIT1.nisdb

id:15:23:0

212 DOSNIS1A.nisdb

72 0 0 72 0 0 0 st:C,cmds:3,qid:21:58:0

0 0 0 1

214 DTSGIT1.nisdbdev

112 0 0 112 0 0 0 st:C,cmds:3,qid:435:4:0

0 0 0 1

209 ITDSREP1.ITRLDR1_RSSD

0 0 0 0 0 0 0 0 0 0 1

DIST – The “Status” columns (from “admin who,dist”) will provide current status of the thread, either “Normal” or “ignoring”. Other useful columns to look for are “PendingCmds” and “Duplicates”. isql –U<user> -S<RepSrv> -P<pwd>

1> admin who,dist 2> go

Info PrimarySite

Type Status PendingCmds

SqtBlocked

Duplicates Transprocessed CmdsProcessed

MaintUserCmds

NoRepDefCmds

200 PDSREP1.PRLDBS1A_RSSD

200 P Normal 0 1 0 479 1625 0 0

543 DBS.u_DBScta_LC

544 L Normal 0 1 0 344572 5245341

0 4550199

690 PDSDBS1.logindb

690 P Normal 0 1 0 813 2439 0 0

376 DBS.DBSvend_LC

540 L Normal 0 1 0 62246 203586

0 70644

375 DBS.DBSuomm_LC

539 L Normal 0 1 0 6116 18166 0 0

283 DBS.

284 L Normal 0 1 0 6168 18274 4 0


Info PrimarySite

Type Status PendingCmds

SqtBlocked

Duplicates Transprocessed CmdsProcessed

MaintUserCmds

NoRepDefCmds

DBSudef_LC 278 DBS.DBSfast_LC

281 L Normal 0 1 0 12698 534422

4 503479

DSI – Make sure the DSI thread is UP and running and it is NOT suppose to be down for long time in order avoid processing backlog There are many columns to look for when “admin who,dsi” displays results, but following few columns can quickly provide quick status isql –U<user> -S<RepSrv> -P<pwd>

1> admin who,dsi 2> go

Status Info

Maintenance User Xacts_skipped TriggerStatus ReplStatus

Awaiting Message

618 PRSDBS1.u_DBScta u_DBScta_maint 0 off on

Awaiting Message

200 PDSREP1.PRLDBS1A_RSSD

PRLDBS1A_RSSD_maint 0 on on

Awaiting Message

1530 PDSDBS5.u_DBScta u_DBScta_maint 0 off off

Awaiting Message

1529 PDSDBS5.DBSuomm DBSuomm_maint 0 off off

Awaiting Message

1528 PDSDBS5.DBSudef DBSudef_maint 0 off off

Awaiting Message

544 PDSDBS1.u_DBScta u_DBScta_maint 0 on on

Awaiting Message

690 PDSDBS1.logindb logindb_maint 0 on on

Awaiting Message

540 PDSDBS1.DBSvend DBSvend_maint 0 on on

Awaiting Message

539 PDSDBS1.DBSuomm DBSuomm_maint 0 on On

Awaiting Message

284 PDSDBS1.DBSudef DBSudef_maint 0 on On

Awaiting Message

281 PDSDBS1.DBSfast DBSfast_maint 0 on On

Awaiting Message

691 PRSDBS1.logindb logindb_maint 20 on On

Awaiting Message

614 PRSDBS1.DBSvend DBSvend_maint 0 off On

Awaiting Message

1526 PDSDBS5.DBSfast DBSfast_maint 0 off Off

Awaiting Message

1527 PDSDBS5.DBSvend DBSvend_maint 0 off Off


Other quick commands to check the status of DSIs are as follows 1> admin who_is_down 2> go .

.

1> admin who_is_up 2> go RSI – Make sure route’s status is UP. isql –U<user> -S<RepSrv> -P<pwd>

1> admin who,rsi 2> go

Spid State Info Packets Sent

Bytes Sent Blocking Reads

Locater Sent Locater Deleted

38 Awaiting Wakeup

PRLRMDBS1 2655009 426460847 383262 0x0000000000000000000000000000000000000000000000000000000000014c5900320002

0x0000000000000000000000000000000000000000000000000000000000014c5900320002

DAEMONS – Make sure to monitor status of daemons dAlarm, dAIO,dSUB and dCM regularly. isql –U<user> -S<RepSrv> -P<pwd>

1> admin who 2> go

Spid Name State Info 58 DIST Awaiting Wakeup 376 DBS.DBSvend_LC 67 SQT Awaiting Wakeup 376:1 DIST DBS.DBSvend_LC 29 SQM Awaiting Message 376:1 DBS.DBSvend_LC 28 SQM Awaiting Message 376:0 DBS.DBSvend_LC 82 DSI EXEC Awaiting Command 540(1) PDSDBS1.DBSvend 42 DSI Awaiting Message 540 PDSDBS1.DBSvend 8427 REP AGENT Awaiting Command PDSDBS1.DBSvend 8203 DSI EXEC Awaiting Command 1527(1) PDSDBS5.DBSvend 8202 DSI Awaiting Message 1527 PDSDBS5.DBSvend 4646 DSI EXEC Awaiting Command 614(1) PRSDBS1.DBSvend


Spid Name State Info 4645 DSI Awaiting Message 614 PRSDBS1.DBSvend 32 SQM Awaiting Message 614:0 PRSDBS1.DBSvend 38 RSI Awaiting Wakeup PRLRMDBS1 37 SQM Awaiting Message 16777372:0 PRLRMDBS1 86 RSI USER Awaiting Command PRLRMDBS1 54 dSUB Sleeping 15 dCM Awaiting Message 18 dAIO Awaiting Message 62 dREC Sleeping dREC 63 dSTATS Sleeping 1152 USER Active sa 14 dALARM Awaiting Wakeup

1.2.2 Errorlog

1.2.2.1 Database errorlogs By default located at path -- > $SYBASE/$SYBASE_ASE/install Primary database errorlog – Especially to look for errors related to Sybase Replication agent and/or any corruption in the primary database. 00:00000:00586:2008/12/13 14:10:13.30 server Started Rep Agent on database, 'DBSglep' (dbid = 21).

02:00000:00586:2008/12/13 14:10:13.45 server Error: 692, Severity: 20, State: 1

02:00000:00586:2008/12/13 14:10:13.45 server Uninitialized logical page '1498656' was read while accessing object '8' in database '2

1'. Please contact Sybase Technical Support.

02:00000:00586:2008/12/13 14:10:13.45 server Rep Agent Thread for database 'DBSglep' (dbid = 21) terminated abnormally with error. (

major 0, minor 92)

Replicate database errorlog – Verify the replicate site is up and running with adequate resources (i.e. NOT running out number of connections, locks,log space etc). 00:00000:00006:2007/10/13 16:21:55.88 server Error: 1105, Severity: 17, State: 4


00:00000:00006:2007/10/13 16:21:55.89 server Can't allocate space for object 'syslogs' in database 'DBSbank' because 'logsegment' se

gment is full/has no free extents. If you ran out of space in syslogs, dump the transaction log. Otherwise, use ALTER DATABASE to inc

rease the size of the segment. RSSD database errorlog – Very critical database for replication server, look for current available space in all segments and is up and running with adequate resources (i.e. NOT running out number of connections, locks etc).


00:00000:00068:2008/07/31 16:27:18.07 server RepAgent(7): Received the following error message from the Replication Server: Msg 1106

0. CT/CS Lib function 'ct_results' failed. Retcode = 0..


00:00000:00068:2008/07/31 16:27:18.07 server RepAgent(7): This Rep Agent Thread is aborting due to an unrecoverable communications o

r Replication Server error.

00:00000:00068:2008/07/31 16:27:18.07 server Rep Agent Thread for database 'PRLDBS1D_RSSD' (dbid = 7) terminated abnormally with err

or. (major 92, minor 61)

1.2.2.2 Replication errorlog – Use “admin log_name” (as shown below) to find location of the errorlog, where replication server records informational and error messages. isql –U<user> -S<RepSrv> -P<pwd>

admin log_name

go

Output will look like as below Log File Name ------------- /cis1/PRLDBS1A/log/PRLDBS1A.log Below error code are important to observe I - Informational messages W - Warning E - Error H - Replication thread died F - Due to serious error replication server died N - Internal error Below is the classic error from the replication server errorlog E. 2007/01/27 22:45:12. ERROR #1028 DSI EXEC(1010(1) PDSDBS5.u_DBScta) - dsiqmint.c(3071)

Message from server: Message: 2601, State 2, Severity 14 -- 'Attempt to insert duplicate key row in object 'esa_invoice_job'

with unique index 'pk_esa_invoice_job'


'.

I. 2007/01/27 22:45:12. Message from server: Message: 3621, State 0, Severity 10 -- 'Command has been aborted.

'.

H. 2007/01/27 22:45:12. THREAD FATAL ERROR #5049 DSI EXEC(1010(1) PDSDBS5.u_DBScta) - dsiqmint.c(3078)

The DSI thread for database 'PDSDBS5.u_DBScta' is being shutdown. DSI received data server error #2601 which is mapped to STO

P_REPLICATION. See logged data server errors for more information. The data server error was caused by output command #1 mapped from

input command #2 of the failed transaction.

I. 2007/01/27 22:45:12. The DSI thread for database 'PDSDBS5.u_DBScta' is shutdown.

If stable device partition is failed (or not available) look for failure messages in errorlog as shown below I. 2008/08/22 00:38:36. Embedded database id is '101'.

E. 2008/08/22 00:38:36. ERROR #6078 GLOBAL RS(GLOBAL RS) - sun_svr4.c(139)

Could not open file '/dev/rdsk/c2t0d0s5'. System error 'No such file or directory(2)'

I. 2008/08/22 00:38:36. Unable to open partition '/dev/rdsk/c2t0d0s5'.

E. 2008/08/22 00:38:37. ERROR #6021 GLOBAL RS(GLOBAL RS) - m/sqmext.c(2086)

Stable queue '101:1' cannot be started. It is on a failed partition 'sq1'.

E. 2008/08/22 00:38:37. ERROR #6034 GLOBAL RS(GLOBAL RS) - m/sqmext.c(1259)

Cannot start the stable queue named '101:1'

W. 2008/08/22 00:38:37. WARNING #6131 GLOBAL RS(GLOBAL RS) - qm/sqmsp.c(2689)

Replication Server has no partitions.

I. 2008/08/22 00:38:38. Replication Server 'ITRLID1' is started.

I. 2008/08/22 00:38:38. DIST for 'ITDSREP1.ITRLID1_RSSD' is Starting

E. 2008/08/22 00:38:38. ERROR #30020 DIST(101 ITDSREP1.ITRLID1_RSSD) - xec/dist.c(1647)

Unable to start distributor thread for queue '101'.

I. 2008/08/22 00:38:38. The distributor for 'ITDSREP1.ITRLID1_RSSD' is shutting down

I. 2008/08/22 00:38:38. The DSI thread for database 'ITDSREP1.ITRLID1_RSSD' is started.

I. 2008/08/22 00:38:38. SQM starting: 101:0 ITDSREP1.ITRLID1_RSSD

I. 2008/08/22 00:39:28. Replication Agent for ITDSREP1.ITRLID1_RSSD connected in passthru mode.

E. 2008/08/22 00:39:28. ERROR #14023 REP AGENT(ITDSREP1.ITRLID1_RSSD) - /execint.c(3463)

SQM had an error writing to the inbound-queue.

I. 2008/08/22 00:39:55. Shutting down.


Later after executing “rebuild queues” to recover the replication server look for loss detection in the errorlog I. 2008/08/22 01:02:00. Partition 'SQ2' is added.

I. 2008/08/22 01:48:06. Partition 'sq1' is in the process of being dropped.

I. 2008/08/22 01:50:26. Rebuild Queues: Starting

I. 2008/08/22 01:50:27. Resetting Replication Agent starting log position for ITDSREP1.ITRLID1_RSSD

I. 2008/08/22 01:50:27. Shutting down the DSI thread for 'ITDSREP1.ITRLID1_RSSD'.

I. 2008/08/22 01:50:27. The DSI thread for database 'ITDSREP1.ITRLID1_RSSD' is shutdown.

I. 2008/08/22 01:50:27. DSI: enabled loss detection for 'ITDSREP1.ITRLID1_RSSD'.

I. 2008/08/22 01:50:27. Rebuild queues: deleting queue 101:1

I. 2008/08/22 01:50:27. Rebuild queues: done rebuilding queue 101:1. Restarting.

I. 2008/08/22 01:50:27. Rebuild queues: deleting queue 101:0


I. 2008/08/22 01:50:27. SQM stopping: 101:0 ITDSREP1.ITRLID1_RSSD

I. 2008/08/22 01:50:27. Rebuild queues: done rebuilding queue 101:0. Restarting.


I. 2008/08/22 01:50:28. Starting DIST for 101:1.

I. 2008/08/22 01:50:28. DIST for 'ITDSREP1.ITRLID1_RSSD' is Starting

I. 2008/08/22 01:50:28. Starting the DSI thread for 'ITDSREP1.ITRLID1_RSSD'.

I. 2008/08/22 01:50:28. The DSI thread for database 'ITDSREP1.ITRLID1_RSSD' is started.

I. 2008/08/22 01:50:28. Rebuild Queues: Complete

The loss can also be detected by querying RSSD tables as shown below isql –U<user> -S<RSSDSrv> -P<pwd> -D<RSSD>

1> select dsname,dbname from rs_databases where dbid in ( select distinct case when origin_lsite_id = 0 then origin_site_id else origin_lsite_id end from rs_oqid where valid > 0)

2> go

1.2.2.3 LTM errorlog – In hybrid replication system, important to monitor the LTM process’s log for it’s status and any other related errors (i.e. error connecting to replication server etc)


1.2.2.4 Dbltm (i.e. Rep agent for ERSSD) – Make sure to monitor the process if routes are used and respective replication servers are using ERSSD

1.2.3 Disk space Database segments • Primary Database’s log segment • Replicate Database’s all segments • RSSD’s all segments isql –U<user> -S<DBSrv> -P<pwd> -D<DBName>

sp_help_segment <SegmentName>

go

segment name status

----------- ---- -----------

2 logsegment 0

device size

------ ----

raw07 320.0MB

free_pages

-----------

163197

table_name index_name indid

---------- ---------- -----------

syslogs syslogs 0

total_size total_pages free_pages used_pages reserved_pages

---------- ----------- ---------- ---------- --------------

320.0MB 163840 163197 643 0

Threshold monitoring can be setup for all required segments of the database Replication stable device – Use “admin disk_space” to monitor all stable devices. isql –U<user> -S<RepSrv> -P<pwd>

admin disk_space

go

Partition Logical Part.Id Total

Segs Used Segs

State

/dev/vx/rdsk/sybase2/raw2g14 SQM14 112 2000 0 ON-LINE// /dev/vx/rdsk/sybase2/raw2g13 SQM13 111 2000 6 ON-LINE//


Partition Logical Part.Id Total Segs

Used Segs

State

/dev/vx/rdsk/sybase2/raw2g09 SQM1 110 2000 6 ON-LINE// File system managed by operating system (i.e. disk space for errorlogs) – The filesystem space used all servers (i.e. database servers and replication servers) For example, Sybase replication server installation can be Sun Solaris 10 can be monitored using simple “df” unix command

hypnos-mis-/cis1/PRLDBS1A/log> df -k $SYBASE

Filesystem kbytes used avail capacity Mounted on

/dev/vx/dsk/sybase1/cis1_fs 10485760 5013734 5139665 50% /cis1

1.2.4 Replication topology Tools like Sybase central, Sybase power designer allows to generate graphical replication topology used for an organization. For large and complex replication system, regularly monitoring topology of the replication system for any changes mode to the replication system. The changes are not limited to… • Enabling data replication between two sites • Disabling data replication between two sites • Changing direction of data replication between two sites

1.2.4.1 Monitoring changes to table Schema Schema for all required databases/tables needs to be verify at least between • Primary site and replicate site(s) • Primary site and respective replication definition

This step is very critical especially, during application upgrade when most likely the database schema changes. Many tools/methods can be used to find schema of a table or replication definition of the respective table which are not limited to • Sybase Central • Sybase Power Designer (for replication requires “Information Liquidity Model”) • Shell/Perl scripts • Simple SQL commands (i.e. “sp_help” for table schema and “rs_helprep” for

replication definition) • Other third party tools


1.2.4.2 Marking for replication This is to make sure all required databases or database objects (i.e. mostly user tables) are marked for replication. Following sample stored procedure can provide brief information create procedure sps_check_for_repmrk as

declare @dbnm varchar(100)

select @dbnm = db_name()

if (getdbrepstat() >= 0)

if exists (select 1 from sysobjects where type = 'U' and sysstat & -32768 = -32768)

select name from sysobjects where type = 'U' and sysstat & -32768 = -32768

else

select "No objects in database "+@dbnm+" is marked for replication"

else

select "Entire database "+@dbnm+" is marked for replication"

1.2.4.3 Data consistency Regularly (weekly, monthly or at least before some important days for example quarter end closing/ year end closing) making sure data is consistent between primary and replicate site. Sybase provided tool “rs_subcmp” (on WINDOW it called “subcmp”) can be used to find data inconsistency between primary and replicate sites. For example, in order to find data inconsistency between primary site (Server Name = PDSDBS1, Database Name = DBSCOMMON, table name = attach) and its replicate site create a configuration file (using “vi” or other editor) # attach.cfg – This is the file name

# PDSDBS1.DBSCOMMON.dbo.attach with

# PRSDBS1.DBSCOMMON.dbo.attach.

#

PDS = PDSDBS1

RDS = PRSDBS1

PDB = DBSCOMMON

RDB = DBSCOMMON

PTABLE = attach

RTABLE = attach

PSELECT = select wijt_location,caller_id,create_date,file_name,description,file_contents,chgstamp

from attach

order by caller_id, create_date, file_name

RSELECT = select wijt_location,caller_id,create_date,file_name,description,file_contents,chgstamp

from attach

order by caller_id, create_date, file_name


PUSER = svr_maint

RUSER = svr_maint

PPWD = forget1t

RPWD = forget1t

KEY = caller_id

KEY = create_date

KEY = file_name

RECONCILE = N

VISUAL = Y

NUM_TRIES = 3

WAIT = 10

Then simply use above configuration file as below to find the data inconsistency $SYBASE/$SYBASE_REP/bin/rs_subcmp -f attach.cfg

A separate user (i.e. svr_maint) can be created and bind it to its user defined temporary database (which can also be bind to a user defined cache) can be used to avoid resource competition with other remaining user. Other option is to use command line switches for “rs_subcmp”. In order to Sync the whole database, it is recommended to create a batch process having set of “rs_subcmp”s for every user table in database. Following script can be used to generate the required script (i.e. the script which will actually verify/sync using rs_subcmp commands), may require few modifications to customize local environment

isql -Usa –SPDSDBS1 -DDBSCOMMON <<EOF

create table #table_list

(

id int ,

uu int

)

GO

create table #table_def

(

colid tinyint

,name char(30)

)

GO

insert #table_list

select id,uid

from sysobjects

where type='U'

and name not like 'rs_%'

order by name


GO

declare cursor_tabs cursor for

select id,uu from #table_list

GO

declare @dbname varchar(30)

,@tabid int

,@tabname varchar(100)

,@msg varchar(255)

,@pmsg varchar(255)

,@colid tinyint

,@indid int

,@counter int

,@colname varchar(30)

,@uu int

,@uuc varchar(100)

select @tabid = 0

open cursor_tabs

fetch cursor_tabs into @tabid,@uu

while (@@sqlstatus = 0)

begin

select @uuc = user_name(@uu)

setuser @uuc

select @tabname = object_name(@tabid)

insert #table_def

select A.colid, A.name

from syscolumns A

where A.id = @tabid

order by A.colid

select @msg = 'rs_subcmp –SPDSDBS1 -DDBSCOMMON –sPRSDBS1 -dDBSCOMMON -c"select * from '+@tabname+' order by '

print @msg

select @indid = min(indid)

from sysindexes

where id = @tabid

and indid > 0

and (status & 2) = 2

select @pmsg=' '

if (@indid <> NULL)

begin

select @counter = 1


while @counter <= 16

begin

select @colname = index_col(@tabname, @indid, @counter)

if (@colname is NULL)

break

if (@counter > 1)

select @pmsg = @pmsg + ", "

select @pmsg = @pmsg + rtrim(@colname)

select @counter = @counter + 1

end

end

else

begin

select @colid = 0

while (select min(colid) from #table_def where colid > @colid) != NULL

begin

select @colid = min(colid) from #table_def where colid > @colid

select @pmsg = @pmsg + convert(varchar(30),name)

from #table_def

where colid = @colid

if (select count(*) from #table_def) > 1

and exists (select * from #table_def where colid>@colid)

begin

select @pmsg = @pmsg + ","

end

end

end

select @msg = @msg + @pmsg + '" -u'+@uuc+' -U'+@uuc+' -t'+@tabname+'-V -k'+@pmsg

print @msg

truncate table #table_def

fetch cursor_tabs into @tabid,@uu

select @uuc = user_name(@uu)

setuser

end

GO

close cursor_tabs

GO

deallocate cursor cursor_tabs

GO

EOF



1.3 Performance monitoring Following diagram lists out various performance units to measure/monitor for effective monitoring

1.3.1 Latency Difference (generally the UNIT is seconds) between “work” done in primary database and replicate database. Many methods can be used to determine the difference based on how exactly the latency is defined. The latency can calculated for a single transaction, a batch job or for an entire database system.

1.3.1.1 Rs_lastcommit – Replication server maintains this table (i.e. “rs_lastcommit”) in every replicate database which store the most recent committed transactions from specific source/primary site. This method does NOT really provide the best method since the timings specified in the table are generally NOT correct. Since it reports ONLY the last committed transaction, so it is difficult link the respective primary transaction. Also, in


large complex environment it is difficult to identify the latency for various batch processes or single transactions. For example, below is the output from one of replicate database isql –U<user> -S<Replicate_DBSrv> -P<pwd> -D<DBName>

select origin_time,dest_commit_time from rs_lastcommit

go

origin_time dest_commit_time Jan 28 2009 2:02PM Jan 28 2009 2:52PM

1.3.1.2 Heartbeat – This is Sybase Central feature to monitor latency in replication system, which creates replication enabled table (called “rsm_heartbeat”) and modifies the table at frequent interval. It provides latency in nice graphical form. Restrictions includes are • Must use Sybase Central (i.e. must be connected to both primary and replicate) • Latency measure is good for single row updates To configure Heartbeat using Sybase central select the database connection (i.e. primary database connection for which heartbeat needs to configure) right click and select “Heartbeats”. Complete detail steps are available in “Help” section from RS-Plugin (see below print screen)


1.3.1.3 Manually managed ping/time table(s) – User defined table(s) can be created having columns defaulting to time on respective database servers. Inserts into these tables can be done before/after/during (i.e. based on application/batch-job) and latency can be derived by comparing the values from the tables. For example table can be defined as follows in primary database isql –U<user> -S<PDBSrv> -P<pwd> -D<DBName>

Create table PRLREP1_timer

( daats_id int,

p_dt datetime default getdate()

)

Go

Create unique clustered index PRLREP1_timer_idx01 on PRLREP1_timer (daats_id)

Go

Grant all on PRLREP1_timer to public

go

On replicate database create similar table as mentioned below isql –U<user> -S<RDBSrv> -P<pwd> -D<DBName>


Create table PRLREP1_timer

( daats_id int,

p_dt datetime,

r_dt datetime default getdate()

)

Go

Create unique clustered index PRLREP1_timer_idx01 on PRLREP1_timer (daats_id)

Go

Grant all on PRLREP1_timer to public

go

Once replication setup is completed for above created table, first two columns will get replicated from primary values. Now in order to calculate latency insert values into the table before and after a large batch of transaction isql –U<user> -S<PDBSrv> -P<pwd> -D<DBName> Insert into PRLREP1_timer select max(daats_id) +1 from PRLREP1_timer

Go

/* EXECUTE BATCH PROCESS */ Insert into attach ….

Update attach set …

……

Insert into attach ….

Update attach set …

Go

Insert into PRLREP1_timer select max(daats_id) +1 from PRLREP1_timer

go

After the batch load is completed and replicated, following SQL can be used (on replicate database) to calculate latency isql –U<user> -S<PDBSrv> -P<pwd> -D<DBName>

Select datediff(ss,min(p_date),max(r_date)) from PRLREP1_timer

go

Clocks for primary and replicate site must be synchronized to measure latency effectively.


1.3.1.4 Rs_ticket – A rs_ticket can be think of a message which travels from primary database to replicate database hopping at following replication threads • EXEC • DIST • DSI At each hopping stop (i.e. the traveling message handled by particular replication thread) time is appended to the message. Once the message arrives at destination (i.e. replicate database) stored procedure “rs_ticket_report” can modified to append the message with time and store the entire message to a user defined table. The user defined table can be used for further analysis (i.e. measuring latency). The rs_ticket_report function string must be enabled by modifying the replicate connection isql –U<user> -S<RepSrv> -P<pwd>

1> alter connection to servername.databasename set 'dsi_rs_ticket_report' to 'on'

2> go On replicate site the stored procedure “rs_ticket_report” can be modified as follows isql –U<user> -S<RDBSrv> -P<pwd> -D<DBName>

create procedure rs_ticket_report

(@rs_ticket_param varchar(255))

as

begin

set nocount on

declare @new_cmd varchar(255),

@c_time datetime,

@c_secs numeric(6,3)

select @c_time = getdate()

select @c_secs = datepart( millisecond, @c_time)

select @c_secs = datepart( second, @c_time) + @c_secs/1000

select @new_cmd =

@rs_ticket_param + ";RDB(" + db_name()+ ")="

+ convert( varchar(2), datepart( hour, @c_time))

+ ":" + convert( varchar(2), datepart minute, @c_time))

+ ":" + convert( varchar(6), @c_secs)

insert daats_tkt values (@new_cmd)


end On primary site we execute like follows to gather performance related data isql –U<user> -S<PrimaryASE> -P<pwd> -D<PrimaryDB>

1> exec rs_ticket 'BEGIN BATCH PROCESS' 2> EXECUTE BATCH PROCESS 3> exec rs_ticket 'COMPLETE BATCH PROCESS' 4> go On replicate site .. isql –U<user> -S<RepASE> -P<pwd> -D<RepDB>

select * from daats_tkt

go

Output should look like… #################################################################### V=1;H1=BEGIN BATCH PROCESS;PDB(pdsdbs1)=09:51:49.180; EXEC(29)=09:51:49.0;B(29)=43690;DIST(20)=09:51:52.0;DSI(27)=09:51:55.0;RDB(prsdbs1)=09:51:55.413 V=1;H1=COMPLETE BATCH PROCESS;PDB(pdsdbs1)=09:51:49.193; EXEC(29)=09:51:49.0;B(29)=44894;DIST(20)=09:51:52.0;DSI(27)=09:51:55.0;RDB(prsdbs1)=09:51:55.413 #################################################################### Understanding the output from the “rs_ticket” process • V – Version number • H – Header information; String input by "rs_ticket" in primary site • PDB - Primary database name and the time (from the host clock) rs_ticket was

executed • EXEC – spid of user executed rs_ticket in primary and the time • B – Total bytes received from Replication Agent and spid number. In this case total

bytes received = 43690 • DIST - spid number (shown in "admin who") and the time rs_ticket passes through

the DIST • DSI - spid number (shown in "admin who") and the time rs_ticket passes through

the DSI • RDB - Replicate database name and the time rs_ticket_report called to add the

message to the results table (i.e. daats_tkt)


As shown in above output, latency can be calculated by difference of time between the “rs_ticket” message arrives at replicate database and the “rs_ticket” message sent from primary database i.e. approximately two seconds ( 09:51:55:413 – 09:51:49:180 = 2 Seconds).

1.3.2 Throughput – Throughput can be calculated by measuring latency (using above defined methods) for given “work” (for example 1000 transactions or total bytes transferred). Many commands can be used to to find out how much amount of data (i.e. in bytes) is processed through replication server, for example using “admin who,sqm” column “Bytes” shows the total number of bytes written. First run “admin who,sqm” and look for the bytes column on the interested connection (below output shows only three column, other column are not shown) State Info Bytes Awaiting Message 16777372:0 PRLRMDBS1 268793702 Awaiting Message 615:0 PRSDBS1.DBSvmst 24000982 Awaiting Message 606:0 PRSDBS1.DBSglep -926256804 Awaiting Message 379:1 DBS.DBSvmst_LC 81859762 Awaiting Message 379:0 DBS.DBSvmst_LC 0 Awaiting Message 378:1 DBS.DBSglep_LC -1344189992 Awaiting Message 378:0 DBS.DBSglep_LC 0 Awaiting Message 248:1 DBS.DBSwact_LC -1605776577 Awaiting Message 248:0 DBS.DBSwact_LC 0 Awaiting Message 236:1 DBS.DBSarpc_LC 68229190 Awaiting Message 236:0 DBS.DBSarpc_LC 0 Awaiting Message 231:1 DBS.DBSallc_LC 120230729 Awaiting Message 231:0 DBS.DBSallc_LC 0 Awaiting Message 201:0

PDSREP1.PRLDBS1B_RSSD 257502

Awaiting Message 201:1 PDSREP1.PRLDBS1B_RSSD

1330086096

Now process the batch load as mentioned in previous sections (i.e. to calculate latency) and at the completion of batch replication execute “admin who,sqm” again State Info Bytes Awaiting Message 16777372:0 PRLRMDBS1 268793702 Awaiting Message 615:0 PRSDBS1.DBSvmst 24002982 Awaiting Message 606:0 PRSDBS1.DBSglep -926256804 Awaiting Message 379:1 DBS.DBSvmst_LC 81859762 Awaiting Message 379:0 DBS.DBSvmst_LC 0 Awaiting Message 378:1 DBS.DBSglep_LC -1344189992 Awaiting Message 378:0 DBS.DBSglep_LC 0 Awaiting Message 248:1 DBS.DBSwact_LC -1605745641


State Info Bytes Awaiting Message 248:0 DBS.DBSwact_LC 0 Awaiting Message 236:1 DBS.DBSarpc_LC 68229190 Awaiting Message 236:0 DBS.DBSarpc_LC 0 Awaiting Message 231:1 DBS.DBSallc_LC 120230729 Awaiting Message 231:0 DBS.DBSallc_LC 0 Awaiting Message 201:0

PDSREP1.PRLDBS1B_RSSD 257502

Awaiting Message 201:1 PDSREP1.PRLDBS1B_RSSD

1330088366

As shown during the batch load total 2000 (24002982 – 24000982) bytes were processed by the replication server. To calculate throughput divide total bytes transfer (i.e 2000 bytes) by total latency (2 second from previous section) i.e. 1000 bytes/sec will be the final throughput. Another method to calculate total bytes transferred (in order to calculate throughput) is using “admin statistics,SQM,BytesWritten” command. Make sure to reset counter before starting the large batch in primary database by executing “admin statistics,reset”.

1.3.3 Statistics monitoring (i.e. Monitor counters) – Monitors and counters can be used to monitor replication in more detail fashion not limited to • Finding “ignored” transactions • Managing segments • Finding transaction sizes • Finding command sizes • Calculating throughput and latency • Calculating read/write rate for stable device • Help configuring replication to its optimal (i.e. sizing SQT size, parallel DSI etc) Monitoring counters for replication server in version 12.6 was done using following two commands isql –U<user> -S<RepSrv> -P<pwd>

1> admin who 2> go .

.

1> admin statistics 2> go


Counters provides details related to following replication modules • CM • DIST • DSI • DSIEXEC • REPAGENT • RSI • SQM • SQT • STS The counters can be distinctly identified for each instance (i.e. occurrence) of the module. Counters for single-instance module can be identified by their respective module name (for example STS, CM). For multi-instance they can identified using following two ways • Module name and instance ID (i.e. LDBID,DBID) for example, RSI/DSI-S/DIST • Module name, instance ID and instance value for example, SQT/DSI-Exec Replication Monitors and counters can be assigned into one of following groups based on their outputs (i.e. characteristic of generated statistics) • Observers – Results in number of occurrences • Monitors – Result their current value • Counters – Results collection statistics Additionally, each replication Monitors and counters can have one or more following status by their end results are calculated • CNT_SYSMON – These counters can be used by “admin statistics, sysmon”

command • CNT_MUST_SAMPLE – Their results are always in sampled form • CNT_NO_RESET – Cannot be reset (i.e. initialized) • CNT_DURATION – Counters which measure durations • CNT_KEEP_OLD – Counters which keeps their current and previous values • CNF_CONFIGURE – Counters which keeps current value of replication configuration

parameter

1.3.3.1 Replication counters version 12.6 – Use stored procedure “rs_helpcounter” (in related RSSD) to find detail information on each counter. In this version, counters are categorized in following types


• Total • Last • Max • Avg Following methods can be used to monitor the counters Using “admin statistics” – Quick and easy way to monitor the replication counters. For example, simply executing “admin statistics,sysmon” will list out all non-intrusive counters in this category. Detail explanation about those counters can be obtained by executing “rs_helpcounter sysmon”. Configuring replication server to collect counters – This is the recommended method to monitor the replication counters in order to derive performance and other useful statistics for the replication system. Below are high level steps for the setup Start sampling for all types (intrusive/non-intrusive) counters – Following commands can be used

configure replication server set “stat_sampling” to “ON”

admin stats_intrusive_counter,”ON”

Start collecting the counters into RSSD (..make sure to monitor RSSD database) – This step will collect the counters into following RSSD tables at regular interval

rs_statcounters – Details about ALL counters.

rs_statdetail – Collected counters

rs_statrun – Stored statistical information for each collection (i.e. flush to RSSD)

Following commands can be used to setup the collection 1. configure replication server set “stat_flush_rssd” to “ON” 2. configure replication server set “stat_reset_after_flush” to

“ON”

3. Configure replication server set “stat_daemon_sleep_time” to “ON”

4. admin statistics, reset (This command is optional to reset counters manually at any point of time)

Filter collection – By starting “sampling” replication server collection many counters. Filters can be added to collection only specific (to the required replication module) counters into RSSD tables. Following commands can be used to setup the filters 1. admin stats_config_module 2. admin stats_config_connection 3. admin stats_config_route


4. admin statistics,flush_status

1.3.3.2 Replication counters version 15.0 – Unlike in version 12.6, in this version counters are not categorized into various type (i.e. Total, Last, Max, Avg), but all counters will collect • Number of observation • Total observations • Last observed value • Max observed value Also, starting version 15.0, there are no intrusive counters. Starting this version collecting/monitoring replication monitors/counters can be achieved in very simple steps. Basically, using command “admin stats” it is required to define • What statistics to collect • Final destination of collected counters (i.e. screen/RSSD) • How long to collect (i.e. sampling period and number of observations) After executing the command “admin stats” user is prompted to replication server’s command prompt, from where user can exit the session or continue with other work. Later “admin stat,status” can be used view the progress of previously executed “admin stats” (to collect counters). In order to stop collecting “admin stat,cancel” can be used any time to stop collection Once the counters are collected RSSD tables (mentioned above) can be queried for further analysis of the replication system. The newly introduced RSSD stored procedure “rs_dump_stats” dumps all collected counters into CSV formatted file which can further be loaded into Excel sheet for further analysis (Note:- Excel sheet may have limitation of storing only 65K rows.)

1.3.3.3 Sp_sysmon – Sybase ASE ‘s stored procedure “sp_sysmon” also provides a section specific to Replication Agent which provides detail statistical information for each replication agent configured. Below is the sample output of “sp_sysmon” In “Log Scan Activity” section provides

• replicated DDL activities • CLRs – Log records which were partially or fully rolled back Replication Agent


-----------------

Replication Agent: DBSCOMMON

Replication Server: PRLDBS1G

per sec per xact count % of total

------------ ------------ ---------- ----------

Log Scan Summary

Log Records Scanned n/a n/a 9 n/a

Log Records Processed n/a n/a 1 n/a

Log Scan Activity

Updates n/a n/a 0 n/a

Inserts n/a n/a 0 n/a

Deletes n/a n/a 0 n/a

Store Procedures n/a n/a 0 n/a

DDL Log Records n/a n/a 0 n/a

Writetext Log Records n/a n/a 0 n/a

Text/Image Log Records n/a n/a 0 n/a

CLRs n/a n/a 0 n/a

In “Transaction Activity” can make sure (approximately) total number of transactions committed and total number of transactions aborted was equal to total number of transactions opened.

Transaction Activity

Opened n/a n/a 1 n/a

Commited n/a n/a 1 n/a

Aborted n/a n/a 0 n/a

Prepared n/a n/a 0 n/a

Maintenance User n/a n/a 0 n/a

Log Extension Wait

Count n/a n/a 2 n/a

Amount of time (ms) n/a n/a 14133 n/a

Longest Wait (ms) n/a n/a 14133 n/a

Average Time (ms) n/a n/a 7066.5 n/a

Schema Cache Lookups

Forward Schema

Count n/a n/a 0 n/a

Total Wait (ms) n/a n/a 0 n/a



Backward Schema

Count n/a n/a 0 n/a


Total Wait (ms) n/a n/a 0 n/a



Truncation Point Movement

Moved n/a n/a 0 n/a

Gotten from RS n/a n/a 1 n/a

Connections to Replication Server

Success n/a n/a 0 n/a

Failed n/a n/a 0 n/a

Network Packet Information

Packets Sent n/a n/a 1 n/a

Full Packets Sent n/a n/a 0 n/a

Largest Packet n/a n/a 175 n/a

Amount of Bytes Sent n/a n/a 175 n/a

Average Packet n/a n/a 175.0 n/a

I/O Wait from RS

Count n/a n/a 2 n/a

Amount of Time (ms) n/a n/a 0 n/a


Average Wait (ms) n/a n/a 0.0 n/a

--------------------------------------------------------------------------------

1.3.3.4 Measuring Replication Agent – Various information in relates with how replication agent is keeping up can be measured by.. • Beginning for transaction log can be located in master..sysdatabases table (column =

logptr). isql –U<user> -S<PrimaryDBSrv> -P<pwd>

1> select logptr from master..sysdatabases where name = “DBSCOMMON”

2> go Below is output look like logptr

-----------

19844


• Truncation points can be queried from master..syslogshold table (column = page)

isql –U<user> -S<PrimaryDBSrv> -P<pwd>

1> select page from master..syslogshold 2> go page

-----------

19845

• Current position of replication agent can be found by executing

“sp_help_rep_agent <db>” (column = current marker)

isql –U<user> -S<PrimaryDBSrv> -P<pwd>

1> sp_help_rep_agent “DBSCOMMON” 2> go

Replication Agent Recovery status

dbname connect dataserver connect database status rs servername rs username

------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------ ------------------------------

DBSCOMMON PDSDBS1 DBSCOMMON not active PRLDBS1G PRLDBS1G_ra

Replication Agent Process status

dbname spid sleep status retry count last error

------------------------------ ----------- ------------------------------ ----------- -----------

DBSCOMMON 18 end of log 0 0

Replication Agent Scan status

dbname start marker end marker current marker log recs scanned oldest transaction

------------------------------ ------------------------------ ------------------------------ ------------------------------ ---------------- ------------------------------

DBSCOMMON (19845,18) (19845,22) (19845,22) 0 (-1,0)


• Last page of the log can calculated using “dbcc pglinkage(<dbid>,<cur_pg>,0,0,0,1)”, where <cur_pg> can be any page (i.e. beginning of log or primary/sec truncation point or current replication position)

Object ID for pages in this chain = 8.

End of chain reached.

2 pages scanned. Object ID = 8. Last page in scan = 19846.

DBCC execution completed. If DBCC printed error messages, contact a user with System Administrator (SA) role.

1.3.3.5 Measuring Queue (Inbound/Outbound) – Use “admin who,sqm” and difference between “Last Seg.block” and “Next Read” to find how replication server is processing the queues. If “Next Read” is greater than equal to (>= ) ‘Last Seg.block” that means there is nothing to process for replication server in the queue.


1.4 Alerting/Notification – User defined scripts (to take appropriate action/send email etc) can be initiated by configuring Sybase Replication server manager (using Sybase Central) for certain replication event/conditions (i.e. DSI DOWN/SERVER DOWN etc). Similar setup can also be accomplished using Sybase RMS (Replication Monitoring System). Following diagram shows generally used alerting/notification methods

1.4.1 RSM Event monitoring Sybase Central can be used instruct RSM server (may be residing on remote host) to initiate user defined scripts (located on the same host where RSM is running) for specific server events/conditions. The event is a change that occurs in a replication system managed by a specific RSM Server In order to configure RSM event monitoring,


• Install and configure RSM server (using “rsmgen” utility located in $SYBASE/$SYBASE_RSM/install)

• Using Sybase Central (having Replication Plugin) connect to the newly configured RSM server.

• Add primary, replicate, replication and RSSD server to the RSM server using Sybase Central (Make sure to add RSSD server before adding replication server)

• Then right-click the RSM server and select “Server Events”. The next dialog box show six different events (i.e. six different “Tabs”)

o Server events – Event specific to change in state of the monitoring server (can be ASE or REP server). Select “RSM Domain” for specific “Server”. Following events are available to monitor server events

Active - indicates a server is functioning normally. This option is useful if you want to send an e-mail or pager message when a server begins functioning normally after experiencing a problem.

Quiesced - Indicates a server is quiesced. If you use RSM to quiesce a Replication Server, the Replication Server state becomes Suspect rather than Quiesced because the LTMs are suspended.

Suspect - indicates a server is still running but is experiencing a problem.

Hung - indicates RSM cannot connect to the server because of a connection timeout.

Shutdown - indicates the shutdown command was used to shut down a Replication Server or an LTM.

Dead - indicates a server was shut down using a method other than the Shutdown command; for example, you used the isql command to shut down a server.

Unknown - indicates RSM cannot communicate with another server because of a connectivity problem.

Invalid - indicates RSM encountered an error in critical information files, such as a missing or corrupt stored procedure in the RSSD of a monitored Replication Server.

Once the event is selected from above list, then select “Servers” of which this event needs to be monitor.

o Route events: Event specific in changes of the status of a route. Select “RSM Domain” and “Replication server” for specific route.

o Connection events: Event specific in changes the status of a connection. Select “RSM Domain” and “Replication server” for specific connection.

o Partition events: Event specific in partition thresholds (monitors partitions and raises an event when a partition's size equals or exceeds a specified threshold) and partition state (monitors partitions and raises an event when a partition's state changes to ONLINE, OFFLINE, or DROPPED) changes. Select “RSM Domain” and “Replication server” for partition.

o Queue events: Event specific in queue thresholds (monitors queues for specified Replication Servers and raises an event when a queue's size equals or exceeds the specified threshold) and queue latency (the amount of time that the first block


has remained at the beginning of each stable queue). Select “RSM Domain” and “Replication server” for partition.

o Database events: Event specific in replication latency. Select “RSM Domain” and “database” (select multiple databases using Cntrl-Key). In the end make sure to provide “Script Location” for each selected event. The script can accomplish at least

Send email notification Send page Write in error log file Insert error into selected database Add partition Re-start server

1.4.2 Scripts Shell or Perl script can used separately to monitor replication system. For example, below sample script can be use to monitor primary database server, replicate database server and replication server.

wrap_rep.csh

#!/bin/csh

#################################################################

# This script is a wrapper

# This script can be placed in startup script

#################################################################

ps -ef | grep -v grep | grep $0

if "$status" == "0" exit

while 1

check_srv PDSDBS1

check_srv PRSDBS1

check_rep_comp PRLDBS1A

sleep 300

end

check_srv.csh

#!/bin/csh


#################################################################

# This script checks connection to ASE/Rep Srv is OK

#################################################################

set EMAIL_LIST="[email protected]"

set usr=

set pass=

## Check wheather connection to the Srv Can be establish

date > /tmp/$$

isql -U$usr -w132 -S$1 <<EOF >> /tmp/$$

$pass

go

EOF

## IF not send Email and Loop through until connection to the Srv

## gets establish

if "$status" != "0" cat /tmp/$$ | mailx -s"PL Check $1" $EMAIL_LIST

isql -U$usr -w132 -S$1 <<EOF

$pass

go

EOF

while $status != 0

sleep 300

isql -U$usr -w132 -S$1 <<EOF

$pass

go

EOF

end

check_rep_comp.csh

#!/bin/csh


#######################################################################

# This script checks all components of Rep Srv are UP and running

#######################################################################

set EMAIL_LIST="[email protected]"

set usr=

set pass=

## Check wheather all components of Rep Srv are UP and running

## Loop through until they are UP and running

check_srv $1

isql -U$usr -w132 -S$1 <<EOF | grep -i suspect

$pass

admin health

go

EOF

if "$status" == "0" tail -100 $SYBASE/REP-12_6/install/PRLDBS1A.log| mailx -s"PL Check RepSrv" $EMAIL_LIST

## IF any of the component is DOWN send Email with Last 100 Lines from ErrorLog

## Loop through until they are DOWN

check_srv $1


$pass

admin health

go

EOF

while $status == 0

sleep 300

check_srv $1


$pass


admin health

go

EOF

end


1.5 General Troubleshooting

1.5.1 Skipping transaction – This happens when DSI goes down due to bad transactions (i.e. the transaction which errors out in replicate database). In order to continue (i.e. ignoring the current bad transaction) the connection can be resumed using following command isql –U<user> -S<RepSrv> -P<pwd>

1> resume connection to <replicate_dataserver>.<replicate_db> skip tran 2> go Replication server will move the first bad transaction into exception log (located in RSSD) and continue (i.e. resume the connection which was down/suspended) with next transaction in queue. To view the skipped transaction log into respective RSSD (replication command “admin rssd_name” can be used to find the RSSD) and use “rs_helpexception” stored procedure. Once the transaction is reviewed then it can be deleted from exception using “rs_delexception” stored procedure. isql –U<user> -S<RSSDSrv> -P<pwd> -D<RSSD>

1> rs_helpexception 2> go .

.

-- Look for Xact which was logged most recently (i.e. Xact_id)

.

.

1> rs_helpexception <Xact_id>,v 2> go

1.5.2 Dumping stable queue – Dumping (to file/sent to screen) the entire queue or part of queue (by providing particular segment:block:count) can be done using “sysadmin dump_queue” command

1.5.3 Disabling secondary truncation point – To avoid transaction log getting filled in primary due to any replication disaster, following commands at the primary database


isql –U<user> -S<PrimaryDBSrv> -P<pwd> -D<PrimaryDB>

3> sp_stop_rep_agent <DBName> 4> go .

.

1> dbcc settrunc(ltm,ignore) 2> go

1.5.4 Enabling secondary truncation point – After fixing replication system/server disaster or refreshing primary from backup, following commands can be used to re-enable secondary truncation point isql –U<user> -S<RSSDSrv> -P<pwd> -D<RSSD>

1> rs_zeroltm <PrimaryDBSrv>,<DBName> 5> go isql –U<user> -S<PrimaryDBSrv> -P<pwd> -D<PrimaryDB>

1> dbcc settrunc(ltm,valid) 6> go .

.

1> sp_start_rep_agent <DBName> 2> go


References • www.sybase.com • Replication Reference Manual • Replication Administration Guide • Replication Troubleshooting Guide • Replication Heterogeneous Replication Guide

http://www.sybase.com/

Documents

Rep Server Monitoring Best Practices