Upload
nelson-calero
View
1.207
Download
1
Embed Size (px)
Citation preview
Exadata Maintenance tasks 101
Nelson Calero
OTN Tour Latinoamérica
Agosto 2015
About me
• Database Consultant at Pythian
• Computer Engineer
• Oracle Certified Professional DBA 10g/11g
• Oracle ACE
• Working with Oracle tools and Linux environments since 1996
• DBA Oracle (since 2001) & MySQL (since 2005)
• Oracle University Instructor
• Co-founder and President of the Oracle user Group of Uruguay
• LAOUC Director of events
• Blogger and frequent speaker: Oracle Open World, Collaborate, OTN Tour, JIAP, MySQL/NoSQL conferences
http://www.linkedin.com/in/ncalero @ncalerouy
2 © 2014 Pythian Confidential
Pythian overview• 18 Years of data infrastructure management consulting
• 200+ Top brands
• 11700+ Systems under management
• Over 387 DBAs in 30 countries
• Top 5% of DBA work force, 10 Oracle ACEs, 4 ACED, 3 OakTable
members, 2 OCM, 5 Microsoft MVPs, 1 Cloudera Champion of Big Data
• Oracle, Microsoft, MySQL, Hadoop, Cassandra, MongoDB, and more
• Infrastructure, Cloud, DevOps, and application expertise
3 © 2014 Pythian Confidential
Today’s topics
1. Introduction to Exadata
2. Changes for the DBA
3. Monitoring– Configuring ASR
4. Maintenance– Common procedures
– Patching
– Replacing parts
– Some examples
4 © 2014 Pythian Confidential
Introduction to Exadata
• “The highest-performing platform for running Oracle Database” – X5-2 (ref: oracle.com)
• Best for all database workloads: DW/OLTP/In-Memory
• Part of the Engineered Systems familyhttps://www.oracle.com/engineered-systems/index.html
Supercluster
Private cloud appliance
Database appliance
Big data appliance
5 © 2014 Pythian Confidential
Exalogic Elastic Cloud
Exalytics In-Memory
Zero data loss Recovery Appliance
FS1 Flash Storage System
ZFS Storage Appliance
Exadata history
• V1: 2008 – HP Oracle Database machine
• V2: 2009 – Sun hardware
• X2: 2010 – X2-2/X2-8
– 2011 - Exadata Storage Expansion Rack
• X3: 2012 – X3-2/X3-8
• X4: 2013 – X4-2
2014 – X4-8
• X5: 2015 – X5-2
• Great summary in http://flashdba.com/history-of-exadata/
6 © 2014 Pythian Confidential
Exadata flavors• 2 or 8 CPU sockets on database servers (XN-2/XN-8)
• Full Rack – 8 database servers on Xn-2, 2 on Xn-8
– 14 storage servers
– 86.9Tb of flash disk (X5), 44.8Tb (X4), 22.4Tb (X3), 5.3Tb (X2)
– 200Tb disk (X5/X4), 100Tb (X3)
• Half Rack – only for Xn-2, half of full rack
• Quarter Rack – only for Xn-2, half of half rack – 3 storage servers
• Eighth Rack – since X3 half disk and flash, only for Xn-2,
same servers as Quarter, half disks and flash
• http://docs.oracle.com/cd/E50790_01/doc/doc.121/e51953/intro.htm#DBMSO109
7 © 2014 Pythian Confidential
Exadata hardware
• Hardware (example from latest X5-2)– PCI flash storage – up to 230TB per rack (4 cards per storage)
– InfiniBand internal connectivity (40Gb/s)
• 263 GB/s per rack from SQL
– 2 to 19 DB servers per rack (2x18 core, 256GB RAM each)
• Up to 684 CPU cores for database
• Up to 14.6Tb RAM per rack for database
– 3 to 18 storage servers per rack
• Up to 288 CPU for storage
• Software– Oracle database (11.2 / 12.1)
– Oracle Enterprise Linux (5.9 / 6.6)
– ZDP infiniband protocol, iDB for storage access
8 © 2014 Pythian Confidential
Exadata Workload Optimized Configurations (X5)
http://www.oracle.com/us/corporate/events/datacenter/index.html
9 © 2014 Pythian Confidential
Exadata architecture
10 © 2014 Pythian Confidential
http://www.oracle.com/technetwork/database/exadata/exadata-technical-whitepaper-134575.pdf
Exadata disks
11 © 2014 Pythian Confidential
Physical
Disk
LUN Cell Disk Grid
Disks
ASM
Diskgroup
Exadata licencing
• Exadata storage server licenses
• Oracle Database licenses
– plus additional options such as Real Application
Clusters, Partitioning, Diagnostic and Tuning Packs,
Multitenant
• Both varies depending on the model, as it is
based on #cores.
• Exadata hardware has its separate costs
12 © 2014 Pythian Confidential
Exadata functionalities
• Smart flash cache
• Database cell offloading– Queries processed at storage level (w/conditions)
– Uses smart scan and storage indexes (cell in memory)
• Hybrid columnar compression
• Optimized SQL protocol - iDB
– Exafusion in 12c – reimplementation of RAC cache fushion for direct calls from Database
• IO Resource Manager
• OVM support in 12c
13 © 2014 Pythian Confidential
Premier support and Platinum Services
Extra cost support: Premier and Premier for systemshttp://www.oracle.com/us/support/library/platinum-services-policies-1652886.pdf
Platinum services:
• Remote fault monitoring, accelerated response and
patch deployment (4 per year)
• Free for qualified customers who have:
– Certified platinum configuration – Matrix on oracle.com
– Support services contract for software and systems
– Oracle licences
– Gateway, VPN, connectivity, etc.
14 © 2014 Pythian Confidential
Today’s topics
1. Introduction to Exadata
2. Changes for the DBA
3. Monitoring– Configuring ASR
4. Maintenance– Common procedures
– Patching
– Replacing parts
– Some examples
15 © 2014 Pythian Confidential
Changes for the DBA• New components to manage
– Storage cells
– Infiniband switches
– KVM, PDU
• New utilities– cellcli
– dcli
– dbmcli -- on 12c
– ILOM access (DB, Cells, IB Switches) - Web / ssh / IPMI / remote console
• New troubleshooting tools– Exachk / sundiag / ILOM snapshots
• Monitoring and alerting– OEM exadata plugin
– ASR
16 © 2014 Pythian Confidential
Monitoring cells using v$ views• New views
• V$CELL
• V$CELL_CONFIG
• V$CELL_STATE
• V$CELL_THREAD_HISTORY
• V$CELL_REQUEST_TOTALS
• new stats recorded on V$SYSSTAT
• new wait events• cell multiblock physical read
• cell smart index scan
• cell%
http://docs.oracle.com/cd/E50790_01/doc/doc.121/e50471/monitoring.htm#SAGUG20487
17 © 2014 Pythian Confidential
• Columns added to existing v$• V$BACKUP_DATAFILE
• V$SQLFN_METADATA
• V$SQL
• V$SQLAREA
• V$SQLSTATS
• V$SQLAREA_PLAN_HASH
• V$SQLSTATS_PLAN_HASH
cellcli sample output[root@exa1cel03 ~]# cellcli
CellCLI: Release 11.2.3.3.0 - Production on Sun Jul 26 19:21:16 EDT 2015
Copyright (c) 2007, 2013, Oracle. All rights reserved.
Cell Efficiency Ratio: 945
CellCLI> list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
DATA_EXA1_CD_00_exa1cel03 ONLINE Yes
DATA_EXA1_CD_01_exa1cel03 ONLINE Yes
DATA_EXA1_CD_02_exa1cel03 ONLINE Yes
…
DBFS_DG_CD_02_exa1cel03 ONLINE Yes
DBFS_DG_CD_03_exa1cel03 ONLINE Yes
DBFS_DG_CD_04_exa1cel03 ONLINE Yes
…
RECO_EXA1_CD_00_exa1cel03 ONLINE Yes
RECO_EXA1_CD_01_exa1cel03 ONLINE Yes
RECO_EXA1_CD_02_exa1cel03 ONLINE Yes
18 © 2014 Pythian Confidential
dcli sample output[root@exa1db01 ~]# cat /opt/oracle.SupportTools/onecommand/dbs_group
exa1db01
exa1db02
exa1db03
exa1db04
exa2db01
exa2db02
[root@exa1db01 ~]# cd /opt/oracle.SupportTools/onecommand
[root@exa1db01 onecommand]# dcli -l root -g dbs_group "who -b"
exa1db01: system boot 2015-03-12 10:38
exa1db02: system boot 2015-03-12 11:27
exa2db01: system boot 2015-01-19 01:28
exa2db02: system boot 2015-01-19 01:56
exa2db03: system boot 2015-02-10 14:38
exa2db04: system boot 2015-02-10 10:46
19 © 2014 Pythian Confidential
ILOM access – ssh example[root@exa1db02 ~]# ssh exa1db03-ilom
Password:
Oracle(R) Integrated Lights Out Manager
Version 3.1.2.10.c r81825
Copyright (c) 2013, Oracle and/or its affiliates. All rights reserved.
-> show /SP/policy
/SP/policy
Targets:
Properties:
ENHANCED_PCIE_COOLING_MODE = disabled
HOST_AUTO_POWER_ON = disabled
HOST_LAST_POWER_STATE = enabled
20 © 2014 Pythian Confidential
Listing flash storage installed – from OS[root@exa1cel02 ~]# lsscsi | grep -i ATA
[8:0:0:0] disk ATA MARVELL SD88SA02 D21Y /dev/sdn
[8:0:1:0] disk ATA MARVELL SD88SA02 D21Y /dev/sdo
[8:0:2:0] disk ATA MARVELL SD88SA02 D21Y /dev/sdp
[8:0:3:0] disk ATA MARVELL SD88SA02 D21Y /dev/sdq
[9:0:0:0] disk ATA MARVELL SD88SA02 D21Y /dev/sdr
[9:0:1:0] disk ATA MARVELL SD88SA02 D21Y /dev/sds
[9:0:2:0] disk ATA MARVELL SD88SA02 D21Y /dev/sdt
[9:0:3:0] disk ATA MARVELL SD88SA02 D21Y /dev/sdu
[10:0:0:0] disk ATA MARVELL SD88SA02 D21Y /dev/sdv
[10:0:1:0] disk ATA MARVELL SD88SA02 D21Y /dev/sdw
[10:0:2:0] disk ATA MARVELL SD88SA02 D21Y /dev/sdx
[10:0:3:0] disk ATA MARVELL SD88SA02 D21Y /dev/sdy
[11:0:0:0] disk ATA MARVELL SD88SA02 D21Y /dev/sdz
[11:0:1:0] disk ATA MARVELL SD88SA02 D21Y /dev/sdaa
[11:0:2:0] disk ATA MARVELL SD88SA02 D21Y /dev/sdab
[11:0:3:0] disk ATA MARVELL SD88SA02 D21Y /dev/sdac
21 © 2014 Pythian Confidential
Listing flash storage installed – from cellcliCellCLI> list physicaldisk attributes name, makemodel, physicalrpm, physicalport, status where disktype=flashdisk
FLASH_1_0 "Sun Flash Accelerator F40 PCIe Card" normal
FLASH_1_1 "Sun Flash Accelerator F40 PCIe Card" normal
FLASH_1_2 "Sun Flash Accelerator F40 PCIe Card" normal
FLASH_1_3 "Sun Flash Accelerator F40 PCIe Card" normal
FLASH_2_0 "Sun Flash Accelerator F40 PCIe Card" normal
FLASH_2_1 "Sun Flash Accelerator F40 PCIe Card" normal
FLASH_2_2 "Sun Flash Accelerator F40 PCIe Card" normal
FLASH_2_3 "Sun Flash Accelerator F40 PCIe Card" normal
FLASH_4_0 "Sun Flash Accelerator F40 PCIe Card" normal
FLASH_4_1 "Sun Flash Accelerator F40 PCIe Card" normal
FLASH_4_2 "Sun Flash Accelerator F40 PCIe Card" normal
FLASH_4_3 "Sun Flash Accelerator F40 PCIe Card" normal
FLASH_5_0 "Sun Flash Accelerator F40 PCIe Card" failed
FLASH_5_1 "Sun Flash Accelerator F40 PCIe Card" failed
FLASH_5_2 "Sun Flash Accelerator F40 PCIe Card" failed
FLASH_5_3 "Sun Flash Accelerator F40 PCIe Card" failed
22 © 2014 Pythian Confidential
Public documentation available
• Oracle® Exadata Storage Server Software User's Guidehttp://docs.oracle.com/cd/E50790_01/doc/doc.121/e50471/toc.htm
• Oracle® Exadata Database Machine Maintenance Guidehttp://docs.oracle.com/cd/E50790_01/doc/doc.121/e51951/toc.htm
• Several working examples on Oracle Learning Library Search for “Database Machine”:
https://apexapps.oracle.com/pls/apex/f?p=44785:2::FORCE_QUERY::2%2CCIR%2CRIR:P2_TAGS:Database+Machine
• Arup Nanda series: Oracle Exadata Commands Reference, June 2011http://www.oracle.com/technetwork/articles/oem/exadata-commands-intro-402431.html
23 © 2014 Pythian Confidential
Today’s topics
1. Introduction to Exadata
2. Changes for the DBA
3. Monitoring– Configuring ASR
4. Maintenance– Common procedures
– Patching
– Replacing parts
– Some examples
24 © 2014 Pythian Confidential
Exadata monitoring
• OEM using plugin– New pages with all the information
– All Exadata components are monitored and emails are sent when thresholds are crossed – as usual
– OEM 12c Exadata discovery cookbook for configurationhttp://www.oracle.com/technetwork/oem/exa-mgmt/em12c- exadata-discovery-cookbook-1662643.pdf
• Auto Service Request (ASR) – Automatically creates an SR on support.oracle.com when a failure is
detected
– It gets replied in seconds for well known issues that requires maintenance, with links to support notes
– After initial configuration, we see emails notification of SR created, does no need user interaction as OEM
25 © 2014 Pythian Confidential
OEM Exadata plugin
26 © 2014 Pythian Confidential
ASR – sample email from failure detectedOracle ASR: Service Request 3-19771251173 Created
May 26
to
Service Request: 3-19771251173
Oracle Auto Service Request (ASR) has created a Service Request (SR) for the following ASR asset
Hostname: exa1cel02
Serial#: 1234FNN0A0
Please login to My Oracle Support to see the details of this SR. My Oracle Support can also be used to make any changes to the SR or to provide additional information.
The Oracle Auto Service Request documentation can be accessed on http://oracle.com/asr.
Please use My Oracle Support https://support.oracle.com for assistance.
27 © 2014 Pythian Confidential
ASR – sample SR on MOS
28 © 2014 Pythian Confidential
ASR configuration
• Optionally done by Oracle under Platinum support
• ASR server external to Exadata
• MOS account must have Administrator role or Admin Assets Access privilege
• Each Exadata node/IB switch must be configured (SNMP Traps)
• Assets must be accepted on MOS under each CSI
• Notification based on SNMP messages generated by ILOMs• If ASR server is down, messages are lost
• On Solaris it can use another protocol to avoid loss
29 © 2014 Pythian Confidential
ASR configuration and usage• ASR documentation
http://www.oracle.com/technetwork/systems/asr/documentation/index.html
• Auto Service Request Installation and Operations Guidehttps://docs.oracle.com/cd/E37710_01/install.41/e18475/ch1_asr_overview.htm#ASRUD108
• Oracle Auto Service Request (ASR) (Doc ID 1185493.1)
• How To Manage and Approve Pending ASR Assets In My Oracle Support (Doc ID 1329200.1)
• Engineered Systems ASR Configuration Check via ASREXACHK (Doc ID 1450112.1)
30 © 2014 Pythian Confidential
ASR installation is easyASR Manager server requires:
• connectivity to the Internet using HTTPS
• network connectivity to Exadata assets, ILOM, and eth0 from ASR manager server
• JDK 7 (JDK 1.7.0_13) or later
• rpm-build package
Installation on Linux
export JAVA_HOME=/usr/java/jdk1.8.0_25/
export PATH=$JAVA_HOME/bin:$PATH:/opt/asrmanager/bin
export CLASSPATH=.
rpm -i asrmanager.5.0.2-20141215170108.rpm
/opt/asrmanager/bin/asr register
<enter MOS user and password>
/opt/asrmanager/bin/asr test_connection
/opt/asrmanager/bin/asr start
31 © 2014 Pythian Confidential
ASR configuration
• Configure each Exadata component to report to ASR Manager server– Storage servers, Database nodes and Infiniband switches
– IB switch version and serial# should be validated, as they may be updated
• Activate Nodes on the ASR Manager– If ILOM auto-activation didn’t occurred, it should be activated manually
• Verify all nodes are visible on the ASR Manager
• Complete the registration on MOS, approving the ASR activation
• Make sure packages from DB nodes to ASRM uses eth0
32 © 2014 Pythian Confidential
ASR configuration - example
• ASR manager server:10.20.30.123
• Validate current configuration of storage server:
[root@exa1db02 ~]# dcli -g /opt/oracle.SupportTools/onecommand/cell_group -l root "cellcli -e list cell attributes snmpsubscriber"
exa1cel01: ((host=exa1db01.acme.com,port=1830,community=public),(host=exa1db02.acme.com,port=1830,community=public),(host=exa2db02.acme.com,port=3872,community=public),(host=exa2db03.acme.com,port=3872,community=public))
…
33 © 2014 Pythian Confidential
ASR configuration - example– port should be the agent listener port
– cells report to OEM agent on each DB node
• Modify previous output adding ASR manager host
[root@exa1db01 ~]# dcli -g /opt/oracle.SupportTools/onecommand/cell_group -l root "cellcli -e alter cell snmpsubscriber=\
\(\(host=\'exa1db01.acme.com\',port=3872,community=public\),\
\(host=\'exa1db02.acme.com\',port=3872,community=public\),\
\(host=\'exa2db01.acme.com\',port=3872,community=public\),\
\(host=\'exa2db02.acme.com\',port=3872,community=public\),\
\(host=\'exa2db03.acme.com\',port=3872,community=public\),\
\(host=\'exa2db04.acme.com\',port=3872,community=public\),\
\(host=\'10.20.30.123\',port=162,community=public,type=ASR\)\)”
34 © 2014 Pythian Confidential
ASR configuration - example[root@asrm]~# asr status
ASR Manager (pid 10794) is RUNNING.
[root@asrm]~#
[root@asrm]~# asr list_asset
IP_ADDRESS HOST_NAME SERIAL_NUMBER ASR PROTOCOL SOURCE PRODUCT_NAME
--------------- ---------------- ------------------- -------- --------- -------------- ------------------------------------
10.102.100.25 exa1sw-ib3 1234ABC-1234R114WY Enabled SNMP ILOM Sun Datacenter InfiniBand Switch 36
10.102.100.24 exa1sw-ib2 1234ABC-1234R114XY Enabled SNMP ILOM Sun Datacenter InfiniBand Switch 36
10.102.100.18 exa1cel01-ilom 2233BTT0C1 Enabled SNMP ILOM SUN FIRE X4270 M2 SERVER
10.102.100.20 exa1cel03-ilom 2233BTT0CB Enabled SNMP ILOM SUN FIRE X4270 M2 SERVER
10.102.100.16 exa1db01-ilom 2254QUI0RJ Enabled SNMP ILOM SUN FIRE X4170 M2 SERVER
10.102.100.17 exa1db02-ilom 2254QUI0U3 Enabled SNMP ILOM SUN FIRE X4170 M2 SERVER
10.102.100.49 exa2sw-ibs0 BT00123455 Enabled SNMP ILOM Sun Datacenter InfiniBand Switch 36
10.102.100.51 exa2sw-ibb0 BT00123458 Enabled SNMP ILOM Sun Datacenter InfiniBand Switch 36
10.102.100.50 exa2sw-iba0 BT00123459 Enabled SNMP ILOM Sun Datacenter InfiniBand Switch 36
10.102.100.11 exa1db01 2254QUI0RJ Enabled SNMP,HTTP EXADATA-SW,ADR SUN FIRE X4170 M2 SERVER
10.102.100.12 exa1db02 2254QUI0U3 Enabled SNMP,HTTP EXADATA-SW,ADR SUN FIRE X4170 M2 SERVER
35 © 2014 Pythian Confidential
ASR configuration - example• Activate Nodes on the ASR Manager
asr activate_asset -i [Node ILOM IP]
asr activate_exadata -i [Node IP] -h exa1cel01 -l [Node ILOM IP]
[root@asrm] asr activate_asset -i 10.105.200.17
exa2sw-iba0.acme.com : 1 service tags
Successfully submitted activation for the asset
Host Name: exa2sw-iba0
IP Address: 10.105.200.17
Serial Number: BC0001234
The e-mail address associated with the registration id for this asset's ASR Manager will receive an e-mail highlighting the asset activation
status and any additional instructions for completing activation.
Please use My Oracle Support http://support.oracle.com to complete the activation process.
The Oracle Auto Service Request documentation can be accessed on http://oracle.com/asr.
• For IB switches, an empty rule should be added using ILOM:spsh
show /SP/alertmgmt/rules/[NUMBER]
set /SP/alertmgmt/rules/[NUMBER] type=snmptrap level=minor
destination=10.20.30.123 snmp_version=2c community_or_username=public
36 © 2014 Pythian Confidential
ASR configuration - example-> show 4
/SP/alertmgmt/rules/4
Targets:
Properties:
community_or_username = public
destination = 0.0.0.0
destination_port = 0
email_custom_sender = (none)
email_message_prefix = (none)
event_class_filter = (none)
event_type_filter = (none)
level = disable
snmp_version = 1
testrule = (Cannot show property)
type = snmptrap
37 © 2014 Pythian Confidential
View from MOS
• Assets are listed under Systems tab
• Hardware serial# identifies each component
– One for the Exadata machine groups them all
• CSI includes each
• SR is created for a specific CSI
38 © 2014 Pythian Confidential
View from MOS - assets
39 © 2014 Pythian Confidential
View from MOS – user privileges over assets
40 © 2014 Pythian Confidential
Today’s topics
1. Introduction to Exadata
2. Changes for the DBA
3. Monitoring– Configuring ASR
4. Maintenance– Common procedures
– Patching
– Replacing parts
– Some examples
41 © 2014 Pythian Confidential
Maintenance• Software updates
– OS / DB / Switches - Patching
– OS / DB / Switches - Upgrades
– OS / DB / Switches - Configuration change
• Hardware upgrade
• Preventive tasks– on site health checks (after the second year with Permier/Platinum support)
– EOL parts are replaced: RAID HBA Batteries and Energy Storage Modules (ESM) in flash cards
– It does not include patching or upgrading
• Failed components– hard drives
– flash cards
– Infiniband switch riser
– network cables
42 © 2014 Pythian Confidential
Maintenance – only planned?
• No SPOF, many redundant parts
• External issues can cause unplanned outage– Electricity – been there
– All usual named non-planned failures: flooding, earthquakes, etc.
43 © 2014 Pythian Confidential
Maintenance – all Xn have same failures?
• Different parts in newest models
• Different configuration options
• Example with Flash Cards
X2 X3 X4 X5 F20 PCIe F40 PCIe F80 PCIe F160 NVMe PCIe
Battery Capacitor Capacitor Capacitor
Battery must be replaced each 3 years
44 © 2014 Pythian Confidential
Maintenance procedures
• Rolling fashion– no outage required
– One server at a time
– Cells needs to rebalance disks. Process for each cell is:
• Turn off grid disks
• Patch cell
• Turn on gird disks
• Wait for ASM rebalance to finish – time depends on activity
– Total time is more than double of the outage procedure
45 © 2014 Pythian Confidential
Maintenance procedures
• Rolling fashion
– Watch out for bug 16788832 - ORA-27609: SMART I/O FAILED
DUE TO A NETWORK ERROR TO THE CELL AFTER SHUTDOWN.
Patchset available, fixed on 11.2.0.4
– Normal redundancy ASM: a disk failure during the maintenance will
bring system down, recoverable through backups
– High redundancy ASM: two disk failures will have the same effect
46 © 2014 Pythian Confidential
Maintenance procedures
• With complete outage
– services are shut down before starting
– cells are patched in parallel
• no need to rebalance disks
– Total time is less than half of the rolling procedure
47 © 2014 Pythian Confidential
Exadata Maintenance - Patches• Quarter Full Stack Download Patch (QFSDP)
– single patch for OS + Firmware + drivers
– storage and compute nodes
– full outage option faster than the rolling option
• Quarterly DB patch / Bundle Patch– DB + GI + diskmon
– Rolling
– Includes latest PSU
– Platinum Services: 4 per year remotely done by Oracle (w/restrictions)
• PSU – classical – DB requires outage
– BP must be installed first
48 © 2014 Pythian Confidential
Exadata Maintenance - Patches
Components:– Node firmware
– Operating system
– GI and RDBMS binaries
– Infiniband Switches
– Others: KVM, PDU
Infiniband Switches patch are not cumulative, must apply intermediate patches if any
Exadata Database Machine and Exadata Storage Server Supported Versions (Doc ID 888828.1)
49 © 2014 Pythian Confidential
Exadata Maintenance - Patches
• Storage server patches– applied with patchmgr - binary included with the patch
– runs from compute node (DB)
– uses dcli utility
• compute nodes are patched with cells– Updating key software components on database hosts to match those
on the cells (Doc ID 1284070.1)
– OS updated using yum repository, it can be local
• More resources:http://www.pythian.com/blog/upgrade-exadata-to-11-2-0-3/
http://www.pythian.com/blog/exadata-patching-overview/
50 © 2014 Pythian Confidential
Exadata bundle patch - overview1) download and copy patch files to all servers
– dcli makes it easier
2) prerequisites check– Cell: ./patchmgr -cells /opt/oracle.SupportTools/onecommand/cell_group -patch_check_prereq -rolling
– Switch: ./patchmgr -ibswitches -upgrade -ibswitch_precheck
– Database: ./dbnodeupdate.sh -u -l 17809253_112330_Linux-x86-64.zip -v
3) upgrade opatch to latest version (MOS patch 6880880)
4) Blackout involved targets on OEM
5) Single One-Off rolling patch to apply to database homes prior to Bundle Patch (example 17854520)– time consuming depending on amount of Oracle Homes installed.
– Example: 4 database homes on each database server, 6 database servers => database patch to be applied 24 times
6) Run rolling patch to Cell Servers - estimate 1:30h per cell
7) Run patch to Infiniband switches - 1:30h per switch
8) Run rolling patch to Database Servers - 1:30h per server
9) Run rolling patch to GI instances
10) Run rolling patch to DB instances - per each server and Oracle home
Half rack = 4 db servers, 7 cell servers, 2 infiniband switches => insane amount of hours
51 © 2014 Pythian Confidential
Exadata Maintenance – Replacing parts
• Two types: customer and field replaceable unit.
– CRU are in charge of customer
• Oracle Support takes care and sends the bill
• List of replaceable parts on all Xn servershttp://docs.oracle.com/cd/E50790_01/doc/doc.121/e51951/app_fru.htm#DBMMN21100
• Examples to see in detail:
– RAID HBA Batteries
– Flash disks
52 © 2014 Pythian Confidential
Replacing parts - procedure• Through an SR
– automatically created for failures if using ASR
– Automatically created by Oracle for preventive maintenance
• We run checks and upload results to SR
– sundiag, exachk, ILOM snaphsots
– Be careful to include only current files to avoid misunderstandings
• Oracle Support identifies the problem and creates a field task for the activity
• We propose a time
• A Field Engineer is assigned
• We communicate with OFE– Define details: rolling, servers and schedule
– Review the procedure – usually a MOS note
– Set expectations – we are the responsible for the systems
• We get access granted for OFE to datacenter
• Oracle Support gets the new parts delivered to DC or OFE
• Communicate with OFE at scheduled date and work together
53 © 2014 Pythian Confidential
Replacing parts – checks - sundiag[root@exa1cel02 ~]# /opt/oracle.SupportTools/sundiag.sh
Oracle Exadata Database Machine - Diagnostics Collection Tool
Gathering Linux information
Skipping ILOM collection. Use the ilom or snapshot options, or login to ILOM
over the network and run Snapshot separately if necessary.
driveTool Version 1.30
Library loaded for MegaRAID SAS Controller.
…
Generating diagnostics tarball and removing temp directory
==============================================================================
Done. The report files are bzip2 compressed in /tmp/sundiag_exa1cel02_1152FMM0C0_2015_05_25_07_32.tar.bz2
==============================================================================
54 © 2014 Pythian Confidential
Replacing parts – checks - exachk
• Oracle Exadata Database Machine exachk or HealthCheck [ID 1070954.1]
• Original version installed on /opt/oracle.SupportTools/exachk
• Download latest version from MOS.
• From DB node:./exachk -a -o verbose
./exachk -clusternodes exa2db01,exa2db02 -excludeprofiles storage,switch
./exachk -clusternodes exa1db01,exa1db02 -cells exa1cel01,exa1cel02,exa1cel03,exa1cel04,exa1cel05,exa1cel06,exa1cel07 -ibswitches
export RAT_ORACLE_HOME=/u01/app/oracle/product/11.2.0.3
./exachk -localonly -excludeprofile storage,switch
55 © 2014 Pythian Confidential
Replacing parts – checks – ILOM Snapshot• How to run an ILOM Snapshot on a Sun/Oracle X86 System (Doc ID 1448069.1)
[root@exa2db02 ~]# ssh exa2cel02-ilom
Password:
Oracle(R) Integrated Lights Out Manager
Version 3.1.2.12.c r81826
Copyright (c) 2013, Oracle and/or its affiliates. All rights reserved.
-> set /SP/diag/snapshot dataset=normal
Set 'dataset' to 'normal'
-> set /SP/diag/snapshot dump_uri=sftp://root:[email protected]/temp
Set 'dump_uri' to 'sftp://root:[email protected]/temp'
-> cd /SP/diag/snapshot
/SP/diag/snapshot
56 © 2014 Pythian Confidential
-> show
/SP/diag/snapshot
Targets:
Properties:
dataset = normal
dump_uri = (Cannot show property)
encrypt_output = false
result = Collecting data into
sftp://root:*****@10.10.10.74/tmp/exa2cel02-
ilom_2419EZ419H_2015-04-13T17-23-51.zip
Snapshot Complete.
Done.
Exa2db02 IP: 10.10.10.74
Replacing parts examples
1. failing Flash disks
2. failing hard disks
3. proactive RAID HBA Batteries
4. troubleshooting server not powering up
57 © 2014 Pythian Confidential
Example - replace failing Flash disksHost=exa2db03.acme.com
Target type=Oracle Exadata Storage Server
Target name=exa1cel02.acme.com
Categories=Fault
Message=Flash disk failed. Status : FAILED Manufacturer : Sun Model Number : Flash Accelerator F20 PCIe Card Size : 23GB Serial Number : 1039M04E85 Firmware : D21Y Slot
Number : PCI Slot: 1; FDOM: 2 Cell Disk : FD_02_exa1cel02 Grid Disk : Not configured Flash Cache : Present Flash Log : Present
Severity=Critical
Event reported time=Feb 12, 2015 3:48:21 AM PDT
Target Lifecycle Status=Production
Line of Business=ExaProd_Grp
Location=Production_DC
Operating System=Linux
Platform=x86_64
Associated Incident Id=12345
Associated Incident Status=New
Associated Incident Owner=
Associated Incident Acknowledged By Owner=No
Associated Incident Priority=None
Associated Incident Escalation Level=0
Event Type=Metric Alert
Event name=Cell_Generated_Alert:alerttype
Notification Count=1
Metric Group=Cell Generated Alert
Metric=Alert Type
Metric value=Stateful
Key Value=A3C12C4EF7E4FB97480CF3HBA1471EA4
Key Column 1=Alert Name
Key Column 1 Value=Hardware
Key Column 2=Alert Sequence
Key Column 2 Value=63
58 © 2014 Pythian Confidential
Example - replace failing Flash disksPrevious OEM alert followed by many others, and ASR creates a SR.
At scheduled date when the Field Engineer brings the part for replacement:
1) Put a blackout on OEM to avoid pages when powering off the affected cell Watch out for Bug 18297754 - "ALERTLOGADR ALERTS OCCURING IN BLACKOUT PERIOD GET REPORTED WHEN BLACKOUT ENDS“
alerts arrive together after blackout ends using OMS 12.1.0.3 / DB 11.2.0.3
2) Before shutting down cellCellCLI> alter cell led on
Disks to be offline should have redundant copy online (asmdeactivationoutcome=YES)CellCLI> list griddisk attributes name, asmmodestatus , asmdeactivationoutcome
3) Shutdown cellCellCLI> alter cell shutdown services all
[root@exa1cel02 ~]# shutdown -h now
59 © 2014 Pythian Confidential
Example - replace failing Flash disks4) Oracle Field Engineer replaces the flash disk
5) Engineer to power up cell - cell services starts automaticallyValidate flash disks now have normal status
CellCLI> list physicaldisk
20:0 E2WXF9 normal
...
FLASH_1_0 1219M0E48A normal
...
CellCLI> alter cell led off
6) Operation finishes after ASM rebalance operation finishes
CellCLI> list griddisk attributes name, asmmodestatus
DATA_EXA1_CD_00_exa1cel02 SYNCING
DATA_EXA1_CD_01_exa1cel02 SYNCING
DATA_EXA1_CD_10_exa1cel02 ONLINE
…[grid@exa1db02 ~]# SQL> select * from gv$asm_operation;
7) Remove OEM blackout
60 © 2014 Pythian Confidential
Example - replace failing hard disks
Also detailed MOS notes to replace failing hard drives:– Note 1390836.1 for predictive failures
– Note 1386147.1 for hard failures
• Oracle ASM disks associated with the grid disks on the physical drive are automatically dropped - Pro-Active Disk Quarantine– If cell also goes offline, disks are not dropped for DISK_REPAIR_TIME
– Hard failures drops the disk with FORCE option and ASM rebalance starts to restore data redundancy
– Predictive failures triggers an ASM rebalance to relocate the data to other disks, and we should wait for it to complete before replacement.
• We should identify the physical disk
CellCLI> LIST PHYSICALDISK WHERE diskType=HardDisk AND status like failed DETAIL
61 © 2014 Pythian Confidential
Replacing RAID HBA Batteries - overviewExample for X2-2 half rack (4 db nodes, 7 storage nodes):
• HBA backup battery units– 11: one per DB and Storage node.
– Protects all internal drives connected to each RAID HBA
– Must be replaced every two years
– Node needs to be restarted
– since Exadata X3 with image 12.1.2.1.2: Remote mounted batteries, no need node restart
• Energy Storage Module (ESM) in PCI flash cards– 28 in storage nodes, four per node
– Protects the DRAM cache
– F20 PCIe card – battery must be replaced every four years (X2 and older models)
– Since X3 new cards does not use batteries
• X3 uses F40 PCIe card, F80 on X4, F160 on X5
https://docs.oracle.com/cd/E19682-01/E21358/z40002bc1401289.html
62 © 2014 Pythian Confidential
Replacing RAID HBA Batteries - overviewRolling or full outage operation
• Restart DB nodes[root@exa1db02 ~]# crsctl stop crs
[root@exa1db02 ~]# shutdown -y -h now
• Field engineer starts up the server, we validate[root@exa1db02 ~]# crsctl check crs
[root@exa1db02 ~]# crsctl stat res -t
• Restart cell nodes
– Similar to previous procedure for replacing failing Flash disk, but disks should be
offlined before shutdown and onlined after
CellCLI> alter griddisk all inactive -- active to activate
GridDisk DATA_EXA1_CD_00_exa2cel03 successfully altered
GridDisk DATA_EXA1_CD_01_exa2cel03 successfully altered
GridDisk DATA_EXA1_CD_02_exa2cel03 successfully altered
63 © 2014 Pythian Confidential
Troubleshooting server not powering up• Login to server ILOM console
ssh root@exa1db01-ilom
show faulty
cd /SP/faultmgmt
start shell
Are you sure you want to start /SP/faultmgmt/shell (y/n)? y
faultmgmtsp> fmadm faulty
------------------- ------------------------------------ -------------- --------
Time UUID msgid Severity
------------------- ------------------------------------ -------------- --------
2014-02-04/18:45:49 c6ac83c9-bd36-c984-85ff-c940887f4925 SPX86-8001-VY Major
Fault class : fault.security.enclosure-open
FRU : /SYS/SP
(Part Number: unknown)
(Serial Number: unknown)
Description : A chassis intrusion failure has occurred.
Response : The chassis-wide service required LED will be illuminated.
Impact : Server is immediately powered off and the service processor will operate in a degraded mode.
Action : The administrator should review the ILOM event log for additional information pertaining
to this diagnosis. Please refer to the Details section of the Knowledge Article for
additional information.
64 © 2014 Pythian Confidential
Troubleshooting server not powering up
• Clear the fault using UUID part
faultmgmtsp> fmadm repair baab83b7-bd3a-b784-8aff-b740887f472a
• Fault should be cleared
faultmgmtsp> fmadm faulty
No faults found
65 © 2014 Pythian Confidential
Questions?
66
@ncalerouy
http://www.linkedin.com/in/ncalero
© 2014 Pythian Confidential
References• Oracle Exadata Database Machine System Overview 12cR1
http://docs.oracle.com/cd/E50790_01/doc/doc.121/e51953/intro.htm#DBMSO109
• Oracle Learning Library - topics for Exadata: https://apexapps.oracle.com/pls/apex/f?p=44785:2::FORCE_QUERY::2%2CCIR%2CRIR:P2_TAGS:Database+Machine
• Exadata Smart Flash Cache Features http://www.oracle.com/technetwork/database/exadata/exadata-smart-flash-cache-366203.pdf
• Flash storage modelshttp://www.oracle.com/us/products/servers-storage/storage/flash-storage/f20/overview/index.html
• Exadata Storage Server Software User's Guidehttp://docs.oracle.com/cd/E50790_01/doc/doc.121/e50471/toc.htm
67 © 2014 Pythian Confidential