Exadata Maximum Availability - Oracle · after disk failure (ORA-15041) •Solution for 18c and lower –Run exachk which reports on compliance to our MAA best practice •Solution

Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |

Exadata Maximum Availability ArchitectureBest Practices and Recommendations

Michael NowakMAA Solutions ArchitectOracle Server Technologies

Eric BezilleChief Technologist Oracle Cloud InfrastructureOracle France

October 23, 2018


Safe Harbor Statement

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

3


Program Agenda

Exadata & Maximum Availability Architecture (MAA) Overview

Exadata MAA: New Features and Best Practices *

Exadata MAA: Sneak Peek Into Future Best Practices and Features

Customer HA Success Stories from Oracle France

1

2

3

4

BONUS!

* Includes a sampling of some new lifecyle operations best practices and expectations


Exadata MAA: Overview

5


Cost of Downtime

https://devops.com/real-cost-downtime/

For the Fortune 1000, the average total cost of unplanned application downtime per year is $1.25 billion to $2.5 billion.

The average hourly cost of an infrastructure failure is $100,000 per hour. The average cost of a critical application failure per hour is $500,000 to $1 million.

BA lost 80 million pounds : https://www.reuters.com/article/us-iag-ceo/british-airways-ceo-puts-cost-of-recent-it-outage-at-80-million-pounds-idUSKBN1961H2

In addition to the monetary value, Company reputation & Customer Loyalty are affected.

Example from CNBC today, 10/23: Amazon's move off Oracle caused Prime Day outage in big Ohio warehouse, internal report says

6

https://devops.com/real-cost-downtime/

https://www.reuters.com/article/us-iag-ceo/british-airways-ceo-puts-cost-of-recent-it-outage-at-80-million-pounds-idUSKBN1961H2

https://www.cnbc.com/2018/10/23/amazon-move-off-oracle-caused-prime-day-outage-in-warehouse.html


Oracle Maximum Availability Architecture (MAA)

• Applying 25+ years of lessons learned in solving toughest HA problems around the world

• Solutions to reduce downtime for planned & unplanned outages for Enterprise customers with most demanding workloads and requirements

• Service level oriented architectures

• MAA integrated Engineered Systems and Cloud

• Continuous feedback into products and Cloud

• Books, white papers, blueprints

7

High Availability, Disaster Recovery and Data Protection

Production Copy

DatabaseReplication

R

https://oracle.com/goto/maa

Protect your Data

Maintain your Service Level

https://oracle.com/goto/maa


Zero Data LossDR to the Cloud Use CaseZero Downtime

RACZero Data Loss Backup to the Cloud Use Case

Prod/Departmental

Business Critical

Dev, Test, Prod

Mission Critical

Backup and Recovery

Bronze +

Zero DowntimeHigh Availability

Oracle MAA Availability TiersAvailability Service Levels for Unplanned and Planned Maintenance

8

Silver +

Zero Data LossHA and DR

GOLD

BRONZESILVER

PLATINUM

http://www.oracle.com/technetwork/database/availability/maa-reference-architectures-2244929.pdf

Zero Downtime Golden Gate Cloud Svc.

Gold +

Zero Downtime Maintenance / Migration

Local & Remote Backups

Bronze +

Active/Active Database Clustering + Backup & Recovery

Silver +

Remote Replication,Zero Data Loss,Reduced Downtime

Gold +

Advanced Capabilities for Zero Application Outages and Zero Data Loss


• Redundant Database Servers

– Active-Active highly available clustered servers

– Hot-swappable power supplies and fans

– Redundant power distribution units

– Integrated HA software/firmware stack

• Redundant Network

– Redundant 40Gb/s IB connections and switches

– Client access using HA bonded networks


• Redundant Storage Grid

– Data mirrored across storage servers

– Redundant, non-blocking I/O paths


Exadata: Built-in High Availability

9


Happy Birthday Exadata MAA

10

10 Years, Countless HA Features and Best Practices, World Class HAhttps://www.oracle.com/technetwork/database/features/availability/exadata-maa-best-practices-155385.html


High Availability for Maximum Application Uptime

“Exadata and SuperClusterboth achieve AL4 fault

tolerance in a Maximum Availability Architecture

configuration”

FIVE NINES

5X999.999%

A New Gold Standard

11


Exadata MAA Evolution

On-Premises

On-Premises Exadata

Database / ExadataCloud

Autonomous Database

12

• Infrastructure Management

• Architecture• Configuration, Tuning• Database Management• Lifecycle Operations• Application Performance

• Blueprints• Feedback to

products & features

• Infrastructure Management

• Architecture• Database Management• Configuration, Tuning• Lifecycle operations• Application Performance

• Blueprints• Exadata is the best

integrated MAA DB platform

• Architecture• Database Management (Tooling)• Configuration, Tuning • Lifecycle Operations (Tooling)• Application Performance

• Oracle owns and manages the best integrated MAA DB platform

• Cloud automation for provisioning and life cycle operations

• Choosing the SLA policy• Application performance

• Oracle owns and manages Infrastructure

• Policy driven deployments

• MAA Integrated cloud• Fully automated Self-

Driving, Self-Securing, Self-Repairing Database

CustomerOracle


Exadata MAA: New Features and Best Practices

13


Smart Handshake For Storage Server Shutdown

14

• Clear communication to the diskmon process on the database servers when storage is shutdown prevents errors and application blackouts.

Your service level will smile!

Database Tier

Storage Tier

Grid Infrastructure 12c / Exadata 12.1+


Summary: Smart Handshake For Storage Server Shutdown

• Graceful database tier handling during storage server shutdown

• Use graceful shutdown procedures.

• Related: Use patchmgr for storage server software updates as it ensures grid disks are handled properly.

• No blackouts when storage tier is shutdown for maintenance

• No false positive errors/alerts

15

Feature Oracle Has Provided

Best Practices You Can Implement (Tips!)

Service Level ImpactExpectations


Database Tier IO Cancel

16

Database Tier

Storage Tier

Slow IO ?

Hung IO ?

Sick disk ?

Undiscovered hardware / software issue?

Cell IO Latency Capping

IO Hang detection / repair

Disk confinement

Database Tier IO Latency Capping

?IOs are PumpingIOs are PumpingIOs are PumpingIOs are PumpingIOs are Pumping

Oracle Grid Infrastructure & Database 18c


Summary: Database Tier IO Cancel

• Protection from uncommon storage tier stalls/hangs

• Nothing! Completely transparent.

• Stable service level achieved through IO redirection on stalls/hangs

17




Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 18

Database Buffer Cache

Cell with primary mirror populated in super low latency DRAM cache

Cell with secondary mirror populated in low latency flash cache

Cell with tertiary mirror populated on high latency hard disk

DBWR evicts this buffer while freeing up space in

the buffer cache

Smart OLTP CachingOracle Grid Infrastructure 19c. Under the covers of this quarter rack with high redundancy


Smart OLTP Caching

X

• SaaS application reading data from the primary mirror

• Storage failure on cell containing primary mirror

• No problem, just retrieve data from the secondary mirror on flash with low latency

• The tertiary mirror continues to provide protection just in case its one of those days

• After the storage failure is repaired and the cell caching state is deemed healthy again, return to the primary mirror

Maintaining SLAs During Storage Failures


Summary: Smart OLTP Caching

• Low latency IO even after storage failures

• Smart access to repaired storage only after it is fully cached

• Nothing! This feature works automatically and transparently

• No cache misses on storage failure or repair = no performance related service level interruptions

20





Dynamic HugePages

21

Buffer Cache with hugepages

Operating System

sysctl –w …..

alter system set ..…

18c and Lower

Buffer Cache with hugepages

Operating System

sysctl –w …..

alter system set ..…

19c

DISM background

Use oedacli

Oracle Database 19c


Summary: Dynamic HugePages

• Dynamic Hugepages remove the need for manual configuration

• Oedacli automates many Exadata lifecycle operations including database drop/add

• In 18c and lower, set use_large_pages=ONLY

• In 19c, set use_large_pages=AUTO_ONLY

• Use oedacli for lifecycle operations including dbcreation

• Stable service level achieved through proper use of hugepages.

22





Smart Rebalance For High Redundancy Diskgroups

• Problem: Rebalance runs out of space after disk failure (ORA-15041)

• Solution for 18c and lower

– Run exachk which reports on compliance to our MAA best practice

• Solution for 19c with high redundancy diskgroups

– Smart rebalance - no need for free space!

– If there is not enough space to rebalance at the time of failure, offline the disk

– Upon replacement, efficiently repopulate it from partner disks automatically

23

15% free with a normal or high redundancy diskgroup having < 5 Exadata cells and GI versions 12.2 and 18c

0% free with 19c high redundancy diskgroup.

Like they say in New Jersey at the gas station “Fill er up!”

9% free with a normal or high redundancy diskgroup having 5 or more Exadata cellsand GI versions 12.2 and 18c

Oracle Grid Infrastructure 19c


Summary: Smart Rebalance For High Redundancy Diskgroups

• Elimination of need to reserve free space for rebalance when using high redundancy

• Use high redundancy diskgroups

• Use MAA Exadata best practice power limit of 4.

• If desired, ASM REPLACE DISK issued by Exadata auto disk management can be monitored in gv$asm_operation

• High redundancy and seamless repair without risk of out of space errors = no service level impact

24





Exadata OVM Best Practices

• Out of scope for today but see our recently updated recommendations available here (https://www.oracle.com/technetwork/database/availability/exadata-

ovm-2795225.pdf ) covering the following:

– Use Cases

– Exadata OVM Software Requirements

– Exadata Isolation Considerations

– Exadata OVM Sizing and Prerequisites

– Exadata OVM Deployment Overview

– Exadata OVM Administration and Operational Life Cycle

–Migration, HA, Backup/Restore, Upgrading/Patching

–Monitoring, Resource Management

25

https://www.oracle.com/technetwork/database/availability/exadata-ovm-2795225.pdf


Exadata: MAA Exadata Lifecycle Operations

Software Maintenance

Compute Elasticity

Storage Elasticity




Compute Elasticity

Storage Elasticity


VCPU video here

28




Compute Elasticity

Storage Elasticity


Recommended Update Schedule


Frequency Database / Grid Exadata

3-12 months Release Update (RU) Sustaining Release

1-4 years Annual Feature Release Feature Release

• All software maintenance for Exadata MOS 888828.1

• Quality maintenance readiness with Exachk

• Version recommendation

• Critical issues exposure report

Late-breaking issues - MOS Alerts for Hot Topics

Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |Confidential – Oracle Internal/Restricted/Highly

31

RHP Here




Compute Elasticity

Storage Elasticity


Drop/Add Cells and Diskgroup Resizing

33

Service Level Impact Expectations

• Use Exadata best practice default asm_power_limit of 4 (total across clusters)

• For drop cell, follow MAA best practice for space reserved to restore redundancy during rebalance (ie avoid ORA-15041)

• Keep the # of diskgroups per cluster to a minimum (ex: DATA and RECO) both for simplicity and to avoid rebalances getting queued (only one rebalance can run per db node at a time)

• Leverage oedacli to simplify/automate process

• Run exachk

• Zero to low impact because data cached in original cell’s flash cache prior to operation is proactively cached in new cell’s flash cache during rebalance.

Best Practices (Tips!)


Real-world MAA Project Example – Add Cell

34

Cell single Block Read Histogram from Exadata AWR Report

Baseline Single Block Read Histogram

Add Cell Single Block Read Histogram


Exadata MAA Solution Integration

35

On-Premises Exadata

All Exadata MAA configuration best practices baked in

Exadata MAA operational best practices implemented by customer

All Exadata MAA configuration best practices baked in

All Exadata MAA operational best practices baked in

Exadata Cloud / Autonomous DB


Exadata MAA: Sneak Peek Into Future Best Practices and Features

36


Customer HA Success Stories from Oracle France

37


Oracle DB

High EndSAN Storage

SAN

Front End Network

4x 32 cores

396GO RAM

Oracle DB RAC Nodes

BPELFront End

Servers• Customer Context

– Critical BPEL Application to manage 10+ Millions Devices deployment (growing), high level of Database IO load

– Deadline for the next Device deployment at high risk

– High SLA 24/7

Utilities Customer

0

5

10

15

20

25

30

35

40

Devices (Millions)

Devices (Millions)

x4 Times


0

5

10

15

20

25

30

35

40

Devices (Millions)

Devices (Millions)

Front End Network

Oracle DB

High EndSAN Storage

SAN

4x 32 cores

396GO RAM

Oracle DB RAC Nodes

BPELFront End

Servers• Customer Context

– Critical BPEL Application to manage 10+ Millions Devices deployment, high level of Database IO load


– High SLA 24/7

• Solution

– Replace existing High-end x86 servers + SAN Storage by Exadata

– 1 Production Exadata, 1 DRP Exadata, 1x Dev/Integration, 1x Performance tests

– Configuration : 5x Computes Nodes, 3 Extreme Flash, 3x High Capacity, High Redundancy

– Active Dataguard

39

Utilities Customer

x4 Times

OSB

BPEL

BPEL

BPEL

5 x (48 cores -768 GO RAM) Oracle DB RAC Nodes

EF

EF

EF

infi

nib

and

R.EXT.

HC

HC

HC


• Customer Context

– Critical BPEL Application to manage 10+ Millions Devices deployment, high level of Database IO load


– High SLA 24/7

• Solution

– Replace existing High-end x86 servers + SAN Storage by Exadata

– 1 Production Exadata, 1 DRP Exadata, 1x Dev/Integration, 1x Performance tests

– Configuration : 5x Computes Nodes, 3 Extreme Flash, 3x High Capacity, High Redundancy

– Active Dataguard

40

• Results

– Deployment in 8 weeks

– Results in improvement of Database demonstrated bottlenecks at Application layer => moved from virtualized Front-end to Baremetal

– Implemented Backup to FRA : HC = 4 GB/sec, EF = 12 GB/sec

– Tested against Hyperconverged: Exadata deliver x2 to x11 better perf

– Customer sleeps well

Utilities Customer

OSB

BPEL

BPEL

BPEL


EF

EF

EF

infi

nib

and

R.EXT.

HC

HC

HC

Front End Network

BPELFront End

Servers


Architecture

> 50 kmOSB

BPEL

BPEL

BPEL


44 TO Flash (EF)106 TO Disques (HC)

+Active Dataguard

PRE-PROD / Local failover

Backup

PERF. /TESTS

EF

EF

EF

OSB

BPEL

BPEL

BPEL

EF

EFEF

Storage Cells

ExtremFlashTriple Mirror

infi

nib

and

PROD

OSB

BPEL

BPEL

BPEL

EF

EF

DRP

Bakup

EF

EF

R.EXT.

R.EXT.R.EXT.

Storage Cells High Capacity

Triple Mirror

HC

HC

HC

HC

HC

HC

HC

HCHCHC

HC

HC

OSB

BPEL

BPEL

BPEL

EF

EF

EF

R.EXT.

Local FRABackup


• Customer Context

– Securing SAP ERP Database Backend

– Providing scalability and performances for the Enterprise DWH

– Securing POS Database Backend for the new retail shop 24/7

– Provide capabilities to evolve to Database as a Service solution gradualy

• Solution

– Dual Datacenter deployment

– Upgrade their existing Exadata deployment to support full rolling upgrade

– Introduce virtualization

– Implement a 5 days backup in FRA

– Configuration : 3x Computes Nodes,3x High Capacity, High Redundancy

43

• Results

– Improvement of patching procedure

– Improvement in ability to restore very quickly

– Ability to segregate workloads and options

– Customer have room for growth

Fashion Customer

CN

CN

CN

HC

HC

HC

CN

CN

HC

HC

HC

Production DRP / Non Prod

Prod1

DR Test

Prod2

Documents

Exadata Maximum Availability - Oracle · after disk failure (ORA-15041) •Solution for 18c and lower –Run exachk which reports on compliance to our MAA best practice •Solution