59
Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Embed Size (px)

Citation preview

Page 1: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Scalable, Fault-Tolerant NAS for Oracle - The Next Generation

Kevin Closson

Chief Software Architect

Oracle Platform Solutions, Polyserve Inc

Page 2: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

The Un-”Show Stopper”• NAS for Oracle is not “file serving”, let me explain…

• Think of GbE NFS I/O paths from Oracle Servers to the NAS device that are totally direct. No VLANing sort of indirection.

– In these terms, NFS over GbE is just a protocol as is FCPover FiberChannel

– The proof is in the numbers.• A single dual-socket/dual-core ADM server running Oracle10gR2 can push through

273MB/s of large I/Os (scattered reads, direct path read/write, etc) of triple-bonded GbE NICs!

• Compare that to infrastructure and HW costs of 4GbE FCP (~450MB/s, but you need 2 cards for redundancy)

– OLTP over modern NFS with GbE is not a challenging I/O profile.

• However, not all NAS devices are created equal by any means

Page 3: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Agenda

• Oracle on NAS

• NAS Architecture

• Proof of Concept Testing

• Special Characteristics

Page 4: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Oracle on NAS

Page 5: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Oracle on NAS

• Connectivity– Fantasyland Dream Grid™ would be nearly impossible with FibreChannel

switched fabric, for instance:• 128 nodes == 256 HBAs, 2 switches each with 256 ports just for the servers then you

have to work out storage paths

• Simplicity– NFS is simple. Anyone with a pulse can plug in cat-5 and mount filesystems.– MUCH MUCH MUCH MUCH MUCH simpler than:

• Raw partitions for ASM• Raw, OCFS2 for CRS• Oracle Home? Local Ext3 or UFS?• What a mess

– Supports shared Oracle Home, shared APPL_TOP too– But not simpler than a Certified Third Party Cluster Filesystem , but that is a

different presentation• Cost

– FC HBAs are always going to be more expensive than NICs– Ports on enterprise-level FC switches are very expensive

Page 6: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Oracle on NAS

• NFS Client Improvements– Direct IO

• open(,O_DIRECT,) works with Linux NFS clients, Solaris NFS client, likely others

• Oracle Improvements• init.ora filesystemio_options=directIO• No async I/O on NFS, but look at the numbers• Oracle runtime checks mount options

• Caveat: It doesn’t always get it right, but at least it tries (OSDS)• Don’t be surprised to see Oracle offer a platform-independent NFS client

• NFS V4 will have more improvements

Page 7: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

NAS Architecture

Page 8: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

NAS Architecture• Single-headed Filers

• Clustered Single-headed Filers

• Asymmetrical Multi-headed NAS

• Symmetrical Multi-headed NAS

Page 9: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Single Headed Filer Architecture

Page 10: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

NAS Architecture: Single-headed Filer

Filesystems/u01/u02/u03

GigE Network

Page 11: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Oracle Database Servers

Filesystems/u01/u02/u03

A single one of these…

Has the same (or more) bus bandwidth

as this!

Oracle Servers Accessing a Single-headed Filer: I/O Bottleneck

I/O Bottleneck

Page 12: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Oracle Servers Accessing a Single-headed Filer: Single Point of Failure

Oracle Database Servers

Filesystems/u01/u02/u03

Single Point of Failure

Highly Available through failover-HA,DataGuard, RAC, etc

Page 13: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Clustered Single-headed Filers

Page 14: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Architecture: Cluster of Single-headed Filers

Filesystems/u01/u02

Filesystems/u03

Paths Active AfterFailover

Page 15: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Oracle Servers Accessing a Cluster of Single-headed Filers

Filesystems/u01/u02

Filesystems/u03

Paths Active AfterFailover

Oracle Database Servers

Page 16: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Architecture: Cluster of Single-headed Filers

Filesystems/u01/u02

Filesystems/u03

Paths Active AfterFailover

Oracle Database Servers

What if /u03 I/O saturates this Filer?

Page 17: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Filer I/O Bottleneck. Resolution == Data Migration

Filesystems/u01/u02

Filesystems/u03

Paths Active AfterFailover

Oracle Database Servers

Filesystems/u04

Migrate some of the “hot” data to /u04

Page 18: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Data Migration Remedies I/O Bottleneck

Filesystems/u01/u02

Filesystems/u03

Paths Active AfterFailover

Oracle Database Servers

Filesystems/u04

Migrate some of the “hot” data to /u04

NEW Single Point of Failure

Page 19: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Summary: Single-headed Filers

• Cluster to mitigate S.P.O.F– Clustering is a pure afterthought with filers– Failover Times?

• Long, really really long. – Transparent?

• Not in many cases.• Migrate data to mitigate I/O bottlenecks

– What if the data “hot spot” moves with time? The Dog Chasing His Tail Syndrome

• Poor Modularity• Expanded by pairs for data availability• What’s all this talk about CNS?

Page 20: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Asymmetrical Multi-headed NAS Architecture

Page 21: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Asymmetrical Multi-headed NAS Architecture

FibreChannel SAN

Three Active NAS Heads / Three For Failover and

“Pools of Data”

Note: Some variants of this architecture support M:1 Active:Standbybut that doesn’t really change much.

Oracle Database Servers

SAN Gateway

Page 22: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Asymmetrical NAS Gateway Architecture

• Really not much different than clusters of single-headed filers:

– 1 NAS head to 1 filesystem relationship

– Migrate data to mitigate I/O contention

– Failover not transparent

• But:

– More Modular

• Not necessary to scale up by pairs

Page 23: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Symmetric Multi-headed NAS

Page 24: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

HP Enterprise File Services Clustered Gateway

Page 25: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Symmetric vs Asymmetric

NASHead

NASHead

NASHead

/Dir1/File1 /Dir2/File2 /Dir3/File3

/Dir1/File1 /Dir2/File2 /Dir3/File3

/Dir3/File3/Dir2/File2

NAS Head

NAS Head

NAS Head

/Dir1/File1

/Dir1/File1

/Dir2/File2

/Dir3/File3

/Dir3/File3/Dir2/File2

/Dir1/File1

/Dir2/File2

/Dir1/File1

EFS-CG

Page 26: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Enterprise File Services Clustered Gateway Component Overview

• Cluster Volume Manager– RAID 0– Expand Online

• Fully Distributed, Symmetric Cluster Filesystem– The embedded filesystem is a fully distributed, symmetric cluster filesystem

• Virtual NFS Services– Filesystems are presented through Virtual NFS Services

• Modular and Scalable– Add NAS heads without interruption– All filesystems can be presented for read/write through any/all NAS heads

Page 27: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

EFS-CG Clustered Volume Manager

• RAID 0 – LUNS are RAID 1, so this implements S.A.M.E.

• Expand online– Add LUNS, grow volume

• Up to 16TB– Single Volume

Page 28: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

The EFS-CG Filesystem

• All NAS devices have embedded operating systems and file systems, but the EFS-CG is:

– Fully Symmetric• Distributed Lock Manager• No Metadata Server or Lock Server

– General Purpose clustered file system– Standard C Library and POSIX support– Journaled with Online recovery

• Proprietary format but uses standard Linux file system semantics and system calls including flock() and fcntl() clusterwide

• Expand a single filesystem online up to 16TB, up to 254 filesystems in current release.

Page 29: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

EFS-CG Filesystem Scalability

Page 30: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Scalability. Single Filesystem Export Using x86 Xeon-based NAS Heads (Old Numbers)

123246

493

739

986 1,0841,196

0

200

400

600

800

1,000

1,200

Meg

aByt

es p

er

Sec

on

d (

MB

/s)

1 2 4 6 8 9 10

Cluster Size (Nodes)

# Servers Total bytes (Mbytes) Time (sec.) Mbytes/Sec. Gbits/Sec Scale Factor Scaling Coefficient1 16,384 133 123.19 0.96 1.00 100%2 32,768 133 246.38 1.92 2.00 100%4 65,536 133 492.75 3.85 4.00 100%6 98,304 133 739.13 5.77 6.00 100%8 131,072 133 985.50 7.70 8.00 100%9 147,456 136 1,084.24 8.47 8.80 98%

10 163,840 137 1,195.91 9.34 9.71 97%

123246

493

739

986 1,0841,196

Meg

aByt

es p

er

Sec

ond

(MB

/s)

Cluster Size (Nodes)

NAS I/O Throughput (via NFS)

HP StorageWorks Clustered File System is optimized for both READ and WRITE performance.

ApproximateSingle-headed

Filer limit

NAS Heads

Page 31: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Virtual NFS Services

• Specialized Virtual Host IP

• Filesystem groups are exported through VNFS

• VNFS failover and rehosting are 100% transparent to NFS client– Including active file descriptors, file locks (e.g. fctnl/flock), etc

Page 32: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

EFS-CG Filesystems and VNFS

Page 33: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

/u01/u02

NAS Head

/u04/u03

vnfs2b

/u03

NASHead

/u01

vnfs1

Enterprise File Services Clustered Gateway

/u04

NAS Head

/u02

NASHead

/u04/u03

vnfs1b vnfs3b

Enterprise File Services Clustered Gateway

Oracle Database Servers

Page 34: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

EFS-CG Management Console

Page 35: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

EFS-CG Proof of Concept

Page 36: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

EFS-CG Proof of Concept

• Goals

– Use Oracle10g (10.2.0.1) with a single high performance filesystem for the RAC database and measure:

– Durability

– Scalability

– Virtual NFS functionality

Page 37: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

EFS-CG Proof of Concept

• The 4 filesystems presented by the EFS-CG were:

– /u01. This filesystems contained all Oracle executables (e.g., $ORACLE_HOME)

– /u02. This filesystem contained the Oracle10gR2 clusterware files (e.g., OCR, CSS) and some datafiles and External Tables for ETL testing

– /u03. This filesystem was lower-performance space used for miscellaneous tests such as backup disk-to-disk

– /u04. This filesystem resided on a high-performance volume that spanned two storage arrays. It contained the main benchmark database

Page 38: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

EFS-CG P.O.C. Parallel Tablespace Creation

• All datafiles created in a single exported filesystem

– Proof of multi-headed, single filesystem write scalability

Page 39: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

EFS-CG P.O.C. Parallel Tablespace Creation

Multi-headed EFS-CG Tablespace Creation Scalability

111

208

0

50

100

150

200

250

Single-head, Single GigE Path Multi-headed, dual GigE Paths

MB

/s

Page 40: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

EFS-CG P.O.C. Full Table Scan Performance

• All datafiles located in a single exported filesystem

– Proof of multi-headed, single filesystem sequential I/O scalability

Page 41: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

EFS-CG P.O.C.Parallel Query Scan Throughput

Multi-headed EFS-CG Full Table Scan Scalability

98

188

0

50

100

150

200

250

Single-head, Single GigE Path Multi-headed, dual GigE Paths

MB

/s

Page 42: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

EFS-CG P.O.C.OLTP Testing

• OLTP Database based on an Order Entry Schema and workload

• Test areas

– Physical I/O Scalability under Oracle OLTP – Long Duration Testing

Page 43: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

EFS-CG P.O.C.OLTP Workload Transaction Avg Cost

Oracle Statistics Average Per Transaction

SGA Logical Reads 33

SQL Executions 5

Physical I/O 6.9 *

Block Changes 8.5

User Calls 6

GCS/GES Messages Sent 12

* Averages with RAC can be deceiving, be aware of CR sends

Page 44: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

EFS-CG P.O.C.OLTP Testing

10gR2 RAC Scalability on EFS-CG

650

1246

1773

2276

0

500

1000

1500

2000

2500

1 2 3 4

RHEL4-64 RAC Servers

Tra

nsa

ctio

ns

per

S

eco

nd

Page 45: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

EFS-CG P.O.C.OLTP Testing. Physical I/O Operations

RAC OLTP I/O Scalability on EFS-CG

5214

8831

1161913743

0

5000

10000

15000

1 2 3 4

RHEL4-64 RAC Servers

Ran

do

m 4

K I

Op

s

Page 46: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

EFS-CG Handles all OLTP I/O Types Sufficiently—no Logging Bottleneck

OLTP I/O by Type

893

5593

8150

0100020003000400050006000700080009000

redo writes datafile writes datafile reads

I/O

Op

s p

er S

eco

nd

Page 47: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Long Duration Stress Test• Benchmarks do not prove durability

– Benchmarks are “sprints”

– Typically 30-60 minute measured runs (e.g., TPC-C)

• This long duration stress test was no benchmark by any means

– Ramp OLTP I/O up to roughly 10,000/sec

– Run non-stop until the aggregate I/O breaks through 10 Billion physical transfers

– 10,000 physical I/O transfers per second for every second of nearly 12 days

Page 48: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Long Duration Stress Test

Page 49: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Long Duration Stress Test

Page 50: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Long Duration Stress Test

Page 51: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc
Page 52: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Special Characteristics

Page 53: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Special Characteristics

• The EFS-CG NAS Heads are Linux Servers

– Tasks can be executed directly within the EFS-CG NAS Heads at FCP speed:

– Compression

– ETL, data importing

– Backup

– etc..

Page 54: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Example of EFS-CG Special Functionality

• A table is exported on one of the RAC nodes

• The export file is then compressed on the EFS-CG NAS head:

– CPU from NAS Head, instead of database servers• The NAS heads are really just protocol engines. I/O DMAs are offloaded to the I/O

subsysystems. There are plenty of spare cycles.

– Data movement at FCP rate instead of GigE• Offload the I/O fabric (NFS paths from servers to the EFS-CG)

Page 55: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Export a Table to NFS Mount

Page 56: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Compress it on the NAS Head

Page 57: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Questions and Answers

Page 58: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

Backup Slide

Page 59: Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

EFS-CG NAS Head EFS-CG NAS Head

SAN

Ethernet Switch

FiberChannel Switches

3 GbE NFS Paths:Can be triple bonded, etc

EFS-CG Scales “Up” and “Out”