Upload
moses-parks
View
246
Download
3
Tags:
Embed Size (px)
Citation preview
Design Scale-Out File Server Clusters in the Next Release of Windows Server
Claus JoergensenPrincipal Program ManagerMicrosoft Corporation
CDP-B325
Software Defined StorageApplication data storage on cost effective, continuously available, high performance SMB3 file shares backed by Storage Spaces
Disaggregated compute and storage for independent management and scale
1. Performance, Scale: SMB3 File Storage network
2. Continuous Availability and Seamless Scale Out with File Server Nodes
3. Elastic, Reliable, Optimized, Tiered Storage Spaces
4. Standard volume hardware for low cost
Storage Spaces
Hyper-V Clusters
SMB3 Storage Network Fabric
Shared JBODStorage
1
4
22
3
Scale-Out File Server Clusters
Software Defined Storage – Storage Stack
Scale-Out File ServerAccess point for Hyper-VScale-out data accessData access resiliency
Cluster Shared VolumesSingle consistent namespaceFast failover
Storage SpacesStorage poolingVirtual disksData Resiliency
Hardware- Standard volume hardware- Fast and efficient networking- Shared storage enclosures- SAS SSD- SAS HDD
Storage Node Storage Node Storage Node Storage Node
Storage Pool
Storage Space Virtual Disks
Scale-Out File Server \\FileServer\Share
Cluster Shared Volumes C:\ClusterStorage
SMB
Shared JBOD Storage
Soft
ware
Defin
ed
Sto
rag
e S
yst
em
Storage Spaces
Storage SpacesStorage virtualizationStorage Pool: unit of aggregation, administration, isolationStorage Space: a virtual disk with resiliency and performance
Clustered Storage PoolPool is read-write on one node, read-only on all other nodesCluster infrastructure routes pool operations to read-write nodeAutomatic failover if read-write node fails
Clustered Storage SpacePhysical disk resource, online on one nodeIO is routed to the node where the storage space is online (CSVFS)SMB Client is redirected to the where the storage space is online (SOFS)Automatic failover to another nodeAllocations are aware of fault-domains (enclosure)
Interconnects: Shared SAS for Clusters (SAS, SATA, USB for stand-alone)
Enclosures: Shared SAS for Clusters (SAS, SATA, USB for stand-alone)
Storage Spaces ReliabilityMirror Resiliency2-copy mirror, tolerates one drive failure3-copy mirror, tolerates two drive failuresSuitable for random I/O
Parity ResiliencyLower cost storage using LRC encodingTolerates up to 2 drives failuresSuitable for large sequential I/O
Enclosure awarenessTolerance for entire drive enclosure failure
Parallel rebuildPseudo-random distribution weighted to favor less used disksReconstructed space is spread widely and rebuilt in parallel
Storage Pool
Storage Spaces
Data Copy 1 Data Copy 2
Mirror Space
Mirror Space
Parity Space
Mirror Space
…
…
…
Physical drives
from SAS Enclosures
Drive Failure!
Data is rebuilt to multiple drives simultaneously, using
spare capacity
Rebuild Metric Measurement
Data Rebuilt 2,400 GB
Time Taken 49 min
Rebuild Throughput
> 800 MB/s
3TB HDDs, 2-way, 4-column mirror space
Source: internal testing, no foreground activity
Storage Spaces Tiering and WBCTiered Spaces leverage file system intelligenceFile system measures data activity at sub-file granularityHeat follows filesAdmin-controlled file pinning is possible
Data movementAutomated promotion of hot data to SSD tierConfigurable scheduled task
Write-Back Cache (WBC)Helps smooth effects of write burstsUses a small amount of SSD capacityIO to SSD bypass WBCLarge IO bypass WBC
ComplementaryTogether, WBC and the SSD tier address data’s short-term and long-term performance needs
Storage Space
HDD TierCold Data
SSD Tier and WBC
Hot Data
SAS SSD
SAS HDD
I/O Activity Accumulates
Heat at Sub-File Granularity
Compute Nodes
Storage Virtualization and Reliability
Performance
Scalability
Operability
Availability
TieringMove data to appropriate storage
Write-Back CacheBuffer random writes on flash
Storage PoolsUp to 80 disks per pool
4 pools per cluster480TB per pool
Storage Spaces64 Storage Spaces per pool
Storage SpaceResiliency to disk and enclosure
failuresParallel data rebuild
NTFSSignificant improvements
Storage PoolsAggregation, administration,
isolation
PowerShell / SMAPI management
Storage Spaces FAQ
Cluster-Wide File System (CSVFS)
CSVFSCSVFS is a clustered file systemEnables all nodes to access common volumesSingle consistent namespaceProvides a layer of abstraction above on-disk file systemApplication consistent distributed backupInteroperability with backup, AV, BitlockerSupports NTFS and ReFS on-disk file systemsSupport for Storage Spaces
Transparent fault toleranceDoes not require drive ownership changes on failoverNo dismounting and remounting of volumesFaster failover times (aka. less downtime)
WorkloadsHyper-V and SQL ServerScale-Out File Server
Storage Node 1
Storage Node 2
SMB Server (CSV) CSV Filter
NTFS / ReFS
Volume
Space
CSVFS
CSV Volume
SMB Server (default)
SMB Client (CSV)
CSVFS
CSV Volume
SMB Server (default)
Metadata or redirected IO
Block redirected IO
Dire
ct IO
CSVFS componentsCSV File System (CSVFS)Proxy file system on top of NTFS or ReFSMounted on every nodeDecides direct IO vs. file system redirect IO
CSV Volume ManagerResponsible for the creation of CSV volumes Direct IO for locally attached spaces and Block-level IO redirect for non-locally attached spaces
CSV FilterAttaches to NTFS / ReFS for local clustered spacesControls access to the on-disk file system Co-ordinates metadata operations
CSVFS support for NTFS features and operations
Storage Node 1
Storage Node 2
SMB Server (CSV) CSV Filter
NTFS / ReFS
Volume
Space
CSVFS
CSV Volume
SMB Server (default)
SMB Client (CSV)
CSVFS
CSV Volume
SMB Server (default)
Metadata or redirected IO
Block redirected IO
Dire
ct IO
CSVFS Zero Downtime CHKDSKImproved CHKDSK with online scanning separated from offline repair
With CSV repair is also online
CHKDSK processing with CSVCluster checks (once a minute) to see if CHKDSK (spotfix) is requiredCluster pauses the affected CSV file system and dismounts the underlying NTFS volumeCHKDSK (spotfix) is run against only affected files for a maximum of 15 secondsThe underlying NTFS volume is mounted and CSV namespace is un-paused
If CHKDSK (spotfix) did not process all recordsCluster will wait 3 minutes before continuingEnables a large set of affected files to be processed over time
If corruption is too largeCHKDSK (spotfix) is not run and marked to run at next Physical Disk online
Cluster-Wide File System
Performance
Scalability
Operability
Availability
CSV Block Cache7x faster VDI VM boot time (avg.)
Block level I/O redirection
Single namespaceAggregate all file systems
No more drive lettersLeverages mount points
Fault toleranceFast failover on failures
Zero downtime CHKDSK
Shared accessAll nodes can access volumes
Simple managementManage from any node
Scale-Out File Server
SMB Transparent FailoverPlanned and unplanned failovers with zero downtime
SMB ClientServer node failover is transparent to client side applicationsSmall IO delay during failover
SMB ServerHandles are always opened Write-ThroughStores handle state in Resume Key Filter (RKF) DB
Resume KeyPersists protocol server state to fileReconciles handle reconnects/replays with local file system stateProtects file state during reconnect window
Witness ServiceClients proactively notified of server node failuresClients can be instructed to switch server nodes
SMB Server WitnessService
RVSS Service
DNN Resource
Clu
ster
SOFS Node 1
CSVFS
Resume Key
Shared VHD
NTFS/ReFS
LUN / Space
DB
SOFS Resource
CSV Provider
VSSServ
er
Serv
ice
To Witness ServiceOn SOFS node 2+
NIC
Pair
1
NIC
Pair
N
B/W Limiter
VM
VHD Parser
SMB Client
WSK
VMBUS
SMBD
TCP
VM
WitnessClient
RVSS Provider
VSS
DPM
Hyper-V Host
LBFONDK
SMB Scale-OutScaling out for throughput and management
Scale OutActive-Active SMB shares accessible through all nodes simultaneouslyDistributed NetName (DNN)Physical node IP addresses in DNSClient round-robins through IPs (multiple parallel connects)Clients re-directed to “optimal” server node (CSV/Storage space owner)
Management / BackupSimple ManagementExtensive PowerShellFan out requestsRemote VSS (MS-FSRVP)
SMB Server WitnessService
RVSS Service
DNN Resource
Clu
ster
SOFS Node 1
CSVFS
Resume Key
Shared VHD
NTFS/ReFS
LUN / Space
DB
SOFS Resource
CSV Provider
VSSServ
er
Serv
ice
To Witness ServiceOn SOFS node 2+
NIC
Pair
1
NIC
Pair
N
B/W Limiter
VM
VHD Parser
SMB Client
WSK
VMBUS
SMBD
TCP
VM
WitnessClient
RVSS Provider
VSS
DPM
Hyper-V Host
LBFONDK
SMB MultichannelNetwork throughput and fault tolerance
Bandwidth aggregation & link fault tolerance IO balanced over active interfacesReplays operations on alternate channels in channel failure casesRSS aware, LBFO aware, NUMA aware
Zero configurationClient driven NIC discovery and best pair(s) selectionTransparent fall back to less desirable interfaces in failure casesPeriodic re-evaluation and transparent ‘upgrade’
SMB Server WitnessService
RVSS Service
DNN Resource
Clu
ster
SOFS Node 1
CSVFS
Resume Key
Shared VHD
NTFS/ReFS
LUN / Space
DB
SOFS Resource
CSV Provider
VSSServ
er
Serv
ice
To Witness ServiceOn SOFS node 2+
NIC
Pair
1
NIC
Pair
N
B/W Limiter
VM
VHD Parser
SMB Client
WSK
VMBUS
SMBD
TCP
VM
WitnessClient
RVSS Provider
VSS
DPM
Hyper-V Host
LBFONDK
SM
B C
lient
CSV trafficto node 2+
SMB DirectLow network latency and low CPU consumption
SMB DirectProvides sockets-like layer over NDK / RDMALow latency – combination of fabric and skipping TCP stackSupports RoCE, iWARP and InfiniBandEfficient – cycles/byte comparable with DAS
Results>1 Million 8K IOPS demonstrated by Violin Memory16 GB/s large IOs (multiple InfiniBand links) with low CPU
Also utilized by (SMB Direct + SMB Multichannel):Hyper-V for live Migration with bandwidth limiter to avoid starving LM trafficCSVFS for internal traffic
SMB Server WitnessService
RVSS Service
DNN Resource
Clu
ster
SOFS Node 1
CSVFS
Resume Key
Shared VHD
NTFS/ReFS
LUN / Space
DB
SOFS Resource
CSV Provider
VSSServ
er
Serv
ice
To Witness ServiceOn SOFS node 2+
NIC
Pair
1
NIC
Pair
N
B/W Limiter
VM
VHD Parser
SMB Client
WSK
VMBUS
SMBD
TCP
VM
WitnessClient
RVSS Provider
VSS
DPM
Hyper-V Host
LBFONDK
SM
B C
lient
CSV trafficto node 2+
Scale-Out File Server
Performance
Scalability
Operability
Availability
SMB DirectLow latency and minimal CPU
usage
SMB PerformanceOptimized for server app IO
profiles
SMB Scale-OutActive/Active file shares
SMB MultichannelNetwork bandwidth aggregation
SMB Transparent FailoverNode failover transparent to VMs
SMB MultichannelNetwork fault tolerance
SMB PowerShellManage from any node
SMB AnalysisPerformance counters
FAQSupport for IW workloads?Not recommended
CSV caching size? SOFS is not CPU or memory bound go big on cache (64GB)
Using SOFS for file share witness?Yes you can, see http://blogs.msdn.com/b/clustering/archive/2014/03/31/10512457.aspx
How many nodes? Commonly 2-4 nodes. Usually gated by SAS storage connectivity.
How to evaluate performance?DO NOT USE FILE COPY!!Do performance measurements in virtual machine. After all this is the workload
Should I disable NetBIOS?Yes!
Scale-Out File Server is for Hyper-V and SQL Server
Technology Area FeatureGeneral Use File Server
ClusterScale-Out File Server
SMB
SMB Continuous Availability Yes YesSMB Multichannel Yes YesSMB Direct Yes YesSMB Encryption Yes YesSMB Transparent failover Yes1 Yes
File SystemNTFS Yes NAResilient File System (ReFS) Yes NACluster Shared Volume File System (CSV) NA Yes
File Management
BranchCache Yes No4
Data Deduplication (Windows Server 2012) Yes No4
Data Deduplication (Windows Server 2012 R2)
Yes Yes
DFS Namespace (DFSN) root server root Yes No4
DFS Namespace (DFSN) folder target server Yes YesDFS Replication (DFSR) Yes No4
File Server Resource Manager (Screens and Quotas)
Yes No4
File Classification Infrastructure Yes No4
Dynamic Access Control (claim-based access, CAP)
Yes No4
Folder Redirection Yes Yes2
Offline Files (client side caching) Yes Yes5
Roaming User Profiles Yes Yes2
Home Directories Yes Yes2
Work Folders Yes No4
NFS NFS Server Yes No4
ApplicationsHyper-V Yes3 YesMicrosoft SQL Server Yes3 Yes
1 Requires CA is enabled on shares2 Not recommended on Scale-Out File Servers.3 Not recommended on general use file servers.4 Requires NTFS5 CSC is less compatible with CA shares than the other IW technologies, due to how it decides a share is offline combined with the SMB 3 client. This means that Offline Files will stay online even if the user no longer has access to the share, for 3-6 minutes.
Scale-Out File Server is not for Information Worker!!
And we only discussed the stuff in blue
Cluster-Aware
Updating
SMB3 & SMB Direct
Virtual Fibre
Channel
Hyper-V Replica
8,000 VMs per Cluster
VM Prioritization
64-node clusters
Dedup
Scale-Out File Server
Storage Spaces
Offload Data Transfer
VM Storage Migration
iSCSI Target Server
ReFS VHDX
Shared VHDX
Hyper-V Storage QoS
WorkFolders
SMI-S Storage Service
NTFS Trim /
Unmap
NFS 4.1 Server
SM APICSVFS online
CHKDSK
iSCSI Target Server with
VHDX
Dedup (live
files/CSV)
SMB Direct (> 1M IOPs)
Live Migration over SMB
Optimized Scale-out File
Server
Storage Spaces Tiering
Storage Spaces Write Back Cache
Storage Spaces Rebuild
SMB Bandwidth
Management
Windows Server 2012 Windows Server 2012 R2
Microsoft Cloud Platform System
Dell PowerEdge servers
Dell Storage
Dell Networking
Tightly integrated components
Windows Server 2012 R2, System Center 2012 R2, Windows Azure Pack
Microsoft-designed architecture based on Public Cloud learning
Microsoft-led support & orchestrated updates
Optimized run-books for Microsoft applications
Microsoft Cloud Platform System powered by Dell
• Pre-deployed infrastructure• Switches, load balancer, storage,
compute, network edge• N+2 fault tolerant (N+1
networking)• Pre-configured as per best practices• Integrated Management
• Configure, deploy, patching• Monitoring• Backup and DR• Automation
• 8000 VM’s*, 1.1 PB of total storage• Optimized deployment and
operations for Microsoft and other standard workloads
Cloud Platform System - Capabilities
* VM Topology - 2vCPU, 1.75 GB Ram, 50 GB Disk
SQL ServerSYSTEM CENTER
SMB 3.0 & STORAGE SPACES
HYPER-VHOSTS
HYPER-VNETWORKING
SERVICE MANAGEMENT API
ADMINPORTAL
TENANTPORTAL
Dell PowerEdge Servers
Dell Storage Dell Networking
+ optimized racking and cabling for high density and reliability
Dell Enterprise infrastructure
WINDOWS AZURE PACK
Storage Cluster (Storage Scale Unit)Storage Scale Unit hardware (4x4)
• 4x Dell PowerEdge R620v2 Servers• Dual socket Intel IvyBridge (E5-2650v2 @
2.6GHz)• 128GB memory• 2x LSI 9207-8E SAS Controllers (shared storage)• 2x 10 GbE Chelsio T520 (iWARP/RDMA)
• 4x PowerVault MD3060e JBODs• 48x 4TB HDDs 192HDD / 768TB RAW• 12x 800GB SSDs 48SSD / 38TB RAW
Storage Spaces configuration• 3 x pools (2 x tenant, 1 x backup)• VM storage:• 16x enclosure aware, 3-copy mirror @ ~9.5TB• Automatic tiered storage, write-back cache
• Backup storage:• 16x enclosure aware, dual parity @ 7.5TB• SSD for logs only
• 24TB of HDD and 3.2TB of SSD capacity left unused for automatic rebuild
• Available space• Tenant: 156 terabytes• Backup: 126 terabytes (+ deduplication)
\\SMBShare
Virtualized SQL Server OLTP workloadTestMax # of VMs running OLTP DBs with acceptable performance90th percentile response time to not exceed 1 second (application in VM)
Database configuration1 OLTP medium DB per VM (2 vCPU, 3.5GB vMem)
Deployment Configuration28 compute node hosting VMs running OLTP workload4 node Scale-Out File Server with Storage Spaces
ObservationsStorage optimization nearly doubles the number of DBs meeting response time metricResponse times drastically improved across all transaction after storage optimization
64 128 192 256 384 44801234
Before Storage Optimization - Trans-action Response Time vs. Number of
OLTP Databases
Average 90th Percentile Response Time(Write TXN)
Expected Response Time
NUMBER OF DBS
DB
Resp
on
se T
ime
(Secon
ds)
DB Capacity: ~200
64 128 192 256 320 384 4480
1
2
3
4
After Storage Optimization - Trans-action Response Time vs. Number of
Databases
Average 90th Percentile Response Time(Write TXN)
Expected Response Time
NUMBER OF DBS
DB
Resp
on
se T
ime
(Secon
ds)
DB Capacity: ~400
What is beyond?
Cluster Rolling Upgrades for Storage
Storage (Scale Out File Server) or Hyper-V Cluster
Win2012 R2 vNext
Win2012 R2
Seamless
Zero downtime cloud upgrades for Hyper-V and Scale-out File Server
Simple
Easily roll in nodes with new OS version
Windows Server 2012 R2 and Windows Server vNext nodes within the same cluster
Storage Replica
BCDR
Synchronous or asynchronousCluster <-> ClusterServer <-> ServerMicrosoft Azure Site Recovery orchestration
Stretch Cluster
Synchronous stretch clusters across sites for HA
Benefits
Block-level, host-based, volume replicationEnd-to-end software stack from MicrosoftWorks with any Windows volumeHardware agnostic; existing SANs workUses SMB3 as transport
2
NODE1 in HVCLUS
SR over SMB3
NODE3 in HVCLUS
Stretch Cluster
NODE2 in HVCLUS NODE4 in HVCLUS
Man
hattan
DC
Jers
ey C
ity D
C
SRV1
SR over SMB3
SRV2
Server to Server
Man
hatt
an D
C
Jers
ey C
ity D
C
1
Available in Windows Server Technical Preview for Stretch Cluster and Server to Server scenarios. Management tools are still in progress.
Scale-out File Server Cluster
Hyper-V Cluster
Virtual Machines
I/OSched
I/OSched
I/OSchedPolicy
Manager
RateLimiter
s
RateLimiter
s
RateLimiter
s
RateLimiter
s
SMB3 Storage Network Fabric
Control and monitor storage performance
Flexible and customizabl
e
Policy per VHD, VM, Service or Tenant
Define Minimum & Maximum IOPs
Fair distribution within policy
Simple out of box behavior
Enabled by default for Scale Out File Server
Automatic metrics (normalized IOPs & latency) per VM & VHD
Management
System Center VMM and Ops Manager
PowerShell built-in for Hyper-V and SOFS
Storage QoS – Greater efficiency
Storage Spaces Shared Nothing Enabling cloud hardware designsSupport for DAS (shared nothing) storage hardwarePrescriptive configurations
Scalable poolsSupports large poolsSimple storage expansion and rebalancing
Fault toleranceFault tolerance to disk, enclosure and node failures3-copy mirror and dual parity
ManagementSystem Center and PowerShell
Key use casesHyper-V IaaS storage Storage for Backup and Replication targets
Doesn’t need shared JBODs and SAS fabric behind Scale Out File Server
nodes
Scale-Out File Server
Hyper-V Clusters
SMB3 Storage Network Fabric
Shared JBODStorage
CDP-B222 Software Defined Storage in the Next Release of Windows Server (Tuesday, October 28 5:00 PM)
CDP-B246 Sneak Peek into the Next Release of Windows Server Hyper-V (Wednesday, October 29 8:30 AM)
CDP-B318 Building Scalable and Reliable Backup Solutions in the Next Release of Windows Server Hyper-V (Tuesday, October 28 1:30 PM)
CDP-B323 Delivering Predictable Storage Performance with Storage Quality of Service in the Next Release of Windows Server (Wednesday, October 29 8:30 AM)
CDP-B352 Stretching Failover Clusters and Using Storage Replica for Disaster Recovery in the Next Release of Windows Server (Wednesday, October 29 5:00 PM)
CDP-B354 Advantages of Upgrading Your Private Cloud Infrastructure in the Next Release of Windows Server (Wednesday, October 29 10:15 AM)
CDP-B341 Architectural Deep Dive into the Microsoft Cloud Platform System (Wednesday, October 29 12:00 PM - 1:15)
Related content
Find Me Later At The Storage Booth
ResourcesProvide cost-effective storage for Hyper-VPlanning and Design GuideDeploy Clustered Storage SpacesDeploy Scale-Out File ServerStorage Spaces Survival Guide
Resources
Learning
Microsoft Certification & Training Resources
www.microsoft.com/learning
TechNet
Resources for IT Professionals
http://microsoft.com/technet
Sessions on Demand
http://channel9.msdn.com/Events/TechEd
Developer Network
http://developer.microsoft.com
Come visit us in the Microsoft Solutions Experience (MSE)!Look for the Cloud and Datacenter Platform area TechExpo Hall 7
For more informationWindows Server Technical Previewhttp://technet.microsoft.com/library/dn765472.aspx
Windows Server
Microsoft Azure
Microsoft Azurehttp://azure.microsoft.com/en-us/
System Center
System Center Technical Previewhttp://technet.microsoft.com/en-us/library/hh546785.aspx
Azure Pack Azure Packhttp://www.microsoft.com/en-us/server-cloud/products/windows-azure-pack
Azure
Implementing Microsoft Azure Infrastructure Solutions
Classroomtraining
Exams
+
(Coming soon)Microsoft Azure Fundamentals
Developing Microsoft Azure Solutions
MOC
10979
Implementing Microsoft Azure Infrastructure Solutions
Onlinetraining
(Coming soon)Architecting Microsoft Azure Solutions
(Coming soon)Architecting Microsoft Azure Solutions
Developing Microsoft Azure Solutions
(Coming soon)Microsoft Azure Fundamentals
http://bit.ly/Azure-Cert
http://bit.ly/Azure-MVA
http://bit.ly/Azure-Train
Get certified for 1/2 the price at TechEd Europe 2014!http://bit.ly/TechEd-CertDeal
2 5 5MOC
20532
MOC
20533
EXAM
532EXAM
533EXAM
534
MVA MVA
TechEd Mobile app for session evaluations is currently offline
SUBMIT YOUR TECHED EVALUATIONSFill out an evaluation via
CommNet Station/PC: Schedule Builder
LogIn: europe.msteched.com/catalog
We value your feedback!
© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.