Upload
chiou-nan-chen
View
139
Download
1
Embed Size (px)
DESCRIPTION
VMWare Big Data Forum
Citation preview
© 2009 VMware Inc. All rights reserved
vSphere Big Data Extensions 之Hadoop参考架构和性能最佳实践李欣慧大数据研发高级工程师VMware中国研发中心
2
Agenda
Recommended Deployment Topology
Plan Your Cluster
3
Virtualization Host
VMDK
Shared storageSAN/NAS
Local disks
OS Image – VMDK
VMDK VMDK VMDK VMDK VMDK
HadoopVirtualNode 2
Datanode
Ext4
Task-tracker
Ext4 Ext4 Ext4
mapred.local.dir
Standard Deployment Configuration on Single Worker
VMDKVMDK
Ext4 Ext4 Ext4 Ext4
4
Standard Deployment Configuration on Single Worker
Virtualization Host
VMDK
Local disks
OS Image – VMDK
VMDK VMDK VMDK VMDK VMDK
HadoopVirtualNode 2
Datanode
Ext4
Task-tracker
Ext4 Ext4 Ext4
mapred.local.dir
VMDKVMDK
Ext4 Ext4 Ext4 Ext4
5
Virtualization Host
VMDKOS Image – VMDK
HadoopVirtualNode 1
Datanode
Ext4
Task-tracker
Ext4 Ext4 Ext4
Shared storageSAN/NAS
Local disks
OS Image – VMDK
VMDK VMDK VMDK VMDK VMDK VMDK VMDK
HadoopVirtualNode 2
Datanode
Ext4
Task-tracker
Ext4 Ext4 Ext4
mapred.local.dir
Standard Deployment Configuration
6
Virtualization Host
VMDKOS Image – VMDK
HadoopVirtualNode 1
Datanode
Ext4
Task-tracker
Ext4 Ext4 Ext4
Local disks
OS Image – VMDK
VMDK VMDK VMDK VMDK VMDK VMDK VMDK
HadoopVirtualNode 2
Datanode
Ext4
Task-tracker
Ext4 Ext4 Ext4
mapred.local.dir
Standard Deployment Configuration
7
Virtualization Host
OS Image – VMDK
HadoopVirtualNode 1
Task-tracker
Shared storageSAN/NAS
Local disks
OS Image – VMDK
VMDK VMDK VMDK VMDK VMDK VMDK VMDK
HadoopVirtualNode 2
Datanode
Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4
VMDK
VMDK VMDK VMDK VMDK VMDK VMDK VMDKVMDK
… …
Standard Deployment Configuration for D/C Separation
8
Data Path for Combined vs. Data/Compute Separation
Virtualization Host
Virtualization Host
Hadoop Virtual Node 1
Hadoop Virtual Node 2
TaskTrackerTaskTracker
Virtual Switch
Hadoop Virtual NodeHadoop Virtual Node
Virtual Switch
TaskTrackerTaskTracker
Serengeti provide local storage based temp for D/C separation.
• Each compute VM needs its own temp space
• Required temp space is different from an application to another
• Can result in wasted space
9
Recommended Topology of Data/Compute Separation
Virtualization Host
VMDKOS Image – VMDK
HadoopVirtualNode 1
Ext4
Task-tracker
Shared storageSAN/NAS
Local disks
OS Image – VMDK
VMDK VMDK VMDK VMDK VMDK VMDK VMDK
HadoopVirtualNode 2
Datanode
VMDK
Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4
…
10
Virtualization Host
Hadoop Virtual Node 1
Hadoop Virtual Node 2
TaskTrackerTaskTracker
Virtual Switch Virtualization Host
Hadoop Virtual Node 1
Hadoop Virtual Node 2
TaskTrackerTaskTracker
Virtual Switch
Data Path for Local TT Storage vs. NFS Temp
Serengeti provide NFS based temp for D/C separation
• Improve local storage space utilization.
• Trade-off between bandwidth efficiency vs. overhead of NFS.
11
Consolidated Storage on Single DN VM
Virtualization Host
OS Image – VMDK
HadoopVirtualNode 1
Task-tracker
Shared storageSAN/NAS
Local disks
OS Image – VMDK
VMDK VMDK VMDK VMDK VMDK VMDK VMDK
HadoopVirtualNode 2
Datanode
Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4dirdirdirdirdirdirdirdir
VMDK
… …
NFS Client
NFS Server
12
Recommended Topology of Computing Only Cluster
Virtualization Host
OS Image – VMDK
Shared storageSAN/NAS
OS Image – VMDK
HadoopVirtualNode 2
Datanode
Ext4
HadoopVirtualNode 1
Task-tracker
Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4
…
VMDK VMDK VMDK VMDK VMDK VMDK VMDKVMDK
VMDK
13
Plan Your Cluster
Start with a small cluster and grow it as required
• Initially just four or six nodes
• Increase amount of computation/data/memory as required
• Available space of HDFS = (DFS Remaining . value * 95%)/ dfs.replication.value
Choose right hardware – master node
• Namenode and Jobtracker often run on same machine for smaller clusters
• Consider HA/FT settings
• separate NameNode and Jobtracker from slave nodes’ host.
• Dual power supplies
14
Plan Your Cluster
Choose right hardware – slave node• 2 * Quad-core CPUs at least, HT enabled
• RAM
• Consider 6% overhead for virtualization
• Recommend 4-8 GB memory per core
• Storage
• At least 8 disks per host, 12 disks per host may be ideal for absolute performance but probably not for price-performance.
• Recommend 1-1.5 disks per core
• JBOD, SATA RPM7,200 is fine
• A good practical maximum is 24TB or 36TB per slave node. More than that will result in massive network traffic if a node dies and block re-replication must take place.
15
Plan Your Cluster
Networking
• Use dedicate switches for your Hadoop cluster and Nodes are connected to a top-of-rack switch
• Nodes should be connected at a minimum speed of 1Gb/sec and consider 10Gb/sec for clusters with large scale of intermediate data
• Racks are interconnected via core switches
• Core switches should connect to top-of-rack switches by dual 10Gb/sec links
• Redundant top-of-rack switches, core switches
• Separate management network and vm network
• Adopt vDS and dvport groups that span hosts and ensure configuration consistency for vms and virtual ports for functions of Vmotion and network storage
• Leave the management port out of your vDS
16
Virtualization Host
Networking Configurations – Four 1G NICs
vmnic 0
pSwitch 1
Virtual Switch 1
Hadoop cluster VM portgroup
vmnic 1
pSwitch 2
Virtual Switch 0
MGMT192.168.1.100
VMOTION192.168.3.100
FT192.168.4.100
VMKERNEL192.168.2.100
vmnic 3
Hadoop vm traffic goes through vSwitch1 (vmnic2 and vmnic3, both active)
On vSwitch0, it goes through MGMT, VM kernel on vmnic0(active, vmnic1 on standby)
vMotion and FT on vmnic1 (active, vmnic0 on standby)
1Gbs 1Gbs
vmnic 2
1Gbs 1Gbs
17
Virtualization Host
Networking Configurations -10G for Hadoop VMs
vmnic 0
pSwitch 1
Virtual Switch 1
Hadoop cluster VM portgroup
vmnic 1
pSwitch 2
Virtual Switch 0
MGMT192.168.1.100
VMOTION192.168.3.100
FT192.168.4.100
VMKERNEL192.168.2.100
vmnic 2
Hadoop vm traffic goes through vSwitch1 (vmnic3)
10G for Hadop cluster vms
• more performance benefits
• If any need, keep redundancy with the other suit of vmnic /pSwitch
Keep redundancy for management network
pSwitch 3
1Gbs 1Gbs
10 GBe
18
vSphere Configurations
Configure hosts with NTP service and to ensure the time on all the nodes is synchronized
Virtual Disk Settings
• One datastore per physical disk
• Warm-up is needed on the provisioned cluster
NUMA scheduler important for virtualized Hadoop performance
• Poor configuration can result in 12%(1) performance degradation
• Data VM preferably should be distributed across NUMA nodes
Provision right VM size
• Reserve 6% memory for vSphere usage
• Avoid over-commitment
• Enable NUMA and keep VM size within the NUMA node
19
For Existing Devices
Crudely fit existing resource capacity for Hadoop
• CPU : RAM : Throughput - 4*1333MHZ: 32G: 800M/s
Use powerful machine to run master node/computing node
Use high throughput machine for slave node/data node
20
Q&A