Woodman,Shakshober Performance Analys

Embed Size (px)

DESCRIPTION

Woodman,Shakshober Performance Analys

Citation preview

  • PerformanceAnalysisandSystemTuningLarryWoodman

    DJohnShakshober

  • AgendaRedHatEnterpriseLinux(RHEL)PerformanceandTuning

    Referencesvaluabletuningguides/books Part1MemoryManagement/FileSystemCaching Part2DiskandFileSystemIO

    Part3PerformanceMonitoringTools

    Part4PerformanceTuning/Analysis

    Part5CaseStudies

  • LinuxPerformanceTuningReferences Alikins,?SystemTuningInfoforLinuxServers,

    http://people.redhat.com/alikins/system_tuning.html Axboe,J.,?DeadlineIOSchedulerTunables,SuSE,EDFR&D,2003. Braswell,B,Ciliendo,E,?TuningRedHatEnterpriseLinuxon

    IBMeServerxSeriesServers,http://www.ibm.com/redbooks Corbet,J.,?TheContinuingDevelopmentofIOScheduling?,

    http://lwn.net/Articles/21274. Ezolt,P,OptimizingLinuxPerformance,www.hp.com/hpbooks,Mar

    2005. Heger,D,Pratt,S,?WorkloadDependentPerformanceEvaluationofthe

    Linux2.6IOSchedulers?,LinuxSymposium,Ottawa,Canada,July2004.

    RedHatEnterpriseLinuxPerformanceTuningGuidehttp://people.redhat.com/dshaks/rhel3_perf_tuning.pdf

    Network,NFSPerformancecoveredinseparatetalkshttp://nfs.sourceforge.net/nfshowto/performance.html

  • MemoryManagement PhysicalMemory(RAM)Management

    NUMAversusUMA VirtualAddressSpaceMaps

    32bit:x86up,smp,hugemem,1G/3Gvs4G/4G 64bit:x86_64,IA64

    KernelWiredMemory StaticBoottime Slabcache Pagetables HughTLBfs

    ReclaimableUserMemory Pagecache/Anonymoussplit

    PageReclaimDynamics kswapd,bdflush/pdflush,kupdated

  • PhysicalMemory(RAM)Management PhysicalMemoryLayout NUMANodes

    Zones mem_maparray Pagelists

    Freelist Active Inactive

  • Memory Zones

    Upto64GB(PAE)

    HighmemZone

    896MBor3968MB

    NormalZone

    16MBDMAZone0

    EndofRAM

    NormalZone

    16MB(or4GB)

    DMAZone

    0

    32bit 64bit

  • PerNUMANodeResources Memoryzones(DMA&Normalzones) CPUs IO/DMAcapacity Pagereclamationdaemon(kswapd#)

  • NUMA Nodes and Zones

    EndofRAM

    NormalZone

    NormalZone

    16MB(or4GB)DMAZone0

    64bit

    Node0

    Node1

  • Memory Zone Utilization

    DMA Normal Highmem(x86)

    24bitI/O KernelStaticKernelDynamicslabcachebouncebuffersdriverallocationsUserOverflow

    UserAnonymousPagecachePagetables

  • PerZoneResources mem_map Freelists Activeandinactivepagelists Pagereclamation Pagereclamationwatermarks

  • mem_map Kernelmaintainsapagestructforeach4KB(16KBonIA64)

    pageofRAM Themem_maparrayconsumessignificantamountof

    lowmematboottime. Pagestructsize:

    RHEL332bit=60bytes RHEL364bit=112bytes RHEL432bit=32bytes RHEL464bit=56bytes

    16GBx86runningRHEL3: 17179869184/4096*60=~250MBmem_maparray!!!

    RHEL4mem_mapisonlyabout50%oftheRHEL3mem_map.

  • PerzoneFreelist/buddyallocatorlists

    Kernelmaintainsperzonefreelist Buddyallocatorcoalescesfreepagesintolargerphysicallycontiguouspieces

    DMA1*4kB4*8kB6*16kB4*32kB3*64kB1*128kB1*256kB1*512kB0*1024kB1*2048kB2*4096kB=11588kB)

    Normal217*4kB207*8kB1*16kB1*32kB0*64kB1*128kB1*256kB1*512kB0*1024kB0*2048kB0*4096kB=3468kB)

    HighMem847*4kB409*8kB17*16kB1*32kB1*64kB1*128kB1*256kB1*512kB0*1024kB0*2048kB0*4096kB=7924kB)

    Memoryallocationfailures Freelistexhaustion. Freelistfragmentation.

  • Perzonepagelists

    Activemostrecentlyreferenced Anonymousstack,heap,bss Pagecachefilesystemdata

    Inactiveleastrecentlyreferenced Dirtymodified Laundrywritebackinprogress Cleanreadytofree

    Free Coalescedbuddyallocator

  • VirtualAddressSpaceMaps

    32bit 3G/1Gaddressspace 4G/4Gaddressspace

    64bit X86_64 IA64

  • Linux 32-bit Address Spaces

    0GB3GB4GBRAM

    VIRT

    DMANormalHighMem

    3G/1GKernel(SMP)

    4G/4GKernel(Hugemem)User(s)VIRT

    0GB3968MBKernel

    DMANormal3968MBHighMem

  • Linux 64-bit Address Space

    01TB(2^40)RAM

    x86_64

    VIRT

    IA64

    User Kernel

    VIRT

    RAM

    0 1 2 3 4 5 6 7

  • MemoryPressure

    32bit

    64bit

    DMA Normal Highmem

    DMA Normal

    KernelAllocationsUserAllocations

    KernelandUserAllocations

  • KernelMemoryPressure StaticBoottime(DMAandNormalzones)

    Kerneltext,data,BSS Bootmemallocator Tablesandhashes(mem_map)

    Slabcache(Normalzone) Kerneldatastructs Inodecache,dentrycacheandbufferheaderdynamics

    Pagetables(Highmem/Normalzone) 32bitversus64bit

    HughTLBfs(Highmem/Normalzone) ie4Kpagew/4GBmemory=1MillionTLBentries 4Mpagew/4GBmemory=1000TLBentries

  • UserMemoryPressureAnonymous/pagecachesplit

    pagecache anonymous

    PagecacheAllocationsPageFaults

  • PageCache/Anonymousmemorysplit Pagecachememoryisglobalandgrowswhenfilesystemdataisaccessed

    untilmemoryisexhausted. Pagecacheisfreed:

    Underlyingfilesaredeleted. Unmountofthefilesystem. Kswapdreclaimspagecachepageswhenmemoryisexhausted.

    Anonymousmemoryisprivateandgrowsonuserdemmand Allocationfollowedbypagefault. Swapin.

    Anonymousmemoryisfreed: Processunmapsanonymousregionorexits. Kswapdreclaimsanonymouspages(swapout)whenmemoryis

    exhausted Balancebetweenpagecacheandanonymousmemory.

    Dynamic. Controlledvia/proc/sys/vm/pagecache.

  • 32-bit Memory Reclamation

    DMA Normal Highmem

    KernelAllocationsUserAllocations

    KernelReclamationUserReclamation(kswapd)(kswapd,bdflush/pdflush)slapcachereaping pageaging

    inodecachepruningpagecacheshrinkingbufferheadfreeing swappingdentrycachepruning

  • 64-bit Memory Reclamation

    RAM

    KernelandUserAllocations

    KernelandUserReclamation

  • Anonymous/pagecachereclaiming

    pagecache anonymous

    PagecacheAllocationsPageFaults

    kswapd(bdflush,kupdated) kswapdpagereclaim pagereclaim(swapout)deletionofafile unmapunmountfilesystem exit

  • Per Node/Zone Paging Dynamics

    ACTIVEINACTIVE

    (Dirty>Clean)FREE

    UserAllocations

    Reactivate

    Pageaging Swapoutbdflush

    Reclaiming

    Userdeletions

  • Part2PerformanceMonitoringTools StandardUnixOStools

    Monitoringcpu,memory,process,disk oprofile

    KernelTools /proc,info(cpu,mem,slab),dmesg,AltSysrq Profilingnmi_watchdog=1,profile=2

    Tracing(separatesummittalk) strace,ltrace dprobe,kprobe

    3rdpartyprofiling/capacitymonitoring Perfmon,Caliper,vtune SARcheck,KDE,BEAPatrol,HPOpenview

  • RedHatTopTools CPUTools1top2vmstat3psaux4mpstatPall5saru6iostat7oprofile8gnomesystemmonitor9KDEmonitor10/proc

    MemoryTools1top2vmstats3psaur4ipcs5sarrBW6free7oprofile8gnomesystemmonitor9KDEmonitor10/proc

    ProcessTools1top2psopmem3gprof4strace,ltrace5sar DiskTools1iostatx2vmstatD3sarDEV#4nfsstat5NEEDMORE!

  • toppresshhelp,mmemory,tthreads,>columnsorttop09:01:04up8days,15:22,2users,loadaverage:1.71,0.39,0.12

    Tasks:114total,1running,113sleeping,0stopped,0zombie

    Cpu0:5.3%us,2.3%sy,0.0%ni,0.0%id,92.0%wa,0.0%hi,0.3%si

    Cpu1:0.3%us,0.3%sy,0.0%ni,89.7%id,9.7%wa,0.0%hi,0.0%si

    Mem:2053860ktotal,2036840kused,17020kfree,99556kbuffers

    Swap:2031608ktotal,160kused,2031448kfree,417720kcached

    PIDUSERPRNIVIRTRESSHRS%CPU%MEMTIME+COMMAND

    27830oracle1601315m1.2g1.2gD1.360.90:00.09oracle

    27802oracle1601315m1.2g1.2gD1.061.00:00.10oracle

    27811oracle1601315m1.2g1.2gD1.060.80:00.08oracle

    27827oracle1601315m1.2g1.2gD1.061.00:00.11oracle

    27805oracle1701315m1.2g1.2gD0.761.00:00.10oracle

    27828oracle1502758466484620S0.30.30:00.17tpcc.exe

    1root1604744580480S0.00.00:00.50init

    2rootRT0000S0.00.00:00.11migration/0

    3root3419000S0.00.00:00.00ksoftirqd/0

  • vmstatofIOzonetoEXT3fs6GBmem#!depletememoryuntilpdflushturnson

    procsmemoryswapiosystemcpu

    rbswpdfreebuffcachesisobiboincsussywaid

    200448352420052423457600546315251303096

    020169784020052429314400057850482108539941221463

    3001537884200524384109200193589463243144307321842

    02052812020052462281720047888810177133921322246

    01046140200524671373600179110719144718251303535

    22050972200524670574400232119698131619710253144

    ....

    #!nowtransitionfromwritetoreads

    procsmemoryswapiosystemcpu

    rbswpdfreebuffcachesisobiboincsussywaid

    14051040200524670554400213351912658390265618

    1103506420052467127240040118911136720210354223

    01068264234372664702000767445420484032072073

    01034468234372667801600773913416202834091872

    01047320234372669035600810507717832916072073

    10038756234372669834400761364420273705191972

    01031472234372670653200767253316012807081973

  • iostatxofsameIOzoneEXT3filesystemIostatmetrics

    ratesperfsecsizesandresponsetimer|wrqm/srequestmerged/saverqszaveragerequestszr|wsec/s512bytesectors/savequszaveragequeueszr|wKB/sKilobyte/sawaitaveragewaittimemsr|w/soperations/ssvcmaveservicetimems

    Linux2.4.2127.0.2.ELsmp(node1)05/09/2005

    avgcpu:%user%nice%sys%iowait%idle

    0.400.002.630.9196.06

    Device:rrqm/swrqm/sr/sw/srsec/swsec/srkB/swkB/savgrqszavgquszawaitsvctm%util

    sdi16164.600.00523.400.00133504.000.0066752.000.00255.071.001.911.8898.40

    sdi17110.100.00553.900.00141312.000.0070656.000.00255.120.991.801.7898.40

    sdi16153.500.00522.500.00133408.000.0066704.000.00255.330.981.881.8697.00

    sdi17561.900.00568.100.00145040.000.0072520.000.00255.311.011.781.76100.00

  • SAR[root@localhostredhat]#saru33Linux2.4.2120.EL(localhost.localdomain)05/16/200510:32:28PMCPU%user%nice%system%idle10:32:31PMall0.000.000.00100.0010:32:34PMall1.330.000.3398.3310:32:37PMall1.340.000.0098.66Average:all0.890.000.1199.00

    [root]sarnDEVLinux2.4.2120.EL(localhost.localdomain)03/16/200501:10:01PMIFACErxpck/stxpck/srxbyt/stxbyt/srxcmp/stxcmp/srxmcst/s01:20:00PMlo3.493.49306.16306.160.000.000.0001:20:00PMeth03.893.532395.34484.700.000.000.0001:20:00PMeth10.000.000.000.000.000.000.00

  • free/numastatmemoryallocation[root@localhostredhat]#freeltotalusedfreesharedbufferscachedMem:511368342336169032029712167408Low:511368342336169032000High:000000/+buffers/cache:145216366152Swap:104324001043240

    numastat(on2cpux86_64basedsystem)node1node0numa_hit980333210905630numa_miss20490181609361numa_foreign16093612049018interleave_hit5868954749local_node977092710880901other_node20814231634090

  • ps,mpstat[root@localhostroot]#psaux

    [root@localhostroot]#psaux|more

    USERPID%CPU%MEMVSZRSSTTYSTATSTARTTIMECOMMAND

    root10.10.11528516?S23:180:04init

    root20.00.000?SW23:180:00[keventd]

    root30.00.000?SW23:180:00[kapmd]

    root40.00.000?SWN23:180:00[ksoftirqd/0]

    root70.00.000?SW23:180:00[bdflush]

    root50.00.000?SW23:180:00[kswapd]

    root60.00.000?SW23:180:00[kscand]

    [root@localhostredhat]#mpstat33

    Linux2.4.2120.EL(localhost.localdomain)05/16/2005

    10:40:34PMCPU%user%nice%system%idleintr/s

    10:40:37PMall3.000.000.0097.00193.67

    10:40:40PMall1.330.000.0098.67208.00

    10:40:43PMall1.670.000.0098.33196.00

    Average:all2.000.000.0098.00199.22

  • pstree[root@dhcp8336proc]#pstreeinit atd

    auditd

    2*[automount]

    bdflush

    2*[bonoboactivati]

    cannaserver

    crond

    cupsd

    dhclient

    eggcups

    gconfd2

    gdmbinary gdmbinary X

    gnomesession sshagent

    2*[gnomecalculato]

    gnomepanel

    gnomesettings

    gnometerminal bash xchat

    bash cscope bash cscope bash cscope bash cscope bash cscope bash

    bash cscope bash cscope bash cscope bash cscope vi

    gnomeptyhelpe

    gnometerminal bash su bash pstree

    bash cscope vi

    gnomeptyhelpe

  • The/procfilesystem /proc

    acpi bus irq net scsi sys tty pid#

  • 32bit/proc//maps[root@dhcp8336proc]#cat5808/maps

    0022e0000023b000rxp0000000003:034137068/lib/tls/libpthread0.60.so

    0023b0000023c000rwp0000c00003:034137068/lib/tls/libpthread0.60.so

    0023c0000023e000rwp0000000000:000

    0037f00000391000rxp0000000003:03523285/lib/libnsl2.3.2.so

    0039100000392000rwp0001100003:03523285/lib/libnsl2.3.2.so

    0039200000394000rwp0000000000:000

    00c4500000c5a000rxp0000000003:03523268/lib/ld2.3.2.so

    00c5a00000c5b000rwp0001500003:03523268/lib/ld2.3.2.so

    00e5c00000f8e000rxp0000000003:034137064/lib/tls/libc2.3.2.so

    00f8e00000f91000rwp0013100003:034137064/lib/tls/libc2.3.2.so

    00f9100000f94000rwp0000000000:000

    080480000804f000rxp0000000003:031046791/sbin/ypbind

    0804f00008050000rwp0000700003:031046791/sbin/ypbind

    09794000097b5000rwp0000000000:000

    b5fdd000b5fde000p0000000000:000

    b5fde000b69de000rwp0000100000:000

    b69de000b69df000p0000000000:000

    b69df000b73df000rwp0000100000:000

    b73df000b75df000rp0000000003:033270410/usr/lib/locale/localearchive

    b75df000b75e1000rwp0000000000:000

    bfff6000c0000000rwpffff800000:000

  • 64bit/proc//maps#cat/proc/2345/maps004000000100b000rxp00000000fd:001933328/usr/sybase/ASE12_5/bin/dataserver.esd30110b00001433000rwp00c0b000fd:001933328/usr/sybase/ASE12_5/bin/dataserver.esd301433000014eb000rwxp0143300000:0004000000040001000p4000000000:0004000100040a01000rwxp4000100000:0002a95f730002a96073000p0012b000fd:00819273/lib64/tls/libc2.3.4.so2a960730002a96075000rp0012b000fd:00819273/lib64/tls/libc2.3.4.so2a960750002a96078000rwp0012d000fd:00819273/lib64/tls/libc2.3.4.so2a960780002a9607e000rwp2a9607800000:0002a9607e0002a98c3e000rws0000000000:06360450/SYSV0100401e(deleted)2a98c3e0002a98c47000rwp2a98c3e00000:0002a98c470002a98c51000rxp00000000fd:00819227/lib64/libnss_files2.3.4.so2a98c510002a98d51000p0000a000fd:00819227/lib64/libnss_files2.3.4.so2a98d510002a98d53000rwp0000a000fd:00819227/lib64/libnss_files2.3.4.so2a98d530002a98d57000rxp00000000fd:00819225/lib64/libnss_dns2.3.4.so2a98d570002a98e56000p00004000fd:00819225/lib64/libnss_dns2.3.4.so2a98e560002a98e58000rwp00003000fd:00819225/lib64/libnss_dns2.3.4.so2a98e580002a98e69000rxp00000000fd:00819237/lib64/libresolv2.3.4.so2a98e690002a98f69000p00011000fd:00819237/lib64/libresolv2.3.4.so2a98f690002a98f6b000rwp00011000fd:00819237/lib64/libresolv2.3.4.so2a98f6b0002a98f6d000rwp2a98f6b00000:00035c7e0000035c7e08000rxp00000000fd:00819469/lib64/libpam.so.0.7735c7e0800035c7f08000p00008000fd:00819469/lib64/libpam.so.0.7735c7f0800035c7f09000rwp00008000fd:00819469/lib64/libpam.so.0.7735c800000035c8011000rxp00000000fd:00819468/lib64/libaudit.so.0.0.035c801100035c8110000p00011000fd:00819468/lib64/libaudit.so.0.0.035c811000035c8118000rwp00010000fd:00819468/lib64/libaudit.so.0.0.035c900000035c900b000rxp00000000fd:00819457/lib64/libgcc_s3.4.420050721.so.135c900b00035c910a000p0000b000fd:00819457/lib64/libgcc_s3.4.420050721.so.135c910a00035c910b000rwp0000a000fd:00819457/lib64/libgcc_s3.4.420050721.so.17fbfff10007fc0000000rwxp7fbfff100000:000

  • /proc/meminfo#cat/proc/meminfo

    MemTotal:514060kB

    MemFree:23656kB

    Buffers:53076kB

    Cached:198344kB

    SwapCached:0kB

    Active:322964kB

    Inactive:60620kB

    HighTotal:0kB

    HighFree:0kB

    LowTotal:514060kB

    LowFree:23656kB

    SwapTotal:1044216kB

    SwapFree:1044056kB

    Dirty:40kB

    Writeback:0kB

    Mapped:168048kB

    Slab:88956kB

    Committed_AS:372800kB

    PageTables:3876kB

    VmallocTotal:499704kB

    VmallocUsed:6848kB

    VmallocChunk:491508kB

    HugePages_Total:0

    HugePages_Free:0

    Hugepagesize:2048kB

  • /proc/slabinfoslabinfoversion:2.0

    biovec128256260153652:tunables24128:slabdata52520

    biovec6425626076851:tunables54278:slabdata52520

    biovec16256270256151:tunables120608:slabdata18180

    biovec425630564611:tunables120608:slabdata550

    biovec159069385907188162261:tunables120608:slabdata26138261380

    bio59069465907143128311:tunables120608:slabdata1905531905530

    file_lock_cache712396411:tunables120608:slabdata330

    sock_inode_cache296351271:tunables54278:slabdata990

    skbuff_head_cache202540256151:tunables120608:slabdata36360

    sock610384101:tunables54278:slabdata110

    proc_inode_cache139209360111:tunables54278:slabdata19190

    sigqueue227148271:tunables120608:slabdata110

    idr_layer_cache82116136291:tunables120608:slabdata440

    buffer_head6602713380052751:tunables120608:slabdata178417840

    mm_struct447076851:tunables54278:slabdata14140

    kmem_cache150150256151:tunables120608:slabdata10100

  • AltSysrqMRHEL3/UMASysRq:ShowMemory

    Meminfo:

    Zone:DMAfreepages:2929min:0low:0high:0

    Zone:Normalfreepages:1941min:510low:2235high:3225

    Zone:HighMemfreepages:0min:0low:0high:0

    Freepages:4870(0HighMem)

    (Active:72404/13523,inactive_laundry:2429,inactive_clean:1730,free:4870)

    aa:0ac:0id:0il:0ic:0fr:2929

    aa:46140ac:26264id:13523il:2429ic:1730fr:1941

    aa:0ac:0id:0il:0ic:0fr:0

    1*4kB4*8kB2*16kB2*32kB1*64kB2*128kB2*256kB1*512kB0*1024kB1*2048kB2*4096kB=11716kB)

    1255*4kB89*8kB5*16kB1*32kB0*64kB1*128kB1*256kB1*512kB1*1024kB0*2048kB0*4096kB=7764kB)

    Swapcache:add958119,delete918749,find4611302/5276354,race0+1

    27234pagesofslabcache

    244pagesofkernelstacks

    1303lowmempagetables,0highmempagetables

    0bouncebufferpages,0areontheemergencylist

    Freeswap:598960kB

    130933pagesofRAM

    0pagesofHIGHMEM

    3497reservedpages

    34028pagesshared

    39370pagesswapcached

  • AltSysrqMRHEL3/NUMASysRq:ShowMemoryMeminfo:Zone:DMAfreepages:0min:0low:0high:0Zone:Normalfreepages:369423min:1022low:6909high:9980Zone:HighMemfreepages:0min:0low:0high:0Zone:DMAfreepages:2557min:0low:0high:0Zone:Normalfreepages:494164min:1278low:9149high:13212Zone:HighMemfreepages:0min:0low:0high:0Freepages:866144(0HighMem)(Active:9690/714,inactive_laundry:764,inactive_clean:35,free:866144)aa:0ac:0id:0il:0ic:0fr:0aa:746ac:2811id:188il:220ic:0fr:369423aa:0ac:0id:0il:0ic:0fr:0aa:0ac:0id:0il:0ic:0fr:2557aa:1719ac:4414id:526il:544ic:35fr:494164aa:0ac:0id:0il:0ic:0fr:02497*4kB1575*8kB902*16kB515*32kB305*64kB166*128kB96*256kB56*512kB39*1024kB30*2048kB300*4096kB=1477692kB)Swapcache:add288168,delete285993,find726/2075,race0+04059pagesofslabcache146pagesofkernelstacks388lowmempagetables,638highmempagetablesFreeswap:1947848kB917496pagesofRAM869386freepages30921reservedpages21927pagesshared2175pagesswapcachedBuffermemory:9752kBCachememory:34192kBCLEAN:696buffers,2772kbyte,51used(last=696),0locked,0dirty0delayDIRTY:4buffers,16kbyte,4used(last=4),0locked,3dirty0delay

  • AltSysrqMRHEL4/UMA

    SysRq:ShowMemory

    Meminfo:

    Freepages:20128kB(0kBHighMem)

    Active:72109inactive:27657dirty:1writeback:0unstable:0free:5032slab:19306mapped:41755pagetables:945

    DMAfree:12640kBmin:20kBlow:40kBhigh:60kBactive:0kBinactive:0kBpresent:16384kBpages_scanned:847all_unreclaimable?yes

    protections[]:000

    Normalfree:7488kBmin:688kBlow:1376kBhigh:2064kBactive:288436kBinactive:110628kBpresent:507348kBpages_scanned:0all_unreclaimable?no

    protections[]:000

    HighMemfree:0kBmin:128kBlow:256kBhigh:384kBactive:0kBinactive:0kBpresent:0kBpages_scanned:0all_unreclaimable?no

    protections[]:000

    DMA:4*4kB4*8kB3*16kB4*32kB4*64kB1*128kB1*256kB1*512kB1*1024kB1*2048kB2*4096kB=12640kB

    Normal:1052*4kB240*8kB39*16kB3*32kB0*64kB1*128kB0*256kB1*512kB0*1024kB0*2048kB0*4096kB=7488kBHighMem:empty

    Swapcache:add52,delete52,find3/5,race0+0

    Freeswap:1044056kB

    130933pagesofRAM

    0pagesofHIGHMEM

    2499reservedpages

    71122pagesshared

    0pagesswapcached

  • AltSysrqMRHEL4/NUMA

    Freepages:16724kB(0kBHighMem)Active:236461inactive:254776dirty:11writeback:0unstable:0free:4181slab:13679mapped:34073pagetables:853Node1DMAfree:0kBmin:0kBlow:0kBhigh:0kBactive:0kBinactive:0kBpresent:0kBpages_scanned:0all_unreclaimable?noprotections[]:000Node1Normalfree:2784kBmin:1016kBlow:2032kBhigh:3048kBactive:477596kBinactive:508444kBpresent:1048548kBpages_scanned:0all_unreclaimable?noprotections[]:000Node1HighMemfree:0kBmin:128kBlow:256kBhigh:384kBactive:0kBinactive:0kBpresent:0kBpages_scanned:0all_unreclaimable?noprotections[]:000Node0DMAfree:11956kBmin:12kBlow:24kBhigh:36kBactive:0kBinactive:0kBpresent:16384kBpages_scanned:1050all_unreclaimable?yesprotections[]:000Node0Normalfree:1984kBmin:1000kBlow:2000kBhigh:3000kBactive:468248kBinactive:510660kBpresent:1032188kBpages_scanned:0all_unreclaimable?noprotections[]:000Node0HighMemfree:0kBmin:128kBlow:256kBhigh:384kBactive:0kBinactive:0kBpresent:0kBpages_scanned:0all_unreclaimable?noprotections[]:000Node1DMA:emptyNode1Normal:0*4kB0*8kB30*16kB10*32kB1*64kB1*128kB1*256kB1*512kB1*1024kB0*2048kB0*4096kB=2784kBNode1HighMem:emptyNode0DMA:5*4kB4*8kB4*16kB2*32kB2*64kB3*128kB2*256kB1*512kB0*1024kB1*2048kB2*4096kB=11956kBNode0Normal:0*4kB0*8kB0*16kB0*32kB1*64kB1*128kB1*256kB1*512kB1*1024kB0*2048kB0*4096kB=1984kBNode0HighMem:emptySwapcache:add44,delete44,find0/0,race0+0Freeswap:2031432kB524280pagesofRAM10951reservedpages363446pagesshared0pagesswapcached

  • AltSysrqTbashRcurrent016091606

    (NOTLB)

    CallTrace:[]snprintf[kernel]0x27(0xdb3c5e90)

    []call_console_drivers[kernel]0x63(0xdb3c5eb4)

    []printk[kernel]0x153(0xdb3c5eec)

    []printk[kernel]0x153(0xdb3c5f00)

    []show_trace[kernel]0xd9(0xdb3c5f0c)

    []show_trace[kernel]0xd9(0xdb3c5f14)

    []show_state[kernel]0x62(0xdb3c5f24)

    []__handle_sysrq_nolock[kernel]0x7a(0xdb3c5f38)

    []handle_sysrq[kernel]0x5d(0xdb3c5f58)

    []write_sysrq_trigger[kernel]0x53(0xdb3c5f7c)

    []sys_write[kernel]0x97(0xdb3c5f94)

    *thiscangetBIGloggedin/var/log/messages

  • Kernelprofiling1.Enablekernelprofiling.

    Onthekernelbootlineaddprofile=2nmi_watchdog=1i.e.kernel/vmlinuz2.6.928.EL.smproprofile=2nmi_watchdog=1root=0805

    thenreboot.2.Createaandrunashellscriptcontainingthefollowinglines:

    #!/bin/shwhile/bin/true;doecho;date/usr/sbin/readprofilev|sortnr+2|head15/usr/sbin/readprofilersleep5done

  • Kernelprofiling

    [root]tiobench]#morerhel4_read_64k_prof.logFriJan2808:59:19EST20050000000000000000total2394230.1291ffffffff8010e3a0do_arch_prctl238564213.0036ffffffff80130540del_timer950.5398ffffffff80115940read_ldt500.6250ffffffff8015d21c.text.lock.shmem440.1048ffffffff8023e480md_do_sync400.0329ffffffff801202f0scheduler_tick380.0279ffffffff80191cf0dma_read_proc300.2679ffffffff801633b0get_unused_buffer_head250.0919ffffffff801565d0rw_swap_page_nolock250.0822ffffffff8023d850status_unused240.1500ffffffff80153450scan_active_list240.0106ffffffff801590a0try_to_unuse230.0288ffffffff80192070read_profile220.0809ffffffff80191f80swaps_read_proc180.1607Linux2.6.95.ELsmp(perf1.lab.boston.redhat.com)01/28/2005

    /usr/sbin/readprofilev|sortnr+2|head15

  • oprofilebuiltintoRHEL4(smp)

    opcontrolon/offdata startstartcollection stopstopcollection dumpoutputtodisk event=:name:count

    Example:#opcontrolstart#/bin/timetest1sleep60#opcontrolstop#opcontroldump

    opreportanalyzeprofile rreverseordersort t[percentage]thesholdtoview

    f/path/filename ddetails

    opannotate s/path/source a/path/assembly

  • oprofileopcontrolandopreportcpu_cycles#vmlinux2.6.9prepCPU:Itanium2,speed1300MHz(estimated)CountedCPU_CYCLESevents(CPUCycles)withaunitmaskof0x00(Nounitmask)count100000samples%imagenameappnamesymbolname909368968.9674vmlinuxvmlinuxdefault_idle9698857.3557vmlinuxreread_spin_unlock_irq7444455.6459vmlinuxreread_spin_unlock_irqrestore4201033.1861vmlinuxvmlinux_spin_unlock_irqrestore1464131.1104vmlinuxreread__blockdev_direct_IO749180.5682vmlinuxvmlinux_spin_unlock_irq652130.4946vmlinuxrereadkmem_cache_alloc594530.4509vmlinuxvmlinuxdio_bio_complete586360.4447vmlinuxrereadmempool_alloc566750.4298scsi_mod.korereadscsi_decide_disposition539650.4093vmlinuxrereaddio_bio_complete530790.4026vmlinuxrereadbio_check_pages_dirty530350.4022vmlinuxvmlinuxbio_check_pages_dirty474300.3597vmlinuxvmlinux__end_that_request_first472630.3584vmlinuxrereadget_request433830.3290vmlinuxreread__end_that_request_first402510.3053qla2xxx.korereadqla2xxx_get_port_name359190.2724scsi_mod.koreread__scsi_device_lookup355640.2697vmlinuxrereadaio_read_evt328300.2490vmlinuxrereadkmem_cache_free327380.2483scsi_mod.koscsi_modscsi_remove_host

  • Red Hat Confidential

    Open source project http://oprofile.sourceforge.net

    Upstream; Red Hat contributes Originally modeled after DEC Continuous

    Profiling Infrastructure (DCPI) System-wide profiler (both kernel and

    user code) Sample-based profiler with SMP machine

    support Performance monitoring hardware support Relatively low overhead, typically

  • Red Hat Confidential

    Profiling Tools: SystemTap Open Source project (started 01/05)

    Collaboration between Red Hat, Intel, and IBM

    Linux answer to Solaris DTrace

    A tool to take a deeper look into a running system:

    Provides insight into system operation Assists in identifying causes of

    performance problems Simplifies building instrumentation

    Current snapshots available from: http://sources.redhat.com/systemtap

    Scheduled for inclusion Red Hat Enterprise Linux Update 2 (Fall 2005) X86, X86-64, PPC64, Itanium2

    probescript

    probesetlibrary

    probekernelobject

    probeoutput

    parse

    elaborate

    translatetoC,compile*

    loadmodule,startprobe

    extractoutput,unload

    *SolarisDtraceisinterpretive

  • HowtotuneLinux Capacitytuning

    Fixedbyaddingresources CPU,memory,disk,network

    PerformanceTuning Methodology

    1)Documentconfig2)Baselineresults3)Whileresultsnonoptimal

    a)Monitor/Instrumentsystem/workloadb)Applytuning1changeatatimec)Analyzeresults,exitorloop

    4)Documentfinalconfig

    Part3GeneralSystemTuning

  • /proc

    [root@hairballfs]#cat/proc/sys/kernel/sysrq

    0

    [root@hairballfs]#echo1>/proc/sys/kernel/sysrq

    [root@hairballfs]#cat/proc/sys/kernel/sysrq

    1 Sysctlcommand

    [root@hairballfs]#sysctlkernel.sysrq

    kernel.sysrq=0

    [root@hairballfs]#sysctlwkernel.sysrq=1

    kernel.sysrq=1

    [root@hairballfs]#sysctlkernel.sysrq

    kernel.sysrq=1 Editthe/etc/sysctl.conffile

    #KernelsysctlconfigurationfileforRedHatLinux

    #ControlstheSystemRequestdebuggingfunctionalityofthekernel

    kernel.sysrq=1 Usegraphicaltool/usr/bin/redhatconfigproc

    Tuninghowtosetkernelparameters

  • Memory /proc/sys/vm/overcommit_memory /proc/sys/vm/overcommit_ratio /proc/sys/vm/max_map_count /proc/sys/vm/nr_hugepages

    Kernel /proc/sys/kernel/msgmax /proc/sys/kernel/msgmnb /proc/sys/kernel/msgmni /proc/sys/kernel/shmall /proc/sys/kernel/shmmax /proc/sys/kernel/shmmni /proc/sys/kernel/threadsmax

    Filesystems /proc/sys/fs/aio_max_nr /proc/sys/fs/file_max

    CapacityTuning

  • OOMkillsswapspaceexhaustionMeminfo:

    Zone:DMAfreepages:975min:1039low:1071high:1103

    Zone:Normalfreepages:126min:255low:1950high:2925

    Zone:HighMemfreepages:0min:0low:0high:0

    Freepages:1101(0HighMem)

    (Active:118821/401,inactive_laundry:0,inactive_clean:0,free:1101)

    aa:1938ac:18id:44il:0ic:0fr:974

    aa:115717ac:1148id:357il:0ic:0fr:126

    aa:0ac:0id:0il:0ic:0fr:0

    6*4kB0*8kB0*16kB1*32kB0*64kB0*128kB1*256kB1*512kB1*1024kB1*2048kB0*4096kB=3896kB)

    0*4kB1*8kB1*16kB1*32kB1*64kB1*128kB1*256kB0*512kB0*1024kB0*2048kB0*4096kB=504kB)

    Swapcache:add620870,delete620870,find762437/910181,race0+200

    2454pagesofslabcache

    484pagesofkernelstacks

    2008lowmempagetables,0highmempagetables

    Freeswap:0kB

    129008pagesofRAM

    0pagesofHIGHMEM

    3045reservedpages

    4009pagesshared

    0pagesswapcached

  • OOMkillslowmemconsumptionMeminfo:

    Zone:DMAfreepages:2029min:0low:0high:0

    Zone:Normalfreepages:1249min:1279low:4544high:6304

    Zone:HighMemfreepages:746min:255low:29184high:43776

    Freepages:4024(746HighMem)

    (Active:703448/665000,inactive_laundry:99878,inactive_clean:99730,free:4024)

    aa:0ac:0id:0il:0ic:0fr:2029

    aa:128ac:3346id:113il:240ic:0fr:1249

    aa:545577ac:154397id:664813il:99713ic:99730fr:746

    1*4kB0*8kB1*16kB1*32kB0*64kB1*128kB1*256kB1*512kB1*1024kB1*2048kB1*4096kB=8116kB)

    543*4kB35*8kB77*16kB1*32kB0*64kB0*128kB1*256kB0*512kB1*1024kB0*2048kB0*4096kB=4996kB)

    490*4kB2*8kB1*16kB1*32kB1*64kB1*128kB1*256kB1*512kB0*1024kB0*2048kB0*4096kB=2984kB)

    Swapcache:add4327,delete4173,find190/1057,race0+0

    178558pagesofslabcache

    1078pagesofkernelstacks

    0lowmempagetables,233961highmempagetables

    Freeswap:8189016kB

    2097152pagesofRAM

    1801952pagesofHIGHMEM

    103982reservedpages

    115582774pagesshared

    154pagesswapcached

    OutofMemory:Killedprocess27100(oracle).

  • /proc/sys/vm/bdflush /proc/sys/vm/pagecache /proc/sys/vm/inactive_clean_percent /proc/sys/vm/pagecluster /proc/sys/vm/kscand_work_percent Swapdevicelocation Kernelselection

    X86smp X86Hughmem

    x86_64numa

    PerformanceTuningVM(RHEL3)

  • intnfract;/*Percentageofbuffercachedirtytoactivatebdflush*/

    intndirty;/*Maximumnumberofdirtyblockstowriteoutperwakecycle*/

    intdummy2;/*old"nrefill"*/

    intdummy3;/*unused*/

    intinterval;/*jiffiesdelaybetweenkupdateflushes*/

    intage_buffer;/*Timefornormalbuffertoagebeforeweflushit*/

    intnfract_sync;/*Percentageofbuffercachedirtytoactivatebdflushsynchronously

    intnfract_stop_bdflush;/*Percetangeofbuffercachedirtytostopbdflush*/

    intdummy5;/*unused*/

    Example:

    SettingsforServerwithampleIOconfig(defaultr3gearedforws)

    sysctlwvm.bdflush=505000002005000300060200

    RHEL3/proc/sys/vm/bdflush

  • pagecache.minpercent Lowerlimitforpagecachepagereclaiming. Kswapdwillstopreclaimingpagecachepagesbelowthis

    percentofRAM. pagecache.borrowpercnet

    KswapdattemptstokeepthepagecacheatthispercentorRAM pagecache.maxpercent

    Upperlimitforpagecachepagereclaiming. RHEL2.1hardlimit,pagecachewillnotgrowabovethispercent

    ofRAM. RHEL3kswapdonlyreclaimspagecachepagesabovethis

    percentofRAM. IncreasingmaxpercentwillincreaseswappingExample:echo11050>/proc/sys/vm/pagecache

    RHEL3/proc/sys/vm/pagecache

  • /proc/sys/vm/swappiness /proc/sys/vm/dirty_ratio /proc/sys/vm/dirty_background_ratio /proc/sys/vm/vfs_cache_pressure /proc/sys/vm/lower_zone_protection Swapdevicelocation Kernelselection

    X86smp X86Hughmem

    x86_64numa

    PerformanceTuningVM(RHEL4)

  • X86standardkernel(noPAE,3G/1G) UPsystemswith
  • Zone:DMAfreepages:2207min:0low:0high:0

    Zone:Normalfreepages:484min:1279low:4544high:6304

    Zone:HighMemfreepages:266min:255low:61952high:92928

    Freepages:2957(266HighMem)

    (Active:245828/1297300,inactive_laundry:194673,inactive_clean:194668,free:2957)

    aa:0ac:0id:0il:0ic:0fr:2207

    aa:630ac:1009id:189il:233ic:0fr:484

    aa:195237ac:48952id:1297057il:194493ic:194668fr:266

    1*4kB1*8kB1*16kB1*32kB1*64kB0*128kB0*256kB1*512kB0*1024kB0*2048kB2*4096kB=8828kB)

    48*4kB8*8kB97*16kB4*32kB0*64kB0*128kB0*256kB0*512kB0*1024kB0*2048kB0*4096kB=1936kB)

    12*4kB1*8kB1*16kB1*32kB1*64kB1*128kB1*256kB1*512kB0*1024kB0*2048kB0*4096kB=1064kB)

    Swapcache:add3838024,delete3808901,find107105/1540587,race0+2

    138138pagesofslabcache

    1100pagesofkernelstacks

    0lowmempagetables,37046highmempagetables

    Freeswap:3986092kB

    4194304pagesofRAM

    3833824pagesofHIGHMEM

    kernelselection(16GBx86runningSMP)

  • aa:0ac:0id:0il:0ic:0fr:0

    aa:901913ac:1558id:61553il:11534ic:6896fr:10539

    aa:0ac:0id:0il:0ic:0fr:0

    aa:0ac:0id:0il:0ic:0fr:0

    aa:867678ac:879id:100296il:19880ic:10183fr:17178

    aa:0ac:0id:0il:0ic:0fr:0

    aa:0ac:0id:0il:0ic:0fr:0

    aa:869084ac:1449id:100926il:18792ic:11396fr:14445

    aa:0ac:0id:0il:0ic:0fr:0

    aa:0ac:0id:0il:0ic:0fr:2617

    aa:769ac:2295id:256il:2ic:825fr:861136

    aa:0ac:0id:0il:0ic:0fr:0

    Swapcache:add2633120,delete2553093

    x86_64numa

  • Red Hat Confidential

    Socket 1Thread 0 Thread 1

    CPU Scheduler Recognizes differences between

    logical and physical processors I.E. Multi-core, hyperthreaded

    & chips/sockets Optimizes process scheduling

    to take advantage of shared on-chip cache, and NUMA memory nodes

    Implements multilevel run queuesfor sockets and cores (asopposed to one run queueper processor or per system) Strong CPU affinity avoids

    task bouncing Requires system BIOS to report

    CPU topology correctly

    Socket 2

    Process

    Process

    Process

    Process

    Process

    Process

    Process

    Process

    Process

    Process Process

    Process

    Scheduler Compute Queues

    Socket 0Core 0

    Thread 0 Thread 1

    Core 1Thread 0 Thread 1

  • Red Hat Confidential

    Red Hat Enterprise Linux 4 provides improved NUMA support over version3 Goal to locate application pages in low latency memory (local to CPU) AMD64, Itanium2 Enabled by default (or boot command line NUMA=[on,off]) Numactl to setup NUMA behavior Used by latest TPC/H benchmark (>5% gain)

    NUMA considerations

    1 2 4 80

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    0.0%

    20.0%

    40.0%

    60.0%

    80.0%

    100.0%

    120.0%

    RHEL4U2HPL5854dualcoreAMD64McCalpinStreamCopyb(x)=a(x)

    Copynuma=offCopynuma=on%gainnuma/nonnuma

  • DiskIO iostacklunlimits

    RHEL3255inSCSIstack RHEL42**20,18kusefulwithFiberChannel

    /proc/scsituning quedepthtuningperlun

    editR3/etc/modules.conf,R4modprob IRQdistributiondefault,smpaffinitymask

    echo03>/proc/irq//smp_affinity scalability

    Lunstestedupto64luns [email protected]/s,74kIO/sec Nodestestedupto20nodesw/DIO

  • Red Hat Confidential

    Asynchronous I/O to File Systems Allows application to continue processing while

    I/O is in progress Eliminates Synchronous I/O stall Critical for I/O intensive server applications

    Red Hat Enterprise Linux feature since 2002 Support for RAW devices only

    With Red Hat Enterprise Linux 4, significant improvement: Support for Ext3, NFS, GFS file system

    access

    Supports Direct I/O (e.g. Database applications)

    Makes benchmark results more appropriate for real-world comparisons I/O

    Completion

    Application

    DeviceDriver

    I/O RequestIssue

    I/O RequestCompletion

    I/O

    No stall forcompletion

    Asynchronous I/O

    App I/ORequest

    Application

    DeviceDriver

    I/O RequestIssue

    I/O RequestCompletion

    I/O

    Stall forcompletion

    Synchronous I/O

    App I/ORequest

  • Red Hat Confidential

    1 2 4 8 16 32 640

    20

    40

    60

    80

    100

    120

    140

    160

    R4 U2 FC AIO Read

    4k8k16k32k64k

    aios

    MB

    /sec

    1 2 4 8 16 32 640

    20

    40

    60

    80

    100

    120

    140

    160

    180

    R4 U2 FC AIO Write Perf

    4k8k16k32k64k

    aios

    MB

    /sec

  • [root@dhcp8336sysctl]#/sbin/elvtune/dev/hda

    /dev/hdaelevatorID0

    read_latency:2048

    write_latency:8192

    max_bomb_segments:6

    [root@dhcp8336sysctl]#/sbin/elvtuner1024w2048/dev/hda

    /dev/hdaelevatorID0

    read_latency:1024

    write_latency:2048

    max_bomb_segments:6

    PerformanceTuningDISKRHEL3

  • DiskIOtuningRHEL4 RHEL44tunableI/OSchedulers

    CFQelevator=cfq.CompletelyFairQueuingdefault,balanced,fairformultipleluns,adaptors,smpservers

    NOOPelevator=noop.Nooperationinkernel,simple,lowcpuoverhead,leaveopttoramdisk,raidcntrletc.

    Deadlineelevator=deadline.Optimizeforruntimelikebehavior,lowlatencyperIO,balanceissueswithlargeIOluns/controllers

    Anticipatoryelevator=as.InsertsdelaystohelpstackaggregateIO,bestonsystemw/limitedphysicalIOSATA

    Setatboottimeoncommandline

  • FileSystems Separateswapandbusypartitionsetc. EXT2/EXT3separatetalk

    http://www.redhat.com/support/wpapers/redhat/ext3/*.html Tune2fsormountoptions

    data=orderedonlymetadatajournaled data=journalbothmetadataanddatajournaled data=writebackusewithcare! SetupdefaultblocksizeatmkfsbXX

    RHEL4EXT3improvesperformance Scalabilityupto5Mfile/system Sequentialwritebyusingblockreservations Increasefilesystemupto8TB

    GFSglobalfilesystemclusterfilesystem

  • Part4RHEL3vsRHEL4PerformanceCaseStudy

    SchedulerO(1)taskset IOzoneRHEL3/4

    EXT3 GFS NFS

    OLTPOracle10G o_direct,asyncIO,hughmem/page RHELIOelevators

  • IOzone Benchmark http://www.iozone.org/ IOzone is a filesystem benchmark tool. The benchmark tests file I/O performance for

    the following operations: Write, re-write, random write Read, re-read, random read, read backwards, read

    strided, pread Fread, fwrite, mmap, aio_read, aio_write

  • IOzone Sample Output

    1024 2048 4096 8192 16384 32768 655368

    32

    128512

    0

    10000

    20000

    30000

    40000

    50000

    60000

    Rhel4 Ext 3 Seq Write 100 MB

    5000060000

    4000050000

    3000040000

    2000030000

    1000020000

    010000

    Transfersize(bytes)Filesize(k)

    BandwidthKB/sec

  • Understanding IOzone Results GeoMean per category are

    statistically meaningful. Understand HW setup

    Disk, RAID, HBA, PCI Layout file systems

    LVM or MD devices Partions w/ fdisk

    Baseline raw IO DD/DT EXT3 perf w/ IOzone

    In-cache file sizes which fit goal -> 90% memory BW.

    Out-of-cache file sizes more tan 2x memory size

    O_DIRECT 95% of raw Global File System GFS goal

    --> 90-95% of local EXT3

    Use raw command fdisk /dev/sdX raw /dev/raw/rawX /dev/sdX1 dd if=/dev/raw/rawX bs=64k

    Mount file system mkfs t ext3 /dev/sdX1 Mount t ext3 /dev/sdX1 /perf1

    IOzone commands Iozone a f /perf1/t1 (incache) Iozone a -I f /perf1/t1 (w/ dio) Iozone s 2xmem f /perf1/t1 (big)

  • NFS vs EXT3 Comparison

    IOzone cached R4 U2 EXT3 vs NFSGeoMean 1mb-4gb files, 1k-1m transfers

    0

    500000

    1000000

    1500000

    2000000

    Fwrite Re-fwrite Fread Re-fread OverallGeoMean

    0.0%20.0%40.0%60.0%80.0%100.0%120.0%

    R4_U2 EXT3

    R4_U2_NFS

    %Diff

    Red Hat ConfidentialRed Hat Confidential

  • GFS vs EXT3 Iozone ComparisonIOzone cached R4 U2 EXT3 vs GFS

    GeoMean 1mb-4gb files, 1k-1m transfers

    0

    500000

    1000000

    1500000

    2000000

    Fwrite Re-fwrite Fread Re-fread OverallGeoMean

    88.0%

    90.0%

    92.0%

    94.0%

    96.0%

    98.0%

    R4_U2

    R4_U2_GFS

    %Diff

    Red Hat Confidential

  • UsingIOzonew/o_directmimicdatabase Problem:

    Filesystemsusememoryforfilecache Databasesusememoryfordatabasecache Userswantfilesystemformanagementoutsidedatabaseaccess(copy,backupetc)

    YouDON'TwantBOTHtocache. Solution:

    FilesystemsthatsupportDirectIO Openfileswitho_directoption DatabaseswhichsupportDirectIO(ORACLE) NODOUBLECACHING!

  • NFSvsEXT3DIOIozoneComparison

    IOzone (DIO) R4 U2 EXT3 vs NFSGeoMean 1mb-4gb files, 1k-1m transfers

    0100002000030000400005000060000700008000090000

    100000

    Writ

    er

    Re-w

    riter

    Read

    er

    Re-re

    ader

    Rand

    omRe

    adRa

    ndom

    Writ

    eBa

    ckwa

    rdRe

    adRe

    cord

    Rewr

    iteSt

    ride

    Read

    Over

    allGe

    oMea

    n

    0.0%

    10.0%20.0%

    30.0%40.0%

    50.0%60.0%

    70.0%

    R4_U2 EXT3

    R4_U2_NFS

    %Diff

  • GFSGlobalClusterFileSystem

    GFSseparatesummittalk V6.0shippinginRHEL3 V6.1shipsw/RHEL4U1

    HintatGFSPerformanceinRHEL3 Datafromdifferentserver/setup

    HPAMD644cpu,2.4Ghz,8GBmemory 1QLA2300FiberChannel,1EVA5000

    ComparedGFSiozonetoEXT3

  • GFSvsEXT3DIOIozoneComparison

    IOzone (DIO) R4 U2 EXT3 vs GFSGeoMean 1mb-4gb files, 1k-1m transfers

    0

    20000

    40000

    60000

    80000

    100000

    120000

    Writ

    er

    Re-w

    riter

    Read

    er

    Re-re

    ader

    Rand

    omRe

    adRa

    ndom

    Writ

    eBa

    ckw

    ard

    Read

    Reco

    rdRe

    writ

    eSt

    ride

    Read

    Ove

    rall

    Geo

    Mea

    n

    85.0%

    90.0%

    95.0%

    100.0%

    105.0%

    110.0%

    115.0%

    R4_U2

    R4_U2_GFS

    %Diff

  • EvaluatingOraclePerformance UseOLTPworkloadbasedonTPCC ResultswithvariousOracleTuningoptions

    RAWvsEXT3w/o_direct(iedirectIOiniozone) ASYNCIOoptionsw/Oracle,supportedinRHEL4/EXT3 HUGHMEMkernelsonx86kernels

    ResultscomparingRHEL4IOschedulers CFQ DEADLINE NOOP AS RHEL3baseline

  • Oracle10GOLTPext3,gfs/nfssync/aio/dio

    AIOinOracle10Gcd$ORACLE_HOME/rdbms/libmakefins_rdbms.mkasync_onmakefins_rdbms.mkioracle

    Addtoinit.ora(usuallyin$ORACLE_HOME/dbs)disk_synch_io=true#forrawfilesystemio_options=asynchfilesystemio_options=directiofilesystemio_options=setall

  • Oracle OLTP Filesystem Performance

    OLTPsyncio OLTPdio OLTPaio OLTPaio+dio

    0100020003000400050006000700080009000

    10000RHEL4U2Oracle10GOLTPPerformancewithdifferentfilesystems

    EXT3NFSGFS

    Tran

    s/MinuteTP

    M

  • DiskIOelevators

    R3generalpurposeI/Oelevatorsparameters R44tunableI/Oelevators

    CFQCompletelyFairQueuing NOOPNooperationinkernel DeadlineOptimizeforruntime AnticipatoryOptimizeforinteractiveresponse

    2Oracle10Gworkloads OLTP4krandom50%R/50%W DSS32k256ksequentialRead

  • Red Hat Confidential

    As

    Noop

    Rhel3

    Deadline

    CFQ

    0.0% 25.0% 50.0% 75.0% 100.0% 125.0%

    100.0%

    87.2%

    84.1%

    77.7%

    28.4%

    100.0%

    108.9%

    84.8%

    75.9%

    23.2%

    RHEL4IOschedulesvsRHEL3forDatabaseOracle10Goltp/dss(relativeperformance)

    %tran/min%queries/hour

  • Red Hat Confidential

    The Translation Lookaside Buffer (TLB) is a small CPU cache of recently used virtual to physical address mappings

    TLB misses are extremely expensive on today's very fast, pipelined CPUs

    Large memory applicationscan incur high TLB miss rates

    HugeTLBs permit memory to bemanaged in very large segments

    E.G. Itanium: Standard page: 16KB Default huge page: 256MB 16000:1 difference

    File system mapping interface

    Ideal for databases

    E.G. TLB can fully map a 32GBOracle SGA

    PhysicalMemory

    VirtualAddressSpace

    TLB

    128data128instruction

    HugeTLBFS

  • hugemem kernel(4G4G)0

    5000

    10000

    15000

    20000

    25000

    30000

    35000

    40000

    RHEL3U6 with Oracle 10g TPC-C results comparing performance of the Hugemem kernel with and without Hugepages enabled

    EXT3 No Hugepages

    RAW No Hugepages

    EXT3 With Hugepages

    RAW With Hugepages

    tpm

    C

    Testsperformedona2XeonEM64TcpuHTsystemwith6GRAMand14spindlesusingmdadmraid0

  • LinuxPerformanceTuningSummary

    LinuxPerformanceMonitoringTools *stat,/proc/*,top,sar,ps,oprofile Determinecacacityvstunableperformanceissue TuneOSparmetersandrepeat

    RHEL4vsRHEL3PerfComparison RHEL4vsRHEL3

    haveityourwayIOwith4IOschedulers EXT3improvedblockreservationsupto3x! GFSwithin95%ofEXT3,NFSimproveswithEXT3 Oraclew/FSo_direct,aio,hughpages95%ofraw

  • Questions?

  • top2streamsrunningon2dualcoreAMDcpus

    1)sometimesschedulerchoosescpupaironmemoryinterfacedependingonosstate

    Tasks:101total,3running,96sleeping,0stopped,0zombie

    Cpu0:0.0%us,0.0%sy,0.0%ni,100.0%id,0.0%wa,0.0%hi,0.0%si

    Cpu1:0.1%us,0.1%sy,0.0%ni,100.0%id,0.0%wa,0.0%hi,0.0%si

    Cpu2:100.0%us,0.0%sy,0.0%ni,0.0%id,0.0%wa,0.0%hi,0.0%si

    Cpu3:100.0%us,0.0%sy,0.0%ni,0.0%id,0.0%wa,0.0%hi,0.0%si

    2)schedulerw/tasksetccpu#./stream,roundrobinodd,thenevencpus

    Tasks:101total,2running,96sleeping,0stopped,0zombie

    Cpu0:0.0%us,0.0%sy,0.0%ni,100.0%id,0.0%wa,0.0%hi,0.0%si

    Cpu1:100.0%us,0.0%sy,0.0%ni,0.0%id,0.0%wa,0.0%hi,0.0%si

    Cpu2:0.0%us,0.3%sy,0.0%ni,99.7%id,0.0%wa,0.0%hi,0.0%si

    Cpu3:100.0%us,0.0%sy,0.0%ni,0.0%id,0.0%wa,0.0%hi,0.0%si

  • McCalpinStreamon2cpudualcore,4CPUbindingviataskset

    1 2 40

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    RHEL4U12cpu,dualcoreAMD64McCalpinStreamCopyb(x)=a(x)

    CopyCopyw/Aff

    NumberofCPUs

    Band

    widthinM

    B/se

    c

    1 2 40

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    RHEL4U12cpu,dualcoreAMD64McCalpinStreamTriadc(x)=a(x)+b(x).c(x)

    TriadsmpTriadw/Aff

    NumberofCPUs

    Band

    widthinM

    B/se

    c