Download ppt - 1 Scaling CCSM to a Petascale system John M. Dennis: [email protected]@ucar.edu June 22, 2006 John M. Dennis: [email protected]@ucar.edu June 22,

1

Scaling CCSM to a Petascale system

John M. Dennis: [email protected] 22, 2006

mailto:[email protected]

June 22, 2006 Software Engineering Working Group Meeting 2

MotivationMotivation

Petascale system with 100K - 500K procTrend or One off?

LLNL: 128K proc IBM BG/LIBM Watson: 40K proc IBM BG/LSandia: 10K proc IBM RedStormORNL/NCCS: 5K proc Cray XT3

10K (end of summer) -> 20K (Nov 06) -> ?ANL: Large IBM BG/P system

We have prototypes for Petascale system!

Petascale system with 100K - 500K procTrend or One off?

LLNL: 128K proc IBM BG/LIBM Watson: 40K proc IBM BG/LSandia: 10K proc IBM RedStormORNL/NCCS: 5K proc Cray XT3

10K (end of summer) -> 20K (Nov 06) -> ?ANL: Large IBM BG/P system

We have prototypes for Petascale system!


Motivation (con’t)Motivation (con’t)

Prototype Petascale Application?POP @ 0.1 degree

BGW 30K proc --> 7.9 years/wallclock dayRedStorm 8K proc--> 8.1 years/wallclock day

Can CCSM be a Petascale Application?Look at each component separately

Current scalability limitationsChanges necessary to enable execution on

large processor countsCheck scalability on BG/L

Prototype Petascale Application?POP @ 0.1 degree

BGW 30K proc --> 7.9 years/wallclock dayRedStorm 8K proc--> 8.1 years/wallclock day

Can CCSM be a Petascale Application?Look at each component separately

Current scalability limitationsChanges necessary to enable execution on

large processor countsCheck scalability on BG/L


Motivation (con’t)Motivation (con’t)

Why examine scalability on BG/L? Prototype for Petascale system Access to large processor counts

2K easily40K through Blue Gene Watson Days

Scalable architecture Limited memory:

256MB (VN)512MB (CO)

Dedicated resources gives reproducible timings Lessons translate to other systems [Cray XT3]

Why examine scalability on BG/L? Prototype for Petascale system Access to large processor counts

2K easily40K through Blue Gene Watson Days

Scalable architecture Limited memory:

256MB (VN)512MB (CO)

Dedicated resources gives reproducible timings Lessons translate to other systems [Cray XT3]


Outline:Outline:

MotivationPOPCICECLMCAM + CouplerConclusions



Parallel Ocean Program (POP)

Parallel Ocean Program (POP)

Modified base POP 2.0 base codeReduce execution time/ improve scalabilityMinor changes (~9 files)

Rework barotropic solverImprove load-balancing (space-filling curve)Pilfered CICE boundary exchange [NEW]

Significant advances in performancePOP @ 1 degree

128 POWER4 processors --> 2.1xPOP @ 0.1 degree

30K BG/L processors --> 2x8K RedStorm processors --> 1.3x

Modified base POP 2.0 base codeReduce execution time/ improve scalabilityMinor changes (~9 files)

Rework barotropic solverImprove load-balancing (space-filling curve)Pilfered CICE boundary exchange [NEW]

Significant advances in performancePOP @ 1 degree

128 POWER4 processors --> 2.1xPOP @ 0.1 degree

30K BG/L processors --> 2x8K RedStorm processors --> 1.3x


POP using 20x24 blocks (gx1v3)


POP data structureFlexible block structureland ‘block’ eliminationSmall blocks

Better {load balanced, land block elimination}

Larger halo overhead

Larger blocksSmaller halo overheadLoad imbalancedNo land block elimination

POP data structureFlexible block structureland ‘block’ eliminationSmall blocks

Better {load balanced, land block elimination}

Larger halo overhead

Larger blocksSmaller halo overheadLoad imbalancedNo land block elimination


Outline:Outline:

MotivationPOP

New Barotropic solverCICE boundary exchangeSpace-filling curves

CICECLMCAM + CouplerConclusions

MotivationPOP




Alternate Data StructureAlternate Data Structure

2D data structure Advantages

Regular stride-1 access

Compact form of stencil operator

Disadvantages Includes land points Problem specific data

structure


Regular stride-1 access

Compact form of stencil operator

Disadvantages Includes land points Problem specific data

structure


No more land points General data structure

Disadvantages Indirect addressing Larger stencil operator


No more land points General data structure

Disadvantages Indirect addressing Larger stencil operator


Using 1D data structures in POP2 solver (serial)

Using 1D data structures in POP2 solver (serial)

Replace solvers.F90Execution time on cache

microprocessorsExamine two CG algorithms w/Diagonal

precondPCG2 ( 2 inner products)PCG1 ( 1 inner product) [D’Azevedo 93]

Grid: test [128x192 grid points]w/(16x16)

Replace solvers.F90Execution time on cache

microprocessorsExamine two CG algorithms w/Diagonal

precondPCG2 ( 2 inner products)PCG1 ( 1 inner product) [D’Azevedo 93]

Grid: test [128x192 grid points]w/(16x16)


0

1

2

3

4

5

6

POWER4 1.3 Ghz

Compute Platform

seconds for 20 timesteps

PCG2+2D

PCG1+2D

PCG2+1D

PCG1+1D

0

1

2

3

4

5

6

POWER4 1.3 Ghz

Compute Platform


PCG2+2D

PCG1+2D

PCG2+1D

PCG1+1D

Serial execution time on IBM POWER4 (test)

Serial execution time on IBM POWER4 (test)

56% reduction in cost/iteration


Using 1D data structure in POP2 solver (parallel)

Using 1D data structure in POP2 solver (parallel)

New parallel halo update Examine several CG algorithms w/Diagonal

precond PCG2 ( 2 inner products) PCG1 ( 1 inner product)

Existing solver/preconditioner technology: Hypre (LLNL)

http://www.llnl.gov/CASC/linear_solvers PCG solver Preconditioners:

Diagonal

New parallel halo update Examine several CG algorithms w/Diagonal

precond PCG2 ( 2 inner products) PCG1 ( 1 inner product)

Existing solver/preconditioner technology: Hypre (LLNL)

http://www.llnl.gov/CASC/linear_solvers PCG solver Preconditioners:

Diagonal

http://www.llnl.gov/CASC/linear_solvers

http://www.llnl.gov/CASC/linear_solvers


Solver execution time for POP2 (20x24) on BG/L

(gx1v3)

Solver execution time for POP2 (20x24) on BG/L

(gx1v3)

0

5

10

15

20

25

30

35

40

64

# processors

Seconds for 200 timesteps

PCG2+2D

PCG1+2D

PCG2+1D

PCG1+1D

Hypre (PCG+Diag)

0

5

10

15

20

25

30

35

40

64

# processors

Seconds for 200 timesteps

PCG2+2D

PCG1+2D

PCG2+1D

PCG1+1D

Hypre (PCG+Diag)

48% cost/iteration

27% cost/iteration


Outline:Outline:

MotivationPOP



MotivationPOP




CICE boundary exchangeCICE boundary exchange

POP applies 2D boundary exchange to 3D vars.

3D-update 2-33% of total timeSpecialized 3D boundary exchange

Reduce message countIncrease message lengthReduces dependence on machine latency

Pilfer CICE 4.0 boundary exchangeCode Reuse! :-)

POP applies 2D boundary exchange to 3D vars.

3D-update 2-33% of total timeSpecialized 3D boundary exchange

Reduce message countIncrease message lengthReduces dependence on machine latency

Pilfer CICE 4.0 boundary exchangeCode Reuse! :-)


Simulation rate of POP @ gx1v3 on IBM POWER4

Simulation rate of POP @ gx1v3 on IBM POWER4

0

5

10

15

20

25

30

16 32 64 80 128

# processors

Simulated years/wallclock day

20x24+SFC+NB (PCG1+1D)20x24+SFC (PCG1+1D)single block (PCG1+1D)single block (PCG1+2D)

0

5

10

15

20

25

30

16 32 64 80 128

# processors

Simulated years/wallclock day

20x24+SFC+NB (PCG1+1D)20x24+SFC (PCG1+1D)single block (PCG1+1D)single block (PCG1+2D)

ret

50% of time in solver


Performance of POP@gx1v3

Performance of POP@gx1v3

Three code modifications1D data structureSpace-filling curvesCICE boundary exchange

Cumulative impact is hugeSeparately 10-20% eachTogether 2.1x on 128 processors

Small improvements add up!

Three code modifications1D data structureSpace-filling curvesCICE boundary exchange

Cumulative impact is hugeSeparately 10-20% eachTogether 2.1x on 128 processors

Small improvements add up!


Outline:Outline:

MotivationPOP



MotivationPOP




Partitioning with Space-filling Curves

Partitioning with Space-filling Curves

Map 2D -> 1DVariety of sizes

Hilbert (Nb=2n)

Peano (Nb=3m)

Cinco (Nb=5p) [New]Hilbert-Peano (Nb=2n3m)Hilbert-Peano-Cinco

(Nb=2n3m5p) [New]

Partitioning 1D array

Map 2D -> 1DVariety of sizes

Hilbert (Nb=2n)

Peano (Nb=3m)

Cinco (Nb=5p) [New]Hilbert-Peano (Nb=2n3m)Hilbert-Peano-Cinco

(Nb=2n3m5p) [New]

Partitioning 1D array

Nb


Partitioning with SFCPartitioning with SFC

Partition for 3 processors





POP (gx1v3) + Space-filling curve

POP (gx1v3) + Space-filling curve


Space-filling curve (Hilbert Nb=24)

Space-filling curve (Hilbert Nb=24)


Remove Land blocksRemove Land blocks


Space-filling curve partition for 8 processors

Space-filling curve partition for 8 processors


0.1 degree POP0.1 degree POP

Global eddy-resolving Computational grid:

3600 x 2400 x 40Land creates problems:

load imbalancesscalability

Alternative partitioning algorithm:Space-filling curves

Evaluate using Benchmark:1 day/ Internal grid / 7 minute timestep

Global eddy-resolving Computational grid:

3600 x 2400 x 40Land creates problems:

load imbalancesscalability

Alternative partitioning algorithm:Space-filling curves

Evaluate using Benchmark:1 day/ Internal grid / 7 minute timestep


POP 0.1 degree benchmark on Blue

Gene/L

POP 0.1 degree benchmark on Blue

Gene/L


POP 0.1 degree benchmark

POP 0.1 degree benchmark

Courtesy of Y. Yoshida, M. Taylor, P. Worley

50% of time in solver

33% of time in 3D-update


Remaining Issues: POPRemaining Issues: POP

Parallel I/O:Decomposition in the vertical

Only parallel for 3D fieldsNeeds all to one communication

Need parallel I/O for 2D fieldsExample: 0.1 degree POP on 30K BGL

Time to compute 1 day: 30 secondsTime to read in 2D forcing files: 22 seconds

Parallel I/O:Decomposition in the vertical

Only parallel for 3D fieldsNeeds all to one communication

Need parallel I/O for 2D fieldsExample: 0.1 degree POP on 30K BGL

Time to compute 1 day: 30 secondsTime to read in 2D forcing files: 22 seconds


Impact of 2x increase in simulation rate

Impact of 2x increase in simulation rate

IPCC AR5 control run [1000 years]5 years per day ~= 6 months10 years per day ~= 3 monthsHuge jump in scientific productivity

Search larger parameter spaceLonger sensitivity studies-> Find and fix problems much quicker

What about entire coupled system?

IPCC AR5 control run [1000 years]5 years per day ~= 6 months10 years per day ~= 3 monthsHuge jump in scientific productivity

Search larger parameter spaceLonger sensitivity studies-> Find and fix problems much quicker

What about entire coupled system?


Outline:Outline:




CICE: Sea-ice ModelCICE: Sea-ice Model

Shares grid and infrastructure with POPCICE 4.0

Not quite ready for general releaseSub-block data structures (POP2)Minimal experience with code base

(<2 weeks)Reuse techniques from POP2 workPartitioning grid using weighted Space-

filling curves?

Shares grid and infrastructure with POPCICE 4.0

Not quite ready for general releaseSub-block data structures (POP2)Minimal experience with code base

(<2 weeks)Reuse techniques from POP2 workPartitioning grid using weighted Space-

filling curves?


Weighted Space-filling curves

Weighted Space-filling curves

Estimate work for each grid block

Worki = wo + Pi*w1

wo: Fixed work for all blocksw1: Work if block contains Sea-icePi: Probability block contains Sea-

ice

Estimate work for each grid block

Worki = wo + Pi*w1

wo: Fixed work for all blocksw1: Work if block contains Sea-icePi: Probability block contains Sea-

ice


Weighted Space-filling curves (con’t)

Weighted Space-filling curves (con’t)

Probability block contains Sea-iceDepends on climate scenario

Control-run Paelo CO2 doubling

Estimate of probabilityBad estimate -> Slower simulation rate

Weight space-filling curvePartition for equal amounts of work

Probability block contains Sea-iceDepends on climate scenario

Control-run Paelo CO2 doubling

Estimate of probabilityBad estimate -> Slower simulation rate

Weight space-filling curvePartition for equal amounts of work


Partitioning with w-SFCPartitioning with w-SFC

Partition for 5 processors


Remaining issues: CICERemaining issues: CICE

Parallel I/OExamine scalability with w-SFC

Active sea-ice ~15% of ocean gridEstimate for 0.1 degree

RedStorm: ~4000 processorsBlue Gene/L: ~10000 processors

Stay Tuned!

Parallel I/OExamine scalability with w-SFC

Active sea-ice ~15% of ocean gridEstimate for 0.1 degree

RedStorm: ~4000 processorsBlue Gene/L: ~10000 processors

Stay Tuned!


Outline:Outline:

Motivation POP CICE CLM CAM + Coupler Conclusions



Community Land Model (CLM2)

Community Land Model (CLM2)

Fundamentally a scalable codeNo communication between grid-

pointsHas some serial components….

River Transport Model (RTM)Serial I/O (Collect on processor 0)

Fundamentally a scalable codeNo communication between grid-

pointsHas some serial components….

River Transport Model (RTM)Serial I/O (Collect on processor 0)

39

What is wrong with just a little serial code?

Serial code is Evil!!


Why is Serial code Evil?Why is Serial code Evil?

Seems innocent at firstLead to much larger problemsSerial code:

Performance bottleneck to codeExcessive memory usage

Collecting stuff on one processorMessage passing information

Seems innocent at firstLead to much larger problemsSerial code:

Performance bottleneck to codeExcessive memory usage

Collecting stuff on one processorMessage passing information


Cost of message passing information

Cost of message passing information

Parallel code:Each processor only communicates

with small number of neighbors O(1) information

Single serial component:One processor communicates with all

procesorsO(npes) information

Parallel code:Each processor only communicates

with small number of neighbors O(1) information

Single serial component:One processor communicates with all

procesorsO(npes) information


Memory usage in subroutine:initDecomp

Memory usage in subroutine:initDecomp Four integer arrays:

dimension(ancells,npes) ancells: number of land sub grid points (~20,000)

On 128 processors:4*4*128*20,000 = 39 Mbytes/per processor


On 10,000 processors:4*4*10,000*20,000 = 2.98 Gbytes/per processor

-> 29 Tbytes across entire system

Four integer arrays: dimension(ancells,npes) ancells: number of land sub grid points (~20,000)



On 10,000 processors:4*4*10,000*20,000 = 2.98 Gbytes/per processor

-> 29 Tbytes across entire system


Memory use in CLMMemory use in CLM

Subroutine initDecomp deallocates large arrays

CLM Configuration:1x1.25 gridNo RTMMAXPATCH_PFT = 4No CN, DGVM

Measure stack and heap on 32-512 BG/L processors

Subroutine initDecomp deallocates large arrays

CLM Configuration:1x1.25 gridNo RTMMAXPATCH_PFT = 4No CN, DGVM

Measure stack and heap on 32-512 BG/L processors


Memory use for CLM on BG/L

Memory use for CLM on BG/L


Non-scalable memory usage

Non-scalable memory usage

Common problemEasy to ignore on 128 processorsFatal on large processor counts

Avoid array dimension with npesFixed size

Eliminate serial code!!Re-evaluate initialization code (scalable?)Remember:

Innocent looking non-scalable code can kill!

Common problemEasy to ignore on 128 processorsFatal on large processor counts

Avoid array dimension with npesFixed size

Eliminate serial code!!Re-evaluate initialization code (scalable?)Remember:

Innocent looking non-scalable code can kill!


Outline:Outline:




CAM + CouplerCAM + Coupler

CAMExtensive benchmarking [P. Worley]Generalizing interface for modular dynamics

Non lat-lon grids [B. Eaton]Quasi uniform grids (cubed-sphere, icoshedral)

Ported to BGL [S. Ghosh]Required rewrite on I/OFV-core resolution limited due to memory

CouplerWill examine single executable concurrent

system (Summer 06)

CAMExtensive benchmarking [P. Worley]Generalizing interface for modular dynamics

Non lat-lon grids [B. Eaton]Quasi uniform grids (cubed-sphere, icoshedral)

Ported to BGL [S. Ghosh]Required rewrite on I/OFV-core resolution limited due to memory

CouplerWill examine single executable concurrent

system (Summer 06)


A Petascale coupled system

A Petascale coupled system

Design principles:Simple/elegant designAttention to implementation detailsSingle executable -> run on any thing

vendors provideminimizes communication hotspots

Concurrent execution creates hotspotsE.G. waste bisection bandwidth by passing

fluxes to coupler

Design principles:Simple/elegant designAttention to implementation detailsSingle executable -> run on any thing

vendors provideminimizes communication hotspots

Concurrent execution creates hotspotsE.G. waste bisection bandwidth by passing

fluxes to coupler


A Petascale coupled system (con’t)

A Petascale coupled system (con’t)

Sequential executionFlux interpolation just a boundary exchangeSimplifies cost budgetAll components must be scalable

Quasi-uniform gridFlux interpolation should be communication

with small number of nearest neighborsMinimizes interpolation costs

Sequential executionFlux interpolation just a boundary exchangeSimplifies cost budgetAll components must be scalable

Quasi-uniform gridFlux interpolation should be communication

with small number of nearest neighborsMinimizes interpolation costs


Possible ConfigurationPossible Configuration

CAM (100 km, L66)POP @ 0.1 degree

Demonstrated 30 seconds per day

Sea-Ice @ 0.1 degree Land model (50 km)Sequential Coupler

CAM (100 km, L66)POP @ 0.1 degree

Demonstrated 30 seconds per day

Sea-Ice @ 0.1 degree Land model (50 km)Sequential Coupler


High-Resolution CCSM on ~30K BG/L processors

High-Resolution CCSM on ~30K BG/L processors

Time per day (secs)

demonstrated Budget Actual

[email protected] Yes [03/29/06] 30 30.1

[email protected] No [Summer 06] 8

Land (50 km)

No [Summer 06] 5

Atm + Chem(100 km)

No [Fall 06] 77

Coupler No [Fall 06] 10

Total No [Spring 07] 130

~1.8 years/wallclock day


ConclusionsConclusions

Examine scalability of several components on BG/LStress limits of resolution and processor countUncover problems in code

Is possible to use large # proc POP @ 0.1Results obtain by modifying ~9 filesBGW 30K proc --> 7.9 years/wallclock day

33% of time in 3D-update -> CICE boundary exchange

RedStorm 8K proc--> 8.1 years/wallclock day50% of time in solver -> use preconditioner

Examine scalability of several components on BG/LStress limits of resolution and processor countUncover problems in code

Is possible to use large # proc POP @ 0.1Results obtain by modifying ~9 filesBGW 30K proc --> 7.9 years/wallclock day

33% of time in 3D-update -> CICE boundary exchange

RedStorm 8K proc--> 8.1 years/wallclock day50% of time in solver -> use preconditioner


Conclusions (con’t)Conclusions (con’t)

CICE needs Improved load-balancing (w-SFC)

CLM needs Parallelize RTM, I/O Cleanup non-scalable data structures

Common Issues: Focus on returning advances into models

Vector mods in POP? Parallel I/O in CAM? High-resolution CRIEPI work?

Parallel I/O Eliminate all serial code! Watch the memory usage

CICE needs Improved load-balancing (w-SFC)

CLM needs Parallelize RTM, I/O Cleanup non-scalable data structures

Common Issues: Focus on returning advances into models

Vector mods in POP? Parallel I/O in CAM? High-resolution CRIEPI work?

Parallel I/O Eliminate all serial code! Watch the memory usage


Conclusions (con’t)Conclusions (con’t)

Efficient use of Petascale system is possible!

Path to Petascale computing:1. Test the limits of our codes2. Fix resulting problems3. Goto 1.

Efficient use of Petascale system is possible!

Path to Petascale computing:1. Test the limits of our codes2. Fix resulting problems3. Goto 1.


Acknowledgements/Questions?

Acknowledgements/Questions?

Thanks to: D. Bailey (NCAR)F. Bryan (NCAR)T. Craig (NCAR)J. Edwards (IBM)E. Hunke (LANL)B. Kadlec (CU)E. Jessup (CU)P. Jones (LANL)K. Lindsay (NCAR)W. Lipscomb (LANL)M. Taylor (SNL)H. Tufo (NCAR)M. Vertenstein (NCAR)S. Weese (NCAR)P. Worley (ORNL)

Thanks to: D. Bailey (NCAR)F. Bryan (NCAR)T. Craig (NCAR)J. Edwards (IBM)E. Hunke (LANL)B. Kadlec (CU)E. Jessup (CU)P. Jones (LANL)K. Lindsay (NCAR)W. Lipscomb (LANL)M. Taylor (SNL)H. Tufo (NCAR)M. Vertenstein (NCAR)S. Weese (NCAR)P. Worley (ORNL)

Computer Time: Blue Gene/L time:

NSF MRI GrantNCARUniversity of ColoradoIBM (SUR) program

BGW Consortium DaysIBM research (Watson)

LLNL RedStorm time:

Sandia

Computer Time: Blue Gene/L time:

NSF MRI GrantNCARUniversity of ColoradoIBM (SUR) program

BGW Consortium DaysIBM research (Watson)

LLNL RedStorm time:

Sandia

et


eta1_local=0.0D0

do i=1,nActive

Z(i) = Minv2(i)*R(i) ! Apply the diagonal preconditioner

eta1_local = eta1_local + R(i)*Z(i) !*** (r,(PC)r)

enddo

Z(iptrHalo:n) = Minv2(iptrHalo:n)*R(iptrHalo:n)

!-----------------------------------------------------------------------

! update conjugate direction vector s

!-----------------------------------------------------------------------

if(lprecond) call update_halo(Z)

eta1 = global_sum(eta1_local,distrb_tropic)

cg_beta = eta1/eta0

do i=1,n

S(i) = Z(i) + S(i)*cg_beta

enddo

call matvec(n,A,Q,S)

!-----------------------------------------------------------------------

! compute next solution and residual

!-----------------------------------------------------------------------

call update_halo(Q)

eta0 = eta1

rtmp_local = 0.0D0

do i=1,nActive

rtmp_local = rtmp_local + Q(i)*S(i)

enddo

rtmp = global_sum(rtmp_local,distrb_tropic)

eta1 = eta0/rtmp

do i=1,n

X(i) = X(i) + eta1*S(i)

R(i) = R(i) - eta1*Q(i)

enddo

eta1_local=0.0D0

do i=1,nActive

Z(i) = Minv2(i)*R(i) ! Apply the diagonal preconditioner

eta1_local = eta1_local + R(i)*Z(i) !*** (r,(PC)r)

enddo

Z(iptrHalo:n) = Minv2(iptrHalo:n)*R(iptrHalo:n)

!-----------------------------------------------------------------------

! update conjugate direction vector s

!-----------------------------------------------------------------------

if(lprecond) call update_halo(Z)

eta1 = global_sum(eta1_local,distrb_tropic)

cg_beta = eta1/eta0

do i=1,n

S(i) = Z(i) + S(i)*cg_beta

enddo

call matvec(n,A,Q,S)

!-----------------------------------------------------------------------

! compute next solution and residual

!-----------------------------------------------------------------------

call update_halo(Q)

eta0 = eta1

rtmp_local = 0.0D0

do i=1,nActive

rtmp_local = rtmp_local + Q(i)*S(i)

enddo

rtmp = global_sum(rtmp_local,distrb_tropic)

eta1 = eta0/rtmp

do i=1,n

X(i) = X(i) + eta1*S(i)

R(i) = R(i) - eta1*Q(i)

enddo


do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) if (lprecond) then call preconditioner(WORK1,R,this_block,iblock) else

where(A0(:,:,iblock /= c0) then WORK1(:,:,iblock) = R(:,:,iblock)/A0(:,:,iblock)

elsewhereWORK1(:,:,iblock) = c0

endwhere endif WORK0(:,:,iblock) = R(:,:,iblock)*WORK1(:,:,iblock) end do ! block loop!----------------------------------------------------------------------! update conjugate direction vector s!-----------------------------------------------------------------------if (lprecond) & call update_ghost_cells(WORK1,bndy_tropic, field_loc_center,& field_type_scalar) !*** (r,(PC)r) eta1 = global_sum(WORK0, distrb_tropic, field_loc_center, RCALCT_B)

do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) S(:,:,iblock) = WORK1(:,:,iblock) + S(:,:,iblock)*(eta1/eta0)!----------------------------------------------------------------------! compute As!----------------------------------------------------------------------- call btrop_operator(Q,S,this_block,iblock) WORK0(:,:,iblock) = Q(:,:,iblock)*S(:,:,iblock) end do ! block loop!-----------------------------------------------------------------------! compute next solution and residual!----------------------------------------------------------------------- call update_ghost_cells(Q, bndy_tropic, field_loc_center, & field_type_scalar eta0 = eta1 eta1 = eta0/global_sum(WORK0, distrb_tropic, & field_loc_center, RCALCT_B) do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) X(:,:,iblock) = X(:,:,iblock) + eta1*S(:,:,iblock) R(:,:,iblock) = R(:,:,iblock) - eta1*Q(:,:,iblock) if (mod(m,solv_ncheck) == 0) then call btrop_operator(R,X,this_block,iblock) R(:,:,iblock) = B(:,:,iblock) - R(:,:,iblock) WORK0(:,:,iblock) = R(:,:,iblock)*R(:,:,iblock) endif end do ! block loop

do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) if (lprecond) then call preconditioner(WORK1,R,this_block,iblock) else

where(A0(:,:,iblock /= c0) then WORK1(:,:,iblock) = R(:,:,iblock)/A0(:,:,iblock)

elsewhereWORK1(:,:,iblock) = c0

endwhere endif WORK0(:,:,iblock) = R(:,:,iblock)*WORK1(:,:,iblock) end do ! block loop!----------------------------------------------------------------------! update conjugate direction vector s!-----------------------------------------------------------------------if (lprecond) & call update_ghost_cells(WORK1,bndy_tropic, field_loc_center,& field_type_scalar) !*** (r,(PC)r) eta1 = global_sum(WORK0, distrb_tropic, field_loc_center, RCALCT_B)

do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) S(:,:,iblock) = WORK1(:,:,iblock) + S(:,:,iblock)*(eta1/eta0)!----------------------------------------------------------------------! compute As!----------------------------------------------------------------------- call btrop_operator(Q,S,this_block,iblock) WORK0(:,:,iblock) = Q(:,:,iblock)*S(:,:,iblock) end do ! block loop!-----------------------------------------------------------------------! compute next solution and residual!----------------------------------------------------------------------- call update_ghost_cells(Q, bndy_tropic, field_loc_center, & field_type_scalar eta0 = eta1 eta1 = eta0/global_sum(WORK0, distrb_tropic, & field_loc_center, RCALCT_B) do iblock=1,nblocks_tropic this_block = get_block(blocks_tropic(iblock),iblock) X(:,:,iblock) = X(:,:,iblock) + eta1*S(:,:,iblock) R(:,:,iblock) = R(:,:,iblock) - eta1*Q(:,:,iblock) if (mod(m,solv_ncheck) == 0) then call btrop_operator(R,X,this_block,iblock) R(:,:,iblock) = B(:,:,iblock) - R(:,:,iblock) WORK0(:,:,iblock) = R(:,:,iblock)*R(:,:,iblock) endif end do ! block loop


Piece 1D data structure solver

Piece 1D data structure solver

!-----------------------------------------------------! compute next solution and residual!----------------------------------------------------- call update_halo(Q) eta0 = eta1 rtmp_local = 0.0D0 do i=1,nActive rtmp_local = rtmp_local + Q(i)*S(i) enddo rtmp =

global_sum(rtmp_local,distrb_tropic) eta1 = eta0/rtmp do i=1,n X(i) = X(i) + eta1*S(i) R(i) = R(i) - eta1*Q(i) enddo

!-----------------------------------------------------! compute next solution and residual!----------------------------------------------------- call update_halo(Q) eta0 = eta1 rtmp_local = 0.0D0 do i=1,nActive rtmp_local = rtmp_local + Q(i)*S(i) enddo rtmp =

global_sum(rtmp_local,distrb_tropic) eta1 = eta0/rtmp do i=1,n X(i) = X(i) + eta1*S(i) R(i) = R(i) - eta1*Q(i) enddo

Update Halo

Dot product

Update vectors

ret


POP 0.1 degreePOP 0.1 degreeblocksize

Nb Nb2 Max ||

36x24 100 10000 7545

30x20 120 14400 10705

24x16 150 22500 16528

18x12 200 40000 28972

15x10 240 57600 41352

12x8 300 90000 64074

Increasing || -->D

ecre

asin

g ov

erhe

ad -

->


Serial Execution time on Multiple platforms (test)Serial Execution time on Multiple platforms (test)

0

1

2

3

4

5

6

7

8

9

10

IBM POWER4 IBM POWER5 IBM PPC 440 AMD Opteron Intel P4

(1.3 Ghz) (1.9 Ghz) (700 Mhz) (2.2 Ghz) (2.0 Ghz)

Compute Platform


PCG2+2D

PCG1+2D

PCG2+1D

PCG1+1D

0

1

2

3

4

5

6

7

8

9

10

IBM POWER4 IBM POWER5 IBM PPC 440 AMD Opteron Intel P4

(1.3 Ghz) (1.9 Ghz) (700 Mhz) (2.2 Ghz) (2.0 Ghz)

Compute Platform


PCG2+2D

PCG1+2D

PCG2+1D

PCG1+1D


The Unexpected Problem:The Unexpected Problem:

Just because your code scales to N processors, does not mean it will scale to k*N, where k>=4.

Just because your code scales to N processors, does not mean it will scale to k*N, where k>=4.