An efficient multiblock method for aerodynamic analysis and ...aero-comlab.stanford.edu/Papers/AIAA-1997-1893-134.pdfAn Efficient Multiblock Method for Aerodynamic Analysis and Design

Copyright© 1997, American Institute of Aeronautics and Astronautics, Inc.

A97-32442AIAA-97-1893

An Efficient Multiblock Method for AerodynamicAnalysis and Design on Distributed Memory Systems

J. ReuthertRIACS - NASA Ames Research Center

Moffet Field, CA 94035J. C. Vassbergtt

Aerodynamic DesignDouglas Aircraft Co.

Long Beach, CA 90846A. Jameson*

Department of Aeronautics & AstronauticsStanford UniversityStanford, CA 94305

J. J. Alonso+Department of Aeronautics & Astronautics

Stanford UniversityStanford, CA 94305

L. MartinelliTDepartment of Mechanical & Aerospace Engineering

Princeton UniversityPrinceton, NJ 08544

The work presented in this paper describes the application of a multiblock gridding strategy tothe solution of aerodynamic design optimization problems involving complex configurations. Thedesign process is parallelized using the MPI (Message Passing Interface) Standard such that itcan be efficiently run on a variety of distributed memory systems ranging from traditional parallelcomputers to networks of workstations. Substantial improvements to the parallel performance ofthe baseline method are presented, with particular attention to their impact on the scalability ofthe program as a function of the mesh size. Drag minimization calculations at a fixed coefficient oflift are presented for a business jet configuration that includes the wing, body, pylon, aft-mountednacelle, and vertical and horizontal tails. An aerodynamic design optimization is performed withboth the Euler and Reynolds Averaged Navier-Stokes (RANS) equations governing the flow solu-tion and the results are compared. These sample calculations establish the feasibility of efficientaerodynamic optimization of complete aircraft configurations using the RANS equations as the flowmodel. There still exists, however, the need for detailed studies of the importance of a true viscousadjoint method which holds the promise of tackling the minimization of not only the wave andinduced components of drag, but also the viscous drag.

INTRODUCTION

During the course of the last few years, there hasbeen a concentrated effort within our group to de-velop fast and efficient methods for the solution ofviscous fluid flows over complex aircraft configura-tions. The path to achieve this goal has seen nu-merous improvements in convergence accelerationtechniques (multigrid, implicit residual averaging),viscous discretization algorithms, higher order dis-sipation schemes for shock capturing and boundarylayer resolution, unstructured and multiblock gridapproaches, and parallel implementations based on| Research Scientist, Member AIAA+ Assistant Professor, Member AIAAft Senior Principal Engineer, Senior Member AIAA* T. V. Jones Professor of Engineering, AIAA Fellowt Assistant Professor, Member AIAA

AIAA Paper 97-1893Copyright ©1997 by the Authors. Published by the AIAA,Inc. with permission.

domain decomposition ideas [30, 50, 43, 49, 2, 1].The combination of all these factors has resulted ina variety of flow solvers that adhere to the higheststandards of accuracy and efficiency.

Although direct flow analysis of existing configu-rations has in the past provided the aircraft de-signer with invaluable information to overcome awide range of problems, there is the need for Com-putational Fluid Dynamics (CFD) methods which,in addition, provide information about geometrychanges that are necessary to improve an existingdesign with respect to a pre-specified figure of merit.It is within this framework where the utilization offast solution techniques is of utmost importance.

Most effective aerodynamic optimization methodsare based on the calculation of the derivatives of thefigure of merit with respect to the design variablesin the problem. Unfortunately, the calculation ofthese derivatives using the straightforward method

419


of finite differencing is usually prohibitively expen-sive. While it is possible to perform aerodynamicoptimization on a limited class of problems usingthe finite difference approach [19, 18, 51, 44, 42],large scale problems that are of the greatest engi-neering interest do not belong to this class. An al-ternative technique first suggested for problems in-volving partial differential equations by Lions [40],and extended for the treatment of compressible flowby Jameson [23, 24] is the control theory approach.This methodology employs control theory applied tosystems governed by partial differential equations toderive a co-state or adjoint system of equations. Thisadjoint equation has similar complexity to the flowsolution, and allows the calculation of the completegradient of the figure of merit with a cost which isessentially independent of the number of design vari-ables in the problem.

The computational cost of performing viscous baseddesign is considerably larger than for design usingthe Euler equations because: a) the number of meshpoints must be increased by a factor of about five toresolve boundary layers and wakes, b) there is theadditional cost of computing the viscous terms and aturbulence model, and c) Navier-Stokes calculationsgenerally converge much more slowly than Euler so-lutions because of stiffness arising from the highlystretched boundary layer cells. Therefore, the com-putational feasibility of viscous design hinges on thedevelopment of a rapidly convergent Navier-Stokesflow solver which is able to handle complex configu-rations and is efficiently implemented on the currentgeneration of distributed memory architectures.

With this in mind, the logical approach to the solu-tion of the aerodynamic design problem is to link to-gether fast iterative solvers and the adjoint solutionmethodology in order to produce a computationalmethod which can address the needs of the aircraftdesigner: high solution accuracy, fast turnaround,geometric complexity, and automated shape design.Even with the use of an adjoint solver, large scaledesign problems using the Navier-Stokes equationsthat are considered in this work require massive com-putational resources. Future work will place evenmore extreme demands on the computational powerneeded. Therefore, it was decided to attempt to ex-ploit the power of emerging distributed memory par-allel computers with efficient standardized message-passing implementations. Thus, much emphasis inthis paper has been placed, not only on demonstrat-ing the viability of performing automatic designs oncomplex configurations, but also on minimizing thecommunication overhead incurred by mapping themethod onto either parallel computers or clusters ofworkstations.

In this paper we present one of the possible varia-

tions of the adjoint based design technique for com-plex geometries, where the flow and adjoint solvershave been implemented using a multiblock strat-egy. Both the Euler and Reynolds Averaged Navier-Stokes equations are used to solve drag minimiza-tion problems. In both circumstances, the adjointsystem solved to obtain the sensitivity of the fig-ure of merit with respect to the design variables isbased on the inviscid equations only. The authorsfeel that the effective use of a viscous adjoint in a re-alistic design environment will require much furtherdevelopment and validation work. Furthermore, theapproach presented here is a natural evolution of ourprevious work [50, 43, 49] which had already beenextended to treat inviscid flows over complex config-urations. The work directed towards the improve-ment of the parallel performance of the method wasmotivated by an interest in demonstrating the per-formance of the method on computational platformswhich do not possess the bandwidth and latency thatis realizable on highly integrated parallel machines.

CONTROL THEORY FORMULATIONFOR SHAPE DESIGN

The presentation of the control theory approach tooptimal design is well documented elsewhere [23, 28],and only a brief summary is given here.

The progress of the design procedure is measured interms of a cost function I, which could be, for exam-ple, the drag coefficient or the lift to drag ratio. Forthe flow about an airfoil or wing, the aerodynamicproperties which define the cost function are func-tions of the flow-field variables (w) and the physicallocation of the boundary, which may be representedby the function F, say. Then

and a change in F results in a change

(1)in the cost function. Using control theory, the gov-erning equations of the flow field are introduced as aconstraint in such a way that the final expression forthe gradient does not require multiple flow solutions.This corresponds to eliminating Sw from (1).

Suppose that the governing equation R which ex-presses the dependence of w and f within the flow-field domain D can be written as

R(w,F)=Q. (2)In our current work, R may be expressed by eitherthe Euler or Navier-Stokes equations. Then 6w isdetermined from the equation

420SR =

dw6w

dR(3)


Next, introducing a Lagrange multiplier V, we haven TT f) rTdl r . ol ,.„.81 =

T (\dR-\ r \dR(\~5~\Sw+\'&E\ L aw J L o.F

,T \9R\ . s— if) —— | > to satisfy the adjoint equation

£>D~\ * FITOft _ 01dw\ dw (4)

the first term is eliminated, and we find that

SI = G6F, (5)

where

The advantage is that (5) is independent of 6w, withthe result that the gradient of / with respect to anarbitrary number of design variables can be deter-mined without the need for additional flow-field eval-uations. In the case that (2) is a partial differentialequation, the adjoint equation (4) is also a partialdifferential equation and determination of the appro-priate boundary conditions requires careful mathe-matical treatment.

The computational cost of a single design cycle isroughly equivalent to the cost of two flow solutionssince the the adjoint problem has similar complexity.When the number of design variables becomes large,the computational efficiency of the control theoryapproach over traditional finite differencing strate-gies, which require direct evaluation of the gradientsby individually varying each design variable and re-computing the flow field, becomes compelling.

Once equation (5) is established, an improvementcan be made with a shape change in the direction ofthe negative gradient

&F = -\Q

where A is positive, and small enough that the firstvariation is an accurate estimate of 61. Then

si = -\QTQ < o.After making such a modification, the gradient canbe recalculated and the process repeated to follow apath of descent until a minimum is reached. Varia-tions on the optimization procedure which allow forthe treatment of structural and aerodynamic con-straints can be readily incorporated in this approach.

MULTIBLOCK FLOW SOLVER

Multiblock strategy

FL0107-MB is a three-dimensional, multiblock, Eu-ler and Navier-Stokes flow solver suitable for the so-lution of external and internal flows around complexconfigurations. The discretization of the governingequations of the flow is accomplished using a cell-centered finite volume method. The flow domain isdivided into a large number of small subdomains,and the integral form of the conservation laws

vis applied to each subdomain. Here F is the fluxfunction which can include the viscous fluxes in thecase of the Navier-Stokes equations, w is the vec-tor of flow variables, and dS is the directed sur-face element of the boundary B of the domain £>.The use of the integral form has the advantage thatno assumption of the differentiability of the solu-tion is implied, with the result that it remains avalid statement for a subdomain containing shockwaves. In general the subdomains could be arbi-trary, but in this work we use the hexahedral cellsof a multiblock body-conforming curvilinear mesh.Discretizations of this type reduce to central dif-ferences on a regular Cartesian grid, and in orderto eliminate possible odd-even decoupling modes al-lowed by the discretization, some form of artificialdissipation must be added. Moreover, when shockwaves are present, it is necessary to upwind the dis-cretization to provide a non-oscillatory capture ofdiscontinuities. The current version of the multi-block flow solver accomplishes this task using eithera switched scalar dissipation scheme or the moresophisticated Convective Upstream Split Pressure(CUSP) approach, coupled with an Essential LocalExtremum Diminishing (ELED) formulation. De-tails on these techniques and an extensive validationof the scheme for both inviscid and viscous flow, canbe found in [26, 27, 54].

In order to apply the finite volume technique to thesolution of flows around complex configurations, wehave chosen to implement a multiblock strategy. In amultiblock environment, a series of structured blocksof varying sizes is constructed such that these blocksfill the complete space and conform to the surfaceof the geometry of interest. This segmentation ofthe complete domain into smaller blocks avoids thetopological problems present in constructing a gridaround complex configurations and multiply con-nected regions. The general strategy in the solutionprocedure of the multiblock flow solver is to con-struct a halo of cells which surrounds each block andcontains information from cells in the neighboringblocks. This halo of cells, when '--dated at appro-

421


priate times during the numerical procedure, allowsthe flow solution inside each block to proceed inde-pendently of the others.

This approach requires establishing the number andlocation of halo cells adjacent to block boundariesand constructing lists of halo cells and their inter-nal counterparts in the global mesh. In our case,we have chosen to carry out these setup proceduresas part of a pre-processing module. During thepre-processing step,, a two-level halo is constructedaround each block. The requirement of this doublehalo results from the necessity to calculate all thenecessary fluxes for the internal cells of each blockwithout reference to additional cell locations outsidethe block in question. In particular, the second dif-ferences used for the third order dissipation termsrequire the values of the flow variables in the twoneighboring cells on all sides of the cell in question.As we will see later, this approach is at the heart ofthe parallel implementation of the method.

The system of equations solved as well as the solu-tion strategy follows that presented in many earlierworks [34, 22, 21]. The governing equations of theflow may be written as

(6)

where it is convenient to denote the Cartesian coor-dinates and velocity components by x\, x%, £3 and1*1, U2, us, and w and fi are defined as

w = <PU3pE

fi — (7)

with djj being the Kronecker delta function. Notethat this definition of the flux functions fi corre-sponds to the Euler equations. They can be ex-panded to include the appropriate viscous termsin the Reynolds Averaged Navier-Stokes equationswithout modification to this discussion. Details ofthe construction of these viscous fluxes are presentedin the following section. Also,

(8)

andpH = pE + p (9)

where 7 is the ratio of the specific heats. Considera transformation to coordinates £1, £2, £s where

K _ \dxi] j - det (K} K~l = \——IJ L ^ £ j J ' ' ll i^xjIntroduce scaled contravariant velocity componentsas

whereQ = JK~l.

The Euler equations can now be written as

dW dFi . . _ (10)

with

W = J

PpuiPU2

pU3

pE

+ QnppUiU2 + Qi2ppUtu3 + Qi3p

pUiH

(11)

For the multiblock flow solver, the above notationapplies to each block in turn. The flow is thus deter-mined as the steady-state solution to equation (10)in all blocks, subject to the flow tangency or no-slipconditions on solid boundary faces:

[/„ = 0 on all Bs

for flow tangency, or

= 0, i = 1, 2, 3 on all Bs

(12)

(13)

422

for no-slip boundary conditions. In this notation, TJis 1, 2, or 3 depending on the direction that is nor-mal to face BS where a solid surface is indicated. Atthe far field boundary faces, BF, freestream condi-tions are specified for incoming waves, while outgo-ing waves are determined by the solution.

The time integration scheme follows that used in thesingle block solver [34]. The solution proceeds byperforming the cell flux balance, updating the flowvariables, and smoothing the residuals at each stageof the time-stepping scheme and at each level of themultigrid cycle. The main difference in the inte-gration strategy is the need to loop over all blocksduring each stage of the process. The use of thedouble-halo configuration permits standard single-block subroutines to be used, without modification,for the computation of the flow field within each indi-vidual block. This includes the single-block subrou-tines for convective and dissipative flux discretiza-tion, viscous discretization, multistage time step-ping, and multigrid convergence acceleration.

The only difference between the integration strate-gies is in the implementation of the residual aver-aging technique. In the single-block solution strat-egy, tridiagonal systems of equations are set up andsolved using flow information from the entire grid.Thus, each residual is replaced by a weighted av-erage of itself and the residuals of its neighbors inthe entire grid. In the multiblock strategy, the sup-port for the residual smoothing is reduced to the ex-tent of each block, in order to eliminate the need to


solve scalar tridiagonal systems spanning the blocks,which would incur a penalty in communication costs.Depending on the topology of the overall mesh, thesetup of tridiagonal systems that follow coordinatelines may lose the physical interpretation that it hadin the single block implementation. This change hasno effect on the final converged solution, and in allapplications of the solver has not led to any reduc-tion in the rate of convergence.

Viscous discretization

In order to include the viscous terms of the Navier-Stokes equations into the spatial discretizationscheme it is necessary to approximate the velocityderivatives |̂ - which constitute the stress tensorOf,-. These derivatives may be evaluated by apply-ing the Gauss formula to a control volume V withboundary S:

I ^dV = j UiHjdS ,Jv 9xj Js

where HJ is the outward normal. For a hexahedralcell this gives

dx. Ui Hj S '(14)

faceswhere u^ is an estimate of the average of Ui over theface, rij is the jth component of the normal, and Sis the face area.

In a cell centered scheme, the integration is carriedout on a dual mesh obtained by connecting the cen-ters of the computational cells in the original mesh.This process yields an approximation of the stresstensor at the vertices of the original computationalmesh. Once the stress tensor is computed at thecell vertices, it is averaged at the face centers beforecomputing the viscous flux balance. This discretiza-tion is very efficient because it does not require theevaluation of the gradients separately for each cellface. However, as a consequence of the averagingprocess, the discretization may admit odd/even de-coupling modes. These modes should be damped bythe third-order artificial dissipation already added todamp the odd/even modes arising from the centraldifference approximation of the convective terms.Alternatively, it is possible to add a correction sten-cil to the velocity gradients calculated at the ver-tices to approximately convert it to a more compactstencil characteristic of a face-centered approach [31]without increasing the computational cost. The firstapproached described is used here.

The implementation of this discretization procedurefor the viscous terms in the multiblock method re-quires only a single halo for the both the flow valuesand the grid locations.

MULTIBLOCK DESIGN STRATEGY

With the discussion of the multiblock flow solvercompleted, we will now describe the adjoint baseddesign methodology. The development and imple-mentation of adjoint approaches for aerodynamicshape optimization has reached a stage of maturityin which problems of practical interest are startingto be considered. In one of our recent publications,both transonic and supersonic shape optimizationswere performed for complex aircraft configurationssubject to a variety of geometric constraints [49]. Al-though the inviscid Euler equations were used as thecore CFD algorithm for these design calculations,an accompanying paper at the same conference pre-sented the derivation of a viscous adjoint algorithmand demonstrated a preliminary wing-body designcapability using a viscous flow solver coupled withan inviscid adjoint solver [32]. The methodology pre-sented here will follow this latter approach to allowfor the capability of complete aircraft shape designin the presence of viscous effects.

The course of action can be described as follows:first, the structured multiblock Navier-Stokes flowsolver described in the previous section replaces theinviscid multiblock method used in reference [49].Therefore, although the expression for the cost func-tion does not include viscosity related items, it re-flects the effects of the presence of the boundarylayer. The ability to perform inviscid design cal-culations is retained with a simple input flag. Thegradient of the cost function with respect to geomet-ric design variables is then calculated via the solutionof the inviscid adjoint system of [49].

The lack of a viscous adjoint solver correspondingto the viscous flow solver has various important im-plications. First and foremost is the fact that theuse of an Euler adjoint is mathematically inconsis-tent with a cost function evaluated via a set of vis-cous governing equations. Therefore, it is impossibleto obtain gradients from this approach that matchthose obtained using finite differencing. However,for problems of engineering interest, the objective isnot necessarily to find the true optimum at all cost,but to get within a reasonable vicinity of the mini-mum for a cost that is acceptable.

It is interesting to note that other design approachesalso suffer from a similar inconsistency for reasonsof engineering interest. Quasi-inverse design meth-ods such as those used by Campbell [10, 11] as-sume a relationship between the pressure distribu-tion and the local surface curvature. This relation-ship effectively provides an inconsistent gradient inorder to obtain improved designs. In reference [11]the idea has been pursued in applications using theNavier-Stokes equations. These approaches, whichmay be applicable for a small sub-class of problems,

423


are likely to fail in situations where the heuristic as-sumptions used to obtain gradient information ceaseto be valid.

With these ideas in mind, it is important to con-sider both the advantages and the limitations of thepresent design technique. If the aerodynamic figureof merit to be minimized has a direct dependence onviscosity such as through the friction drag, the ap-proach is rendered invalid since the inviscid adjointsystem lacks direct sensitivity to viscosity. However,for problems in which viscosity plays an indirect rolethe proposed design technique is bound to produceuseful results. Some important aerodynamic shapeoptimization problems fall into this latter category.Take for example the problem of pressure drag min-imization for commercial transport aircraft. With-out breakthroughs in either laminar flow control orturbulent skin friction reduction technologies, mostof the aerodynamic performance improvements at-tainable for a given configuration can be achievedthrough pressure drag minimization (both induceddrag and wave drag). In addition, since the pressuregradient normal to a viscous boundary layer for air-craft at cruise conditions is negligible, the pursuit ofinviscid methods for aerodynamic shape optimiza-tion has yielded moderate success [15, 50, 49].

However, inviscid design methods must be used cau-tiously even for inverse pressure distribution or pres-sure drag minimization problems, since the viscouseffects will indirectly alter these quantities. Themost noticeable effect is due to the boundary layerdisplacement thickness. The magnitude and impor-tance of the effective changes in wing shape causedby the presence of the boundary layer depend on theflow field in question, and generally become morepronounced under transonic conditions. The po-sition and strength of shock waves as well as thelevel of pressure recovery at the trailing edge can bestrongly impacted by the existence of a boundarylayer. In transonic flow, it is thus highly desirableto take viscous effects into account when designingthe aerodynamic shape of a wing to minimize pres-sure drag.

When the effect of the boundary layer on the outerflow couples very strongly, as is the case at transonicbuffet or at maximum lift coefficient conditions, theability to perform meaningful design without a vis-cous adjoint can be questioned.

In summary, a design methodology has been devel-oped that uses the Navier-Stokes equations for theflow solution and an inviscid adjoint formulation toobtain gradient information. This method is suitablefor a large class of problems of practical aerodynamicinterest. For problems in which the viscous effectsdominate the behavior of the flow, the viscous for-mulation of the adjoint equations more than likely

will be necessary. It is our intention to pursue thisissue further in the coming months.

Adjoint Solver

The mathematical development of the inviscid ad-joint equations used in this research has been exten-sively discussed in our earlier work [23, 24, 25, 29, 45,33, 46, 47, 48, 50]. An introductory treatment of thederivation of a viscous adjoint has been given in ref-erence [32]. In this section we present a short reviewof the development of the inviscid adjoint equationsfor the illustrative problem of pressure drag mini-mization subject to a variety of constraints.

/= CD= CA cos a + CN sin a

= TJ —— / / Cp (Sx cos a + Sy sin a) d£i d& ,&ref J JBSwhere Sx and Sy define projected surface areas, Srefis the reference area, and d£i and d£2 are the twocoordinate indices that are in the plane of the facein question. Note that the integral in the final ex-pression above is carried out over all solid boundaryfaces. The design problem is now treated as a controlproblem where the control function is the geometryshape, which is chosen to minimize /, subject to theconstraints defined by the flow equations. A vari-ation in the shape will cause a variation Sp in thepressure and consequently a variation in the costfunction

.SI — + oa.Sa

where SCp is the variation due to changes in thedesign parameters with a fixed. To treat the prob-lem of practical design, drag must be minimized ata fixed lift coefficient. Thus an additional constraintis given by

SCL = 0,

which yields

= 0.

Combining these two expressions to eliminate Sagives

(9CD\SI = SCD - ^4'5CL. (15)

Since p depends on w through the equation of state,the variation Sp can be determined from the vari-ation Sw. If a fixed computational domain is used,the variations in the shape result in variations in themapping derivatives. Define the Jacobian matrices

424(16)


Then the equation for 5w in the steady state be- Finally we obtain the expressioncomes

(17) 61= ^— ft <V Sref J JBS

where in the domain

and on the solid surface,

0

+ P on any Bs-

(18)

Now, multiplying equation (17) by a vector co-statevariable ijj, assuming the result is differentiable, andintegrating by parts over the entire domain,

= 0, (19)

where fii are components of a unit vector normal tothe boundary. Equation (19) can now be subtractedfrom equation (15) without changing the value of61. Then ip may be chosen to cancel the explicitterms in 5w and Sp. For this purpose if) is set to thesteady-state solution of the adjoint equation

with the surface boundary condition

(foQvi + ̂ Q^ + V>4i-s =0 on all .Sp-

it is noted that the waves in the adjoint problempropagate in the opposite direction to those in theflow problem because of the transpose in equation(20).

+£l (6Sy cos a - 6SX sin a)}

(22)

In order to evaluate the changes in the cost fromthe above expression, the function ip must be de-fined through the solution of (20). A major dif-ference between the development of adjoint solverspresented by our group and those cited in refer-ences [4, 5, 6, 8, 7, 13, 14, 9, 38, 36, 20, 41, 37, 35] isthat we have relied on a continuous formulation. Inthis formulation the adjoint system of equations isderived starting from the continuous governing equa-tions to produce a set of continuous co-state equa-tions including boundary conditions. This set of co-state equations is then discretized for computationalanalysis as a final step. A discrete formulation in-terchanges the order of these operations by startingfrom the discrete governing equations and employinglinear algebra to obtain a discrete adjoint system. Itis useful to note that the final results from these twoapproaches can be explained as alternate discretiza-tions of the continuous adjoint formulation.

This subtle difference in the order of the adjoint anddiscretization operations has several important im-plications. Some of these differences have been ex-plored in detail in the first author's Ph.D. disser-tation [52] as well as in the recent work by Ander-son and Venkatakrishnan [3]. For purposes of thepresent work, it is important to focus on one partic-ular difference between the continuous and discreteadjoints. If a discrete approach is followed, gradi-ents obtained via either the resulting adjoint or di-rectly through finite differences should converge tothe same result. Hence, the very idea of using aninviscid adjoint for a viscous state equation wouldnot exist. The combination of methods used in thispaper is derived from the natural flexibility of em-ploying a continuous adjoint formulation.Details of the particular discretization used here arecovered in reference [52]. The discrete adjoint sys-tem is solved in precisely the same manner as theflow equations. More details of the approach as wellas the development for other cost functions havebeen presented in references [25, 29, 33, 46, 47, 50,43, 49].

Design Variables and UnderlyingGeometry Database

Even with the rapid developments of the last fewyears regarding the derivation and implementation

425


of adjoint solvers, many unresolved issues requirefurther research efforts. Not the least of these re-maining difficulties is the precise description of themachinery used to modify the shape of interest. Thischoice directly affects other aspects of the design al-gorithm. Observing that equation (22) requires notonly the flow and the adjoint solutions, but also vari-ations in the mesh metrics, we see the importance ofchoosing the design variable formulation. In orderto obtain all the discrete gradient components fromequation (22) it will be necessary either to developan analytic expression for the variation in mesh met-rics or to calculate them directly. The availability ofthis choice will be determined by the choice of thedesign variable formulation.

Available choices for the design variables span a widespectrum ranging from employing the locations ofthe actual mesh points, to relying on the analyticcontrol points used in a CAD definition of the ge-ometry.

In the case of using the actual mesh points, no un-derlying geometry database exists. Constraints, ifpresent, must be imposed directly on the locations ofthese mesh points. This approach will surely proveproblematic in general. Consider, for example thedifficulties involved in the imposition of a wing fuelvolume constraint. In addition, the treatment ofsurface intersections (such as the wing-body) raisesdifficulties for this approach since the path for themotion of the mesh points lying directly on theseintersections is ill-defined.

However, an advantage of using the mesh points asdesign variables is that, when combined with an an-alytic mesh mapping transformation, the calculationof the gradient can be performed without explicitlycomputing the variations in the mesh metrics. Un-fortunately, obtaining such a general mapping trans-formation increases in difficulty with added geomet-ric complexity.

The alternative of using an underlying geometry Mesh Perturbation Algorithm

design algorithm in addition to the multiblock meshused for the calculations. Design variables which aredefined as a set of analytic shape functions are ap-plied directly to these geometric entities. Linear andnonlinear geometric constraints are then evaluatedon these primary entities. At any particular pointin the design process, changes to the mesh surfacesare obtained by first intersecting all of the geomet-ric entities to construct a set of parametric surfacesrepresenting the complete configuration. The loca-tion of each surface mesh point on this parametricrepresentation of the geometry is determined for theinitial configuration in a pre-processing step. Thus,the results of this pre-processed mapping from para-metric geometry to the computational surface meshpoints is also a part of the necessary input. Theperturbed surface mesh point locations are deter-mined by evaluating the parametric geometry sur-faces at these predetermined locations. Once thesurface mesh points have been updated, the volumemesh may be perturbed (see following section onmesh motion) and either the gradient or the solutioncan be calculated. The important feature of this ap-proach is that a set of simple geometric entities liesat the core of the entire design process. This tech-nique retains the typical way in which aerodynamicvehicles are defined, and provides strict control overhow surface intersections are treated. Furthermore,since the chosen design variables act directly on thegeometric entities, at the end of the design processthese entities may be output for future analysis.

In the current implementation, input geometric en-tities are restricted to those defined by sets ofpoints. However, in the future, CAD entities suchas NURBS surfaces will also serve this role, therebyallowing both the input and the output from theaerodynamic surface optimization method to inter-face directly with a CAD database.

database, which may be modified either by the di-rect application of design variables or by changesin the coefficients of its possibly analytic defini-tion, also has its advantages. First, since the rawunintersected geometries are available, constraintsand design changes affecting intersections are eas-ily treated. This can be done without regard to the.actual mesh that is used for the flow and adjointcalculations. However, these strengths are counter-balanced by the fact that additional computationalwork is required to calculate the mesh metric varia-tions.

In the current research, we have used an underlyinggeometry database where a set of simple geometricentities, such as wings and bodies, are input to the

426

After we have applied a set of design variables to theunderlying geometry and mapped these changes tochanges in the computational surface mesh points,two related tasks remain. For gradient calculations,variations in the mesh metrics must be calculated.In addition, when a design step is to be taken it mustbe possible to deform the entire mesh to accommo-date design changes. Both tasks are accomplishedin this work by the approach presented in references[49] and are only outlined here.

Since it would be difficult in the current applica-tion to 'obtain an explicit relationship between ar-bitrary surface changes and variations in the multi-block mesh metrics, these latter quantities are cal-culated by finite differences. This approach avoids


the use of multiple flow solutions to determine thegradient, but it unfortunately still requires the meshto be regenerated repeatedly. The number of meshgenerations required is proportional to the num-ber of design variables. The inherent difficulty inthe approach is two-fold. First, for complicatedthree-dimensional configurations, elliptic or hyper-bolic partial differential equations must normallybe solved iteratively in order to obtain acceptablysmooth meshes. These iterative mesh generationprocedures are usually computationally expensive.In the worst case they approach the cost of the flowsolution process. Thus the use of finite differencemethods for obtaining metric variations in combina-tion with an iterative mesh generator leads to com-putational costs which strongly hinge on the num-ber of design variables, despite the use of an adjointsolver to eliminate the flow variable variations. Sec-ond, multiblock mesh generation is by no means atrivial task. In fact no method currently exists thatallows this to be accomplished as a completely au-tomatic process for complex three-dimensional con-figurations.

Here, these difficulties are overcome through theuse of a mesh perturbation technique. In this ap-proach, a high quality mesh appropriate for the flowsolver is first generated by any available procedureprior to the start of the design. In examples to beshown later, these meshes were created using theGridgen software developed by Pointwise, Inc. [53].This initial mesh becomes the basis for all subse-quent meshes which are obtained by analytic per-turbations.

In order to perturb the multiblock mesh, two ca-pabilities are required. First, the block corners,edges and faces must be moved in a manner thatfollows the desired geometric changes and simulta-neously retains mesh continuity throughout the do-main. The second requirement is to move all thepoints interior to each block such that the spac-ing distributions and smoothness of the originalmesh are retained. This latter requirement is ac-complished by the WARP3D algorithm [43]. Sinceour current flow solver and design algorithm assumea point-to-point match between blocks, each blockmay be independently perturbed by WARP3D, pro-vided that perturbed surfaces are treated continu-ously across block boundaries. The methodologyused to achieve the first requirement of maintain-ing continuity in the blocking structure is given asfollows:

1. All faces that are directly affected by the designvariables (active faces) are explicitly perturbed.

2. All edges that touch an active face, either inthe same block or in an adjacent block, are im-

plicitly perturbed by a simple arc-length-basedalgorithm.

3. All inactive faces that either include an implic-itly perturbed edge or abut to an active faceare implicitly perturbed by a quasi-3D form ofWARP3D.

4. WARP3D is used on each block that has one ormore explicitly or implicitly perturbed faces todetermine the adjusted interior points.

Note that much of the mesh, especially away fromthe surfaces, will not require mesh perturbations andthus may remain fixed through the entire design pro-cess. Close to the surfaces, many blocks will eithercontain an active face or touch a block which con-tains an active face, either by an edge or by a corner.As the design variations affect the active faces, theabove scheme ensures that the entire mesh will re-main attached along block boundaries. Added com-plexity is needed to accomplish step (2) since theconnectivity of the various edges and corners mustbe indicated somehow. Currently, pointers to andfrom a set of master edges and master corners aredetermined as a pre-processing step. During the de-sign calculation, perturbations to any edges or cor-ners are fed to these master edges and master cor-ners which in turn communicate these changes to allconnected edges and corners.

Since this mesh perturbation algorithm is explicitit is possible to work out the analytic variations inthe metric terms required for equation (22). Thisapproach was followed in [46]. However since themesh perturbation algorithm that is used in the cur-rent paper was significantly more complex, and itwas discovered that the computational cost of re-peatedly using the block perturbation algorithm waswithin reason, finite differences were used to calcu-late SQij instead of deriving the exact analytical re-lationships.

Optimization Algorithm andProblem Constraints

With all of the machinery to obtain gradients for anarbitrary set of design variables in place, it remainsas a final detail to outline the numerical optimizationalgorithm and the imposition of constraints. TheNPSOL optimization algorithm employed here waschosen because of its extensive past use on aero-dynamic optimization problems, and its treatmentof both linear and nonlinear inequality constraints.NPSOL [16] is a sequential quadratic programming(SQP) method in which the search direction is calcu-lated by solving the quadratic subproblem where theHessian is defined by a quasi-Newton approximationof an augmented Lagrangian merit function. The

427


Lagrange multipliers in this merit function serve toscale the effect of any nonlinear constraints that thedesign may contain. Linear constraints are treatedby solving the quadratic subproblem such that thesearch direction remains in feasible space. A com-plete treatment of the method and other optimiza-tion strategies is given in [17].

The entire design procedure is outlined below:

1. Decompose the multiblock mesh into an appro-priate number of processors, and create lists ofpointers for the communication of the processorhalo cells.

2. Solve the flow field governing equations (6-11)for each design point.

3. Solve the adjoint equations (20) subject to theboundary condition (21) for each design point.

4. For each of the n design variables repeat thefollowing:

• Perturb the design variable by a finite stepto modify the geometric entities.

• Reintersect the geometric entities andform parametric geometry surfaces.

• Explicitly perturb all face mesh points af-fected by the geometry changes by evalu-ating their locations on the parametric ge-ometries.

• Implicitly perturb all faces that share anedge with an explicitly perturbed face.

• Obtain the perturbed internal mesh pointlocations via WARP3D for those blockswith perturbed faces.

• Calculate all the delta metric terms, SQij,within those blocks that were perturbedusing finite differencing.

• Integrate equation (22) to obtain SI forthose blocks that contain nonzero 6Qttj,and for each design variable, to determinethe gradient component.

5. Calculate the search direction and perform aline search via NPSOL.

6. Return to (2) if a minimum has not beenreached.

PARALLELIZATION STRATEGY

Communication Management

Efficient parallel computation on a set of distributedprocessors is achieved by a combination of minimiz-ing the overhead of communication between these

processors and balancing the partitioned workloadamong them. The first obvious choice in order to im-prove parallel performance is then to minimize theamount of communication required between proces-sors. This includes minimizing the number of mes-sages sent and received (latency) as well as the totalamount of data to be transferred (bandwidth). Fur-ther improvement may be achieved by using othertechniques such as latency hiding and scheduledcommunication, if either the problem or the specificarchitecture of the distributed platform allow it.

Latency hiding is in itself a form of local paral-lelism, where the communication and computationsof an individual node proceed concurrently with eachother. This benefit is accomplished by initiating thebi-directional asynchronous communication as soonas the data to be passed has been updated and insuch a manner that calculations that can be done(without updated data from remote nodes) are per-formed while the communication continues in thebackground.

In the case of networks which utilize a collision-detection based protocol (e.g., ethernet), scheduledinformation transfer may help reduce the commu-nication overhead. A consequence of the collision-detection mechanism is that the effective bandwidthof a saturated network is degraded since a portion ofit is wasted when two or more processors are tryingto initiate communication simultaneously. Hence, byscheduling or synchronizing the messages betweenprocessors, one can minimize this deterioration inperformance. On the other hand, the current gener-ation of parallel computers typically include a net-working environment which is capable of sustainingsimultaneous communication among all the proces-sors in the machine. This enhanced communicationability comes with a high price tag. An intermedi-ate solution where hardware switching is embeddedin a workstation distributed computing environmentwill be shown to be a compromise between these twooptions.

In this paper, we revisit the task of minimizing theamount of total communication required betweenprocessors. However, we avoid the temptation tocommunicate less often than is consistent with thebaseline serial calculation so that the convergence ofthe original scheme is exactly maintained. The pos-sibility of improving the parallel performance of themethod by further restricting the amount of commu-nication at different points in the multigrid sequencewill be investigated at a later time.

The following subsections describe the baselinemethod against which all improvements are mea-sured, and the modifications (associated with com-munication overheads) developed under the presentwork.

428


Baseline Communication

The multiblock solver is parallelized by using a do-main decomposition model, a SPMD (Single Pro-gram Multiple Data) strategy, and the MPI Libraryfor message passing.

The baseline parallel scheme is exactly consistentwith the serial multiblock solution: the results 'pro-duced by both programs are identical, including theconvergence history of the method. Updates to thesolution vector in all processors occur at every stageof the Runge-Kutta time-stepping scheme and inevery level of a multigrid W-cycle. In addition,the baseline computations are performed with 64-bitarithmetic. Therefore, flow residuals in the calcula-tion can be converged approximately 13 orders-of-magnitude before roundoff effects stall the conver-gence process.

The multiblock strategy adopted in this work allowsthe independent update of the internal cells of ev-ery block in the mesh by using a halo or ghost cellapproach. The information in this halo of cells sur-rounding each block is transferred from the corre-sponding physical cells in the interior of the neigh-boring blocks. The baseline scheme utilizes a three-pass communication model which allows for the com-putation of solutions on arbitrarily oriented multi-block meshes.

Under this model, updated halo information is trans-ferred across the six faces of each block during eachphase of the three-pass communication. The firstpass transfers face information, the second pass pro-vides edge data, and the final pass is required toupdate the solution across the block corners. Withthis three-pass approach, each block is guaranteed tohave the proper information in its complete halo (in-cluding edges and corners) regardless of the topologyof the mesh. This is a particularly challenging sit-uation when more than four blocks meet at a givenedge. The double halo is used to compute the third-order artificial dissipation terms while preserving afully conservative scheme. Although the current dis-cretization of the viscous fluxes requires only a singlelevel halo, future variations which require the pres-ence of a double halo can be accommodated withthis procedure.

In addition to the above, the blocks of the baselinesolution are distributed to the individual processorsin such a manner that the total number of unknownsper processor is as evenly loaded as possible. Whilefinding the optimum distribution is recognized to bean NP-Complete problem, a simple algorithm is em-ployed which routinely yields a load balancing inthe neighborhood of the optimum. The essence ofthis algorithm is to take the largest of the remainingblocks (yet to be distributed) and assign it to the

Processor I Ncells [Cell-Ratio1234

185,088185,088182,400182,400

1.0071.0070.9930.993

429

Table 1: Baseline Load-Balance for the BenchmarkTest Case on 4-Processors.

processor with the smallest current load. Repeatingthis procedure until all blocks have been distributed,an effective load balance algorithm is obtained. Ta-ble 1 illustrates the effectiveness of this algorithmfor our benchmarking test case.

The test case of Table 1 is used to benchmark theeffectiveness of the enhancements described below.It is a 72-block grid about a wing-fuselage-nacellegeometry. The total number of cells in the systemis 734,976. Of these, roughly 300,000 are halo cells.This case was chosen specifically because it accentu-ates the penalty of communication. Yet, on a high-speed, low-latency network such as those on the IBMSP2 or the SGI Origin2000, the corresponding flowsolutions scale reasonably with the number of pro-cessors.

One-Pass Communication Model

The use of the original three-pass communicationmodel was necessary for handling a completely gen-eral block structure. Drawbacks of this approachare that redundant communications are performedand that the second and third passes must wait un-til the previous passes have completed before theyare started.

The source of redundant data passing can be seenby following the flow of information from one blockto a neighboring block coincident with an edge or acorner of the originating block. For example, acrossan edge, information from one block to another (lo-cated above and to the right) can flow in one of twoways. Firstly, the data could flow from the origi-nating block to its right-hand neighbor, then thisinformation could be transferred from this neigh-bor to the block directly above it. Alternatively,the data could first move upward, then to the right.Because of the complexity involved in determiningwhich path the data should flow along and whichit should not, the baseline three-pass model trans-fers the information in both directions. Similarly,for communications across corners, this redundancyis three-fold.

Hence, an obvious source of improvement is to re-move redundant data transfers from the communi-cations model. This is accomplished by adopting


Processor MPI Processor MPI1234

22, 84830, 08027, 04012,960

30, 08023,2965,37620, 224

27,0405,37621,24831,648

12,96020, 22431,64812,992

70, 08055,68064, 06464,832

1234

17,50426,49621,23213,776

26,48017,6968,25616,232

21,1288,26416,79227, 768

13, 74416, 26427, 86410,136

61,35251,02457, 35257, 776

MPI || 70,080 | 55,680 | 64,064 | 64,832 || 254,656~|| MPI \\ 61,504 | 50,968 | 57,160 | 57,872 || 227,504

Table 2: Three-Pass Communication Matrix usingthe Baseline Load-Balance.

Table 3: One-Pass Communication Matrix using theBaseline Load-Balance.

a single-pass scheme which reproduces exactly theend state of the original three-pass model. In or-der to ensure that the the one-pass communicationmodel produces results identical to the three-passapproach, the original three-pass model is used toinitialize the communication lists of the one-passmethod. This is accomplished in the following man-ner: after the blocks of the grid system have been as-signed to an appropriate processor through the loadbalancing procedure, the solution vector is coloredwith information that describes its starting locationprior to any communication. This encoding includesinformation such as block and processor numbersand local cell indices. With this state set, the so-lution vector is processed with the original three-pass communication model. Upon completion of thisdata transfer, every halo cell in the distributed sys-tem has been reset with information which pointsback to its origin, i.e. block number, processor num-ber and "distant" cell index. At this stage, new com-munication lists are constructed and returned to thesource processor which stores them for future use bythe one-pass model.

Tables 2 and 3 illustrate the reduction in commu-nication achieved for the one-pass model. Thesetables provide the communication matrix (for thefinest mesh in the multigrid sequence) of messagesizes for each communication approach. The diago-nal terms of these matrices correspond to messagesthat a processor needs to send to itself. For thiskind of message, the present method uses a localmemory copy instead of an actual MPI (Message-Passing Interface) message, which is used for inter-processor communication. For the benchmark testcase, the one-pass model reduces the total messagelength by about 11% on the fine mesh. However,because there is no forced synchronization betweenpasses as in the three-pass model, the overhead re-duction approaches 25%.

Delta Updates

In the baseline code, communication always trans-ferred the actual values of the solution vector. Inorder to preserve 64-bit accuracy, all of these valueswere transferred as 64-bit floating-point numbers. Inthe present one-pass model, an additional choice of

communication model has been implemented. Werefer to this communication model as the delta up-date procedure.

The purpose of including a delta form in the presentwork is motivated by the fact that these delta in-crements can be transferred as 32-bit numbers whilemaintaining 64-bit accuracy in the converged solu-tion. Naturally, maintaining this level of precisionduring the course of convergence requires an occa-sional reset of the halo values with a 64-bit commu-nication, although the large majority of the commu-nication is now performed using only single precision32-bit numbers. For this occasional reset, we havemaintained the capability to transfer actual full pre-cision values of the solution vector.

For all practical purposes, the communication over-head of the new delta form is half that of the baseline(full precision) transfers.

Communication-Weighted Load Balancing

As mentioned above, the original load-balancing al-gorithm was guided solely by the number of cellsbeing distributed to the complete set of processors.This form of load balancing has proved to be quiteacceptable for platforms with state-of-the-art com-munication capabilities such as the IBM SP2. How-ever, for a cluster of workstations linked togetherwith a lower performance network, this techniquecan be further refined.

A new load-balancing algorithm has been developedwhich includes the penalties associated with out-of-processor communication. In this setting, the loadis defined as the time it takes each processor to com-plete all of its tasks-numerical processing as well assending and receiving the necessary messages. Thepredicted times of each of these tasks are derived us-ing experimentally obtained MFLOPS (Millions ofFloating Point Operations per Second) ratings, andthe MPI latency and bandwidth values associatedwith the particular distributed platform.

The new load-balancing algorithm is very similar tothat of the original method, but the "size" of eachblock is now initialized assuming that the informa-tion of all halo cells will be transferred to another

430


PLATFORM

SP2 Switch USSP2 Switch IP

HP/J280-100BaseTHP/J280-ethernet

MFLOPS

50503030

LATENCY(M-sec)

43285290600

BANDWIDTH(MBytes/s)

35.013.07.00.8

Processor Ncells Cell-Ratio

Table 4: Observed Capacities of Various Platforms.

processor. The algorithm then proceeds by takingthe largest of the remaining blocks (yet to be dis-tributed) and temporarily assigning it to every pro-cessor. When assigned to each processor, a tempo-rary update of the load of that processor is made byadding the size of the current block to that proces-sor's previous load. This assignment is rewarded bya decrease in the equivalent size if neighboring blocksare already assigned to that processor and thus nocommunication is necessary. After all temporary as-signments have been done, the processor whose loadis the smallest after the assignment is selected andthe block is permanently assigned to that processor.The previous steps are repeated until all blocks havebeen distributed.

Table 4 provides representative values for the IBMSP2 using the switch in User Space mode (highperformance communication mode), for the switchand IP (Internet Protocol), and for an HP worksta-tion cluster of J280s using switched-lOOBaseT andstandard ethernet. These values have been exper-imentally observed and may vary from site to site.They correspond to the measured values of latencyand bandwidth using various implementations of theMPI standard on the different platforms mentionedabove. They are not the manufacturer's publisheddata for the communication hardware. It is inter-esting to note that the values of latency and band-width obtained for the switched-lOOBaseT networkare quite close to those for the IBM SP2 systemcommunicating in IP mode. Therefore, the SP2 canbe used to simulate large networks of workstationslinked together by a switched-lOOBaseT network.

Using the network characteristics for the HP clusteron standard ethernet, a new distribution of blocks isobtained. This distribution is provided in Table 5.Comparing it with Table 1, it is noticed that thenumber of cells per processor is not nearly as well"balanced" as before. Yet, the solution's cycle timeof the new distribution on the HP cluster using anethernet network is only 57.46 seconds as comparedwith the original cycle time of 108.38 seconds.

The secret of this performance improvement can beseen by. comparing the communication matrices ofthe original load-balanced distribution with that ofthe improved one. This information is provided byTables 3 and 6, respectively. Notice that the new

1234

169, 536203, 136188, 544173, 760

0.9231.1061.0260.946

Table 5: New Load-Balance for the Benchmark TestCase on 4-Processors.

Processor MPI1234

28,84811,96816,7444,456

MPI || 33,168

11,88842,1682,55216,57631,016

16,7362,36843,84011,64830,752

4,30416,32811,69635,384

32,92830, 66430,99232,680

32,328 || 127,264

Table 6: One-Pass Communication Matrix using theNew Load-Balance.

load-balancing algorithm has done an effective jobof reducing the amount of data to be transferred viaMPI. Under the baseline distribution, MPI messagesare used to set a total of 227,504 halo cells in the finemesh. Using the new load-balancing algorithm, MPIcalls are only responsible for resetting now 127,264halo cells. This is accomplished by increasing theamount of data transfer each processor does withitself (i.e., in a global sense, communication is drawntoward the diagonal of these matrices).

Single-Layer Halo Communication

In the baseline method, we stated that a double-layer halo surrounds each block and it is utilizedto facilitate calculation of the third-order artificialdissipation fluxes. However, upon close inspectionof the 5-stage Runge-Kutta scheme and multigridprocesses, we note that the dissipative fluxes are notrecomputed as often as the solution updates occur.In particular, these dissipative terms are typicallyreset only during the odd stages of Runge-Kutta onthe finest mesh and never computed in any of thecoarser levels of the multigrid scheme.

Immediately, we can omit transferring the outer-layer halo data during the even (of five) stages ofRunge-Kutta in the fine mesh. This reduces the fine-mesh communication by 20%.

For a 4-level multigrid W-cycle in the baseline code,more than 45% of the total data transferred duringthe cycle resides in the coarser-level meshes, ft Byupdating only the data of the inner halo during thecoarse-level communication, an additional improve-ment is realized.

tfA W-cycle with four levels of multigrid traverses thesecond-level grid exactly twice when communication is in-volved; four times in the third and fourth levels. The numberof halo cells of a coarse-mesh is at least one-fourth that of

431


The above two improvements combine to reduce thetotal amount of data transferred per multigrid cy-cle. Relative to the baseline communication, thisreduction in overhead is between 33.3% and 54.7%,depending on the granularity of the mesh involved.

Communication Improvement Summary

For the 72-block mesh in question, the relative im-provements in communication overhead with respectto the original scheme can be summarized as follows:

• 20% reduction in overhead with one-pass(benchmark).

• 50% reduction in overhead with delta form (ingeneral).

• 50% reduction in overhead with new load bal-ancer (benchmark, ethernet).

• 33%-55% reduction in overhead with single-halotransfers (in general).

• Communication reduced by a minimum of 75%when combined.

RESULTS

Design Results

The design test cases to be presented here will focuson the wing redesign of a typical transonic businessjet. The designs will be carried out independentlyusing the Euler and Navier-Stokes equations. Thediscussion will conclude with comparisons betweenthe final Euler and Navier-Stokes designs. For theEuler design case, reference [50] gives a treatment ofthe reliability of the flow solver as well as the abil-ity of the adjoint method to provide accurate gradi-ents very efficiently. With regard to the validity ofthe Navier-Stokes case, a comparison will be madefor the initial configuration using both the inviscidand the viscous equations. The adjoint gradientsfor the Navier-Stokes test case will not be comparedwith finite difference calculations for two reasons.First, since the adjoint used to obtain the gradientsis not of the viscous type, it is understood and ac-cepted that it will not produce gradients that areconsistent with the finite difference approach. Sec-ondly, the computational cost of obtaining finite dif-ference gradients for the Navier-Stokes design on alarge three-dimensional test case is prohibitive. Inorder to obtain accurate finite difference gradients,

the next finer mesh. Hence, the baseline communication inthe coarse grids is at least 81% as intense as it is in the finestmesh. Further study of a grid with only one interior cell in thefourth-level mesh shows that the baseline communication inthe coarser grids can approach 183% that of the finest mesh.

the flow solution must be converged at least two orthree orders more than is necessary for adjoint gra-dients [52, 39]. Navier-Stokes solvers with their no-toriously slow convergence would take an unaccept-able number of iterations to achieve such a level ofconvergence.

Flow Solver Comparison

In the design demonstration of the multiblock op-timization algorithm to follow, a typical transonicbusiness jet configuration is considered. The samegeometry was also studied in [50, 15, 49]. Herethe complete configuration including wing, body, na-celle, pylon, vertical tail, and horizontal tail will beused. Prior to the start of the designs, flow analyseswere completed using the Euler and Navier-Stokesequations.

Two alternative meshes were constructed thatshared the same block topologies and differed onlyin the normal wall spacings and cell counts. Themeshes were both created with a general C-O topol-ogy and flow-through nacelles. Both meshes fea-tured 240 blocks, with the Euler mesh having 4.1million computational cells and the Navier-Stokesmesh having 5.8 million computational cells. Therelative ratio between the two is smaller than ex-pected since the Euler mesh was constructed fromthe Navier-Stokes mesh simply by coarsening inthe viscous direction. Furthermore, while completeconfigurations are being modeled, only the wing istreated as a viscous solid surface. The other compo-nents are handled with inviscid boundary conditions.A single flow calculation for the Euler solution using200 multigrid cycles converges 5 orders of magni-tude in 0.8 hours of wall clock time on 32 processorsof an IBM SP2 machine. By comparison, a Navier-Stokes analysis uses 300 multigrid cycles to converge4.7 orders of magnitude and consumes 2.0 hours ofwall clock time on 32 processors of an IBM SP2 ma-chine. An Euler and a Navier-Stokes solution arecompared for the baseline configuration at the sameover-design flight conditions and compared in Figure(1). The Cp distributions depicted in the figure showthe usual trend of having the shock strength andlocation being moved forward for the viscous anal-ysis when compared to the inviscid analysis. TheNavier-Stokes calculation was carried out using anall-turbulent boundary layer with a Baldwin-Lomaxturbulence model. The wall normal spacing of thefirst cell was such that at the cruise condition ay+ = 1 would be attained at the half span trail-ing edge assuming a flat plate turbulent boundarylayer. At the cruise condition (M = 0.80 and analtitude of 40,000 ft) the Reynolds number is 1.45million/ft.

432


The wing sweep for the design is a low 20 degrees.Thus, with the thick airfoil sections featured in theconfiguration, it represents a challenge to containwave drag at the moderate Mach numbers of its de-sign point (M = 0.75 - 0.82). Although they arenot presented here, correlations of the wing pressuredistributions have been obtained with experimentaldata. The comparisons with tunnel data are excel-lent except for a 5% difference in the location of theupper surface shock for the inviscid analysis. TheNavier-Stokes solutions virtually overlay the windtunnel data.

Inviscid Transonic Flow Constrained Aircraft Design

The Euler mesh described above was used duringan inviscid redesign of the wing in the presence ofthe complete configuration. The baseline configu-ration was designed for flight at M = 0.80 with aCL = 0.30. In this inviscid design case, a single pointconstrained design is attempted in which the Machnumber and CL are pushed to 0.82 and 0.35 respec-tively. The objective is to minimize configurationpressure drag at a fixed lift coefficient by modifyingthe wing shape. Eighteen Hicks-Henne design vari-ables were chosen for six wing defining sections fora total of 108 design variables. Spar thickness con-straints were also enforced at each defining stationat x/c = 0.2 and x/c = 0.8. Maximum thicknesswas forced to be preserved at x/c = 0.4 for all sixdefining sections. Each section was also constrainedto have the thickness preserved at x/c = 0.95 to en-sure an adequate included angle at the trailing edge.A total of 55 linear geometric constraints were im-posed on the configuration. Figure (2) shows over-lays of the Cp distributions for the initial and finaldesign after 6 NPSOL design iterations at four sta-tions along the wing. It is seen that the final resulthas reached a near-shock-free condition over muchof the outboard wing panel. The drop in completeconfiguration pressure drag for this case was 24.6%.Noting that most of this drag reduction came from adecrease in wing wave drag implies that further im-provements may be possible through the reshapingof other components. The program took 13.5 hoursto complete 6 design cycles on 32 processors of anIBM SP2.

Viscous Transonic Flow Constrained Aircraft Design

In a second design example for this complete busi-ness jet configuration, a viscous redesign of the wingis attempted. The design point is again chosen tobe Mach = 0.82 with a Cx =0.35. Again the designalgorithm, this time with a no-slip boundary condi-tion on the wing and the viscous terms turned on,

433

is run in drag minimization mode. Figure (3) showsan iso-Cp colored representation of the initial designand the final design after 5 NPSOL design iterations.It is clearly seen that the rather low Cp region ter-minated by a strong shock spanning the entire wingupper surface has been largely eliminated in the fi-nal design. Figure (4) shows the initial and finalCp distributions achieved using the same 108 designvariables and 55 geometric constraints employed forthe inviscid test case. Note that the strong shockspresent on both the upper and lower surfaces in theinitial configuration have been eliminated. Further-more, it is apparent that the character of the changesto the pressure distribution follow those that oc-curred for the Euler based design to some extent.The main difference is that the Navier-Stokes de-sign tends to have a more benign behavior in thepressure distributions. The overall pressure drag forthe complete configuration was reduced by 21.5%.Before proceeding to the next section, it should benoted that these business jet design examples areonly representative of the potential for automateddesign, and are not intended to provide designs foractual construction. First, in each case only 5 or 6NPSOL steps were taken where considerably morecould have improved the designs slightly. More im-portantly, since these are only single point designs,either may suffer unacceptable off-design behavior.In our most recent previous paper [49] we treatedthe case of inviscid design at multiple design pointswhile here we address the case of viscous design at asingle design point. Eventually, both the multipointand viscous design capabilities must be treated con-currently. The calculation took 28 hours to complete5 design cycles on 32 processors of a IBM SP2.

Crosscheck

To see the difference between the two design casesexplored here the Euler based design was reanalyzedusing the Navier-Stokes equations. The mesh forthis cross check was created by entering the designvariable coefficients produced from the Euler designprocess into a set of inputs for a Navier-Stokes anal-ysis. The result of this reanalysis is shown in Figure(5). It is seen that the wing designed using the newNavier-Stokes approach employed here has slightlyweaker shocks than the one designed via the use ofthe inviscid approach. This conclusion is also sup-ported by the fact that the drag improvement forthe Euler design analyzed with the Navier-Stokesequations has a 20.5% improvement as contrastedwith the 21.5% for the Navier-Stokes based design.However, it is seen from these two designs that theEuler strategy did not perform at all poorly and in-deed was able to achieve much of the improvementthat was possible from a Navier-Stokes based design


method. This conclusion must be taken with cautionsince the sensitivity of the pressure distributions tothe presence of the boundary layer can vary widelydepending on the configuration. Furthermore, hadwe used a true viscous adjoint it may have been pos-sible to lower the pressure drag for the configurationeven further. Clearly, much further testing of thedesign approaches developed here is needed. Thesecalculations must be taken as the preliminary stepstowards Navier-Stokes based aircraft design.

Parallel Performance of the Method

The following section presents a series of parallelscalability curves for the baseline code with the var-ious added improvements to reduce the communica-tion overhead. The scalability study was conductedfor two different meshes whose ratio of computationvs. communication was chosen to be at both ends ofthe spectrum. The first mesh is the benchmark meshreferred to above. This mesh consists of 72 blocksof varying sizes and has a total of 734,976 cells ofwhich over 300,000 reside in the halos. As one cansee, the results on this mesh will provide an extremetest of the scalability of the method since a largepart of the time will be spent in the communicationprocess. The second mesh is referred to as the finemesh and consists of 48 blocks of different sizes witha total of 2.5 million cells, from which about 600,000reside in the halos. This mesh has a much higherratio of computation to communication, and there-fore, the parallel scalability of the code is expected toimprove when compared with the benchmark mesh.Nevertheless, the contrast provided between the re-sults in the two meshes is intended to be illustrativeof the range of performance that can be expected formeshes of varying sizes.

Scalability studies were conducted on a range ofplatforms whose performance characteristics are pre-sented in Table 4. These platforms include thevery high performance communication network ofthe IBM SP2 (in User Space mode) and networks ofworkstations connected via standard lOBaseT eth-ernet as well as a higher performance (although stilllow cost) switched-lOOBaseT fast ethernet network.For the benchmark mesh, the range 1 — 16 was se-lected for the number of processors, whereas for thefine mesh, 4 — 32 was selected instead. This latterchoice results from the inability to fit this large sizecalculation in a number of processors smaller than 4.In order to present speedup data for this mesh, thebest achievable timing for a one processor calcula-tion was derived from the measurements of compu-tational time (not communication) observed for the4 processor case. It will be seen from the data thatfor this mesh the program must scale at worst lin-early between 1 and 4 processors and therefore our

timing assumption is conservative.

Figures 6-8 present the results obtained for thebenchmark mesh using the IBM SP2 in User Spaceand Internet Protocol modes and the cluster of work-stations using switched-lOOBaseT for both the base-line and the improved load balancing algorithms.Since the SP2 in User Space Mode has a very highperformance network, the impact of the use of dif-ferent load balancing techniques is small. For ex-ample, the parallel speedup for the 1-pass, singleprecision scheme using 16 processors remained vir-tually unchanged from 11.88 to 11.89. When usedin IP mode (a lower performance network that verywell simulates the results on the network of work-stations linked via switched-lOOBaseT) the resultsof improved load balancing schemes start to showup. For the same communication algorithm and thesame mesh, the parallel speedup improved from 8.46to 8.70 using 16 processors. Using 8 processors onthe switched-lOOBaseT network for the same algo-rithm, the parallel speedup improves from 3.212 to3.645. These results are even more dramatic for thecase of the lowest performance network: unswitchedlOBaseT (ethernet). In this case, the use of the im-proved load balancing scheme decreased the time tocomplete one multigrid iteration from 108.38 secondsto 57.46 seconds using 4 processors in the network.

A more interesting point addressed by the results inthese scalability plots is the increase in performancederived from successive improvements to the com-munication algorithm. Figures 6a, 7a, and 8a showthe progression from the 3-pass and double precisionscheme to the 1-pass and double precision, 1-passand single precision, and 1-pass and single precisionwith single level halo schemes using the IBM SP2 inboth US and IP modes and the network of switched-lOOBaseT workstations. These results use the orig-inal load balancing technique. Similar conclusionscan be obtained from Figures 6b, 7b, and 8b whichuse the improved load balancing algorithm. For theUS results, the parallel speedup using 16 processorsimproved from 10.41 to 13.00. For the IP results, theimprovement is larger (as expected from the use ofa lower performance communication network): par-allel scalability for the same number of processorsimproved from 7.11 to 10.69. It must be noted thatin both cases, there is an upper limit to the parallelscalability achievable by the scheme derived from theimpossibility of evenly load balancing a calculationin which the sizes of the different blocks in the meshvary. For this case, this upper limit on parallel scala-bility using 16 processors is found to be 14.9, which isvery close to the actual achieved value using the IBMSP2 in US mode, considering the highly communica-tive nature of the benchmark mesh. At the sametime, the results for the switched-lOOBaseT networkof workstations using 8 processors show an improve-

434


ment in parallel scalability from 2.98 to 5.41, point-ing out how much more important these successivecommunication improvements are for networks withlower performance. Prom these data, it is clearlyseen that a high performance message passing im-plementation is essential to allow the effective use ofnetworks of workstations.

A more clear representation of the improvementin communication performance can be seen in Fig-ure 9a where the parallel scalability results usingthe IBM SP2 US and IP modes for the benchmarkmesh have been overlaid. The lower curve shows thespeedup obtained using the original scheme (3-pass,double precision) with the SP2 in US mode, whereasthe upper curve shows the speedup of the improvedscheme (1-pass, single precision, single level halo)using the SP2 in the lower performance IP commu-nication mode. Since the IP mode is representativeof a switched-fast ethernet network of workstations,one can deduce from this graph that the improve-ments in communication performance introduced inthis work make the use of networks of workstationsfor the design process entirely feasible.

All these scalability studies were repeated for thefine mesh with 48 blocks and 2.5 million cells. Thedifficulties with parallel scalability exhibited by theprevious figures are substantially ameliorated. Inthe interest of space, only a summary is presented inFigure 9b where the results using up to 32 processorsare presented for the IBM SP2 in US mode and thedifferent variations in communication algorithms. Itcan be seen that the communication improvementsshift the scalability curve slightly upwards. Thisshift is small since for numbers of processors up to16 the ratio of communication time to processingtime is very small. Note that some of the calcula-tions exhibit superlinear speedup for some numbersof processors; this effect has been observed previ-ously [12] and is attributed to a better utilization ofthe processor cache resulting from a decrease in sizeof the datasets that the processor operates on. Thedecrease in performance for the 32 processor resultsis a consequence of the poor load balancing that canbe obtained with a mesh that only has 48 blocks.

Although not reported in any of the figures,scalability studies were performed for the base-line unswitched ethernet network of workstations(lOBaseT). This communication fabric has the lim-itation that the total bandwidth of the system re-mains constant, independent of the number of pro-cessors in the calculation. This lack of bandwidthscalability results from the very nature of ethernetcommunication; the network can be used by onlyone processor at a time. Table 7 shows the re-sults for the parallel speedups obtained using thebaseline algorithm (original load balancing, 3-pass,

No. Processors Baseline Optimized1248

1.0000.7680.688

1.0001.5522.3011.493

Table 7: Parallel Speedups for the Baseline and Op-timized Communication Schemes Using unswitchedEthernet (lOBaseT)

double precision) and the improved communication(new load balancing, 1-pass, single precision, sin-gle level halo). The disappointing performance ofthis network clearly points out that typical ethernetnetworks are not suitable for parallel computationssuch as these ones which place a high demand onthe communication layer. Despite this negative re-sult, it is apparent that the new communication al-gorithms dramatically improved the performance ofthe method.

CONCLUSIONS

A general aerodynamic shape optimization methodhas been developed and demonstrated for the caseof Euler and Navier-Stokes automatic redesign ofa complete aircraft configurations in transonic flow.The design method is implemented on distributedmemory systems (including parallel computers anddistributed networks of workstations) using the MPIStandard and it achieves excellent scalability in allplatforms except for the simple unswitched-Ethernetcase (lOBaseT). For a full configuration viscousmesh containing 5.8 million cells and 240 blocks, acomplete design including a total of 5 design itera-tions can be completed in 28 hours using 32 proces-sors of an IBM SP2 system. The present methoduses the Navier-Stokes equations for the solution ofthe flow and an inviscid adjoint solver for the calcu-lation of the gradient information. Further researchand development work is required to assess the ap-plicability and usefulness of a viscous adjoint tech-nique.

With the present method, the automatic aerody-namic design of complex aircraft configurations us-ing a high fidelity viscous flow model based on theRANS equations becomes feasible on the currentgeneration of parallel computers. Moreover, the ap-plicability of the method is demonstrated for use innetworks of workstations with a -moderate invest-ment in networking resources (switched-lOOBaseT).

ACKNOWLEDGEMENTS

435

Our research has benefited greatly from the supportof ARPA under Grant Number N00014-92-J-1796,


AFOSR under Grant Number AFOSR-F49620-95-1-0259, and USRA through RIACS at the NASAAmes Research Center. The support provided bythe IBM-CRA grant is gratefully acknowledged bythe authors. The Douglas Aircraft Company is alsoacknowledged in part for supporting the implemen-tation of the design method on a distributed net-works of workstations. Special thanks are extendedto Mark Rimlinger for the generation of the initialmeshes used in the calculations presented here.

REFERENCES

[1] J. J. Alonso, L. Martinelli, and A. Jameson.Multigrid unsteady Navier-Stokes calculationswith aeroelastic applications. AIAA paper 95-0048, AIAA 33rd Aerospace Sciences Meetingand Exhibit, Reno, NV, January 1995.

[2] J. J. Alonso, T. J. Mitty, L. Martinelli, andA. Jameson. A two-dimensional multigridNavier-Stokes solver for multiprocessor archi-tectures. In Satofuka, Periaux, and Ecer, ed-itors, Parallel Computational Fluid Dynamics:New Algorithms and Applications, pages 435-442, Kyoto, Japan, May 1994. Elsevier ScienceB. V. Parallel CFD '94 Conference.

[3] W. K. Anderson and V. Venkatakrishnan. Aero-dynamic design optimization on unstructuredgrids with a continuous adjoint formulation.AIAA paper 97-0643, 35th Aerospace SciencesMeeting and Exhibit, Reno, Nevada, January1997.

[4] 0. Baysal and M. E. Eleshaky. Aerodynamicsensitivity analysis methods for the compress-ible Euler equations. Journal of Fluids Engi-neering, 113(4) :681-688, 1991.

[5] 0. Baysal and M. E. Eleshaky. Aerodynamicdesign optimization using sensitivity analysisand computational fluid dynamics. AIAA Jour-nal, 30(3):718-725, 1992.

[6] O. Baysal and M. E. Eleshaky. Airfoil shape op-timization using sensitivity analysis on viscousflow equations. Journal of Fluids Engineering,115:75-84, 1993.

[7] G. W. Burgreen and O. Baysal. Aerodynamicshape optimization using preconditioned conju-gate gradient methods. AIAA paper 93-3322,July 1993.

[8] G. W. Burgreen and 0. Baysal. Three-dimensional aerodynamic shape optimizationof wings using sensitivity analysis. AIAA pa-per 94-0094, 32nd Aerospace Sciences Meetingand Exhibit, Reno, Nevada, January 1994.

436

[9] G. W. Burgreen, 0. Baysal, and M. E. Ele-shaky. Improving the efficiency of aerodynamicshape optimization procedures. AIAA paper 92-4697, September 1992.

[10] R. Campbell. An approach to constrainedaerodynamic design with application to airfoils.NASA Technical Paper 3260, Langley ResearchCenter, November 1992.

[11] R. Campbell. Efficient constrained design usingNavier-Stokes codes. AIAA paper 95-1808, 13thApplied Aerodynamics Conference, San Diego,CA, June 1995.

[12] S. Cliff, S. Thomas, T. Baker, W. Cheng, andA. Jameson. Supersonic transport design on theIBM parallel system SP2. Santa Clara Con-vention Center, Sunnyvale, CA, March 1995.Proceedings of the Computational AerosciencesWorkshop.

[13] M. E. Eleshaky and 0. Baysal. Precondi-tioned domain decomposition scheme for three-dimensional aerodynamic sensitivity analysis.In Proceedings of of the 12th AIAA computa-tional fluid dynamics conference, pages 1055-1056, July 1993.

[14] M. E. Eleshaky and 0. Baysal. Shape optimiza-tion of a 3-D nacelle near a flat plate wing us-ing multiblock sensitivity analysis. AIAA pa-per 94-0160, 32nd Aerospace Sciences Meetingand Exhibit, Reno, Nevada, January 1994.

[15] J. Gallman, J. Reuther, N. Pfeiffer, W. Forrest,and D. Bernstorf. Business jet wing design us-ing aerodynamic shape optimization. AIAA pa-per 96-0554, 34th Aerospace Sciences Meetingand Exhibit, Reno, Nevada, January 1996.

[16] P.E. Gill, W. Murray, M.A. Saunders, and M.A.Wright. User's guide for NPSOL (version 4.0).a FORTRAN package nonlinear programming.Technical Report SOL86-2, Stanford University,Department of Operations Research, 1986.

[17] P.E. Gill, W. Murray, and M.H. Wright. Prac-tical Optimization. Academic Press, 1981.

[18] R. M. Hicks and P. A. Henne. Wing designby numerical optimization. Journal of Aircraft,15:407-412, 1978.

[19] R. M. Hicks, E. M. Murman, and G. N. Vander-plaats. An assessment of airfoil design by nu-merical optimization. NASA TMX-3092, AmesResearch Center, Moffett Field, California, July1974.


[20] A. C. Taylor III, G. W. Hou, and V. M. Ko-rivi. Sensitivity analysis, approximate analysis,and design optimization for internal and exter-nal viscous flows. AIAA paper 91-3083, Septem-ber 1991.

[21] A. Jameson. Solution of the Euler equations fortwo dimensional transonic flow by a multigridmethod. Applied Mathematics and Computa-tions, 13:327-356, 1983.

[22] A. Jameson. Multigrid algorithms for compress-ible flow calculations. In W. Hackbusch andU. Trottenberg, editors, Lecture Notes in Math-ematics, Vol. 1228, pages 166-201. Proceedingsof the 2nd European Conference on MultigridMethods, Cologne, 1985, Springer-Verlag, 1986.

[23] A. Jameson. Aerodynamic design via controltheory. Journal of Scientific Computing, 3:233-260, 1988.

[24] A. Jameson. Automatic design of transonic air-foils to reduce the shock induced pressure drag.In Proceedings of the 31st Israel Annual Con-ference on Aviation and Aeronautics, Tel Aviv,pages 5-17, February 1990.

[25] A. Jameson. Optimum aerodynamic design viaboundary control. In AGARD-VKILecture Se-ries, Optimum Design Methods in Aerodynam-ics, von Karman Institute for Fluid Dynamics,1994.

[26] A. Jameson. Analysis and design of numeri-cal schemes for gas dynamics 1, artificial diffu-sion, upwind biasing, limiters and their effect onmultigrid convergence. Int. J. of Comp. FluidDyn., 4:171-218, 1995.

[27] A. Jameson. Analysis and design of numericalschemes for gas dynamics 2, artificial diffusionand discrete shock structure. Int. J. of Comp.Fluid Dyn., 5:1-38, 1995.

[28] A. Jameson. Optimum aerodynamic design us-ing CFD and control theory. AIAA paper 95-1729, AIAA 12th Computational Fluid Dynam-ics Conference, San Diego, CA, June 1995.

[29] A. Jameson. Optimum Aerodynamic Design Us-ing Control Theory, Computational Fluid Dy-namics Review 1995. Wiley, 1995.

[30] A. Jameson and J.J. Alonso. Automaticaerodynamic optimization on distributed mem-ory architectures. AIAA paper 96-0409, 34thAerospace Sciences Meeting and Exhibit, Ren

Documents

An efficient multiblock method for aerodynamic analysis and ...aero-comlab.stanford.edu/Papers/AIAA-1997-1893-134.pdfAn Efficient Multiblock Method for Aerodynamic Analysis and Design