当前位置:首页 >> >>

Runge-Kutta Software for the Parallel Solution of Boundary Value ODEs


Runge-Kutta Software for the Parallel Solution of Boundary Value ODEs ?
P.H. Muir? R.N. Pancer? August 18, 2000 K.R. Jackson§

Abstract In this paper we describe the development of parallel software for the numerical solution of boundary value ordinary differential equations (BVODEs). The software, implemented on two shared memory, parallel architectures, is based on a modi?cation of the MIRKDC package, which employs discrete and continuous mono-implicit Runge-Kutta schemes within a defect control algorithm. The primary computational costs are associated with the almost block diagonal (ABD) linear systems representing the Newton matrices arising from the iterative solution of the nonlinear algebraic systems which result from the discretization of the ODEs. The most signi?cant modi?cation featured in the parallel version of the code is the replacement of the sequential ABD linear system software, COLROW, which employs alternating row and column elimination, with new parallel ABD linear system software, RSCALE, which is based on a recently developed parallel block eigenvalue rescaling algorithm. Other modi?cations are associated with the parallelization of the setup of the ABD systems, the setup of the approximate solution interpolants, and the estimation of the defect. The numerical results show that nearly optimal speedups can be obtained, and that substantial speedups in overall solution time are achieved, compared with the sequential version of MIRKDC. Keywords: Parallel software, boundary value ordinary differential equations, almost block diagonal linear systems, Runge-Kutta methods, shared memory. AMS Subject Classi?cations: 65L10, 65Y05

1 Introduction
In this paper we consider the development of software for the ef?cient numerical solution of boundary value ordinary differential equations (BVODEs) on parallel computers based on shared memThis work is supported by the Natural Sciences and Engineering Research Council of Canada, the Information Technology Research Center of Ontario, and Communications and Information Technology Ontario. ? Department of Mathematics and Computing Science, Saint Mary’s University, Halifax, Nova Scotia, Canada B3H 3C3, (Paul.Muir@stmarys.ca). ? Division of Physical Sciences, University of Toronto at Scarborough, Scarborough, Ontario, Canada M1C 1A4,(pancer@cs.toronto.edu) § Department of Computer Science, University of Toronto, Toronto, Ontario, Canada M5S 3G4, (krj@cs.toronto.edu)
?

1

ory architectures. We shall assume BVODEs which have the general form, y ′ (t) = f (t, y (t)), with boundary conditions in separated form, g a (y (a)) = 0, g b (y (b)) = 0, (2) (1)

where y : ? → ?n , f : ? × ?n → ?n , ga : ?n → ?p and gb : ?n → ?q , with p + q = n. Such problems arise in a wide variety of application areas; see [7] for further details and examples. There has been a considerable history of high quality software for the ef?cient numerical solution of (1), (2). The packages can be roughly divided into those based on initial value methods and those based on global methods. In the former group, most codes are based on some form of multiple shooting; one subdivides the problem interval with a set of breakpoints, and considers an associated initial value ordinary differential equation (IVODE) (with estimated initial conditions) on each subinterval; the IVODEs are solved to a given user provided tolerance using high quality IVODE software. The IVODE solutions are then evaluated at the breakpoints and improved estimates of the initial conditions are computed, within the context of a Newton iteration. A re?nement of the mesh of breakpoints is then performed to provide a better adaptation of the mesh to solution behavior. The process terminates when the IVODE solutions match at the breakpoints to within the user tolerance. See, e.g., [7], Chapter 4, for further details. Within the class of global methods, the general algorithmic approach is to subdivide the problem interval with a set of meshpoints and then apply a numerical discretization method, e.g., a ?nite difference method, to replace the ODEs with a system of algebraic equations, which are then solved with Newton’s method. An estimate of the error is then used to assess solution quality and determine a better adapted mesh, and the process repeats until the estimated error satis?es the user tolerance. See, e.g., [7], Chapter 5, for further details. This class includes ?nite-difference, Runge-Kutta, and collocation methods. In both the groups of software, the dominant computational effort involves the solution of a large system of nonlinear equations by a Newton iteration. Because the computations can be localized within each subinterval, the Newton matrices generally have a sparsity structure which has been referred to in the literature as almost block diagonal (ABD) [15]. The excellent survey [3] provides an extensive review of ABD systems, including contexts in which they arise and methods for solving them. A number of software packages for the ef?cient factorization and solution of linear systems for which the coef?cient matrix has an ABD structure have been developed over the last 25 years. In §2, we will brie?y review sequential software for the numerical solution of BVODEs and for the ef?cient treatment of ABD systems. This section also includes a brief discussion of more general BVODE software for the ef?cient treatment of a sequence of BVODEs arising in a parameter continuation framework. Parameter continuation computations, since they involve repeated use of a BVODE code, can be very computationally intensive. As well, the application of the transverse method of lines in the numerical solution of systems of parabolic partial differential equations (see, e.g., [7]) can lead to large BVODE systems which require substantial computational expense. Since parallel computers have become commonly available, it has become natural to consider the ef?cient numerical solution of BVODEs on such architectures. A brief investigation of the relative costs of the principal 2

algorithmic components of a typical BVODE software package reveals that the only signi?cant computations which cannot be parallelized in a straightforward fashion are the factorization and solution of the ABD linear systems. The last 10-15 years have seen considerable effort expended in the search for ef?cient and stable parallel algorithms for ABD systems and a substantial number of papers have been written on the subject; we brie?y review this work in §3. Despite the relatively large effort that has been invested in the development of parallel ABD algorithms, relatively little work has been done on the subsequent development of parallel software for BVODEs, employing the new parallel ABD algorithms; we survey previous work in the development of parallel BVODE software in §4. The goal of our paper is to describe the development of a parallel BVODE solver based on the MIRKDC software package [35]. A principal component of this modi?cation is the replacement of the sequential ABD system solver, COLROW [26], [27], with a new parallel ABD system solver, RSCALE [44], [66], [65]. In §5, we brie?y describe the underlying algorithms and provide some implementation details for the MIRKDC code and the RSCALE package. In §6, we report in detail on the modi?cation of the MIRKDC package to enable parallel execution of the major algorithmic components. In §7, we identify the parallel architectures and test problems to be employed and then present and discuss results from several numerical experiments which demonstrate that nearly optimal linear speedup in the number of processors is attained in the parallel MIRKDC code. During our experimentation, it became apparent that in some instances the use of the RSCALE algorithm leads to performance improvements, observable even on a sequential computer, which appear to be attributable to the better stability of the RSCALE algorithm. We report on these, also in §7. Our paper closes with §8 in which we present our conclusions and identify areas for future work.

2 Sequential BVODE and ABD Software
2.1 BVODE Software
As mentioned earlier, software for BVODEs can be roughly divided into two groups: initial value methods (multiple shooting) and global methods. The text [7] mentions a number of software packages based on multiple shooting for nonlinear BVODEs which have been developed over the last 25 to 30 years. This list includes the code of Bulirsch, Stoer, and Deu?hard, and the MUSN code of Mattheij and Staarink. Among the currently available packages based on multiple shooting are the BVPMS code of IMSL [43], the codes D02HAF, D02HBF, D02AGF, and D02SAF from the NAG library [61] , and the MUSN code, available from the ODE collection of netlib [62]. In the group of packages based on global methods, upon which we shall largely focus for the remainder of this paper, among the earliest codes are the PASVAR codes, e.g., PASVA3, PASVA4, [45], [46], of Lentini and Pereyra, which are based on the trapezoidal ?nite difference scheme coupled with an iterated deferred correction algorithm. Another early code is the COLSYS package [5], [6], by Ascher, Christiansen, and Russell, which is based on collocation and which employs the B-spline package [15] of deBoor for the representation of the piecewise polynomials upon which the collocation method is based. A later modi?cation of the COLSYS code called COLNEW, [10], developed by Bader and Ascher, replaces the B-spline package with an implementation based on 3

monomial splines. This code also performs a local treatment (for each subinterval) of some of the linear algebra computations, using a “condensation” process. Runge-Kutta methods, see, e.g., [20], originally developed for the solution of IVODEs, have also served as the basic methods in several codes for BVODEs. In fact, the trapezoidal method of the PASVAR codes, and, for (1), collocation methods are equivalent to Runge-Kutta methods. A modi?cation of the PASVA3 code which employs mono-implicit Runge-Kutta (MIRK) methods, [21], in a deferred correction algorithm is the code ABVRKS [40] of Gupta. The TWPBVP code [22], [23], developed by Cash and Wright, is also based on MIRK methods and a deferred correction algorithm. MIRK methods are employed in a defect control algorithm in the code, MIRKDC [35], of Enright and Muir. The MATLAB problem solving environment [54] also includes a BVODE solver, bvp4c, [52] by Shampine and Kierzenka, based on a MIRK method and using defect control. The IMSL library includes the ?nite difference code BVPFD (based on PASVA3/PASVA4). The NAG library includes the codes D02GAF, D02RAF (based on PASVA3/PASVA4), and D02TKF (based on COLSYS/COLNEW). The ODE collection at netlib includes the COLSYS and COLNEW codes and the TWPBVP and MIRKDC codes. The COLSYS code is also available in the ACM Collected Algorithms library [9]. All of the above mentioned codes are implementations in Fortran for sequential computers.

2.2 ABD Software
In this paper, we are interested in a special case of the ABD system structure which generally arises within almost all of the codes mentioned above, i.e., the multiple shooting, Runge-Kutta, collocation (with condensation), and ?nite difference codes, when they are applied to (1),(2). The Newton matrix has a block row structure in which all blocks in a row are zero except for those along the bi-diagonal, which we will refer to as Li and Ri , for the ith row of blocks. That is, the Newton matrix has the structure, ? ? Ba ? ? L1 R1 ? ? ? ? L2 R2 ? ? ? ? . . ?. ? (3) ? ? . . ? ? ? ? . . ? ? ? LNsub?1 RNsub?1 ? Bb

The blocks Li and Ri are associated with the application of a numerical method on the ith subinterval and the blocks Ba and Bb are associated with the derivatives of the boundary conditions. The Li and Ri blocks are n × n; the Ba block is p × n; the Bb block is q × n. Nsub is the number of subintervals. A matrix with the structure exhibited in (3) is referred to in [3] as being in “standard ABD form”. There is a considerable history of high quality software for ABD systems, for sequential computers; see [3]. The earliest of these packages SOLVEBLOK [16], [17], developed by DeBoor and Weiss, employs Gaussian elimination with partial pivoting, taking into account the ABD structure of the matrix. The stability of this algorithm follows from standard results for Gaussian elimination with partial pivoting for banded systems. This algorithm, however, introduces ?ll-in and 4

thus requires extra storage and computation. Fill-in can be eliminated and the computation can be made more ef?cient if a more sophisticated elimination algorithm is employed; the modi?ed alternate row and column elimination algorithm of Diaz, Fairweather, and Keast, is implemented in the ARCECO package [26], [27], (for a general ABD structure of Figure 1.1 of [3]), and in the COLROW package [26], [27], (for a more specialized ABD structure of Figure 1.2 of [3]). The stability of this algorithm follows from the observation that it can be expressed as Gaussian elimination with a restricted form of partial pivoting [71]. A version of ARCECO modi?ed by Brankin and Gladwell to use the level 2 BLAS [32], [33] is included in the NAG library as F01LHF [18]. The approaches based on multiple shooting or collocation with condensation lead to an ABD matrix structure as in (3) but with the Ri block equal to the identity. In this case it is possible to exploit this extra structure; the modi?ed alternate row and column elimination algorithm is implemented for such matrix systems in the code ABBPACK [56], [57], of Majaess, Keast, Fairweather, and Bennett. The SOLVEBLOK, ARCECO, COLROW, and ABBPACK packages are available through netlib in the ACM Collected Algorithms library. All of these packages are implemented in Fortran for sequential computers. Because the ef?cient treatment of the Newton matrices is critical to the overall ef?ciency of a BVODE, the BVODE codes mentioned earlier employ algorithms speci?cally designed to handle the ABD matrices which arise. In many cases, the authors of the BVODE software have employed, possibly with some modi?cation, one of the ABD software packages mentioned above. The COLSYS package employs the original SOLVEBLOK code while the COLNEW package employs a modi?ed version of SOLVEBLOK. The NAG collocation code, D02TKF, employs the NAG version of ARCECO, F01LHF. The TWPBVP code employs a modi?ed version of the COLROW package and the MIRKDC code employs the original COLROW package. On the other hand, in the PASVAR codes, the authors of the BVODE software have also implemented algorithms for handling the ABD systems. In the PASVAR codes, the ABD factorization and backsolve routines are called DECOMP and SOLVE. A feature of these routines is that they can handle matrices that have what is referred to in [3] as a bordered ABD (BABD) structure; such matrices arise from the application of any of the numerical methods discussed above, when the original BVODE (1) has nonseparated boundary conditions, g (y (a), y(b)) = 0, (4)

where g : ?n × ?n → ?n . The factorization algorithm employed in DECOMP/SOLVE is of alternate row and column elimination type, similar to that of [26], [27], with ?ll-in along the bottom block row associated with the presence of the nonseparated boundary conditions. See [80] for a detailed description of the underlying algorithms. Although all of the above codes except the PASVAR codes and others based on them can only handle separated boundary conditions, (2), directly, it is possible to recast a BVODE to remove the unseparated boundary conditions, a process which can double the number of ODEs to be treated and thus substantially increase the

5

cost of solving the recast system; see [8]. The standard BABD structure is, ? ? Da Db ? ? L1 R1 ? ? ? ? L2 R2 ? ? ? ? . . ? ? ? ? . . ? ? ? ? . . LNsub?1 RNsub?1 or ? L1 R1 ? L2 R2 ? ? . . ? ? . . ? ? . . ? ? LNsub?1 RNsub?1 Da Db ? ? ? ? ? ? ? ? ? ?

(5)

(6)

where Da and Db are n × n blocks associated with the derivatives of the boundary conditions. The former characterization is used in [45]; the latter in [3]. The stability of methods for BABD systems can be much different than for ABD systems: the paper [78] demonstrates with some simple examples that Gaussian elimination with partial pivoting can be unstable for BABD systems. An interesting comment on the treatment of BABD systems is provided by Kraut and Gladwell in [53]; after separating the boundary conditions (through the introduction of extra unknowns and ODEs), the resultant BVODE leads to an ABD system rather than a BABD which can then be handled ef?ciently and in a stable fashion by a Gaussian elimination type algorithm (such as alternating row and column elimination). The authors compare the cost of this approach with the cost of applying a (stable) QR factorization directly to the original BABD system, and show that both approaches have approximately the same cost.

2.3 BVODE Parameter Continuation Software
For dif?cult parameter dependent BVODEs, having the general form, y ′ (t) = f (t, y (t), λ), (7)

with parameter λ ∈ ?l , it is common to employ a BVODE code within a “parameter continuation” algorithm - see, e.g., [7] and references within. To obtain the solution to a problem corresponding to a dif?cult parameter value, one solves a sequence of BVODEs beginning with a simpler problem and proceeding through several parameter values until the dif?cult parameter range is reached, using the ?nal mesh and solution from one problem as the initial mesh and initial solution estimate for the next. In its simplest form, the selection of the sequence of parameter values is done by the user and the BVODE code is simply called within a loop. In some cases a code will have a built-in parameter continuation feature and the selection of the parameter sequence is chosen automatically within the code. A recent code of this type is COLMOD [24] by Cash, Moore, and 6

Wright; the continuation algorithm obtains an estimate of the next parameter value based on linear or quadratic interpolation from previous parameter values, coupled with a variety of heuristics. This code employs COLNEW (with a modi?ed mesh selection algorithm [76]) but the parameter continuation implementation is external to the COLNEW code and thus the linear algebra software components are the same as in COLNEW, i.e. a modi?ed SOLVEBLOK package. The same parameter continuation algorithm is implemented in the code ACDC [24] also by Cash, Moore, and Wright, which uses TWPBVP (with the mono-implicit Runge-Kutta formulas replaced by Lobatto collocation formulas [12]) to solve each BVODE problem in the sequence. Again the continuation algorithm is external to the modi?ed TWPBVP package and thus the same linear algebra software, i.e., a modi?cation of the COLROW package, is employed. The COLMOD and ACDC packages are available through the ODE library of netlib. Automatic parameter continuation is also available in the PASVA3/PASVA4 codes. In standard parameter continuation, the purpose of solving the sequence of problems associated with different parameter values is simply to get a suf?ciently good initial mesh and initial solution estimate so that the ?nal, most dif?cult problem can be successfully solved. A generalization of the application of parameter continuation is in homotopy path following, where one wishes to follow solution values for a path of parameter values, or in bifurcation analysis, where one considers multiple solutions for a BVODE with a parameter. See, e.g., [7] and references within. A well-known BVODE code of this type is AUTO (see [29] references within) by Doedel et al. Because of the presence of parameters and the possibility of nonseparated boundary conditions, the matrix structure that arises is BABD with an extra column of blocks of width l associated with the additional parameters λ. The linear algebra algorithm employed in AUTO is based on a structured Gaussian elimination (LU) algorithm with partial pivoting. An interesting aspect of this algorithm is that a similar version was later presented for use in the parallel solution of ABD systems by Wright in [79]. In [47], Liu and Russell replace the structured LU algorithm with a structured QR algorithm, previously considered by Wright in [77] for use in the parallel solution of ABD systems. They report improved stability associated with the use of the latter algorithm. The AUTO software is available from its primary author [28]. Another path following code based on the collocation algorithm employed in COLNEW, is the COLCON code [11] of Bader and Kunkel; it employs a further modi?cation of the SOLVEBLOK code to handle the linear systems that arise.

3 Parallel Algorithms for ABD Systems
Over the last 15 years, a number of papers by various researchers have addressed the question of developing ef?cient parallel algorithms for the factorization and solution of ABD systems. Many of these algorithms allow a bordered ABD (BABD) matrix. An extensive survey of parallel algorithms for ABD and BABD systems is provided in [3].

3.1 Algorithms for Vector Processors
About two decades ago, commonly available “parallel” computers were limited to those which had a vector processing capability and thus the earliest papers in the parallel treatment of ABD systems considered algorithms based on vector processors. In [39], Goldmann describes a vec-

7

torized version of a cyclic reduction implementation of a “condensation” method of Bulirsch (not the same as the condensation employed in COLNEW) for BABD systems, modi?ed to improve its stability, since the original algorithm appears to be related to the well-known compacti?cation method, which is know to be unstable - see, e.g. [47]. In [41], Hay and Gladwell develop and test vectorized versions of the SOLVEBLOK, ARCECO, and COLROW codes and show that signi?cant improvements in execution time are attained by the vectorized codes. In a related paper [36], these same authors consider the reductions in linear algebra costs associated with replacing the linear algebra packages in COLSYS and COLNEW with ARCECO, including consideration of vectorizability within the linear algebra computations. Both from an analysis of operation counts and from numerical experiments, they show that the use of COLSYS or COLNEW with the vectorized ARCECO code is superior to the use of these codes with the vectorized SOLVEBLOK or modi?ed SOLVEBLOK routines. The paper also considers possible con?icts between vectorization and parallelization of a BVODE code. In [80], Wright and Pereyra, investigate several vector-multiprocessor algorithms for the ef?cient treatment of BABD systems arising in PASVA4. For larger systems of BVODEs, the vector capabilities of the new algorithms are better demonstrated, and the algorithms are shown to improve upon the original DECOMP/SOLVE routines. However, the underlying parallel algorithm is essentially equivalent to compacti?cation and thus will be unstable for some problems.

3.2 Algorithms based on Level 3 BLAS
The presence of multiple processors is exploited at a lower level of granularity in the papers [38], [68], by Gladwell and Paprzycki, which describe algorithms for the treatment of ABD systems using level 3 BLAS [30],[31]; parallelism is achieved through the parallel execution of the computations inside the BLAS routines. On a Cray X-MP, the authors show that parallelism at this level can lead to speedups of about 4 or 5 when up to 40 processors are employed; thus this approach may be appropriate primarily for large BVODE systems with few processors.

3.3 Algorithms for Parallel Architectures
In addition to [80], another early attempt at the development of a parallel algorithm for the treatment of ABD or BABD systems was that of Ascher and Chan [4]. By ?rst forming the normal equations corresponding to the original ABD system, Ascher and Chan obtained a positive de?nite, block tridiagonal system which can then be solved, stably, using standard block cyclic reduction (which is not stable for general linear systems since it does not include pivoting). Of course the dif?culty with this approach is that the 2-norm condition number of the linear system to be solved is potentially squared. Another algorithm, considered by Paprzycki and Gladwell [67], involves “tearing” the ABD system into a sequence of smaller, independent ABD systems with the same structure, each of which is solved on a different processor, using the sequential ABD solver, F01LHF. The authors report speedups of 3 or 4 using up to 40 processors. An additional dif?culty with this approach is that the conditioning of the smaller ABD systems may be poor even when the original ABD system is well-conditioned. A more recent followup to this work is that of Amodio and Paprzycki [1] in which an improved version of the algorithm of [67] is described. The improved algorithm features a reduction in the number of operations and in the amount of 8

?ll-in and is speci?cally directed to distributed memory parallel architectures. The reported potential speedups improve on those of [67] but are still substantially less than linear in the number of processors. The ?rst two algorithms to be both stable and exhibit almost linear speedups in the number of processors were published by Wright [77], [79]; each partitions the BABD matrix into collections of contiguous block rows; the ?rst, the structured QR (SQR) algorithm, employs local QR factorizations within each partition and can be proven to be stable; the second, the structured LU (SLU) algorithm, uses local LU factorizations within each partition, exhibits stable behavior for many problems (stability can be proven under certain assumptions), and involves about half the computational cost of the former. Wright followed up the second of these papers with [78], commenting on the possibility of the SLU algorithm exhibiting unstable behavior on simple BABD systems. These SQR and SLU algorithms were also developed independently by Jackson and Pancer [44], and subsequent work by these researchers has further investigated the potential for instability within the SLU algorithm [66], [65]. An algorithm essentially equivalent to the SLU algorithm was also proposed around this same time by Wright [75]; it employs local LU factorizations with partial pivoting within a block cyclic reduction algorithm rather than within a partitioning algorithm. In [2], Amodio and Paprzycki examine the standard block cyclic reduction algorithm for ABD systems and describe a “stabilized” version which uses block LU factorizations; this algorithm is related to the SLU algorithm and the authors base an argument for its stability on the stability of the SLU algorithm. The authors report better performance than the SLU algorithm when the number of BVODEs is large. An algorithm which improves upon the SQR algorithm is described by Mattheij and Wright in [55]; a key idea in this approach is to explicitly consider the presence of non-increasing and nondecreasing fundamental solution modes associated with the BVODE system from which the ABD systems arise. An explicit decoupling of the columns of each ABD block corresponding to the two modes leads to a stable version of the compacti?cation algorithm [7]. Mattheij and Wright describe a parallel version of this algorithm which partitions the ABD systems over the available processors and then applies stabilized compacti?cation on each partition. The decoupling step, in addition to using a QR factorization, also employs a generalized SVD decomposition. (The authors note that for some problems, it may be necessary to reapply the decoupling step several times.) The factorizations employed within the stabilized compacti?cation algorithm are QR factorizations, and in fact, the SQR algorithm is used on the reduced ABD system that arises after the parallel compacti?cation phase. The new algorithm is reported to have the same stability as the SQR algorithm but with the computational costs of the SLU algorithm. (We note, however, that a careful implementation of the SLU algorithm can have a computational cost of about O (4n3 ) - see [44], pg. 25 - rather than the O ( 23 n3 ) cost reported in [55], making it substantially less computationally 3 expensive than the parallel compacti?cation algorithm of [55].) Another recently proposed algorithm which is related to the SQR algorithm is that of Osborne [64], in which a stabilized cyclic reduction algorithm based on the use of local QR factorizations is discussed. This was followed by the paper of Hegland and Osborne [42] in which local orthogonal factorizations (i.e. QR factorizations) are employed in a novel partitioning algorithm, described by the authors as a generalization of cyclic reduction partitioning, and called wrap-around partitioning. The basic idea is to reorder the block rows and columns of the original BABD so that the blocks appearing in each lower diagonal partition are independent of each other and can thus 9

be treated in parallel. After the factorization phase, the resultant system can also be treated with wrap-around partitioning, allowing for recursive use of this algorithm. The algorithm has the same stability as the SQR algorithm; it also appears to have the same operation count as the SQR algorithm but may have advantages for architectures with vector capabilities. Jackson and Pancer have developed a new algorithm called eigenvalue rescaling [44], [66], which in terms of stability, compares well with the SQR algorithm - see [65], and which has the computational costs somewhat better than those of the SLU algorithm. (As mentioned earlier, a detailed analysis of a precise implementation of the SLU algorithm has a cost that is O (4n3); 1 3 n ).) It is implemented in the the new eigenvalue rescaling algorithm has a cost which is O (3 3 RSCALE package and will be discussed further later in this paper.

4 Parallel Software for BVODEs
4.1 Parallel Implementations for Multiple Shooting
One of the earlier papers in this area investigated vector algorithms for the numerical solution of BVODEs; in this paper [39], Goldmann considers the development of a vectorized version of a multiple shooting algorithm. This includes identi?cation of algorithms for mesh selection, integration of the local IVODEs, and a modi?ed Newton iteration, that are appropriate for vector machines. The new code is compared, in both sequential and vector mode, with several sequential BVODE codes; the numerical results show that the new code run in sequential mode compares well with the best performances of the other sequential codes and that in vector mode it achieves speedups of about 3. Another early investigation of parallel implementations for multiple shooting was conducted by Keller and Nelson [50], [51], in which they considered hypercube parallel architectures. After identifying the two major computational phases, integration of the local IVODEs and factorization and solution of the ABD system, they consider two approaches to parallelization: domain decomposition, which is the conventional approach of distributing the work associated with each subinterval across the available processors, and column decomposition, which is an alternative approach in which work associated with corresponding columns of the fundamental solution matrices for each subinterval in the IVODE phase and work associated with corresponding columns for each block of the ABD matrix are treated by the same processor. The authors do not consider parallel treatment of the ABD factorization and solution within the domain decomposition; a sequential alternate row and column elimination algorithm is assumed. The paper includes a careful analysis of ef?ciency for each approach, for each phase. The paper [74] by Womble, Allen, and Baca, discusses a parallel algorithm for the method of invariant imbedding (which is related to multiple shooting since the solution of a linear BVODE is computed via the solution of a number of IVODEs). They also consider the parallel treatment of partial differential equations using a (transverse) method-of-lines approach, which leads to a system of BVODEs which can be treated using their parallel algorithm. The paper [58] by Miele and Wang describes a method for the parallel solution of linear BVODEs based on the method of particular solutions, which is again related to multiple shooting since it obtains the solution of the BVODEs through a partitioning of the original problem interval followed by the solution of IVODEs on each subinterval.

10

A more recent effort by Chow and Enright [25] considers standard multiple shooting for distributed memory parallel architectures. The BABD systems which arise are treated with a parallel algorithm which appears to be related to compacti?cation and may therefore have stability dif?culties. The authors also acknowledge that standard multiple shooting can have stability problems due to the presence of rapidly increasing fundamental solution modes, which can cause dif?culties for the underlying IVODE solver. To address these stability dif?culties, the authors introduce a parallel iterative re?nement scheme, which is shown through numerical examples to provide improved performance, counteracting, to some extent, the negative performance effects associated with the stability dif?culties.

4.2 Parallel Chopping Algorithms
A number of authors have considered approaches which generalize standard multiple shooting by considering local BVODEs on each subinterval rather than local IVODEs. The local BVODEs must, of course, then be provided with appropriate local boundary conditions. Ascher and Chan discuss in [4] a parallel algorithm involving local (stable) BVODEs for which the local boundary conditions are chosen to be consistent with the increasing and decreasing fundamental solution modes of the global BVODE. Some preliminary work in the development of a practical algorithm for the determination of the local boundary conditions is discussed. Paprzycki and Gladwell report in [69] on a parallel chopping algorithm based on the sequential chopping algorithm of Russell and Christiansen [72]. Using estimated boundary conditions, the independent BVODEs are treated in parallel using COLNEW. This is coarse-grain parallelism since it involves running the COLNEW code independently on each of several processors, as opposed to executing components of a single run of COLNEW in parallel. Their results show quite substantial speedups on some test problems and suggest the need for further investigation of the approach. A later paper by Katti and Goel, [48], again considers the parallel chopping algorithm; however in their approach the individual BVODEs are treated using a fourth order ?nite difference method derived by the authors rather than by using standard BVODE software. The method they derive requires a rather specialized assumption on the BVODE which in general does not hold for nonlinear BVODEs. As in previous papers, Pasic and Zhang [70] consider a parallel algorithm based on local BVODEs. The authors comment that once the local solutions are obtained the usual approach is to impose continuity constraints globally by considering all the constraints associated with all the subintervals simultaneously; this gives the usual ABD matrix. In [70], the authors present algorithms which attempt to treat the continuity constraints locally, i.e., between pairs of adjacent subintervals, in an iterative context. This allows for an obvious parallel treatment of the continuity constraints. Examples are provided for single second order ODEs but the authors indicate that the approach generalizes to systems.

4.3 Parallelization Across the Method
Several authors have considered approaches which involve parallel execution of signi?cant algorithmic components of existing sequential BVODE codes.

11

Gladwell and Hay, in [36], show that, for the COLSYS and COLNEW codes, linear algebra computations make a substantial contribution to the overall cost, and highlight the factorization and solution of the ABD systems as particular bottlenecks. Their paper was written prior to the extensive work on parallel algorithms for ABD systems and thus pointed out the need for further research in this area. In the paper by Bennett and Fairweather [14], a parallelization of the COLNEW code is described. They observe that, in COLNEW in sequential mode, the most signi?cant computational cost is in the setup of the ABD systems; the factorization and solution of the ABD is actually relatively less important. (This is due to the condensation phase implemented in COLNEW; in COLSYS the factorization phase is relatively more costly.) Hence, the authors focus on the parallel implementation of the setup of the ABD systems. They attain nearly linear speedup on several test problems for the computational components associated with the condensation phase. They report that on systems with even a small number of parallel processors the ABD setup costs can be reduced to the point where the factorization of the ABD system can represent about 50% of the overall execution time, further supporting the need for research in the parallel treatment of the ABD systems. These authors followed up on this effort by modifying COLNEW to obtain the parallel code, PCOLNEW [13], which also included an implementation of the parallel SLU algorithm of Wright [79], mentioned earlier, for the parallel factorization and backsolve of the ABD systems. Muir and Remington [59] report on a modi?cation of an early version of the MIRKDC code to incorporate parallelization of the ABD system setup, factorization, and solution phases, as well as the interpolant setup and defect estimation phases. The parallel ABD factorization and solution algorithm was based on an implementation of Wright’s SLU algorithm. For the ABD setup, interpolant setup, and defect estimation phases, they report nearly linear speedups for most phases; however the speedups reported for the factorization and solution phases do not become apparent until 3 or more processors are employed and do not show as much speedup as do the other phases.

4.4 Special Methods for Singularly Perturbed Problems
In the paper by Gasparo and Macconi [37], a parallel version of a method referred to as an initial value method for singularly perturbed BVODEs is described. The method had been reported in previous papers by the authors; it is a rather specialized algorithm developed for second order linear and semilinear BVODEs satisfying certain special conditions that would not be expected to hold for general nonlinear BVODEs. The parallel version identi?es independent subtasks and can employ up to 4 processors. Since only two of the tasks are dominant, speedups of about 2 are attained. For the singularly perturbed second order linear BVODE, with special assumptions on the coef?cients of the ODE, Kadalbajoo and Rao [49], assuming a uniform mesh, derive a simple ?nite difference scheme which can be solved using several recurrence relations. They derive expressions for these recurrences which can be computed in parallel in order to obtain the solution to the ODE. Despite a substantial volume of work in the investigation of the ef?cient and stable parallel treatment of ABD systems arising in the numerical solution of BVODEs, one can observe from the above discussion that relatively little has been done in following up this work by implementing any of the stable, parallel ABD algorithms within a BVODE package. The only reported work of this type, to our knowledge, is that mentioned above in [13] and [59]. 12

5 Overview of MIRKDC and RSCALE
5.1 MIRKDC: Mono-implicit Runge-Kutta Software with Defect Control
The underlying algorithms employed in the MIRKDC code are described in detail in [34] and [35]. The numerical schemes used to discretize the ODEs are a special class of Runge-Kutta methods known as mono-implicit Runge-Kutta (MIRK) methods - see [19] and references within. Due to special structure which relates the stages of these methods, they can be implemented, in the BVODE context, more ef?ciently than standard implicit Runge-Kutta methods. The discretization process leads to a system of nonlinear algebraic equations which are solved by a modi?ed Newton iteration. This gives a discrete numerical solution at the meshpoints which subdivide the problem interval. This solution can be ef?ciently extended to a continuous approximate solution by augmenting the MIRK methods with continuous MIRK methods [60]. This continuous solution approximation is a C 1 interpolant to the the discrete numerical solution of the whole problem interval. The availability of this high order interpolant allows the MIRKDC code to employ defect control as an alternative to the standard control of a global error estimate usually employed in BVODE codes. For a continuous approximate solution, u(t), and the ODE given in (1), the defect is given by, δ (t) = u′ (t) ? f (t, u(t)). The MIRKDC code also employs adaptive mesh redistribution based on an equidistribution of the defect estimates for each subinterval. The algorithm implemented in the MIRKDC software is as follows: Initialization ? Initial mesh, π ≡ {ti }Nsub i=0 , partitioning [a, b], ? Initial solution approximation, Y ≡ {y i }Nsub i=0 , at mesh points. REPEAT 1. Based on current mesh and a MIRK method, discretize ODEs and boundary conditions to obtain a nonlinear algebraic system, Φ(Y ) = 0. 2. Compute discrete solution, Y , of Φ(Y ) = 0, using Newton’s method: For q = 0, 1, . . . (i) Setup Newton matrix, (ii) Setup residual, Φ(Y
? Φ(Y (q) ) , ?Y (q )

),

(iii) Factor Newton matrix, (iv) Solve Newton system,

? Φ(Y (q) ) , ?Y ? Φ(Y (q) ) ?Y

?Y (q) = ?Φ(Y (q) ), for ?Y (q) ,

(v) Update Newton iterate, Y (q+1) = Y (q) + ?Y (q) . 3. Based on discrete solution, Y , obtained from step 2. and on a CMIRK method, construct interpolant, u(t).

13

4. Estimate defect, δ (t) = u′ (t) ? f (t, u(t)), on each subinterval, to obtain defect estimates, {δ i }Nsub i=1 . 5. Check size of each defect estimate, ||δi ||, for i = 1, . . . , Nsub: If ||δi || < T OL then CONV ERGE = T RUE else ? CONV ERGE = F ALSE ? Use current mesh, defect estimates, and T OL to construct new mesh, π ? , which equidistributes defect estimates with suf?cient re?nement to approximately satisfy T OL. ? Evaluate u(t) on new mesh, π ? , to obtain new discrete initial solution approximation, Y ? , at new mesh points. UNTIL CONV ERGE With respect to the above algorithm, note that: ? T OL is the user de?ned tolerance. ? To improve ef?ciency, ?xed Newton Matrix iterations are sometimes employed, in which case, steps 2.(i) and 2.(iii) are skipped. (Also, damped Newton steps are sometimes employed in 2.(v).) ? The Newton iteration, step 2., terminates when the Newton correction is suf?ciently small, e.g., ||?(Y (q) )|| < T OL/100. ? If the Newton iteration (Step 2.) fails to converge, the calculation is restarted with a “doubled mesh”, obtained by dividing each subinterval of the current mesh in half. ? The defect is estimated by sampling it at several points per subinterval. ? The algorithm halts without obtaining a solution if a new mesh will require more than the maximum number of subintervals available (de?ned by the user). An analysis of the relative execution costs for the components of the above algorithm (when the code is executed on a sequential architecture), shows that only a few of them dominate the computational costs (see §6). These are: 2(i) Setup of the Newton matrix, 2.(ii) Setup of the residual, 2.(iii) Factorization of the Newton matrix, 2.(iv) Solution of the Newton system, 3. Setup of the interpolant, 4. Computation of the defect estimates. 14

It is worthwhile to note that for a single full Newton step, the costs of steps 2.(i) and 2.(iii) heavily dominate the costs of the other three steps. However, in a typical run of the MIRKDC code, ?xed Newton Matrix iterations will frequently be employed, and this will increase the relative costs of steps 2.(ii) and 2.(iv), since they will be executed more often than steps 2.(i) and 2.(iii). Further analysis of this observation will be provided later in this paper.

5.2 RSCALE: Parallel Block Rescaling and Factorization
Assuming nonseparated boundary conditions (4), the Newton matrix that arises during the q -th iteration in step 2. of the algorithm of the previous subsection has the BABD system structure of (5). Augmenting this matrix with the corresponding right-hand side, we may write the Newton system as ? ? Da Db g ? L1 R1 φ1 ? ? ? ? Φ(Y (q) ) ? φ2 ? L2 R2 |Φ(Y (q) ) = ? (8) ?, ? ? ?Y . . . . . . ? ? . . . LNsub?1 RNsub?1 φNsub?1 where g is the vector of boundary conditions and φi is the i-th component of Φ(Y ). Several variations of RSCALE exist for solving (8) and are discussed in [65]. Probably the most suitable variation for architectures with a modest number of processors—the variation implemented in PMIRKDC—is a two-level factorization. In the ?rst level of such a scheme, the last Nsub ? 1 block-rows of (8) are divided into P partitions of (Nsub ? 1)/P block-rows each in such a way that each partition can be factored independently. In the second level, the boundary blocks and a single block-row from each partition are extracted to form a P + 1 block-row ABD matrix which is factored sequentially. Figure 1 illustrates how local transformations are applied within each partition during the ?rst level of the factorization. Transformation starts at the bottom of each partition and moves upward. Steps (a) and (d) incorporate block-column differencing to shift the eigenvalues in the Ri and Ri?1 blocks, respectively. These are key steps in the rescaling process which improves stability in step (g) of the algorithm. Steps (b) and (f) are each implemented using LU-factorization followed by n + 1 back-solves; matrix inversion is shown for ease of notation only. Steps (d), (e), (f) and (g) are applied to successive block-row triples [i?2, i?1, i+1]r , [i?3, i?2, i+1]r , . . . , until the top of the partition is reached. When the ?rst-level factorization is complete, the unknown solution values associated with the block-rows at the bottom of each partition have been effectively uncoupled from the original system, and can be extracted to form a P + 1 block-row ABD matrix which is factored sequentially. Further details on this two-level factorization scheme are given in [65]. Repetitive matrix multiplication in step (g) does not result in excessive growth of the underlying increasing fundamental solution modes. RSCALE controls the growth by rescaling the eigenvalues ? i?1 in step (f) so that L ? i+1 remains small. A stability proof for certain classes of ABD of L matrices is given in [65]. 1 3 The local operation count for RSCALE is 3 3 n + O(n2 ) ?ops per block-row. If Ri = I, i = 1, . . . , Nsub ? 1, as occurs in multiple shooting or collocation with condensation, matrix multi15

Figure 1: Steps in the RSCALE local transformation. [i]r and [i]c denote the block-row containing
T , LT ]T , respectively. [Li , Ri , φi ] and block-column containing [Ri i+1

?

? Li?1 Ri?1 φi?1 ? L i Ri φi ? ? ? ? ? ? Li+1 Ri+1 φi+1 ? ?
bottom of partition

(a) ?→

?

? Li?1 Ri?1 ?Ri?1 φi?1 ? Li Ri?Li φi ? ? ? ? ? Li+1 Ri+1 φi+1 ? ? ? Li?1 Ri?1 ?Ri?1 ? ?i L I ? ? Li+1 Ri+1 ? φi?1 ?i ? φ ? ?i+1 φ ? ?i?1 φ ?i ? φ ? ? φi+1

(b) ?→

?

? Li?1 Ri?1 ?Ri?1 φi?1 ? ?i ? ?i L I φ ? ? Li+1 Ri+1 φi+1 ?

(c) ?→

Li?1 Ri?1?Li?1 ?Ri?1 (d) ? ?i ?→ ? L I ? Li+1 Ri+1 (f) ?→ ? ? ? ? i?1 L I ? Li I ? Li+1 Ri+1

? ? ? i )?Li?1 φi?1 Li?1 Ri?1 (I + L ? (e) ? ? ? Li I φi ? ?→ ? ? ? Li+1 Ri+1 φi+1 (g) ?→ ? ? i?1 I L ? ?i I L ? ? i+1 L Ri+1 ←? ←? ←? ←? ? ?i?1 φ ?i ? φ ? ? φi+1

? ?i?1 φ ?i ? φ ? ? φi+1

(a) (b) (c)

[i]c ←? [i]c ? [i ? 1]c [i]r ←? (Ri ? Li )?1 [i]r [i + 1]r ←? [i + 1]r ? Li+1 [i]r

(d) (e) (f) (g)

[i ? 1]c [i ? 1]r [i ? 1]r [i + 1]r

[i ? 1]c ? [i ? 2]c [i ? 1]r + Ri?1 [i]r ? i ) ? Li?1 )?1 [i ? 1]r (Ri?1 (I + L ? i+1 [i ? 1]r [i + 1]r ? L

16

1 3 n + O(n2 ) ?ops per plication in step (e) of Figure 1 is not required, and RSCALE uses only 2 3 block-row.

6 Parallelization of the MIRKDC code
Since the six components of the MIRKDC code identi?ed in the previous section dominate the computational costs, it is important that our parallelization efforts focus on them. Since the parallel architectures employed in the work described in this paper are of shared memory type, the parallelization of each code component does not involve inter-processor communication. Rather the parallelization process for each component can be described in terms of partitioning the associated computation and distributing the task over the available processors. This requires a careful speci?cation of the partitioning of memory to the processors. Parallel Fortran compiler directives (provided by the Silicon Graphics Parallel Fortran compiler) are used to implement parallelism and memory partitioning speci?cations. We now provide further details for the parallelization of step 2.(ii) - the computation of Φ(Y ), the residual; the other steps, 2.(i), 3., and 4, are treated similarly; the parallelization process for steps 2.(iii) and 2.(iv) has been discussed earlier. The evaluation of the residual, Φ(Y ), is implemented, in the sequential version of MIRKDC, in a loop that iterates over the Nsub subintervals of the mesh. This loop has the form, do 5 i = 1, Nsub . . . continue

5

A straightforward parallelization of this loop would involve partitioning the Nsub loop iterates over the nprocs available processors. The Nsub loop iterates would simply be queued and the processors would execute the iterates as they were retrieved from the queue. During each of these iterates, the processor would calculate the component of Φ(Y ) associated with an individual subinterval and access the corresponding memory locations. This would mean that the amount of work by a processor done during each such iterate would be relatively small, the number of iterates executed by each processor would be relatively large, approximately Nsub/nprocs, and that each processor would need to access several different parts of the large array in which the Φ(Y ) values are stored. All of these factors contribute to inef?cient use of the processors and memory accesses, and thus a degradation in the parallel speedups. A signi?cant improvement in the parallel implementation is obtained if the work associated with the evaluation of Φ(Y ) is partitioned so that each processor works on a contiguous set of subintervals. The loop that is parallelized runs only over the available processors, and within each such loop, approximately Nsub/nprocs subintervals are treated. This means that each processor is called only once, minimizing processor setup overhead, and that each processor works on a contiguous portion of the right hand side vector, minimizing memory access con?icts. The improved loop has the form, do 2 np=1,nprocs 17

5 2

npshift = (np-1)*neqns do 5 i=(np-1)*Nsub/nprocs+1, np*Nsub/nprocs . . . continue continue

The complier directive, for the above loop, which implements the parallel loop execution and speci?es the memory distribution (parallel copies or shared) is given below. C$DOACROSS SHARE(nprocs, neqns, Nsub, mesh, Y, PHI, leftbc, C$& k_discrete, s, work, fsub), C$& LOCAL(np, npshift, i, i1, h, t) do 2 np=1,nprocs npshift = (np-1)*neqns do 5 i=(np-1)*Nsub/nprocs+1, np*Nsub/nprocs_l . . . 5 continue 2 continue This directive speci?es that the loop indicies and local variables np, npshift, i, i1 and that the length and left endpoint of the current subinterval, h, t, respectively, must all be distributed to the local memory of each processor. The directive also speci?es that the variables nprocs, neqns, Nsub, mesh, Y, leftbc, s, (the number of processors, the number of ODEs, the number of subintervals, the mesh point array, the discrete solution array, the number of left boundary conditions, and the number of stages of the Runge-Kutta method) and the subroutine name fsub (which provides the right hand side of the ODE system, y ′(t) = f (t, y (t))), can be shared (since they are read-only), and that the arrays PHI, k discrete, and work (the array containing Φ(Y ), the array containing the stage values for the Runge-Kutta method for each subinterval, and a work array) can be shared (since the processors will write into independent locations of these arrays). In the sequential version of this code, only neqns locations of the work array are needed for extra storage. In the parallel version, each processor requires its own independent work space so nprocs × neqns memory locations are needed; however this represents a very minor increase in overall storage requirements for the parallel version. (A similar analysis of the other routines to be parallelized shows that other extensions to the storage requirements for the parallel version of the code are required but these are at most of the order nprocs × neqns2 and thus still represent minor increases in storage compared to the major storage costs associated with the large ABD Newton matrices and the storage for the Runge-Kutta stages associated with each subinterval.) In the above discussion, we have speci?ed that the processors will access adjacent memory blocks, each of size neqns, in the work array. When several processors attempt to simultaneously access adjacent blocks, we have observed a degradation in parallel performance due to processor contention for memory. This happens because each processor, due to the cached memory algorithm, actually attempts to retrieve a block of memory of size larger than neqns. This means that 18

it will attempt to access adjacent work spaces associated with other processors. We avoid this dif?culty by making sure that the work spaces used by the processors are not adjacent; this is implemented through a simple change to the npshift variable which results in 99 memory blocks each of size neqns being placed between the work spaces. The assignment to npshift becomes npshift = (np-1)*100*neqns This spacing was chosen somewhat arbitrarily and could be reduced by a careful analysis of the cache size. (The amount of extra storage employed in this way is quite small compared to the overall storage requirements.) The interface for the RSCALE algorithms for the parallel factorization of the Newton matrix and the parallel backsolve of the Newton system are essentially identical to those for the corresponding components of COLROW, called CRDCMP and CRSLVE, and thus the replacement of COLROW by RSCALE in MIRKDC is relatively straightforward. However the RSCALE algorithm does require some additional memory as well as access to the number of available processors. This requires some minor modi?cations to the required size of the work array and to the parameter lists for those MIRKDC routines that call RSCALE routines. The resultant modi?ed version of MIRKDC, with parallelization of the setup of the Newton matrix, the residual evaluation, the setup of the interpolant, the defect estimation, and the ABD linear system factorization and solution (by RSCALE), will be referred to as PMIRKDC. The original sequential version of MIRKDC (in which the ABD linear systems are factored and solved by COLROW) will continue to be referred to as MIRKDC.

7 Investigation of PMIRKDC
7.1 Parallel Architectures and Test Problems
We have employed three computer architectures in our work. The speci?cations for the three systems are given in Table 1. The Sun Ultra 2 is used as a sequential machine. Table 1: Architecture speci?cations. (Vendors: SG:Silicon Graphics, SM:Sun Microsystems) Architecture acronym model vendor # CHA Challenge L SG 8 ORG Origin 2000 SG 8 ULT Ultra 2/2170 SM 2 Speci?cations processors memory speed type 150 MHZ R4400 512 MB 250 MHZ R10000 2 GB 150 MHZ SPARC 256 MB

The test problems are derived from a family of standard test problems described in [7] and referred to there as Swirling Flow III (SWF-III). The test family is:
′ y1 (t) = y2 (t), ′ y2 (t) =

1 (y1 (t)y4 (t) ? y2 (t)y3 (t)) , ? 19

′ y3 (t) = y4 (t),

′ y4 (t) = y5 (t),

1 (?y3 (t)y6 (t) ? y1 (t)y2 (t)) , ? y3 (a) = y4 (a) = y3 (b) = y4 (b) = 0, y1 (a) = ?1, y1 (b) = 1.
′ y5 (t) = y6 (t), ′ y6 (t) =

No closed form solution exists in general. For our testing, we consider problems A, B, C, D, and E from this family, obtained by specifying the values for the parameters ?, a, and b - see Table 2. Table 2: Five SWF-III problems. Each problem is speci?ed by ? and the left and right endpoints of the interval of integration, [a,b] (i.e. the distance between the rotating disks). The problems are listed in increasing order of dif?culty. Problem A B C D E SWF-III parameters ? [a,b] .002 [0, 1] .000125 [0, 1] .000125 [?1, 1] .0001 [?1, 1] .00275 [0, 10]

The initial guesses for the solution components are chosen to be straight lines through the boundary conditions, for components y1 (t), y3 (t), and y4 (t). We choose y5 (t) = y6 (t) = 0 as the initial guesses for the last two components, and choose the initial guess for y2 (t) to be the slope of the line used as the initial guess for y1 (t) (since y2 (t) represents the derivative of y1 (t)). These architectures and test problems provide the basis for a sequence of experiments we have conducted, the results of which we will present and analyze in the next two subsections. The speci?cations for these experiments are reported in Table 3. In all experiments, the MIRKDC and PMIRKDC codes are run with the fourth order option selected so that fourth order MIRK and CMIRK methods are employed - see [35]. For some experiments we were interested in studying code behavior on dif?cult problems for which the code, with the usual initial solution approximation and uniform initial mesh, would not be able to obtain a solution. In such cases we used a parameter continuation strategy to provide a suf?ciently good initial approximation and initial mesh for the problem. Table 3 below speci?es the continuation problem sequences in terms of the parameter ?.

7.2 Numerical Results for Parallel Execution
In this section we report on a number of experiments in which the sequential MIRKDC code is compared with the parallel PMIRKDC code. Since the setup of the interpolant is actually included as a preliminary step in the estimation of the defect, the costs for the interpolant setup are included in those for defect estimation and thus only costs for defect estimation are reported. For some experiments, the results will show that the number of backsolves is greater than the number of right hand side evaluations. This difference arises from some of the heuristics 20

Table 3: Ten MIRKDC vs. PMIRKDC Experiments. Each experiment is speci?ed by a host architecture (Table 1), a SWF-III problem (Table 2), and a solution strategy. A solution strategy is given by a defect tolerance τdefect , a number of initial mesh subintervals Nsub0 , and a parameter sequence. exp. # 1 2 3 4 5 6 7 8 9 10 arch. CHA CHA ORG ORG ORG ORG ORG ORG ULT ULT prb. A A A A B C D E C D solution strategy continuation strategy none none none none ? = {.002, .001, .0005, .00025, .000125} ? = {.002, .001, .0005, .00025, .000125} ? = {.002, .001, .0004, .0002, .0001} ? = {1, .1, .01, .005, .00275} none none

τdefect 10?11 10?11 10?11 10?11 10?8 10?6 10?7 10?7 10?6 10?7

Nsub0 7000 10 7000 10 10 10 10 10 10 10

employed within the modi?ed Newton iterations, discussed in earlier. When the MIRKDC code attempts a ?xed Jacobian step, the residual function is evaluated using the current iterate and a Newton correction is computed from a backsolve of the linear system with the new residual value as the right hand side and an “old” Jacobian as the coef?cient matrix. Under certain conditions, the resulting new iterate may be rejected. In this event the current residual value is saved and the Jacobian is updated and factored. Then another backsolve using the new Jacobian as the coef?cient matrix and the saved residual as the right hand side is performed. In other words, when a ?xed Jacobian step fails, the code will solve two linear systems with the same right hand side but with different coef?cient matrices, resulting in two backsolves but only one residual (right hand side) evaluation. We will now present results and discussion for the ten experiments identi?ed in Table 3. (Note that the associated ?gures, referred to in the following discussion, appear after the reference section.) 7.2.1 Experiments #1 and #2 In Experiment #1, the large number of initial mesh subintervals (Nsub0 = 7000) was chosen in order to force the codes to work with ?ne meshes and large ABD systems which lead to large execution times. After computing a solution on this initial mesh, the code redistributes the mesh and computes a ?nal solution on a mesh of 3744 subintervals. Figure 2, which provides results on overall execution time for MIRKDC and PMIRKDC, shows that in sequential mode PMIRKDC cost about 50% more than MIRKDC but with 2 or more processors, PMIRKDC outperforms MIRKDC. Figure 2 also shows that, initially, the overall speedups for PMIRKDC exhibit nearly optimal linear speedups as the number of processors increases, even though some parts of the code are being 21

run sequentially. This occurs because all of the components of the code which represent significant computational costs are being run in parallel. Figure 3 shows that all parallelized program components exhibit nearly optimal linear speedups. Figure 3 also shows the relative cost of each signi?cant algorithm component in the MIRKDC code, as well as the relative cost of each significant algorithm component in PMIRKDC as the number of processors varies. It is interesting to compare the relative costs of the components of MIRKDC and PMIRKDC when 1 processor is employed; in PMIRKDC the factorization costs are about four times greater than those in MIRKDC while the backsolve costs, which represent the dominant component in both codes, are about twice as much in PMIRKDC as in MIRKDC. As the number of processors is increased, the costs of the parallelized components of PMIRKDC decrease linearly; with 8 processors, the most signi?cant computational costs are those associated with the sequential components of PMIRKDC. In Experiment #2 problem A is again solved but with Nsub0 = 10. The small number of initial mesh subintervals is more realistic and allows the codes to work with smaller ABD systems, yielding a more typical solution computation. Figure 4 shows results for overall execution time for MIRKDC and PMIRKDC on various numbers of processors. The overall execution time can be seen to behave in a nearly optimal fashion despite the presence of sequential code components. On 1 processor the overall PMIRKDC execution time is about one third greater than that of MIRKDC and the parallelism pays off when 2 or more processors are used. Figure 5 provides information about the sizes of the meshes employed by the codes while attempting to compute a solution whose defect meets the required tolerance. It also shows the total number of calls to each of the parallelized solution components; the number of calls to the residual evaluation and backsolve components is much greater than the number of calls to the other components. This is because of the relatively large number of ?xed Newton Matrix steps taken by the code (which do not require matrix setup or factorization computations). Figure 5 also shows the percentage of calls to the factorization and backsolve routines for each mesh the code uses. It is interesting to note that most of the factorization and backsolve calls take place on the coarse meshes, indicating that the codes use substantially more full Newton and ?xed Newton matrix iterations on the coarse meshes, when less information about the solution is available and the meshes are not yet well adapted to the solution behavior. Figure 6 shows nearly optimal linear speedups for all parallelized components. This ?gure also gives the execution times for these components in MIRKDC and in PMIRKDC for various numbers of processors. With 1 processor, we see that the matrix construction costs are by far the dominant component in MIRKDC, while in PMIRKDC, the matrix factorization costs dominate. We can see that the factorization costs in MIRKDC are again about 4 to 5 times less than those in PMIRKDC, while the backsolve costs are about half. As in the previous experiment, the component costs in PMIRKDC drop almost linearly with the number of processors and by the time 8 processors are employed the sequential components represent the most signi?cant costs. The slight departures from linear speedup for larger numbers of processors are attributable to the computations associated with the coarser meshes. In such cases the total amount of work performed by each processor is small and thus the overhead associated with invoking the parallel processors is relatively more signi?cant.

22

7.2.2 Experiments #3 and #4 Experiments #3 and #4 are identical to #1 and #2, respectively, except that they are run on the SGI Origin 2000 rather than the Challenge L. For Experiment #3, Figure 7 provides results on the assessment of linear speedup for PMIRKDC and a comparison of the overall times for PMIRKDC vs. MIRKDC for varying numbers of processors, while Figure 8 provides the corresponding results for each signi?cant algorithm component of MIRKDC and PMIRKDC. From Figure 7, we see that with respect to overall time, PMIRKDC exhibits nearly linear speedups; as well we see that on 1 processor the PMIRKDC execution time is about twice that of MIRKDC and that parallelism begins to pay-off after 3 processors are employed. From Figure 8 we can see that the speedups of all the parallelized components of PMIRKDC are nearly linear. As well we can compare the relative costs of these components in MIRKDC and in PMIRKDC. For 1 processor, the backsolve costs in both MIRKDC and PMIRKDC are the largest; the costs in PMIRKDC are about 3 times those in MIRKDC; the factorization costs in PMIRKDC continue to be about 4 times those of MIRKDC. With 2 processors, the backsolve costs in PMIRKDC continue to be by far the most signi?cant costs; they are about 1.5 times those of MIRKDC. Unlike Experiment #1, even with 8 processors, the backsolve costs in PMIRKDC are still greater than the costs of the sequential components. For Experiment #4, in which a more realistic initial mesh is employed and thus a more typical computation sequence is exhibited, Figures 9 and 10 provide the results on linear speedup analysis and execution times for several processors for the overall execution time and for signi?cant algorithm components, respectively. Parallelism pays-off for 3 or more processors; however the speedup in overall time is less optimal than in the corresponding Experiment #2 - see comments below. For this experiment, the matrix setup costs are the dominant costs in MIRKDC, while in PMIRKDC, with one 1 processor, the matrix factorization costs are dominant, followed by the backsolve costs. With 2 or more processors the costs of the selected components drop approximately linearly until, for 8 processors, the sequential costs are the greatest. Three additional points regarding Experiments #3 and #4 are worth noting here. First, execution times are nearly 5 times as fast on the Origin 2000. This is somewhat surprising considering only the ratio of processor speeds (250 MHz on the Origin 2000 vs. 150 MHz on the Challenge L), however the Origin 2000 has many other performance enhancing features not listed in Table 1. This ?ve-fold increase in speed conforms with other benchmarks for this machine. Second, in program segments characterized by light work over many calls, speedup is not as optimal as on the Challenge L. This comments especially applies to the residual evaluation speedups shown in Figure 10 . We believe this is due at least in part to the increased processor speed on the Origin 2000. This conjecture is supported by the observation that residual evaluation speedup on the Origin 2000 can be made arbitrarily close to optimal by arti?cially in?ating the work-per-call (these results are not shown). Finally, we have noticed that there is a minor degradation in performance of some sequential program segments on the Origin 2000 as more processors are invoked (not shown here). We believe this is due to occasional non-optimal load balancing by the operating system. At times we have noticed as much as a 10% difference in the CPU rate among processors. 7.2.3 Experiments #5, #6, #7, and #8 Even with the minor performances glitches mentioned above, the Origin 2000 is considerably faster than the Challenge L, and more time-consuming problems are required to better illustrate its 23

power as a parallel machine. To this end, problems B, C, D, and E are solved in Experiments #5, #6, #7, and #8, respectively. In each experiment, a parameter continuation strategy is invoked to achieve convergence as the problems become increasingly dif?cult. The performance statistics reported in each experiment apply to the ?nal, most computationally-intensive iteration of the continuation loop only. For Experiment #5 the last continuation problem requires a mesh of 1057 subintervals followed by a mesh of 1162 subintervals. For Experiment #6 the last continuation problem requires a mesh of 367 subintervals followed by a mesh of 734 subintervals followed by a mesh of 1468 subintervals. For Experiment #7 the last continuation problem requires a mesh of 662 subintervals followed by a mesh of 1324 followed by a mesh of 2648 subintervals. The results for these problems are given in Figures 11-16. They are similar to the results of Experiment #8, which we discuss now in more detail. For the ?nal problem in the continuation sequence for Experiment #8 the speedup assessment for PMIRKDC and the overall execution times for MIRKDC and PMIRKDC are given in Figure 17. This is quite a dif?cult problem and thus substantial execution time is required even for the ?nal problem. This problem is also the most relevant of the test problems since we are of course primarily interested in the application of parallelism in problems with large execution times. We see that PMIRKDC exhibits nearly linear speedups, that on 1 processor the PMIRKDC cost is about 2.5 times that of MIRKDC, and that parallelism pays-off after 3 processors. Figure 18 gives the mesh sequence, the total number of calls for the signi?cant computational components, and the percentage of calls to the factorization and backsolve routines for each mesh. We see that the number of residual evaluation and backsolve calls dominates the number of calls to the other routines, due to the relatively large number of ?xed Newton matrix iterations. We also observe that except for one mesh, the number of Newton matrix factorizations and backsolves is quite uniform over the meshes considered. This is more typical of the latter part of a computation sequence where the meshes are somewhat better adapted to the solution behavior. Figure 19 shows nearly optimal linear speedups for all parallelized components. In addition it shows the relative costs of these components within the MIRKDC code and within the PMIRKDC code for various numbers of processors. For 1 processor the matrix setup costs dominate in MIRKDC while in PMIRKDC the matrix backsolve and factorization costs are the highest. The factorization cost in PMIRKDC is about 5 times that of MIRKDC. As the number of processors increases the cost of the parallelized components drops approximately linearly; for 8 processors we see that the factorization and backsolve costs have dropped to almost the same level as the sequential costs. Cumulative results for each of these experiments over all meshes considered throughout the entire continuation sequence for each experiment are provided in Figures 20-23. These ?gures give the total number of calls of each of the signi?cant algorithm components and an indication of the relative number of calls to the matrix factorization and backsolve routines on the various meshes considered. From these ?gures, we see that the number of residual evaluations and backsolves always substantially outnumbers the calls to the other routines. As well, we see that the percentage of calls to the matrix factorization and backsolve routines is generally higher for the larger meshes, a behavior that is consistent with the fact that these problems are very dif?cult and that even on ?ner meshes substantial work is still required to obtain accurate solutions.

24

7.3 Numerical Results for Sequential Execution
In all the experiments considered in the previous section, the solution pro?les exhibited by MIRKDC and PMIRKDC were identical. That is, the two versions of the code solved all problems using the same sequence of meshes having the same number of subintervals and employing the same number of Newton iterations to solve the Newton systems associated with each mesh, and producing answers in agreement to approximately round-off level. In this section we consider problems for which this is not the case. Experiments #9 and #10, each run sequentially on the Sun Ultra 2, illustrate a potential secondary advantage of RSCALE. Even though there is no parallelism, PMIRKDC greatly outperforms MIRKDC (Figures 24-25). The reduced execution times in PMIRKDC are explained by comparing the subroutine call and call-per-mesh pro?les, in Figures 26-27 for Experiment #9 and in Figures 28-29 for Experiment #10. From these ?gures it can be seen that PMIRKDC can use substantially coarser meshes with substantially fewer calls to all computational components, leading to great savings in execution time. Compared to MIRKDC, PMIRKDC solves the problems in Experiments #9 and #10 with savings of about 60% and 85%, respectively. We believe this happens because RSCALE may be acting as a mild “preconditioner” in these problems, producing more accurate ABD system solutions which in turn lead to shorter convergence patterns (i.e., meshes that more quickly adapt to the solution behavior using fewer Newton iterations) in PMIRKDC.

8 Conclusions and Future Work
Dif?cult BVODEs arise in many applications and can represent a substantial computational cost. It is therefore important to develop numerical methods which can take advantage of the availability of parallel processor architectures in an ef?cient manner. It has been known for some time that the most signi?cant bottleneck to the effective parallelization of numerical methods for BVODEs is the parallel factorization and solution of the ABD linear systems which arise. While there has been considerable research in the investigation of parallel algorithms for ABD systems and several high quality software packages have been developed, relatively very little has been done in the development of parallel BVODE codes based on this software. In this paper we have described in detail the software development effort associated with modifying a sequential BVODE code to allow it to run ef?ciently in a parallel processor environment; this modi?cation features the parallel treatment of ABD systems using the RSCALE package. The six most computationally intensive components of the BVODE code were targeted for parallelization. Extensive numerical testing demonstrates that almost optimal, linear speedups for all these components are attained, leading to substantial savings in overall execution time. Further reductions in the relative costs of these components would obviously be observed on an architecture with more processors. However, even with only eight processors, it is signi?cant to note that, generally, the remaining sequential algorithmic components of the BVODE code become relatively the most expensive. This implies that one area for further research should be in the development of parallel algorithms for some of these components; e.g., development of parallel mesh selection algorithms, may be worthwhile. As reported in this paper, it was necessary to provide a very careful implementation of the complier directives and associated use of memory in order to avoid parallel memory access con25

?icts. The parallel compiler directives provided by Silicon Graphics within their Fortran compliers furnish a convenient tool for implementing parallelism and managing parallel memory access on a shared memory machine. In fact, the parallel directives within Fortran provided by a number of different vendors are relatively similar and the task of modifying a code to run under a different compiler is sometimes not particularly dif?cult. However, it would be convenient to employ “portable” compiler directives. For shared memory architectures, the OpenMP Application Program Interface provides this facility and many vendors already provide or are in the process of providing OpenMP implementations. Another area for future work on this project is therefore the replacement of the vendor speci?c complier directives with portable OpenMP directives within PMIRKDC. We have observed signi?cant performance improvements, for certain problems, even on sequential machines, due to the replacement of COLROW by RSCALE. We are currently conducting an investigation of this phenomenon, including a study of the effect of the use of RSCALE on the conditioning of the ABD systems that arise throughout a computation. Another direction we are also currently considering is the possible advantage in using the RSCALE software, even in sequential processor environments, in the direct treatment of BVODEs with nonseparated boundary conditions. Since COLROW cannot handle BABD systems, one must ?rst uncouple the boundary conditions. For fully coupled conditions, this leads to a doubling of, n, the number of equations, and thus in some of the most computationally intensive components of a BVODE code (where the costs depend on n3 ), costs are increased by a factor of eight. Hence, even though the factorization costs of RSCALE are four times those of COLROW, the overall savings may favour the use of RSCALE.

References
[1] P.Amodio and M.Paprzycki, Parallel solution of almost block diagonal systems on a hypercube, Lin. Alg. Applic., 241-243, (1996), 85–103. [2] P. Amodio and M. Paprzycki, A cyclic reduction approach to the numerical solution of boundary value ODEs, SIAM J. Sci. Comput., 18, (1997), 56–68. [3] P. Amodio, J.R. Cash, G. Roussos, R.W. Wright, G. Fairweather, I. Gladwell, G.L. Kraut, and M. Paprzycki, Almost Block Diagonal Linear Systems: Sequential and Parallel Solution Techniques, and Applications, Numer. Linear Algebra Appl., 7, (2000), 275–317. [4] U.M. Ascher and S.Y.P. Chan, On parallel methods for boundary value ODEs, Computing, 46, (1991), 1–17. [5] U.M. Ascher, J. Christiansen, and R.D. Russell, A collocation solver for mixed order systems of boundary value problems, Math. Comp., 33, (1979), 659–679. [6] U.M. Ascher, J. Christiansen, and R.D. Russell, Collocation software for boundary value ODEs, ACM Trans. Math. Softw., 7, (1981), 209-222.

26

[7] U.M. Ascher, R.M.M. Mattheij, and R.D. Russell, Numerical Solution of Boundary Value Problems for Ordinary Differential Equations, Classics in Applied Mathematics Series, SIAM, Philadelphia, 1995. [8] U.M. Ascher and R.D. Russell, Reformulation of boundary value problems into ‘standard’ form, SIAM Review, 23, (1981), 238-254. [9] Association for Computing Machinery, New York, New York, USA, Collected Algorithms, www.acm.org/calgo. [10] G. Bader and U.M. Ascher, A new basis implementation for a mixed order boundary value ODE solver, SIAM J. Sci. Stat. Comput., 8, (1987), 483-500. [11] G. Bader and P. Kunkel, Continuation and collocation for parameter-dependent boundary value problems, SIAM J. Sci. Stat. Comput., 10, (1989), 72–88. [12] Z. Bashir-Ali, J.R. Cash, H.H.M. Silva, Lobatto deferred correction for stiff two-point boundary value problems, Comput. Math. Appl., 36, (1998), 59–69. [13] K.R. Bennett, Parallel collocation methods for boundary value problems, Ph.D. thesis, Department of Mathematics, University of Kentucky, Tech. Rep. CCS-91-1, University of Kentucky Center for Computational Sciences, 1991. [14] K.R. Bennett and G. Fairweather, A parallel boundary value ODE code for shared-memory machines, Int. J. High Speed Comput., 4, (1992), 71–86. [15] C. deBoor, A practical guide to splines, Applied Mathematical Sciences, 27, Springer-Verlag, New York-Berlin, 1978. [16] C. deBoor and R. Weiss, SOLVEBLOK: a package for solving almost block diagonal linear systems, ACM Trans. Math. Softw., 6, (1980), 80–87. [17] C. de Boor and R. Weiss, Algorithm 546: SOLVEBLOK, ACM Trans. Math. Softw., 6, (1980), 88–91. [18] R.W. Brankin and I. Gladwell, Codes for almost block diagonal systems, Comput. Math. Appl., 19 (1990), 1–6. [19] K. Burrage, F.H. Chipman, and P.H. Muir, Order results for mono-implicit Runge-Kutta methods, SIAM J. Numer. Anal., 31, (1994), 876-891. [20] J.C. Butcher, The numerical analysis of ordinary differential equations, Runge-Kutta and general linear methods, John Wiley & Sons, Ltd., Chichester, 1987. [21] J.R. Cash and A. Singhal, Mono-implicit Runge-Kutta formulae for the numerical integration of stiff differential systems, IMA J. Numer. Anal., 2 (1982), 211–227. [22] J.R. Cash and M.H. Wright, Implementation issues in solving nonlinear equations for twopoint boundary value problems, Computing, 45 (1990), 17–37. 27

[23] J.R. Cash and M.H. Wright, A deferred correction method for nonlinear two-point boundary value problems: implementation and numerical evaluation, SIAM J. Sci. Stat. Comput., 12, (1991), 971-989. [24] J.R. Cash, G. Moore, and R.W. Wright, An automatic continuation strategy for the solution of singularly perturbed linear two-point boundary value problems, J. Comput. Phys., 122, (1995), 266–279. [25] K.L. Chow and W.H. Enright, Distributed Parallel Shooting for BVODEs, Proceedings of High Performance Computing, A. Tentner, ed., The Society for Computer Simulation, Boston, (1998), 203-210. [26] J.C. Diaz, G. Fairweather, and P. Keast, FORTRAN packages for solving certain almost block diagonal linear systems by modi?ed alternate row and column elimination, ACM Trans. Math. Softw., 9, (1983), 358–375. [27] J.C. Diaz, G. Fairweather, and P. Keast, Algorithm 603. COLROW and ARCECO: FORTRAN packages for solving certain almost block diagonal linear systems by modi?ed alternate row and column elimination, ACM Trans. Math. Softw., 9, (1983), 376–380. [28] AUTO: ftp.cs.concordia.ca/pub/doedel/auto/. [29] E. Doedel, H.B. Keller, and J.-P. Kernevez, Numerical analysis and control of bifurcation problems. I. Bifurcation in ?nite dimensions, Internat. J. Bifur. Chaos Appl. Sci. Engrg., 1 (1991), 493–520. [30] J.J. Dongarra, J. Du Croz, S. Hammarling, and I.S. Duff, A set of level 3 basic linear algebra subprograms, ACM Trans. Math. Softw., 16, (1990), 1–17. [31] J.J. Dongarra, J. Du Croz, S. Hammarling, and I.S. Duff, Algorithm 679; a set of level 3 basic linear algebra subprograms: model implementation and test programs, ACM Trans. Math. Softw., 16, (1990), 18–28. [32] J.J. Dongarra, J. Du Croz, S. Hammarling, and R.J. Hanson, An extended set of FORTRAN basic linear algebra subprograms, ACM Trans. Math. Softw., 14, (1988), 1–17. [33] J.J. Dongarra, J. Du Croz, S. Hammarling, and R.J. Hanson, Algorithm 656: an extended set of basic linear algebra subprograms: model implementation and test programs, ACM Trans. Math. Softw., 14, (1988), 18–32. [34] W.H. Enright and P.H. Muir, A Runge-Kutta Type Boundary Value ODE Solver with Defect Control, Dept. Comp. Sci. Tech. Rep. 93-267, University of Toronto, 1993. [35] W.H. Enright and P.H. Muir, Runge-Kutta software with defect control for boundary value ODEs, SIAM J. Sci. Comput., 17, (1996), 479-497. [36] I. Gladwell and R.I. Hay, Vector- and parallelization of ODE BVP codes, Parallel Comput., 12, (1989), 343–350. 28

[37] M.G. Gasparo and M. Macconi, Parallel initial value algorithms for singularly perturbed boundary value problems, J. Optim. Theory Appl., 73, (1992), 501–517. [38] I. Gladwell and M. Paprzycki, Parallel solution of almost block diagonal systems on the CRAY Y-MP using level 3 BLAS, J. Comput. Appl. Math., 45, (1993), 181–189. [39] M. Goldmann, Vectorization of the multiple shooting method for the nonlinear boundary value problem in ordinary differential equations, Parallel Comput., 7, (1988), 97–110. [40] S. Gupta, An adaptive boundary value Runge-Kutta solver for ?rst order boundary value problems, SIAM J. Numer. Anal., 22, (1985), 114-126. [41] R.I. Hay and I. Gladwell, Solving almost block diagonal linear equations on the CDC Cyber 205, Numerical Analysis Report 98, Department of Mathematics, University of Manchester, 1985. [42] M. Hegland and M.R. Osborne, Wrap-around partitioning for block bidiagonal linear systems, IMA J. Numer. Anal., 18 (1998), 373–383. [43] IMSL: Visual Numerics, www.vni.com/products/imsl. [44] K.R. Jackson and R.N. Pancer, The Parallel Solution of ABD Systems Arising in Numerical Methods for BVPs for ODEs, Technical report 255/91, Dept. Computer Science, Univ. of Toronto, 1992. [45] M. Lentini and V. Pereyra, An adaptive ?nite difference solver for nonlinear two-point boundary problems with mild boundary layers, SIAM J. Numer. Anal., 14, (1977), 94–111. [46] M. Lentini and V. Pereyra, PASVA4: an ODE boundary solver for problems with discontinuous interfaces and algebraic parameters, Mat. Apl. Comput., 2, (1983), 103–118. [47] L. Liu and R.D. Russell, Linear system solvers for boundary value ODEs, J. Comput. Appl. Math., 45, (1993), 103–117. [48] C.P. Katti and S. Goel, A parallel mesh chopping algorithm for a class of two-point boundary value problems, Comput. Math. Appl., 35, (1998), 121–128. [49] M.K. Kadalbajoo and A.A. Rao, Parallel discrete invariant embedding algorithm for singular perturbation problems, Int. J. Comput. Math., 66, (1998), 149–161. [50] H.B. Keller and P. Nelson, A comparison of hypercube implementations of parallel shooting, Mathematics for large scale computing, 49–79, Lecture Notes in Pure and Appl. Math., 120, Dekker, New York, 1989. [51] H.B. Keller and P. Nelson, Hypercube implementations of parallel shooting, Appl. Math. Comput., 31, (1989), 574–603. [52] J. Kierzenka and L.F. Shampine, A BVP Solver based on Residual Control and the MATLAB PSE, SMU Math Report 99-001, Department of Mathematics, Southern Methodist University, Dallas, TX, USA, 1999. 29

[53] G.L. Kraut and I. Gladwell, Cost of stable algorithms for bordered almost block diagonal systems, ICIAM/GAMM 95 (Hamburg, 1995), Z. Angew. Math. Mech., 76, (1996), suppl. 1, 151–154. [54] MATLAB: The MathWorks Inc., Natick, MA, USA, www.mathworks.com. [55] R.M.M. Mattheij and S.J. Wright, Parallel stable compacti?cation for ODEs with parameters and multipoint conditions, Appl. Numer. Math., 13, (1993), 305–333. [56] F. Majaess, P. Keast, and G. Fairweather, The solution of almost block diagonal linear systems arising in spline collocation at Gaussian points with monomial basis functions, ACM Trans. Math. Softw., 18, (1992), 193–204. [57] F. Majaess, P. Keast, G. Fairweather and K.R. Bennett, Algorithm 704: ABDPACK and ABBPACK-FORTRAN programs for the solution of almost block diagonal linear systems arising in spline collocation at Gaussian points with monomial basis functions, ACM Trans. Math. Softw., 18, (1992), 205–210. [58] A. Miele and T. Wang, Parallel computation of two-point boundary value problems via particular solutions, J.Optim. Theory Appl., 79, (1993), 5–29 [59] P.H. Muir and K. Remington, A parallel implementation of a Runge-Kutta code for systems of nonlinear boundary value ODEs, Cong. Numer., 99, (1994), 291–305. [60] P. Muir and B. Owren, Order barriers and characterizations for continuous mono-implicit Runge-Kutta schemes, Math. Comp., 61, (1993), 675-699. [61] NAG: Numerical Algorithms Group, OXFORD, UK, www.nag.co.uk. [62] Netlib: www.netlib.org. [63] The OpenMP Application Program Interface, www.openmp.org. [64] M.R. Osborne, Cyclic reduction, dichotomy, and the estimation of differential equations, J. Comput. Appl. Math., 86, (1997), 271–286. [65] R.N. Pancer, The Parallel Solution of Almost Block Diagonal Linear Systems, Ph.d. Thesis, Dept. of Comp. Sci., University of Toronto, 2000. [66] R.N. Pancer and K.R. Jackson, The Parallel Solution of Almost Block Diagonal Systems Arising in Numerical Methods for BVPs for ODEs, In the Proceedings of the Fifteenth IMACS World Congress, Berlin, Germany, Volume 2, 57–62, 1997. [67] M. Paprzycki and I. Gladwell, Solving almost block diagonal systems on parallel computers, Parallel Comput., 17, (1991), 133–153. [68] M. Paprzycki and I. Gladwell, Using level 3 BLAS to solve almost block diagonal systems, Proc. Fifth SIAM Conf. on Parallel Processing for Scienti?c Computing, (Houston, TX, 1991), SIAM, Philadelphia, PA, (1992), 52–62. 30

[69] M. Paprzycki and I. Gladwell, A parallel chopping algorithm for ODE boundary value problems, Parallel Comput., 19, (1993), 651–666. [70] H. Pasic and Y. Zhang, Parallel solutions of BVPs in ODEs based on local matching, Proceedings of the Eighth SIAM Conference on Parallel Processing for Scienti?c Computing, SIAM, Philadelphia, PA, 1997. [71] J.K. Reid and A. Jennings, On solving almost block diagonal (staircase) linear systems, ACM Trans. Math. Softw., 10, (1984), 196–201. [72] R.D. Russell and J. Christiansen, Adaptive mesh selection strategies for solving boundary value problems, SIAM J. Numer. Anal., 15, (1978), 59–80. [73] R. Weiss, The application of implicit Runge-Kutta and collocation methods to boundary value problems, Math. Comp., 28, (1974), 449-464. [74] D.E. Womble, R.C. Allen, and L.S. Baca, Invariant imbedding and the method of lines for parallel computers, Parallel Comput., 11, (1989), 263–273. [75] K. Wright, Parallel treatment of block-bidiagonal matrices in the solution of ordinary differential boundary value problems, J. Comput. Appl. Math., 45 (1993), 191–200. [76] R. Wright, J.R. Cash, and G. Moore, Mesh selection for stiff two-point boundary value problems, Numer. Algorithms, 7, (1994), 205–224. [77] S.J. Wright, Stable parallel algorithms for two-point boundary value problems, SIAM J. Sci. Statist. Comput., 13, (1992), 742–764. [78] S.J. Wright, A collection of problems for which Gaussian elimination with partial pivoting is unstable, SIAM J. Sci. Comput., 14, (1993), 231–238. [79] S.J. Wright, Stable parallel elimination for boundary value ODEs, Numer. Math., 67, (1994),521–535. [80] S.J. Wright and V. Pereyra, Adaptation of a two-point boundary value problem solver to a vector-multiprocessor environment, SIAM J. Sci. Stat. Comput., 11, (1990), 425–449.

31

Figure 2: Overall speedup and execution time of MIRKDC and PMIRKDC in Experiment #1. Parallelism begins to pay-off at 2 processors.
LEGEND: ? ? MIRKDC, ? ? PMIRKDC. Absolute overall execution time Speedup 50
? 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 optimal ?

X faster than on 1 processor

40

seconds

30 20 10 0

?

?

?

?

?

?

?

?

?

? ? ? ? ? ?

# of processors

1

2

3

4 5 6 # of processors

7

8

Figure 3: Speedup and execution time of selected program segments of MIRKDC and PMIRKDC in Experiment #1.
LEGEND: r ? residual evaluation, d ? defect estimation, c ? ABD matrix construction, f ? ABD matrix factorization, s ? ABD system backsolve, o ? all sequential program segments. PMIRKDC segment times PMIRKDC segment speedups
8 7 6 5 4 3 2 1

10

(17)

f r c

s

seconds

optimal r

optimal d

X faster than on 1 processor

MIRKDC segment times 10
s

8 6

↑ s

s f s o d r f c d o r d f c o r o

4 2
d

c

seconds

optimal c

8
c

6 4 2

r

0

1

2

3

4

f o d

10 8

8 7 6 5 4 3 2 1

0

1 # of processors

seconds

optimal f

optimal s

6 4 2 0
r d c f s o r d c f s so r d c f r d c f

o

so

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

# of processors

5

6 7 # of processors

8

32

Figure 4: Overall speedup and execution time of MIRKDC and PMIRKDC in Experiment #2. Parallelism begins to pay-off at 2 processors.
LEGEND: ? ? MIRKDC, ? ? PMIRKDC. Absolute overall execution time Speedup 6
? 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 optimal ?

X faster than on 1 processor

5

seconds

4 3 2 1

?

? ?

?

?

?

?

?

?

? ? ? ? ? ?

# of processors

0

1

2

3

4 5 6 # of processors

7

8

Figure 5: Subroutine call and call-per-mesh pro?les of MIRKDC and PMIRKDC in Experiment #2.
LEGEND: ? ? MIRKDC, ? ? PMIRKDC, r ? residual evaluation, d ? defect estimation, c ? ABD matrix construction, f ? ABD matrix factorization, s ? ABD system backsolve. ?, ? call profile 70 60
2970 2582 1280 7% 320 320 80 7% 20 10 40% 33% 7% 7% 7% 2970 2582 38% 1280

?, ? call?per?mesh profiles
3% 3% 3% 5%

total number of calls

50 40 30 20 10 0 r d c f s program segment

8%

80 20 10

mesh

33%

mesh s (63 calls total)

f (15 calls total)

33

Figure 6: Speedup and execution time of selected program segments of MIRKDC and PMIRKDC in Experiment #2.
LEGEND: r ? residual evaluation, d ? defect estimation, c ? ABD matrix construction, f ? ABD matrix factorization, s ? ABD system backsolve, o ? all sequential program segments. PMIRKDC segment times PMIRKDC segment speedups
8 7 6 5 4 3 2 1

2
f

seconds

optimal r

optimal d

X faster than on 1 processor

MIRKDC segment times 2

1.5
c

1 0.5
r d

s f c o rd s o rd f c f so rd c s o

seconds

optimal c

1.5
c

0 2

1 0.5
r d fs

1

2

3

4

o

8 7 6 5 4 3 2 1

0

1 # of processors

seconds

optimal f

optimal s

1.5 1 0.5
f c rd s rd o c o s rd c o s rd c o s

f

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

f

f

# of processors

0

5

6 7 # of processors

8

34

Figure 7: Overall speedup and execution time of MIRKDC and PMIRKDC in Experiment #3. The same problem is solved as in Experiment #1, using the same solution strategy, but this time on the Origin 2000 instead of the Challenge L. Execution times are nearly 5 times as fast (compare to Figure 2). Parallelism begins to pay-off at 3 processors.
LEGEND: ? ? MIRKDC, ? ? PMIRKDC. Absolute overall execution time Speedup 8
(9)
↑ ? ? ? ? ? ? ? ? ? ? ? ? ?

X faster than on 1 processor

8 7 6 5 4 3 2 1 1

optimal ?

6

seconds

4

2
2 3 4 5 6 7 8

?

?

?

# of processors

0

1

2

3

4 5 6 # of processors

7

8

Figure 8: Speedup and execution time of selected program segments of MIRKDC and PMIRKDC in Experiment #3. Compare to Challenge L results shown in Figure 3.
LEGEND: r ? residual evaluation, d ? defect estimation, c ? ABD matrix construction, f ? ABD matrix factorization, s ? ABD system backsolve, o ? all sequential program segments. PMIRKDC segment times PMIRKDC segment speedups
8 7 6 5 4 3 2 1

3
s

seconds

optimal r

optimal d

X faster than on 1 processor

MIRKDC segment times 3

2
c

f

(5)
↑ s

s s f r d c o r d c f o r d c f o

1

r o

seconds

optimal c

2
s c

d

0 3
f o

1

2

3

4

1

r

8 7 6 5 4 3 2 1

0

1 # of processors

seconds

optimal f

optimal s

d

2
s s f r c d o r c d f o r d c f s o r d c f s o

1

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

# of processors

0

5

6 7 # of processors

8

35

Figure 9: Overall speedup and execution time of MIRKDC and PMIRKDC in Experiment #4. The same problem is solved as in Experiment #2, using the same solution strategy, but this time on the Origin 2000 instead of the Challenge L. Execution times are nearly 5 times as fast (compare to Figure 4). Parallelism begins to pay-off at 3 processors.
LEGEND: ? ? MIRKDC, ? ? PMIRKDC. Absolute overall execution time Speedup

X faster than on 1 processor

1.2 1

?

8 7 6 5 4 3 2 1 1

optimal ?

seconds

0.8 0.6 0.4 0.2
? ?? ? ? ? ? ? ? ? ? ? ? ? ?

2

3

4

5

6

7

8

# of processors

0

1

2

3

4 5 6 # of processors

7

8

Figure 10: Speedup and execution time of selected program segments of MIRKDC and PMIRKDC in Experiment #4. Compare to Challenge L results shown in Figure 6.
LEGEND: r ? residual evaluation, d ? defect estimation, c ? ABD matrix construction, f ? ABD matrix factorization, s ? ABD system backsolve, o ? all sequential program segments. PMIRKDC segment times PMIRKDC segment speedups
8 7 6 5 4 3 2 1

0.4

f s c f s d r o r d c o r d c f s o rd c f so

seconds

optimal r

optimal d

X faster than on 1 processor

MIRKDC segment times 0.4

0.3 0.2 0.1 0

seconds

optimal c

0.3
c

1

2

3

4

0.2 0.1
d r fs o

0.4

8 7 6 5 4 3 2 1

0

1 # of processors

seconds

optimal f

optimal s

0.3 0.2 0.1
f c rd s o rd c f o s rd c f o s rd c o f s

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

# of processors

0

5

6 7 # of processors

8

36

Figure 11: Overall speedup and execution time of MIRKDC and PMIRKDC in Experiment #5. Results are shown for the ?nal continuation step only. Parallelism begins to pay-off at 3 processors.
LEGEND: ? ? MIRKDC, ? ? PMIRKDC. Absolute overall execution time Speedup 1
? 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 optimal ?

X faster than on 1 processor

0.8

seconds

0.6
?

0.4 0.2 0

?

?

?

?

? ?

? ?

?

?

?

?

?

?

# of processors

1

2

3

4 5 6 # of processors

7

8

Figure 12: Speedup and execution time of selected program segments of MIRKDC and PMIRKDC in Experiment #5. Results are shown for the ?nal continuation step only.
LEGEND: r ? residual evaluation, d ? defect estimation, c ? ABD matrix construction, f ? ABD matrix factorization, s ? ABD system backsolve, o ? all sequential program segments. PMIRKDC segment times PMIRKDC segment speedups
8 7 6 5 4 3 2 1

0.4

s

seconds

optimal r

optimal d

X faster than on 1 processor

MIRKDC segment times 0.4

0.3 0.2
c f s f c d o r d o r d f c o rd c f s s o

0.1 0

r

seconds

optimal c

0.3 0.2
c

1

2

3

4

0.1
8 7 6 5 4 3 2 1 optimal f optimal s

r d

s f o

0.4

0

1 # of processors

seconds

0.3 0.2 0.1
s r dc f o r dc f s o r dc f s so

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

o r dc

f

# of processors

0

5

6 7 # of processors

8

37

Figure 13: Overall speedup and execution time of MIRKDC and PMIRKDC in Experiment #6. Results are shown for the ?nal continuation step only. Parallelism begins to pay-off at 3 processors.
LEGEND: ? ? MIRKDC, ? ? PMIRKDC. Absolute overall execution time Speedup 6 5
(7)

X faster than on 1 processor

8 7 6 5 4 3 2 1 1

optimal ?

seconds

4 3 2 1
?

↑ ?

? ? ? ? ? ? ? ? ? ?

?

?

?

?

2

3

4

5

6

7

8

# of processors

0

1

2

3

4 5 6 # of processors

7

8

Figure 14: Speedup and execution time of selected program segments of MIRKDC and PMIRKDC in Experiment #6. Results are shown for the ?nal continuation step only.
LEGEND: r ? residual evaluation, d ? defect estimation, c ? ABD matrix construction, f ? ABD matrix factorization, s ? ABD system backsolve, o ? all sequential program segments. PMIRKDC segment times PMIRKDC segment speedups
8 7 6 5 4 3 2 1

3
s f

seconds

optimal r

optimal d

X faster than on 1 processor

MIRKDC segment times 3

2
c f s f c r c d o r d o r o d r d c s f s o

1

seconds

optimal c

2
c

0 3
r f d s o

1

2

3

4

1

8 7 6 5 4 3 2 1

0

1 # of processors

seconds

optimal f

optimal s

2

1
f s f s o c rd o rd c f s o rd c c f s rd o

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

# of processors

0

5

6 7 # of processors

8

38

Figure 15: Overall speedup and execution time of MIRKDC and PMIRKDC in Experiment #7. Results are shown for the ?nal continuation step only. Parallelism begins to pay-off at 3 processors.
LEGEND: ? ? MIRKDC, ? ? PMIRKDC. Absolute overall execution time Speedup 10
(12)
↑ ? ? ? ? ? ? ? ? ? ? ? ?

X faster than on 1 processor

8 7 6 5 4 3 2 1 1

optimal ?

8

seconds

6 4 2

?

?

?

?

2

3

4

5

6

7

8

# of processors

0

1

2

3

4 5 6 # of processors

7

8

Figure 16: Speedup and execution time of selected program segments of MIRKDC and PMIRKDC in Experiment #7. Results are shown for the ?nal continuation step only.
LEGEND: r ? residual evaluation, d ? defect estimation, c ? ABD matrix construction, f ? ABD matrix factorization, s ? ABD system backsolve, o ? all sequential program segments. PMIRKDC segment times PMIRKDC segment speedups
8 7 6 5 4 3 2 1

5

s f

seconds

optimal r

optimal d

X faster than on 1 processor

MIRKDC segment times 5

4 3
c

s f f c c s f o d c r d s o

2 1
r d o r d

seconds

optimal c

4 0 3
c

o

r

1

2

3

4

2 1
s r d f o

5 4

8 7 6 5 4 3 2 1

0

1 # of processors

seconds

optimal f

optimal s

3 2 1 0
r d f s c o rd c f s o rd c f s o rd c f s o

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

# of processors

5

6 7 # of processors

8

39

Figure 17: Overall speedup and execution time of MIRKDC and PMIRKDC in Experiment #8. Results are shown for the ?nal continuation step only. Parallelism begins to pay-off at 3 processors.
LEGEND: ? ? MIRKDC, ? ? PMIRKDC. Absolute overall execution time Speedup 40
(45)
↑ ? ? ? ? ? ? ? ? ? ? ? ? ? ?

X faster than on 1 processor

8 7 6 5 4 3 2 1 1

optimal ?

30

seconds

20

10
2 3 4 5 6 7 8

?

?

# of processors

0

1

2

3

4 5 6 # of processors

7

8

Figure 18: Subroutine call and call-per-mesh pro?les of MIRKDC and PMIRKDC in Experiment #8. Results are shown for the ?nal continuation step only, which is the most computationally intensive.
LEGEND: ? ? MIRKDC, ? ? PMIRKDC, r ? residual evaluation, d ? defect estimation, c ? ABD matrix construction, f ? ABD matrix factorization, s ? ABD system backsolve. ?, ? call profile
8% 7648 3824 1912 24% 18% 956 478 956 478 ε
1

?, ? call?per?mesh profiles (ε < 2.5%)
14% 7648 3824 24% 1912 8% 20%

total number of calls

300

200

18% ε1

100

760 380

760 380 13% 22%

10% 19%

0

mesh r d c f s program segment

mesh

f (112 calls total)

s (348 calls total)

40

Figure 19: Speedup and execution time of selected program segments of MIRKDC and PMIRKDC in Experiment #8. Results are shown for the ?nal continuation step only.
LEGEND: r ? residual evaluation, d ? defect estimation, c ? ABD matrix construction, f ? ABD matrix factorization, s ? ABD system backsolve, o ? all sequential program segments. PMIRKDC segment times PMIRKDC segment speedups
8 7 6 5 4 3 2 1

20

s

seconds

optimal r

optimal d

X faster than on 1 processor

MIRKDC segment times 20

15 10 5
r c

f s f c c d o r d o r o d r d f c s f s o

seconds

optimal c

15 10 5
c s r f o

0

1

2

3

4

20

8 7 6 5 4 3 2 1

0

1 # of processors

seconds

optimal f

optimal s

d

15 10 5
f rd c s f o rd c s o rd c f s

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

o rd

# of processors

c

f so

0

5

6 7 # of processors

8

41

Figure 20: Subroutine call and call-per-mesh pro?les of MIRKDC and PMIRKDC in Experiment #5. Results are accumulated over all ?ve continuation steps.
LEGEND: ? ? MIRKDC, ? ? PMIRKDC, r ? residual evaluation, d ? defect estimation, c ? ABD matrix construction, f ? ABD matrix factorization, s ? ABD system backsolve. ?, ? call profile
3%

?, ? call?per?mesh profiles (ε < 2.5%)
14% 1162 1057 961 874 795 14% 723 658 542 493 3% 320 3% 80 20 3% 10 11% 1162 1057 961 874 795 14% 723 658 542 493 3% 320 ε 1 ε 80 2 20 10 15% 14%

ε6 12% ε
5

150

total number of calls

3%

100

11%

16%

50

3%

ε

4

11% 17% 3%

18% ε3

0

mesh r d c f s program segment

mesh

f (35 calls total)

s (154 calls total)

Figure 21: Subroutine call and call-per-mesh pro?les of MIRKDC and PMIRKDC in Experiment #6. Results are accumulated over all ?ve continuation steps.
LEGEND: ? ? MIRKDC, ? ? PMIRKDC, r ? residual evaluation, d ? defect estimation, c ? ABD matrix construction, f ? ABD matrix factorization, s ? ABD system backsolve. ?, ? call profile
5%

?, ? call?per?mesh profiles (ε < 2.5%)
7% 8% 1468 734 367 8% 334 304 ε 277 ε 1 2 248 224 6% 204 149 40 8% 20 ε 3 10 21% 4% 4% 1468 ε 734 ε 1 2 367 4% 334 304 277 5% 248 ε 3 224 204 5% 149 ε 40 4 20 7% 10 26%

300

total number of calls

250 200 150 100 50 0 r d c f s program segment

23%

mesh

33%

mesh

11%

ε

4

6%

f (73 calls total)

s (306 calls total)

42

Figure 22: Subroutine call and call-per-mesh pro?les of MIRKDC and PMIRKDC in Experiment #7. Results are accumulated over all ?ve continuation steps.
LEGEND: ? ? MIRKDC, ? ? PMIRKDC, r ? residual evaluation, d ? defect estimation, c ? ABD matrix construction, f ? ABD matrix factorization, s ? ABD system backsolve. ?, ? call profile
6%

?, ? call?per?mesh profiles (ε < 2.5%)
7% 25% 8% 2648 1324 662 602 8% 548 499 ε 1 434 ε2 377 327 6% 160 40 20 8% 10 ε 20% 4% 2648 4% ε 1324 ε2 1 662 4% 602 548 499 7% 434 377 ε3 327 160 7% 40 20 ε 4 10

300

total number of calls

250 200 150 100 50 0 r d c f s program segment

20%

26% 11%

3

mesh

mesh

9%

ε4

9%

f (72 calls total)

s (307 calls total)

Figure 23: Subroutine call and call-per-mesh pro?les of MIRKDC and PMIRKDC in Experiment #8. Results are accumulated over all ?ve continuation steps.
LEGEND: ? ? MIRKDC, ? ? PMIRKDC, r ? residual evaluation, d ? defect estimation, c ? ABD matrix construction, f ? ABD matrix factorization, s ? ABD system backsolve. ?, ? call profile 500 400 300 200 100 0
4% 7648 ε 3 3824 5% 1912 956 478 760 9% 380 340 296 179 149 66 40 19% 10 ε 3%
2

?, ? call?per?mesh profiles (ε < 2.5%)
ε1 4% 11% 7648 3824 6% 1912 ε3 956 478 760 9% 380 340 296 179 8% 149 66 40 10 ε2 4% ε1 7% 14%

total number of calls

14%

12%

13%

18%

mesh r d c f s program segment

ε4

10%

mesh

17%

ε4

7%

f (142 calls total)

s (497 calls total)

43

Figure 24: Overall and selected program segment execution times of MIRKDC and PMIRKDC in Experiment #9. This experiment is run sequentially on the Ultra 2. The vast difference in execution times is explained by the subroutine call and call-per-mesh pro?les shown in Figures 26 and 27.
LEGEND: ? ? MIRKDC, ? ? PMIRKDC, r ? residual evaluation, d ? defect estimation, c ? ABD matrix construction, f ? ABD matrix factorization, s ? ABD system backsolve. Overall execution time
?

Absolute execution time of program segments 6 5
? ?

15

seconds

10
?

seconds

4 3 2 1
? ? ? ? ? ? ? ? ?

?

5

0

0

r

d

c

f

s

other

Figure 25: Overall and selected program segment execution times of MIRKDC and PMIRKDC in Experiment #10. This experiment is run sequentially on the Ultra 2. The vast difference in execution times is explained by the subroutine call and call-per-mesh pro?les shown in Figures 28 and 29.
LEGEND: ? ? MIRKDC, ? ? PMIRKDC, r ? residual evaluation, d ? defect estimation, c ? ABD matrix construction, f ? ABD matrix factorization, s ? ABD system backsolve. Overall execution time
?

Absolute execution time of program segments 20 15
(24)
↑ ? ?

60

seconds

40

seconds

10 5

?

?

20
?

? ?

?

?

? ?

0

0

? ?

r

d

c

f

s

other

44

Figure 26: Pro?les of MIRKDC in Experiment #9. These differ signi?cantly from those of PMIRKDC (Figure 27).
LEGEND: ? ? MIRKDC, ? ? PMIRKDC, r ? residual evaluation, d ? defect estimation, c ? ABD matrix construction, f ? ABD matrix factorization, s ? ABD system backsolve. ? call profile 400
2496 4% 9%

? call?per?mesh profiles (ε < 2.5%)
10% 2496 1248 10% 15% 624 312 257 9% 160 12% 80 40 20 9% ε 26% 7%
1

3% 12%

5%

13%

total number of calls

1248

300 200 100 0

624 312 257 160 80 40 20 10

16%

10

13% ε
1

20% 4%

mesh r d c f s program segment

mesh

f (91 calls total)

s (385 calls total)

Figure 27: Pro?les of PMIRKDC in Experiment #9. PMIRKDC computes an acceptable solution using fewer subroutine calls and fewer mesh subintervals than MIRKDC, resulting in substantially reduced execution time (Figure 24).
LEGEND: ? ? MIRKDC, ? ? PMIRKDC, r ? residual evaluation, d ? defect estimation, c ? ABD matrix construction, f ? ABD matrix factorization, s ? ABD system backsolve. ? call profile 250
13% 686 343 312 257 160 80 40 20 7% ε 13% 13%
1

? call?per?mesh profiles (ε < 2.5%)
7% 22% 7% 4% 686 343 312 18% 257 8% 15% 160 80 40 20 10 18% 19% 4% ε
1

23%

total number of calls

200 150 100 50 0

4%

10

mesh r d c f s program segment

mesh

f (60 calls total)

s (262 calls total)

45

Figure 28: Pro?les of MIRKDC in Experiment #10. These differ signi?cantly from those of PMIRKDC (Figure 29).
LEGEND: ? ? MIRKDC, ? ? PMIRKDC, r ? residual evaluation, d ? defect estimation, c ? ABD matrix construction, f ? ABD matrix factorization, s ? ABD system backsolve. ? call profile 500
9584 4% 8% 4792 2396 1198 599 545 392 160 80 40 20 10 22% 7% ε
1

? call?per?mesh profiles (ε < 2.5%)
9% 9584 4792 2396 1198 599 7% 545 10% 392 160 80 40 20 6% 10% 7% 10 ε
1

5% 11%

3%

11%

total number of calls

10% 9%

400 300 200 100 0

12%

10%

19%

5%

mesh r d c f s program segment

mesh

6%

8%

f (125 calls total)

s (492 calls total)

Figure 29: Pro?les of PMIRKDC in Experiment #10. PMIRKDC computes an acceptable solution using fewer subroutine calls and fewer mesh subintervals than MIRKDC, resulting in substantially reduced execution time (Figure 25).
LEGEND: ? ? MIRKDC, ? ? PMIRKDC, r ? residual evaluation, d ? defect estimation, c ? ABD matrix construction, f ? ABD matrix factorization, s ? ABD system backsolve. ? call profile 300
7% 1092 14% 546 546 392 160 20% 80 40 20 16% ε1 80 40 20 10 13% 13% 18% 18% ε1 19% 11%

? call?per?mesh profiles (ε < 2.5%)
5% 14% 1092 9% 19%

total number of calls

250 200 150 100 50 0 r d c f s program segment

392 160

10

mesh

mesh s (281 calls total)

f (69 calls total)

46


赞助商链接
相关文章:
更多相关文章: