Iterative Parallel Methods for Boundary Value Problems
G. Kraut Department of Mathematics and Computer Science The University of Texas at Tyler Tyler, TX 75799-0001
A bordered almost block diagonal system (BABD) results from discretizing and linearizing ordinary di erential equation (ODE) boundary value problems (BVPs) with non-separated boundary conditions (BCs) by either spline collocation, nite di erences, or multiple shooting. After internal condensation, if necessary, this BABD system reduces to a standard nite di erence BABD structure. This system can be solved either using a \direct" divide-and-conquer approach or an iterative scheme such as preconditioned conjugate gradients (PCG). Preconditioners approximating the inverse of the nite di erence operator are e ective and can be computed and applied e ciently in a parallel environment. We present theoretical computational costs, comparing direct and iterative methods, and numerical results computed on a Sequent Symmetry shared memory computer. These demonstrate that the PCG method can outperform the divide-andconquer approach on systems with many processors when approximating large di erential systems. Also, the PCG method \scales up" better than the implemented divide-and-conquer method.
I. Gladwell y Department of Mathematics Southern Methodist University Dallas, TX 75275-0156
is outlined in section 4. In section 5 we present the parallel implementation and the computational costs of the methods outlined in the previous sections. Finally, in section 6 we describe our test problems and tabulate the comparative timings and speedups of the implemented algorithms. Consider the system of n linear rst-order two-point BVPs consisting of y0 = A(x)y + q(x); x 2 a; b] ; (1) where y; q 2 Rn and A 2 Rn n , subject to linear non-separated BCs Ba y(a) + Bb y(b) = ; (2) where Ba ; Bb 2 Rn n . A problem of the form (1), (2) is solved at every iteration of a quasilinearization method for the standard non-linear problem y0 = f (x; y); g(y(a); y(b)) = 0; x 2 a; b]: (3) Multiple shooting, nite di erence, and spline collocation methods choose a mesh a = x1 < x2 < : : : < xK+1 = b and a method dependent discretization to generate vectors si that approximate y(xi ); i = 1; : : :; K+1. In each case a linear system Ys = b must be solved for s1; s2; : : :; sK+1 . Ys = b has the form Ba Bb 3 2 s1 3 2 3 7 6 s2 7 6 f1 7 6 S1 R1 76 6 .. 7 = 6 .. 7 : (4) ... ... 54 4 . 5 4 . 5 fK sK+1 SK RK Here Si ; Ri 2 Rn n , and si ; fi ; 2 Rn . For example, the midpoint nite di erence approximation de nes Si = ?I ? hi A(xi+1=2 ); 2 (5) Ri = I ? h2i A(xi+1=2 ); qi = hi q(xi+1=2 ); 1 i K; where hi = xi+1 ? xi and xi+1=2 = xi + hi =2. After a suitable transformation the multiple shooting matrix
2
Abstract
2 Mathematical Problem
We consider the linear system arising from the discretization of linear ODE BVPs with non-separated BCs. We present evidence that PCG methods using a new preconditioner can outperform direct divideand-conquer algorithms for parallel solution of these systems. This paper extends 12] by presenting results for the implemented PCG method rather than extrapolating from other theoretical and experimental evidence and by giving the parallel implementation of the method. In section 2 we discuss the mathematical problem, its discretization, serial and parallel direct algorithms for the discretized system, and its numerical stability. In section 3 we develop a new preconditioner for the conjugate gradient algorithm. The currently implemented version of the divide-and-conquer algorithm
Author wishes to acknowledge the use of the Computing Facilities at the Advanced Research Facility, Mathematics and Computer Science Division, Argonne National Laboratory. y Author wishes to acknowledge the support of NATO grant CRG 920037.
1 Introduction
has a similar form 2]. In the case of spline collocation a parallelizable initial internal condensation is needed to reduce the system to this structure 2]. Except for the block Bb , resulting from the nonseparated BCs, the matrix Y is almost block diagonal (ABD). Stable and e cient direct methods, such as NAG's F01LHF 15], COLROW 3], ABDPACK or ABBPACK 13, 14] have been available for some time to solve ABD systems. Only recently has attention turned to BABD systems. Wright 18, 19], and others 1, 10], have designed variants on direct divide-andconquer approaches for solving (4). Wright in 20] has shown that conventional Gaussian elimination with partial pivoting can be unstable for BABD linear systems (4) arising from exact multiple shooting. By an extension of his argument they are potentially unstable for nite di erence and spline collocation discretizations. Consider the linear rst order two-point boundary value problem (1), (2), and the particular problem, taken from 20], de ned by 1 1 0 A(x) = ? 6 ? 1 ; q(x) = 12 + 6x ; 1 x2 6 0 : Ba = Bb = I; x 2 0; 60]: (6) = 1 The problem de ned in (6) is a well-conditioned BVP in the sense of 2]. We expect its discretizations to give well-conditioned linear systems. The blocks Si and Ri of Y in (4) are constant. If we pre-multiply Y by diag(I; R?1; R?1; : : :; R?1) and let ?B = R?1 S we obtain 2 I I3 ?B I 7 6 7 ?B I Y=6 6 ... ... 7 : 5 4 ?B I Choose h small enough so that the elements of B are less than one in magnitude. This is certainly possible since ?B = ?R?1 S = ?(I ? hA)?1 (?I ? hA) (I + 2hA) + O (h2 ) e2hA ; and ?h I + 2hA = 1 2h 3 1 2h h : ?
3
It is easy to see that exponential element growth takes place in the last column of U. For example, when K = 200 (h = 0.3), the largest element in the U factor is approximately 2:59 1021. This behavior occurs whenever A has negative diagonal elements, since then, provided h is su ciently small, B has all its elements less than one in magnitude and no pivoting will occur. This instability is not restricted to matrices A with negative diagonal entries, nor to problems with constant coe cient matrices, nor to matrices derived from nite di erence discretizations. A similar analysis applies to Wright's structured LU divide-and-conquer method (SLU) 18] discussed in section 3. Complete pivoting on the BABD system produces a stable LU factorization for the matrices Y and Y. Both the L and U factors are often sparse but the ll-in pattern is somewhat irregular and can be signi cant. The cost and memory requirement is unpredictable and can be large. The structured QR (SQR) divide-and-conquer algorithm 19], is the most e cient stable algorithm, in either serial or parallel implementation, of which we are aware. It costs about four times as much as the potentially unstable SLU divide-and-conquer algorithm 18] which we use in our comparisons. Separating the BCs as described in 2] and solving the expanded ABD system with an e cient ABD solver produces a stable factorization. However, the doubled size of the expanded system leads to a cost similar to that of the SQR algorithm and double the storage space. A cost summary of these methods is provided in Table 1 in section 5. Linear systems with large coe cient matrices which have regular nonzero structure, such as blocktridiagonal and block-banded matrices, are often solved more cost e ectively with iterative methods rather than with direct methods. An iterative method requires less storage and can be faster than a direct method. The BABD matrix in (4) is large and sparse with a unique block structure. We solve this linear system using a PCG method and compare this approach to Wright's SLU divide-and-conquer approach. The above discussion indicates that a more stable direct parallel algorithm would cost signi cantly more (up to a factor of 4). This should be taken into account when studying the comparative costs in Tables 3-10 in section 6. Since the coe cient matrix of (4) is not symmetric positive de nite we apply (implicitly) the CG method to the normal equations YTYs = YT b (7) using a modi ed version of algorithm 10.3.1 in Golub and Van Loan 5], see Algorithm 1. In well-posed BVPs the condition number of Y (and hence of YT Y) is moderate. Many authors have studied CG methods for large systems of linear equations on serial 6, 7] as well as parallel machines 9, 16, 17]. For a system of size N,
3 Preconditioned Conjugate Gradients
If Gaussian elimination with row partial pivoting is applied to the matrix Y, no interchanges occur, and the following LU factorization of Y is obtained 3 32 2 I I I I B I 7 76 6 ?B 7 76 6 .. ?B I ... 7 76 6 . 76 6 ... ... K?1 7 5 54 4 I B ^ ^ ?B L U
AlgorithmT1 ?1 Let M ' (Y Y) be the preconditioner, k = 0; s0 = 0; r0 = YTb T while (jjrk jj)=(jjY bjj) > error tolerance zk = Mrk; k = k + 1 if k = 1 then p1 = z0 else k = rT?1 zk?1 =rT?2zk?2 k k pk = zk?1 + k pk?1 end T k = r ? zk?1 =(Ypk )T (Ypk ) sk = skk?11 + k pk rk = rk?1 ? kYT Ypk endwhile s = sk
If M = I, the standard conjugate gradient method results. the iteration must converge, using exact arithmetic, in at most N iterations and the YT Y norm of the error is decreased at every step. Although CG is guaranteed to converge in at most N iterations, Dongarra et al. 4] point out that, in practice, the number of iterations is roughly proportional to the square root of the condition number of the coe cient matrix YT Y. In our numerical tests we observe convergence in O (N) iterations when solving (4) using the standard CG method. This is unacceptable since the systems of interest can be very large. It is not unusual to have a system of n = 20 ordinary di erential equations and K = 600 mesh points which results in a system of N = 12000 linear equations. Since the condition number of YT Y, and hence the speed of convergence of the conjugate gradient method depends strongly on the spectrum of Y, we need a preconditioner M to gather the eigenvalues of MYT Y, preferably near unity, so that the preconditioned conjugate gradient method will converge quickly. It is apparent, that if M = (YT Y)?1 the iteration converges immediately. By choosing M (YT Y)?1 it is hoped the iteration will converge very quickly. The di culty lies in nding e cient preconditioners which allow a su cient degree of parallelism. Both the computation and application of the preconditioner is crucial in the cost of this algorithm and both are generally di cult to parallelize. We have considered using a complete factorization of each diagonal block of YT Y or YYT as a preconditioner. This can be implemented in parallel. In our tests this version of PCG converged in O (N) iterations, essentially the same as with no preconditioning 11]. Since this and other widely used methods are not successful in improving the convergence of the CG method for the BABD system, we investigate the structure of the inverse of the matrix Y in (4). The inverse of Y is full. Each n n block of the inverse matrix depends on every n n block of the original matrix Y. The blocks with the largest elements appear in positions in the inverse matrix dependent on the problem and generally not within the structure
of the BABD matrix. These observations lead us to explore a full matrix approximate inverse. When using a nite di erence approximation Y consists of blocks originating from the BCs and the discretization, for example, the midpoint scheme as de ned in (5). As the number of mesh points K becomes large, h becomes small and hi A(xi+1=2 ) tends 2 to zero. We approximate Y by 2 Ba Bb 3 I 7 6 ?I 7 ?I I Z=6 7: 6 ... ... 5 4 ?I I We choose the preconditioner to be M = Z?1(Z?1 )T. If (Ba + Bb )?1 exists, then 2 I ?Bb ?Bb ?Bb 3 Ba ?Bb ?Bb 7 6 I Z?1 = (Ba + Bb )?1 6 I B.a B.a . ?B.b 7 7 6 4 . .. 5 .. .. .. . . I Ba Ba Ba The simple approximate inverse preconditioner M = Z?1 (Z?1 )T produces remarkably rapid convergence for many examples, however, it has two disadvantages: (i) (Ba +Bb )?1 must exist, which is not true in many commonly occurring cases. (ii) No account is taken of the e ect of the choice of step sizes hi and the behavior of A(x), hence it can be ine ective for solving many boundary layer problems with realistic mesh distributions. In 11] the rst author discusses some extensions to this choice of Z (and hence M) where the identity matrices are replaced by other approximations of Si and Ri . Though these approximations gave some improvement in some cases, none were uniformly better than those given here.
3.1 Construction of the Preconditioner
4 Stable Parallel Elimination for Boundary Value ODE's
After stripping the BC blocks from (4), Wright's algorithm 18] slices the coe cient matrix and performs Gaussian elimination with row partial pivoting on the submatrix in each slice. For simplicity we use K slices 2 of two block rows each. We obtain for the rst slice G1 U1 E1 = ~ ~ S1 0 R1 1 0 ~ ~ L1;n P1;n : : : L1;1P1;1 S1 R2 R2 : 0 S This process is repeated for each slice. A reduced system, involving only K block rows of equations (ex2 cluding the BCs), can be formed by taking the last
block row of each partition after the factorization above has been applied. This system has the form 2 B Bb 3 a ~ ~ 7 6 S1 R1 7 6 ~ ~ S2 R2 7 6 (8) 7 6 ... ... 5 4 ~ ~ S P RP Since the matrix in (8) has identical structure to that of the original system, the elimination procedure just described can be applied again, and recursively log2K times. Eventually, a 2n 2n system of the form Ba Bb (9) ~ ~ S R is obtained, which is solved by standard Gaussian elimination with partial pivoting. The intermediate unknowns can be recovered by forward and back substitution, using the matrices Ui ; Gi ; and Ei . These forward and backsolves can either be done entirely on the last processor, or in log2K steps involving a fan-in and a fan-out parallel computation. The following de nitions and assumptions are made in computing the cost for the PCG algorithm (the basic unit of cost is the linked triad (a + b) operation): n { size of the ODE system. (We assume to be xed and large.) K { number of internal blocks representing the number of mesh intervals in the discretization. (For simplicity we assume K = 2q with K + 1 block rows in the matrix Y.) P { number of processors. (Parallelism on these processors is introduced at the block level.) j { number of iterations in the PCG algorithm. We describe the parallel implementation used in the computed examples. The computation of the matrixvector products and the application of the preconditioner of Algorithm 1 are carried out in parallel. These operations comprise about 90% of the total execution time of Algorithm 1. Ideally, to perform the matrixvector products or application of the preconditioner in parallel, we would divide the (K+1) block rows of the BABD matrix into P slices with an equal number of block rows. This would result in a balanced load for each processor for the PCG algorithm. However, we restrict ourselves to K = 2q mesh points with modulus (K; P) = 0, in order to compare numerical results with Wright's SLU algorithm. This choice of K produces a well balanced load for Wright's SLU algorithm but creates an imbalanced load for the processors in case of the PCG algorithm. In our implementation of the PCG algorithm we divide the K interior block rows
5 Parallel Computational Cost
into P equal sections of K block rows and include the P boundary condition block in the rst section. Each slice is passed to a processor. Using this arrangement each processor is accessed only once per matrix-vector product. The computation of the matrix-vector product used to form the k and rk in Algorithm 1 requires 4n2 operations for each block row. The matrix-vector products at the block level are implemented using level 2 BLAS. The simple explicit form of Z?1 is important and is exploited in e cient parallel implementation. The application of the preconditioner in zk = Z?1 (Z?1 )T rk utilizes the block structure of each block row in Z?1. We compute the building blocks (Ba + Bb )?1; ?(Ba + Bb )?1Bb ; and (Ba + Bb )?1Ba in Z?1 prior to calling the PCG code. (Ba + Bb )?1rk is computed only once per iteration. Two n n matrix-vector products are required per block row applied to summed vector suben? tries of rk . This results in a workload of 8n2 K + 1 ?K P operations on the rst processor and 8n2 P operations on the remaining processors. We compute the cost based on the work of the rst processor. The approximate cost for PCG is CPCG ' 8n2 K + 1 (10) P operations per iteration. This PCG involves about twice the cost per iteration compared to the standard CG but converges much faster (see Table 2). The major cost of the divide-and-conquer approach is in the elimination step. Our results are based on Wright's SLU algorithmas implemented on the Alliant FX/8 which we adapted for the Sequent Symmetry. This code performs structured Gaussian eliminationin three levels of which only the rst level is implemented in parallel. For this rst level, the matrix Y is divided into P sections with K block rows per processor. This P leaves P block rows to be eliminated at the second level and the usual 2n 2n matrix to be eliminated at the third level. The cost for the entire elimination is approximately 3 (11) CSLU ' 23n K + P : 6 P It is easy to see that this SLU algorithm will become less e cient as P and n get large. The matrix at the second level will be very large and serial factorization expensive. In Table 1 we cost the algorithms discussed in this paper except Gaussian elimination with complete pivoting for which the worst case cost may be misleading. The cost of the corresponding serial algorithm is obtained by setting P = 1. The cost for F01LHF is based on the expanded system with 2n 2n interior blocks and n n boundary condition blocks. For all other algorithms the cost is based on the BABD matrix (4).
5.2 Parallel Elimination for BVP ODEs
5.1 Preconditioned Conjugate Gradient
F01LHF n3 23K + 8 3 3 (expanded, serial) ?K 46 n3 + logP SQR 3 P ?K 23 n3 + logP SLU 6 P SLU (implemented) PCG
23 n3 ? K + P 6 P ?K 8n2 P + 1 j
Software
Operation Count ?
8Kn2 + 4n2
Storage
4Kn2 4Kn2 4Kn2
(2n2 + 5n) Table 1: Cost Summary of Discussed Algorithms
No. of order of No. of iterations Problem meshpts. matrix CG PCG 1 100 202 202 13 200 402 402 13 500 1002 1002 13 2 100 202 207 14 200 402 411 14 500 1002 1019 15 3 200 402 77 55 600 1202 210 52 2400 4802 743 44 6000 12002 1739 41 The number of ODEs n = 2. Table 2: Number of Iterations Using Algorithm 1 The SLU algorithm and the PCG algorithm were coded in FORTRAN 77 on both a Sun 4/490 server and a Sequent Symmetry shared memory multiprocessor computer. To show the performance of the new preconditioner we have included the numerical results of the serial PCG algorithm in Table 2, computed on the Sun. Similar results were obtained on one processor on a Sequent Symmetry. The CG algorithm converges in about O (N) iterations, while PCG converges in a constant number of iterations essentially independent of K (see Table 2). We also constructed larger di erential systems by repetition of Problems 1-3. For these PCG converges in the same number of iterations as for one copy of the di erential equation. The computations for our parallel tests were performed on a Sequent Symmetry at the Advanced Computing Research Facility, Mathematics and Computer Science Division, Argonne National Laboratory. The results for the four test problems are given in Tables 3 through 10. The timings under SLU and PCG are in seconds. The PCG computations used an iteration error tolerance of 10?8. In order to verify the predicted cost comparison, the PCG should perform relatively well for large n. We included test results for n = 20, 32, and 40. We varied K in order to demonstrate that the rate of convergence is independent of the number of meshpoints. For each of Problems 1 and 2 we give the timing comparison of SLU with PCG and the speedup for each algorithm. Here speedup is measured relative to the same algorithm implemented on one processor. It is normally desirable to measure speedup relative to the fastest stable algorithm but, without a full implementation and thorough testing, it is not clear which algorithm to use for these basic measurements. In Tables 3 and 4 we present the results from the singularly perturbed problem, Problem 4. For each additional set of four equations, the parameter is changed. The BVP becomes more di cult to solve as gets smaller. When n = 20, PCG converges in 40 iterations. PCG demonstrates better overall speedup than parallel SLU but takes three to four times as long to converge. When n = 40, PCG requires 135 iterations and for one processor takes about six times
5.3 Comparison
For PCG to be competitive with the implemented SLU algorithm, we require the number of iterations
6.1 Computational Results
2 (12) j < 23n K + P : 48 K + P This cost comparison suggests that this PCG method should be faster in obtaining the solution than SLU for large enough n and P.
6 Test Problems
cient problem. 8 2 y00 = ? 1 y0 + 8?x2 ; x 2 0; 1]; y0(0) = y(1) = 0, x 7 which has the unique solution: y(x) = 2 ln 8?x2 Problem 3: Example 1 from 20], see (6). This problem is well conditioned but Gaussian elimination with row partial pivoting is unstable. Problem 4: A set of four rst order equations composed of Example 10.3 from 2] together with Problem 1.3.1 from Hemker 8], set up to be scaledup by using a new value for for each repeated set of equations on 0, 1]. (This way we avoid potential problems arising from using PCG on a set of problems created by repetition without variation of parameters, where we would expect PCG to converge in the same number of iterations as for one copy of the equations.) y00 + (2 + cos x)y0 ? y = g(x), where g(x) = ?(1 + 2 )cos x ? (2 + cos x)sin x + ? 1 + 23 2 x2 ?3x= ; y0(0) = 3 and y(1) = ?1 which has the exact solution y(x) = cos x ? e?3x= + O( 2 ) and a boundary layer at x = 0. y00 ? y0 = 0; y0(0) = (e1=1 ?1) ; y(1) = 1, which has ? the unique solution: y(x) = e1= 1 1 + e1=1?1 ex= , and a ? boundary layer at x = 1.
Problem 1: Example 5.18 from 2], a constant coe cient problem. y00 ? 4y = 16x+12x2 ? 4x4 ; x 2 0; 1]; y(0) = y0 (1) = 0, which has the unique solution: y(x) = x4 ? 4x Problem 2: Example 5.1 from 2], a variable coef-
We used the following problems in our comparisons:
P SLU Speedup PCG Speedup 1 303.2 1203.8 2 158.0 1.91 630.6 1.91 4 87.0 3.48 337.6 3.57 8 52.5 5.78 191.6 6.28 16 39.7 7.64 120.5 9.99 1 = 1:25; 2 = 1:00; 3 = 0:75; 4 = 0:5; 5 = 0:25; n = 20; K = 512; no. of PCG iterations= 40. Table 3: Problem 4: Comparison of PCG and SLU P SLU Speedup PCG Speedup 1 591.5 3853.4 2 311.0 1.90 2001.3 1.93 4 173.1 3.42 1034.7 3.72 8 119.7 4.94 576.8 6.68 16 118.4 4.99 337.5 11.49 1 = 1:75; 2 = 1:50; 3 = 1:25; 4 = 1:00; 5 = 0:75; 6 = 0:50; 7 = 0:25; 8 = 0:175; 9 = 0:15; 10 = 0:125; n = 40; K = 128; no. of PCG iterations= 135. Table 4: Problem 4: Comparison of PCG and SLU as long as SLU. For Problem 3, with N > 200, SLU is unstable and does not obtain an accurate solution. PCG converges to the solution for all N. For Problem 3 we included two tests. One which solves the original system with only 2 equations and large K. For this problem the interval over which the solution is obtained is large, h = 0.01 when K = 6000. In this case the parallel PCG algorithm is implemented at the level of 2 2 blocks and the speedup su ers especially as we use more processors. In contrast, when n = 20 and K = 512, we see a similar speedup to the other problems.
P SLU Speedup PCG Speedup 1 613.5 485.1 2 307.0 1.99 248.7 1.95 4 163.5 3.75 131.1 3.70 8 94.0 6.53 72.4 6.70 16 76.2 8.05 43.5 11.15 n = 32, K = 256, no. of PCG iterations: 13. Table 6: 3-level SLU and PCG for Prob. 1 P SLU Speedup PCG Speedup 1 592.8 374.7 2 310.8 1.91 191.9 1.95 4 164.4 3.61 100.1 3.74 8 108.6 5.46 55.8 6.72 16 110.0 5.39 33.5 11.19 n = 40, K = 128, no. of PCG iterations: 13. Table 7: 3-level SLU and PCG for Prob. 1 P SLU Speedup PCG Speedup 1 300.7 480.2 2 161.6 1.86 246.4 1.94 4 84.8 3.54 134.0 3.58 8 52.7 5.71 76.7 6.25 16 39.2 7.68 45.3 10.61 n = 20, K = 512, no. of PCG iterations: 16. Table 8: 3-level SLU and PCG for Prob. 2 P SLU Speedup PCG Speedup 1 568.8 551.0 2 299.2 1.90 282.6 1.94 4 167.6 3.39 148.1 3.72 8 100.9 5.64 82.2 6.70 16 83.8 6.79 49.8 11.07 n = 32, K = 256, no. of PCG iterations: 14. Table 9: 3-level SLU and PCG for Prob. 2 P SLU Speedup PCG Speedup 1 595.9 397.0 2 311.2 1.92 202.9 1.95 4 169.4 3.51 107.2 3.70 8 115.3 5.17 59.2 6.70 16 117.7 5.06 35.6 11.16 n = 40, K = 128, no. of PCG iterations: 14. Table 10: 3-level SLU and PCG for Prob. 2 P n = 20 Speedup n = 2 Speedup 1 1604.8 479.7 2 832.7 1.93 269.4 1.78 4 444.7 3.61 165.3 2.90 8 254.8 6.30 112.6 4.26 16 158.7 10.11 87.9 5.46 n = 20, K = 512, no. of PCG iterations: 54. n = 2, K = 6000, no. of PCG iterations: 41. Table 11: PCG for Problem 3
6.2 Speedup
The results in Table 3 and 4 indicate that for a singularly perturbed problem PCG does not compete with SLU but scales up better than the implemented version of SLU. The results presented in Table 5 through Table 10 indicate that PCG converges faster to the solution than SLU for large enough n. It is also clear from the results in Table 3 that the three level implementation of SLU is not e cient for a large number of processors (see especially Tables 7 and 10). Too much of the work is done on the second level, which is sequential. As the number of processors increases, the second level matrix gets larger and more and more elimination work is serial. At some stage P SLU Speedup PCG Speedup 1 304.7 392.2 2 153.6 1.98 205.5 1.91 4 78.5 3.88 110.5 3.55 8 43.5 7.00 62.9 6.24 16 30.5 9.99 39.9 9.82 n = 20, K = 512, no. of PCG iterations: 13. Table 5: 3-level SLU and PCG for Prob. 1
the full parallel SLU, with its additional overhead, will outperform the 3-level implemented scheme. It is not clear whether the full scheme will ever be competitive with PCG. The speedup in PCG is about what is to be expected. In 12] we showed that parallelizing the underlying matrix-vector-multiplies give linear speedup as long as the matrices are set up in parallel using the same block structure. These e ects carry over clearly to the full PCG algorithm.
7 Conclusions
We have presented a new full matrix preconditioner in the conjugate gradient algorithm. When it is successful the number of iterations is independent of the size of the linear system, provided h is small (but dependent on the di erential system being solved). This preconditioned conjugate gradient algorithm generally gives close to linear speedup when implemented on a shared memory parallel processor. That is, it has the potential for scaling up with problem size. It can outperform a divide-and-conquer direct approach especially on large problems using a large number of processors.
References
1] U. M. Ascher and S. Y. P. Chan, \On Parallel Methods for Boundary Value ODEs", Department of Computer Science, University of British Columbia Vancouver, B.C. V6T 1W5, Preprint, 1989. 2] U. M. Ascher, R. M. M. Mattheij, and R. D. Russell, Numerical Solution of Boundary Value Problems for Ordinary Di erential Equations, Prentice Hall, Englewood Cli s, New Jersey, 1988. 3] J. C. Diaz, G. Fairweather, and P. Keast, \FORTRAN Packages for Solving Certain Almost Block Diagonal Linear Systems by Modi ed Alternate Row and Column Elimination", ACM Trans. Math. Soft., 9 (1983) 358-375. 4] J. J. Dongarra, I. S. Du , D. C. Sorensen, and H. A. Van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers, SIAM, Philadelphia, 1991. 5] G. H. Golub and C. F. Van Loan, Matrix Computations, (2nd. edition) John Hopkins University Press, Baltimore, 1989. 6] A. Greenbaum, \Behavior of the conjugate gradient algorithm in nite precision arithmetic", UCRL-85752, Lawrence Livermore National Laboratory, CA 94550, 1981. 7] L. Hageman and D. Young, Applied Iterative Methods, Academic Press, New York, 1981. 8] P. W. Hemker, A Numerical Study of Sti TwoPoint Boundary Value Problems, Mathematisch Centrum, Amsterdam, 1977.
9] W. H. Holter and T. C. Oppe, \Parallelizable Preconditioned Conjugate Gradient Methods for the CRI Cray Y-MP and TMC Connection Machine", Supercomputer Computations Research Institute, Florida State University, Tallahassee, FL 32306-4052, 1991. 10] K. R. Jackson and R. N. Pancer, \The Parallel Solution of ABD Systems Arising in Numerical Methods for BVPs for ODEs", Technical Report No. 255/91, Computer Science Department, University of Toronto, Toronto, Ontario, Canada, M5S 1A4, 1992. 11] G. L. Kraut, \Parallel Direct and Iterative Methods for Boundary Value Problems", Ph.D. Thesis, Southern Methodist University, Dallas, TX 75275-0156, 1993. 12] G. L. Kraut and I. Gladwell, \Parallel Methods for Boundary Value Problem Linear Algebra", Proceedings, Sixth SIAM Conference on Parallel Processing for Scienti c Computing, (1993) 647651. 13] F. Majaess, P. Keast, G. Fairweather, and K. R. Bennett, \The solution of almost block diagonal linear systems arising in spline collocation at Gaussian points with monomial basis functions", ACM Trans. Math. Softw., 18 (1992) 193204. 14] F. Majaess, P. Keast, G. Fairweather, and K. R. Bennett, \Algorithm 704: ABDPACK and ABBPACK Fortran programs for the solution of almost block diagonal linear systems arising in spline collocation at Gaussian points with monomial basis functions", ACM Trans. Math. Softw., 18 (1992) 205210. 15] NAG FORTRAN 77 Library Manual. NAG Ltd, Wilkinson House, Jordan Hill Rd, Oxford OX2 8DR, England, 1988. 16] D. P. O'Leary, \The Block Conjugate Gradient Algorithm and Related Methods", Lin. Alg. and its Applics., 29 (1980) 293-322. 17] M. K. Seager \Parallelizing conjugate gradient for the CRAY X-MP", Parallel Computing, 3 (1986) 35-47. 18] S. J. Wright, \Stable Parallel Elimination For Boundary Value ODE's", Argonne National Laboratory, 9700 South Cass Avenue, Argonne, Il 60439, 1991. 19] S. J. Wright, \Stable Parallel Algorithms for Two Point Boundary Value Problems", SIAM J. Sci. Stat. Comput., 13 (1992) 742-764. 20] S. J. Wright, \A Collection of Problems for which Gaussian Elimination with Partial Pivoting is Unstable", SIAM J. Sci. Stat. Comput., 14 (1993) 231-238.