ANL/LLNL UEDGE material

Working Results

Relevant zip files full of information have been moved to make way for the question and answer section below. Check here if you're looking for a specific file that you don't have the address to.

Questions and Answers

In many respects, the main questions motivating much of our work has been What determines the saturation of speed-up with the number of processors? Thus far all of our work has been focused on strong scalability (fixed problem size but increasing processors), but that isn't a necessity for everything we're interested in.

The follow-up question to the saturation limits is whether or not we can overcome these initial issues. Hopefully below we address as much of this as is possible and provide numerical evidence to support our conclusions.

Note that the most significant of these results were presented at the APS-DPP meeting in November. When appropriate I'll not that the results were presented there. Also as we continue to present more stuff that material will also be posted here. Upcoming stuff includes the FACETS meeting in February and SIAM-CSE11.

Facets SciDac Review - May 19, 2009
SIAM-PP Talk - February 25, 2010
Facets UEDGE Update (Never presented) - August, 2010
FSP Review: Algorithms - September 1, 2010
APS-DPP Talk - November 8, 2010
SIAM CSE 2011 Poster - March 1, 2011
ICNSP 2011 Poster - September 8, 2011

LOGOS I've used some very high resolution pictures for presentations and posters in the past. Those can be found here.

Considering a 1D radial decomposition

Our initial experiments starting back in Fall 2009 and continuing until Spring 2010 dealt with the use of the LU decomposition as the primary solver for UEDGE. Because of the desire to conduct simulations with active neutral gas terms our initial belief was that the direct solver was needed for the linear solves.

In fact, that belief may be valid for some steady state simulations. Recall that we use a very large time step (dt = O(1)) to incur a steady state simulation. Thus knowledge of the effect of the time step is significant in determining an acceptable solver.
Question: What effect does the choice of time step have on the 1D decomposition?
Answer: Decreasing the time step increases the validity of ASM.

That's the one line answer, but the picture is more complicated than that if we consider more than just the time step as variable. Below there are pictures showing the deteriorating scalability for increasing time steps with the neutral equations active. Note that these images were generated using Jacobian lag=1.

Figure 1: Note the speed-up values listed to the left of the legend.
The images above indicate that the scalability of the LU decomposition is unaffected by dt, although it does suffer its traditional limitations. The ASM preconditioner does falter significantly for larger time steps and actually fails to converge for sufficiently high number of processors with large enough time steps. This is part of what confirmed our belief that the direct solver is necessary when incorporating neutral gases for steady state simulations.

For small enough time steps we can have a different solver than LU and still solve the system. Of course, we're not just interested in solving the system at all, but eventually solving it as quickly as possible. To try and answer that we need to understand what effect the component equations are having on the solver. The common belief is that the neutral gas terms are holding back the performance, which leads to the following.
Question: What effect does the choice of time step have on the 1D decomposition in the absense of neutrals? (plasma equations only)
Answers: For smaller time steps, ASM is the optimum solver.

Below are graphs that show the effect of the time step on the plasma equations only.

Figure 2: Note the speed-up values listed to the left of the legend.
Once again we see the improvement in speed for ASM and the time step agnosticism for LU. The surprising feature here is that the gain in scalability with ASM eventually overcomes the loss in quality due to neglected coupling between domains. This in my mind helps confirm our belief that the presence of the neutrals is what causes the loss of scalability in ASM.

The next test which is logical to help supplement our scalability concerns is to analyze the solver in the absense of plasma equations - with neutral terms only.
Question: What effect does the choice of time step have on the 1D decomposition in the absense of plasma equations? (neutral terms only)
Answers: Tests underway (Assumed to have AMG preferred no matter what)

I expect that when these tests are done we will see multigrid as the preferred solver, but there is no guarantee that the difference between multigrid and LU will be significant. There's obviously a lot of tuning parameters for mutligrid, so there may be something to look at there. For the time being, let's work under the assumption that AMG is optimal and I'll fix that if the experiments come back otherwise.

Thus far I have been using the term ASM without appropriately defining it. ASM is the Additive Schwarz domain decomposition based solver with overlap=1 and LU solver on each domain. The term overlap roughly refers to how much coupling between disjoint domains is allowed. Each domain has its own solver, although they are always the same type of solver.
Given this information about ASM, it is reasonable to ask whether these overlap=1 and LU sub_solver defaults are appropriate. Let's examine that below.
Question: What effect does the amount of overlap have on the ASM solver assuming dt=1e-4?
Answer: Increasing the overlap improves the total speed of the solve, to a point.

What we see in the picture below is that there is a significant difference between overlap=0 and overlap=1. Note that the overlap=0 case is the block Jacobi style of domain decomposition, and increasing the overlap pushes the solver towards the fully coupled solution. Recall that the bars are scaled so that all the bars will be of equal height in the event of optimal strong scalability (n times additional processors runs in 1/n as much time).

Figure 3: Note the average linear iterations per nonlinear iteration listed beneath each experiment.
Of significance here is the great improvement in scalability initially which tapers off as the overlap increases. Also note the shift in computational time from KSP Solve and PC Apply to LU Factor. This is because the quality of the preconditioner is improving with increasing overlap and thus fewer linear iterations are required. The cost of that better preconditioner is that more work is required to create it.
Question: What effect does the choice of sub_solver have on the 1D decomposition assuming dt=1e-4?
Answers: No significant difference ... for NP≤8

This result is only partially complete because of the difficulties we've had in running on more than 8 processors. So far it seems like LU or ILU are both equally fine, for the time step we've chosen. It does look like for larger NP we would see benefits in using ILU, but I'd have to do an experiment to confirm. It would also be valuable to consider the effect for other time steps.

Figure 4: Note the average linear iterations per nonlinear iteration listed beneath each experiment.
For the moment, it seems as though the standard choices for Additive Schwarz are appropriate, although as the conditions of our experiments change we may need to evaluate. One big change that should be introduced now is the use of a physics preconditioner which we refer to in the PETSc terminology as FieldSplit. Here the terms "physics preconditioner", "component-wise preconditioner", "operator-specific preconditioner" are all used interchangeably.

Question: Assuming dt=1e-4, how can we use what we've learned so far to produce a better solver?
Answers: FieldSplit preconditioner using the optimal solvers on subsets of the variables.

The structure of the physics preconditioner used here, as well as a broader discussion of existing and potential FieldSplit options will be provided elsewhere. For the time being, this solver consists of two blocks where the plasma terms are solved independently of the neutral terms and vice versa. Any coupling between the two terms is neglected for simplicity, but may be of significance in future experiments. Each set of variables is solved with its optimal solver: neutrals - AMG; plasma - ASM. The scalability results can be seen below.

Figure 5: Relevant information showing the effectiveness of FieldSplit over LU given to the right.
Even with this simple splitting of the problem, the significance of operator-specific preconditioning is immediately obvious. The strong scalability of the solver is near optimal for up to 8 processors, and the number of linear iterations doesn't undergo the explosion that we see for ASM. This allows for the FieldSplit preconditioner to outperform LU not just in scalability, but in total time as well. There are some questions that can be raised with FieldSplit given these results.

Question: What effect does the time step have on the effectiveness of FieldSplit?
Answers: Tests underway (Assumed tendency toward LU speed)

Experiments will need to be done to determine the usefulness of FieldSplit for larger time steps, although my guess is that there will be solid performance even for larger time steps. I feel that way because the ASM on plasma only at dt=1e-1 was about equivalent to LU, so I doubt FS would do any worse than LU. Given that I imagine that for appropriately large time steps the FS preconditioner will behave much like the LU preconditioner as that will begin to dominate the computation.

Another issue we haven't tackled so far is the Jacobian lag. We don't want to have to reevaluate the Jacobian at each SNES step if the previous Jacobian is still a "good enough" preconditioner. If FieldSplit is going to be the preferred 1D solver, we need to determine its usefulness for other lags.

Question: What effect does the Jacobian lag have on the effectiveness of FieldSplit?
Answers: Lag up to 5 has no detrimental effect on FieldSplit

The experiments above all have Jacobian lag equal to 1 as the default, but for a production level code we would want to have the solver work with higher lag values. The graph below compares the speed of the solve for LU and FS given lag=5.

Figure 5: Scalability information is listed to the left of the legend.
Here we see roughly the same level of outperformance of FS over LU. This is somewhat surprising because it was previously assumed that without a low lag the solver would take too many linear iterations and lose its edge. Instead we see that is not the case, although the same can't necessarily be said for larger time steps, or even in the multistepping case. We'll need more experiments to say something about that.