Mixed MPI and OpenMP extension of the MOLSCAT v14 package.
Here you can access the repository for extensions of the scattering
code MOLSCAT originally developed by Sheldon Green and collaborators
until he passed away in 1995.
The present version developed by P. Valiron and George C. McBane
is targeted for large
close coupling calculations on parallel supercomputers and computer
grids and combines
- the original serial version from the
MOLSCAT repository
maintained by Jeremy M. Hutson: see the Version 14 and the latest release (>2019) on the MOLSCAT repository on GitHub
- the "Poor Man Parallel"
PMP MOLSCAT
variant developed by George C. McBane at
Grand Valley State University since 2005 for dispatching
independant (JTOT, M) calculations over an MPI parallel environment;
- further extensions and optimizations
for serial, MPI and multi-threading environments
(including a full OpenMP support for ITYPE=1,2,3,4,7 and INTFLG=6,8);
- new algorithms to alleviate the
memory requirements for very large CC calculations involving ITYPE=3,4.
One important motivation for these developments was to extend close
coupling calculations to higher temperatures for molecules of
astrophysical interest, especially for collisions involving
para-H2(J=0,2) and ortho-H2.
Earlier developments
Original MOLSCAT v14 serial package
The original MOLSCAT v14 repository is
maintained by Jeremy M. Hutson. This v14 package is purely serial.
For reference, you may also download from here
the corresponding original fortran source
(gzipped file),
the documentation in text format
(txt
file), or the documentation in html format
(gzipped tar file).
Early parallel variant at Daresbury Laboratory, UK.
An early parallel version of MOLSCAT, based on older version 12 of the
serial code, was implemented for the Intel iPSC/860 parallel
supercomputer at Daresbury Laboratory, UK. It was also successfully
used on Cray T3D machines. The User Manual for this implementation is
no more available on the MOLSCAT v14 repository but a paper version
has been scanned
here for reference purpose. This code involved large changes
in the MOLSCAT source and was based on the PVM message library which
is now superseded by MPI. Due to the memory limitations of the
iPSC/860 and T3D computers, the storage of the matrix elements of the
angular expansion of the potential was optionally distributed over
several processes, making the code a bit complicated.
This code is "dead" and is is not incorporated in the present
parallel version.
"Poor Man Parallel" PMP MOLSCAT MPI variant
George C. McBane at Grand Valley State University has developed in 2005 a
variant of MOLSCAT v14 for dispatching independant (JTOT, M)
calculations over an MPI parallel environment. This PMP code is available
from McBane's
home page. For reference, you may also
download the corresponding PMP manual from here
(in pdf format).
This MPI code is very useful if your problem takes a long time to run
and involves many (JTOT, M) combinations. It is also expected to be very
reliable because it involves only minimal changes to the original
source.
The PMP code presents however several limitations.
- The behaviour of the original MOLSCAT is somewhat modified. In
particular the IOS and pressure broadening calculations are not
parallelized, and the resonance searching and the automatic JTOT convergence
checking are disabled. This
means that the original v14 MOLSCAT source must be used when the
original behaviour is required.
- The price to pay for the simplicity of the approach is a
limitation of the dynamic dispatch mode (the most effective for proper
load balancing), where an MPI task must be devoted to the
dispatching of the (JTOT, M) combinations and is lost for
computations. This limits the efficiency of parallelism especially if
a moderate number of MPI tasks is used.
- Individual (JTOT, M) combinations are still run serially. Thus
they must fit the available memory and time limit allocated per cpu
core. The memory scales as MXLAM*N^2/2 real*8 words where MXLAM is the
number of terms in the angular expansion of the potential and N the
number of channels. For a large problem, the required memory per core
may exceed several gigabytes, beyond the typical memory per core
available on nowadays massively paraller supercomputers. Corresponding
cpu timings may also exceed current batch queue limitations.
- PMP MOLSCAT only produces S-matrices and requires post-processing
utilities of ISAVEU files, such as those provides on the original MOLSCAT
page as
S-matrix
save tape post-processors. In particular state-to-state integral
cross sections are produced by the utility sig_save.f (as also
provided in the v14_tools/ sub-directory). However the original
version by S. Green does not handle ITYP = 4, 8, or 9.
Developments provided in the present version
The user is supposed to be familiar with the original MOLSCAT v14
serial code, and is also supposed to have read in detail the
PMP
manual.
File naming conventions are the same as for PMP MOLSCAT. Input
is read from file molscat.in. If MPI is disabled (see pmpnone.f below), the
extension .9999 is used for output files to avoid any confusion with
MPI numbering.
Extensions to PMP MOLSCAT
The present code implements several extensions to the original PMP
MOLSCAT (compare to the PMP manual for reference).
- pmpnew.f
Compiling with pmpnew.f provides and extended
dynamic dispatch mode where the MPI task 0 overlays dispatching and
computations. This feature improves the scalability to the price of a
small latency in the dispatching, and supercedes both routines pmpstat.f
and pmpdyn.f provided by the original PMP variant. It is possible to
revert to a dedicated dispacher by setting LL_COMPUTE=.false.
in the
namelist &INPUT.
Modified files: A synchronization call (call synchro_tasks) has been
added in many places to allow to run calculations on the master
concurrently with the dispatching of tasks: after all linear algebra
calls involving DGEMUL, DGESV, SYMINV, F02AAF, F02ABF, DSYMM, DSYR2K;
in the outer loops for the build up of matrix elements (couple.f,
cpl*.f, mcgcpl.f, entry cpl6 in set6.f); in I/O routines readir and
wridir.
- pmpnone.f
Compiling with pmpnone.f disables the MPI
environment completely. It also restores most functionalities of plain
serial MOLSCAT which are unsupported in the PMP-enabled variant
(resonance searching, automatic JTOT convergence checking, etc...).
This permits to keep
using the same code base for serial or OpenMP applications.
Modified files: driver.f and a few minor changes elsewhere.
- smerge.f
This program permits to combine the S-matrices
produced independantly by the MPI tasks. A minor extension permits to specify
in the namelist &MERGE that input
S-matrices are in big_endian or little_endian format. Usage:
convert='big_endian' or convert='little_endian'. This feature is
useful is the data is collected on a computer distinct from the
production machines. PC computers are always
little_endians. Itanium-based machines are also little_endians, while
IBM PowerPC, Power-based machines and some mainframes are big_endians.
If the input
S-matrices are a mix of little endian and big endian files, you may
process separately the two endian sets and recombine them in a further
smerge step.
Support for OpenMP and "mixed-mode" OpenMP/MPI
A large fraction of cpu is used in the linear algebra routines. Thus
an obvious way to accelerate individual (JTOT, M) calculations is to
link to multi-threaded Lapack and BLAS libraries. This also permits to
share the memory requirements over several
cpu cores. However other parts of the code, and notably the
generation of the angular matrix elements, limit the parallelization
efficiency.
A full OpenMP support is presently provided for ITYPE=1,2,3,4,7 and
INTFLG=6,8 and permits to achieve an excellent speedup for large problems
using 2 to 4
OpenMP threads.
Example of quasi-linear scalability on IDRIS (IBM zahir) for H2CO-H2,
using OpenMP and lapack + esslsmp libraries,
MXLAM=146, N=1032, INTFLG=6, NSTEPS=192 (one JTOT, two symmetries):
zahir1/vali01.892344: ELAPSED TIME 1635.78 CLK SECS (1 thread)
zahir2/vali02.892341: ELAPSED TIME 857.54 CLK SECS (2 threads)
zahir4/vali04.892357: ELAPSED TIME 413.57 CLK SECS (4 threads)
OpenMP multi-threading for individual (JTOT, M)
calculations can be freely combined with MPI dispatch of the various
(JTOT, M) combinations resulting in "mixed-mode" OpenMP/MPI runs.
More details about multi-threading
here.
Alternative inversion routine SYMINV
The implementation of the SYMINV routine is important to achieve proper
serial or multi-threading performance. Two alternative routines are provided.
- syminv_dsy.f corresponds to the original MOLSCAT inversion routine
for symmetric indefinite matrices and relies on the Lapack routines
DSYTRF and DSYTRI. As these routines mostly rely on Level 2 BLAS,
they do not parallelize well for multi-threaded runs.
- syminv_dge.f provides a variant based on the general matrix
inversion routines DGETRF and DGETRI, which parallelize much better,
but are twice as costly in term of floating operations. In some
implementations this general inversion scheme may however prove faster
than the symmetric indefinite case, especially for a large number of
channels, due to a better handling of the cache.
The appropriate routine must be selected in the compil* file
enabled in your Makefile. Hints for this selection are provided
here.
Faster diagonalisation routines F02ABF and F02AAF
These routines are heavily used in the Airy propagator to compute
eigenvalues and eigenvectors (F02ABF) or eigenvalues only (F02AAF).
They belong to the commercial NAG library and are implemented in
MOLSCAT using Lapack. The original routines call the routine
DSYEVX, which is now superceded by DSYEVR. The performance gain on
F02ABF is very important on some architectures, up to x6 on a Power4
mainframe, and parallelization efficiency is also generally improved.
These gains benefit to the first energy. Additional energies (if NNRG
> 1) require eigenvalues only, with much less acceleration.
The appropriate routines must be selected in the compil* file
enabled in your Makefile. Hints for this selection are provided
here.
Note also that a proper implementation of DSYEVR is
essential.
Alternative routines for Wigner 3-j, 6-j and 9-j recoupling
coefficients.
For most ITYP values, and in particular ITYP=3,4, MOLSCAT uses
routine threej for 3-j coefficients with zero projections, and
routines thrj, sixj and xninej for 3-j, 6j and 9-j coefficients,
respectively.
However, as all similar routines written with floating
point accuracy, the thrj, sixj and xninej routines present severe
numerical stability problems for large quantum numbers (in the
range 50-100). As an example, compute with original MOLSCAT routines
the 6-j symbol:
_(J1 56 24 )_
(35 12 42 ) for J1 = 32 .. 54.
The values obtained vary according to machine, compiler and
compiling options and may be wrong by orders of magnitude.
The correct value for J1=32 is -2.143216E-007.
Similar problems are encountered with the corresponding Wigner
recoupling routines written by B. Follmeg as provided in the popular
HIBRIDON package
developed by Millard Alexander et al. However the HIBRIDON recoupling routines
are based upon pre-computed logarithms of factorials and present
generally slightly less severe numerical stability problems.
The code provides several alternatives in the Makefile:
- Use the original MOLSCAT routines.
- Use the replacement routines from HIBRIDON. The 3-j and 6-j
HIBRIDON routines are significantly faster than the MOLSCAT
ones, and may be preferred to their MOLSCAT counterpart for
production.
- Use a quadruple accuracy version of the replacement routines
thrj, sixj and xninej for checking purposes or for production
involving large quantum numbers. These accurate routines are
slower than the original MOLSCAT ones by a moderate factor (2 to 4).
As a default choice, we recommend the faster replacement routines
adapted from HIBRIDON. If large quantum numbers are involved,
the safest choice is to check the accuracy with the quadruple
accuracy routines. If quadruple accuracy is not available on
your platform (as it is the case with Opterons under Solaris),
or if it is too slow, you might compare results using original
MOLSCAT recoupling routines and HIBRIDON ones. As the
algorithms involved are completely different, numerical
instabilities will develop differently and an agreement for
cross sections is likely to indicate that the recoupling
coefficients are okay.
MOLSCAT also use routines j3j000, j6j and j9j which provide a
vector of results corresponding to all permissible values of
J1. These routines use algorithms proposed by Schulten and
Gordon (1975) and permit useful optimizations for a few ITYP
cases and for the evaluation of xninej. For the moment no
replacement routines for better numerical stability or for
quadruple accuracy have been implemented.
Alleviate the memory requirements
The memory required to store the angular coupling matrix elements may
grow very large for some ITYPE cases, as it scales as MXLAM*N^2/2
(in real*8 words) where MXLAM is the number of terms in the angular
expansion of the potential and N the number of channels. In addition
this memory is accessed at each step, generating a significant memory
bandwidth bottleneck especially on multi-core architectures.
The required memory can be estimated using
storage.x.
It is advisable to limit these requirements, especially in vue of
running MOLSCAT on architectures with limited core memory, such as the
supercomputer BlueGene/P presently installed at
IDRIS, or for experimenting the emerging Cell or NVIDIA GPU
coprocessor chips.
An alternative kindly suggested by Millard Alexander is to exploit
the sparsity of the VL couplings, as it is already done within the HIBRIDON package. An OpenMP-aware
implementation along this idea is now provided for ITYPE=3,4, and is
automatically disabled for other ITYPE values. This packing algorithm
of the VL array is selected by specifying VLTHRESH in the namelist &INPUT),
and is generally recommended as a default choice.
Another alternative is to precalculate and spline-interpolate the
W coupling matrix over a distance mesh. The algorithm is presently
implemented for ITYPE=3,4, and is automatically disabled for other
ITYPE values. It runs in-core if enough storage is available, and
automatically switches off to out-of-core storage with moderate I/O
requirements (~ N2 per step) when needed. It is selected
by specifying the logical unit IRWU in the namelist &INPUT), and
by defining the corresponding distance mesh in the namelist &MESH). This
mesh-based algorithm is very efficient in term of memory bandwidth,
but it may be more memory demanding than the packed one for in-core
operation. It also requires a distance mesh, which may be
inconvenient for problems involving very long-range interactions (such
as ultra-cold collisions). Its major benefit is to eliminate
completely the VL storage for out-of-core operation, which
permits to run calculations with very large N values.
Support for portability
Memory allocation
Originally, MOLSCAT allocates its workspace statically in the main routine.
On some platforms a simple trick permits to allocate the workspace at
runtime. This is especially useful on mainframes to prevent
maintaining different executables adapted to various batch queues.
The corresponding value (in real*8 words) is then read from the
command line, e.g.
$ molscat.x 10000000
The required workspace can be estimated using
storage.x.
More details are given
here.
Utility routines
A large set of utility routines is provided to suit most architectures.
Also, the original utility routine PARITY has been renamed FARITY
to avoid a name conflict in some system libraries.
More details are given
here.
Makefile templates for various platforms
Presently supported platforms are:
- AIX Power with xlf, Lapack 3.1.1 and ESSL.
- Linux PC (32-bit and 64-bit) with Intel ifort and MKL (or ACML,
faster for AMD processors).
- Linux Itanium2 with Intel ifort, Lapack 3.1.1 and GotoBLAS (MKL
is presently
buggy).
- Solaris amd64 with Studio 11 compiler and ACML library
(faster than SunPerf for AMD processors).
More details are given
here.
Download source code
- Initial distribution of 26 March 2008.
- Distribution of 07 April 2008. Provides an important
bugfix in pmpnew.f, and updates to various files.
- Update of 17 April 2008. Provides a new diagonalisation
scheme for asymmetric top levels (ITYPE=4,6), which fixes an improper labeling
of levels in MOLSCAT output.
- Update of 28 April 2008. Implements a couple of
suggestions from George C. McBane after running the code at the San Diego
Supercomputer center.
- Distribution of 05 May 2008. Provides
i) a bugfix in daprop.f for OpenMP;
ii) an efficient packing algorithm of the VL array for ITYPE=3,4
(selected by specifying VLTHRESH in the
namelist &INPUT);
iii) an OMP-parallelized version of COUPLE for ITYPE 1, 2, 7
contributed by George C. McBane;
iv) a new version of the smerge utility including the update
posted by George C. McBane;
v) an utility to estimate the memory storage
requirements for your calculations.
- Distribution of 12 May 2008. (gzipped tar file). Provides
i) a
mesh-based algorithm)
for in-core or out-of-core storage of the
W coupling matrix, and ii) an updated version of the utility
storage.x.
- Update of 14 May 2008. Provides
OpenMP support in WAVVEC for ITYPE = 2 and 7 contributed by George C. McBane.
- Update of 27 May 2008. (gzipped tar file). Provides benchmarking tools to
select the proper matrix inversion and diagonalisation routines, and
updated makefiles.
More details are given
here.
- See release
notes for further details and known problems.
This web page was originally created by Dr. Pierre Valiron, who passed
away in August 2008. It is not currently being updated except for link
verification.
Contacting authors:
We acknowledge support from various sources, and in particular: