Mixed MPI and OpenMP extension of the MOLSCAT v14 package.

Here you can access the repository for extensions of the scattering code MOLSCAT originally developed by Sheldon Green and collaborators until he passed away in 1995.

The present version developed by P. Valiron and George C. McBane is targeted for large close coupling calculations on parallel supercomputers and computer grids and combines

One important motivation for these developments was to extend close coupling calculations to higher temperatures for molecules of astrophysical interest, especially for collisions involving para-H2(J=0,2) and ortho-H2.

Earlier developments

Original MOLSCAT v14 serial package

The original MOLSCAT v14 repository is maintained by Jeremy M. Hutson. This v14 package is purely serial.

For reference, you may also download from here the corresponding original fortran source (gzipped file), the documentation in text format (txt file), or the documentation in html format (gzipped tar file).

Early parallel variant at Daresbury Laboratory, UK.

An early parallel version of MOLSCAT, based on older version 12 of the serial code, was implemented for the Intel iPSC/860 parallel supercomputer at Daresbury Laboratory, UK. It was also successfully used on Cray T3D machines. The User Manual for this implementation is no more available on the MOLSCAT v14 repository but a paper version has been scanned here for reference purpose. This code involved large changes in the MOLSCAT source and was based on the PVM message library which is now superseded by MPI. Due to the memory limitations of the iPSC/860 and T3D computers, the storage of the matrix elements of the angular expansion of the potential was optionally distributed over several processes, making the code a bit complicated.

This code is "dead" and is is not incorporated in the present parallel version.

"Poor Man Parallel" PMP MOLSCAT MPI variant

George C. McBane at Grand Valley State University has developed in 2005 a variant of MOLSCAT v14 for dispatching independant (JTOT, M) calculations over an MPI parallel environment. This PMP code is available from McBane's home page. For reference, you may also download the corresponding PMP manual from here (in pdf format). This MPI code is very useful if your problem takes a long time to run and involves many (JTOT, M) combinations. It is also expected to be very reliable because it involves only minimal changes to the original source.

The PMP code presents however several limitations.

Developments provided in the present version

The user is supposed to be familiar with the original MOLSCAT v14 serial code, and is also supposed to have read in detail the PMP manual. File naming conventions are the same as for PMP MOLSCAT. Input is read from file molscat.in. If MPI is disabled (see pmpnone.f below), the extension .9999 is used for output files to avoid any confusion with MPI numbering.

Extensions to PMP MOLSCAT

The present code implements several extensions to the original PMP MOLSCAT (compare to the PMP manual for reference).

Support for OpenMP and "mixed-mode" OpenMP/MPI

A large fraction of cpu is used in the linear algebra routines. Thus an obvious way to accelerate individual (JTOT, M) calculations is to link to multi-threaded Lapack and BLAS libraries. This also permits to share the memory requirements over several cpu cores. However other parts of the code, and notably the generation of the angular matrix elements, limit the parallelization efficiency.

A full OpenMP support is presently provided for ITYPE=1,2,3,4,7 and INTFLG=6,8 and permits to achieve an excellent speedup for large problems using 2 to 4 OpenMP threads.

      Example of quasi-linear scalability on IDRIS (IBM zahir) for H2CO-H2,
      using OpenMP and lapack + esslsmp libraries,
      MXLAM=146, N=1032, INTFLG=6, NSTEPS=192 (one JTOT, two symmetries):
      zahir1/vali01.892344: ELAPSED TIME     1635.78 CLK SECS (1 thread)
      zahir2/vali02.892341: ELAPSED TIME      857.54 CLK SECS (2 threads)
      zahir4/vali04.892357: ELAPSED TIME      413.57 CLK SECS (4 threads)
OpenMP multi-threading for individual (JTOT, M) calculations can be freely combined with MPI dispatch of the various (JTOT, M) combinations resulting in "mixed-mode" OpenMP/MPI runs.

More details about multi-threading here.

Alternative inversion routine SYMINV

The implementation of the SYMINV routine is important to achieve proper serial or multi-threading performance. Two alternative routines are provided. The appropriate routine must be selected in the compil* file enabled in your Makefile. Hints for this selection are provided here.

Faster diagonalisation routines F02ABF and F02AAF

These routines are heavily used in the Airy propagator to compute eigenvalues and eigenvectors (F02ABF) or eigenvalues only (F02AAF).

They belong to the commercial NAG library and are implemented in MOLSCAT using Lapack. The original routines call the routine DSYEVX, which is now superceded by DSYEVR. The performance gain on F02ABF is very important on some architectures, up to x6 on a Power4 mainframe, and parallelization efficiency is also generally improved. These gains benefit to the first energy. Additional energies (if NNRG > 1) require eigenvalues only, with much less acceleration.

The appropriate routines must be selected in the compil* file enabled in your Makefile. Hints for this selection are provided here.

Note also that a proper implementation of DSYEVR is essential.

Alternative routines for Wigner 3-j, 6-j and 9-j recoupling coefficients.

For most ITYP values, and in particular ITYP=3,4, MOLSCAT uses routine threej for 3-j coefficients with zero projections, and routines thrj, sixj and xninej for 3-j, 6j and 9-j coefficients, respectively.

However, as all similar routines written with floating point accuracy, the thrj, sixj and xninej routines present severe numerical stability problems for large quantum numbers (in the range 50-100). As an example, compute with original MOLSCAT routines the 6-j symbol:

      _(J1 56 24 )_
       (35 12 42 )    for J1 = 32 .. 54.
The values obtained vary according to machine, compiler and compiling options and may be wrong by orders of magnitude. The correct value for J1=32 is -2.143216E-007.

Similar problems are encountered with the corresponding Wigner recoupling routines written by B. Follmeg as provided in the popular HIBRIDON package developed by Millard Alexander et al. However the HIBRIDON recoupling routines are based upon pre-computed logarithms of factorials and present generally slightly less severe numerical stability problems.

The code provides several alternatives in the Makefile:

As a default choice, we recommend the faster replacement routines adapted from HIBRIDON. If large quantum numbers are involved, the safest choice is to check the accuracy with the quadruple accuracy routines. If quadruple accuracy is not available on your platform (as it is the case with Opterons under Solaris), or if it is too slow, you might compare results using original MOLSCAT recoupling routines and HIBRIDON ones. As the algorithms involved are completely different, numerical instabilities will develop differently and an agreement for cross sections is likely to indicate that the recoupling coefficients are okay.

MOLSCAT also use routines j3j000, j6j and j9j which provide a vector of results corresponding to all permissible values of J1. These routines use algorithms proposed by Schulten and Gordon (1975) and permit useful optimizations for a few ITYP cases and for the evaluation of xninej. For the moment no replacement routines for better numerical stability or for quadruple accuracy have been implemented.

Alleviate the memory requirements

The memory required to store the angular coupling matrix elements may grow very large for some ITYPE cases, as it scales as MXLAM*N^2/2 (in real*8 words) where MXLAM is the number of terms in the angular expansion of the potential and N the number of channels. In addition this memory is accessed at each step, generating a significant memory bandwidth bottleneck especially on multi-core architectures. The required memory can be estimated using storage.x.

It is advisable to limit these requirements, especially in vue of running MOLSCAT on architectures with limited core memory, such as the supercomputer BlueGene/P presently installed at IDRIS, or for experimenting the emerging Cell or NVIDIA GPU coprocessor chips.

An alternative kindly suggested by Millard Alexander is to exploit the sparsity of the VL couplings, as it is already done within the HIBRIDON package. An OpenMP-aware implementation along this idea is now provided for ITYPE=3,4, and is automatically disabled for other ITYPE values. This packing algorithm of the VL array is selected by specifying VLTHRESH in the namelist &INPUT), and is generally recommended as a default choice.

Another alternative is to precalculate and spline-interpolate the W coupling matrix over a distance mesh. The algorithm is presently implemented for ITYPE=3,4, and is automatically disabled for other ITYPE values. It runs in-core if enough storage is available, and automatically switches off to out-of-core storage with moderate I/O requirements (~ N2 per step) when needed. It is selected by specifying the logical unit IRWU in the namelist &INPUT), and by defining the corresponding distance mesh in the namelist &MESH). This mesh-based algorithm is very efficient in term of memory bandwidth, but it may be more memory demanding than the packed one for in-core operation. It also requires a distance mesh, which may be inconvenient for problems involving very long-range interactions (such as ultra-cold collisions). Its major benefit is to eliminate completely the VL storage for out-of-core operation, which permits to run calculations with very large N values.

Support for portability

Memory allocation

Originally, MOLSCAT allocates its workspace statically in the main routine. On some platforms a simple trick permits to allocate the workspace at runtime. This is especially useful on mainframes to prevent maintaining different executables adapted to various batch queues. The corresponding value (in real*8 words) is then read from the command line, e.g.
      $ molscat.x 10000000 
The required workspace can be estimated using storage.x. More details are given here.

Utility routines

A large set of utility routines is provided to suit most architectures. Also, the original utility routine PARITY has been renamed FARITY to avoid a name conflict in some system libraries. More details are given here.

Makefile templates for various platforms

Presently supported platforms are: More details are given here.

Download source code


This web page was originally created by Dr. Pierre Valiron, who passed away in August 2008. It is not currently being updated except for link verification.

Contacting authors:

We acknowledge support from various sources, and in particular: