How To Write (Nearly) Portable Fortran Programs for Parallel Computers and
the MPI on NCX
Abstract:
There are many different architectures of the supercomputers, how we can
make the program running on different architectures with very high user's
effective speed?
Based on a Monte Carlo program we will answer this question and discuss
how the four principles --- coarse parallelism, locality, high BLAS level,
instruction parallelism --- work to get high effective speed and the
portability. We also will talk about the MPI programming with the compiler
pre-processing.
It often happened that even one did invest long time to write a parallel program for some computer, before one get much use to get his job done, either the environment changed or the computer company goes out. So the users always dreams that one can write a program which works on any kind supercomputers with almost the sam high efficiency. That is the potable program.
Good news:
All the computer companies accepted MPI as the Message Passing Interface
Standard.
Today, the portable programming is not only considerable dream but also possible.
Portablility
. 
f(a) = 4.d0 / (1.d0 + a*a)
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)
n=numprocs*1000000
call MPI_BCAST(n, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr)
h = 1.0d0 / n
sum = 0.0d0
do i = myid + 1, n, numprocs
x = h * (dble(i) - 0.5d0)
sum = sum + f(x)
enddo
mypi = h * sum
call MPI_REDUCE(mypi, pi, 1, MPI_DOUBLE_PRECISION,
& MPI_SUM, 0, MPI_COMM_WORLD, ierr)
if (myid .eq. 0) then
write(6, 97) pi, abs(pi - 3.141592653589793238462643 )
endif
call MPI_FINALIZE(ierr)
The same program runs on each processor, with "myid" to compute in different area.

B = C




call MPI_FINALIZE(ierr)
v






On NCX, whenever you use f77, f90, mpif77, or mpif90, the file types that end in the .F extension (but not those ending in .f or .f90) will be processed by the C preprocessor as long as the +ccp option is set to its default value of "yes"; that is, if +ccp = default or +ccp = yes. In contrast, specifying +ccp = no will tell the compiler not to invoke the C preprocessor for any file on the command line, including those ending in .F.
Example:
#ifdef MPI
call MPI_INIT( ierr )
call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
#endif
#ifdef Single
myid=0
numprocs=1
#endif
mpif77 -DMPI +O2 ....
f77 -DSingle +O2 ....

That means decompose large array to a number of sub-arrays which located on different processors.

The VECLIB test with 3000 x 3000 matrix
Table 3: BLAS level and cache miss latency
The QCD test (1000,000 x 1000,000 sparse matrix):

Principle 2: Use high BLAS level algorithm and decompose large arrays.


where F is the frequency, m called dimension, the number of the functional units inside of the processor.
Loop unrolling
Unroll the most important loop as fine as possible by programmer's hand.
Example on QCD program: SU(3)
SU(3) routine
COMPLEX w(3,3,N), u(3,3,N), v(3,3,N)
w=0.0d00
DO I=1,N
DO J1=1,3
DO J2=1,3
DO J3=1,3
w(J2,J1,I)=w(J2,J1,I)+u(J2,J3,I)*v(J3,J1,I)
ENDDO
ENDDO
ENDDO
ENDDO
REAL routine
