I was curious how to find the optimal way of multiplying two matrices, C = A*B. I wrote a program that does it in 5 different ways:
a) a triple nested loop in the main program,
b) the same code as a subroutine
c) the matmul function in F95
d) the DGEMM library routine from BLAS
e) my own compilation of the DGEMM source (F77 code!)
I tried optimization levels O3, O4, and O5, combined with Math=0, 3, 6, and 10. Here is the timing I get (for N=1000) with OMP_GET_WTIME() and the opemMP-option switched on in Project options:
Explicit loop My subroutine matmul DGEMM-library DGEMM compiled
O3, Math=0 0.61 1.44 1.38 2.98 2.98
O4, Math=0 0.57 1.46 1.40 2.97 2.96
O5, Math=0 0.60 0.49 0.49 3.12 3.13
O3, Math=3 0.60 1.47 1.20 3.03 0.93
O4, Math=3 0.61 1.45 1.21 2.99 0.94
O5, Math=3 0.46 0.43 0.43 3.11 1.00
O3, Math=6 0.60 1.45 1.20 3.00 0.94
O4, Math=6 0.59 1.46 1.24 3.03 0.94
O5, Math=6 0.46 0.42 0.51 3.10 0.97
O3, Math=10 1.08 1.37 1.11 2.96 0.93
O4, Math=10 1.12 1.39 1.00 2.97 0.93
O5, Math=10 0.46 0.42 0.43 3.12 0.98
These are the strange obsevations:
a) „explicit loop“ and „My subroutine“ are exactly the same code. Nevertheless, “explicit loop” is not affected by any optimization settings, except that it gets slow with Math=10 and O3 or O4. The subroutine is always three times slower than the explicit code, except with O5.
b) The FORTRAN function “matmul” performs much slower than the explicit loop, except for O5.
c) The BLAS-library routine is not affected by any optimization settings, this is expected since the code is fixed.
d) The DGEMM routine compiled with the compiler settings is not affected by O3, O4, O5 when Math=0, and performs like BLAS with O3. This is strange since “My subroutine” is twice as fast with O3, O4, and six times faster with O5! Once Math>=3 is set, DGEMM compiled is always the same speed, regardless of the setting of O3, O4, O5.
e) The BLAS routine is supposed to be the fastest option for basic linear algebra, so it is strange that a simple straightforward nested loop should be twice as fast!
Perhaps I am doing something wrong? I am familiar with F77, but new to F95. Here is my program:
Program TestMM
implicit none
! test timing for matrix multiplication
real*8, allocatable :: A(:,:),B(:,:),C(:,:)
real*8 :: RI,RJ,CJK
real*8, parameter :: zero=0.0D0, one=1.0D0
real :: Tim1,Tim2
integer N,I,J,K,L,NCYCLE
!
! input matrix dimension
!
write (*,*) ' matrix dimension'
read (*,*) N
NCYCLE = 1
allocate (A(N,N))
allocate (B(N,N))
allocate (C(N,N))
do I = 1,N
RI = DFLOAT(I)
DO J = 1,N
RJ = DFLOAT(J)
A(I,J) = RI*RJ
B(I,J) = A(I,J)
ENDDO
ENDDO
write (*,*) ' finished setting up matrices'
!
! multiply explicitely with nested loops
!
Tim1 = OMP_GET_WTIME()
DO I = 1,NCYCLE
DO J = 1,N
DO K = 1,N
CJK = 0.0D0
DO L = 1,N
CJK = CJK + A(J,L) * B(L,K)
ENDDO
C(J,K) = CJK
ENDDO
ENDDO
write (*,*) C(I,I)
ENDDO
Tim2 = OMP_GET_WTIME()
write (*,*) ' timing explicit loops ',Tim2-Tim1
!
! use same code, but now in subroutine
!
Tim1 = OMP_GET_WTIME()
DO I = 1,NCYCLE
CALL BDMAMU (N,A,B,C)
write (*,*) C(I,I)
ENDDO
Tim2 = OMP_GET_WTIME()
write (*,*) ' subroutine with nested loops ',Tim2-Tim1
!
! multiply with fortran matmul routine
!
Tim1 = OMP_GET_WTIME()
DO I = 1,NCYCLE
C = matmul(A,B)
write (*,*) C(I,I)
ENDDO
Tim2 = OMP_GET_WTIME()
write (*,*) ' timing matmul-routine ',Tim2-Tim1
!
! multiply with DGEMM from BLAS
!
Tim1 = OMP_GET_WTIME()
DO I = 1,NCYCLE
CALL DGEMM('N','N',N,N,N,one,A,N,B,N,zero,C,N)
write (*,*) C(I,I)
ENDDO
Tim2 = OMP_GET_WTIME()
write (*,*) ' timing DGEMM ',Tim2-Tim1
!
! multiply with DGEMM compiled (renamed as BDMM)
!
Tim1 = OMP_GET_WTIME()
DO I = 1,NCYCLE
CALL BDMM('N','N',N,N,N,one,A,N,B,N,zero,C,N)
write (*,*) C(I,I)
ENDDO
Tim2 = OMP_GET_WTIME()
write (*,*) ' timing DGEMM compiled ',Tim2-Tim1
STOP
END
!
! my own subroutine for matrix multiplication
!
SUBROUTINE BDMAMU (N,A,B,C)
implicit none
REAL*8 :: A(N,N),B(N,N),C(N,N)
REAL*8 :: CJK
INTEGER :: J,K,L,N
DO J = 1,N
DO K = 1,N
CJK = 0.0D0
DO L = 1,N
CJK = CJK + A(J,L) * B(L,K)
ENDDO
C(J,K) = CJK
ENDDO
ENDDO
RETURN
END SUBROUTINE BDMAMU
there is a second file that contains the source of DGEMM, only the subroutine was renamed to BDMM