gfxgfx
 
Please login or register.

Login with username, password and session length
 
gfx gfx
gfx
1565 Posts in 517 Topics by 752 Members - Latest Member: zaikunzhang September 25, 2021, 05:14:02 PM
*
gfx* Home | Help | Search | Login | Register | gfx
gfx
Absoft User Forum  |  Support  |  Windows  |  strange behavior of optimization
gfx
gfxgfx
 

Author Topic: strange behavior of optimization  (Read 3991 times)

BDick

  • Newbie
  • *
  • Posts: 4
strange behavior of optimization
« on: August 18, 2016, 06:06:52 PM »
I was curious how to find the optimal way of multiplying two matrices, C = A*B. I wrote a program that does it in 5 different ways:
a) a triple nested loop in the main program,
b) the same code as a subroutine
c) the matmul function in F95
d) the DGEMM library routine from BLAS
e) my own compilation of the DGEMM source (F77 code!)

I tried optimization levels O3, O4, and O5, combined with Math=0, 3, 6, and 10. Here is the timing I get (for N=1000) with OMP_GET_WTIME() and the opemMP-option switched on in Project options:

   Explicit loop   My subroutine   matmul   DGEMM-library   DGEMM compiled
O3, Math=0   0.61   1.44   1.38   2.98   2.98
O4, Math=0   0.57   1.46   1.40   2.97   2.96
O5, Math=0   0.60   0.49   0.49   3.12   3.13
O3, Math=3   0.60   1.47   1.20   3.03   0.93
O4, Math=3   0.61   1.45   1.21   2.99   0.94
O5, Math=3   0.46   0.43   0.43   3.11   1.00
O3, Math=6   0.60   1.45   1.20   3.00   0.94
O4, Math=6   0.59   1.46   1.24   3.03   0.94
O5, Math=6   0.46   0.42   0.51   3.10   0.97
O3, Math=10   1.08   1.37   1.11   2.96   0.93
O4, Math=10   1.12   1.39   1.00   2.97   0.93
O5, Math=10   0.46   0.42   0.43   3.12   0.98

These are the strange obsevations:
a)   „explicit loop“ and „My subroutine“ are exactly the same code. Nevertheless, “explicit loop” is not affected by any optimization settings, except that it gets slow with Math=10 and O3 or O4. The subroutine is always three times slower than the explicit code, except with O5.
b)   The FORTRAN function “matmul” performs much slower than the explicit loop, except for O5.
c)   The BLAS-library routine is not affected by any optimization settings, this is expected since the code is fixed.
d)   The DGEMM routine compiled with the compiler settings is not affected by O3, O4, O5 when Math=0, and performs like BLAS with O3. This is strange since “My subroutine” is twice as fast with O3, O4, and six times faster with O5! Once Math>=3 is set, DGEMM compiled is always the same speed, regardless of the setting of O3, O4, O5.
e)   The BLAS routine is supposed to be the fastest option for basic linear algebra, so it is strange that a simple straightforward nested loop should be twice as fast!

Perhaps I am doing something wrong? I am familiar with F77, but new to F95. Here is my program:

  Program TestMM
  implicit none
! test timing for matrix multiplication
  real*8, allocatable :: A(:,:),B(:,:),C(:,:)
  real*8 :: RI,RJ,CJK
  real*8, parameter :: zero=0.0D0, one=1.0D0
  real :: Tim1,Tim2
  integer N,I,J,K,L,NCYCLE
!
! input matrix dimension

  write (*,*) ' matrix dimension'
  read (*,*) N
  NCYCLE = 1
  allocate (A(N,N))
  allocate (B(N,N))
  allocate (C(N,N))
  do I = 1,N
    RI = DFLOAT(I)
    DO J = 1,N
      RJ = DFLOAT(J)
      A(I,J) = RI*RJ
      B(I,J) = A(I,J)
    ENDDO
  ENDDO
  write (*,*) ' finished setting up matrices'

! multiply explicitely with nested loops
!
  Tim1 = OMP_GET_WTIME()
  DO I = 1,NCYCLE
    DO J = 1,N
      DO K = 1,N
        CJK = 0.0D0
        DO L = 1,N
          CJK = CJK + A(J,L) * B(L,K)
        ENDDO
        C(J,K) = CJK
      ENDDO
    ENDDO
    write (*,*) C(I,I)
  ENDDO
  Tim2 = OMP_GET_WTIME()
  write (*,*) ' timing explicit loops            ',Tim2-Tim1 

! use same code, but now in subroutine
!
  Tim1 = OMP_GET_WTIME()
  DO I = 1,NCYCLE
    CALL BDMAMU (N,A,B,C)
    write (*,*) C(I,I)
  ENDDO
  Tim2 = OMP_GET_WTIME()
  write (*,*) ' subroutine with nested loops     ',Tim2-Tim1

! multiply with fortran matmul routine
!
  Tim1 = OMP_GET_WTIME()
  DO I = 1,NCYCLE
    C = matmul(A,B)
    write (*,*) C(I,I)
  ENDDO
  Tim2 = OMP_GET_WTIME()
  write (*,*) ' timing matmul-routine            ',Tim2-Tim1
!
! multiply with DGEMM from BLAS
!
  Tim1 = OMP_GET_WTIME()
  DO I = 1,NCYCLE
    CALL DGEMM('N','N',N,N,N,one,A,N,B,N,zero,C,N)
    write (*,*) C(I,I)
  ENDDO
  Tim2 = OMP_GET_WTIME()
  write (*,*) ' timing DGEMM                     ',Tim2-Tim1 
!
! multiply with DGEMM compiled (renamed as BDMM)
!
  Tim1 = OMP_GET_WTIME()
  DO I = 1,NCYCLE
    CALL BDMM('N','N',N,N,N,one,A,N,B,N,zero,C,N)
    write (*,*) C(I,I)
  ENDDO
  Tim2 = OMP_GET_WTIME()
  write (*,*) ' timing DGEMM compiled            ',Tim2-Tim1 
  STOP
END
!
!  my own subroutine for matrix multiplication
!
SUBROUTINE BDMAMU (N,A,B,C)
implicit none
REAL*8 :: A(N,N),B(N,N),C(N,N)
REAL*8 :: CJK
INTEGER :: J,K,L,N
    DO J = 1,N
      DO K = 1,N
        CJK = 0.0D0
        DO L = 1,N
          CJK = CJK + A(J,L) * B(L,K)
        ENDDO
        C(J,K) = CJK
      ENDDO
    ENDDO
RETURN
END  SUBROUTINE BDMAMU

there is a second file that contains the source of DGEMM, only the subroutine was renamed to BDMM




Absoft User Forum  |  Support  |  Windows  |  strange behavior of optimization
 

gfxgfx
gfx gfx
Powered by MySQL Powered by PHP Valid XHTML 1.0! Valid CSS!