gfxgfx
 
Please login or register.

Login with username, password and session length
 
gfx gfx
gfx
1617 Posts in 535 Topics by 779 Members - Latest Member: rhoronjeff@comcast.net November 27, 2022, 08:19:43 PM
*
gfx* Home | Help | Search | Login | Register | gfx
gfx
Absoft User Forum  |  Support  |  General  |  autovectorization - which idioms are preferred?
gfx
gfxgfx
 

Author Topic: autovectorization - which idioms are preferred?  (Read 5162 times)

sturlamolden

  • Newbie
  • *
  • Posts: 10
autovectorization - which idioms are preferred?
« on: July 22, 2010, 10:29:57 AM »

Inside the Cholesky factorization there is a Gaxpy update (e.g. algorithm 4.2.1 in Golub and van Loan's textbook on matrix computation). Ignoring Lapack and level-2 blas libraries, which of these three idioms is preferred for Absoft's Fortran 95 compiler?

Assume we have an array A(n,n):

Alternative 1: matmul and transpose intrinsics

do j = 1,n
     if (j > 1) then
         A(j:n,j) = A(j:n,j) - matmul(A(j:n,1:j-1),transpose(A(j,1:j-1)))
    end if
    [....]
end do


Alternative 2: nested do-loops

do j = 1,n
    if (j > 1) then
        do k = j,n
             do i = 1,j-1
                  A(k,j) = A(k,j) - A(k,i)*A(j,i)
             end do
         end do

    end if
    [...]
end do


ALternative 3: forall

do j = 1,n
    if (j > 1) then
         forall (k = j:n, i = 1:j-1)
             A(k,j) = A(k,j) - A(k,i)*A(j,i)
         end forall

    end if
end do


You can see the Gaxpy update in yellow. Which of these are the easiest for the compiler to optimize?

More questions:

  • Does the autovectorizer do any form of load balancing?
  • Will the autovectorizer use threads, or should I put in OpenMP directives?
  • Is there anyway of controlling alignment so that it is efficient for SSE3 (e.g. 16 byte boundaries)?  And how could I inform the compiler about this? In C there would e.g. be __declspec(align(16)). Or is the compiler smart enough to do this on its own?
  • Does the autovectorizer know that multitreading is not preferred if n is small?
  • Will threads be spawned or can OpenMP or the autovectorizer maintain a thread pool?
  • Can I make the autovectorizer verbose?


P.S. Yes I know about optimized LAPACK dpotrf and level-2 BLAS dgmv in e.g. Intel MKL or GotoBLAS. I am interested in knowing what kind of coding style is preferred for the Absoft Pro Fortran compiler (64 bit Windows in particular).


Regards,
Sturla Molden








forumadmin

  • Administrator
  • Sr. Member
  • *****
  • Posts: 333
Re: autovectorization - which idioms are preferred?
« Reply #1 on: July 23, 2010, 09:35:35 AM »
Example 1 uses runtime libraries that are pre-optimized. Examples 2 and 3 are basically the same.

1. auto-vectorization doesn't need load balancing, auto-parallelization doesn't do load balancing

2. the auto-vectorizer uses SIMD instructions. the auto-parallelizer uses threads and does not need OpenMP directives

3. Absoft compilers promote data alignment to 16-byte boundaries automatically (required to use SSE3 instructions)

4. the auto-vectorizer doesn't vectorize the code if n is small, the auto-parallelizer will execute the serialized copy instead of the parallelized copy if n is small

5. the auto-parallizer maintains a thread pool to save cost

6. yes, -LNO:verbose=on


V11.1 will contain an SMP analysis tool to help with vectorization and paralleliztion issues, including OpenMP. There have been some improvements to the auto-parallelizer. It is in beta now. Contact beta at absoft dot com if you are interested.

sturlamolden

  • Newbie
  • *
  • Posts: 10
Re: autovectorization - which idioms are preferred?
« Reply #2 on: July 25, 2010, 07:36:01 AM »
Thank you for the answer  :)

Absoft User Forum  |  Support  |  General  |  autovectorization - which idioms are preferred?
 

gfxgfx
gfx gfx
Powered by MySQL Powered by PHP Valid XHTML 1.0! Valid CSS!