[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: A bug and possible fix in atlas3.3.0
Shi,
> The problem was traced to:
> ATLAS/src/blas/level3/kernel/ATL_CtrsmKL.c
AAAARGH! Now we see the difference between a developer release and the
real thing ;>
Thank you very much for catching this; the p0 that is causing the error and
puzzling you is a manual prefetch of the next iteration's data. I just added
these routines in the developer release, and I now remember making a fatal
mistake: with the manual prefetch, you must stop your iteration 1 cycle before
the end, and then finish the last cycle by hand without prefetch. However,
the additional memory ref. will not usually (as we have seen) cause probs,
so I decided to develop the things with the full N-loop, and once everything
was debugged, go back in and unroll the last iteration once I was sure it
worked. Guess which step I forgot . . .
Anyway, the fix you suggest of moving the memory reference to this iteration
will also work, but it may cost you in performance . . .
I may take a bit to get the next developer release out; right now I have
support for user contribution of about 1/2 of the level 1 BLAS. I'd like
to finish that before issuing the next dev release . . .
Thanks,
Clint
From shi@shi.phy.okstate.edu Sun Jun 3 15:04:07 2001
Return-Path: <shi@shi.phy.okstate.edu>
Received: from shi.phy.okstate.edu (marvin@localhost)
by cs.utk.edu with ESMTP (cf v2.9s-UTK)
id PAA15339; Sun, 3 Jun 2001 15:04:05 -0400 (EDT)
Received: from shi.phy.okstate.edu (139.78.124.240 -> shi.phy.okstate.edu)
by cs.utk.edu (smtpshim v1.0); Sun, 3 Jun 2001 15:04:05 -0400
Received: from localhost (shi@localhost)
by shi.phy.okstate.edu (8.11.0/8.11.0) with ESMTP id f53J43627509
for <atlas@cs.utk.edu>; Sun, 3 Jun 2001 14:04:03 -0500
Date: Sun, 3 Jun 2001 14:04:03 -0500 (CDT)
From: Junren Shi <shi@shi.phy.okstate.edu>
X-X-Sender: <shi@xie0>
To: <atlas@cs.utk.edu>
Subject: A bug and possible fix in atlas3.3.0
Message-ID: <Pine.LNX.4.33.0106031328580.27488-100000@xie0>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: R
Hi,
I found a bug in atlas3.3.0. At certain situation, it causes
segment fault. In my case, I built a dynimic library for atlas3.3.0 for
Linux_P4SSE2, and link it with matlab 6.0. Under matlab:
>> A \ B; (A and B are complex, matlab will call zgesv)
will cause segment fault. The dimensions of the A and B which cause the
error is not repeatable. Sometimes working, sometimes not.
The problem was traced to:
ATLAS/src/blas/level3/kernel/ATL_CtrsmKL.c
In functions: trsm_LU2, trsm_LU3, trsm_LU4.
For instance, for trsm_LU2, the codes read:
TYPE *bn=B+ldb2;
int j;
p0 = B[2];
for (j=N; j; j--)
{
xr2 = p0 ; xi2 = B[3];
xr1 = *B ; xi1 = B[1];
t0 = xr2;
xr2 = ar22*xr2 - ai22*xi2;
xi2 = ar22*xi2 + ai22*t0; p0 = bn[2];
...
B = bn;
bn += ldb2;
}
apparently, p0 = bn[2] will make a invalid memory reference at the
final iteration.
I changed trsm_LU2 as following, the similar changes are also in
in trsm_LU3 and trsm_LU4. Basicly it changes for(j=N; j; ...) to
for(j=N-1; j; ...). And manually adds the codes for Nth interation. It
seems it is working now.
By the way, I don't really undestand the codes. It seems to me that the
codes could become simpler if written as:
for(j=N; j; ...){
p0 = B[2];
....;
xi2 = ar22*xi2 + ai22*t0;
...;
}
Am I correct? Or there is some special consideration here?
Thanks
Shi
P.S. Atlas3.3.0 is much faster than intel MKL in P4 now. The double
precision gemm can reach 1862 Mflops. Good job!
------------------------------------------------------------------------
static void trsmLU_2(const int N, const TYPE *A, TYPE *B, const int ldb)
/*
* 'Left, 'Upper', with 1 col prefetch, written with all dependencies
shown,
* so that compiler can optimize.
* A is known to be 2x2, with 1/alpha already applied, diagonals already
* inverted
*/
{
const TYPE ar11=*A, ai11=A[1], ar12=A[4], ai12=A[5];
const TYPE ar22=A[6], ai22=A[7];
TYPE xr1, xi1, xr2, xi2;
TYPE t0, p0;
const int ldb2=ldb+ldb;
TYPE *bn=B+ldb2;
int j;
p0 = B[2];
for (j=N-1; j; j--)
{
xr2 = p0 ; xi2 = B[3];
xr1 = *B ; xi1 = B[1];
t0 = xr2;
xr2 = ar22*xr2 - ai22*xi2;
xi2 = ar22*xi2 + ai22*t0; p0 = bn[2];
xr1 -= ar12*xr2 - ai12*xi2;
xi1 -= ar12*xi2 + ai12*xr2;
t0 = xr1;
xr1 = ar11*xr1 - ai11*xi1;
xi1 = ar11*xi1 + ai11*t0;
*B = xr1; B[1] = xi1;
B[2] = xr2; B[3] = xi2;
B = bn;
bn += ldb2;
}
xr2 = p0 ; xi2 = B[3];
xr1 = *B ; xi1 = B[1];
t0 = xr2;
xr2 = ar22*xr2 - ai22*xi2;
xi2 = ar22*xi2 + ai22*t0;
xr1 -= ar12*xr2 - ai12*xi2;
xi1 -= ar12*xi2 + ai12*xr2;
t0 = xr1;
xr1 = ar11*xr1 - ai11*xi1;
xi1 = ar11*xi1 + ai11*t0;
*B = xr1; B[1] = xi1;
B[2] = xr2; B[3] = xi2;
B = bn;
bn += ldb2;
}