[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: error in M cleanup
Clint-
My apologies. I was using a lowercase m= and n= in my own tester all
along! I don't know what I had been testing, but in any case, the
files are patched in my home directory, and pass my ammended tester,
which I'm including below this time for your comment (I think I should
also include compile time MB and NB in this tester, for example):
#!/bin/bash
for i in s d c z ; do
if [ "$1" = "" ] || [ "$1" = "$i" ] ; then
if [ $i = c ] || [ $i = z ] ; then
pref=c;
else
pref="" ;
fi;
echo -n $i $(make ${pref}mmutstcase pre=$i kb=56 mmrout=../CASES/ATL_gemm_SSE.c 2>&1 | grep PASS) " ";
mrun="mb="
if make ${pref}mmutstcase pre=$i kb=56 mb=0 m=57 mmrout=../CASES/ATL_gemm_SSE.c 2>&1 | grep PASS >/dev/null ; then
mrun="mb=0 M="
echo -n "mr "
fi;
nrun="nb="
if make ${pref}mmutstcase pre=$i kb=56 nb=0 n=57 mmrout=../CASES/ATL_gemm_SSE.c 2>&1 | grep PASS >/dev/null ; then
nrun="nb=0 N="
echo -n "nr "
fi;
kb=4;
while [ $kb -le 80 ] ; do
nb=$kb;
nbe=$(($kb+4))
while [ $nb -le $nbe ] ; do
mb=$kb;
mbe=$(($kb+4))
while [ $mb -le $mbe ] ; do
if ! make ${pref}mmutstcase pre=$i kb=$kb $nrun$nb $mrun$mb mmrout=../CASES/ATL_gemm_SSE.c 2>&1 | grep PASS >/dev/null ; then
# echo -n ${kb}_${nb}_${mb}" "
# else
echo -n ${kb}_${nb}_${mb}x
exit 1
fi
mb=$(($mb+1))
done
nb=$(($nb+1))
done
echo -n "$kb "
kb=$(($kb+4))
done
fi;
done
echo
R Clint Whaley <rwhaley@cs.utk.edu> writes:
> Camm,
>
> The good news is that using your new SSE2 stuff I'm now getting a complete
> DGEMM (not just mmcase) of roughly 2Gflop. The bad news is that it still
> doesn't always get the right answer. In particular there appears to be
> an error in the M cleanup. For any i such that M = 2 + 4i, it produces
> the wrong answer. Here's some examples of making the tester fail:
>
> >> make mmutstcase mmrout=../CASES/ATL_gemm_SSE.c mb=0 nb=56 M=2 N=56 K=56
> >> make mmutstcase mmrout=../CASES/ATL_gemm_SSE.c mb=0 nb=56 M=10 N=56 K=56
>
> Seems like an error in cleanup of a 4 unrolled loop, but I obviously don't
> know. Can you confirm it's an error, and not just something I'm doing wrong?
>
> To give some good news with all this, I include timings below comparing the
> new SSE2 DGEMM versus the x86 FPU implementation.
>
> Thanks,
> Clint
>
> 100 200 300 400 500 600 700 800 900 1000
> ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
> P4 x86 1025.6 1194.0 1181.2 1238.7 1209.7 1234.3 1247.3 1264.2 1276.8 1242.2
> P4 SSE2 1351.4 1837.0 1944.0 1828.6 1851.9 1878.3 1960.0 1932.1 1944.0 2000.0
>
> 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000
> ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
> P4 x86 1256.7 1250.1 1254.5 1262.3 1261.8 1258.6 1261.3 1261.7 1262.0 1260.5
> P4 SSE2 1986.2 1974.1 1974.0 1970.3 1990.0 1999.6 1991.9 1991.6 2002.0 1974.4
>
>
--
Camm Maguire camm@enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens." -- Baha'u'llah