SSE problems with -m32
-
Hi,
i wrote this code:
#include <stdlib.h> #include <stdio.h> #include <iostream> #include <xmmintrin.h> int main() { float *outIt; __m128 image4 ; int k; float sum; posix_memalign( (void**)&outIt, 16, sizeof(float)*8 ); sum=0; printf("sum: %f\n", sum ); for( k=0; k<8; k+=4 ) { image4 = _mm_cvtpu8_ps( _mm_setr_pi8 ( 1+k, 2+k, 3+k, 4+k, 5+k, 6+k, 7+k, 8+k ) ); _mm_store_ps( outIt, image4 ); std::cerr << "16-byte aligned: " << outIt << std::endl; printf( "outIt[0]: %f, outIt[1]: %f, outIt[3]: %f, outIt[3]: %f\n", outIt[0], outIt[0], outIt[2], outIt[3] ); printf( "outIt[0]: %f, outIt[1]: %f, outIt[3]: %f, outIt[3]: %f\n", outIt[0], outIt[1], outIt[2], outIt[3] ); sum = outIt[0] + outIt[1] + outIt[2] + outIt[3]; outIt += 4; printf("sum: %f\n", sum ); } }
When I compile this on a 64-bit (Linux) system, I get:
sum: 0.000000
16-byte aligned: 0x100100080
outIt[0]: 1.000000, outIt[1]: 1.000000, outIt[3]: 3.000000, outIt[3]: 4.000000
outIt[0]: 1.000000, outIt[1]: 2.000000, outIt[3]: 3.000000, outIt[3]: 4.000000
sum: 10.000000
16-byte aligned: 0x100100090
outIt[0]: 5.000000, outIt[1]: 5.000000, outIt[3]: 7.000000, outIt[3]: 8.000000
outIt[0]: 5.000000, outIt[1]: 6.000000, outIt[3]: 7.000000, outIt[3]: 8.000000
sum: 26.000000what is I expect. However, if I compile it for 32-bit mode, the output is:
sum: 0.000000
16-byte aligned: 0x100160
outIt[0]: nan, outIt[1]: 1.000000, outIt[3]: 3.000000, outIt[3]: 4.000000
outIt[0]: 1.000000, outIt[1]: 2.000000, outIt[3]: 3.000000, outIt[3]: 4.000000
sum: 10.000000
16-byte aligned: 0x100170
outIt[0]: nan, outIt[1]: nan, outIt[3]: 7.000000, outIt[3]: nan
outIt[0]: 5.000000, outIt[1]: 6.000000, outIt[3]: 7.000000, outIt[3]: 8.000000
sum: 26.000000There are always NaNs when I try to access elements of outIt for the first time. Does anybody know, what I am doing wrong?
Thanks,
A.O.
-
depending on your compiler version and used flags this code might generate code using mmx-registers, destroying the contents of the fpu-registers (there is no single cpu instruction for _mm_cvtpu8_ps, there is no simple way to convert packed 8-bit integers to packed 32-bit integers). an _mm_empty() intrinsic is missing.
image4 = _mm_cvtpu8_ps( _mm_setr_pi8 ( 1+k, 2+k, 3+k, 4+k, 5+k, 6+k, 7+k, 8+k ) ); _mm_store_ps( outIt, image4 ); _mm_empty(); std::cerr << "16-byte aligned: " << outIt << std::endl;
Under 64-bit normal math is probably be done using the sse-registers, so the problem doesnt surface, or not exists because mmx-registers arent used in the first place.
-
Thanks, that's it!
A.O.