SSE problems with -m32

Hi,

i wrote this code:

#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <xmmintrin.h>

int main()
{
  float *outIt;
  __m128 image4 ;
  int k;
  float sum;

  posix_memalign( (void**)&outIt, 16, sizeof(float)*8 );
  sum=0;

  printf("sum: %f\n", sum );
  for( k=0; k<8; k+=4 )
  {
    image4 = _mm_cvtpu8_ps( _mm_setr_pi8 ( 1+k, 2+k, 3+k, 4+k, 5+k, 6+k, 7+k, 8+k ) );
    _mm_store_ps( outIt, image4 );
    std::cerr <<  "16-byte aligned: " << outIt << std::endl;
    printf( "outIt[0]: %f, outIt[1]: %f, outIt[3]: %f, outIt[3]: %f\n", outIt[0], outIt[0], outIt[2], outIt[3] );
    printf( "outIt[0]: %f, outIt[1]: %f, outIt[3]: %f, outIt[3]: %f\n", outIt[0], outIt[1], outIt[2], outIt[3] );
    sum = outIt[0] + outIt[1] + outIt[2] + outIt[3];
    outIt += 4;
    printf("sum: %f\n", sum );
  }
}

When I compile this on a 64-bit (Linux) system, I get:

sum: 0.000000
16-byte aligned: 0x100100080
outIt[0]: 1.000000, outIt[1]: 1.000000, outIt[3]: 3.000000, outIt[3]: 4.000000
outIt[0]: 1.000000, outIt[1]: 2.000000, outIt[3]: 3.000000, outIt[3]: 4.000000
sum: 10.000000
16-byte aligned: 0x100100090
outIt[0]: 5.000000, outIt[1]: 5.000000, outIt[3]: 7.000000, outIt[3]: 8.000000
outIt[0]: 5.000000, outIt[1]: 6.000000, outIt[3]: 7.000000, outIt[3]: 8.000000
sum: 26.000000

what is I expect. However, if I compile it for 32-bit mode, the output is:

sum: 0.000000
16-byte aligned: 0x100160
outIt[0]: nan, outIt[1]: 1.000000, outIt[3]: 3.000000, outIt[3]: 4.000000
outIt[0]: 1.000000, outIt[1]: 2.000000, outIt[3]: 3.000000, outIt[3]: 4.000000
sum: 10.000000
16-byte aligned: 0x100170
outIt[0]: nan, outIt[1]: nan, outIt[3]: 7.000000, outIt[3]: nan
outIt[0]: 5.000000, outIt[1]: 6.000000, outIt[3]: 7.000000, outIt[3]: 8.000000
sum: 26.000000

There are always NaNs when I try to access elements of outIt for the first time. Does anybody know, what I am doing wrong?

Thanks,

A.O.

camper

depending on your compiler version and used flags this code might generate code using mmx-registers, destroying the contents of the fpu-registers (there is no single cpu instruction for _mm_cvtpu8_ps, there is no simple way to convert packed 8-bit integers to packed 32-bit integers). an _mm_empty() intrinsic is missing.

image4 = _mm_cvtpu8_ps( _mm_setr_pi8 ( 1+k, 2+k, 3+k, 4+k, 5+k, 6+k, 7+k, 8+k ) );
    _mm_store_ps( outIt, image4 );
    _mm_empty();
    std::cerr <<  "16-byte aligned: " << outIt << std::endl;

Under 64-bit normal math is probably be done using the sse-registers, so the problem doesnt surface, or not exists because mmx-registers arent used in the first place.

Thanks, that's it!

A.O.