Xeon Phi - Peak Performance

Ich habe eine Xeon Phi 5100 vorliegen und versuche nun, deren Rechenleistung auszureizen.

Ich verwende den hoffentlich korrekt abgetippten Beispielcode aus dem
Buch "Intel Xeon Phi Coprocessor High-Performance Programming" Seite 35ff.

Ich erhalte für meinen Prozessor aber nicht annähernd so gute Ergebnisse wie im Buch:

Statt knapp 2 Terraflops erhalte ich gerade mal knapp unter einem Terraflop.

Hier der Code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>
#include <sys/time.h>

double dtime()
{    
  double tseconds = 0.0;    
  struct timeval mytime;    
  gettimeofday(&mytime,(struct timezone*)0);    
  tseconds = (double)(mytime.tv_sec + mytime.tv_usec*1.0e-6);    
  return( tseconds );
}

#define FLOPS_ARRAY_SIZE (1024*1024) 
#define MAXFLOPS_ITERS 100000000
#define LOOP_COUNT 128

#define FLOPSPERCALC 2

float fa[FLOPS_ARRAY_SIZE] __attribute__((align(64)));
float fb[FLOPS_ARRAY_SIZE] __attribute__((align(64)));

int main(int argc, char *argv[] )
{
  int i,j,k;
  int numthreads;

  double tstart, tstop, ttime;

  double gflops = 0.0;
  float a = 1.1;

  #pragma omp parallel for
  for(i=0; i < FLOPS_ARRAY_SIZE; i++)
  {
    if (i==0) numthreads = omp_get_num_threads();
    fa[i] = (float)i + 0.1;
    fb[i] = (float)i + 0.2;
  }

  printf("Start computing on %d threads\n", numthreads);

  tstart = dtime();

  #pragma omp parallel for private(j,k)
  for(i=0; i < numthreads; i++)
  {
    int offset = i * LOOP_COUNT; 

    for(j=0; j < MAXFLOPS_ITERS; j++)
    {
      for(k=0; k<LOOP_COUNT; k++)
      {
        fa[k+offset] = a * fa[k+offset] + fb[k+offset];
      }
    }
  }

  tstop = dtime();

  gflops = (double) (1.0e-9 * numthreads * LOOP_COUNT * MAXFLOPS_ITERS * FLOPSPERCALC);

  ttime = tstop - tstart;

  printf("GFlops = %10.3lf, Secs = %10.3lf, GFLops per Second = %10.3lf\r\n", gflops, ttime, gflops/ttime);

  return 0;
}

Kann jemand Gründe für die Performance-Unterschiede feststellen?
Habe ich mich irgendwo vertippt?
Meine Peak-Performance war lustigerweise auch mit 240 Threads statt 120, wie im Buch.

Gregor

1. Ist es das gleiche Modell?

2. Ist es der gleiche Compiler und wird er mit den gleichen Schaltern benutzt?