Frage zu AVX Intrinsics unter dem Intel-Compiler

Hey, wenn ich im Intelcompiler mit -O3 unter Windows 64 Bit eine 64 bit C++ Anwendung erstelle mit diesem Quellcode:

#include <immintrin.h>
#include <iostream>
int main()
{
	double __declspec(align(32)) arr[4] = {1.,2.,3.,4.};
	__m256d a = _mm256_load_pd(arr);
	_mm256_store_pd(arr, _mm256_mul_pd(a,a));
	std::cout<<arr[3];
}

Dann bekomme ich als ASM-Output das hier:

;double __declspec(align(32)) arr[4] = {1.,2.,3.,4.};
mov       rax, 03ff0000000000000H
mov       rdx, 04000000000000000H
mov       r8, 04008000000000000H
mov       r9, 04010000000000000H
mov       QWORD PTR [r13], rax
mov       QWORD PTR [8+r13], rdx
mov       QWORD PTR [16+r13], r8
mov       QWORD PTR [24+r13], r9
$LN26:

;__m256d a = _mm256_load_pd(arr);

vmovupd   ymm0, YMMWORD PTR [r13]

;_mm256_store_pd(arr, _mm256_mul_pd(a,a));

vmulpd    ymm2, ymm0, ymm0

vmovupd   YMMWORD PTR [r13], ymm2

Wieso erstellt der Compiler die move-instructions in der unaligned Version (also für "load_pd" und "store_pd" jeweils "vmovupd" statt "vmovapd") ? Ich aligne doch mein Array explizit auf 32 byte und gebe ja auch explizit die aligned intrinsics an ("_mm256_load_pd" statt "_mm256_loadu_pd). Wieso ändert der Intelcompiler das eigenständig