C++ VS/gcc/clang optimieren schlecht, Intel besser - Ideen?

ich versuche gerade eine Berechnung zu optimieren (ich komme da mehrere Mio mal durch) und dabei ist mir aufgefallen das der VS,gcc,clang
eine nach meinem Gefühl optimierbare triviale Berechnung komplett unoptimiert lassen - scheinbar ist nur der Intel Kompiler (wenn ich den ASM-Code richtig interpretiere)
in der Lage das Szenario zu optimieren - wenn das Define ONLY_CONST_VALUES aktiviert wird sieht man das alle Kompiler die Berechnung komplett auf das Ergebnis reduzieren können

Hier der Source-Code:

// gcc.godbolt.org
// x86-64 CL 19 2017 RTW
// x86-86 clang 4.0.0, -O2 -Wall -std=c++1z
// x86-86 gcc 7.1, -O2 -Wall -std=c++1z
// x86-64 icc 17

#include <cstddef>
#include <cassert>
#include <array>

namespace
{

template <typename ValueT>
class vector3
{
public:
    typedef ValueT value_type;
    typedef ValueT& reference;
    typedef const ValueT& const_reference;

    vector3()
    {
        m_data.fill(ValueT());
    }

    vector3(const_reference x_, const_reference y_, const_reference z_):m_data{x_,y_,z_}
    {
    }

    const_reference x() const
    {
        return m_data[0];
    }

    reference x()
    {
        return m_data[0];
    }

    const_reference y() const
    {
        return m_data[1];
    }

    reference y()
    {
        return m_data[1];
    }

    const_reference z() const
    {
        return m_data[2];
    }

    reference z()
    {
        return m_data[2];
    }

    vector3& operator+=(const vector3& other)
    {
        m_data[0] += other.m_data[0];
        m_data[1] += other.m_data[1];
        m_data[2] += other.m_data[2];
        return *this;
    }

    vector3& operator*=(const_reference value)
    {
        m_data[0] *= value;
        m_data[1] *= value;
        m_data[2] *= value;
        return *this;
    }

private:
    std::array<value_type, 3> m_data;
};

typedef double real_type;
typedef vector3<real_type> point3;

template <typename ValueT>
inline vector3<ValueT> operator*(typename vector3<ValueT>::const_reference a, vector3<ValueT> b)
{
    return b *= a;
}

template <typename ValueT>
inline vector3<ValueT> operator+(vector3<ValueT> a, const vector3<ValueT>& b)
{
    return a += b;
}

static inline point3 calc(const point3& p_point, const point3& p_direction_x,
    const point3& p_direction_y, const point3& p_direction_z, const point3& p_origin)
{
    return (p_point.x() * p_direction_x + p_point.y() * p_direction_y + p_point.z() * p_direction_z) + p_origin;
}

// 0=run calc function using structs
// 1=hand inlined calc function using structs
// 2=hand inlined calc function using x,y,z,direction,origin direct
// 3=hand optimized result of the calc function
#define TEST_LEVEL 0

static point3 test(const real_type& p_x, const real_type& p_y, const real_type& p_z)
{
#if TEST_LEVEL == 0 || TEST_LEVEL == 1
    const point3 point(p_x, p_y, p_z);
    const point3 direction_x(0.0, -1.0, 0.0);
    const point3 direction_y(1.0, 0.0, 0.0);
    const point3 direction_z(0.0, 0.0, 1.0);
    const point3 origin(0.0, 0.0, 0.0);
#endif

#if TEST_LEVEL == 0
    const point3 result = calc(point, direction_x, direction_y, direction_z, origin);
#elif TEST_LEVEL == 1
    const point3 result = (point.x() * direction_x + point.y() * direction_y + point.z() * direction_z) + origin;
#elif TEST_LEVEL == 2
    const point3 result = (p_x * point3(0.0, -1.0, 0.0) + p_y * point3(1.0, 0.0, 0.0) + p_z * point3(0.0, 0.0, 1.0)) + point3(0.0, 0.0, 0.0);
#elif TEST_LEVEL == 3
    const point3 result(p_y, -p_x, p_z);
#else
#error unknown TEST_LEVEL
#endif
    return result;
}

}

//#define ONLY_CONST_VALUES

int main(int argc, char* argv[])
{
    const real_type x = 1.3;
    const real_type y = 2.2;
    const real_type z = 3.1;

#if defined(ONLY_CONST_VALUES)
    const point3 result = test(x, y, z);
#else    
    //+ argv == anti-optimizer trick
    const point3 result = test(x+argv[0][0], y+argv[0][1], z+argv[0][2]);
#endif    

    const int dummy = (result.x() + result.y() + result.z()) * 1000;
#if defined(ONLY_CONST_VALUES)
    assert(dummy == 4000);
#endif

    return dummy; // anti-optimizer-trick
}

oder auf Pastebin: https://pastebin.com/nMW5Am0p

die Direction(x,y,z) und Origin Vektoren sind (kompiletime)konstant - und erlaube einfache "*1"(multiplikation fällt weg) oder "*0"(wird immer zu 0) Optimierungen
nur Point(x,y,z) ist variant

TEST_LEVEL=0 ist mein Original-Code, TEST_LEVEL=3 ist handoptimiert (symb. Vektor-Operationen auf dem Papier durchgefuehrt), TEST_LEVEL=1,2 sind nur Versuche

hier der Source und die 4 Kompiler zum Spielen mit ASM-Ausgabe auf gcc.godbolt
https://godbolt.org/g/zIOEpH

einfach in Zeile 106 den TEST_LEVEL auf 0 oder 3 stellen und die Unterschiede sehen

bei TEST_LEVEL=0
VS2017: kaum optimiert - super viel Addition/Multiplikationen - viel mehr als ich erwarte?
gcc/clang: ein wenig besser - aber immer noch alle Addition/Multiplikationen (kein Ausnutzen von "*1" oder "*0" Wissen)
Intel: so gut optimert wie TEST_LEVEL=3 - oder?

bei TEST_LEVEL=3
soweit ich das sehe sind alle Kompiler auf gleichem Niveau

hab mit const/static/anonymus namespace usw. denke alles gemacht um es "optimierbarer" zu machen

hat jemand eine Idee wie ich dem Kompiler helfen kann mit (verändertem) TEST_LEVEL=0 Code die gleiche Optimierung wie mit
TEST_LEVEL=3 zu erreichen? Oder "sollte" man sowas grundsätzlich von Hand optimieren?

camper

Mit -ffast-math sehen clang und gcc schonmal ganz gut aus.

Mit -ffast-math sehen clang und gcc schonmal ganz gut aus.

ohne die Erfolge von -ffast-math in Abrede zu stellen?
aber warum können clang/gcc (nur) damit besser optimieren - alleine das "händische" inlinen der calc-Funktion reicht doch schon?

ich sehe auch nicht warum das ignorieren von "strict IEEE compliance" hier so viel hilft?

http://stackoverflow.com/questions/7420665/what-does-gccs-ffast-math-actually-do

First of all, of course, it does break strict IEEE compliance, allowing e.g. the reordering of instructions to something which is mathematically the same (ideally) but not exactly the same in floating point.

Second, it disables setting errno after single-instruction math functions, which means avoiding a write to a thread-local variable (this can make a 100% difference for those functions on some architectures).

Third, it makes the assumption that all math is finite, which means that no checks for NaN (or zero) are made in place where they would have detrimental effects. It is simply assumed that this isn't going to happen.

Fourth, it enables reciprocal approximations for division and reciprocal square root.

Further, it disables signed zero (code assumes signed zero does not exist, even if the target supports it) and rounding math, which enables among other things constant folding at compile-time.

Last, it generates code that assumes that no hardware interrupts can happen due to signalling/trapping math (that is, if these cannot be disabled on the target architecture and consequently do happen, they will not be handled).

ich empfinde -ffast-math als ziemlichen Knüppel - oder sollte man das in seinen Projekten einfach so einsetzen - wohl eher nicht - oder?

Wie würdet ihr in einen solchen Situation vorgehen - ich hab noch einige(10-20) solcher Optimierungs-Punkte im Code - und laut Profiling sind das die 70-80% Zeitfresser

Nach eingehender systemanalytischer Auswertung (Advanced Business
Security Consideration) des Sachverhaltes wird der Generic Risk
Exposure Level in einer Secondary Transitional Incidence Kategorie
eingestuft. Das Aggregated Detriment verbleibt also im Admissible
Range. Ein Grund, auf per se löchrige Fricklerware umzusteigen, wäre
eine drastische Gefährdung der Business Integrity, Trust, Transparency,
Documentation, Security, Risk Management und Q&A, und würde, wie
Case Studies gezeigt haben, bzgl. der Long-Term Profitability zu einer
dramatischen Verschlechterung führen. Weiters zeigt die Distributed
Program Performance Analysis, dass Big Data besser als die Native Cloud
performed. Betrachtet man nun den Zusammenhang zwischen Marktanteil
und Meldung über Fehler, muss schlussendlich folgende Reasonable
Business Conclusion getroffen werden: Der gcc taugt nichts, es wird
weiterhin uneingeschränkt empfohlen, die Microsoft Visual Studio
Premium-IDE zu verwenden.

camper

Gast3 schrieb:

Mit -ffast-math sehen clang und gcc schonmal ganz gut aus.

ohne die Erfolge von -ffast-math in Abrede zu stellen?
aber warum können clang/gcc (nur) damit besser optimieren - alleine das "händische" inlinen der calc-Funktion reicht doch schon?

Zumindest bei gcc, kann man ja die Optionen noch genauer spezifizieren, tatsächlich benötigt werden nur
-fno-signed-zeros -ffinite-math-only
Damit deutet sich auch an, was dem Compiler Schwierigkeiten bereitet: er muss nämlich zeigen, dass die Rechenoperationen alle definierte Ergebnisse haben und keine Unendlichkeiten erzeugen. Das wäre in diesem speziellen Fall noch prinzipiell machbar, ist aber bei komplizierteren Operationen und freier Eingabe dagegen im Allgemeinen nicht gewährleistet.

-fno-signed-zeros -ffinite-math-only

Danke

weitere Fragen:
1.
gibt dafür irgendwas vergleichbares beim Microsoft CL - /fp:fast geht wohl nicht weit genug - hier bin ich wohl gezwungen von Hand zu optimieren - oder?

könnte es noch Tricks geben wie man dem Kompiler die Optimierung doch noch erlauben kann (ohne -f(fast-math)-optionen) - nicht direkt mit doubles arbeiten sondern irgendeinem Value-Wrapper der die 0 ist wirklich nur 0 usw. Erkennung ermöglicht

3. kann man die -f(fast-math)-optionen) auch nur auf einzelne Funktionen oder units anwenden - #pragma ...?

4. @camper

Wie würdet du in diesem Beispiel vorgehen
-f(fast-math)-optionen nutzen oder von Hand optimieren?
eigentlich ist der calc Aufruf Teil einer Kette der noch weiter optimiert werden
kann - aber dann erkennt man eben nicht so schön was man da berechnet