Fast Linear Algebra Classes for Games and Graphics

The purpose of specializations of SIMD::Traits<T, N> is to have a mechanism for selecting implementations that are implemented for T and optimized for N elements, for developing the C++ templates classes and template functions that forms the core of the linear algebra features of the library.

The SIMD::Traits<T, N> specializations are used to implement a set of C++ classes that simplifies SIMD development. These classes are contained in the HCCVectorMath.h header file.

The library exploits the ability of C++ to create zero-overhead abstractions, making efficient SIMD code readable:

using Vector = Math::Vector<float, 4>;

Vector v1( 1.0f, 2.0f, 3.f, 1.0f );
Vector v2( 1.0f, 2.0f, 3.f, 1.0f );
Vector v3 = v1 + v2 + v1 + v2 + v1 + v2;

where the + operator is implemented as:

template<SimdOrTupleType T, SimdOrTupleType U>
    requires IsCompatible<T, U> && Internal::HasTupleType<T, U>
inline auto operator + ( const T& lhs, const U& rhs ) noexcept
{
    using Traits = typename T::Traits;
    if constexpr ( Internal::HasSimdType<T, U> == false && T::Size < 3 )
    {
        return T( lhs.x + rhs.x, lhs.y + rhs.y );
    }
    else
    {
        using ResultType = Internal::MakeResultType<T>;
        return ResultType( Traits::Add( Internal::ToSimd( lhs ), 
                                        Internal::ToSimd( rhs ) ) );
    }
}

Which the compiler turns into:

    using Vector = Math::Vector<float, 4>;

    Vector v1( 1.0f, 2.0f, 3.f, 1.0f );
00007FF6C7F162D8  vmovdqu     xmm4,xmmword ptr [__xmm@3f80000040400000400000003f800000 (07FF6C8387F60h)]  
    Vector v2( 1.0f, 2.0f, 3.f, 1.0f );
    Vector v3 = v1 + v2 + v1 + v2 + v1 + v2;
00007FF6C7F162E0  vaddps      xmm0,xmm4,xmm4  
00007FF6C7F162E4  vaddps      xmm1,xmm0,xmm4  
00007FF6C7F162E8  vaddps      xmm2,xmm1,xmm4  
00007FF6C7F162EC  vaddps      xmm3,xmm2,xmm4  
00007FF6C7F162F0  vaddps      xmm6,xmm3,xmm4  

The compiler detects that the two vectors are identical, and only do a single load of the data into xmm4, and then generates code for the additions.

The important thing is that there are no unnecessary artifacts caused by using the classes.

Rearranging, and grouping, the terms:

Vector v3 = ( v1 + v2 ) + ( v1 + v2 ) + ( v1 + v2 );

improves the generated code significantly:

00007FF6BC1F39E8  vaddps      xmm1,xmm6,xmm6  
00007FF6BC1F39EC  vaddps      xmm0,xmm1,xmm1  
00007FF6BC1F39F0  vaddps      xmm1,xmm0,xmm1 

Reducing the number of vaddps operations to just three.

Doing the same thing using DirectXMath requires a bit more work:


XMFLOAT4A v1( { 1.0f, 2.0f, 3.f, 1.0f } );
XMFLOAT4A v2( { 1.0f, 2.0f, 3.f, 1.0f } );
auto v1Loaded = XMLoadFloat4A( &v1 );
auto v2Loaded = XMLoadFloat4A( &v2 );
        
auto v3Loaded = XMVectorAdd(
        XMVectorAdd( 
            XMVectorAdd( 
                v1Loaded,
                v2Loaded ),
            v1Loaded ),
        XMVectorAdd( 
            XMVectorAdd( 
                v2Loaded,
                v1Loaded ),
            v2Loaded ) );

Which is harder to write, and if you didn’t write it, harder to understand.

Benchmarking the above DirectXMath code against

`Vector v3 = ( v1 + v2 ) + ( v1 + v2 ) + ( v1 + v2 );`

yields:

----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
BenchmarkVector2MultipleAdds                  4.36 ns         3.35 ns    224000000
BenchmarkVector2MultipleXMVectorAdd           4.56 ns         3.77 ns    165925926

So, there is rarely any overhead to using the library, compared to working directly with the SIMD compiler intrinsic functions.

Ideally, the compiler would generate optimal code for any computations, and it usually comes close - and when enabled using the /arch:AVX, /arch:AVX2, /arch:AVX512 or /arch:AVX10.1 switches, it will utilize SIMD operations to improve performance. This just requires a rebuild of the solution, and will often improve performance significantly.

Note that AVX was introduced with the Sandy Bridge micro architecture back in 2011, while AVX2 came later with the Haswell micro architecture in 2012, and AMD added support for AVX2 in 2015. So it’s generally safe to assume that any modern server, or workstation, supports AVX2.

Vector

Vector is a C++ template class that holds a fixed number of elements:

template<typename ValueT, size_t N>
class alignas( Math::Internal::SIMD::Traits<ValueT,N>::AlignAs ) Vector
{
  ...
};

The template supports the unary -, -, +, *, /, -=, +=, *=, /=, and [] operators. The basic mathematical operators are constexpr implemented, allowing code to be evaluated at compile time, while using SIMD instructions at runtime, but Eigen and Armadillo are far better alternatives for general linear algebra.

Tuple2, Tuple3 and Tuple4

The Tuple2, Tuple3 and Tuple4 templates implements most of the magic, together with the TupleSimd template, required to provide an efficient set of classes and templates that can handle linear algebra for games, graphic apps, and other apps that work with two and/or three dimensional data.

The template arguments are the type derived from the template and the type used to store each coordinate’s value. Since the templates know their derived type, the templates can operate on, and return values of the derived type.

When working with data in two dimensions, it’s a common convention that the type holding two dimensional data has two data fields, x and y. Similarly a type holding three dimensional data is expected to have three data fields named x, y and z.

The Tuple2 template fills this role for two dimensional data:

template<class DerivedT, typename T>
class Tuple2 : public Internal::TupleBase
{
public:
  ...
    union
    {
        ArrayType values;
        struct
        {
            value_type x, y;
        };
    };
  ...
};

where values is a std::array<value_type,2> sharing the location of x and y in memory. Setting v.values[0] = 0 is the same as v.x = 0.

The union between values and struct { value_type x, y; } is important as it provides a generic way to access the coordinates held by the object without explicitly accessing each value through x or y.

DerivedT is required to be a class derived from the Tuple2 template, and value_type is declared as using value_type = T;.

Similarly the Tuple3 template fills this role for three dimensional data:

template<class DerivedT, typename T>
class Tuple3 : public Internal::TupleBase
{
public:
  ...
    union
    {
        ArrayType values;
        struct
        {
            value_type x, y, z;
        };
    };
  ...
};

and here values is a std::array<value_type,3>.

Tuple4 adds an additional field w, and uses std::array<value_type,4> for the values array.

The Tuple2, Tuple3 and Tuple4 templates are derived from the empty Internal::TupleBase struct.

namespace Internal
{
    struct TupleBase
    { };
}

This provides a mechanism used to distinguish between types derived from Tuple2, Tuple3 and Tuple4; and types that are not:

namespace Internal
{
    template<typename T>
    concept TupleType = std::is_base_of_v<TupleBase, T>;
}

To use SIMD on the Intel/AMD x64 architecture, data must, as mentioned, be loaded into a SIMD type that is an abstract representation of an AVX or SSE4 register. The TupleSimd template represents SIMD types using one of the SIMD::Traits specializations combined with a tuple type.

template<typename TraitsT, typename TupleT>
class TupleSimd : public Internal::TupleSimdBase
{
  ...    
};

TupleSimd fills this role for each of the Tuple2, Tuple3 and Tuple4 template classes.

template<class DerivedT, typename T>
class Tuple3 : public Internal::TupleBase
{
public:
    using DerivedType = DerivedT;
  ...
    using Simd = TupleSimd<Traits, DerivedType>;
  ...
};

The above definition of Tuple3::Simd ensures that each class derived from Tuple3 gets a distinct Tuple3::Simd C++ type.

Tuple2, Tuple3 and Tuple4 implementation details

Tuple2, Tuple3 and Tuple4 have similar implementations, following the same pattern:

template<typename DerivedT, typename T>
class Tuple3 : public Internal::TupleBase
{

Internal::TupleBase is used as the base class for the Tuple2, Tuple3 and Tuple4 enabling the use of std::is_base_of_v<,> to distinguish between types that are derived from Tuple2, Tuple3 and Tuple4 and those that are not, and this is used to ensure that the templates designed for Tuple2, Tuple3 and Tuple4 will only be enabled for classes derived from either of them.

Inside the template definition, DerivedType is defined as the class or struct derived from the Tuple3 template, together with value_type and size_type.

public:
    using DerivedType = DerivedT;
    using value_type = T;
    using size_type = size_t;

Tuple3 holds three values, x, y and z, as its main purpose is to be a storage for three dimensional coordinates, and Size specifies the number of values that Tuple3 holds.

    static constexpr size_type Size = 3;

    using Traits = SIMD::Traits<value_type, Size>;
    using SIMDType = typename Traits::SIMDType;

Above Traits is defined for value_type and Size, selecting the SIMD::Traits specialization that fits the requirements for Tuple3.

Next ArrayType is defined:

    using ArrayType = typename Traits::ArrayType;

This is the same as std::array<value_type,Size>, and while x, y and z is the common notation for three dimensional information, values, as defined below, is more convenient for developing templates that will work with Tuple2, Tuple3, Tuple4 and any class derived from either of them.

The mathematical operations are performed using the Simd type which holds a SIMD vector using the TupleSimd instantiated for the Traits and the class derived from the Tuple3 template, ensuring that each derived class gets a unique TupleSimd C++ type.

    using Simd = TupleSimd<Traits, DerivedType>;

The data fields of Tuple3:

    union
    {
        ArrayType values;
        struct
        {
            value_type x, y, z;
        };
    };

The default constructor ensures that x, y and z are initialized to 0.

    Tuple3( ) noexcept
            : x{}, y{}, z{}
    { }

    Tuple3( value_type xv, value_type yv, value_type zv ) noexcept
        : x( xv ), y( yv ), z( zv )
    { }

The next constructor initializes a Tuple3 from a compatible TupleSimd, which is any TupleSimd instantiated for the same SIMD::Traits specialization as Tuple3. SIMD::Traits has a ToArray function that stores the values in the SIMD type in a std::array<> with the same type as ArrayType and returns the data.


    template<Internal::SimdType T>
        requires std::is_same_v<Traits, typename T::Traits>
    Tuple3( const T& other ) noexcept
        : values( Traits::ToArray( other.simd ) )
    { }

    template<Internal::SimdType T>
        requires std::is_same_v<Traits, typename T::Traits>
    DerivedType& operator = ( const T& other ) noexcept
    {
        values = Traits::ToArray( other.simd );
        return static_cast< DerivedType& >( *this );
    }

    constexpr bool operator == ( const Tuple3& other ) const noexcept
    {
        return IsSameValue( x, other.x ) && IsSameValue( y, other.y ) 
                            && IsSameValue( z, other.z );
    }
    constexpr bool operator != ( const Tuple3& other ) const noexcept
    {
        return !IsSameValue( x, other.x ) || !IsSameValue( y, other.y ) 
                            || !IsSameValue( z, other.z );
    }

Compare the Tuple3 with a compatible TupleSimd:


    template<Internal::SimdType T>
        requires std::is_same_v<Traits, typename T::Traits>
    bool operator == ( const T& other ) const noexcept
    {
        return Traits::Equals( Traits::Load( values.data( ) ), other.simd );
    }

    template<Internal::SimdType T>
        requires std::is_same_v<Traits, typename T::Traits>
    bool operator != ( const T& other ) const noexcept
    {
        return Traits::Equals( Traits::Load( values.data( ) ), other.simd ) == false;
    }

Negation loads the data of Tuple3 into a SIMDType, calls Traits::Negate returning the result as a Tuple3::Simd, the TupleSimd specialization for this Tuple3 type:

    Simd operator-( ) const noexcept
    {
        return Traits::Negate( Traits::Load( values.data( ) ) );
    }

Tuple3 overloads the +=, -=, *= and /= operators, and each operator accepts a const reference to a Simd, const reference to a Tuple3, or a scalar - note that the overloads returns a reference to DerivedType:


    DerivedType& operator += ( const Simd& other ) noexcept
    {
        values = Traits::ToArray( Traits::Add( Traits::Load( values ), other.simd ) );
        return static_cast< DerivedType& >(*this );
    }

    DerivedType& operator += ( const Tuple3& other ) noexcept
    {
        x += other.x;
        y += other.y;
        z += other.z;
        return static_cast< DerivedType& >( *this );
    }

    DerivedType& operator += ( const value_type& value ) noexcept
    {
        x += value;
        y += value;
        z += value;
        return static_cast< DerivedType& >( *this );
    }

    DerivedType& operator -= ( const Simd& other ) noexcept
    {
        values = Traits::ToArray( Traits::Sub( Traits::Load( values ), other.simd ) );
        return static_cast< DerivedType& >( *this );
    }

    DerivedType& operator -= ( const Tuple3& other ) noexcept
    {
        x -= other.x;
        y -= other.y;
        z -= other.z;
        return static_cast< DerivedType& >( *this );
    }

    DerivedType& operator -= ( const value_type& value ) noexcept
    {
        x -= value;
        y -= value;
        z -= value;
        return static_cast< DerivedType& >( *this );
    }

    DerivedType& operator *= ( const Simd& other ) noexcept
    {
        values = Traits::ToArray( Traits::Mul( Traits::Load( values ), other.simd ) );
        return static_cast< DerivedType& >( *this );
    }

    DerivedType& operator *= ( const Tuple3& other ) noexcept
    {
        x *= other.x;
        y *= other.y;
        z *= other.z;
        return static_cast< DerivedType& >( *this );
    }

    DerivedType& operator *= ( const value_type& value ) noexcept
    {
        x *= value;
        y *= value;
        z *= value;
        return static_cast< DerivedType& >( *this );
    }


    DerivedType& operator /= ( const Simd& other ) noexcept
    {
        values = Traits::ToArray( Traits::Div( Traits::Load( values ), other.simd ) );
        return static_cast< DerivedType& >( *this );
    }

    DerivedType& operator /= ( const Tuple3& other ) noexcept
    {
        x /= other.x;
        y /= other.y;
        z /= other.z;
        return static_cast< DerivedType& >( *this );
    }

    DerivedType& operator /= ( const value_type& value ) noexcept
    {
        x /= value;
        y /= value;
        z /= value;
        return static_cast< DerivedType& >( *this );
    }

Since Tuple3 is a container, it implements common container member functions:

    const_reference operator[]( size_t index ) const noexcept
    {
        return values[ index ];
    }

    reference operator[]( size_t index ) noexcept
    {
        return values[ index ];
    }

    const_pointer data( ) const noexcept
    {
        return values.data( );
    }
    pointer data( ) noexcept
    {
        return values.data( );
    }

    constexpr size_t size( ) const noexcept
    {
        return Size;
    }

    const_reference front( ) const noexcept
    {
        return values.front( );
    }
    reference front( ) noexcept
    {
        return values.front( );
    }

    const_reference back( ) const noexcept
    {
        return values.back( );
    }
    reference back( ) noexcept
    {
        return values.back( );
    }

    const_iterator begin( ) const noexcept
    {
        return values.begin( );
    }
    const_iterator cbegin( ) const noexcept
    {
        return values.cbegin( );
    }
    iterator begin( ) noexcept
    {
        return values.begin( );
    }
    const_iterator end( ) const noexcept
    {
        return values.end( );
    }
    const_iterator cend( ) const noexcept
    {
        return values.cend( );
    }
    iterator end( ) noexcept
    {
        return values.end( );
    }
    const_reverse_iterator rbegin( ) const noexcept
    {
        return values.rbegin( );
    }
    reverse_iterator rbegin( ) noexcept
    {
        return values.rbegin( );
    }
    const_reverse_iterator rend( ) const noexcept
    {
        return values.rend( );
    }
    reverse_iterator rend( ) noexcept
    {
        return values.rend( );
    }

The above provides a reasonable level of integration with the standard C++ library.

Calling the Assign member functions from derived classes can be more convenient than calling Base::operator = ( arrayOfData ).

    void Assign( value_type xv, value_type yv, value_type zv ) noexcept
    {
        x = xv;
        y = yv;
        z = zv;
    }
    void Assign( const ArrayType& src ) noexcept
    {
        values = src;
    }
    void Assign( SIMDType src ) noexcept
    {
        values = Traits::ToArray( src );
    }

The ability to check for NaN or infinity:


    bool HasNaN( ) const noexcept
    {
        return std::isnan( x ) || std::isnan( y ) || std::isnan( z );
    }

    bool IsFinite( ) const noexcept
    {
        return std::isfinite( x ) && std::isfinite( y ) && std::isfinite( z );
    }

    bool IsInfinite( ) const noexcept
    {
        return std::isinf( x ) || std::isinf( y ) || std::isinf( z );
    }

};

The library defines two internal concepts. Internal::TupleType and Internal::SimdType.

Tuple2, Tuple3 and Tuple4 matches the Internal::TupleType concept, while any TupleSimd derived type matches the Internal::SimdType concept.

In the code below, Internal::IsCompatible<T,U> is used to verify that T and U are compatible types using the same SIMD::Traits specialization.

The first overload will be used when both arguments are compatible TupleSimd based objects, so the implementation can perform the addition without any loading of values into a SIMD datatype/register.

template<Internal::SimdType T, Internal::SimdType U>
    requires Internal::IsCompatible<T,U>
inline T operator + ( const T& lhs, const U& rhs ) noexcept
{
    using Traits = typename T::Traits;
    return Traits::Add( lhs.simd, rhs.simd );
}

The second overload accepts TupleSimd for the left hand argument, and any compatible Tuple2, Tuple3 or Tuple4 as its right hand argument.

The implementation loads the data from the tuple type, before performing the addition, returning T which is a TupleSimd based type.

template<Internal::SimdType T, Internal::TupleType U>
    requires Internal::IsCompatible<T, U>
inline T operator + ( const T& lhs, const U& rhs ) noexcept
{
    using Traits = typename T::Traits;
    return Traits::Add( lhs.simd, Traits::Load( rhs.values.data( ) ) );
}

The third overload accepts the same type of arguments as the second, but takes a Tuple2, Tuple3 or Tuple4 as its left hand argument, and TupleSimd for the right hand argument.

This time the contents of lhs gets loaded before performing the addition.

template<Internal::TupleType U, Internal::SimdType T>
    requires Internal::IsCompatible<T, U>
inline T operator + ( const U& lhs, const T& rhs ) noexcept
{
    using Traits = typename T::Traits;
    return Traits::Add( Traits::Load( lhs.values.data( ) ), rhs.simd );
}

The fourth, and last, overload, accepts compatible Tuple2, Tuple3 or Tuple4 for both the left hand side and the right hand side of the addition, loading data for both arguments before performing the addition.

template<Internal::TupleType T, Internal::TupleType U, 
                            typename ResultT = typename T::Simd>
    requires Internal::IsCompatible<T, U>
inline ResultT operator + ( const T& lhs, const U& rhs ) noexcept
{
    using Traits = typename T::Traits;
    return Traits::Add( Traits::Load( lhs.values.data( ) ), 
                                Traits::Load( rhs.values.data( ) ) );
}

Loading to and storing from the SIMD types is done using the values field of the Tuple2, Tuple3 and Tuple4 types, so they can all share the same operator and function implementations:

template<Internal::SimdType T>
inline T Round( const T& t ) noexcept
{
    using Traits = typename T::Traits;
    return Traits::Round( t.simd );
}
template<Internal::TupleType T, typename ResultT = typename T::Simd>
inline ResultT Round( const T& t ) noexcept
{
    using Traits = typename T::Traits;
    return Traits::Round( Traits::Load( t.values.data( ) ) );
}

The operators and functions nearly always return a TupleSimd based type, ensuring that data is only loaded into a SIMD type when necessary.