Implementation Details
Fast Linear Algebra Classes for Games and Graphics
Vector
is template for fixed size vectors of integer or floating point values.Tuple2
,Tuple3
andTuple4
are template classes that holds2
,3
or4
integer or floating point values. They are used to implement specializations ofVector
and other classes holding2
,3
or4
integer or floating point values.SquareMatrix
Quaternion
Transforms
- Benchmarks
The purpose of specializations of SIMD::Traits<T, N>
is to have a mechanism for
selecting implementations that are implemented for T
and optimized for N
elements,
for developing the C++ templates classes and template functions that forms the core
of the linear algebra features of the library.
The SIMD::Traits<T, N>
specializations are used to implement a set of C++
classes that simplifies SIMD development. These classes are contained in
the HCCVectorMath.h
header file.
The library exploits the ability of C++ to create zero-overhead abstractions, making efficient SIMD code readable:
using Vector = Math::Vector<float, 4>;
Vector v1( 1.0f, 2.0f, 3.f, 1.0f );
Vector v2( 1.0f, 2.0f, 3.f, 1.0f );
Vector v3 = v1 + v2 + v1 + v2 + v1 + v2;
where the +
operator is implemented as:
template<SimdOrTupleType T, SimdOrTupleType U>
requires IsCompatible<T, U> && Internal::HasTupleType<T, U>
inline auto operator + ( const T& lhs, const U& rhs ) noexcept
{
using Traits = typename T::Traits;
if constexpr ( Internal::HasSimdType<T, U> == false && T::Size < 3 )
{
return T( lhs.x + rhs.x, lhs.y + rhs.y );
}
else
{
using ResultType = Internal::MakeResultType<T>;
return ResultType( Traits::Add( Internal::ToSimd( lhs ),
Internal::ToSimd( rhs ) ) );
}
}
Which the compiler turns into:
using Vector = Math::Vector<float, 4>;
Vector v1( 1.0f, 2.0f, 3.f, 1.0f );
00007FF6C7F162D8 vmovdqu xmm4,xmmword ptr [__xmm@3f80000040400000400000003f800000 (07FF6C8387F60h)]
Vector v2( 1.0f, 2.0f, 3.f, 1.0f );
Vector v3 = v1 + v2 + v1 + v2 + v1 + v2;
00007FF6C7F162E0 vaddps xmm0,xmm4,xmm4
00007FF6C7F162E4 vaddps xmm1,xmm0,xmm4
00007FF6C7F162E8 vaddps xmm2,xmm1,xmm4
00007FF6C7F162EC vaddps xmm3,xmm2,xmm4
00007FF6C7F162F0 vaddps xmm6,xmm3,xmm4
The compiler detects that the two vectors are identical, and only do a single load of the data into xmm4, and then generates code for the additions.
The important thing is that there are no unnecessary artifacts caused by using the classes.
Rearranging, and grouping, the terms:
Vector v3 = ( v1 + v2 ) + ( v1 + v2 ) + ( v1 + v2 );
improves the generated code significantly:
00007FF6BC1F39E8 vaddps xmm1,xmm6,xmm6
00007FF6BC1F39EC vaddps xmm0,xmm1,xmm1
00007FF6BC1F39F0 vaddps xmm1,xmm0,xmm1
Reducing the number of vaddps
operations to just three.
Doing the same thing using DirectXMath requires a bit more work:
XMFLOAT4A v1( { 1.0f, 2.0f, 3.f, 1.0f } );
XMFLOAT4A v2( { 1.0f, 2.0f, 3.f, 1.0f } );
auto v1Loaded = XMLoadFloat4A( &v1 );
auto v2Loaded = XMLoadFloat4A( &v2 );
auto v3Loaded = XMVectorAdd(
XMVectorAdd(
XMVectorAdd(
v1Loaded,
v2Loaded ),
v1Loaded ),
XMVectorAdd(
XMVectorAdd(
v2Loaded,
v1Loaded ),
v2Loaded ) );
Which is harder to write, and if you didn’t write it, harder to understand.
Benchmarking the above DirectXMath code against
`Vector v3 = ( v1 + v2 ) + ( v1 + v2 ) + ( v1 + v2 );`
yields:
----------------------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------------------
BenchmarkVector2MultipleAdds 4.36 ns 3.35 ns 224000000
BenchmarkVector2MultipleXMVectorAdd 4.56 ns 3.77 ns 165925926
So, there is rarely any overhead to using the library, compared to working directly with the SIMD compiler intrinsic functions.
Ideally, the compiler would generate optimal code for any computations,
and it usually comes close - and when enabled using the /arch:AVX
, /arch:AVX2
, /arch:AVX512
or /arch:AVX10.1
switches, it will utilize SIMD operations to improve performance.
This just requires a rebuild of the solution, and will often improve performance significantly.
Note that AVX was introduced with the Sandy Bridge micro architecture back in 2011, while AVX2 came later with the Haswell micro architecture in 2012, and AMD added support for AVX2 in 2015. So it’s generally safe to assume that any modern server, or workstation, supports AVX2.
Vector
Vector is a C++ template class that holds a fixed number of elements:
template<typename ValueT, size_t N>
class alignas( Math::Internal::SIMD::Traits<ValueT,N>::AlignAs ) Vector
{
...
};
The template supports the unary -
, -
, +
, *
, /
, -=
, +=
, *=
, /=
,
and []
operators. The basic mathematical operators are constexpr
implemented,
allowing code to be evaluated at compile time, while using SIMD instructions at
runtime, but Eigen and Armadillo
are far better alternatives for general linear algebra.
Tuple2, Tuple3 and Tuple4
The Tuple2
, Tuple3
and Tuple4
templates implements most of the magic,
together with the TupleSimd
template, required to provide an efficient
set of classes and templates that can handle linear algebra for games,
graphic apps, and other apps that work with two and/or three dimensional data.
The template arguments are the type derived from the template and the type used to store each coordinate’s value. Since the templates know their derived type, the templates can operate on, and return values of the derived type.
When working with data in two dimensions, it’s a common convention that the
type holding two dimensional data has two data fields, x
and y
. Similarly
a type holding three dimensional data is expected to have three data fields
named x
, y
and z
.
The Tuple2
template fills this role for two dimensional data:
template<class DerivedT, typename T>
class Tuple2 : public Internal::TupleBase
{
public:
...
union
{
ArrayType values;
struct
{
value_type x, y;
};
};
...
};
where values
is a std::array<value_type,2>
sharing the location of
x
and y
in memory. Setting v.values[0] = 0
is the same as v.x = 0
.
The union between values
and struct { value_type x, y; }
is important
as it provides a generic way to access the coordinates held by the object
without explicitly accessing each value through x
or y
.
DerivedT
is required to be a class derived from the Tuple2
template,
and value_type
is declared as using value_type = T;
.
Similarly the Tuple3
template fills this role for three dimensional data:
template<class DerivedT, typename T>
class Tuple3 : public Internal::TupleBase
{
public:
...
union
{
ArrayType values;
struct
{
value_type x, y, z;
};
};
...
};
and here values
is a std::array<value_type,3>
.
Tuple4
adds an additional field w
, and uses std::array<value_type,4>
for the values array.
The Tuple2
, Tuple3
and Tuple4
templates are derived from the
empty Internal::TupleBase
struct.
namespace Internal
{
struct TupleBase
{ };
}
This provides a mechanism used to distinguish between types derived from
Tuple2
, Tuple3
and Tuple4
; and types that are not:
namespace Internal
{
template<typename T>
concept TupleType = std::is_base_of_v<TupleBase, T>;
}
To use SIMD on the Intel/AMD x64 architecture, data must, as mentioned,
be loaded into a SIMD type that is an abstract representation of an
AVX or SSE4 register. The TupleSimd
template represents SIMD types
using one of the SIMD::Traits
specializations combined with
a tuple type.
template<typename TraitsT, typename TupleT>
class TupleSimd : public Internal::TupleSimdBase
{
...
};
TupleSimd
fills this role for each of the Tuple2
, Tuple3
and Tuple4
template classes.
template<class DerivedT, typename T>
class Tuple3 : public Internal::TupleBase
{
public:
using DerivedType = DerivedT;
...
using Simd = TupleSimd<Traits, DerivedType>;
...
};
The above definition of Tuple3::Simd
ensures that each class
derived from Tuple3
gets a distinct Tuple3::Simd
C++ type.
Tuple2, Tuple3 and Tuple4 implementation details
Tuple2
, Tuple3
and Tuple4
have similar implementations, following
the same pattern:
template<typename DerivedT, typename T>
class Tuple3 : public Internal::TupleBase
{
Internal::TupleBase
is used as the base class for the
Tuple2
, Tuple3
and Tuple4
enabling the use of std::is_base_of_v<,>
to distinguish between types that are derived from Tuple2
, Tuple3
and Tuple4
and those that are not, and this is used to ensure that the templates designed for
Tuple2
, Tuple3
and Tuple4
will only be enabled for classes derived from either of them.
Inside the template definition, DerivedType
is defined as the class or struct derived from
the Tuple3
template, together with value_type
and size_type
.
public:
using DerivedType = DerivedT;
using value_type = T;
using size_type = size_t;
Tuple3
holds three values, x
, y
and z
, as its main purpose is to
be a storage for three dimensional coordinates, and Size
specifies the
number of values that Tuple3
holds.
static constexpr size_type Size = 3;
using Traits = SIMD::Traits<value_type, Size>;
using SIMDType = typename Traits::SIMDType;
Above Traits
is defined for value_type
and Size
, selecting
the SIMD::Traits
specialization that fits the requirements for
Tuple3
.
Next ArrayType
is defined:
using ArrayType = typename Traits::ArrayType;
This is the same as std::array<value_type,Size>
, and while x
, y
and z
is the common notation for three dimensional information, values
,
as defined below, is more convenient for developing templates
that will work with Tuple2
, Tuple3
, Tuple4
and any class derived from
either of them.
The mathematical operations are performed using the Simd
type which holds a SIMD vector using the TupleSimd
instantiated
for the Traits
and the class derived from the Tuple3
template,
ensuring that each derived class gets a unique TupleSimd
C++ type.
using Simd = TupleSimd<Traits, DerivedType>;
The data fields of Tuple3
:
union
{
ArrayType values;
struct
{
value_type x, y, z;
};
};
The default constructor ensures that x
, y
and z
are initialized to
0
.
Tuple3( ) noexcept
: x{}, y{}, z{}
{ }
Tuple3( value_type xv, value_type yv, value_type zv ) noexcept
: x( xv ), y( yv ), z( zv )
{ }
The next constructor initializes a Tuple3
from a compatible TupleSimd
,
which is any TupleSimd
instantiated for the same SIMD::Traits
specialization
as Tuple3
. SIMD::Traits
has a ToArray
function that stores the values
in the SIMD type in a std::array<>
with the same type as ArrayType
and
returns the data.
template<Internal::SimdType T>
requires std::is_same_v<Traits, typename T::Traits>
Tuple3( const T& other ) noexcept
: values( Traits::ToArray( other.simd ) )
{ }
template<Internal::SimdType T>
requires std::is_same_v<Traits, typename T::Traits>
DerivedType& operator = ( const T& other ) noexcept
{
values = Traits::ToArray( other.simd );
return static_cast< DerivedType& >( *this );
}
constexpr bool operator == ( const Tuple3& other ) const noexcept
{
return IsSameValue( x, other.x ) && IsSameValue( y, other.y )
&& IsSameValue( z, other.z );
}
constexpr bool operator != ( const Tuple3& other ) const noexcept
{
return !IsSameValue( x, other.x ) || !IsSameValue( y, other.y )
|| !IsSameValue( z, other.z );
}
Compare the Tuple3
with a compatible TupleSimd
:
template<Internal::SimdType T>
requires std::is_same_v<Traits, typename T::Traits>
bool operator == ( const T& other ) const noexcept
{
return Traits::Equals( Traits::Load( values.data( ) ), other.simd );
}
template<Internal::SimdType T>
requires std::is_same_v<Traits, typename T::Traits>
bool operator != ( const T& other ) const noexcept
{
return Traits::Equals( Traits::Load( values.data( ) ), other.simd ) == false;
}
Negation loads the data of Tuple3
into a SIMDType
, calls Traits::Negate
returning
the result as a Tuple3::Simd
, the TupleSimd
specialization for
this Tuple3
type:
Simd operator-( ) const noexcept
{
return Traits::Negate( Traits::Load( values.data( ) ) );
}
Tuple3
overloads the +=
, -=
, *=
and /=
operators, and each operator
accepts a const
reference to a Simd
, const
reference to a Tuple3
, or
a scalar - note that the overloads returns a reference to DerivedType
:
DerivedType& operator += ( const Simd& other ) noexcept
{
values = Traits::ToArray( Traits::Add( Traits::Load( values ), other.simd ) );
return static_cast< DerivedType& >(*this );
}
DerivedType& operator += ( const Tuple3& other ) noexcept
{
x += other.x;
y += other.y;
z += other.z;
return static_cast< DerivedType& >( *this );
}
DerivedType& operator += ( const value_type& value ) noexcept
{
x += value;
y += value;
z += value;
return static_cast< DerivedType& >( *this );
}
DerivedType& operator -= ( const Simd& other ) noexcept
{
values = Traits::ToArray( Traits::Sub( Traits::Load( values ), other.simd ) );
return static_cast< DerivedType& >( *this );
}
DerivedType& operator -= ( const Tuple3& other ) noexcept
{
x -= other.x;
y -= other.y;
z -= other.z;
return static_cast< DerivedType& >( *this );
}
DerivedType& operator -= ( const value_type& value ) noexcept
{
x -= value;
y -= value;
z -= value;
return static_cast< DerivedType& >( *this );
}
DerivedType& operator *= ( const Simd& other ) noexcept
{
values = Traits::ToArray( Traits::Mul( Traits::Load( values ), other.simd ) );
return static_cast< DerivedType& >( *this );
}
DerivedType& operator *= ( const Tuple3& other ) noexcept
{
x *= other.x;
y *= other.y;
z *= other.z;
return static_cast< DerivedType& >( *this );
}
DerivedType& operator *= ( const value_type& value ) noexcept
{
x *= value;
y *= value;
z *= value;
return static_cast< DerivedType& >( *this );
}
DerivedType& operator /= ( const Simd& other ) noexcept
{
values = Traits::ToArray( Traits::Div( Traits::Load( values ), other.simd ) );
return static_cast< DerivedType& >( *this );
}
DerivedType& operator /= ( const Tuple3& other ) noexcept
{
x /= other.x;
y /= other.y;
z /= other.z;
return static_cast< DerivedType& >( *this );
}
DerivedType& operator /= ( const value_type& value ) noexcept
{
x /= value;
y /= value;
z /= value;
return static_cast< DerivedType& >( *this );
}
Since Tuple3
is a container, it implements common container member functions:
const_reference operator[]( size_t index ) const noexcept
{
return values[ index ];
}
reference operator[]( size_t index ) noexcept
{
return values[ index ];
}
const_pointer data( ) const noexcept
{
return values.data( );
}
pointer data( ) noexcept
{
return values.data( );
}
constexpr size_t size( ) const noexcept
{
return Size;
}
const_reference front( ) const noexcept
{
return values.front( );
}
reference front( ) noexcept
{
return values.front( );
}
const_reference back( ) const noexcept
{
return values.back( );
}
reference back( ) noexcept
{
return values.back( );
}
const_iterator begin( ) const noexcept
{
return values.begin( );
}
const_iterator cbegin( ) const noexcept
{
return values.cbegin( );
}
iterator begin( ) noexcept
{
return values.begin( );
}
const_iterator end( ) const noexcept
{
return values.end( );
}
const_iterator cend( ) const noexcept
{
return values.cend( );
}
iterator end( ) noexcept
{
return values.end( );
}
const_reverse_iterator rbegin( ) const noexcept
{
return values.rbegin( );
}
reverse_iterator rbegin( ) noexcept
{
return values.rbegin( );
}
const_reverse_iterator rend( ) const noexcept
{
return values.rend( );
}
reverse_iterator rend( ) noexcept
{
return values.rend( );
}
The above provides a reasonable level of integration with the standard C++ library.
Calling the Assign
member functions from derived classes can be more convenient
than calling Base::operator = ( arrayOfData )
.
void Assign( value_type xv, value_type yv, value_type zv ) noexcept
{
x = xv;
y = yv;
z = zv;
}
void Assign( const ArrayType& src ) noexcept
{
values = src;
}
void Assign( SIMDType src ) noexcept
{
values = Traits::ToArray( src );
}
The ability to check for NaN
or infinity:
bool HasNaN( ) const noexcept
{
return std::isnan( x ) || std::isnan( y ) || std::isnan( z );
}
bool IsFinite( ) const noexcept
{
return std::isfinite( x ) && std::isfinite( y ) && std::isfinite( z );
}
bool IsInfinite( ) const noexcept
{
return std::isinf( x ) || std::isinf( y ) || std::isinf( z );
}
};
The library defines two internal concepts. Internal::TupleType
and Internal::SimdType
.
Tuple2
, Tuple3
and Tuple4
matches the Internal::TupleType
concept, while
any TupleSimd
derived type matches the Internal::SimdType
concept.
In the code below, Internal::IsCompatible<T,U>
is used to verify that
T
and U
are compatible types using the same SIMD::Traits
specialization.
The first overload will be used when both arguments are compatible TupleSimd
based objects, so the implementation can perform the addition without
any loading of values into a SIMD datatype/register.
template<Internal::SimdType T, Internal::SimdType U>
requires Internal::IsCompatible<T,U>
inline T operator + ( const T& lhs, const U& rhs ) noexcept
{
using Traits = typename T::Traits;
return Traits::Add( lhs.simd, rhs.simd );
}
The second overload accepts TupleSimd
for the left hand argument,
and any compatible Tuple2
, Tuple3
or Tuple4
as its right hand argument.
The implementation loads the data from the tuple type, before performing
the addition, returning T
which is a TupleSimd
based type.
template<Internal::SimdType T, Internal::TupleType U>
requires Internal::IsCompatible<T, U>
inline T operator + ( const T& lhs, const U& rhs ) noexcept
{
using Traits = typename T::Traits;
return Traits::Add( lhs.simd, Traits::Load( rhs.values.data( ) ) );
}
The third overload accepts the same type of arguments as the second,
but takes a Tuple2
, Tuple3
or Tuple4
as its left hand argument,
and TupleSimd
for the right hand argument.
This time the contents of lhs
gets loaded before performing the addition.
template<Internal::TupleType U, Internal::SimdType T>
requires Internal::IsCompatible<T, U>
inline T operator + ( const U& lhs, const T& rhs ) noexcept
{
using Traits = typename T::Traits;
return Traits::Add( Traits::Load( lhs.values.data( ) ), rhs.simd );
}
The fourth, and last, overload, accepts compatible Tuple2
, Tuple3
or Tuple4
for both the left hand side and the right hand side of the addition, loading data
for both arguments before performing the addition.
template<Internal::TupleType T, Internal::TupleType U,
typename ResultT = typename T::Simd>
requires Internal::IsCompatible<T, U>
inline ResultT operator + ( const T& lhs, const U& rhs ) noexcept
{
using Traits = typename T::Traits;
return Traits::Add( Traits::Load( lhs.values.data( ) ),
Traits::Load( rhs.values.data( ) ) );
}
Loading to and storing from the SIMD types is done using the values
field of
the Tuple2
, Tuple3
and Tuple4
types, so they can all share the same
operator and function implementations:
template<Internal::SimdType T>
inline T Round( const T& t ) noexcept
{
using Traits = typename T::Traits;
return Traits::Round( t.simd );
}
template<Internal::TupleType T, typename ResultT = typename T::Simd>
inline ResultT Round( const T& t ) noexcept
{
using Traits = typename T::Traits;
return Traits::Round( Traits::Load( t.values.data( ) ) );
}
The operators and functions nearly always return a TupleSimd
based type,
ensuring that data is only loaded into a SIMD type when necessary.