Different result between FPU and SIMD

Something unexpected can happen when migrating legacy project about floating-point calculation...

About two years ago, I had a chance to migrate x86 based project to x64 project.
And there are not many works I had to do. Really. Because it was just a simple tool, so all I needed to do was switch the architecture of the platform, and fix some warnings come from different size of size_t.

It seems everything works fine, yet I found something works differently than before.
I quite struggled to trace the error in that there is no difference in C++ code between x86 and x64.

Anyway, I finally got where the error come from in a dissambly mode. But let’s see the C++ code first that error occurred.

#include <cstdio>

int main(){
    volatile float value = 2048.f;
    volatile float episilon = 0.0001f;

    printf(((value + episilon) > value) ? "true" : "false");

    return 0;
}

On x86, it printed true, but printed false on x64.
And the reason why I declared every value as volatile, it’s just for avoiding constantize value by compiler optimization.

Alright. Then let’s have a look how the x86 compiler made its assembly code.

...
fld dword ptr [episilon]
fadd dword ptr [value]
fld dword ptr [value]
fcompp
fnstsw ax
test ah, 5
mov eax, offset string "true" (0BA2100h)
jnp main+34h (0BA1074h)
mov eax, offset string "false" (0BA2108h)
push eax
call printf (0BA1010h)
...

So, the first is pushing episilon onto FPU register stack. Then add value to the top of the stack. Therefore, value + episilon is stored on top.
Next, push value onto the stack. And through fcompp, compare ST[0](value) with ST[1](value + episilon) and pop register stack twice. Depending on the result of the instruction, C0, C2 and C3 of the FPU status word will be changed.
The instruction fnstsw will store FPU status word into its destination. On the code, the register ax has 0100h and this is the result when condition ST[0] < ST[1] is true.

Okay. Then how about x64 compiled assembly code look like? Unlike x86 assembly, x64 compiler uses SSE2 by default. Probably, every x64 processor supports SSE2 and this is faster than FPU.

Anyway, see how the x64 code looks like.

...
lea rax, [string "true" (07FF7F46B2240h)]
lea rcx, [string "false" (07FF7F46B2248h)]
movss xmm2, dword ptr [episilon]
movss xmm0, dword ptr [value]
movss xmm1, dword ptr [value]
addss xmm2, xmm0
comiss xmm2, xmm1
cmova rcx, rax
call printf (07FF7F46B1010h)
...

As you see, it’s nothing so special.
But one I noticed is, after executing addss instruction on the code, xmm2 has 2048.0. And xmm1 also has 2048.0, the result is false.
Logically, this is the right result. Because 2048.0001 is too long number as single precision floating point since it has not enough mantissa bits.

Then back to the x86 code, its result was true. That means FPU compared their value with a higher precision floating point.
So I started to searching about FPU and I found an interesting description of fld instruction.

Which explained,

If the source operand is in single-precision or double-precision floating-point format, it is automatically converted to the double extended-precision floating-point format before being pushed on the stack.

Now, everything got clear.
On x86, every floating point(regardless of their precision) are automatically converted to 80 bits of extended precision floating point to be calculated by FPU.
But x64 on the other hand, as I declared float in the C++ code, it maintained single precision in every operation of SIMD.

So, if you’re planning to switch FPU to SIMD of your legacy project, make sure to notice the result of floating point can be changed.

And by the way, on Visual Studio 2019, it uses SSE2 by default even with x86. But you can still use the legacy FPU by adding /arch:IA32 command line.