C is not suited to SIMD (2019)

by zetalyraeon 1/23/25, 9:01 PMwith 120 comments
by Const-meon 1/27/25, 3:45 PM

I would rather conclude that automatic vectorizers are still less than ideal, despite SIMD instructions have been widely available in commodity processors for 25 years now.

The language is actually great with SIMD, you just have to do it yourself with intrinsics, or use libraries. BTW, here’s a library which implements 4-wide vectorized exponent functions for FP32 precision on top of SSE, AVX and NEON SIMD intrinsics (MIT license): https://github.com/microsoft/DirectXMath/blob/oct2024/Inc/Di...

by AnimalMuppeton 1/23/25, 9:11 PM

The argument seems to be that C isn't suited to SIMD because some functions (exp, for instance) are not. But it then talks about how exp is implemented in hardware, in a way that forces it to be a non-SIMD calculation. That's not a C issue, that's a CPU issue. No language can save you there (unless it uses a software-only function for exp).

Is this article really confused, or did I misunderstand it?

The thing that makes C/C++ a good language for SIMD is how easily it lets you control memory alignment.

by PaulHouleon 1/23/25, 9:10 PM

So are those functional languages. The really cool SIMD code I see people writing now is by people like Lemire using the latest extensions who do clever things like decoding UTF-8 which I think will always take assembly language to express unless you are getting an SMT solver to write it for you.

by npallion 1/27/25, 5:18 PM

C++ does OK.

Google's highway [1]

Microsoft DirectXMath [2]

[1] https://github.com/google/highway

[2] https://github.com/microsoft/DirectXMath

by mlochbaumon 1/27/25, 1:47 PM

Several comments seem confused about this point: the article is not about manual SIMD, which of course C is perfectly capable of with intrinsics. It's discussing problems in compiling architecture-independent code to SIMD instructions, which in practice C compilers very often fail to do (so, exp having scalar hardware support may force other arithmetic not to be vectorized). An alternative mentioned is array programming, where operations are run one at a time on all the data; these languages serve as a proof of concept that useful programs can be run in a way that uses SIMD nearly all the time it's applicable, but at the cost of having to write every intermediate result to memory. So the hope is that "more fluent methods for compilation" can generate SIMD code without losing the advantages of scalar compilation.

As an array implementer I've thought about the issue a lot and have been meaning to write a full page on it. For now I have some comments at https://mlochbaum.github.io/BQN/implementation/versusc.html#... and the last paragraph of https://mlochbaum.github.io/BQN/implementation/compile/intro....

by commandlinefanon 1/27/25, 8:35 PM

This seems more of a generalization of Richard Steven's observation:

"Modularity is the enemy of performance"

If you want optimal performance, you have to collapse the layers. Look at Deepseek, for example.

by notoranditon 1/27/25, 1:25 PM

C has been invented when CPUs, those few available, were just single cored and single threaded.

Wanna SMP? Use multi-thread libreries. Wanna SIMD/MIMD? Use (inline) assembler functions. Or design your own language.

by camel-cdron 1/27/25, 2:40 PM

This depends entirely on compiler support. Intels ICX compiler can easily vectorize a sigmoid loop, by calling SVMLs vectorized expf function: https://godbolt.org/z/no6zhYGK6

If you implement a scalar expf in a vectorizer friendly way, and it's visible to the compiler, then it could also be vectorized: https://godbolt.org/z/zxTn8hbEe

by svilen_dobrevon 1/27/25, 4:01 PM

reminds me of this old article.. ~2009, Sun/MIT - "The Future is Parallel" .. and "sequential decomposition and usual math are wrong horse.."

https://ocw.mit.edu/courses/6-945-adventures-in-advanced-sym...

by vkakuon 1/27/25, 9:16 PM

I also think that vectorizers and compilers can detect parallel memory adds/moves/subs and without that, many do not take time to provide adequate hints to the compiler about it.

Some people have vectorized successfully with C, even with all the hacks/pointers/union/opaque business. It requires careful programming, for sure. The ffmpeg cases are super good examples of how compiler misses happen, and how to optimize for full throughput in those cases. Worth a look for all compiler engineers.

by Vosporoson 1/27/25, 3:36 PM

Woo, very happy to see more posts by McHale these days!

by dvorack101on 1/27/25, 7:20 PM

Not my code, but illustrates how SIMD works.

https://github.com/dezashibi-c/a-simd_in_c

Copy rights go to Navid Dezashibi.

by exitcode0000on 1/28/25, 3:18 AM

Of course C easily allows you to directly write SIMD routines via intrinsic instructions or inline assembly:

```

  generic
     type T is private;
     Aligned : Bool := True;
  function Inverse_Sqrt_T (V : T) return T;
  function Inverse_Sqrt_T (V : T) return T is
    Result : aliased T;
    THREE         : constant Real   := 3.0;
    NEGATIVE_HALF : constant Real   := -0.5;
    VMOVPS        : constant String := (if Aligned then "vmovaps" else "vmovups");
    begin
      Asm (Clobber  => "xmm0, xmm1, xmm2, xmm3, memory",
           Inputs   => (Ptr'Asm_Input ("r", Result'Address),
                        Ptr'Asm_Input ("r", V'Address),
                        Ptr'Asm_Input ("r", THREE'Address),
                        Ptr'Asm_Input ("r", NEGATIVE_HALF'Address)),                     
           Template => VMOVPS & "       (%1), %%xmm0         " & E & --   xmm0 ← V
                       " vrsqrtps     %%xmm0, %%xmm1         " & E & --   xmm1 ← Reciprocal sqrt of xmm0
                       " vmulps       %%xmm1, %%xmm1, %%xmm2 " & E & --   xmm2 ← xmm1 \* xmm1
                       " vbroadcastss   (%2), %%xmm3         " & E & --   xmm3 ← NEGATIVE_HALF
                       " vfmsub231ps  %%xmm2, %%xmm0, %%xmm3 " & E & --   xmm3 ← (V - xmm2) \* NEGATIVE_HALF
                       " vbroadcastss   (%3), %%xmm0         " & E & --   xmm0 ← THREE
                       " vmulps       %%xmm0, %%xmm1, %%xmm0 " & E & --   xmm0 ← THREE \* xmm1
                       " vmulps       %%xmm3, %%xmm0, %%xmm0 " & E & --   xmm0 ← xmm3 \* xmm0
                       VMOVPS & "     %%xmm0,   (%0)         ");     -- Result ← xmm0
      return Result;
    end;

  function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_2D, Aligned => False);
  function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_3D, Aligned => False);
  function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_4D); 
```

```C

  vector_3d vector_inverse_sqrt(const vector_3d\* v) {
  ...
  vector_4d vector_inverse_sqrt(const vector_4d\* v) {
    vector_4d out;
    static const float THREE = 3.0f;          // 0x40400000
    static const float NEGATIVE_HALF = -0.5f; // 0xbf000000

    __asm__ (
        // Load the input vector into xmm0
        "vmovaps        (%1), %%xmm0\n\t"
        "vrsqrtps     %%xmm0, %%xmm1\n\t"
        "vmulps       %%xmm1, %%xmm1, %%xmm2\n\t"
        "vbroadcastss   (%2), %%xmm3\n\t"
        "vfmsub231ps  %%xmm2, %%xmm0, %%xmm3\n\t"
        "vbroadcastss   (%3), %%xmm0\n\t"
        "vmulps       %%xmm0, %%xmm1, %%xmm0\n\t"
        "vmulps       %%xmm3, %%xmm0, %%xmm0\n\t"
        "vmovups      %%xmm0,   (%0)\n\t"                         // Output operand
        :
        : "r" (&out), "r" (v), "r" (&THREE), "r" (&NEGATIVE_HALF) // Input operands
        : "xmm0", "xmm1", "xmm2", "memory"                        // Clobbered registers
    );

    return out;
} ```

by jiehongon 1/27/25, 2:35 PM

It seems that Zig is well suited for writing SIMD [0].

If only GPU makers could standardise an extended ISA like AVX on CPU, and we could all run SIMD or SIMT code without needing any librairies, but our compilers.

[0]: https://zig.guide/language-basics/vectors/

by musicaleon 1/28/25, 6:15 AM

That's why CUDA never went anywhere. /s ;-)

(Oh, that's SIMT. Carry on then.)

by TinkersWon 1/27/25, 1:40 PM

This reads like jibberish.

C functions can't be vectorized? WTF are you talking about? You can certainly pass vector registers to functions.

Exp can also be vectorized, AVX512 even includes specific instructions to make it easier.( there is no direct exp instruction on most hardware,it is generally a sequence of instructions)