Intrinsics are better than direct assembly because GCC can simplify, combine, and reorder instructions (e.g., lift them out of loops). GCC handles register allocation, etc. Intrinsics dramatically simplify the programmer's job
The GCC codegen is fantastic for NN-512, overall
GCC 8.3 and earlier make a few mistakes, like compiling FNMADD as xor-negation followed by FMADD (using an extra register for the xor-negation constant), but those problems have been fixed in GCC 9.1 and above
The only codegen mistake I see in GCC 10 is when I load 512 bits from memory, convert the low 256 bits from packed-half to packed-single, and do the same thing for the high 256 bits. GCC sometimes reloads the 512 bits from memory (despite still having those bits in a register). It doesn't harm performance much, but it seems dumb. Not sure why GCC does this
GCC can reduce the liveness range of an in-register value by moving the producing instruction and consuming instruction closer together. This can be a big help if you're writing code that just barely fits in the register file. For example, NN-512 produces many loops that use almost all of the 32 ZMM vector registers. GCC generally does a good job avoiding spills, if the programmer doesn't make the job too hard
In my opinion, properly written C intrinsics produce very good AVX-512 machine code, much more easily than if I wrote the assembly by hand. You can write much larger, more complex, fully vectorized programs when GCC helps you
> The problem is that intrinsics are so unreliable that you have to manually check the result on every platform and every compiler you expect your code to be run on, and then tweak the intrinsics until you get a reasonable result. That's more work than just writing the assembly by hand.
Well that sounds very bad. Have things improved since this article was written? Are intrinsics best avoided?
> Have things improved since this article was written?
Yes.
> Are intrinsics best avoided?
No.
---
If you are writing assembly code by hand, you probably care about the quality of the generated code.
Intrinsics are lower effort, portable, and can and often do generate much better code than using inline assembly.
I disagree with the claim that verifying the assembly output of an intrinsics is more work than writing assembly code by hand. In my experience, it is significantly less work.
I also disagree with the post about what to do if an intrinsic doesn't generate good code.
"Tweak the intrinsic until you get a reasonable result" is probably the worst thing you can do, because once the intrinsic is fixed, your tweaks might prevent it from generating good code.
The two things you can and should do are:
- report the bug, so that it gets fixed (1-2 days for clang.. fixing an intrinsic is just adding a new "case" in a pattern matching table, writing down what instructions it should lower too... worst case you can fix this yourself if you know what it should lower to...),
- if you can't live with worse code till the next compiler release, use inline assembly.
This second point is super rare. Its just not worth it. Intrinsics are usually fixed in a couple of days if you report the bug, and the fixed intrinsic will be in the next compiler release in a couple of months if you can't use the nightlies. So unless you really need this now, this is often not worth doing.
> Intrinsics are usually fixed in a couple of days
This is only helpful if everyone who builds the software is willing and able to use a bleeding edge version of a particular compiler. By comparison, using inline assembly will fix the problem for everyone, immediately, usually without anyone needing to use a different compiler or compiler version.
I don't think we disagree here. The only bad thing is trying to bend broken intrinsics into half-way doing what you want and complaining that it is too much work.
Don't do that.
---
Also, keep in mind that with inline assembly you often need one implementation per architecture and per compiler, since all compilers have slightly different syntax (its non-standard), i've seen multiple implementations even for the same compiler depending on version...
Another perspective is of course that of the embedded developer, a camp I can count myself to.
In embedded software, it's not uncommon to have exactly one target for the software (commonly called "the target"). Sometimes the target changes due to components being end-of-lifed or so, but it's rare and slow.
In those situations, I have found intrinsics to be very helpful since they allow you to reason and talk about the software at a higher level (C is, after all, higher than assembly) and without making sure all developers on a team understand the inline assembly syntax. :)
It is still good practice to check the resulting code, especially as, if you're using intrinsics, chances are you're often thinking more or less in assembly, but you can do that once and be pretty sure you're getting the desired result.
The resulting code is of course also more portable, which can be helpful when you want to e.g. automate tests of code without external hardware dependencies such as data structures, utility functions, and so on.
He's right about assembly vs inline assembly (gcc asm). But something well done like Intel Intrinsics [1] specializes an intrinsic for target platforms. It's (a lot) more work on the intrinsic writer's part but then provides something of a cross platform abstraction for the programmer.
https://NN-512.com
Intrinsics are better than direct assembly because GCC can simplify, combine, and reorder instructions (e.g., lift them out of loops). GCC handles register allocation, etc. Intrinsics dramatically simplify the programmer's job
The GCC codegen is fantastic for NN-512, overall
GCC 8.3 and earlier make a few mistakes, like compiling FNMADD as xor-negation followed by FMADD (using an extra register for the xor-negation constant), but those problems have been fixed in GCC 9.1 and above
The only codegen mistake I see in GCC 10 is when I load 512 bits from memory, convert the low 256 bits from packed-half to packed-single, and do the same thing for the high 256 bits. GCC sometimes reloads the 512 bits from memory (despite still having those bits in a register). It doesn't harm performance much, but it seems dumb. Not sure why GCC does this
GCC can reduce the liveness range of an in-register value by moving the producing instruction and consuming instruction closer together. This can be a big help if you're writing code that just barely fits in the register file. For example, NN-512 produces many loops that use almost all of the 32 ZMM vector registers. GCC generally does a good job avoiding spills, if the programmer doesn't make the job too hard
In my opinion, properly written C intrinsics produce very good AVX-512 machine code, much more easily than if I wrote the assembly by hand. You can write much larger, more complex, fully vectorized programs when GCC helps you