Imo the most "unhinged" cpus for AVX-512 are early batches of Alder Lakes which is the only cpu family that has nearly full coverage of all existing avx-512 subsets.
It's a shame that Intel seemed to really not want people to use it, given they started disabling the ability to use it in future microcode, and fused it off in later parts.
> It's a shame that Intel seemed to really not want people to use it
AVX-512 was never part of the specification for those CPUs. It was never advertised as a feature or selling point. You had to disable the E cores to enable AVX-512, assuming your motherboard even supported it.
Alder Lake AVX-512 has reached mythical status, but I think the number of people angry about it is far higher than the number of people who ever could have taken advantage of it and benefitted from it. For general purpose workloads, having the E cores enabled (and therefore AVX-512 disabled) was faster. You had to have an extremely specific workload that didn't scale well with additional cores and also had hot loops that benefitted from AVX-512, which was not very common.
So you're right: They never wanted people to use it. It wasn't advertised and wasn't usable without sacrificing all of the E cores and doing a lot of manual configuration work. I suspect they didn't want people using it because they never validated it. AVX-512 mode increased the voltages, which would impact things like failure rate and warranty returns. They probably meant to turn it off but forgot in the first versions.
They had to disable AVX-512 only because Microsoft was too lazy to rewrite their thread scheduler to handle heterogeneous CPU cores.
The Intel-AMD x86-64 architecture is full of horrible things, starting with the System Management Mode added in 1990, which have been added by Intel only because every time Microsoft has refused to update Windows, expecting that the hardware vendors must do the work instead of Microsoft for enabling Windows to continue to work on newer hardware, even when that causes various disadvantages for the customers.
Moreover, even if Intel had not said that Alder Lake will support AVX-512, they also had not said that the P-cores of Alder Lake will not support AVX-512.
Therefore everybody had expected that Intel will continue to provide backward compatibility, as always before that, so the P-cores of Alder Lake will continue to support any instruction subset that had been supported by Rocket Lake and Tiger Lake and Ice Lake and Cannon Lake.
The failure to be compatible with their previous products has been a surprise for everybody.
Windows can work without SMM, especially NT - the problem is that SMM was created for a world where majority used DOS and the idea of using OS services instead of every possibly quirk of IBM PC was anathema to developers.
Thus, SMM, because there was no other way to hook power management on a 386 laptop running " normal" DOS
> Thus, SMM, because there was no other way to hook power management on a 386 laptop running " normal" DOS
In theory, there was: you could have a separate microcontroller, accessed through some of the I/O ports, doing the power management; it's mostly how it's done nowadays, with the EC (Embedded Controller) on laptops (and nowadays there's also the PSP or ME, which is a separate processor core doing startup and power management for the main CPU cores). But back then, it would also be more expensive (a whole other chip) than simply adding an extra mode to the single CPU core (multiple cores back then usually required multiple CPU chips).
The problem is reliably interrupting the CPU in a way that didn't require extra OS support. SMM provided such trigger, and in fact is generally used as part of the scheme with EC cooperating.
If Windows could work without SMM, is there a historical reason why SMM mode didn't just die and become disused after Windows becomes popular and nobody uses DOS any more? There are plenty of features in x86 that are disused.
The feature turned out too useful for all sorts of things, including dealing with the fact that before NT loaded itself you still had to emulate being an IBM PC including the fiction of booting from cassette tape or jumping to ROM BASIC.
Also, it's been cheaper to implement various features through small piece of code instead of adding a separate MCU to handle them, including prosaic things like handling NVRAM storage for variables (instead of interacting with external MCU or having separate NVRAM, you end up with SMM code being "trusted" to update the homogenous flash chip that contains both NVRAM and boot code)
I don't know if I'd call Microsoft lazy. Are there any existing operating systems that allow preemptive scheduling across cores with different ISA subsets? I'd sort of assume Microsoft research has a proof of concept for something like that but putting it into a production OS is a different kettle of fish.
> the P-cores of Alder Lake will continue to support any instruction subset that had been supported by Rocket Lake and Tiger Lake and Ice Lake and Cannon Lake
Wait. I thought the article says only Tiger Lake supports the vp2intersect instruction. Is that not true then?
Tiger Lake is the only one with vp2intersect, but before Alder Lake there had already been 3 generations of consumer CPUs with AVX-512 support (Cannon Lake in 2018/2019, Ice Lake in 2019/2020 and Tiger Lake + Rocket Lake in 2020/2021).
So it was expected that any future Intel CPUs will remain compatible. Removing an important instruction subset has never happened before in Intel's history.
Only AMD has removed some instructions when passing from a 32-bit ISA to a 64-bit ISA, most of which were obsolete (except that removing interrupt on overflow was bad and it does not simplify greatly a CPU core, since there are many other sources of precise exceptions that must still be supported; the only important effect of removing INTO is that many instructions can be retired earlier than otherwise, which reduces the risk of filling up the retirement queue).
The reason you had to disable the E cores was... also an artificial barrier imposed by Intel. Enabling AVX-512 only looks like a problem when inside that false dichotomy. You can have both with a bit of scheduler awareness.
The problem with the validation argument is that the P-cores were advertising AVX-512 via CPUID with the E-cores disabled. If the AVX-512 support was not validated and meant to be used, it would not have been a good idea to set that CPUID bit, or even allow the instructions to be executed without faulting. It's strange that it launched with any AVX-512 support at all and there were rumors that the decision to drop AVX-512 support officially was made at the last minute.
As for the downsides of disabling the E-cores, there were Alder Lake SKUs that were P-core only and had no E-cores.
Not all workloads are widely parallelizable and AVX-512 has features that are also useful for highly serialized workloads such as decompression, even at narrower than 512-bit width. Part of the reason that AVX-512 has limited usage is that Intel has set back widespread adoption of AVX-512 by half a decade by dropping it again from their consumer SKUs, with AVX10/256 only to return starting in ~2026.
I think I have two of these sitting in a box, one prototype with avx512 and one retail without. Is it worth me breaking these out for ML experiments and such?
And, as noted in the article, that's an instruction which only works on two desktop CPU architectures (Tiger Lake and Zen 5), including one where it's arguably slower than not using it (Tiger Lake).
Meaning... this entire effort was for something that's faster on only a single kind of CPU (Zen 5).
This article is honestly one of the best I've read in a long time. It's esoteric and the result is 99.5% pointless objectively, but in reality it's incredibly useful and a wonderful guide to low-level x86 optimization end to end. The sections on cache alignment and uiCA + analysis notes are a perfect illustration of "how it's done."
Fascinating blog post. Having said that, it may seem like nitpicking but I have to take issue with the point about recursion, which is often far too easily blamed for inefficiency.
The blog post mentions it as one of the reasons for the inefficiency of the conventional algorithm.
A glance at the algorithm shows that the recursion in question is a tail call. This means that any overhead can be readily eliminated using a technique known for nearly fifty years already.
Steele, Guy Lewis (1977). "Debunking the "expensive procedure call" myth or, procedure call implementations considered harmful or, LAMBDA: The Ultimate GOTO". Proceedings of the 1977 annual conference on - ACM '77.
The main problem with tail call optimization is that it's unreliable; small apparently unrelated changes elsewhere in the function, a difference in the compiler command line flags, or a different compiler version, could all make a tail call become a non-tail call. Some languages have proposed explicit markers to force a call to be a tail call (and generate a compilation error if it can't), but I don't think these proposals have been adopted yet.
Not specifically address space layout randomization in the way it's usually implemented; ASLR as applied in most modern production OSes randomizes each stack's base address, but each stack is still laid out and used in a normal way. There are some research projects towards actual stack layout randomization (involving stack rearrangement via static analysis, randomly sized stack padding frames, and other techniques) which would also definitely blow up cache, but none that are mainstream in a production system as far as I know.
However, for the naive case where the recursion is a full-blown function call, without some kind of optimization, other security mitigations than ASLR will significantly affect the efficiency of recursion by adding function call overhead (and possible cache side effects) - for example, the stack cookie will still be verified and control-flow guard checks and the shadow/return stack will still be in play, if present.
What a post. It should have taken a week just to write it, never mind the amount of time it took to actually come up with all this stuff and overcome all the obstacles mentioned. What a dedication to improving the performance of phrase search.
> Why are you merging up to one rare token at the beginning or at the end? Let’s consider that someone searched for C_0 R_1 C_2 C_3. If we don’t do this merge, we would end up searching for C_0, R_1, C_2 C_3, and this is bad. As established, intersecting common tokens is a problem, so it’s way better to search C_0 R_1, C_2 C_3. I learned this the hard way…
But since R_1 C_2 C_3 is in the index as well, instead of searching for C_0 R_1, C_2 C_3 with a distance of 2, you can instead search for C_0 R_1, R_1 C_2 C_3 with a distance of 1 (overlapping), which hopefully means that the lists to intersect are smaller.
From my 1980s 8-bit CPU perspective, the instruction is unhinged based solely on the number of letters. Compared to LDA, STA, RTS, that's not an assembler mnemonic, it's a novel. :-)
Incidentally, how is it a GF(2^8) affine transform? As best as I can tell, it’s a GF(2)^8 affine transform, i.e. an affine transform of vectors of bits with normal XOR addition and AND multiplication, and the polynomial defining GF(2^8) just does not enter anywhere. It does enter into GF2P8AFFINEINVQB, but I’m having difficulties finding a geometric description for that one at all.
I believe that the polynomial for GF2P8AFFINEQB is user-defined. One argument is an 8x8 matrix in GF(2) and the result is [A.x + b] in GF(2)^8 for each 8-bit section. Don't quote me on this, but I believe that matrix multiply in GF(2)^8 gets you a transform in GF(2^8).
I think I actually need that instruction and have a use case for it, and it does something with a matrix transpose so I might finally find a real world useful demonstration of a matrix operation I can cite to people who don't know what those mean.
"This is an instruction that doesn't exist in any computer right now, so why should I put it in a machine, if it's supposed to be realistic? Well, it's because it's ahead of time."
It has a fixed polynomial, so not really that useful for anything but AES
The only case where I've had use of GF(2^8) inverses is in FEC algorithms (Forney's algorithm) and then you need some kind of weird polynomial. But all of those needs are rarely in the hot-path, and the FEC algo's are way outdated
I think the AFFINE and AFFINEINV instructions are specifically for FEC and maybe compression algorithms. I also think they smell like something requested by one of the big customers of Intel (e.g. the government).
Hmm of course erasure codes would always need to solve these problems. Not sure what modern applications need that in the X86 world
I really think it's only AES since thats the only place I've seen that polynomial used. But of course maybe there's an obscure tape backup FEC algo used somewhere in datacenters?
Sometimes I read through the instrinsics guide just to play the game of spotting instructions defined primarily because certain cryptologic agencies asked for it.
Imo the most "unhinged" cpus for AVX-512 are early batches of Alder Lakes which is the only cpu family that has nearly full coverage of all existing avx-512 subsets.
It's a shame that Intel seemed to really not want people to use it, given they started disabling the ability to use it in future microcode, and fused it off in later parts.
> It's a shame that Intel seemed to really not want people to use it
AVX-512 was never part of the specification for those CPUs. It was never advertised as a feature or selling point. You had to disable the E cores to enable AVX-512, assuming your motherboard even supported it.
Alder Lake AVX-512 has reached mythical status, but I think the number of people angry about it is far higher than the number of people who ever could have taken advantage of it and benefitted from it. For general purpose workloads, having the E cores enabled (and therefore AVX-512 disabled) was faster. You had to have an extremely specific workload that didn't scale well with additional cores and also had hot loops that benefitted from AVX-512, which was not very common.
So you're right: They never wanted people to use it. It wasn't advertised and wasn't usable without sacrificing all of the E cores and doing a lot of manual configuration work. I suspect they didn't want people using it because they never validated it. AVX-512 mode increased the voltages, which would impact things like failure rate and warranty returns. They probably meant to turn it off but forgot in the first versions.
They had to disable AVX-512 only because Microsoft was too lazy to rewrite their thread scheduler to handle heterogeneous CPU cores.
The Intel-AMD x86-64 architecture is full of horrible things, starting with the System Management Mode added in 1990, which have been added by Intel only because every time Microsoft has refused to update Windows, expecting that the hardware vendors must do the work instead of Microsoft for enabling Windows to continue to work on newer hardware, even when that causes various disadvantages for the customers.
Moreover, even if Intel had not said that Alder Lake will support AVX-512, they also had not said that the P-cores of Alder Lake will not support AVX-512.
Therefore everybody had expected that Intel will continue to provide backward compatibility, as always before that, so the P-cores of Alder Lake will continue to support any instruction subset that had been supported by Rocket Lake and Tiger Lake and Ice Lake and Cannon Lake.
The failure to be compatible with their previous products has been a surprise for everybody.
Windows can work without SMM, especially NT - the problem is that SMM was created for a world where majority used DOS and the idea of using OS services instead of every possibly quirk of IBM PC was anathema to developers.
Thus, SMM, because there was no other way to hook power management on a 386 laptop running " normal" DOS
> Thus, SMM, because there was no other way to hook power management on a 386 laptop running " normal" DOS
In theory, there was: you could have a separate microcontroller, accessed through some of the I/O ports, doing the power management; it's mostly how it's done nowadays, with the EC (Embedded Controller) on laptops (and nowadays there's also the PSP or ME, which is a separate processor core doing startup and power management for the main CPU cores). But back then, it would also be more expensive (a whole other chip) than simply adding an extra mode to the single CPU core (multiple cores back then usually required multiple CPU chips).
The problem is reliably interrupting the CPU in a way that didn't require extra OS support. SMM provided such trigger, and in fact is generally used as part of the scheme with EC cooperating.
If Windows could work without SMM, is there a historical reason why SMM mode didn't just die and become disused after Windows becomes popular and nobody uses DOS any more? There are plenty of features in x86 that are disused.
The feature turned out too useful for all sorts of things, including dealing with the fact that before NT loaded itself you still had to emulate being an IBM PC including the fiction of booting from cassette tape or jumping to ROM BASIC.
Also, it's been cheaper to implement various features through small piece of code instead of adding a separate MCU to handle them, including prosaic things like handling NVRAM storage for variables (instead of interacting with external MCU or having separate NVRAM, you end up with SMM code being "trusted" to update the homogenous flash chip that contains both NVRAM and boot code)
I don't know if I'd call Microsoft lazy. Are there any existing operating systems that allow preemptive scheduling across cores with different ISA subsets? I'd sort of assume Microsoft research has a proof of concept for something like that but putting it into a production OS is a different kettle of fish.
> the P-cores of Alder Lake will continue to support any instruction subset that had been supported by Rocket Lake and Tiger Lake and Ice Lake and Cannon Lake
Wait. I thought the article says only Tiger Lake supports the vp2intersect instruction. Is that not true then?
Tiger Lake is the only one with vp2intersect, but before Alder Lake there had already been 3 generations of consumer CPUs with AVX-512 support (Cannon Lake in 2018/2019, Ice Lake in 2019/2020 and Tiger Lake + Rocket Lake in 2020/2021).
So it was expected that any future Intel CPUs will remain compatible. Removing an important instruction subset has never happened before in Intel's history.
Only AMD has removed some instructions when passing from a 32-bit ISA to a 64-bit ISA, most of which were obsolete (except that removing interrupt on overflow was bad and it does not simplify greatly a CPU core, since there are many other sources of precise exceptions that must still be supported; the only important effect of removing INTO is that many instructions can be retired earlier than otherwise, which reduces the risk of filling up the retirement queue).
The reason you had to disable the E cores was... also an artificial barrier imposed by Intel. Enabling AVX-512 only looks like a problem when inside that false dichotomy. You can have both with a bit of scheduler awareness.
The problem with the validation argument is that the P-cores were advertising AVX-512 via CPUID with the E-cores disabled. If the AVX-512 support was not validated and meant to be used, it would not have been a good idea to set that CPUID bit, or even allow the instructions to be executed without faulting. It's strange that it launched with any AVX-512 support at all and there were rumors that the decision to drop AVX-512 support officially was made at the last minute.
As for the downsides of disabling the E-cores, there were Alder Lake SKUs that were P-core only and had no E-cores.
Not all workloads are widely parallelizable and AVX-512 has features that are also useful for highly serialized workloads such as decompression, even at narrower than 512-bit width. Part of the reason that AVX-512 has limited usage is that Intel has set back widespread adoption of AVX-512 by half a decade by dropping it again from their consumer SKUs, with AVX10/256 only to return starting in ~2026.
I use AVX-512 on Alder and it does not increase the voltage above AVX-2 voltages, and even then power dissipation is considerably lower than AVX-2.
https://www.reddit.com/r/rpcs3/comments/tqt1ko/clearing_up_s...
I think I have two of these sitting in a box, one prototype with avx512 and one retail without. Is it worth me breaking these out for ML experiments and such?
Do they cover anything Sapphire Rapids Xeon's don't? I thought they share the same arch (Golden Cove).
According to this [1] wikipedia article, the only feature Sapphire Rapids doesn't support is VP2INTERSECT.
[1]:https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
It seems that there are faster alternatives to it
https://arxiv.org/abs/2112.06342
https://www.reddit.com/r/asm/comments/110pld0/fasterthannati...
Note that the article mentions using both outputs of the instruction, whereas the emulation is only able to compute one output efficiently.
Or Zen 5. :-p
Yes, you are right; I meant "consumer grade cpu".
What about Zen5?
AMD advertised and enabled AVX-512 on all Zen 5 CPUs. You have to resort to workarounds to get AVX-512 working on Alder Lake.
Spoiler if you don’t want to read through the (wonder but many) paragraphs of exposition: the instruction is `vp2intersectq k, zmm, zmm`.
And, as noted in the article, that's an instruction which only works on two desktop CPU architectures (Tiger Lake and Zen 5), including one where it's arguably slower than not using it (Tiger Lake).
Meaning... this entire effort was for something that's faster on only a single kind of CPU (Zen 5).
This article is honestly one of the best I've read in a long time. It's esoteric and the result is 99.5% pointless objectively, but in reality it's incredibly useful and a wonderful guide to low-level x86 optimization end to end. The sections on cache alignment and uiCA + analysis notes are a perfect illustration of "how it's done."
Presumably Zen 5 cores will also get used in Threadripper and EPYC processors.
Yep. And the feature will probably be available on all AMD CPUs manufactured from here on.
It might be an esoteric feature today. But if it'll become an ubiquitous feature in a few years, its nice to learn about using it.
Not just that, but the fact that Intel CPUs execute it 20-30 times slower than AMD Zen 5 CPUs.
Also, the fact that it's deprecated by Intel.
Now the question is if Intel will revive it now that Zen 5 has it.
Fascinating blog post. Having said that, it may seem like nitpicking but I have to take issue with the point about recursion, which is often far too easily blamed for inefficiency.
The blog post mentions it as one of the reasons for the inefficiency of the conventional algorithm.
A glance at the algorithm shows that the recursion in question is a tail call. This means that any overhead can be readily eliminated using a technique known for nearly fifty years already.
Steele, Guy Lewis (1977). "Debunking the "expensive procedure call" myth or, procedure call implementations considered harmful or, LAMBDA: The Ultimate GOTO". Proceedings of the 1977 annual conference on - ACM '77.
A lot of modern programming languages do not do tail call optimization, often citing keeping accurate stack history for debugging as an excuse.
Regardless of how valid the excuse is, for such an obvious and old optimization, it’s very poorly supported.
The main problem with tail call optimization is that it's unreliable; small apparently unrelated changes elsewhere in the function, a difference in the compiler command line flags, or a different compiler version, could all make a tail call become a non-tail call. Some languages have proposed explicit markers to force a call to be a tail call (and generate a compilation error if it can't), but I don't think these proposals have been adopted yet.
Dumb question: does modern stack layout randomization affect the efficiency of recursion? On first glance I would be worried about cache misses.
Not specifically address space layout randomization in the way it's usually implemented; ASLR as applied in most modern production OSes randomizes each stack's base address, but each stack is still laid out and used in a normal way. There are some research projects towards actual stack layout randomization (involving stack rearrangement via static analysis, randomly sized stack padding frames, and other techniques) which would also definitely blow up cache, but none that are mainstream in a production system as far as I know.
However, for the naive case where the recursion is a full-blown function call, without some kind of optimization, other security mitigations than ASLR will significantly affect the efficiency of recursion by adding function call overhead (and possible cache side effects) - for example, the stack cookie will still be verified and control-flow guard checks and the shadow/return stack will still be in play, if present.
What a post. It should have taken a week just to write it, never mind the amount of time it took to actually come up with all this stuff and overcome all the obstacles mentioned. What a dedication to improving the performance of phrase search.
I think you meant to write "it must have" rather than "it should have"?
Let's agree on "It likely has".
> Why are you merging up to one rare token at the beginning or at the end? Let’s consider that someone searched for C_0 R_1 C_2 C_3. If we don’t do this merge, we would end up searching for C_0, R_1, C_2 C_3, and this is bad. As established, intersecting common tokens is a problem, so it’s way better to search C_0 R_1, C_2 C_3. I learned this the hard way…
But since R_1 C_2 C_3 is in the index as well, instead of searching for C_0 R_1, C_2 C_3 with a distance of 2, you can instead search for C_0 R_1, R_1 C_2 C_3 with a distance of 1 (overlapping), which hopefully means that the lists to intersect are smaller.
The most unhinged AVX-512 instruction is GF2P8AFFINEQB.
From my 1980s 8-bit CPU perspective, the instruction is unhinged based solely on the number of letters. Compared to LDA, STA, RTS, that's not an assembler mnemonic, it's a novel. :-)
"Load accumulator" (LDA)
vs
"Galois Field 2^8 affine transform on quad binary words" (GF2P8AFFINEQB)
The compression factor isn't quite the same on character count, but it's still abbreviated. :)
Incidentally, how is it a GF(2^8) affine transform? As best as I can tell, it’s a GF(2)^8 affine transform, i.e. an affine transform of vectors of bits with normal XOR addition and AND multiplication, and the polynomial defining GF(2^8) just does not enter anywhere. It does enter into GF2P8AFFINEINVQB, but I’m having difficulties finding a geometric description for that one at all.
I believe that the polynomial for GF2P8AFFINEQB is user-defined. One argument is an 8x8 matrix in GF(2) and the result is [A.x + b] in GF(2)^8 for each 8-bit section. Don't quote me on this, but I believe that matrix multiply in GF(2)^8 gets you a transform in GF(2^8).
There's a pretty good list of weird off-label uses for the Galois Field instructions here: https://gist.github.com/animetosho/d3ca95da2131b5813e16b5bb1...
I think I actually need that instruction and have a use case for it, and it does something with a matrix transpose so I might finally find a real world useful demonstration of a matrix operation I can cite to people who don't know what those mean.
Here is Knuth introducing the MMIX instruction MXOR, which Intel later defined on vector registers under the name vgf2p8affineqb.
https://www.youtube.com/watch?v=r_pPF5npnio&t=3300 (55:00)
"This is an instruction that doesn't exist in any computer right now, so why should I put it in a machine, if it's supposed to be realistic? Well, it's because it's ahead of time."
MMIX? Now that's something I haven't heard in a long time...
What about GF2P8AFFINEINVQB?
It has a fixed polynomial, so not really that useful for anything but AES
The only case where I've had use of GF(2^8) inverses is in FEC algorithms (Forney's algorithm) and then you need some kind of weird polynomial. But all of those needs are rarely in the hot-path, and the FEC algo's are way outdated
I think the AFFINE and AFFINEINV instructions are specifically for FEC and maybe compression algorithms. I also think they smell like something requested by one of the big customers of Intel (e.g. the government).
Hmm of course erasure codes would always need to solve these problems. Not sure what modern applications need that in the X86 world
I really think it's only AES since thats the only place I've seen that polynomial used. But of course maybe there's an obscure tape backup FEC algo used somewhere in datacenters?
The forward affine matrix is useful for all sorts of bit manipulation, e.g. something as simple as a bit reversal.
potato, potato, tomato, tomato
Sometimes I read through the instrinsics guide just to play the game of spotting instructions defined primarily because certain cryptologic agencies asked for it.
Is the first example under The genius idea heading missing entry #3 below?
I thought it's missing. However, he does introduce it with:
> The inverted index will look something like this:
He isn't wrong. It is indeed "something like".
Very cool blog post, but the fastest phrase search would just use a suffix array, which would make any phrase search take double digit nanoseconds.