This introduced me to the idea of load value predictors. Is Apple the only chip ...

adrian_b · on Jan 29, 2025

In many CPU ISAs, load value predictors are unlikely to be useful, because they cannot guess the value that will be loaded with an acceptable probability.

The ARM ISA and also other ISAs with fixed-length instruction encoding are an exception. Because they have a fixed instruction length, typically of 32 bits, most constants cannot be embedded in the instruction encoding.

As a workaround, when programming for such ISAs, the constants are stored in constant pools that are close to the code for the function that will use them, and the load instructions load the constants using program-counter-relative addressing.

Frequently such constants must be reloaded from the constant pool, which allows the load value predictor to predict the value based on previous loads from the same relative address.

In contrast with the Apple ARM CPUs, for x86-64 CPUs it is very unlikely that a load value predictor can be worthwhile, because the constants are immediate values that are loaded directly into registers or are directly used as operands. There is no need for constants stored outside the function code, which may be reloaded multiple times, enabling prediction.

All fast CPUs can forward the stored data from the store buffer to subsequent loads from the same address, instead of waiting for the store to be completed in the external memory. This is not load value prediction.

eigenform · on Jan 29, 2025

> for x86-64 CPUs it is very unlikely that a load value predictor can be worthwhile

I think you're making a good point about immediate encodings probably making ARM code more amenable to LVP, but I'm not sure I totally buy this statement.

If you take some random x86 program, chances are there are still many loads that are very very predictable. There's a very recent ISCA'24 paper[^1] about this (which also happens to be half-attributed to authors from Intel PARL!):

> [...] we first study the static load instructions that repeatedly fetch the same value from the same load address across the entire workload trace. We call such a load global-stable.

> [..] We make two key observations. First, 34.2% of all dynamic loads are global-stable. Second, the fraction of global-stable loads are much higher in Client, Enterprise, and Server work-loads as compared to SPEC CPU 2017 workloads.

[^1]: https://arxiv.org/pdf/2406.18786

adrian_b · on Jan 29, 2025

Unfortunately what you say is true for many legacy programs, but it is a consequence of the programs not being well structured by the programmer, or not being well optimized by the compiler, or due to a defect of the ISA, other than the lack of big immediate constants.

Some of the global-stable values are reloaded because the ISA does not provide enough explicitly-addressable registers, despite the fact that a modern CPU core may have 10 times to 20 times more available registers, which could be used to store the global-stable values.

This is one of the reasons why Intel wants to double the number of general-purpose directly addressable registers from 16 to 32 in the future Diamond Rapids CPU (the APX ISA extension).

In other cases the code is not well structured and it tests repeatedly some configuration options, which could be avoided by a proper partitioning of the code paths, where slow tests would be avoided and the execution time would be reduced, even at the price of a slight code size expansion (similarly to the effect of function inlining or loop unrolling).

Sometimes the use of such global-stable values could have been avoided even by moving at compile time the evaluation of some expressions, possibly combined with dynamic loading of some executable objects that had been compiled for different configurations.

So I have seen many cases of such global-stable values being used, even for CPU ISAs that do not force their use, but almost none of them were justified. Improving such programs at programming time or at compile time would have resulted in greater performance improvements, which would have been obtained with less energy consumption, than implementing a load-value predictor in the CPU.

adgjlsfhk1 · on Jan 30, 2025

I think you're under-estimating the amount of pointer chasing that lots of types of code has to do. B-Tree traversal for filesystems, mark loops for garbage collection, and sparse graph traversal are all places where you're doing a lot of pointer chasing.

twoodfin · on Jan 29, 2025

Thank you, fantastic answer.

I do wonder if there are other common code patterns that a practical LVP could exploit. One that comes to mind immediately are effectively constants at one remove: Think processing a large array of structs with long runs of identical values for some little-used parameter field. Or large bitmasks that are nearly all 0xFF or 0x00.

eigenform · on Jan 28, 2025

Probably not, but I don't think anyone has talked about it explicitly.

Otherwise, there are known examples of related-but-less-aggressive optimizations for resolving loads early. I'm pretty sure both AMD[^1] and Intel[^2] have had predictive store-to-load forwarding.

edit: Just noticed the FLOP paper also has a nice footnote about distinguishing LVP from forwarding during testing (ie. you want to drain your store queue)!

[^1]: https://www.amd.com/content/dam/amd/en/documents/processor-t...

[^2]: https://www.intel.com/content/www/us/en/developer/articles/t...

bjackman · on Jan 29, 2025

> I'm pretty sure both AMD[^1] and Intel[^2] have had predictive store-to-load forwarding.

IIRC this was how Spectre Variant 4 worked.

adgjlsfhk1 · on Jan 29, 2025

from doing some work on GC a couple years ago, at that time apple was the only one with it. The performance is awesome, it makes graph traversal ~2x faster.