You're not getting a boost, you're avoiding a penalty. In some (but not all) cas...

thijsr · on Jan 28, 2025

Disabling SMT alone isn’t enough to mitigate CPU vulnerabilities. For full protection against issues like L1TF or MDS, you must both enable the relevant mitigations and disable SMT. Mitigations defend against attacks where an attacker executes on the same core after the victim, while disabling SMT protects against scenarios where the attacker runs concurrently with the victim.

umanwizard · on Jan 28, 2025

In my experience SMT is still faster for most workloads even with the mitigations.

daneel_w · on Jan 28, 2025

It's a common misunderstanding that the CPU suddenly has twice as large performance envelope when SMT is enabled. Only specialized software/scenarios will tangibly benefit from the parasitic gains of SMT-induced extra parallelization, e.g. video encoders like x264 or CPU-bound raytracers to name a few examples. These gains typically amount to about 15-20% at the very extreme end. In some cases you'll see a performance drop due to the inherent contention of two "cores" sharing one actual core. If you're stuck with a dual-core CPU for your desktop setup you should absolutely enable SMT to make your general experience feel a bit more responsive.

umanwizard · on Jan 28, 2025

> It's a common misunderstanding that the CPU suddenly has twice as large performance envelope when SMT is enabled.

Perhaps, but I am not under this misunderstanding and never expressed it.

> Only specialized software/scenarios will tangibly benefit from the parasitic gains of SMT-induced extra parallelization

In my experience it also speeds up C++/Rust compilation, which is the main thing I care about. I can't find any benchmarks now but I have definitely seen a benefit in the past.

avianlyric · on Jan 29, 2025

Are you sure about your statement

> video encoders like x264 or CPU-bound raytracers to name a few examples. These gains typically amount to about 15-20% at the very extreme end.

Normally those types of compute heavy processes, data streamlined, processes don’t see much benefit from SMT. After all SMT only provides a performance benefit by allowing the CPU to pull from two distinct chains of instructions, and fill the pipeline gaps from one thread, with instructions from the other thread. It’s effectively instruction-by-instruction scheduling of two different threads.

But if you’re running an optimised and efficient process that doesn’t have significant unpredictable branching, or significant unpredictable memory operations. Then SMT offers you very little because the instruction pipeline for each thread is almost fully packed, offering few opportunities to schedule instructions from a different thread.

astrange · on Jan 29, 2025

Compression is inherently unpredictable (if you can predict it, it's not compressed enough), which is vaguely speaking how it can help x264.

avianlyric · on Jan 29, 2025

I agree that compression is all about increasing entropy per bit, which makes the output of a good compressor highly unpredictable.

But that doesn’t mean the process of compression involves significant amounts of unpredictable branching operations. If for no other reason than it would be extremely slow and inefficient, because many branching operations means you’re either processing input pixel-by-pixel, or your SIMD pipeline is full of dead zones that you can’t actually re-schedule, because it would desync your processing waves.

Video compression is mostly very clever signal processing built on top of primitives like convolutions. You’re taking large blocks of data, and performing uniform mathematical operations over all the data to perform what is effectively statistical analysis of that data. That analysis can then be used to drive a predictor, then you “just” need to XOR the predictor output with the actual data, and record the result (using some kind of variable length encoding scheme that lets you remove most of the unneeded bytes).

But just like computing the median of a large dataset can be done with no branches, regardless of how random or the large the input is. Video compression can also largely be done the same way, and indeed has to be done that way to be performant. There’s no other way to cram up to 4k * 3bytes per frame (~11MB) through a commercial CPU to perform compression at a reasonable speed. You must build your compressor on top of SIMD primitives, which inherently makes branching extremely expensive (many orders of magnitude more expensive than branching SISD operations).

astrange · on Jan 29, 2025

> You’re taking large blocks of data, and performing uniform mathematical operations over all the data to perform what is effectively statistical analysis of that data.

It doesn't behave this way. If you're thinking of the DCT it uses that's mostly 4x4 which is not very large. As for motion analysis there are so many possible candidates (since it's on quarter-pixels) that it can't try all of them and very quickly starts trying to filter them out.

fc417fc802 · on Jan 29, 2025

> it uses that's mostly 4x4 which is not very large

That's 16x32 which is AVX512. What other size would you suggest using and (more importantly) what commercially available CPU architecture are you running it on?

astrange · on Jan 30, 2025

4x4 is 4x4, not 16x32?

daneel_w · on Jan 29, 2025

> Are you sure about your statement

Yes. From actual experience.

hansvm · on Jan 29, 2025

It usually speeds up basically everything parallelizable that looks kind of like a parser, lexer, tokenizer, .... Unless somebody goes out of their way to design a format with fewer data dependencies, those workloads are crippled on modern CPUs. That includes (de)compression routines, compilers, protobuf parsing, ....

The only real constraint is that you can actually leverage multiple threads. For protos as an example, that requires a modified version of the format with checkpoints or similar (which nobody does) or having many to work on concurrently (very common in webservers or whatever).