> Were there any particular reason why you chose 16-bytes instead of 32-bytes? Like maybe targeting AVX but not AVX2?? (Then again, I see a rare ymm register here and there... sooo... I guess I'm just a bit confused with the design decision)
Yeah, my target use case rarely had strings longer than 16 (which is quite common for prefix matching problems in the real world), so, 16 made the most sense. I'll probably do an AVX2 and AVX-512 version down the track; as I'll have 32-byte and 64-byte registers to play with, it will allow me to do some more interesting things (either longer prefix strings, or more comparisons).
> * SSE4.2 has string-specific instructions. Did you look into any of the SSE4.2 string instructions to see if they'd help in your case? (Ironically, forcing you back down to xmm registers, since its SSE only. So kind of a contradiction with my first question, lol) I fully admit that I don't actually know how to use the SSE4.2 string instructions, so don't look into my question too hard.
The instructions like pcmpistri were considered and explicitly rejected :-) They are surprisingly slow, and they only process 16 characters at a time, so you don't get the implicit benefits on longer strings that you would expect if you could do like a `rep pcmpistri` or something (like the accelerated `rep stosq` etc).
The first twelve instructions for the negative match logic execute in about 6 CPU cycles. A pcmpistri-type instruction will often clock in around 7-14 cycles. So, with my approach, using the "basically free" instructions like vpcmpeqb, vpcmpgtb etc, I can detect if my input string doesn't have any prefix matches in a table of 16 prefixes in ~6 cycles, which is pretty neat.
Yeah, my target use case rarely had strings longer than 16 (which is quite common for prefix matching problems in the real world), so, 16 made the most sense. I'll probably do an AVX2 and AVX-512 version down the track; as I'll have 32-byte and 64-byte registers to play with, it will allow me to do some more interesting things (either longer prefix strings, or more comparisons).
> * SSE4.2 has string-specific instructions. Did you look into any of the SSE4.2 string instructions to see if they'd help in your case? (Ironically, forcing you back down to xmm registers, since its SSE only. So kind of a contradiction with my first question, lol) I fully admit that I don't actually know how to use the SSE4.2 string instructions, so don't look into my question too hard.
The instructions like pcmpistri were considered and explicitly rejected :-) They are surprisingly slow, and they only process 16 characters at a time, so you don't get the implicit benefits on longer strings that you would expect if you could do like a `rep pcmpistri` or something (like the accelerated `rep stosq` etc).
The first twelve instructions for the negative match logic execute in about 6 CPU cycles. A pcmpistri-type instruction will often clock in around 7-14 cycles. So, with my approach, using the "basically free" instructions like vpcmpeqb, vpcmpgtb etc, I can detect if my input string doesn't have any prefix matches in a table of 16 prefixes in ~6 cycles, which is pretty neat.