Here is a way to do it with a perfect hash table (so only one lookup).
Find a 128-bit mask that will be used to select bits of the input string. You want to minimize the number of 1 bits in the mask, since the table size is 2^(number of 1 bits). On the other hand, you need to pick enough 1 bits so that there is only one string per table entry.
AND this mask with the input string. COMPRESS these bits using the mask. The result is the index into the table. Each entry has a string to compare and thermometer coded mask (number of 1 bits equals length of string in bits). AND the input string with the thermometer coded mask and compare it with the string. If they match, you have a hit.
COMPRESS (or generalized extract) means to move selected bits to the right so that they are all next to each other and adjacent to the LSB. There is a clever algorithm for this, see section 7-4 of Hacker's Delight:
Excellent! I built a string matcher based on these same principles at Intel while working on Hyperscan but never released it. I'm pleased to see that you have discovered and publicized this yourself as I can say the cat is out of the bag. :-)
You can use PEXT alone without even messing with SIMD as long as there's enough good distinguishing bits in the early (first/last 8) parts of the string. The only thing that complicates this is that, as in the example in the article, sometimes some strings are proper prefixes/suffixes of others (which one is problematic depends on which direction you're using).
But seriously though, I played around with compression a bit initially: https://github.com/tpn/tracer/blob/master/StringTable/String.... Referenced Hacker's Delight when I was working on it. Granted, this was just for compressing the USHORT lengths into BYTE-sized lengths.
All of the compress/expand routines mentioned in that Hacker's Delight section you're quoting seem to mention upwards of 160+ instructions. My negative match logic fast-path only requires 12 instructions.
True. A gperf alike hash table (with special key indices) works only ok with compile-time known strings. But a simple CMP-alike perfect hash table with double indirection would be worthwhile to test against.
I called it Hanov after http://stevehanov.ca/blog/index.php?id=119
Problem is there that you have to calc a hash for each string, which is only fast with __builtin_crc/_mm_crc32_u64
A hash alone isn't going to help much; he has both a 1-char string in the set and to top it off, most of the other strings begin with (a distinct from the 1-char string one) the same string. There are some tricks (overpopulate the hash table with short string) but hashing is tricky here.
Source: 11 years of string matching work. :-) We built a multiple hash table string matcher in the early days of Hyperscan (it's gone now).
There is an Intel PEXT instruction, but I don't know the cycle count and I think it's limited to 64 bits.
With PEXT the whole prefix match is four instructions.
PEXT is pretty cheap... but, I still don't think I understand how what you're suggesting could optimally apply to prefix matching against all 16 strings at once, or why it would offer a substantial speed-up to the approach I've used? Can you elaborate?
I can't see how this avoids needing to do a comparison against the string in the slot once you've actually found the index though. I think I need to see this in C or asm to grok what the benefits are.
Is there any way to quickly compute 32-bit hash code of the string using SIMD? If yes, then we get a very fast perfect hash map: https://github.com/tatumizer/pigeon_map. Hash code in question is not supposed to be of crypto quality, just "good enough" will be good enough :)
Where's all this newfound interest in hash maps coming from? Once you hash something, you lose the prefix information, making it kinda' useless for this activity. Unless you create a hash for every string length variation... but, that seems like a lot of overhead for very little gain.
I also can't see how you'd get that negative match fast path performance if you were reliant on hashes. That owes its speed to the length and unique char filter working in concert.
Find a 128-bit mask that will be used to select bits of the input string. You want to minimize the number of 1 bits in the mask, since the table size is 2^(number of 1 bits). On the other hand, you need to pick enough 1 bits so that there is only one string per table entry.
AND this mask with the input string. COMPRESS these bits using the mask. The result is the index into the table. Each entry has a string to compare and thermometer coded mask (number of 1 bits equals length of string in bits). AND the input string with the thermometer coded mask and compare it with the string. If they match, you have a hit.
COMPRESS (or generalized extract) means to move selected bits to the right so that they are all next to each other and adjacent to the LSB. There is a clever algorithm for this, see section 7-4 of Hacker's Delight:
https://doc.lagout.org/security/Hackers%20Delight.pdf