Here is a way to do it with a perfect hash table (so only one lookup). Find a 12...

glangdale · on May 14, 2018

Excellent! I built a string matcher based on these same principles at Intel while working on Hyperscan but never released it. I'm pleased to see that you have discovered and publicized this yourself as I can say the cat is out of the bag. :-)

You can use PEXT alone without even messing with SIMD as long as there's enough good distinguishing bits in the early (first/last 8) parts of the string. The only thing that complicates this is that, as in the example in the article, sometimes some strings are proper prefixes/suffixes of others (which one is problematic depends on which direction you're using).

trentnelson · on May 14, 2018

Patches welcome :-)

But seriously though, I played around with compression a bit initially: https://github.com/tpn/tracer/blob/master/StringTable/String.... Referenced Hacker's Delight when I was working on it. Granted, this was just for compressing the USHORT lengths into BYTE-sized lengths.

Are you sure your compression suggestion would work for prefix matches? If you can get lower latency than the fastest assembly version, I'm all ears: http://trent.me/is-prefix-of-string-in-table/#IsPrefixOfStri...

All of the compress/expand routines mentioned in that Hacker's Delight section you're quoting seem to mention upwards of 160+ instructions. My negative match logic fast-path only requires 12 instructions.

rurban · on May 14, 2018

True. A gperf alike hash table (with special key indices) works only ok with compile-time known strings. But a simple CMP-alike perfect hash table with double indirection would be worthwhile to test against.

I called it Hanov after http://stevehanov.ca/blog/index.php?id=119 Problem is there that you have to calc a hash for each string, which is only fast with __builtin_crc/_mm_crc32_u64

glangdale · on May 15, 2018

A hash alone isn't going to help much; he has both a 1-char string in the set and to top it off, most of the other strings begin with (a distinct from the 1-char string one) the same string. There are some tricks (overpopulate the hash table with short string) but hashing is tricky here.

Source: 11 years of string matching work. :-) We built a multiple hash table string matcher in the early days of Hyperscan (it's gone now).

jhallenworld · on May 14, 2018

There is an Intel PEXT instruction, but I don't know the cycle count and I think it's limited to 64 bits. With PEXT the whole prefix match is four instructions.

trentnelson · on May 14, 2018

PEXT is pretty cheap... but, I still don't think I understand how what you're suggesting could optimally apply to prefix matching against all 16 strings at once, or why it would offer a substantial speed-up to the approach I've used? Can you elaborate?

jhallenworld · on May 14, 2018

Here is a small example:

Strings (written right to left):

    -----101
    --100101
    -----010
    --100010

Mask:

00111010

Table:

    Index Contents
    0000  -----101
    0001  -----010
    0010  -----101
    0011  -----010
    0100  -----101
    0101  -----010
    0110  -----101
    0111  -----010
    1000  --100101
    1001  --100010
    1010  -----101
    1011  -----010
    1100  -----101
    1101  -----010
    1110  -----101
    1111  -----010

Suppose the input is --101101

After the mask, the result is 00101000

After the compress, the result is 1010

Table index 1010 has -----101, so it's a hit. with string 101.

Now Suppose the input is --100101

After the mask, the result is 00100000

After the compress, the result is 1000

Table index 1000 has --100101, so it's a hit. with string 100101.

trentnelson · on May 14, 2018

I can't see how this avoids needing to do a comparison against the string in the slot once you've actually found the index though. I think I need to see this in C or asm to grok what the benefits are.

resource0x · on May 14, 2018

Is there any way to quickly compute 32-bit hash code of the string using SIMD? If yes, then we get a very fast perfect hash map: https://github.com/tatumizer/pigeon_map. Hash code in question is not supposed to be of crypto quality, just "good enough" will be good enough :)

trentnelson · on May 15, 2018

Where's all this newfound interest in hash maps coming from? Once you hash something, you lose the prefix information, making it kinda' useless for this activity. Unless you create a hash for every string length variation... but, that seems like a lot of overhead for very little gain.

I also can't see how you'd get that negative match fast path performance if you were reliant on hashes. That owes its speed to the length and unique char filter working in concert.