Wider branching factors (3-ary, 4-ary, etc) are less efficient in the total number of comparisons, but give you more memory level parallelism, and large searches are dominated by memory access behavior and critical paths, not total comparisons or instruction throughput. So better MLP can make up for the inefficiency of higher arity searches... to a point.
E.g., with 4-ary search, your search tree is half the depth of binary search (effectively, you are doing two levels of binary search in one level), but you do 3x the number of comparisons, so 1.5x more in total.
However, the comparison (1 cycle) is very fast compared to the time to fetch the element from memory (5, ~12, ~40, ~200+ cycles for L1, L2, L3 and DRAM hit respectively). The 4-ary search can issue the 3 probes in parallel, and so if the total time is dominated by the memory access, you might expect it to be ~2x faster. In practice, things like branch mispredictions (if the search is branchy) complicate the picture.
Sounds right. I think the fastest solution (of this approach) would do something like: get the Lx-cache-row sized batch; check upper/lower end; depending on result: choose next batch by binary division, or do linear walk to find the match. Not sure if doing it precisely would be an improvement over the magic numbers in the example though.
Then again, I'd like to experiment with prefetch here as well. It may be possible to squeeze out even more performance.
I don't think the trick of checking each end of the cache line helps much, except perhaps at the very end of the search (where there are say low 100s of bytes left in the region).
When the region is large it just doesn't cut the search space down almost at all: it's a rounding error.
Now you might think it's basically free, but clogging up the OOOE structures with instructions can hurt MLP because fewer memory accesses fit in the window, even if the instructions always completely execute in the shadow of existing misses.
There is a similar trick with pages: you might try to favor not touching more pages than necessary, so it might be worth moving probe points slightly to avoid touching a new page. For example, if you just probed at bytes 0 and 8200, the middle is 4100, but that's a new 4k page (bytes 4096 thru 8191), so maybe you adjust it slightly to probe at 4090 since you already hit that page with your 0 probe.
Making all of these calculations dynamically is a bit annoying and maybe slow, so it's best if the whole search is unrolled so the probe points according to whatever rules can be calculated at compile time and embedded into the code.
Much more important than either of these things is avoiding overloading the cache sets. Some implementations have this habit of choosing their probe points such that only a very small number of sets in the L1, L2 are used so the caching behavior is terrible. Paul Khuong talks about this in detail on pvk.ca.
Doing a linear walk at the end definitely helps a bit though: SIMD is fastest for this part.
If you want to optimize a data structure for binary search, it sounds like it might be best to reorder the data itself in memory to make caching more effective.
For example the first access in a binary search will be the middle element, followed by the lower quartile or upper quartile. If you store all of those together in memory, a single cache line fetch can serve all those requests.
Yes, although the design space explodes once you can reorder, so it's also interesting to consider the restricted problem where you can't reorder. This is still realistic since in many scenarios the layout may be fixed by an external constraint.
A levelwise traversal usually performs well without much effort if you want to optimize for cache friendliness. It's not optimal iirc, but it's dead-simple and much easier on the prefetcher than an inorder traversal.
Edit: Oops, TIL the eytzinger layout other comments have mentioned _is_ a levelwise traversal. Anywho, it's usually a good starting point for this kind of thing.
Binary search is used when you can't optimize the data structure. You have a sorted array, and have to work with that.
If you store the middle element together with its first two quartiles, you're making a B-tree. You will need indirection (pointers) to locate the other chunks that are not stored together and so it goes.
"Binary search" usually refers to the sorted array algorithm, though of course in the abstract sense of subdivision, any subdivision-by-two search is binary and in that sense we have the term "binary search tree".
On a GPU for example, if you have an array with 100.000 elements and want to find the index of one value, you are probably best of by using 100.000 threads to find that value.
So that's 100.000-ary search.
On a CPU, using an Eytzinger layout will probably also give you better performance than any of the N-ary approaches.
I've never heard of the eytzinger layout but it seems like it's just a static binary tree on a array (children are at 2i and 2i+1), commonly used for heaps and segment trees.
Random idea: a replacement function library that accepts hardware tuning parameters like this and a corresponding tool which sets them once during OS installation/upgrade and provides a config file or environment variable setup, so apps can share the setup...
Bonus if this library also detected/leveraged specialized hardware (GPU, AVX, etc) and then leveraged it appropriately.
Obviously, these parameters would need optional per-app-invocation overrides, e.g. so random apps didn't hog shared resources, e.g. for benchmarking, etc.
(I'm guessing this exists but not widely deployed...?)
This was somewhat the approach used in the ATLAS[1] linear algebra library. In the ATLAS case, systems were characterized through testing and a set of parameters generated. Then at installation time code was generated to match the system type using the given parameters.
E.g., with 4-ary search, your search tree is half the depth of binary search (effectively, you are doing two levels of binary search in one level), but you do 3x the number of comparisons, so 1.5x more in total.
However, the comparison (1 cycle) is very fast compared to the time to fetch the element from memory (5, ~12, ~40, ~200+ cycles for L1, L2, L3 and DRAM hit respectively). The 4-ary search can issue the 3 probes in parallel, and so if the total time is dominated by the memory access, you might expect it to be ~2x faster. In practice, things like branch mispredictions (if the search is branchy) complicate the picture.