It would be interesting to benchmark how much mmap hurts when operating in a non...

burntsushi · on Sept 23, 2016

One thing I did benchmark was the use of memory maps for single file search (cf. `subtitles_literal`). In that case, it saved time (small, but measurable) to memory map the file than to incrementally read it. Memory maps were only slower in parallel search on large directories.

Thankfully, ripgrep makes it easy to switch between memory maps and incremental reading. So I can just do this for you right now on the spot:

    $ time rg -j1 PM_SUSPEND | wc -l
    335
    
    real    0m0.406s
    user    0m0.350s
    sys     0m0.293s

    $ time rg -j1 PM_SUSPEND --mmap | wc -l
    335
    
    real    0m0.482s
    user    0m0.380s
    sys     0m0.317s

Note that this is on a Linux x64 box. I bet you'd get completely different results on a different OS.

bodyfour · on Sept 24, 2016

Interesting that user time went up as well.. not sure if that's significant.

I guess it's not too surprising that mmap isn't much of a win these days for anything... SIMD can copy a memory page pretty fast these days.

I just installed rg from homebrew and it's quite impressive... about 2.5x faster than ag on my macbook pro. Interestingly I get another 25% improvement by falling back to -j3 even though I'm on a quad-core machine. Not sure what is bottlenecking since it's all in cache.

burntsushi · on Sept 24, 2016

Yeah, figuring out the optimal thread count has always seemed like a bit of a black art to me. I can pretty reliably figure it out for my system (which has 8 physical cores, 16 logical), but it's hard to generalize that to others.

-j3 will spawn 3 workers for searching while the main thread does directory traversal. It sounds like I should do `num_cpus - 1` for the default `-j` instead of `num_cpus`.

karmakaze · on Sept 23, 2016

I recently questioned if/why parallel mmap might be slower without a satisfactory conclusion. One specific thing I couldn't answer is if reusing the same filesystem buffer and program memory addresses has a less negative effect than reading a wide range of mapped memory addresses.

brandmeyer · on Sept 23, 2016

Since all processors must share the mapping,

- The initial mapping of each file in any thread must halt all of the threads which are otherwise active.

- Every page fault in any mapping must also halt all of the threads.

Worse, since the page tables are getting munged, some or all of the TLB cache is getting flushed every time, again, on every processor.

I'm not sure of the details, but this hypothesis should be directly testable. IIRC, there are some hardware performance counters for time spent waiting on TLB lookups.

Addendum: One other possibility is that the mere act of extending the working set size (of the address space) is blowing the TLB cache.

prattmic · on Sept 24, 2016

    - The initial mapping of each file in any thread must halt all of the threads which are otherwise active.

    - Every page fault in any mapping must also halt all of the threads.

These are certainly not the case on Linux, and I'd imagine not on other OSes, as it would be terrible for performance.

Each mapping (i.e., mmap(2) call) is synchronized with other paths that read process memory maps, such as other mmap(2), munmap(2), etc syscalls, and page faults being handled for other threads. (i.e., mmap(2) takes the mmap_sem semaphore for writing). Running threads are not halted. The page tables are not touched at all during mmap, unless MAP_POPULATE is passed. (Linux delays actual population of the page tables until the page is accessed.)

The page fault handler takes mmap_sem for reading (synchronizing with mmap(2), etc, but allowing multiple page fault handlers to read concurrently) the mappings and page_table_lock for the very small period when it actually updates the page tables.

Again, running threads are not halted. The active page tables are updated while other cores may be accessing them. This must be done carefully to avoid spurious faults, but it is certainly feasible.

In fact, at least on x86, handling page faults does not require a TLB flush. The TLB does not cache non-present page table entries, and taking a page fault invalidates the entry that caused the fault, if one existed.

There are plenty of places here that may cause contention, but nothing nearly so bad as halting execution.

munmap will be rather noisy. It involves tons of invalidations and a TLB flush. I wouldn't be surprised if a good bit of performance could be regained by avoiding munmapping the file until the process exits.

brandmeyer · on Sept 24, 2016

So, I did some testing with 'perf'. This is on an older Intel processor, 2-cores with hyperthreading. These were all done on the same set of files, using the binary release of ripgrep v0.1.16 on Debian Jessie:

At -j1, --mmap: 95 kdTLB load-misses, 2800 page faults

At -j2: --mmap: 170 kdTLB load-misses, 2840 page faults

At -j3: --mmap: 230 kdTLB load-misses, 2800 page faults

At -j4, --mmap: 4180 context switches, 2900 page faults, 200 Minsn, 280 kDTLB load misses, 35 MDTLB loads

At -j1, --no-mmap: 50 kdTLB load misses, 635 page faults

At -j2, --no-mmap: 70 kdTLB load misses, 675 page faults

At -j3, --no-mmap: 90 kdTLB load misses, 715 page faults

At -j4, --no-mmap: 377-400 context switches, 750 page faults, 275 Minsn, 100 kDTLB load misses

As the number of threads goes up, the total amount of TLB pressure goes up in both cases. These results are consistent with a number of TLB cache flushes proportional to N_threads * M_mappings + C for the --mmap case, and N_threads * M_buffer_perthread + D for the --no-mmap case. I think that does support the model that each thread's mmap adds pressure to all of the threads TLB's.

prattmic · on Sept 24, 2016

I did some experimentation last night as well. I suspected a lot of the cost came from the unmapping the files and the required invalidations and TLB shootdowns required to do so.

I made rg simply not munmap files when it was done with them (I made this drop do nothing: https://github.com/danburkert/memmap-rs/blob/master/src/unix...)

Searching for PM_RESUME in the Linux source gave me these results:

    --no-mmap: ~400ms
    --mmap (with munmap): ~750ms
    --mmap (without munmap): ~550ms

So eliding munmap made a big difference, but it was still not enough to beat out reading the files. perf shows that the mmap syscall itself is just too expensive (this is --mmap (with munmap)):

      Children      Self  Command  Shared Object       Symbol
    -   81.88%     0.00%  rg       rg                  [.] __rust_try
       - __rust_try
          - 50.57% std::panicking::try::call::ha112cda315d6c57d
             - 47.73% rg::Worker::search_mmap::h5179a76c63e344d0
                - 23.91% __GI___munmap
                     6.08% smp_call_function_many
                     3.14% rwsem_spin_on_owner
                     1.86% native_queued_spin_lock_slowpath
                     0.94% osq_lock
                     0.67% native_write_msr_safe
                     0.52% unmap_page_range
                + 21.41% _$LT$rg..search_buffer..BufferSearcher$LT$$u27$a$C$$u20$W$GT$$GT$::run::hd0f8b2830716be0c
                  0.80% memchr
          - 17.99% __mmap64
               5.20% rwsem_down_write_failed
               1.96% rwsem_spin_on_owner
               0.77% osq_lock
               0.59% native_queued_spin_lock_slowpath
          + 6.79% 0x1080d
            1.72% __GI___libc_close
            1.27% __memcpy_sse2_unaligned
            0.98% __fxstat64
            0.56% __GI___ioctl