Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It would be interesting to benchmark how much mmap hurts when operating in a non-parallel mode.

I think a lot of the residual love for mmap is because it actually did give decent results back when single core machines were the norm. However, once your program becomes multithreaded it imposes a lot of hidden synchronization costs, especially on munmap().

The fastest option might well be to use mmap sometimes but have a collection of single-thread processes instead of a single multi-threaded one so that their VM maps aren't shared. However, this significantly complicates the work-sharing and output-merging stages. If you want to keep all the benefits you'd need a shared-memory area and do manual allocation inside it for all common data which would be a lot of work.

It might also be that mmap is a loss these days even for single-threaded... I don't know.

Side note: when I last looked at this problem (on Solaris, 20ish years ago) one trick I used when mmap'ing was to skip the "madvise(MADV_SEQUENTIAL)" if the file size was below some threshold. If the file was small enough to be completely be prefetched from disk it had no effect and was just a wasted syscall. On larger files it seemed to help, though.



One thing I did benchmark was the use of memory maps for single file search (cf. `subtitles_literal`). In that case, it saved time (small, but measurable) to memory map the file than to incrementally read it. Memory maps were only slower in parallel search on large directories.

Thankfully, ripgrep makes it easy to switch between memory maps and incremental reading. So I can just do this for you right now on the spot:

    $ time rg -j1 PM_SUSPEND | wc -l
    335
    
    real    0m0.406s
    user    0m0.350s
    sys     0m0.293s

    $ time rg -j1 PM_SUSPEND --mmap | wc -l
    335
    
    real    0m0.482s
    user    0m0.380s
    sys     0m0.317s
Note that this is on a Linux x64 box. I bet you'd get completely different results on a different OS.


Interesting that user time went up as well.. not sure if that's significant.

I guess it's not too surprising that mmap isn't much of a win these days for anything... SIMD can copy a memory page pretty fast these days.

I just installed rg from homebrew and it's quite impressive... about 2.5x faster than ag on my macbook pro. Interestingly I get another 25% improvement by falling back to -j3 even though I'm on a quad-core machine. Not sure what is bottlenecking since it's all in cache.


Yeah, figuring out the optimal thread count has always seemed like a bit of a black art to me. I can pretty reliably figure it out for my system (which has 8 physical cores, 16 logical), but it's hard to generalize that to others.

-j3 will spawn 3 workers for searching while the main thread does directory traversal. It sounds like I should do `num_cpus - 1` for the default `-j` instead of `num_cpus`.


I recently questioned if/why parallel mmap might be slower without a satisfactory conclusion. One specific thing I couldn't answer is if reusing the same filesystem buffer and program memory addresses has a less negative effect than reading a wide range of mapped memory addresses.


Since all processors must share the mapping,

- The initial mapping of each file in any thread must halt all of the threads which are otherwise active.

- Every page fault in any mapping must also halt all of the threads.

Worse, since the page tables are getting munged, some or all of the TLB cache is getting flushed every time, again, on every processor.

I'm not sure of the details, but this hypothesis should be directly testable. IIRC, there are some hardware performance counters for time spent waiting on TLB lookups.

Addendum: One other possibility is that the mere act of extending the working set size (of the address space) is blowing the TLB cache.


    - The initial mapping of each file in any thread must halt all of the threads which are otherwise active.

    - Every page fault in any mapping must also halt all of the threads.
These are certainly not the case on Linux, and I'd imagine not on other OSes, as it would be terrible for performance.

Each mapping (i.e., mmap(2) call) is synchronized with other paths that read process memory maps, such as other mmap(2), munmap(2), etc syscalls, and page faults being handled for other threads. (i.e., mmap(2) takes the mmap_sem semaphore for writing). Running threads are not halted. The page tables are not touched at all during mmap, unless MAP_POPULATE is passed. (Linux delays actual population of the page tables until the page is accessed.)

The page fault handler takes mmap_sem for reading (synchronizing with mmap(2), etc, but allowing multiple page fault handlers to read concurrently) the mappings and page_table_lock for the very small period when it actually updates the page tables.

Again, running threads are not halted. The active page tables are updated while other cores may be accessing them. This must be done carefully to avoid spurious faults, but it is certainly feasible.

In fact, at least on x86, handling page faults does not require a TLB flush. The TLB does not cache non-present page table entries, and taking a page fault invalidates the entry that caused the fault, if one existed.

There are plenty of places here that may cause contention, but nothing nearly so bad as halting execution.

munmap will be rather noisy. It involves tons of invalidations and a TLB flush. I wouldn't be surprised if a good bit of performance could be regained by avoiding munmapping the file until the process exits.


So, I did some testing with 'perf'. This is on an older Intel processor, 2-cores with hyperthreading. These were all done on the same set of files, using the binary release of ripgrep v0.1.16 on Debian Jessie:

At -j1, --mmap: 95 kdTLB load-misses, 2800 page faults

At -j2: --mmap: 170 kdTLB load-misses, 2840 page faults

At -j3: --mmap: 230 kdTLB load-misses, 2800 page faults

At -j4, --mmap: 4180 context switches, 2900 page faults, 200 Minsn, 280 kDTLB load misses, 35 MDTLB loads

At -j1, --no-mmap: 50 kdTLB load misses, 635 page faults

At -j2, --no-mmap: 70 kdTLB load misses, 675 page faults

At -j3, --no-mmap: 90 kdTLB load misses, 715 page faults

At -j4, --no-mmap: 377-400 context switches, 750 page faults, 275 Minsn, 100 kDTLB load misses

As the number of threads goes up, the total amount of TLB pressure goes up in both cases. These results are consistent with a number of TLB cache flushes proportional to N_threads * M_mappings + C for the --mmap case, and N_threads * M_buffer_perthread + D for the --no-mmap case. I think that does support the model that each thread's mmap adds pressure to all of the threads TLB's.


I did some experimentation last night as well. I suspected a lot of the cost came from the unmapping the files and the required invalidations and TLB shootdowns required to do so.

I made rg simply not munmap files when it was done with them (I made this drop do nothing: https://github.com/danburkert/memmap-rs/blob/master/src/unix...)

Searching for PM_RESUME in the Linux source gave me these results:

    --no-mmap: ~400ms
    --mmap (with munmap): ~750ms
    --mmap (without munmap): ~550ms
So eliding munmap made a big difference, but it was still not enough to beat out reading the files. perf shows that the mmap syscall itself is just too expensive (this is --mmap (with munmap)):

      Children      Self  Command  Shared Object       Symbol
    -   81.88%     0.00%  rg       rg                  [.] __rust_try
       - __rust_try
          - 50.57% std::panicking::try::call::ha112cda315d6c57d
             - 47.73% rg::Worker::search_mmap::h5179a76c63e344d0
                - 23.91% __GI___munmap
                     6.08% smp_call_function_many
                     3.14% rwsem_spin_on_owner
                     1.86% native_queued_spin_lock_slowpath
                     0.94% osq_lock
                     0.67% native_write_msr_safe
                     0.52% unmap_page_range
                + 21.41% _$LT$rg..search_buffer..BufferSearcher$LT$$u27$a$C$$u20$W$GT$$GT$::run::hd0f8b2830716be0c
                  0.80% memchr
          - 17.99% __mmap64
               5.20% rwsem_down_write_failed
               1.96% rwsem_spin_on_owner
               0.77% osq_lock
               0.59% native_queued_spin_lock_slowpath
          + 6.79% 0x1080d
            1.72% __GI___libc_close
            1.27% __memcpy_sse2_unaligned
            0.98% __fxstat64
            0.56% __GI___ioctl




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: