*- Roll your own unbuffered input using raw system calls.* A small toy project I...

dasht · on March 31, 2011

Regarding this bit: "Roll your own unbuffered input using raw system calls. / A small toy project I wrote last year was a modification of GNU grep that did the opposite -- it aggressively prefetched ahead of the file it was currently reading. "

I think you mis-understood. "Roll your own unbuffered input [....]" does not mean "do not read ahead" it means "do not use stdio". Stdio buffers have a lot of overhead (relatively speaking) in support of features that grep does not need.

electrum · on March 31, 2011

How did that compare to --mmap?

angusgr · on March 31, 2011

Wow, I didn't realise grep had a separate --mmap option. Interesting.

I don't think it would make much difference in the parent's case of many, small, fragmented files because if you're mmapping each file in turn and it's not cached, it still needs to be loaded from disk - it just happens in a page fault instead of the read() call.

Possibly if you mmapped all of the files and then used madvise() or something to prefetch in front of where you are in the list of files. Maybe grep does that, I don't know?

I guess the case where that technique would help is actually when you have a combination of (a) many files, (b) a computationally expensive pattern match (even just -i is a measurable hit) and (c) largeish files.

Because on many small files and a simple match, the disk I/O is still going to be the major component - even if you prefetch you still can't get around needing to load all the file contents from disk.

DarkShikari · on March 31, 2011

Possibly if you mmapped all of the files and then used madvise() or something to prefetch in front of where you are. Maybe grep does that, I don't know?

It doesn't do that, and that's the first method I used in my hack.

fadvise() works just as well though, I think.

angusgr · on March 31, 2011

It doesn't do that, and that's the first method I used in my hack.

Fair enough. :) If you feel like sharing then I'm very curious as to what the additional methods were.

DarkShikari · on March 31, 2011

It was a total hack, so I just opened files ahead of time, fadvised them, then closed them.

dasht · on March 31, 2011

"mmap" is not a general solution for grep. It's nice when it can be used but grep must be able to grep files much to large to reasonably mmap and various kinds of streams.

nikcub · on April 1, 2011

Yes but you would think that a reasonable default solution would be to use both. ie. mmap by default and then if the file is larger than MAX (based on memory use, file size) then use read.