Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

- Roll your own unbuffered input using raw system calls.

A small toy project I wrote last year was a modification of GNU grep that did the opposite -- it aggressively prefetched ahead of the file it was currently reading. This helped performance dramatically on fragmented data (e.g. tons of small files).

For most typical greps (at least of files as opposed to standard input), "grep" is likely disk-bound, not CPU-bound.

(Note: I mean literal prefetch, not actually reading the file from disk. This is important because file input is a blocking operation in UNIX -- the thread blocks when read() is called and can only be resumed when the read is complete, unlike the case of output. This prevents the filesystem from reordering or merging multiple reads unless they come simultaneously from different threads. This is why reads are often slower than writes on typical data.)



Regarding this bit: "Roll your own unbuffered input using raw system calls. / A small toy project I wrote last year was a modification of GNU grep that did the opposite -- it aggressively prefetched ahead of the file it was currently reading. "

I think you mis-understood. "Roll your own unbuffered input [....]" does not mean "do not read ahead" it means "do not use stdio". Stdio buffers have a lot of overhead (relatively speaking) in support of features that grep does not need.


How did that compare to --mmap?


Wow, I didn't realise grep had a separate --mmap option. Interesting.

I don't think it would make much difference in the parent's case of many, small, fragmented files because if you're mmapping each file in turn and it's not cached, it still needs to be loaded from disk - it just happens in a page fault instead of the read() call.

Possibly if you mmapped all of the files and then used madvise() or something to prefetch in front of where you are in the list of files. Maybe grep does that, I don't know?

I guess the case where that technique would help is actually when you have a combination of (a) many files, (b) a computationally expensive pattern match (even just -i is a measurable hit) and (c) largeish files.

Because on many small files and a simple match, the disk I/O is still going to be the major component - even if you prefetch you still can't get around needing to load all the file contents from disk.


Possibly if you mmapped all of the files and then used madvise() or something to prefetch in front of where you are. Maybe grep does that, I don't know?

It doesn't do that, and that's the first method I used in my hack.

fadvise() works just as well though, I think.


It doesn't do that, and that's the first method I used in my hack.

Fair enough. :) If you feel like sharing then I'm very curious as to what the additional methods were.


It was a total hack, so I just opened files ahead of time, fadvised them, then closed them.


"mmap" is not a general solution for grep. It's nice when it can be used but grep must be able to grep files much to large to reasonably mmap and various kinds of streams.


Yes but you would think that a reasonable default solution would be to use both. ie. mmap by default and then if the file is larger than MAX (based on memory use, file size) then use read.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: