3) if all the lines of the 10+GB file are actually unique, wouldn't awk keep the...

niggler · on March 7, 2013

The sort | uniq method literally needs to sort the file and pipe it to uniq, a far more memory-intensive operation than the single-pass awk check. You can write your own hash function in AWK if you think you may overstep memory, but of course you risk hash collisions. It's a tradeoff.

I tried it on a 1U server with 24GB ram a few years ago and found that the sort was thrashing at the 10GB file size while AWK handled it easily

gav · on March 7, 2013

Or you can use `sort -u` and not have to pipe to `uniq`.

stevvooe · on March 7, 2013

Sort will actually externally sort blocks into temp files and merge them. Adjusting this block size can help with thrashing.

Awk may still be better for uniquification.

enigmo · on March 7, 2013

What about 'sort -u' ?

stevvooe · on March 7, 2013

I'm not sure. I haven't closely studied the difference between each algorithm. My guess would be that sort -u would perform better as the data set gets larger with a good block size setting because it does do an external sort. Cardinality would also affect the performance. If the unique set handily fits in memory, an external sort on a large data set wouldn't be very efficient.

carlesfe · on March 7, 2013

Yes, awk does some real black bagic. It's awesome how it can parse really, really big files.

alexkus · on March 7, 2013

Yes it will, and a bit (read: a lot) more than 10GB as it needs to store the contents of variable x in a hash table (with the corresponding hash key and value of the counter). There's no other magic way it can 'know' whether a particular line has been seen before. You can't rely on hash keys alone as the hashes aren't guaranteed to be unique.

For files with relatively few duplicates it's going to be a lot slower than sort | uniq.

Trying it on a 128MB file (nowhere near enough time to test a 10GB file) filled with lines of 7 random upper case characters[1] (so hardly any duplicates):-

    $ wc -l x.out
    16777216 x.out
    
    $ time ( sort x.out | uniq ) | wc -l
    16759719

    real    0m17.982s
    user    0m42.575s
    sys     0m0.876s
    
    $ time ( sort -u x.out ) | wc -l
    16759719

    real    0m20.582s
    user    0m43.775s
    sys     0m0.688s

Not much difference between "sort | uniq" and "sort -u".

As for the awk method:-

    $ time awk '!x[$0]++' x.out | wc -l

has been running for more than 20 minutes and still hasn't returned. For that 128MB file the awk process is also using 650MB of memory (according to ps). Will check up on it later (have to go out now).

This Linux machine has ~16GB of memory so the file was going to be completed cached in memory before the first test. All things considered equal the awk method will be roughly O(n) (e.g. linear against file size) and sort/uniq will be O(n log n). So, theoretically, the awk method will eventually surpass the sort method because it's having to do less work (it's only checking for a previously seen key rather than sorting the entire file) but I'm not sure the crossover will be anywhere useful if the file doesn't contain many duplicates.

Repeating it for a file containing lots of duplicates (same 128MB file size but contents are only the 7 letter words consisting of A or B, so only 128 possible entries):-

    $ time awk '!x[$0]++' y.out | wc -l
    128
    
    real    0m1.207s
    user    0m1.192s
    sys     0m0.016s
    
    $ time ( sort y.out | uniq ) | wc -l
    128

    real    0m14.320s
    user    0m31.414s
    sys     0m0.428s
    
    $ time ( sort -u y.out | uniq ) | wc -l
    128
    
    real    0m12.638s
    user    0m30.366s
    sys     0m0.188s

Notice that "sort -u" doesn't do anything clever for files with lots of duplicates.

So awk is much faster for files with lots of duplicates. No great surprises. When I get a chance I'll repeat it for a 1GB file and a 10GB file (with lots of duplicates otherwise the awk version will take far too long).

1. Example contents:-

EPQKHPH DLJCROB WICVGQY MHWTPSR HMPNECN

alexkus · on March 7, 2013

The awk run against a file containing almost no duplicates finished after over an hour (compared to 43sec for the sort method).

    $ time awk '!x[$0]++' x.out | wc -l
    16759719

    real    64m41.089s
    user    64m31.970s
    sys     0m3.136s

Peak memory usage (given that it was a 128MB input file) was (pid, rss, vsz, comm):

     8972 1239744 1246488  \_ awk

So > 1GB for a 128MB input file.

bumbledraven · on March 8, 2013

Yes it will ... [need] to store the contents of variable x in a hash table [...] There's no other magic way it can 'know' whether a particular line has been seen before. You can't rely on hash keys alone as the hashes aren't guaranteed to be unique.

Technically that's true, but the result of a cryptographic hash like SHA-256 is (practically) guaranteed to be unique. Depending on the average length of an input line and how many of the lines are unique, storing only the SHA-256 hash value could take far less memory than storing the input lines along with a non-cryptographic 32-bit hash value.

kilburn · on March 8, 2013

You are describing a bad implementation of a bloom filter [1]. Anyway, people expect "uniq" to be correct in all cases (i.e., to never filter a unique line). A default implementation where it would possible (even with a minuscule chance) that this doesn't happen would be a recipe for disaster. It may be a cool option though ;)

http://en.wikipedia.org/wiki/Bloom_filter

kragen · on March 12, 2013

No, this is not a bad implementation of a bloom filter, and the size of the minuscule chance matters; unless SHA-256 has a flaw in it that we don't know about, SHA-256 collisions are far less likely than undetected hardware errors in your computer. The universe contains roughly 2²⁶⁵ protons, 500 protons per distinct SHA-256 value, and has existed for roughly 2⁵⁸ seconds, which means there are roughly 2¹⁹⁸ SHA-256 values per second of the age of the universe.

Typical undetected bit error rates on hard disks are one error per 10¹⁴ bits, which is about 2⁴⁷. If your lines are about 64 characters long, you'll have an undetected bit error roughly every 2^(47-6-3) = 2³⁸ lines. SHA-256 will give you an undetected hash collision roughly every 2²⁵⁵ lines. That is, for every 2²¹⁷ disk errors, SHA-256 will introduce an additional error. If you're hashing a billion lines a second (2³⁰) then that will be 2^(217-30) = 2¹⁸⁷ seconds, while the disk is giving you an undetected bit error every minute or so. A year is about 2²⁵ seconds, so that's about 2¹⁶² years, about 10⁴⁹. By comparison, stars will cease to form in about 10¹⁴ years, all planets will be flung from their orbits around the burned-out remnants of stars by random gravitational perturbations in about 10¹⁵ years, the stellar remnants will cease to form cold, dark galaxies in about 10²⁰ years, and all protons will have decayed in about 10⁴⁰ years.

And if you somehow manage to keep running uniq on your very large file at a billion lines a second, in a mere 500 times the amount of time from the universe's birth to the time when nothing is left of matter but black holes, SHA-256 will have produced your first random false collision.

kragen · on March 23, 2013

Another possibly relevant note: there are about 2¹⁴⁹ Planck times per second. None of the above takes into account the Landauer and related quantum-mechanical limits to computation, which may be more stringent.