3) if all the lines of the 10+GB file are actually unique, wouldn't awk keep the whole file in RAM? For files larger than my RAM could this leave my system unresponsive because it's thrashing on swap?
The sort | uniq method literally needs to sort the file and pipe it to uniq, a far more memory-intensive operation than the single-pass awk check. You can write your own hash function in AWK if you think you may overstep memory, but of course you risk hash collisions. It's a tradeoff.
I tried it on a 1U server with 24GB ram a few years ago and found that the sort was thrashing at the 10GB file size while AWK handled it easily
I'm not sure. I haven't closely studied the difference between each algorithm. My guess would be that sort -u would perform better as the data set gets larger with a good block size setting because it does do an external sort. Cardinality would also affect the performance. If the unique set handily fits in memory, an external sort on a large data set wouldn't be very efficient.
Yes it will, and a bit (read: a lot) more than 10GB as it needs to store the contents of variable x in a hash table (with the corresponding hash key and value of the counter). There's no other magic way it can 'know' whether a particular line has been seen before. You can't rely on hash keys alone as the hashes aren't guaranteed to be unique.
For files with relatively few duplicates it's going to be a lot slower than sort | uniq.
Trying it on a 128MB file (nowhere near enough time to test a 10GB file) filled with lines of 7 random upper case characters[1] (so hardly any duplicates):-
$ wc -l x.out
16777216 x.out
$ time ( sort x.out | uniq ) | wc -l
16759719
real 0m17.982s
user 0m42.575s
sys 0m0.876s
$ time ( sort -u x.out ) | wc -l
16759719
real 0m20.582s
user 0m43.775s
sys 0m0.688s
Not much difference between "sort | uniq" and "sort -u".
As for the awk method:-
$ time awk '!x[$0]++' x.out | wc -l
has been running for more than 20 minutes and still hasn't returned. For that 128MB file the awk process is also using 650MB of memory (according to ps). Will check up on it later (have to go out now).
This Linux machine has ~16GB of memory so the file was going to be completed cached in memory before the first test. All things considered equal the awk method will be roughly O(n) (e.g. linear against file size) and sort/uniq will be O(n log n). So, theoretically, the awk method will eventually surpass the sort method because it's having to do less work (it's only checking for a previously seen key rather than sorting the entire file) but I'm not sure the crossover will be anywhere useful if the file doesn't contain many duplicates.
Repeating it for a file containing lots of duplicates (same 128MB file size but contents are only the 7 letter words consisting of A or B, so only 128 possible entries):-
$ time awk '!x[$0]++' y.out | wc -l
128
real 0m1.207s
user 0m1.192s
sys 0m0.016s
$ time ( sort y.out | uniq ) | wc -l
128
real 0m14.320s
user 0m31.414s
sys 0m0.428s
$ time ( sort -u y.out | uniq ) | wc -l
128
real 0m12.638s
user 0m30.366s
sys 0m0.188s
Notice that "sort -u" doesn't do anything clever for files with lots of duplicates.
So awk is much faster for files with lots of duplicates. No great surprises. When I get a chance I'll repeat it for a 1GB file and a 10GB file (with lots of duplicates otherwise the awk version will take far too long).
Yes it will ... [need] to store the contents of variable x in a hash table [...] There's no other magic way it can 'know' whether a particular line has been seen before. You can't rely on hash keys alone as the hashes aren't guaranteed to be unique.
Technically that's true, but the result of a cryptographic hash like SHA-256 is (practically) guaranteed to be unique. Depending on the average length of an input line and how many of the lines are unique, storing only the SHA-256 hash value could take far less memory than storing the input lines along with a non-cryptographic 32-bit hash value.
You are describing a bad implementation of a bloom filter [1]. Anyway, people expect "uniq" to be correct in all cases (i.e., to never filter a unique line). A default implementation where it would possible (even with a minuscule chance) that this doesn't happen would be a recipe for disaster. It may be a cool option though ;)
No, this is not a bad implementation of a bloom filter, and the size of the minuscule chance matters; unless SHA-256 has a flaw in it that we don't know about, SHA-256 collisions are far less likely than undetected hardware errors in your computer. The universe contains roughly 2²⁶⁵ protons, 500 protons per distinct SHA-256 value, and has existed for roughly 2⁵⁸ seconds, which means there are roughly 2¹⁹⁸ SHA-256 values per second of the age of the universe.
Typical undetected bit error rates on hard disks are one error per 10¹⁴ bits, which is about 2⁴⁷. If your lines are about 64 characters long, you'll have an undetected bit error roughly every 2^(47-6-3) = 2³⁸ lines. SHA-256 will give you an undetected hash collision roughly every 2²⁵⁵ lines. That is, for every 2²¹⁷ disk errors, SHA-256 will introduce an additional error. If you're hashing a billion lines a second (2³⁰) then that will be 2^(217-30) = 2¹⁸⁷ seconds, while the disk is giving you an undetected bit error every minute or so. A year is about 2²⁵ seconds, so that's about 2¹⁶² years, about 10⁴⁹. By comparison, stars will cease to form in about 10¹⁴ years, all planets will be flung from their orbits around the burned-out remnants of stars by random gravitational perturbations in about 10¹⁵ years, the stellar remnants will cease to form cold, dark galaxies in about 10²⁰ years, and all protons will have decayed in about 10⁴⁰ years.
And if you somehow manage to keep running uniq on your very large file at a billion lines a second, in a mere 500 times the amount of time from the universe's birth to the time when nothing is left of matter but black holes, SHA-256 will have produced your first random false collision.
Another possibly relevant note: there are about 2¹⁴⁹ Planck times per second. None of the above takes into account the Landauer and related quantum-mechanical limits to computation, which may be more stringent.