linux - how to find unique rows in a large file? -
i have large file (4 billion rows) each row contains 1 word. want find list of unique words , corresponding counts.
i tried:
sort largefile |uniq -c >outfile
but still running , no output.
then tried:
awk '!arr[$1]++' largefile >outfile
but not not print counts. how can use awk print counts well? or other alternative approach can handle large files.
edit: there 17 million unique words in file.
how large files? how many unique words expecting? in cases sort | uniq
solution start, if files big it's not good. perl script saves each word in hash might work you.
this untested , memory, may have bunch of errors...
my %words = (); open(in, "<", "yourfile") or die "arrgghh file didn't open: $!"; while(<in>) { chomp; $words{$_}++; } close(in); $k in (keys %words) { print "$k $words{$k}\n"; }
Comments
Post a Comment