linux - how to find unique rows in a large file? -

- February 15, 2010

i have large file (4 billion rows) each row contains 1 word. want find list of unique words , corresponding counts.

i tried:

sort largefile |uniq -c >outfile

but still running , no output.

then tried:

awk '!arr[$1]++' largefile >outfile

but not not print counts. how can use awk print counts well? or other alternative approach can handle large files.

edit: there 17 million unique words in file.

how large files? how many unique words expecting? in cases sort | uniq solution start, if files big it's not good. perl script saves each word in hash might work you.

this untested , memory, may have bunch of errors...

my %words = (); open(in, "<", "yourfile") or die "arrgghh file didn't open: $!"; while(<in>) {     chomp;     $words{$_}++; } close(in); $k in (keys %words) {     print "$k $words{$k}\n"; }

Search This Blog

Naan

linux - how to find unique rows in a large file? -

Comments

Post a Comment

Popular posts from this blog

ios - UICollectionView Self Sizing Cells with Auto Layout -

asp.net - Passing parameter to telerik popup -

node.js - ldapjs - write after end error -