bash - Fast way of finding lines in one file that are not in another? -
i have 2 large files (sets of filenames). 30.000 lines in each file. trying find fast way of finding lines in file1 not present in file2.
for example, if file1:
line1 line2 line3
and file2:
line1 line4 line5
then result/output should be:
line2 line3
this works:
grep -v -f file2 file1
but very, slow when used on large files.
i suspect there way using diff(), output should just lines, nothing else, , cannot seem find switch that.
can me find fast way of doing this, using bash , basic linux binaries?
edit: follow on own question, best way have found far using diff():
diff file2 file1 | grep '^>' | sed 's/^>\ //'
surely, there must better way?
you can achieve controlling formatting of old/new/unchanged lines in gnu diff
output:
diff --new-line-format="" --unchanged-line-format="" file1 file2
the input files should sorted work. bash
(and zsh
) can sort in-place process substitution <( )
:
diff --new-line-format="" --unchanged-line-format="" <(sort file1) <(sort file2)
in above new , unchanged lines suppressed, changed (i.e. removed lines in case) output. may use few diff
options other solutions don't offer, such -i
ignore case, or various whitespace options (-e
, -b
, -v
etc) less strict matching.
explanation
the options --new-line-format
, --old-line-format
, --unchanged-line-format
let control way diff
formats differences, similar printf
format specifiers. these options format new (added), old (removed) , unchanged lines respectively. setting 1 empty "" prevents output of kind of line.
if familiar unified diff format, can partly recreate with:
diff --old-line-format="-%l" --unchanged-line-format=" %l" \ --new-line-format="+%l" file1 file2
the %l
specifier line in question, , prefix each "+" "-" or " ", diff -u
(note outputs differences, lacks ---
+++
, @@
lines @ top of each grouped change). can use other useful things number each line %dn
.
the diff
method (along other suggestions comm
, join
) produce expected output sorted input, though can use <(sort ...)
sort in place. here's simple awk
(nawk) script (inspired scripts linked-to in konsolebox's answer) accepts arbitrarily ordered input files, and outputs missing lines in order occur in file1.
# output lines in file1 not in file2 begin { fs="" } # preserve whitespace (nr==fnr) { ll1[fnr]=$0; nl1=fnr; } # file1, index lineno (nr!=fnr) { ss2[$0]++; } # file2, index string end { (ll=1; ll<=nl1; ll++) if (!(ll1[ll] in ss2)) print ll1[ll] }
this stores entire contents of file1 line line in line-number indexed array ll1[]
, , entire contents of file2 line line in line-content indexed associative array ss2[]
. after both files read, iterate on ll1
, use in
operator determine if line in file1 present in file2. (this have have different output diff
method if there duplicates.)
in event files sufficiently large storing them both causes memory problem, can trade cpu memory storing file1 , deleting matches along way file2 read.
begin { fs="" } (nr==fnr) { # file1, index lineno , string ll1[fnr]=$0; ss1[$0]=fnr; nl1=fnr; } (nr!=fnr) { # file2 if ($0 in ss1) { delete ll1[ss1[$0]]; delete ss1[$0]; } } end { (ll=1; ll<=nl1; ll++) if (ll in ll1) print ll1[ll] }
the above stores entire contents of file1 in 2 arrays, 1 indexed line number ll1[]
, 1 indexed line content ss1[]
. file2 read, each matching line deleted ll1[]
, ss1[]
. @ end remaining lines file1 output, preserving original order.
in case, problem stated, can divide , conquer using gnu split
(filtering gnu extension), repeated runs chunks of file1 , reading file2 each time:
split -l 20000 --filter='gawk -f linesnotin.awk - file2' < file1
note use , placement of -
meaning stdin
on gawk
command line. provided split
file1 in chunks of 20000 line per-invocation.
for users on non-gnu systems, there gnu coreutils package can obtain, including on osx part of apple xcode tools provides gnu diff
, awk
, though posix/bsd split
rather gnu version.
Comments
Post a Comment