bash - Fast way of finding lines in one file that are not in another? -

- February 15, 2015

i have 2 large files (sets of filenames). 30.000 lines in each file. trying find fast way of finding lines in file1 not present in file2.

for example, if file1:

line1 line2 line3

and file2:

line1 line4 line5

then result/output should be:

line2 line3

this works:

grep -v -f file2 file1

but very, slow when used on large files.

i suspect there way using diff(), output should just lines, nothing else, , cannot seem find switch that.

can me find fast way of doing this, using bash , basic linux binaries?

edit: follow on own question, best way have found far using diff():

diff file2 file1 | grep '^>' | sed 's/^>\ //'

surely, there must better way?

you can achieve controlling formatting of old/new/unchanged lines in gnu diff output:

diff --new-line-format="" --unchanged-line-format=""  file1 file2

the input files should sorted work. bash (and zsh) can sort in-place process substitution <( ):

diff --new-line-format="" --unchanged-line-format="" <(sort file1) <(sort file2)

in above new , unchanged lines suppressed, changed (i.e. removed lines in case) output. may use few diff options other solutions don't offer, such -i ignore case, or various whitespace options (-e, -b, -v etc) less strict matching.

explanation

the options --new-line-format, --old-line-format , --unchanged-line-format let control way diff formats differences, similar printf format specifiers. these options format new (added), old (removed) , unchanged lines respectively. setting 1 empty "" prevents output of kind of line.

if familiar unified diff format, can partly recreate with:

diff --old-line-format="-%l" --unchanged-line-format=" %l" \      --new-line-format="+%l" file1 file2

the %l specifier line in question, , prefix each "+" "-" or " ", diff -u (note outputs differences, lacks --- +++ , @@ lines @ top of each grouped change). can use other useful things number each line %dn.

the diff method (along other suggestions comm , join) produce expected output sorted input, though can use <(sort ...) sort in place. here's simple awk (nawk) script (inspired scripts linked-to in konsolebox's answer) accepts arbitrarily ordered input files, and outputs missing lines in order occur in file1.

# output lines in file1 not in file2 begin { fs="" }                         # preserve whitespace (nr==fnr) { ll1[fnr]=$0; nl1=fnr; }     # file1, index lineno (nr!=fnr) { ss2[$0]++; }                # file2, index string end {     (ll=1; ll<=nl1; ll++) if (!(ll1[ll] in ss2)) print ll1[ll] }

this stores entire contents of file1 line line in line-number indexed array ll1[], , entire contents of file2 line line in line-content indexed associative array ss2[]. after both files read, iterate on ll1 , use in operator determine if line in file1 present in file2. (this have have different output diff method if there duplicates.)

in event files sufficiently large storing them both causes memory problem, can trade cpu memory storing file1 , deleting matches along way file2 read.

begin { fs="" } (nr==fnr) {  # file1, index lineno , string   ll1[fnr]=$0; ss1[$0]=fnr; nl1=fnr; } (nr!=fnr) {  # file2   if ($0 in ss1) { delete ll1[ss1[$0]]; delete ss1[$0]; } } end {   (ll=1; ll<=nl1; ll++) if (ll in ll1) print ll1[ll] }

the above stores entire contents of file1 in 2 arrays, 1 indexed line number ll1[], 1 indexed line content ss1[]. file2 read, each matching line deleted ll1[] , ss1[]. @ end remaining lines file1 output, preserving original order.

in case, problem stated, can divide , conquer using gnu split (filtering gnu extension), repeated runs chunks of file1 , reading file2 each time:

split -l 20000 --filter='gawk -f linesnotin.awk - file2' < file1

note use , placement of - meaning stdin on gawk command line. provided split file1 in chunks of 20000 line per-invocation.

for users on non-gnu systems, there gnu coreutils package can obtain, including on osx part of apple xcode tools provides gnu diff, awk, though posix/bsd split rather gnu version.

Search This Blog

Naan

bash - Fast way of finding lines in one file that are not in another? -

Comments

Post a Comment

Popular posts from this blog

ios - UICollectionView Self Sizing Cells with Auto Layout -

asp.net - Passing parameter to telerik popup -

node.js - ldapjs - write after end error -