Monday, June 20, 2016

Proofreading for illusions with grep and AWK

http://www.thelinuxrain.com/articles/proofreading-for-illusions-with-grep-and-awk

Lexical illusions are very hard to find when proofreading. The most common lexical illusion is a duplicated word, as in this well-known example:
A lexical illusion:
many people are not aware that the
the brain will automatically ignore
a second instance of the word 'the'
when it starts a new line.
1
But if you let grep and AWK do the proofreading — problem solved!

Duplications on one line: grep

Word duplications are sometimes just as easily overlooked when they're on a single line, as in this example:
The Liberal Party has taken the old line so beloved of
economic think tanks since the 1980s of cutting company
taxes and and assisting innovation, while the ALP's approach
involves an emphasis on on education spending and concerns
over inequality that is increasingly becoming the new standard.
2
A good way to find the dupes is with this grep code, which I found here:
grep -onE '(\b.+) \1\b'
3
The grep options used are '-o', which returns only the looked-for string, rather than the whole line; '-n', which gives the number of the line with the looked-for string; and '-E', which allows grep to use nifty things like backreferences.
The looked-for string is a regular expression. The bit in brackets is word boundary (\b) followed by 1 or more appearances (+) of any character (.). This is followed by a space, then a backreference (\1) representing the bit in brackets, and closing with another word boundary. If that last '\b' wasn't in the regex, grep would return non-duplications like 'on one occasion'.

Duplications across successive lines: AWK

Here's an AWK command that finds line pairs in which the last word of the first line is also the first word of the second line. It relies on the fact that AWK recognises words as separate fields, because they're separated by whitespace, and that's the default field separator for AWK.
awk '$1==a {print b"\n"$0; a=""; b=""} {a=$NF; b=$0}'
4
AWK reads the text line by line, and the command begins with a pattern-action statement: if the first field (word) equals the variable 'a', AWK does the action in brackets, namely print... But wait! No variable 'a' has been defined yet, so that action can't happen on the first line. AWK moves to the second action, which is to set the variable 'a' equal to the contents of the last field/word ($NF) in the first line, and the variable 'b' to the contents of the whole first line ($0).
Next, AWK reads the second line. If the first word in this line is the same as the last word in the last line, AWK prints the last line (as stored in 'b'), followed by a newline, followed by the whole current line. It then 'empties' the two variables by setting them equal to the empty string '""'. Whether the first word equals the last word or not, AWK also does the second action, namely fill the two variables with the last word in the current line and the whole current line, ready for a test of the third line. If no line pairs pass the test, AWK doesn't print any lines.
We can make this command slightly cooler by allowing for the possibility that there are lexical illusions on successive lines, as here:
A lexical illusion:
many people are not aware that the
the brain will automatically ignore
ignore a second instance of the word 'the'
when it starts a new line.

awk '$1==a {print b"\n"$0"\n"; a=""; b=""} {a=$NF; b=$0}'
5
The command now adds a blank line after each two-line match, so that successive illusion-pairs appear separately in the output. Note that AWK will ignore spaces at the beginnings and ends of lines, since they're separators, not fields:
6

About the Author

Bob Mesibov is Tasmanian, retired and a keen Linux tinkerer.

No comments:

Post a Comment