Tuesday, November 3, 2015

Skimming Files Before You Grep Them

http://freedompenguin.com/articles/how-to/skimming-files-before-you-grep-them

Grep is a utility to find string patterns in text files. I cannot show you all the magic that grep can perform. Often the stuff you want to find in a file is buried in a lot of crud you don’t want in the file. Web server files? Yeah. Those. Spreadsheets? Unfortunately, they get large too. Image files? Yes, those too! I’m going to focus on those towards the end. So, let’s review a few tricks…
Do you need the Regular Expressions?
Grep could be grep, fgrep, or egrep in the Linux world. If you’re looking for whole words or numbers and not doing a regular expression, you can save processor time by using fgrep.
Did it start with a “Capital Letter?” Get into the habit of using the -i switch with grep, to make your searches case-insensitive. This saves you time by making first runs get more matches.
Server Log Files
Your mom always told you to clean your room. Well, the logrotate utility is like cleaning your log file closet, but with cron. As a habit, I rotate my log files daily and compress them. Zgrep is a good way to grep through compressed log files. If you’re doing a simple search, you can make zgrep use fgrep behind it with the grep environment variable:
  • user@foo$ GREP=fgrep zgrep -il ‘php’ /var/log/httpd/*gz
See that -l switch? That just lists the name of the file. If you’re hunting down date ranges of where strings are, this helps. Conversely, if you know the date range of the files you need to search, don’t bother grepping everything…avoid it if you can. “Globbing” is your friend:
  • GREP=fgrep zgrep -il /var/log/httpd/error_log-201510{13,14,15}.gz
Is file compression actually faster or slower? That depends on your processor and your disk bandwidth. If you have a 5400rpm laptop hard drive, the less blocks you read off the disk the better.
If you have an SSD, you might not notice. If you have a Raspberry Pi running on an SD card, disk is actually your arch rival and compression becomes very important. You can gain a lot on compression of text files. Decompression is not expensive for CPUs. Writing decompressed data back to disk is more expensive in terms of IO (input/output) and that’s why decompression seems slow. If you have a desktop processor, your base clock speed is so much faster than an SSD’s input bitrate that you won’t even notice. Use the compression.
Can you grep Pictures?
Yes. You can grep anything, even ISO files, but you’ll likely find a lot of useless crap. Best to use the “strings” utility on binary files before you start grepping for things. This will help filter out the binary data that isn’t character strings.
  • strings < img4123.jpg | grep ‘2015’
This is a way to paw through the EXIF data in an image to see if you took it this year. Neat, huh? You don’t need to learn any ExifTool syntax to do that. Let’s try it out in a folder with … uh … 408 images!
  • > ls *jpg | while read f ; do echo -n "$f " ; strings $f | grep -c 2015 ; done
  • 2015-05-24-two-flowers-1920x1272.jpg 1
  • _img2746.jpg 3
  • _img2747.jpg 3
  • _img2748.jpg 3
  • _img2749.jpg 3
  • _img2759.jpg 3
  • _img2760.jpg ^C
Oh, I have plenty of SSD but that’s still taking a while. Why is that? Oh, I’m chewing through 2.9 GB of pictures. Let’s actually profile that more closely, by timing that run:
  • > time for f in *jpg ; do strings $f | grep -c 2015 >/dev/null ; done
  • real 1m15.339s
  • user 0m41.112s
  • sys 0m5.440s
Now let’s try that just reading an area of the file where we know the EXIF data is: at the start. We can be pretty safe in assuming that we’re not going to read more than 4K of data to find this. Let’s drop our caches and try this again.
  • su -c ‘echo 3 > /proc/sys/vm/drop_cache’
  • > time for f in *jpg ; do head -c4096 $f | strings | grep -c 2015 >/dev/null ; done
  • real 0m5.601s
  • user 0m0.308s
  • sys 0m1.812s
Wow, that’s impressive, huh? Why 4096? That’s 4KB, your standard Linux file system page size and block size for your file system.
Let’s tie that in with my image processing work-flow: ImgDate.sh is the wee utility I glossed over in the previous article about organizing my image files. You’ll recognize how this works now:
  • #!/bin/bash
  • [ -z "$1" ] && echo "no filename, bye" && exit 1
  • head -c4096 $1 \
  • | strings \
  • | perl -ne 'm/(\d\d\d\d):(\d\d):(\d\d)/ && print "$1-$2-$3\n"' \
  • | head -n1
I begin with my standard guard clause and swing my perl wand at it. I also could have used sed or grep, but perl seems to work just as quickly and the regular expressions are easier to write. Also, perl allowed me to use positional parameters to output the format right where I found it.
There are places (like embedded systems) where Perl isn’t an option. In those places, this would have been my grep and tr equivalent.
  • > head -c4096 IMAG0656.jpg \
  • | strings | egrep -o '[0-9]{4}:[0-9]{2}:[0-9]{2}' \
  • | head -n1 | tr ':' '-'
  • 2015-05-19
So, the less you can read off the disk, the faster grep is going to go.

No comments:

Post a Comment