Tuesday, June 21, 2016

Never Lose Another File By Mastering Mlocate

http://www.linux-server-security.com/linux_servers_howtos/linux_mlocate_command.html

It’s not uncommon for a Sysadmin to have to find needles which are buried deep inside haystacks. On a busy machine there can be files in their hundreds of thousands present on your filesystems. What do you do when a pesky colleague needs to check that a single configuration file is up-to-date but can’t remember where it is located?

If you’ve used Unix-type machines for a while then you’ve almost certainly come across the “find” command before. It is unquestionably exceptionally sophisticated and highly functional. Here’s an example which just searches for links inside a directory, ignoring files:
# find . -lname "*"
You can do seemingly endless things with the “find” command; there’s no denying that. The “find” command is however nice and succinct when it wants to be but it can also easily grow arms and legs very quickly. It’s not necessarily just thanks to the “find” command itself but coupled with “xargs” you can pass it all sorts of options to tune your output, and indeed delete those files which you have found.
There often comes a time when simplicity is the preferred route however. Especially when a testy boss is leaning over your shoulder, chatting away about how time is of the essence. And, imagine trying to vaguely guess the path of the file that you haven’t ever seen before but your boss is certain lives somewhere on the busy “/var” partition.
Step forward, “mlocate”. You may be aware of one of its close relatives “slocate” (which securely, note the prepended letter “s” for “secure”, took note of the pertinent file permissions to avoid unprivileged users seeing privileged files). Additionally there is also the older, original “locate” command from whence they came.
The differences, between other members of its family (according to “mlocate” at least) is that when scanning your filesystems mlocate doesn’t need to continually rescan all your filesystem(s). Instead it merges its findings (note the prepended letter “m” for “merge”) with any existing file lists, making it much more performant and less heavy on system caches.
In this article we’ll look at “mlocate” (and simply refer to it as “locate”) due to its popularity and how to quickly and easily you can tune it to your heart’s content.

   Compact And Bijou


If you’re anything like me unless you re-use complex commands frequently then ultimately you forget them and need to look them up.The beauty of the locate command is that you can query entire filesystems very quickly and without worrying about top-level, root, paths with a simple command using “locate”.
In the past you might well have discovered that the “find” command can be very stubborn and cause you lots of unwelcome head-scratching. You know, a missing semicolon here or a special character not being escaped properly there. Let’s leave the complicated “find” command alone now, put our feet up and have a gentle look into the clever little command that is “locate”.
You will most likely want to check that it’s on your system first by running these commands:
Red Hat Derivatives
# yum install mlocate
Debian Derivatives
# apt-get install mlocate
There shouldn’t be any differences between distributions but there are almost definitely a few subtle differences between versions, beware.
Next we’ll introduce a key component to the locate command, namely “updatedb”. As you can probably guess this is the command which “updates” the locate command’s “db”. It’s hardly named counter-intuitively after all.
The “db” is the locate command’s file list which I mentioned earlier. That list is held in a relatively simple and highly efficient database for performance. The “updatedb” runs periodically, usually at quiet times of the day, scheduled via a “cron job”. In Listing One we can see the innards of the file “/etc/cron.daily/mlocate.cron” (both the file’s path and its contents might possibly be distro and version dependent).
#!/bin/sh
nodevs=$(< /proc/filesystems awk '$1 == "nodev" { print $2 }')
renice +19 -p $$ >/dev/null 2>&1
ionice -c2 -n7 -p $$ >/dev/null 2>&1
/usr/bin/updatedb -f "$nodevs"
Listing One: How the “updatedb” command is triggered every day
As we can see the “mlocate.cron” script makes careful use of the excellent “nice” commands in order to have as little impact on system performance as possible. I haven’t explicitly stated that this command runs at a set time every day (although if my addled memory serves the original locate command was associated with a slow-your-computer-down scheduled run at midnight). This is thanks to the fact that on some “cron” versions delays are now introduced into overnight start times.
This is probably because of the so-called “Thundering Herd Problem”.
Imagine there’s lots of computers (or hungry animals) waking up at the same time to demand food (or resources) from a single or limited source. This can happen when all your hippos set their wristwatches using NTP (okay, this allegory is getting stretched too far but bear with me). Imagine that exactly every five minutes (just as a “cron job” might) they all demand access to food or something otherwise being served.
If you don’t believe me then have a quick look at the config from, a version of “cron” which is called “Anacron”, in Listing Two, which is the guts of the file “/etc/anacrontab”.
# /etc/anacrontab: configuration file for anacron
# See anacron(8) and anacrontab(5) for details.
SHELL=/bin/sh
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
# the maximal random delay added to the base delay of the jobs
RANDOM_DELAY=45
# the jobs will be started during the following hours only
START_HOURS_RANGE=3-22
#period in days   delay in minutes   job-identifier   command
1       5       cron.daily              nice run-parts /etc/cron.daily
7       25      cron.weekly             nice run-parts /etc/cron.weekly
@monthly 45     cron.monthly            nice run-parts /etc/cron.monthly
Listing Two: How delays are introduced into when “cron” jobs are run
From Listing Two you have hopefully spotted both “RANDOM_DELAY” and the “delay in minutes” column. If this aspect of “cron” is new to you then you can find out more here:
# man anacrontab
Failing that you don’t need to be using Anacron, you can introduce a delay yourself if you’d like. An excellent Web page (now more than a decade old) discusses this issue in a perfectly sensible way (sadly, it's now showing a 404 but may return): http://www.moundalexis.com/archives/000076.php
That excellent website discusses using “sleep” to introduce a level of randomality, as we can see in Listing Three.
#!/bin/sh

# Grab a random value between 0-240.
value=$RANDOM
while [ $value -gt 240 ] ; do
 value=$RANDOM
done

# Sleep for that time.
sleep $value

# Syncronize.
/usr/bin/rsync -aqzC --delete --delete-after masterhost::master /some/dir/
Listing Three: A shell script to introduce random delays before triggering an event, to avoid Thundering Herds of Hippos, which was found at http://www.moundalexis.com/archives/000076.php
The aim in mentioning these (potentially surprising) delays was to point you at the file “/etc/crontab” or the “root” user’s own “crontab” file. If you want to change the time of when the locate command runs specifically because of disk access slowdowns then it’s not too tricky. There may be a more graceful way of achieving this result but you can also just move the file “/etc/cron.daily/mlocate.cron” somewhere else (I’ll use the “/usr/local/etc” directory) and as the root user add an entry into the “root” user’s “crontab” with this command and paste the content as below:
# crontab -e
33 3 * * * /usr/local/etc/mlocate.cron
Rather than trapse through “/var/log/cron” and it’s older, rotated, versions you can quickly tell the last time your “cron.daily” jobs were fired, in the case of “anacron” at least, as so:
# ls -hal /var/spool/anacron

   Well Situated


Incidentally you might get a little perplexed if trying to look up the manuals for updatedb and the locate command. Even though it’s actually the “mlocate” command and the binary is “/usr/bin/updatedb” on my filesystem you probably want to use varying versions of these “man” commands to find what you’re looking for:
# man locate
# man updatedb
# man updatedb.conf
Let’s look at the important “updatedb” command in a little more detail now. It’s worth mentioning that after installing the locate utility you will need to initialise your file-list database before doing anything else. You have to do this as the “root” user in order to reach all the relevant areas of your filesystems or the locate command will complain otherwise. Initialise or update your database file, whenever you like, with this command:
# updatedb
Obviously the first time that this is run it may take a little while to complete but when I’ve installed the locate command afresh I’ve almost always been pleasantly surprised at how quickly it finishes. After a hop, a skip and a jump you can then immediately query your file database. However let’s wait a moment before doing that.
We’re dutifully informed by its manual that the database created as a result of running the “updatedb” command resides at the following location: “/var/lib/mlocate/mlocate.db”.
If we want to change how the “updatedb” command is run then we need to affect it with our config file, a reminder that it should live here: “/etc/updatedb.conf”. Listing Four shows the contents of it on my system:
PRUNE_BIND_MOUNTS = "yes"
PRUNEFS = "9p afs anon_inodefs auto autofs bdev binfmt_misc cgroup cifs coda configfs cpuset debugfs devpts ecryptfs exofs fuse fusectl gfs gfs2 hugetlbfs inotifyfs iso9660 jffs2 lustre mqueue ncpfs nfs nfs4 nfsd pipefs proc ramfs rootfs rpc_pipefs securityfs selinuxfs sfs sockfs sysfs tmpfs ubifs udf usbfs"
PRUNENAMES = ".git .hg .svn"
PRUNEPATHS = "/afs /media /net /sfs /tmp /udev /var/cache/ccache /var/spool/cups /var/spool/squid /var/tmp"
Listing Four: The innards of the file “/etc/updatedb.conf” which affects how our database is created
The first thing that my eye is drawn to is the “PRUNENAMES” section. As you can see by stringing together a list of directory names, delimited with spaces, you can suitably ignore them. One caveat is that only directory names can be skipped and you can’t use wildcards. As we can see all of the otherwise-hidden files in a Git repository (the “.git” directory” might be an example of putting this option to good use.
If you need to be more specific then, again using spaces to separate your entries, you can instruct the locate command to ignore certain paths. Imagine for example that you’re generating a whole host of temporary files overnight which are only valid for one day. You’re aware that this is a special directory of sorts which employs a familiar naming convention for its thousands of files. It would take the locate command a relatively long time to process the subtle changes every night adding unnecessary stress to your system. The solution is of course to simply add it to your faithful “ignore” list.

   Perfectly Appointed


As we can see from Listing Five the file “/etc/mtab” offers not just a list of the more familiar filesystems such as “/dev/sda1” but also a number of others that you may not immediately remember.
/dev/sda1 /boot ext4 rw,noexec,nosuid,nodev 0 0
proc /proc proc rw 0 0
sysfs /sys sysfs rw 0 0
devpts /dev/pts devpts rw,gid=5,mode=620 0 0
/tmp /var/tmp none rw,noexec,nosuid,nodev,bind 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
Listing Five: A mashed up example of the innards of the file “/etc/mtab”
As some of these filesystems shown in Listing Five contain ephemeral content and indeed content that belongs to pseudo-filesystems it is clearly important to ignore their files. If for no other reason than because of the stress added to your system during each overnight update.
In Listing Four the “PRUNEFS” option takes care of this and ditches those not suitable (for most cases). There’s certainly a few different filesystems to consider as you can see:
PRUNEFS = "9p afs anon_inodefs auto autofs bdev binfmt_misc cgroup cifs coda configfs cpuset debugfs devpts ecryptfs exofs fuse fusectl gfs gfs2 hugetlbfs inotifyfs iso9660 jffs2 lustre mqueue ncpfs nfs nfs4 nfsd pipefs proc ramfs rootfs rpc_pipefs securityfs selinuxfs sfs sockfs sysfs tmpfs ubifs udf usbfs"
The “updatedb.conf” manual succinctly informs us of the following information in relation to the “PRUNE_BIND_MOUNTS” option:
“If PRUNE_BIND_MOUNTS is 1 or yes, bind mounts are not scanned by updatedb(8).  All file systems mounted in the subtree of a bind mount are skipped as well, even if they are not bind mounts.  As an exception, bind mounts of a directory on itself are not skipped.”
Assuming that makes sense, before moving onto some locate command examples, a quick note. Excluding some versions of the “updatedb” command it can also be told to ignore certain “non-directory files” but this does not always apply so don’t blindly copy and paste config between versions if you use such an option.

 Needs Modernisation


As mentioned earlier there are times when finding a specific file needs to be so quick that it’s at your fingertips before you’ve consciously recalled the command. This is the irrefutable beauty of the locate command.
And, if you’ve ever sat in front of a horrendously slow Windows machine watching the hard disk light flash manically, as if it was suffering a conniption, thanks to the indexing service running (apparently in the background) then I can assure you that the performance that you’ll receive from the “updatedb” command will be of very welcome relief.
You should bear in mind, that unlike the “find” command, there’s no need to remember the base paths of where your file might be residing. By that I mean that all of your (hopefully) relevant filesystems are immediately accessed with one simple command and that remembering paths is almost a thing of the past.
In its most simple form the locate command looks like this:
# locate chrisbinnie.pdf
There’s also no need to escape hidden files which start with a dot or indeed expand a search with an asterisk:
# locate .bash
Listing Six shows us what has been returned, in an instant, from the many partitions the clever locate command has scanned previously.
/etc/bash_completion.d/yum.bash
/etc/skel/.bash_logout
/etc/skel/.bash_profile
/etc/skel/.bashrc
/home/chrisbinnie/.bash_history
/home/chrisbinnie/.bash_logout
/home/chrisbinnie/.bash_profile
/home/chrisbinnie/.bashrc
/usr/share/doc/git-1.5.1/contrib/completion/git-completion.bash
/usr/share/doc/util-linux-ng-2.16.1/getopt-parse.bash
/usr/share/doc/util-linux-ng-2.16.1/getopt-test.bash
Listing Six: The search results from running the command: “locate .bash”
I’m suspicious that the following usage has altered slightly, from back in the day when the “slocate” command was more popular or possibly the original locate command, but you can receive different results by adding an asterisk to that query as so:
# locate .bash*
In Listing Seven we can see the difference between that of Listing Six’s output. Thankfully the results make more sense now that we can see them side by side. In this case the addition of the asterisk is asking the locate command to return files beginning with “.bash” as opposed to all files containing that string of characters.
/etc/skel/.bash_logout
/etc/skel/.bash_profile
/etc/skel/.bashrc
/home/d609288/.bash_history
/home/d609288/.bash_logout
/home/d609288/.bash_profile
/home/d609288/.bashrc
Listing Seven: The search results from running the command: “locate .bash*” with the addition of an asterisk
If you remember I mentioned “xargs” earlier and the “find” command. Our trusty friend the locate command can also play nicely with the “--null” option of “xargs” by outputting all of the results onto one line (without spaces which isn’t great if you want to read it yourself) by using the “-0” switch like this:
# locate -0 .bash
An option which I like to use (admittedly that’s if I remember to use it because the locate command rarely needs queried twice to find a file thanks to the syntax being so simple) is that of the “-e” option.
# locate -e .bash
For the curious that “-e” switch means “existing”. And, in this case, you can use “-e” to ensure that any files returned by the locate command do actually exist at the time of the query on your filesystems.
It’s almost magical, that even on a slow machine, the mastery of the modern locate command allows us to query its file database and then check against the actual existence of many files in seemingly no time whatsoever. Let’s try a quick test with a file search that’s going to return a zillion results and use the “time” command to see how long it takes both with and without the “-e” option being enabled.
I’ll choose files with the compressed “.gz” extension. Starting with a count we can see there’s not quite a zillion but a fair number of files ending “.gz” on my machine, note the “-c” for “count”:
# locate -c .gz
7539
This time we’ll output the list but “time” it and see the abbreviated results as follows:
# time locate .gz
real    0m0.091s
user    0m0.025s
sys     0m0.012s
That’s pretty swift but it’s only reading from the overnight-run database. Let’s get it to do a check against those 7,539 files too, to see if they truly exist and haven’t been deleted or renamed since last night and time the command again:
# time locate -e .gz
real    0m0.096s
user    0m0.028s
sys     0m0.055s
The speed difference is nominal as you can see. There’s no point in talking about lightning or you-blink-and-you-miss-it, because those aren’t suitable yardsticks. Relative to the other Indexing Service I mentioned a few moments ago let’s just say that’s pretty darned fast.
If you need to move the efficient database file used by the locate command (in my version it lives here: “/var/lib/mlocate/mlocate.db”) then that’s also easy to do. You may wish to do this for example because you’ve generated a massive database file (which is only 1.1MB in my case so it’s really tiny in reality) which needs to be put onto a faster filesystem.
Incidentally even the “mlocate” utility appears to have created an “slocate” group of users on my machine so don’t be too alarmed if you see something similar, as we can see here from a standard file listing:
-rw-r-----. 1 root slocate 1.1M Jan 11 11:11 /var/lib/mlocate/mlocate.db
Back to the matter in hand. If you want to move away from “/var/lib/mlocate” as your directory being used by the database then you can use this command syntax (and you’ll have to become the “root” user with “sudo -i” or “su -” for at least the first command to work correctly):
# updatedb -o /home/chrisbinnie/my_new.db
# locate -d /home/chrisbinnie/my_new.db SEARCH_TERM
Obviously replace your database name and path. The “SEARCH_TERM” element is the fragment of the filename that you’re looking for (wildcards and all).
If you remember I mentioned that you need to run “updatedb” command as the superuser in order to reach all the areas of your filesystems.
This next example should cover two useful scenarios in one. According to the manual you can also create a “private” database for standard users as follows:
# updatedb -l 0 -o DATABASE -U source_directory
Here the previously seen “-o” option means that we output our database to a file (obviously called “DATABASE”). The “-l 0” addition apparently means that the “visibility” of the database file is affected. It means (if I’m reading the docs correctly) that my user can read it but otherwise, without that option, only the locate command can.
The second useful scenario for this example is that we can create a little database file specifying exactly which path its top-level should be. Have a look at the “database-root” or “-U source_directory” option in our example. If you don’t specify a new root file path then the whole filesystem(s) is scanned instead.
If you wanted to get clever and chuck a couple of top-level source directories into one command then you can manage that having created two separate databases. Very useful for scripting methinks.
You can achieve that like so with this command:
# locate -d /home/chrisbinnie/database_one -d /home/chrisbinnie/database_two SEARCH_TERM
The manual dutifully warns however that ALL users that can read the “DATABASE” file can also get the complete list of files in the subdirectories of the chosen “source_directory”. So use these commands with some care as a result.

   Priced To Sell


Back to the mind-blowingly simplicity of the locate command being used on a day-to-day basis.
There are many times when newbies get confused with case-sensitivity on Unix-type systems. Simply use the conventional “-i” option to ignore case entirely when using the flexible locate command:
# locate -i ChrisBinnie.pdf
If you have a file structure that has a number of symlinks holding it together then there might be occasion when you want to remove broken symlinks from the search results. You can do that with this command:
# locate -Le chrisbinnie_111111.xml
If you needed to limit the search results then you could use this functionality, also in a script for example (similar to the “-c” option for counting), as so:
# locate -l25 *.gz
This command simply stops after outputting the first 25 files that were found. Coupled with being piped through the “grep” command it’s very useful on a super busy system.

   Popular Area


We briefly touched upon performance earlier and I couldn’t help but stumble across this nicely-written Blog entry (http://jvns.ca/blog/2015/03/05/how-the-locate-command-works-and-lets-rewrite-it-in-one-minute/). The author discusses thoughts on the trade-offs between the database size becoming unwieldy and the speed at which results are delivered.
What piqued my interest is the comments on how the original locate command was written and what limiting factors were considered during its creation. Namely how disk space isn’t quite so precious any longer and nor is the delivery of results even when 700,000 files are involved.
I’m certain that the author(s) of “mlocate” and its forebears would have something to say in response to that Blog post. I suspect that holding onto the file permissions to give us the “secure” and “slocate” functionality in the database might be a fairly big hit in terms of overheads. And, as much as I enjoyed the post, needless to say I won’t be writing a Bash script to replace “mlocate” any time soon. I’m more than happy with the locate command and extol its qualities at every opportunity.

   Sold


Hopefully you have now had enough of an insight into the superb locate command to prune, tweak, adjust and tune it to your unique set of requirements.
As we’ve seen it’s fast, convenient, powerful and efficient. Additionally you can ignore the “root” user demands and use it within scripts for very specific tasks.
My favourite feature however has to be when I’ve been woken up at 4am, called out because of an emergency. It’s not a good look, having to remember this complex “find” command and typing it slowly with bleary eyes (and managing to add lots of typos):
# find . -type f -name "*.gz"
Instead I can just use this simple locate command (they do produce slightly different results but I’m sure you get the point):
# locate *.gz
As has been said, any fool can create things that are bigger, bolder, rougher and tougher but it takes a modicum of genius to create something simpler. And in terms of introducing more people to the venerable Unix-type command line there’s little argument that the locate command welcomes them with open arms.

No comments:

Post a Comment