Thursday, March 29, 2012

Using Nagios to Monitor Your Clusters’ Health


The Nagios network monitoring and alerting framework lets you easily keep track of a wide variety of hosts and services, and generate reports and alerts targeted to specific teams or individuals. By using plugins, you can further enhance Nagios’s functionality, giving it capabilities not available in the core product. One such plugin lets you monitor the health of your cluster instead of that of individual hosts.
A cluster is a group of hosts or services that perform a common function in tandem. Some clusters are tied together at the operating system level (Beowulf clusters, for example) while some are defined by the tasks they perform. For instance, say your organization has four machines that work as HTTP servers behind a common front end. These constitute an HTTP service cluster, and you would want Nagios to monitor all of these machines and the HTTP service running on them.
Typically, each node in a such a cluster probably would be configured to pass its work to other nodes should it fail unexpectedly, so it would not be an emergency if one machine went out – though you’d want to know about it immediately. I would worry if two nodes went down, and would declare an emergency if three of them were not in a condition to serve users.
The check_cluster plugin can check the health of two kinds of clusters: host clusters, which comprise all the machines as a whole, and service clusters, which encompass a particular service running on the hosts. To understand the difference, consider a scenario where a particular physical machine is running several HTTP server instances on different ports, running a reverse proxy server on the front end to hide the port. All these HTTP services can be said to be in a service cluster. The Nagios check_cluster plugin can monitor these services and report a state of OK, warning, or critical, depending on the way you define the cluster’s health.

Service Cluster Monitoring

Suppose you want to monitor this HTTP service cluster. You first need to monitor the functionality of each individual server instance using the check_http plugin. Make a note of the service_description of the check, as we will use this in a moment when defining the cluster check. Service description is a kind of check identifier. Once you have set up monitoring individual server instances, you can put them in a cluster check.
First, you need to define the command to execute the check in /etc/nagios/conf.d/. We’ll call it check_cluster_service:
define command{
 command_name check_cluster_service
 command_line path_to_plugin/check_cluster --service -w $ARG1$ -c $ARG2$ -d $ARG3$  }
In the parameters passed to command_line, --service indicates this is a service cluster check, -w defines the number of individual service checks that must fail to get the warning state, -c defines the critical range, and -d contains the list of physical hosts along with the service check description you noted earlier. You can add or remove parameters passed to command_line as you need; see the man page for instructions.
Now you need to decide the warning and critical state thresholds or ranges. Let’s make the failure of two out of four services a warning and three out of four critical:
define service{
  hostgroup_name http-hosts
  service_description Cluster Check - HTTP Service
  check_command check_cluster_service!1:2!2:3!$SERVICESTATEID:host1:HTTP Service Check$,$SERVICESTATEID:host2:HTTP Service Check$,$SERVICESTATEID:host3:HTTP Service Check,$SERVICESTATEID:host4:HTTP Service Check$
  contacts noc, sysad
 }
check_command takes arguments passed according to the definition of command_line in the define command section above. Here we have defined the ranges for warning and critical state range separated by exclamation points. You can read more about defining ranges in the Nagios Plugins development guidelines.
We have also defined the hosts and services to be included using the SERVICESTATEID macro. Macros enable you to use information from various sources in real time. The SERVICESTATEID macro enables you to get the current state of the service from the HTTP check we defined earlier.
Restart Nagios to put these configuration changes into action. You should see the check appearing on the Nagios web interface, hopefully in OK state, which will change to critical or warning depending on number of services that are down. You can define the team or person to alert using Nagios’s contactgroup definition.

Putting in the Host Cluster Check

Setting up host cluster checks is a bit easier than setting up service cluster checks. Suppose you want to monitor whether a cluster of hosts is down instead of tracking the services running on them. As with service clusters checks, you first have to create a define command section:
define command{
 command_name check_cluster_host
 command_line path_to_plugin/check_cluster --host -w $ARG1$ -c $ARG2$ -d $ARG3$
 }
The parameters passed are similar to the ones for service checks, with the exception of --host.
Now you need to decide the alerting thersholds and define the check. Here again let’s keep it similar to the service check, where a warning was the failure of two out of four boxes and critical was the failure of three out of four boxes.
define service{
  hostgroup_name http-hosts
  service_description Cluster Check - HTTP Hosts
  check_command check_cluster_host!1:2!2:3!$HOSTSTATEID:host1$,$HOSTSTATEID:host2$,$HOSTSTATEID:host3$,$HOSTSTATEID:host4$
  contacts noc, sysad
 }
Again, the parameters passed are similar to the service cluster check, with the exception of the macro passed. The HOSTSTATEID macro expands to give the status of the host in real time.
Again, restart Nagios to see the checks appear on your web interface. For medium to large clusters, I usually turn off the alerts for individual hosts and services and only care for the cluster’s health. If I find a problem, I can fix things before users see any downtime.
By using Nagios plugins, you can keep an eye on the health of not just individual nodes and services, but entire clusters. That should give you more time to handle other tasks, and sysadmin time is one resource that’s always at a premium in the data center.

No comments:

Post a Comment