wiki:Doc/Monitoring/Nagios

Version 26 (modified by /DC=org/DC=terena/DC=tcs/C=NL/O=Nikhef/CN=Ronald Starink ronalds@…, 13 years ago) (diff)

--

Nagios configuration requires both a set of client templates for commands to be run on clients by the Nagios Remote Plug-in Executor (NRPE) and a set of server templates configuring contacts for alarms, hosts to be monitored, services (AKA sensors) and so on.

Configuring the Nagios server

The configuration of a Nagios server is done in a set of standard templates, in the 'monitoring/nagios' namespace. Also, sensors are provided for many of the plug-ins described on the GridPP wiki. An example Nagios server template is included in the QWG distribution.

Quickstart

In order to configure a basic nagios server, simply include the template monitoring/nagios/config in the server's template. This will automatically generate a Nagios configuration that monitors all Quattor-managed machines.

Monitoring grid services

Preliminary work has been done to integrate monitoring of grid services according to the EGEE model (described at https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringNcg). Please note that in order to retrieve SAM results for your site, you will need to request access to the SAM database from your Nagios server according to the procedure described at https://twiki.cern.ch/twiki/bin/view/LCG/SamProgInterfaceACL

The template monitoring/nagios/ncg_services will automatically generate service, service group, service dependency and external info configuration for CE and SEs defined in the CE_HOST and SE_HOST variables. (The NCG services configuration template should be included before the core Nagios config template.)

Customising your configuration

What is monitored

In principle, all hosts present in the DB_MACHINE database are expected to be monitored and are added to Nagios configuration. (This is done via the variable HOSTSLIST, which is automatically derived from DB_MACHINE.) If you want to monitor additional hosts that are not Quattor-managed, they can be specified in the variable NAGIOS_EXTRA_HOSTS.

Currently, all hosts are given the same settings. Finer-grained settings (for instance, separate hosts to be monitored on working and non-working hours) are yet to be done.

Hardware-related monitoring

The variable HW_LISTINGS is essentially the inverse of DB_MACHINE: it is a structure with fields related to some hardware attribute and whose values are the lists of nodes with such attribute. For instance, HW_LISTINGS["per_cpu"]["_4"] is the list of nodes with 4 CPU cores, whether 4 single-code Pentiums or a single quad-core Barcelona chip. See here for the full description of each variable.

Specifying commands

A structure template with all the commands used at UAM is provided, as an example. Variable NAGIOS_COMMANDS_TEMPLATE is the name of a structure template with all the command definitions you need. You can write your own commands template, and even include the example one. A command template is just a set of "command_name" = "command_definition" lines, like this:

"check_ftp" = "$USER1$/check_ftp -H $HOSTADDRESS$";
"notify-by-email" = "/bin/echo '$SERVICEOUTPUT$' | /bin/mail -s '$SERVICESTATE$ alert for $HOSTALIAS$($HOSTNAME$)/$SERVICEDESC$' $CONTACTEMAIL$";
"check_nrpe_ncd" = "$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_ncd";

Host aliases and host-specific customization

Additional, per-host customization is possible with another template, called NAGIOS_HOSTALIASES_TEMPLATE. It is a full template, not an structure one. It is a rather poor way of specifyng customizations, better ideas are welcomed.

Host groups

A structure template for defining host groups can be supplied with the NAGIOS_HOSTGROUPS_TEMPLATE variable.

Macros

User macros can be supplied with the NAGIOS_MACROS_TEMPLATE. By default, $USER1$ and $USER2$ are set with sensible values. The macros file will be written with 0600 permissions, so it is safe to place passwords here, if needed.

Time periods

A structure template can be supplied with the NAGIOS_TIMEPERIODS_TEMPLATE. The default templates provide only a always timeperiod.

Contacts and contact groups

Contacts and contact groups for alarms can be specified with structure templates called NAGIOS_CONTACTS_TEMPLATE and NAGIOS_CONTACTGROUPS_TEMPLATE. I provided an example template under monitoring/nagios/contacts, but please adapt it to your needs!!

Services

Each service must be provided via a structure template. This way, the use directive can be imitated with just an include. So, the variable NAGIOS_SERVICE_TEMPLATES is a list with the names of the templates that fully describe services.

For instance,

include monitoring/nagios/services/fast-service;

'service_description'='DNS response time';
'check_command'= list (
'check_dns',
    'www.cern.ch',
    '2',
    '5');

Generic services

As seen in the previous example, you can reduce the amount of code by including some other structure template and changing the appropriate values. For this, 6 generic templates are provided:

generic-service
A "generic" service, with the appropriate contact groups, and other usual settings. Before sending an alarm, the check is tried 4 times. Alarms are re-sent each 4 hours.
fast-service
A generic-service to be checked each 5 minutes, or each minute in case of problems.
performance-service
A fast-service to be checked each 30 minutes, or each 5 minutes in case of problems.
security-service
A volatile fast-service to be checked each 20 minutes.
slow-service
A fast-service to be checked each hour, or each 20 minutes in case of problems. Alarms are re-sent each 7 hours.
expire-service
A fast-service to be checked each 48 hours. Alarms are re-sent each 50 hours. It is useful for checking certificate expirations.

Checking for a host load

The acceptable load on a host depends on the number of CPUs and cores on it. For instance, having a single-core node on load 4 is problematic, whereas an 8-core node is wasting CPU with that same load.

The function generate-check-load-services is aimed at this. It generates service structures based on the monitoring/nagios/services/load templates, for all hosts in DB_MACHINE, setting sensible limits. These are:

Load 1Load 5Load 15
Warningcores*2cores*1.75cores*1.5
Criticalcores*3cores*2cores*1.75

The limits are hardcoded in the function, a more flexible approach will be provided (some day).

External configuration files

If you have existing configuration files, or if parts of your configuration are generated by tools other than ncm-nagios, you can integrate them with the NAGIOS_EXTERNAL_FILES variable.

General options

The component provides sensible defaults for most of the Nagios options. If you want to override any of these, set the variable NAGIOS_GENERAL_OPTIONS.

Visualizing performance metrics

Many service and host checks return performance metrics to the Nagios server. The performance data can be visualized in plots based on rrdtool using pnp4nagios.

To enable visualization with pnp4nagios on the web interface, pnp4nagios must be enabled (via PNP4NAGIOS_ENABLE) and the base URL for pnp4nagios on the web server must be supplied (via PNP4NAGIOS_BASE_URL). The exact locations of the host and service metrics can be modified via variables PNP4NAGIOS_HOST_ACTION_URL and PNP4NAGIOS_SERVICE_ACTION_URL, although by default they are derived from the base location:

variable PNP4NAGIOS_ENABLE = true;
variable PNP4NAGIOS_BASE_URL = '/nagios/html/pnp4nagios/index.php';
variable PNP4NAGIOS_HOST_ACTION_URL = PNP4NAGIOS_BASE_URL + '?host=$HOSTNAME$';
variable PNP4NAGIOS_SERVICE_ACTION_URL = PNP4NAGIOS_BASE_URL + '?host=$HOSTNAME$&srv=$SERVICEDESC$';

Every host or service for which visualization is required, should contain a action_url in its definition. Hosts that are derived from the generic host template monitoring/nagios/generic_host already have this definition:

"action_url" = PNP4NAGIOS_HOST_ACTION_URL;

However, the action_url must be set for every service for which it is desired (note that it may not be useful for all services!). The definition of the load service template in the example cluster (template sites/example/monitoring/nagios/services/load.tpl) contains an example:

'action_url' = PNP4NAGIOS_SERVICE_ACTION_URL;

Note: the above assumes that the Nagios server is configured to collect performance metrics. The default Nagios master template monitoring/nagios/master contains the required definitions.

Passive service checks (NSCA)

Besides displaying results or sending notifications for regular checks, Nagios can also receive results from checks that it did not initiate itself. Such checks are passive checks. They are executed via some mechanism external to the Nagios server (e.g., a cron job on a node) and the check result is sent to the NSCA daemon that runs on the Nagios server. On the node, the send_nsca executable submits the results to the NSCA daemon. The NSCA daemon accepts the passive check results and inserts them into Nagios.

Passive service checks are required when using a hierarchy of Nagios servers. The slave servers that execute the (active) service checks, submit the check results to the master server.

There are 3 different configuration variables to enable parts the NSCA configuration:

  • NSCA_SEND_TEMPLATE: required for nodes that run passive checks and slave servers to submit check results to the master
  • NSCA_SUBMIT_RESULT_TEMPLATE: required for Nagios slave servers to submit check results to the master
  • NSCA_DAEMON_TEMPLATE: required for a Nagios server that must accept passive check results and for a Nagios master server when using a hierarchy.

The NSCA configuration uses the following variables:

  • NSCA_PORT: port at which the server listens (and to which the client send the results), defaulting to 5667.
  • NSCA_ENCRYPTION_METHOD: encryption scheme used by send_nsca, defaulting to 1.
  • NSCA_DECRYPTION_METHOD: decryption scheme used by the daemon, should be (and is by default) identical to the encryption.
  • NSCA_PASSWORD: password used for encrypting and decrypting connection. This variable must be set and there is no default value.

Hierarchy of servers

Nagios servers can be used in a hierarchy, where there is 1 master server and 2 or more slave servers. The master server runs the web interface, handles notifications, performance data etc. The slave servers schedule and execute checks and submit the check results (via NSCA) to the master server. Each slave server runs (a subset of the) checks against (a subset of the) nodes. The master server receives the results for all services and all hosts.

The QWG templates contain some generic templates for Nagios master and slave servers. However, much of the configuration is site-specific. An example setup is provided under directory sites/example/site/nagios. This setup must be modified to match the desired configuration of your site.

The generic setup consists of the following templates (under standard/):

  • monitoring/nagios/master: template containing specific settings for the Nagios master server, such as accepting passive checks, running the web interface, collecting performance data etc. Service templates must be included as passive checks, therefore variable NAGIOS_DEFAULT_SERVICE_TEMPLATE is forced to monitoring/nagios/services/passive-service.
  • monitoring/nagios/slave: template defining specific Nagios settings for a slave server, such as suppressing notifications, submission of check results, disabling performance data.

The site-specific setup consists of the following templates (under sites/example):

  • site/nagios/master: contains specific settings for the Nagios master for this cluster; specifically, it includes all host, hostgroup and service templates.
  • site/nagios/slave-A: contains specific settings for a Nagios slave server that monitors a subset of the hosts. Only the hosts, hostgroups and services that are monitored by this server must be included.
  • site/nagios/slave-B: this templates is very similar to the one for slave-A, except that is uses a different subset of hosts, hostgroups and services.
  • site/nagios/common: site-specific configuration that is common to all servers (master and all slaves).
  • site/nagios/hosts/: this directory contains the list of hosts per monitoring slave server (templates cluster-A and cluster-B) as well as a template aggregating all hosts for the master server (all).
  • site/nagios/hostsgroups/: this directory contains the list of hostgroups per monitoring slave server and an aggregating template
  • site/nagios/config/services/standard-templates: list of services to be monitored; in the example cluster, there is only 1 service (load) that is monitored for all hosts.
  • monitoring/nagios/services/load: this template overrides the standard load template. It uses a variable for the service template (via variable NAGIOS_DEFAULT_SERVICE_TEMPLATE) to distinguish the active check (needed by slave servers) from the passive variant (used by the master). Furthermore, it includes support for an action_url to visualize performance metrics.
  • site/nagios/nsca: various variables for NSCA.
  • site/nagios/OCP_setup: this template can optionally be used by Nagios slave servers to push service checks results to the master using OCP. That gives a much better performance than using a simple 'send_nsca' per check result, in the sense that the latency of service checks will dramatically decrease. It is recommended to use this template. This template uses monitoring/nagios/nsca/OCP_daemon (under namespace standard), which provides the OCP daemon code and the init.d script for the daemon.

Other templates present under the sites/example/ directory that are not specific for the hierarchy:

  • site/nagios/config/apache: example of an Apache configuration for a Nagios web server, with support for X509 certificate-based authentication via https. Users connecting to port 80 of the web server are mapped to the guest user. Because of the complexity of the Apache configuration, this template uses ncm-filecopy to get the configuration files on the server.
  • site/nagios/config/webinterface: this template adds the certificate DNs for administrators to the list of people who can access the Nagios web interface via https. The administrators get access to all services, hosts, system and configuration options via the web interface.
  • site/nagios/site-commands: commands that are specific to the site; that is, all commands that require non-standard 3rd party software to be installed on either the Nagios server or the nodes to be monitored. Everything means all software not contained in the rpms nagios-plugins or nagios-plugins-nrpe.

Variable index

NAGIOS_BASE_HOST A structure template defining a basic host, with all fields except, perhaps, an alias, set to sensible values.
HW_LISTINGS Lists all machines, grouped by hardware characteristics. It is set automatically by monitoring/nagios/cfg and cannot be modified.
per_hardware nlist in which keys are the hardware templates (the values of DB_MACHINE). The value is another structure:
host_list List of all hosts associated to that hardware template.
hardware_structure The complete hardware information as described on the hardware template.
per_cpu nlist with the machines grouped by number of processors. The key is in the form "_n", where n is the number of CPUs. The value is the list of hosts.
HOSTSLIST Plain nlist with all hosts structures, extracted from HW_LISTINGS.
NAGIOS_EXTRA_HOSTS List of FQDNs with additional hosts to be monitored. Put here any host outside your CDB you want to monitor, for instance routers or external DNS servers.
NAGIOS_COMMANDS_TEMPLATE A structure template with home-made commands. A default set with all commnands used today (6/2008) at UAM is provided.
NAGIOS_HOSTSALIASES_TEMPLATE A template with additional per-host customizations. To be improved.
NAGIOS_HOSTGROUPS_TEMPLATE A structure template with all hostgroups definitions.
NAGIOS_MACROS_TEMPLATE A structure template with any user-defined macros, not enclosed by '$' symbols. For instance, for setting $USER1$, do: "USER1" = "/usr/lib/nagios/plugins";
NAGIOS_TIMEPERIODS_TEMPLATE A structure template with time periods definitions.
NAGIOS_CONTACTS_TEMPLATE A structure template with all contacts.
NAGIOS_CONTACTGROUPS_TEMPLATE A structure template with the different ways the CONTACTS are grouped.
NAGIOS_SERVICE_TEMPLATES List with the names of the nagios structure templates that define each sensor.
NAGIOS_EXPLICIT_SERVICES List with the complete definition of any services you don't want to specify as structure templates.
NAGIOS_EXPLICIT_SERVICEGROUPS List with the complete definition of any service groups you don't want to specify as structure templates.
NAGIOS_EXPLICIT_SERVICEDEPENDENCIES List with the complete definition of any service dependencies you don't want to specify as structure templates.
NAGIOS_EXPLICIT_SERVICEEXTINFO List with the complete definition of any service external info you don't want to specify as structure templates.
NAGIOS_GENERAL_OPTIONS nlist with all generic options you may want, such as whether to accept external commands or not. Check the Nagios documentation to see what options are available.
NAGIOS_EXTERNAL_FILES List of external files to be added to Nagios configuration.
NAGIOS_SERVICEEXTINFO_TEMPLATES List of templates containing definitions for extended service information
NAGIOS_LOAD_BASE_TEMPLATE Base template for the load checks, which are generated by the generate_load_checks function