wiki:Doc/Monitoring/Nagios

Version 22 (modified by /DC=org/DC=terena/DC=tcs/C=NL/O=Nikhef/CN=Ronald Starink ronalds@…, 13 years ago) (diff)

--

Nagios configuration requires both a set of client templates for commands to be run on clients by the Nagios Remote Plug-in Executor (NRPE) and a set of server templates configuring contacts for alarms, hosts to be monitored, services (AKA sensors) and so on.

Configuring the Nagios server

The configuration of a Nagios server is done in a set of standard templates, in the 'monitoring/nagios' namespace. Also, sensors are provided for many of the plug-ins described on the GridPP wiki. An example Nagios server template is included in the QWG distribution.

Quickstart

In order to configure a basic nagios server, simply include the template monitoring/nagios/config in the server's template. This will automatically generate a Nagios configuration that monitors all Quattor-managed machines.

Monitoring grid services

Preliminary work has been done to integrate monitoring of grid services according to the EGEE model (described at https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringNcg). Please note that in order to retrieve SAM results for your site, you will need to request access to the SAM database from your Nagios server according to the procedure described at https://twiki.cern.ch/twiki/bin/view/LCG/SamProgInterfaceACL

The template monitoring/nagios/ncg_services will automatically generate service, service group, service dependency and external info configuration for CE and SEs defined in the CE_HOST and SE_HOST variables. (The NCG services configuration template should be included before the core Nagios config template.)

Customising your configuration

What is monitored

In principle, all hosts present in the DB_MACHINE database are expected to be monitored and are added to Nagios configuration. (This is done via the variable HOSTSLIST, which is automatically derived from DB_MACHINE.) If you want to monitor additional hosts that are not Quattor-managed, they can be specified in the variable NAGIOS_EXTRA_HOSTS.

Currently, all hosts are given the same settings. Finer-grained settings (for instance, separate hosts to be monitored on working and non-working hours) are yet to be done.

Hardware-related monitoring

The variable HW_LISTINGS is essentially the inverse of DB_MACHINE: it is a structure with fields related to some hardware attribute and whose values are the lists of nodes with such attribute. For instance, HW_LISTINGS["per_cpu"]["_4"] is the list of nodes with 4 CPU cores, whether 4 single-code Pentiums or a single quad-core Barcelona chip. See here for the full description of each variable.

Specifying commands

A structure template with all the commands used at UAM is provided, as an example. Variable NAGIOS_COMMANDS_TEMPLATE is the name of a structure template with all the command definitions you need. You can write your own commands template, and even include the example one. A command template is just a set of "command_name" = "command_definition" lines, like this:

"check_ftp" = "$USER1$/check_ftp -H $HOSTADDRESS$";
"notify-by-email" = "/bin/echo '$SERVICEOUTPUT$' | /bin/mail -s '$SERVICESTATE$ alert for $HOSTALIAS$($HOSTNAME$)/$SERVICEDESC$' $CONTACTEMAIL$";
"check_nrpe_ncd" = "$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_ncd";

Host aliases and host-specific customization

Additional, per-host customization is possible with another template, called NAGIOS_HOSTALIASES_TEMPLATE. It is a full template, not an structure one. It is a rather poor way of specifyng customizations, better ideas are welcomed.

Host groups

A structure template for defining host groups can be supplied with the NAGIOS_HOSTGROUPS_TEMPLATE variable.

Macros

User macros can be supplied with the NAGIOS_MACROS_TEMPLATE. By default, $USER1$ and $USER2$ are set with sensible values. The macros file will be written with 0600 permissions, so it is safe to place passwords here, if needed.

Time periods

A structure template can be supplied with the NAGIOS_TIMEPERIODS_TEMPLATE. The default templates provide only a always timeperiod.

Contacts and contact groups

Contacts and contact groups for alarms can be specified with structure templates called NAGIOS_CONTACTS_TEMPLATE and NAGIOS_CONTACTGROUPS_TEMPLATE. I provided an example template under monitoring/nagios/contacts, but please adapt it to your needs!!

Services

Each service must be provided via a structure template. This way, the use directive can be imitated with just an include. So, the variable NAGIOS_SERVICE_TEMPLATES is a list with the names of the templates that fully describe services.

For instance,

include monitoring/nagios/services/fast-service;

'service_description'='DNS response time';
'check_command'= list (
'check_dns',
    'www.cern.ch',
    '2',
    '5');

Generic services

As seen in the previous example, you can reduce the amount of code by including some other structure template and changing the appropriate values. For this, 6 generic templates are provided:

generic-service
A "generic" service, with the appropriate contact groups, and other usual settings. Before sending an alarm, the check is tried 4 times. Alarms are re-sent each 4 hours.
fast-service
A generic-service to be checked each 5 minutes, or each minute in case of problems.
performance-service
A fast-service to be checked each 30 minutes, or each 5 minutes in case of problems.
security-service
A volatile fast-service to be checked each 20 minutes.
slow-service
A fast-service to be checked each hour, or each 20 minutes in case of problems. Alarms are re-sent each 7 hours.
expire-service
A fast-service to be checked each 48 hours. Alarms are re-sent each 50 hours. It is useful for checking certificate expirations.

Checking for a host load

The acceptable load on a host depends on the number of CPUs and cores on it. For instance, having a single-core node on load 4 is problematic, whereas an 8-core node is wasting CPU with that same load.

The function generate-check-load-services is aimed at this. It generates service structures based on the monitoring/nagios/services/load templates, for all hosts in DB_MACHINE, setting sensible limits. These are:

Load 1Load 5Load 15
Warningcores*2cores*1.75cores*1.5
Criticalcores*3cores*2cores*1.75

The limits are hardcoded in the function, a more flexible approach will be provided (some day).

External configuration files

If you have existing configuration files, or if parts of your configuration are generated by tools other than ncm-nagios, you can integrate them with the NAGIOS_EXTERNAL_FILES variable.

General options

The component provides sensible defaults for most of the Nagios options. If you want to override any of these, set the variable NAGIOS_GENERAL_OPTIONS.

Visualizing performance metrics

Many service and host checks return performance metrics to the Nagios server. The performance data can be visualized in plots based on rrdtool using pnp4nagios.

To enable visualization with pnp4nagios on the web interface, pnp4nagios must be enabled (via PNP4NAGIOS_ENABLE) and the base URL for pnp4nagios on the web server must be supplied (via PNP4NAGIOS_BASE_URL). The exact locations of the host and service metrics can be modified via variables PNP4NAGIOS_HOST_ACTION_URL and PNP4NAGIOS_SERVICE_ACTION_URL, although by default they are derived from the base location:

variable PNP4NAGIOS_ENABLE = true;
variable PNP4NAGIOS_BASE_URL = '/nagios/html/pnp4nagios/index.php';
variable PNP4NAGIOS_HOST_ACTION_URL = PNP4NAGIOS_BASE_URL + '?host=$HOSTNAME$';
variable PNP4NAGIOS_SERVICE_ACTION_URL = PNP4NAGIOS_BASE_URL + '?host=$HOSTNAME$&srv=$SERVICEDESC$';

Every host or service for which visualization is required, should contain a action_url in its definition. Hosts that are derived from the generic host template monitoring/nagios/generic_host already have this definition:

"action_url" = PNP4NAGIOS_HOST_ACTION_URL;

However, the action_url must be set for every service for which it is desired (note that it may not be useful for all services!). The definition of the load service template in the example cluster (template sites/example/monitoring/nagios/services/load.tpl) contains an example:

'action_url' = PNP4NAGIOS_SERVICE_ACTION_URL;

Note: the above assumes that the Nagios server is configured to collect performance metrics. The default Nagios master template monitoring/nagios/master contains the required definitions.

Passive service checks (NSCA)

Besides displaying results or sending notifications for regular checks, Nagios can also receive results from checks that it did not initiate itself. Such checks are passive checks. They are executed via some mechanism external to the Nagios server (e.g., a cron job on a node) and the check result is sent to the NSCA daemon that runs on the Nagios server. On the node, the send_nsca executable submits the results to the NSCA daemon. The NSCA daemon accepts the passive check results and inserts them into Nagios.

Passive service checks are required when using a hierarchy of Nagios servers. The slave servers that execute the (active) service checks, submit the check results to the master server.

There are 3 different configuration variables to enable parts the NSCA configuration:

  • NSCA_SEND_TEMPLATE: required for nodes that run passive checks and slave servers to submit check results to the master
  • NSCA_SUBMIT_RESULT_TEMPLATE: required for Nagios slave servers to submit check results to the master
  • NSCA_DAEMON_TEMPLATE: required for a Nagios server that must accept passive check results and for a Nagios master server when using a hierarchy.

Hierarchy of servers

Nagios servers can be used in a hierarchy, where there is 1 master server and 2 or more slave servers. The master server runs the web interface, handles notifications, performance data etc. The slave servers schedule and execute checks and submit the check results (via NSCA) to the master server. Each slave server runs (a subset of the) checks against (a subset of the) nodes. The master server receives the results for all services and all hosts.

Variable index

NAGIOS_BASE_HOST A structure template defining a basic host, with all fields except, perhaps, an alias, set to sensible values.
HW_LISTINGS Lists all machines, grouped by hardware characteristics. It is set automatically by monitoring/nagios/cfg and cannot be modified.
per_hardware nlist in which keys are the hardware templates (the values of DB_MACHINE). The value is another structure:
host_list List of all hosts associated to that hardware template.
hardware_structure The complete hardware information as described on the hardware template.
per_cpu nlist with the machines grouped by number of processors. The key is in the form "_n", where n is the number of CPUs. The value is the list of hosts.
HOSTSLIST Plain nlist with all hosts structures, extracted from HW_LISTINGS.
NAGIOS_EXTRA_HOSTS List of FQDNs with additional hosts to be monitored. Put here any host outside your CDB you want to monitor, for instance routers or external DNS servers.
NAGIOS_COMMANDS_TEMPLATE A structure template with home-made commands. A default set with all commnands used today (6/2008) at UAM is provided.
NAGIOS_HOSTSALIASES_TEMPLATE A template with additional per-host customizations. To be improved.
NAGIOS_HOSTGROUPS_TEMPLATE A structure template with all hostgroups definitions.
NAGIOS_MACROS_TEMPLATE A structure template with any user-defined macros, not enclosed by '$' symbols. For instance, for setting $USER1$, do: "USER1" = "/usr/lib/nagios/plugins";
NAGIOS_TIMEPERIODS_TEMPLATE A structure template with time periods definitions.
NAGIOS_CONTACTS_TEMPLATE A structure template with all contacts.
NAGIOS_CONTACTGROUPS_TEMPLATE A structure template with the different ways the CONTACTS are grouped.
NAGIOS_SERVICE_TEMPLATES List with the names of the nagios structure templates that define each sensor.
NAGIOS_EXPLICIT_SERVICES List with the complete definition of any services you don't want to specify as structure templates.
NAGIOS_EXPLICIT_SERVICEGROUPS List with the complete definition of any service groups you don't want to specify as structure templates.
NAGIOS_EXPLICIT_SERVICEDEPENDENCIES List with the complete definition of any service dependencies you don't want to specify as structure templates.
NAGIOS_EXPLICIT_SERVICEEXTINFO List with the complete definition of any service external info you don't want to specify as structure templates.
NAGIOS_GENERAL_OPTIONS nlist with all generic options you may want, such as whether to accept external commands or not. Check the Nagios documentation to see what options are available.
NAGIOS_EXTERNAL_FILES List of external files to be added to Nagios configuration.
NAGIOS_SERVICEEXTINFO_TEMPLATES List of templates containing definitions for extended service information
NAGIOS_LOAD_BASE_TEMPLATE Base template for the load checks, which are generated by the generate_load_checks function