wiki:Doc/Monitoring/Nagios

Version 6 (modified by /DC=es/DC=irisgrid/O=uam/CN=luisf-munnoz, 16 years ago) (diff)

--

Configuring Nagios

Nagios configuration requires both a set of client templates for commands to be run on clients by the Nagios Remote Plug-in Executor (NRPE) and a set of server templates configuring contacts for alarms, hosts to be monitored, services (AKA sensors) and so on.

Configuring the Nagios server

The configuration of a Nagios server is done in a set of standard templates, on the monitoring/nagios namespace.

Also, sensors are provided for many of the plug-ins described in GridPP's wiki

What is monitored

In principle, all hosts present on DB_MACHINE are expected to be monitored and are added to Nagios configuration. This is done with the variables HOSTSLIST, which is automatically derived from DB_MACHINE. Additional hosts can be specified with the variable NAGIOS_EXTRA_HOSTS.

Currently, all hosts are considered to have the same settings. Finer-grained settings (for instance, separate hosts to be monitored on working and non-working hours) are yet to be done.

Hardware-related monitoring

The variable HW_LISTINGS is somewhat the "inverse" from DB_MACHINE: it is a structure with fields related to some hardware attribute and whose values are the lists of nodes with such attribute. For instance, HW_LISTINGS["per_cpu"]["_4"] is the list of nodes with 4 CPU cores, should that be 4 old Pentium or a single Barcelona chip. See here for the full description of each variable.

Specifying commands

A structure template with all the commands used at UAM is provided, as an example. Variable NAGIOS_COMMANDS_TEMPLATE is the name of a structure template with all the command definitions you need. You can write your own commands template, and even include the example one. A command template is just a set of "command_name" = "command_definition" lines, like this:

"check_ftp" = "$USER1$/check_ftp -H $HOSTADDRESS$";
"notify-by-email" = "/bin/echo '$SERVICEOUTPUT$' | /bin/mail -s '$SERVICESTATE$ alert for $HOSTALIAS$($HOSTNAME$)/$SERVICEDESC$' $CONTACTEMAIL$";
"check_nrpe_ncd" = "$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_ncd";

Host aliases and host-specific customization

Additional, per-host customization is possible with another template, called NAGIOS_HOSTALIASES_TEMPLATE. It is a full template, not an structure one. It is a rather poor way of specifyng customizations, better ideas are welcomed.

Host groups

A structure template for defining host groups can be supplied with the NAGIOS_HOSTGROUPS_TEMPLATE variable.

Macros

User macros can be supplied with the NAGIOS_MACROS_TEMPLATE. By default, $USER1$ and $USER2$ are set with sensible values. The macros file will be written with 0600 permissions, so it is safe to place passwords here, if needed.

Time periods

A structure template can be supplied with the NAGIOS_TIMEPERIODS_TEMPLATE. The default templates provide only a always timeperiod.

Contacts and contact groups

Contacts and contact groups for alarms can be specified with structure templates called NAGIOS_CONTACTS_TEMPLATE and NAGIOS_CONTACTGROUPS_TEMPLATE. I provided an example template under monitoring/nagios/contacts, but please adapt it to your needs!!

Services

Each service must be provided via a structure template. This way, the use directive can be imitated with just an include. So, the variable NAGIOS_SERVICE_TEMPLATES is a list with the names of the templates that fully describe services.

For instance,

include monitoring/nagios/services/fast-service;

'service_description'='DNS response time';
'check_command'= list (
'check_dns',
    'www.cern.ch',
    '2',
    '5');

Generic services

As seen in the previous example, you can reduce the amount of code by including some other structure template and changing the appropriate values. For this, 6 generic templates are provided:

generic-service
A "generic" service, with the appropriate contact groups, and other usual settings. Before sending an alarm, the check is tried 4 times. Alarms are re-sent each 4 hours.
fast-service
A generic-service to be checked each 5 minutes, or each minute in case of problems.
performance-service
A fast-service to be checked each 30 minutes, or each 5 minutes in case of problems.
security-service
A volatile fast-service to be checked each 20 minutes.
slow-service
A fast-service to be checked each hour, or each 20 minutes in case of problems. Alarms are re-sent each 7 hours.
expire-service
A fast-service to be checked each 48 hours. Alarms are re-sent each 50 hours. It is useful for checking certificate expirations.

Checking for a host load

The acceptable load on a host depends on the number of CPUs and cores on it. For instance, having a single-core node on load 4 is problematic, whereas an 8-core node is wasting CPU with that same load.

The function generate-check-load-services is aimed at this. It generates service structures based on the monitoring/nagios/services/load templates, for all hosts in DB_MACHINE, setting sensible limits. These are:

Load 1Load 5Load 15
Warningcores*2cores*1.75cores*1.5
Criticalcores*3cores*2cores*1.75

The limits are hardcoded in the function, a more flexible approach will be provided (some day).

External configuration files

If you have existing configuration files, or if parts of your configuration are generated by tools other than ncm-nagios, you can integrate them with the NAGIOS_EXTERNAL_FILES variable.

General options

The component provides sensible defaults for most of the Nagios options. If you want to override any of these, set the variable NAGIOS_GENERAL_OPTIONS.

Variable index

NAGIOS_BASE_HOST

A structure template defining a basic host, with all fields except, perhaps, an alias, set to sensible values.

HW_LISTINGS

Lists all machines, grouped by hardware characteristics. It is set autmatically by monitoring/nagios/cfg and cannot be modified.

per_hardware

nlist in which keys are the hardware templates (the values of DB_MACHINE). The value is another structure:

host_list

List of all hosts associated to that hardware template.

hardware_structure

The complete hardware information as described on the hardware template.

per_cpu

nlist with the machines grouped by number of processors. The key is in the form "_n", where n is the number of CPUs. The value is the list of hosts.

HOSTSLIST

Plain nlist with all hosts structures, extracted from HW_LISTINGS. ===NAGIOS_EXTRA_HOSTS=== List of FQDNs with additional hosts to be monitored. Put here any host outside your CDB you want to monitor, for instance routers or external DNS servers. ===NAGIOS_COMMANDS_TEMPLATE=== A structure template with home-made commands. A default set with all commnands used today (6/2008) at UAM is provided. ===`NAGIOS_HOSTSALIASES_TEMPLATE=== A template with additional per-host customizations. To be improved. ===NAGIOS_HOSTGROUPS_TEMPLATE=== A structure template with all hostgroups definitions. ===NAGIOS_MACROS_TEMPLATE=== A structure template with any user-defined macros, not enclosed by '$' symbols. For instance, for setting $USER1$, do:

"USER1" = "/usr/lib/nagios/plugins";

===NAGIOS_TIMEPERIODS_TEMPLATE=== A structure template with time periods definitions. ===NAGIOS_CONTACTS_TEMPLATE=== A structure template with all contacts. ===NAGIOS_CONTACTGROUPS_TEMPLATE=== A structure template with the different ways the CONTACTS are grouped. ===NAGIOS_SERVICE_TEMPLATES=== List with the names of the nagios structure templates that define each sensor. ===NAGIOS_EXPLICIT_SERVICES=== List with the complete definition of any services you don't want to specify as structure templates. ===NAGIOS_GENERAL_OPTIONS=== nlist with all generic options you may want, such as whether to accept external commands or not. Check the Nagios documentation to see what options are available. ===NAGIOS_EXTERNAL_FILES=== List of external files to be added to Nagios configuraton.