Changes between Version 2 and Version 3 of Doc/Monitoring/Nagios


Ignore:
Timestamp:
Jun 9, 2008, 8:13:43 PM (16 years ago)
Author:
/DC=es/DC=irisgrid/O=uam/CN=luisf-munnoz
Comment:

Nagios server documentation, pending lists of packages.

Legend:

Unmodified
Added
Removed
Modified
  • Doc/Monitoring/Nagios

    v2 v3  
    44[[TOC(inline)]]
    55
    6 Nagios configuration requires both a set of client templates for commands to be run on clients by the Nagios Remote Plug-in Executor (NRPE) and a set of server templates configuring contacts for alarms, hosts to be monitored services and so on.
     6Nagios configuration requires both a set of client templates for commands to be run on clients by the Nagios Remote Plug-in Executor (NRPE) and a set of server templates configuring contacts for alarms, hosts to be monitored, services (AKA sensors) and so on.
    77
    88= Configuring the Nagios server =
     9
     10The configuration of a Nagios server is done in a set of ''standard'' templates, on the ''monitoring/nagios'' namespace.
     11
     12Also, sensors are provided for many of the plug-ins described in [http://www.gridpp.ac.uk/wiki/Nagios_Plugins GridPP's wiki]
     13
     14== What is monitored ==
     15
     16In principle, all hosts present on `DB_MACHINE` are expected to be monitored and are added to Nagios configuration. This is done with the variables `HOSTSLIST`, which is automatically derived from `DB_MACHINE`. Additional hosts can be specified with the variable `NAGIOS_EXTRA_HOSTS`.
     17
     18Currently, all hosts are considered to have the same settings. Finer-grained settings (for instance, separate hosts to be monitored on working and non-working hours) are yet to be done.
     19
     20== Hardware-related monitoring ==
     21
     22The variable `HW_LISTINGS` is somewhat the "inverse" from `DB_MACHINE`: it is a structure with fields related to some hardware attribute and whose values are the lists of nodes with such attribute. For instance, `HW_LISTINGS["per_cpu"]["_4"]` is the list of nodes with 4 CPU cores, should that be 4 old Pentium or a single Barcelona chip. See [#Variable-index here] for the full description of each variable.
     23
     24== Specifying commands ==
     25
     26A structure template with all the commands used at UAM is provided, as an example. Variable `NAGIOS_COMMANDS_TEMPLATE` is the name of a structure template with all the command definitions you need. You can write your own commands template, and even include the example one. A command template is just a set of `"command_name" = "command_definition"` lines, like this:
     27{{{
     28"check_ftp" = "$USER1$/check_ftp -H $HOSTADDRESS$";
     29"notify-by-email" = "/bin/echo '$SERVICEOUTPUT$' | /bin/mail -s '$SERVICESTATE$ alert for $HOSTALIAS$($HOSTNAME$)/$SERVICEDESC$' $CONTACTEMAIL$";
     30"check_nrpe_ncd" = "$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_ncd";
     31}}}
     32
     33== Host aliases and host-specific customization ==
     34
     35Additional, per-host customization is possible with another template, called `NAGIOS_HOSTALIASES_TEMPLATE`. It is a full template, not an structure one. It is a rather poor way of specifyng customizations, better ideas are welcomed.
     36
     37== Host groups ==
     38
     39A structure template for defining host groups can be supplied with the `NAGIOS_HOSTGROUPS_TEMPLATE` variable.
     40
     41== Macros ==
     42
     43User macros can be supplied with the `NAGIOS_MACROS_TEMPLATE`. By default, `$USER1$` and `$USER2$` are set with sensible values. The macros file will be written with 0600 permissions, so it is safe to place passwords here, if needed.
     44
     45== Time periods ==
     46
     47A structure template can be supplied with the `NAGIOS_TIMEPERIODS_TEMPLATE`. The default templates provide only a '''always''' timeperiod.
     48
     49== Contacts and contact groups ==
     50
     51Contacts and contact groups for alarms can be specified with structure templates called `NAGIOS_CONTACTS_TEMPLATE` and `NAGIOS_CONTACTGROUPS_TEMPLATE`. I provided an example template under '''monitoring/nagios/contacts''', but please adapt it to your needs!!
     52
     53== Services ==
     54
     55Each service must be provided via a structure template. This way, the `use` directive can be imitated with just an include. So, the variable `NAGIOS_SERVICE_TEMPLATES` is a list with the names of the templates that fully describe services.
     56
     57For instance,
     58{{{
     59include monitoring/nagios/services/fast-service;
     60
     61'service_description'='DNS response time';
     62'check_command'= list (
     63'check_dns',
     64    'www.cern.ch',
     65    '2',
     66    '5');
     67}}}
     68
     69=== Generic services ===
     70
     71As seen in the previous example, you can reduce the amount of code by including some other structure template and changing the appropriate values. For this, 6 generic templates are provided:
     72
     73 `generic-service`::
     74   A "generic" service, with the appropriate contact groups, and other usual settings. Before sending an alarm, the check is tried 4 times. Alarms are re-sent each 4 hours.
     75 `fast-service`::
     76   A `generic-service` to be checked each 5 minutes, or each minute in case of problems.
     77 `performance-service`::
     78   A `fast-service` to be checked each 30 minutes, or each 5 minutes in case of problems.
     79 `security-service`::
     80   A ''volatile'' `fast-service` to be checked each 20 minutes.
     81 `slow-service`::
     82   A `fast-service` to be checked each hour, or each 20 minutes in case of problems. Alarms are re-sent each 7 hours.
     83 `expire-service`::
     84   A `fast-service` to be checked each 48 hours. Alarms are re-sent each 50 hours. It is useful for checking certificate expirations.
     85
     86=== Checking for a host load ===
     87
     88The acceptable load on a host depends on the number of CPUs and cores on it. For instance, having a single-core node on load 4 is problematic, whereas an 8-core node is wasting CPU with that same load.
     89
     90The function `generate-check-load-services` is aimed at this. It generates service structures based on the ''monitoring/nagios/services/load'' templates, for all hosts in `DB_MACHINE`, setting sensible limits. These are:
     91|| ||'''Load 1'''||'''Load 5'''||'''Load 15'''||
     92||'''Warning'''||cores*2||cores*1.75||cores*1.5||
     93||'''Critical'''||cores*3||cores*2||cores*1.75
     94
     95The limits are hardcoded in the function, a more flexible approach will be provided (some day).
     96== External configuration files ==
     97
     98If you have existing configuration files, or if parts of your configuration are generated by tools other than ncm-nagios, you can integrate them with the `NAGIOS_EXTERNAL_FILES` variable.
     99
     100== General options ==
     101
     102The component provides sensible defaults for most of the Nagios options. If you want to override any of these, set the variable `NAGIOS_GENERAL_OPTIONS`.
     103
     104== Variable index ==