| | 9 | |
| | 10 | The configuration of a Nagios server is done in a set of ''standard'' templates, on the ''monitoring/nagios'' namespace. |
| | 11 | |
| | 12 | Also, sensors are provided for many of the plug-ins described in [http://www.gridpp.ac.uk/wiki/Nagios_Plugins GridPP's wiki] |
| | 13 | |
| | 14 | == What is monitored == |
| | 15 | |
| | 16 | In principle, all hosts present on `DB_MACHINE` are expected to be monitored and are added to Nagios configuration. This is done with the variables `HOSTSLIST`, which is automatically derived from `DB_MACHINE`. Additional hosts can be specified with the variable `NAGIOS_EXTRA_HOSTS`. |
| | 17 | |
| | 18 | Currently, all hosts are considered to have the same settings. Finer-grained settings (for instance, separate hosts to be monitored on working and non-working hours) are yet to be done. |
| | 19 | |
| | 20 | == Hardware-related monitoring == |
| | 21 | |
| | 22 | The variable `HW_LISTINGS` is somewhat the "inverse" from `DB_MACHINE`: it is a structure with fields related to some hardware attribute and whose values are the lists of nodes with such attribute. For instance, `HW_LISTINGS["per_cpu"]["_4"]` is the list of nodes with 4 CPU cores, should that be 4 old Pentium or a single Barcelona chip. See [#Variable-index here] for the full description of each variable. |
| | 23 | |
| | 24 | == Specifying commands == |
| | 25 | |
| | 26 | A structure template with all the commands used at UAM is provided, as an example. Variable `NAGIOS_COMMANDS_TEMPLATE` is the name of a structure template with all the command definitions you need. You can write your own commands template, and even include the example one. A command template is just a set of `"command_name" = "command_definition"` lines, like this: |
| | 27 | {{{ |
| | 28 | "check_ftp" = "$USER1$/check_ftp -H $HOSTADDRESS$"; |
| | 29 | "notify-by-email" = "/bin/echo '$SERVICEOUTPUT$' | /bin/mail -s '$SERVICESTATE$ alert for $HOSTALIAS$($HOSTNAME$)/$SERVICEDESC$' $CONTACTEMAIL$"; |
| | 30 | "check_nrpe_ncd" = "$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_ncd"; |
| | 31 | }}} |
| | 32 | |
| | 33 | == Host aliases and host-specific customization == |
| | 34 | |
| | 35 | Additional, per-host customization is possible with another template, called `NAGIOS_HOSTALIASES_TEMPLATE`. It is a full template, not an structure one. It is a rather poor way of specifyng customizations, better ideas are welcomed. |
| | 36 | |
| | 37 | == Host groups == |
| | 38 | |
| | 39 | A structure template for defining host groups can be supplied with the `NAGIOS_HOSTGROUPS_TEMPLATE` variable. |
| | 40 | |
| | 41 | == Macros == |
| | 42 | |
| | 43 | User macros can be supplied with the `NAGIOS_MACROS_TEMPLATE`. By default, `$USER1$` and `$USER2$` are set with sensible values. The macros file will be written with 0600 permissions, so it is safe to place passwords here, if needed. |
| | 44 | |
| | 45 | == Time periods == |
| | 46 | |
| | 47 | A structure template can be supplied with the `NAGIOS_TIMEPERIODS_TEMPLATE`. The default templates provide only a '''always''' timeperiod. |
| | 48 | |
| | 49 | == Contacts and contact groups == |
| | 50 | |
| | 51 | Contacts and contact groups for alarms can be specified with structure templates called `NAGIOS_CONTACTS_TEMPLATE` and `NAGIOS_CONTACTGROUPS_TEMPLATE`. I provided an example template under '''monitoring/nagios/contacts''', but please adapt it to your needs!! |
| | 52 | |
| | 53 | == Services == |
| | 54 | |
| | 55 | Each service must be provided via a structure template. This way, the `use` directive can be imitated with just an include. So, the variable `NAGIOS_SERVICE_TEMPLATES` is a list with the names of the templates that fully describe services. |
| | 56 | |
| | 57 | For instance, |
| | 58 | {{{ |
| | 59 | include monitoring/nagios/services/fast-service; |
| | 60 | |
| | 61 | 'service_description'='DNS response time'; |
| | 62 | 'check_command'= list ( |
| | 63 | 'check_dns', |
| | 64 | 'www.cern.ch', |
| | 65 | '2', |
| | 66 | '5'); |
| | 67 | }}} |
| | 68 | |
| | 69 | === Generic services === |
| | 70 | |
| | 71 | As seen in the previous example, you can reduce the amount of code by including some other structure template and changing the appropriate values. For this, 6 generic templates are provided: |
| | 72 | |
| | 73 | `generic-service`:: |
| | 74 | A "generic" service, with the appropriate contact groups, and other usual settings. Before sending an alarm, the check is tried 4 times. Alarms are re-sent each 4 hours. |
| | 75 | `fast-service`:: |
| | 76 | A `generic-service` to be checked each 5 minutes, or each minute in case of problems. |
| | 77 | `performance-service`:: |
| | 78 | A `fast-service` to be checked each 30 minutes, or each 5 minutes in case of problems. |
| | 79 | `security-service`:: |
| | 80 | A ''volatile'' `fast-service` to be checked each 20 minutes. |
| | 81 | `slow-service`:: |
| | 82 | A `fast-service` to be checked each hour, or each 20 minutes in case of problems. Alarms are re-sent each 7 hours. |
| | 83 | `expire-service`:: |
| | 84 | A `fast-service` to be checked each 48 hours. Alarms are re-sent each 50 hours. It is useful for checking certificate expirations. |
| | 85 | |
| | 86 | === Checking for a host load === |
| | 87 | |
| | 88 | The acceptable load on a host depends on the number of CPUs and cores on it. For instance, having a single-core node on load 4 is problematic, whereas an 8-core node is wasting CPU with that same load. |
| | 89 | |
| | 90 | The function `generate-check-load-services` is aimed at this. It generates service structures based on the ''monitoring/nagios/services/load'' templates, for all hosts in `DB_MACHINE`, setting sensible limits. These are: |
| | 91 | || ||'''Load 1'''||'''Load 5'''||'''Load 15'''|| |
| | 92 | ||'''Warning'''||cores*2||cores*1.75||cores*1.5|| |
| | 93 | ||'''Critical'''||cores*3||cores*2||cores*1.75 |
| | 94 | |
| | 95 | The limits are hardcoded in the function, a more flexible approach will be provided (some day). |
| | 96 | == External configuration files == |
| | 97 | |
| | 98 | If you have existing configuration files, or if parts of your configuration are generated by tools other than ncm-nagios, you can integrate them with the `NAGIOS_EXTERNAL_FILES` variable. |
| | 99 | |
| | 100 | == General options == |
| | 101 | |
| | 102 | The component provides sensible defaults for most of the Nagios options. If you want to override any of these, set the variable `NAGIOS_GENERAL_OPTIONS`. |
| | 103 | |
| | 104 | == Variable index == |