[[TracNav]] [[TOC(inline,depth=1)]] Nagios configuration requires both a set of client templates for commands to be run on clients by the Nagios Remote Plug-in Executor (NRPE) and a set of server templates configuring contacts for alarms, hosts to be monitored, services (AKA sensors) and so on. = Configuring the Nagios server = The configuration of a Nagios server is done in a set of ''standard'' templates, in the [source:templates/trunk/standard/monitoring/nagios ''monitoring/nagios''] namespace. Also, sensors are provided for many of the plug-ins described in [http://www.gridpp.ac.uk/wiki/Nagios_Plugins GridPP's wiki] == What is monitored == In principle, all hosts present on `DB_MACHINE` are expected to be monitored and are added to Nagios configuration. This is done with the variables `HOSTSLIST`, which is automatically derived from `DB_MACHINE`. Additional hosts can be specified with the variable `NAGIOS_EXTRA_HOSTS`. Currently, all hosts are considered to have the same settings. Finer-grained settings (for instance, separate hosts to be monitored on working and non-working hours) are yet to be done. == Hardware-related monitoring == The variable `HW_LISTINGS` is somewhat the "inverse" from `DB_MACHINE`: it is a structure with fields related to some hardware attribute and whose values are the lists of nodes with such attribute. For instance, `HW_LISTINGS["per_cpu"]["_4"]` is the list of nodes with 4 CPU cores, should that be 4 old Pentium or a single Barcelona chip. See [#Variable-index here] for the full description of each variable. == Specifying commands == A structure template with all the commands used at UAM is provided, as an example. Variable `NAGIOS_COMMANDS_TEMPLATE` is the name of a structure template with all the command definitions you need. You can write your own commands template, and even include the example one. A command template is just a set of `"command_name" = "command_definition"` lines, like this: {{{ "check_ftp" = "$USER1$/check_ftp -H $HOSTADDRESS$"; "notify-by-email" = "/bin/echo '$SERVICEOUTPUT$' | /bin/mail -s '$SERVICESTATE$ alert for $HOSTALIAS$($HOSTNAME$)/$SERVICEDESC$' $CONTACTEMAIL$"; "check_nrpe_ncd" = "$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_ncd"; }}} == Host aliases and host-specific customization == Additional, per-host customization is possible with another template, called `NAGIOS_HOSTALIASES_TEMPLATE`. It is a full template, not an structure one. It is a rather poor way of specifyng customizations, better ideas are welcomed. == Host groups == A structure template for defining host groups can be supplied with the `NAGIOS_HOSTGROUPS_TEMPLATE` variable. == Macros == User macros can be supplied with the `NAGIOS_MACROS_TEMPLATE`. By default, `$USER1$` and `$USER2$` are set with sensible values. The macros file will be written with 0600 permissions, so it is safe to place passwords here, if needed. == Time periods == A structure template can be supplied with the `NAGIOS_TIMEPERIODS_TEMPLATE`. The default templates provide only a '''always''' timeperiod. == Contacts and contact groups == Contacts and contact groups for alarms can be specified with structure templates called `NAGIOS_CONTACTS_TEMPLATE` and `NAGIOS_CONTACTGROUPS_TEMPLATE`. I provided an example template under '''monitoring/nagios/contacts''', but please adapt it to your needs!! == Services == Each service must be provided via a structure template. This way, the `use` directive can be imitated with just an include. So, the variable `NAGIOS_SERVICE_TEMPLATES` is a list with the names of the templates that fully describe services. For instance, {{{ include monitoring/nagios/services/fast-service; 'service_description'='DNS response time'; 'check_command'= list ( 'check_dns', 'www.cern.ch', '2', '5'); }}} === Generic services === As seen in the previous example, you can reduce the amount of code by including some other structure template and changing the appropriate values. For this, 6 generic templates are provided: `generic-service`:: A "generic" service, with the appropriate contact groups, and other usual settings. Before sending an alarm, the check is tried 4 times. Alarms are re-sent each 4 hours. `fast-service`:: A `generic-service` to be checked each 5 minutes, or each minute in case of problems. `performance-service`:: A `fast-service` to be checked each 30 minutes, or each 5 minutes in case of problems. `security-service`:: A ''volatile'' `fast-service` to be checked each 20 minutes. `slow-service`:: A `fast-service` to be checked each hour, or each 20 minutes in case of problems. Alarms are re-sent each 7 hours. `expire-service`:: A `fast-service` to be checked each 48 hours. Alarms are re-sent each 50 hours. It is useful for checking certificate expirations. === Checking for a host load === The acceptable load on a host depends on the number of CPUs and cores on it. For instance, having a single-core node on load 4 is problematic, whereas an 8-core node is wasting CPU with that same load. The function `generate-check-load-services` is aimed at this. It generates service structures based on the ''monitoring/nagios/services/load'' templates, for all hosts in `DB_MACHINE`, setting sensible limits. These are: || ||'''Load 1'''||'''Load 5'''||'''Load 15'''|| ||'''Warning'''||cores*2||cores*1.75||cores*1.5|| ||'''Critical'''||cores*3||cores*2||cores*1.75 The limits are hardcoded in the function, a more flexible approach will be provided (some day). == External configuration files == If you have existing configuration files, or if parts of your configuration are generated by tools other than ncm-nagios, you can integrate them with the `NAGIOS_EXTERNAL_FILES` variable. == General options == The component provides sensible defaults for most of the Nagios options. If you want to override any of these, set the variable `NAGIOS_GENERAL_OPTIONS`. == Variable index == || `NAGIOS_BASE_HOST` || A structure template defining a basic host, with all fields except, perhaps, an alias, set to sensible values.|| ||`HW_LISTINGS` || Lists all machines, grouped by hardware characteristics. It is set automatically by ''monitoring/nagios/cfg'' and cannot be modified. || ||` per_hardware` || nlist in which keys are the hardware templates (the values of `DB_MACHINE`). The value is another structure: ||` host_list` || List of all hosts associated to that hardware template. || ||` hardware_structure` || The complete hardware information as described on the hardware template. || ||` per_cpu` || nlist with the machines grouped by number of processors. The key is in the form "_n", where n is the number of CPUs. The value is the list of hosts. || || `HOSTSLIST` || Plain nlist with all hosts structures, extracted from `HW_LISTINGS`. || || `NAGIOS_EXTRA_HOSTS` || List of FQDNs with additional hosts to be monitored. Put here any host outside your CDB you want to monitor, for instance routers or external DNS servers. || || `NAGIOS_COMMANDS_TEMPLATE` || A structure template with home-made commands. A default set with all commnands used today (6/2008) at UAM is provided.|| || `NAGIOS_HOSTSALIASES_TEMPLATE` || A template with additional per-host customizations. To be improved. || || `NAGIOS_HOSTGROUPS_TEMPLATE` || A structure template with all hostgroups definitions. || || `NAGIOS_MACROS_TEMPLATE` || A structure template with any user-defined macros, not enclosed by '$' symbols. For instance, for setting `$USER1$`, do: {{{"USER1" = "/usr/lib/nagios/plugins";}}} || || `NAGIOS_TIMEPERIODS_TEMPLATE` || A structure template with time periods definitions. || || `NAGIOS_CONTACTS_TEMPLATE` || A structure template with all contacts.|| || `NAGIOS_CONTACTGROUPS_TEMPLATE` || A structure template with the different ways the CONTACTS are grouped. || || `NAGIOS_SERVICE_TEMPLATES` || List with the names of the nagios structure templates that define each sensor. || || `NAGIOS_EXPLICIT_SERVICES` || List with the complete definition of any services you don't want to specify as structure templates.|| || `NAGIOS_GENERAL_OPTIONS` || nlist with all generic options you may want, such as whether to accept external commands or not. Check the Nagios documentation to see what options are available. || || `NAGIOS_EXTERNAL_FILES` || List of external files to be added to Nagios configuration. ||