[[TracNav]] [[TOC(inline,depth=1)]] Nagios configuration requires both a set of client templates for commands to be run on clients by the Nagios Remote Plug-in Executor (NRPE) and a set of server templates configuring contacts for alarms, hosts to be monitored, services (AKA sensors) and so on. = Configuring the Nagios server = The configuration of a Nagios server is done in a set of ''standard'' templates, in the [source:templates/trunk/standard/monitoring/nagios ''monitoring/nagios''] namespace. Also, sensors are provided for many of the plug-ins described on the [http://www.gridpp.ac.uk/wiki/Nagios_Plugins GridPP wiki]. An [source:templates/trunk/clusters/example-3.1/profiles/nagios-server.example.org.tpl example Nagios server template] is included in the QWG distribution. == Quickstart == In order to configure a basic nagios server, simply include the template [source:templates/trunk/standard/monitoring/nagios/config monitoring/nagios/config] in the server's template. This will automatically generate a Nagios configuration that monitors all Quattor-managed machines. == Monitoring grid services == Preliminary work has been done to integrate monitoring of grid services according to the EGEE model (described at https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringNcg). Please note that in order to retrieve SAM results for your site, you will need to request access to the SAM database from your Nagios server according to the procedure described at https://twiki.cern.ch/twiki/bin/view/LCG/SamProgInterfaceACL The template [source:templates/trunk/standard/monitoring/nagios/ncg_services monitoring/nagios/ncg_services] will automatically generate service, service group, service dependency and external info configuration for CE and SEs defined in the CE_HOST and SE_HOST variables. (The NCG services configuration template should be included before the core Nagios config template.) = Customising your configuration = == What is monitored == In principle, all hosts present in the `DB_MACHINE` database are expected to be monitored and are added to Nagios configuration. (This is done via the variable `HOSTSLIST`, which is automatically derived from `DB_MACHINE`.) If you want to monitor additional hosts that are not Quattor-managed, they can be specified in the variable `NAGIOS_EXTRA_HOSTS`. Currently, all hosts are given the same settings. Finer-grained settings (for instance, separate hosts to be monitored on working and non-working hours) are yet to be done. == Hardware-related monitoring == The variable `HW_LISTINGS` is essentially the inverse of `DB_MACHINE`: it is a structure with fields related to some hardware attribute and whose values are the lists of nodes with such attribute. For instance, `HW_LISTINGS["per_cpu"]["_4"]` is the list of nodes with 4 CPU cores, whether 4 single-code Pentiums or a single quad-core Barcelona chip. See [#Variable-index here] for the full description of each variable. == Specifying commands == A structure template with all the commands used at UAM is provided, as an example. Variable `NAGIOS_COMMANDS_TEMPLATE` is the name of a structure template with all the command definitions you need. You can write your own commands template, and even include the example one. A command template is just a set of `"command_name" = "command_definition"` lines, like this: {{{ "check_ftp" = "$USER1$/check_ftp -H $HOSTADDRESS$"; "notify-by-email" = "/bin/echo '$SERVICEOUTPUT$' | /bin/mail -s '$SERVICESTATE$ alert for $HOSTALIAS$($HOSTNAME$)/$SERVICEDESC$' $CONTACTEMAIL$"; "check_nrpe_ncd" = "$USER1$/check_nrpe -H $HOSTADDRESS$ -c check_ncd"; }}} == Host aliases and host-specific customization == Additional, per-host customization is possible with another template, called `NAGIOS_HOSTALIASES_TEMPLATE`. It is a full template, not an structure one. It is a rather poor way of specifyng customizations, better ideas are welcomed. == Host groups == A structure template for defining host groups can be supplied with the `NAGIOS_HOSTGROUPS_TEMPLATE` variable. == Macros == User macros can be supplied with the `NAGIOS_MACROS_TEMPLATE`. By default, `$USER1$` and `$USER2$` are set with sensible values. The macros file will be written with 0600 permissions, so it is safe to place passwords here, if needed. == Time periods == A structure template can be supplied with the `NAGIOS_TIMEPERIODS_TEMPLATE`. The default templates provide only a '''always''' timeperiod. == Contacts and contact groups == Contacts and contact groups for alarms can be specified with structure templates called `NAGIOS_CONTACTS_TEMPLATE` and `NAGIOS_CONTACTGROUPS_TEMPLATE`. I provided an example template under '''monitoring/nagios/contacts''', but please adapt it to your needs!! == Services == Each service must be provided via a structure template. This way, the `use` directive can be imitated with just an include. So, the variable `NAGIOS_SERVICE_TEMPLATES` is a list with the names of the templates that fully describe services. For instance, {{{ include monitoring/nagios/services/fast-service; 'service_description'='DNS response time'; 'check_command'= list ( 'check_dns', 'www.cern.ch', '2', '5'); }}} === Generic services === As seen in the previous example, you can reduce the amount of code by including some other structure template and changing the appropriate values. For this, 6 generic templates are provided: `generic-service`:: A "generic" service, with the appropriate contact groups, and other usual settings. Before sending an alarm, the check is tried 4 times. Alarms are re-sent each 4 hours. `fast-service`:: A `generic-service` to be checked each 5 minutes, or each minute in case of problems. `performance-service`:: A `fast-service` to be checked each 30 minutes, or each 5 minutes in case of problems. `security-service`:: A ''volatile'' `fast-service` to be checked each 20 minutes. `slow-service`:: A `fast-service` to be checked each hour, or each 20 minutes in case of problems. Alarms are re-sent each 7 hours. `expire-service`:: A `fast-service` to be checked each 48 hours. Alarms are re-sent each 50 hours. It is useful for checking certificate expirations. === Checking for a host load === The acceptable load on a host depends on the number of CPUs and cores on it. For instance, having a single-core node on load 4 is problematic, whereas an 8-core node is wasting CPU with that same load. The function `generate-check-load-services` is aimed at this. It generates service structures based on the ''monitoring/nagios/services/load'' templates, for all hosts in `DB_MACHINE`, setting sensible limits. These are: || ||'''Load 1'''||'''Load 5'''||'''Load 15'''|| ||'''Warning'''||cores*2||cores*1.75||cores*1.5|| ||'''Critical'''||cores*3||cores*2||cores*1.75 The limits are hardcoded in the function, a more flexible approach will be provided (some day). == External configuration files == If you have existing configuration files, or if parts of your configuration are generated by tools other than ncm-nagios, you can integrate them with the `NAGIOS_EXTERNAL_FILES` variable. == General options == The component provides sensible defaults for most of the Nagios options. If you want to override any of these, set the variable `NAGIOS_GENERAL_OPTIONS`. == Visualizing performance metrics == Many service and host checks return performance metrics to the Nagios server. The performance data can be visualized in plots based on rrdtool using pnp4nagios. == Variable index == || `NAGIOS_BASE_HOST` || A structure template defining a basic host, with all fields except, perhaps, an alias, set to sensible values.|| ||`HW_LISTINGS` || Lists all machines, grouped by hardware characteristics. It is set automatically by ''monitoring/nagios/cfg'' and cannot be modified. || ||` per_hardware` || nlist in which keys are the hardware templates (the values of `DB_MACHINE`). The value is another structure: ||` host_list` || List of all hosts associated to that hardware template. || ||` hardware_structure` || The complete hardware information as described on the hardware template. || ||` per_cpu` || nlist with the machines grouped by number of processors. The key is in the form "_n", where n is the number of CPUs. The value is the list of hosts. || || `HOSTSLIST` || Plain nlist with all hosts structures, extracted from `HW_LISTINGS`. || || `NAGIOS_EXTRA_HOSTS` || List of FQDNs with additional hosts to be monitored. Put here any host outside your CDB you want to monitor, for instance routers or external DNS servers. || || `NAGIOS_COMMANDS_TEMPLATE` || A structure template with home-made commands. A default set with all commnands used today (6/2008) at UAM is provided.|| || `NAGIOS_HOSTSALIASES_TEMPLATE` || A template with additional per-host customizations. To be improved. || || `NAGIOS_HOSTGROUPS_TEMPLATE` || A structure template with all hostgroups definitions. || || `NAGIOS_MACROS_TEMPLATE` || A structure template with any user-defined macros, not enclosed by '$' symbols. For instance, for setting `$USER1$`, do: {{{"USER1" = "/usr/lib/nagios/plugins";}}} || || `NAGIOS_TIMEPERIODS_TEMPLATE` || A structure template with time periods definitions. || || `NAGIOS_CONTACTS_TEMPLATE` || A structure template with all contacts.|| || `NAGIOS_CONTACTGROUPS_TEMPLATE` || A structure template with the different ways the CONTACTS are grouped. || || `NAGIOS_SERVICE_TEMPLATES` || List with the names of the nagios structure templates that define each sensor. || || `NAGIOS_EXPLICIT_SERVICES` || List with the complete definition of any services you don't want to specify as structure templates.|| || `NAGIOS_EXPLICIT_SERVICEGROUPS` || List with the complete definition of any service groups you don't want to specify as structure templates.|| || `NAGIOS_EXPLICIT_SERVICEDEPENDENCIES` || List with the complete definition of any service dependencies you don't want to specify as structure templates.|| || `NAGIOS_EXPLICIT_SERVICEEXTINFO` || List with the complete definition of any service external info you don't want to specify as structure templates.|| || `NAGIOS_GENERAL_OPTIONS` || nlist with all generic options you may want, such as whether to accept external commands or not. Check the Nagios documentation to see what options are available. || || `NAGIOS_EXTERNAL_FILES` || List of external files to be added to Nagios configuration. || || `NAGIOS_SERVICEEXTINFO_TEMPLATES` || List of templates containing definitions for extended service information || || `NAGIOS_LOAD_BASE_TEMPLATE` || Base template for the load checks, which are generated by the `generate_load_checks` function ||